## Jian-Jia Chen  *Editor*

# A Journey of Embedded and Cyber-Physical Systems

Essays Dedicated to Peter Marwedel on the Occasion of His 70th Birthday

A Journey of Embedded and Cyber-Physical Systems

Jian-Jia Chen Editor

# A Journey of Embedded and Cyber-Physical Systems

Essays Dedicated to Peter Marwedel on the Occasion of His 70th Birthday

*Editor* Jian-Jia Chen Dortmund, Germany

#### ISBN 978-3-030-47486-7 ISBN 978-3-030-47487-4 (eBook) https://doi.org/10.1007/978-3-030-47487-4

© The Editor(s) (if applicable) and The Author(s) 2021. This book is an open access publication. **Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

*Dedicated to Prof. Peter Marwedel*

## **Foreword**

Professor Dr. Peter Marwedel has made many invaluable contributions to research in the Electronic Design Automation area and specifically to the field of Embedded System Design. Already in his young years, he placed a landmark in the field of high-level synthesis and hardware description languages. In the mid-1970s, he started to work on high-level synthesis—almost a decade before it became mainstream. Peter Marwedel also laid ground for the field of retargetable compilation, which he considered as a special case of high-level synthesis in which the architecture was fixed. He designed the very first approach for generating a compiler from a description of the processor architecture and published this work already in the early 1980s—long before compilers were an accepted topic at EDA conferences. In the early 1990s, Peter Marwedel realized that compilers for embedded processors will become a crucial element in the toolchain for designing efficient embedded systems. One of his most prominent activities in the area of compilers for embedded systems is that of energy-aware compilation where he considered the energy efficiency of compiled code. His work on exploiting scratchpad memories for improving energy efficiency is well-known throughout the world. His engagement in memory-architecture aware compilation is highly influential and can be seen as an early contribution towards green computing.

In addition to being an outstanding researcher, Peter Marwedel has also proven himself as a great educator and passionate teacher in the classroom. This is best evidenced by his textbook "Embedded System Design" which has become standard literature in higher-level education on Embedded Systems for over a decade now. His excellence in teaching is documented, e.g., by the invitation of the Advanced Institute of Information Technology (AIIT) to give a 1-week compact course on embedded system design to Korean professors in Seoul in 2008 or by an invited talk on embedded system education during the cyber-physical systems program at the annual meeting of the US National Science Foundation (NSF).

While this Festschrift features a variety of contributions from his peers in the professional community, we would like to take this opportunity to emphasize Peter Marwedel's role as an exemplary advisor for his students, all the way from freshmen studying for their BS and MS degrees to PhD graduates and rising university professors. His dedicated mentorship and valuable advice has had a tremendous impact on many careers, as the following personal quotes show:

Back in July 1995, I was a student at the University of Dortmund, working hard to complete my diploma thesis. At that time, my advisor Prof. Marwedel left Dortmund for a sabbatical at the University of California. I vaguely remember the day when he sent me an email, asking me if I would be interested to study for PhD in the USA. It took me less than 5 min to respond with 'No, thank you!' as I had always envisioned to spend my life and work in Germany. Nine days later Peter Marwedel changed my life with a second email that essentially read 'Think twice!'. So I did. Long story short, I moved to California and live there since. I will be forever grateful for that second email which opened up a once-in-alifetime opportunity for my successful career. Thank you!

Rainer Dömer (Professor, University of California at Irvine, USA)

I got in touch with Peter Marwedel's chair for the first time in 1996 in the context of my student project work on HW/SW co-design and FPGA synthesis. With my successful graduation 1998, Peter offered me a PhD position in his group so that I had the chance to collaborate with the IMEC in Leuven on high-level source code optimization techniques. Having completed my PhD in 2004, Peter continued to support me so that I was able to establish a team working on compilers for real-time systems. Besides this excellent mentorship for more than a decade that finally led my to my current professorship at TUHH, I always appreciated Peter Marwedel's team spirit very much. Besides professional activities, team building was always very important for him, leading to countless memories of garden parties, barbecues with students in his chair's backyard, bicycle tours, workshops in the middle Rhine Valley, etc. This high degree of support and friendship is extraordinary and I am happy about the opportunity to say thank you to Peter Marwedel in this Festschrift.

Heiko Falk (Professor, Hamburg University of Technology, Germany)

Even though I started as a postdoc in Peter Marwedel's research group in 2010, I was able to learn a lot from his decades of experience in teaching, scientific writing, and designing research proposals. We were always eager to experiment with new approaches, such as flipped classroom teaching, and to explore new research topics. All of this has had a significant influence on my current work as professor at Coburg University. The extensive research on compilers in Peter's group also contributed to obtaining my new position at NTNU. An invaluable benefit when working in a large, well-funded research group was that I got to travel to so many interesting places and meet colleagues from all around the world. Traveling was also a common topic on the private level, since Peter loves to travel and to photograph. I fondly remember many hours of browsing through Peter's diligently prepared photo albums and listening to the stories of his trips. Going to Norway, I hope that I can provide an exciting destination for an upcoming trip. Thanks a lot for everything, Peter, and see you soon in Trondheim!

Michael Engel (Professor, NTNU, Trondheim, Norway).

Numerous anecdotes just like these could be added here, but that would fill an entire book by itself. So without further ado, we conclude this foreword here with a big THANK YOU to Peter Marwedel.

Workshop at TU Dortmund, July 2019 Rainer Dömer

Heiko Falk Michael Engel

## **Acknowledgements**

The workshop on embedded systems, dedicated to Peter Marwedel on the occasion of his 70th birthday, and its festschrift have been partially supported by Deutsche Forschungsgemeinschaft (DFG), Collaborative Research Center SFB 876 (http:// sfb876.tu-dortmund.de/), and Alumni der Informatik e.V. TU Dortmund.

## **Contents**



## **Contributors**

**Hussam Amrouch** Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

**Emad Arasteh** Center for Embedded and Cyber-Physical Systems, University of California, Irvine, CA, USA

**M. Balakrishnan** Indian Institute of Technology Delhi, Delhi, India

**Jian-Jia Chen** TU Dortmund, Dortmund, Germany

**Kuan-Hsun Chen** TU Dortmund, Dortmund, Germany

**Zhongqi Cheng** Center for Embedded and Cyber-Physical Systems, University of California, Irvine, CA, USA

**Bryan Donyanavard** University of California, Irvine, CA, USA

**Rainer Dömer** Center for Embedded and Cyber-Physical Systems, University of California, Irvine, CA, USA

**Caio Batista de Melo** University of California, Irvine, CA, USA

**Nikil Dutt** University of California, Irvine, CA, USA

**Heiko Falk** Institute of Embedded Systems, Hamburg University of Technology (TUHH), Hamburg, Germany

**Gernot Fink** TU Dortmund, Dortmund, Germany

**Gernot Gebhard** AbsInt Angewandte Informatik GmbH, Saarbrücken, Germany

**Jörg Henkel** Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

**Wen-Hung Huang** TU Dortmund, Dortmund, Germany

**Shashank Jadhav** Institute of Embedded Systems, Hamburg University of Technology (TUHH), Hamburg, Germany

**Matthias Jung** Fraunhofer IESE, Kaiserslautern, Germany

**Daniel Kästner** AbsInt Angewandte Informatik GmbH, Saarbrücken, Germany

**Arno Luppold** Institute of Embedded Systems, Hamburg University of Technology (TUHH), Hamburg, Germany

**Pouya Mahmoody** Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

**Biswadip Maity** University of California, Irvine, CA, USA

**Daniel Mendoza** Center for Embedded and Cyber-Physical Systems, University of California, Irvine, CA, USA

**Kasra Moazzemi** University of California, Irvine, CA, USA

**Kateryna Muts** Institute of Embedded Systems, Hamburg University of Technology (TUHH), Hamburg, Germany

**Tiago Mück** University of California, Irvine, CA, USA

**Heinrich Müller** TU Dortmund, Dortmund, Germany

**Dominic Oehlert** Institute of Embedded Systems, Hamburg University of Technology (TUHH), Hamburg, Germany

**Nina Piontek** Institute of Embedded Systems, Hamburg University of Technology (TUHH), Hamburg, Germany

**Markus Pister** AbsInt Angewandte Informatik GmbH, Saarbrücken, Germany

**Behnaz Pourmohseni** Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

**Martin Rapp** Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

**Amir M. Rahmani** Technische Universität Wien, Vienna, Austria

**Sascha Roloff** Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

**Mikko Roth** Institute of Embedded Systems, Hamburg University of Technology (TUHH), Hamburg, Germany

**Sami Salamin** Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

**Wolfgang Schröder-Preikschat** Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

**Kenneth Stewart** University of California, Irvine, CA, USA

**Jürgen Teich** Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

**Niklas Ueter** TU Dortmund, Dortmund, Germany

**Georg von der Brüggen** TU Dortmund, Dortmund, Germany

**Norbert Wehn** TU Kaiserslautern, Kaiserslautern, Germany

**Christian Weis** TU Kaiserslautern, Kaiserslautern, Germany

**Stefan Wildermann** Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

**Reinhard Wilhelm** Universität des Saarbrücken, Saarbrücken, Germany

**Saehanseul Yi** University of California, Irvine, CA, USA

## **Chapter 1 Peter Marwedel and the Department of Computer Science of the TU Dortmund University**

**Gernot Fink and Heinrich Müller**

#### **1.1 Introduction**

Peter Marwedel was appointed Professor at the Department of Computer Science of the University of Dortmund in 1989. He represented the area "Computer Engineering and Embedded Systems" and headed the Chair of Computer Science 12 until his retirement in 2014. During this time, he has made a great contribution to the Department of Computer Science, which continues to have a lasting effect today. The following presentation is intended to give an idea of the extraordinary breadth and importance of his activities in teaching, academic self-government, basic research, and technology transfer.

#### **1.2 Teaching**

A special passion of Professor Marwedel is teaching, which is not self-evident for a dedicated and successful researcher. In Dortmund, he was extremely involved in basic teaching, which is particularly challenging given the high number of students attending these courses. For many years he has held the introductory compulsory lecture "Computer Structures" and the elective lectures "Embedded Systems" and "Computer Architecture." Through these courses, he has substantially contributed to the basic training in technical computer science. His lectures were extremely well prepared. The material was thoughtfully selected in all details, didactically carefully prepared, and presented objectively and clearly. The courses have gained a high

TU Dortmund, Dortmund, Germany

G. Fink · H. Müller (-)

e-mail: Gernot.Fink@tu-dortmund.de; heinrich.mueller@cs.uni-dortmund.de

reputation among the students. This was reflected in consistently excellent student ratings and led to Professor Marwedel receiving the prestigious Teaching Award of TU Dortmund University in 2003.

From the course "Embedded Systems" the first edition of the English textbook "Embedded System Design" emerged in 2003. The book has established itself as an international textbook and is often cited. Professor Marwedel has adapted it over the years to current developments. The third edition has been published in 2018.

In addition to traditional teaching, Professor Marwedel has shown great interest in the possibilities of new media and has made strong use of it. On YouTube he has made available a significant number of educational videos that are particularly associated with his courses "computer structures" and "computer architecture." The design of the videos focuses on the essentials and avoids superfluous, distracting visual effects.

In his last active years at the department, Professor Marwedel experimented with the concept of the inverted classroom as a natural consequence. This form of teaching is still little used at German universities.

Particularly noteworthy is the fact that Professor Marwedel has conveyed his concept of education in the field of embedded systems beyond its actual implementation. It was the subject of the Workshop on Embedded and Cyber-Physical System Education (WESE) organized by him in Finland in 2012. In the same year, he was invited to the annual meeting of the cyber-physical systems program of the prestigious National Science Foundation (NSF) in Washington, USA, where he gave a talk focusing on the Dortmund education concept in the field of Embedded Systems.

Finally, it should be mentioned that Professor Marwedel contributed to the internationalized teaching of the Department by participating in the English Master's program "Automation and Robotics" of the TU Dortmund University and his contacts to Indian Institutes of Technology, which have led to internships of Indian students in Dortmund.

#### **1.3 Academic Self-Government**

Professor Marwedel has always been active in the Department's academic administration and beyond. He was Dean of Studies of the Department and member of the Academic Senate of TU Dortmund. But here, too, his commitment to teaching is particularly reflected. He was chairman of the teaching committee of the Department and chairman of the educational committee of the Academic Senate of the university. He was Dean of Studies of the Department from 2012 to 2014. Beyond the university, he has worked on standards for the accreditation of computer science curricula in the national organizations ASIIN and AVI.

#### **1.4 Basic Research and SFB 876**

Professor Marwedel has contributed significantly to the international visibility of the Department of Computer Science through his research in embedded and cyberphysical systems. He has received several international honors, in particular the EDAA Lifetime Achievement Award (2013), the ESWEEK Lifetime Achievement Award (2014), and the ACM SIGDA Distinguished Service Award (2014). In 2010 he became an IEEE Fellow and a Fellow of the Design, Automation and Test Conference in Europe (DATE).

A very special contribution to research at the Department and the university was made by Professor Marwedel as co-initiator of the Collaborative Research Center (CRC/SFB) 876, together with the main initiator Prof. Dr. Katharina Morik, spokeswoman for the CRC. The CRC program of the Deutsche Forschungsgemeinschaft (DFG) enjoys a high reputation and is extremely competitive. The topic of the SFB 876 is "Providing Information by Resource-Constrained Data Analysis." The collaborative research center SFB 876 brings together data mining and embedded systems. On the one hand, embedded systems can be improved using machine learning. On the other hand, data mining algorithms can be realized in hardware, e.g. FPGAs, or run on GPGPUs. The restrictions of ubiquitous systems in computing power, memory, and energy demand new algorithms for known learning tasks. At the time of the application, merging data analysis and resource constraints was visionary—today it is highly relevant in many applications. In the SFB 876, about 20 research groups of the Department, of TU Dortmund University, and of neighboring universities are working together since 2011 for 12 years in an interdisciplinary manner.

#### **1.5 Technology Transfer and ICD**

Professor Marwedel has not only been active in basic research, but has also dealt with applied research and the transfer of scientific results to the economy. For many years he is CEO of the "Informatik Centrum Dortmund e.V." (ICD) and head of the Embedded Systems Group at ICD. The ICD was founded in 1989 from the Department of Computer Science at the University of Dortmund. The ICD is available to companies from all sectors of the economy. Its goal is to accelerate the transfer of current research results in computer science and information technology into industrial products. Under Professor Marwedel's leadership, the ICD has become a well-established, successful, and economically stable association.

#### **1.6 Conclusion**

The Department of Computer Science is extremely grateful to Professor Marwedel. On the occasion of his seventieth birthday, it wishes him all the best for the future.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 2 Testing Implementation Soundness of a WCET Analysis Tool**

**Reinhard Wilhelm, Markus Pister, Gernot Gebhard, and Daniel Kästner**

#### **2.1 Introduction**

*Timing verification* of a set of hard real-time tasks to be executed on a given hardware platform attempts to prove that all tasks in the set when executed on that platform always respect their deadlines, i.e., each task finishes its execution within its deadline. Traditionally, timing verification is split into two subtasks: a *timing analysis* also known as *WCET analysis*, which statically determines upper bounds on the execution times of the tasks, and a *schedulability analysis*, which takes these upper bounds and attempts to verify that all tasks in the given set, assuming these upper bounds on their execution times, will respect their deadlines.

A preliminary version of this paper appeared in [16].

R. Wilhelm (-)

Universität des Saarbrücken, Saarbrücken, Germany e-mail: wilhelm@cs.uni-saarland.de

M. Pister · G. Gebhard · D. Kästner

AbsInt Angewandte Informatik GmbH, Saarbrücken, Germany e-mail: pister@absint.com; gebhard@absint.com; kaestner@absint.com

#### *2.1.1 Tool Qualification*

WCET analysis is applied to time-critical and safety-critical embedded-system software in problem-aware parts of the embedded-systems industry. Such systems have to be developed in accordance with international safety norms, e.g., DO-178B/C, DO-254, IEC 61508, and ISO 26262. While there are differences between these norms, in particular regarding prescriptiveness and required level of rigor, they have many aspects in common. All of them include guidance on the use of software tools as a part of the development and verification process of safety-critical software.

The criticality level (also known as the design assurance level (DAL) or safety integrity level (SIL)) of a component determines the effort to invest and the methods required or recommended to deliver assurance of the correct functioning of the component. The criticality level is derived from the impact of a failure of the component on the functioning of the system. Similarly, the required activities to provide confidence in the correct functioning of a software tool depend on its criticality with respect to the overall system. For example, DO-178C, the current international standard for avionics systems, defines five different *tool qualification levels (TQLs)*. The TQL is determined by the potential tool impact and the design assurance level of the software. There are three tool-impact categories; the most critical, *Category 1*, applies to tools whose output becomes part of the airborne software. Similar considerations are also made in other norms, e.g., the ISO 26262 defines a *tool confidence level (TCL)* in a very similar way.

The overall goal of *tool qualification* is to provide confidence that the tool operates correctly, i.e., according to its functional specification, in the operational context of the tool user. In the following, we will focus on the tool qualification requirements of the avionics industry, which are the most rigid of the safetycritical industries. Certification of avionics systems is regulated by the international standard DO-178C [1]. WCET analysis tools fare under *verification tools*. Verification tools have no overly rigid certification requirements, unlike *development tools*: their impact category is *Category 2* or *Category 3*, mostly depending on whether the output of the tool is used to justify the elimination or reduction of other verification or development activities or not. A prerequisite for tool qualification is a specification of the tool functionality. The *tool operational requirements (TOR)* specify the tool functions and technical features, which are stated as low-level requirements on tool behavior under normal operating conditions. Another required input is the *verification test plan (VTP)*, which defines test cases demonstrating the correct functioning of all specified requirements of the TOR. Test-case definitions include the overall test setup as well as a detailed structural and functional description of each test case, i.e., how the individual test case works and what the expected result is.

Certification becomes more challenging through DO-333, the formal-methods supplement to DO-178C. It asks for a statement that a formal method including the underlying theory is *adequate* for solving the corresponding verification problem. This introduces and enforces *soundness* of the methods and tools.

Since the required effort for tool qualification can be high, ideally the software qualification process is supported by a *qualification support kit (QSK)* supplied by the tool provider. It must include TOR and VTP and typically provides a validation suite, which allows users to execute the relevant test cases in the relevant operational context. TOR, VTP, and a test execution report become part of the certification package. Furthermore, it is typically required for a tool provider to supply *qualification software life cycle data* to demonstrate that the development process and the invested efforts to assure correctness, quality, and traceability are adequate for usage in a safety-critical system context. The qualification software life cycle is not covered in this article.

DO-178C exhales a test-based spirit: many verification activities are test based. Well-defined coverage criteria try to capture to which extent the behavior of the system under test has actually been exerted during testing. Note that in case of a static verification tool, test coverage does not apply to the code to be analyzed: a sound static-analysis tool provides full data and control coverage, i.e., it analyses all paths and takes into account all potential data values for its analyses. What is needed in case of the microarchitectural analysis, which is the focus of this article, is to demonstrate the correctness of the microarchitecture model used by the analyzer. To this end it is the instruction set architecture (ISA) and the set of paths through the execution platform that need to be covered. Huge sets of test traces in qualification suites are used at tool-qualification time to cover the sets of paths through the execution platform.

Note the difference to measurement-based WCET analyses. It is known that they are in general unsound. In order to provide a sufficient level of confidence in the real-time behavior of industrial-size code they need an unacceptably huge set of traces and accordingly an excessive effort at verification time. In the case of a static WCET analysis tool, the testing effort is applied at tool-qualification time when ample time is available.

#### *2.1.2 Predictability*

Timing predictability [3, 15] has long been recognized as essential for achieving precise results of timing estimation at reduced analysis effort. In the context of the current article, it is worth mentioning that it also reduces the number of test cases for the validation of an abstract architectural model. In general, an increase in the timing predictability of the underlying architecture leads to a decreasing number of different instruction flow paths through the processor pipeline since they feature less average-case performance-enhancing micro-optimizations like instruction and data queues and buffers, data forwards, etc. Such architectures show a more regular hardware design.

#### *2.1.3 WCET Analysis*

Performance-enhancing architectural components such as caches, pipelines, and speculation have made WCET analysis difficult. Execution times of consecutively executed instructions do not compose easily because instruction execution times are now dependent on the execution state in which they are executed. In the composition *A;B* the execution time of statement *B* depends on the execution state produced by executing statement *A*. The variability of execution times grows with several architectural parameters, e.g., the cache-miss penalty and the costs for pipeline stalls and for control-flow mispredictions. As approaches using exhaustive measurements are infeasible due to the size of the search space, abstraction is applied leading to an over-approximation of the set of potential executions. This over-approximation introduces remaining uncertainty in the results of the microarchitectural analysis, which grows with the same architectural parameters mentioned above unless the architectural platform is predictable [18], see Sect. 2.1.2.

#### *2.1.4 The Central Idea: Proving Safety Properties*

We needed to solve the WCET problem for architectures with state-dependent execution times. Figure 2.1 shows that this problem could be decomposed into many subproblems. The main problem, specific for WCET analysis, was the *microarchitectural analysis*, a combined cache and pipeline analysis. Let us describe the central idea behind this phase in our WCET analysis method [17], first in a conceptual way, i.e., not quite like it is implemented, later closer to how it is implemented:

• We define any architectural effect that causes an instruction to execute longer than its fastest execution time to be a *timing accident*. Typical such timing accidents are cache misses, pipeline stalls, bus-access conflicts, or branch mispredictions. Each timing accident is associated with a *timing penalty*. Timing penalties may be constant, but may also be execution-state dependent. A cachemiss penalty may be constant if the bus is always guaranteed to be free for the cache reload. If this guarantee cannot be given, however, its size depends on the execution state, namely whether the bus happens to be free.

The property that the execution of an instruction at some program point will not cause a particular timing accident is then a safety property. The occurrence of a timing accident thus violates a corresponding safety property.

• We then use an appropriate method for the verification of safety properties to prove that for the instructions in the program some of the potential timing accidents will never happen. The goal is to prove as many of such safety properties as possible. Conceptually, the safety properties shown to hold could be used to reduce the worst-case execution-time bound for an instruction, which a naive, sound WCET analysis would have to assume, by the cost for the excluded

**Fig. 2.1** The architecture of the aiT tool

timing accidents. In practice, pipeline analysis drives a cycle-wise transition, which considers the abstract execution state, e.g., makes no transition under a cache miss if a cache miss can be excluded.

• We then prove these safety properties by abstract interpretation (AI) [4] in the following way: Compute invariants at each program point, in our case an overapproximation of the set of execution states that are possible when execution reaches this program point. Derive the above mentioned safety properties, that certain timing accidents will not happen, from these invariants. For example, AI computes an abstract cache state at each program point, which overapproximates the sets of concrete cache states that may reach this program point. The abstract cache states are used to classify some memory accesses as definite hits. Another cache analysis that underapproximates the set of possible concrete cache states is able to predict definite misses. Predicted cache hits are then used to prove that the timing accident, this memory access will miss the cache, will never happen [8, 10].

This method for the microarchitectural analysis was the main innovation that made our WCET analysis work for real-life architectures and scale to industrialsize software [6].

Now follows the description of the microarchitectural analysis that is closer to the implementation. Driver of this analysis is the pipeline analysis [14]. It goes through the instruction stream, instruction by instruction, and executes the current instruction in the current abstract execution state. This abstract execution state contains uncertainty, i.e., it lacks information about some state components. Transitions to all potential successor states are performed whenever the transition to the next state depends on such a missing part of the state. The timing contributions of these transitions are accumulated until an instruction can be retired. In the end, upper bounds on the execution times of basic blocks are obtained that are coefficients in an integer linear program representing the control flow of the program [17]. Another type of result is described below.

#### *2.1.5 Terminology*

We consider only sound WCET analysis methods. *Soundness* means that a method and associated tool will always produce *conservative* WCET estimates, i.e., estimates that will never be exceeded in any execution. Being conservative is a Boolean property. Unfortunately, *conservative* is often used as a metric property, *more conservative* meaning *less precise*. However, calling results of an unsound method conservative is a misnomer. The really meant, other dimension, in addition to soundness, is *accuracy*. Accuracy of some WCET estimate, obtained by a sound method, expresses the degree of over-estimation, the difference between a WCET estimate and the real WCET. It does not make sense to talk about the accuracy of an unsafe estimate or an unsound method. In case of an unsound method it is not even clear whether a "more conservative" estimate moves towards the real WCET from below or is larger than the real WCET and moves further away from it. In general, WCET estimates are below, i.e., underestimate the real WCET, if endto-end measurements are used. On the other hand, if piecewise measurements are applied whose results are combined to an estimate of the overall execution times, this often results in over-estimation of the real WCET.

WCET analysis can be seen as the search for a longest path in the state space spanned by the program under analysis and by the architectural platform. Most realtime software is written as to guarantee termination. Its state space can thus be easily abstracted to a finite abstract state space, which is still too large to be exhaustively explored. We can, therefore, not expect to identify the real WCET, but only safe upper bounds to all execution times, which we will call WCET estimates. (Safe) over-approximation is used in several places. In particular, an abstraction of the execution platform is employed by the WCET analysis. How to convince oneself (or the certification authorities) of the correctness of this architectural model is the main subject of the next section.

#### **2.2 Validation of Our WCET Analysis Tool**

The claim that our WCET analysis tools produce safe results is a strong one and often disputed by some proponents of unsound WCET analysis methods. Their argument is, to develop an error-free instantiation of the, in principle, sound WCET analysis technology is so difficult, that one might use a simpler unsound method in the first place. The main complaint is the complexity of the abstract architectural models. So, what is the basis for our claims?

Several analyses in the tools are instances of abstract interpretation [4], a scientific method with a strong underlying theory, relating analysis results to semantic properties of analyzed programs. Value and loop bound analysis, c.f. Fig. 2.1, are more or less standard abstract interpretations. The difference is that these analyses are performed on the binary level and not on the source level. Still, adequacy of these analyses is easily accepted. The instantiation of the abstractinterpretation framework for the microarchitectural analysis of a given execution platform, however, is far from trivial. In particular, it contains an abstraction of the execution platform. How does one make sure that such an abstraction is conservative? This will be explained in Sect. 2.2.3.

Let us give short descriptions of the different component analyses alongside the particular validation activities before we come to the validation of the central component.

#### *2.2.1 Control-Flow Graph Reconstruction*

The reconstruction of the control-flow graph (CFG) from a binary executable means to compute a safe approximation of the inter-procedural control flow of the executable [13]. This is achieved by the following two steps after having loaded the executable:


For Step 1, a specification of the instruction encoding is required. Instructionset-architecture manuals provide this information, which is then used to implement instruction identification in the binary decoder of the aiT tool chain. To validate the implementation, we perform the so-called *decode* tests. For each supported instruction (in each supported addressing mode) we write a test case providing a reference as the expected result of the decoding. The decoding result is then compared to this reference.

In Step 2, the decoder uses the identified instruction stream to compose a safe control-flow approximation. To validate this, we compile a representative set of control structures (in a high-level language like C) and decode the resulting executable to compare the reconstructed control flow with a reference result.

#### *2.2.2 Value Analysis*

The value analysis determines safe approximations of the values in processor registers and memory cells for every program point. These approximations are used to determine bounds on the iteration number of loops and information about the addresses of memory accesses. The value analysis is based on the instruction semantics of the underlying target architecture. Like the instruction encoding, architecture manuals provide this information.

To validate the instruction-semantics implementation, we create a test case for each instruction and define pre- and post-conditions according to the expected effect of the particular instruction. These conditions are expressed by user annotations, which are read by the value analyzer. Pre-conditions are used to generate the machine state needed to execute the tested instruction. The post-conditions define the expected state after having executed the instruction under test.

#### *2.2.3 Microarchitectural Analysis: Trace Validation*

The microarchitectural analysis combines a cache and a pipeline analysis. It is an abstract interpretation of the program's execution on the underlying cache and pipeline architecture. The execution of a program is abstractly executed by feeding instruction sequences from the control-flow graph to the timing model, which then computes the changes of the abstract execution state at cycle granularity and keeps track of the elapsing clock cycles. The correctness proofs of the method have been conducted by Thesing [14] based on the theory of abstract interpretation.

The cache analysis described in [2, 5, 7] is incorporated into the pipeline analysis. At each memory access, where the concrete hardware would query and update the contents of the cache(s), the cache analysis applies the corresponding abstract cache effects to the abstract cache state.

The result of the microarchitectural analysis is either an upper execution-time bound for every basic block or a *prediction graph*. In the first case, these upper bounds are the coefficients in an integer linear program that represents the control flow of the program. This is the version usually described in publications about static WCET analysis, as it presents a clean work distribution. However, it has the disadvantage that too much information is lost at basic-block boundaries, namely the precise matching of final states at predecessor blocks to initial states at successor blocks. This loss of information entails a loss in precision. The prediction graph avoids this loss of precision. It consists of abstract states as nodes and edges for the transition between states and represents the evolution of the abstract execution states at processor-clock granularity and beyond basic-block boundaries. Note that in the description of trace validation a prediction event graph appears, which is the prediction graph extended by event annotations at its edges.

**Fig. 2.2** Evolution of abstract hardware states *s*ˆ*i*. Each edge denotes a single cycle transition in the abstract state space. The gray boxes span the set of states that belong to the same basic block

As an example consider the prediction graph of Fig. 2.2, where the longest path is four transitions long, i.e., it takes four processor cycles to complete the program. Adding up the length of each longest path per basic block (denoted by the gray boxes) would neglect that there is no connection between the abstract states *s*ˆ<sup>3</sup> and *s*ˆ<sup>5</sup> and thus yield a worst-case estimate of five processor cycles.

Due to the complexity of the abstract architectural model, validation of the pipeline analysis cannot be done solely by testing the abstract implementation of individual instructions as we do it for CFG reconstruction and value analysis.

#### **2.2.3.1 Semi-Automatic Derivation of the Abstract Architecture Model**

Nowadays, hardware circuits are automatically synthesized from formal hardware specifications like VHDL or Verilog. Besides a formalization of the functional details, such specifications implicitly contain an execution model that also reflects the timing behavior of the whole system. It was a tempting idea to derive a pipeline analysis from the formal hardware model such that analysis and synthesized circuit share the same basis [11, 12].

However, the semi-automatic derivation of a timing model approach has not proven effective in the industrial context. Even if the hardware manufacturers grant access to their formal models (which is often not the case), the derivation process requires to fully understand the design, which might be a complex task for a complete processor including peripheral devices. Additionally, the quality of the resulting analysis depends on the coding style of the hardware model [11]. Results are excellent if the code features minimal dependencies between processes, a clear logical separation of different functionality into different processes/subprograms and a sequential logic design. Ideally, the code reflects the structural composition of the processor pipeline with explicit control signals to steer the flow of instructions and data. Models not adhering to those design principles complicate state abstractions and thus result in prohibitively resource-consuming analyses.

#### **2.2.3.2 Trace Validation**

For the reasons given above, the abstract architectural models are *hand-crafted* by human experts based on the available hardware reference documentation, which sometimes contains errors and usually lacks relevant details. Reverse engineering based on specific runtime measurements needs to fill this gap. Even if it were semi-automatically derived from a specification, the implementation of the microarchitectural analysis would still need to be validated. *Trace validation* checks for safe over-approximations of the predictions by matching observable hardware events recorded during concrete executions of instruction sequences against predictions of those events produced by the microarchitectural analysis. This is done for a sufficiently large set of instruction sequences that structurally covers the possible instruction flows (wrt. the different functional units, instructions, dependencies between instructions, etc.) of the processor pipeline.

Figure 2.3 shows the trace-validation workflow. An instruction sequence is executed on the actual hardware, or its execution is simulated using a VHDL model, to obtain an *observed event trace*. The microarchitectural analysis is modified to predict those events and annotate them to the edges of the generated prediction graph. In this fashion the microarchitectural analysis of an instruction sequence generates a *prediction event graph* that describes an over-approximation of all possible event traces that could occur while executing the instruction sequence. The observed trace of events, the reached execution state, and the consumed time are

**Fig. 2.3** Trace validation according to [9]. The instruction sequences together with the generated prediction graphs annotated by state and timing information are part of the Qualification Support Kit

checked for containment in the prediction graph. Trace validation is successful if the sequence of traced events is found in the prediction graph, and their predicted execution time does not underestimate the observed execution time.

The granularity at which the comparison takes place strongly depends on the debug facilities provided by the hardware. At best, timer interrupts are used to stop execution after each execution cycle. This way, the execution of instruction sequences is extended cycle by cycle to observe actual execution states and execution times.

The behavior of some components of the architectural state, such as the cache state, is unfortunately not directly observable. These need to be indirectly observed through executions that are forced to lead to cache hits and cache misses.

Thus a tremendous effort is required to cover both all instructions and all architectural components. This is essentially achieved by triggering many different architectural states through the execution of dedicated test cases.

The validation suite of the AbsInt static WCET tool aiT may contain several hundred individual test cases, even for a simple DLX-like architecture like the ARM Cortex-M4. For multi-core architectures, such as the TriCore TC275, which features three different cores, several thousand test cases are necessary to cover all architectural features.

How many test cases are required to cover the whole architectural behavior correlates to the complexity of the analyzed hardware, i.e., with the number of available instructions of the instruction set architecture, the number of components of the pipeline architecture like functional units, internal buffers, queues, memories, buses, and their states. Often unexpected (undocumented) hardware behavior is exposed while trying to understand existing test cases. This leads to additional test cases. Hence, the number of test cases that are sufficient in order to cover the (timing) relevant hardware behavior cannot be easily quantified in advance.

#### **2.3 Conclusion**

The AbsInt WCET analyzer aiT uses a combination of sound methods to derive safe upper bounds on execution times. Their implementation is quite complex, such that it is natural to query the soundness of the implementation of the technology. We describe the validation efforts employed to convince ourselves, the customers, and the certification authorities of the soundness of the implementation. The European Aviation Safety Agency (EASA), obliged to follow the strictest certification rules, those of DO178-C, has accepted AbsInt's aiT as a validated WCET analysis tool for several time-critical subsystems in the Airbus A380 and A350 planes.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 3 The Dynamic Random Access Memory Challenge in Embedded Computing Systems**

**Matthias Jung, Christian Weis, and Norbert Wehn**

#### **3.1 Introduction**

*Dynamic random access memories* (DRAMs) are key components in all computing systems that require large working memory. Due to the strong increase in data volume in many embedded applications, such as machine learning, image processing, autonomous systems, etc., DRAMs largely impact the overall system performance and power consumption. In many of these applications, the overall system performance is often limited by the memory bandwidth or latency and not by the computation itself. Due to the dynamic storage scheme of DRAMs and shrinking technology nodes, reliability is also a major concern in current and future DRAMs.

Therefore, new challenges arise, which we will discuss in this chapter. The most important metrics, which are typically considered for DRAM subsystems (especially in the *high-performance computing* (HPC) domain), are *bandwidth*, *latency*, and *capacity*. However, in the context of embedded systems it requires to consider further metrics, such as *power*, *temperature*, *reliability*, *safety*, and *security*. In the following we will highlight these challenges and refer to some of our recent contributions, which tackle these challenges.

M. Jung

Fraunhofer IESE, Kaiserslautern, Germany e-mail: Matthias.Jung@iese.fraunhofer.de

C. Weis · N. Wehn (-) TU Kaiserslautern, Kaiserslautern, Germany e-mail: weis@eit.uni-kl.de; wehn@eit.uni-klde

#### **3.2 Bandwidth and Latency**

*Bandwidth* is the amount of data that can be transferred between DRAM and a computational unit within 1 s. The maximum DRAM bandwidth is limited to the number of data pins times the interface frequency. *Latency* is the access time that it takes to complete an access. In fact, latency helps bandwidth, but not vice versa [33]. For instance, lower DRAM latency results in more accesses per second, and therefore higher bandwidth, whereas increasing the number of data pins increases the bandwidth without decreasing latency. A fast execution of applications on embedded systems must not only be supported by the computational units, but the memory subsystem must be designed to avoid hitting the *memory wall* [43]. For example, embedded applications for autonomous driving will require between 400 and 1024 GB/s of memory bandwidth [16], which is hard to realize with the current DRAM technologies. To put the problem in perspective, we survey current memory architectures.

Figure 3.1 shows different DRAM-based memory subsystems, and Figs. 3.2 and 3.3 show their properties with respect to interface frequency, maximum theoretical bandwidth, and energy consumption per transferred bit.<sup>1</sup> The maximum bandwidth of conventional DIMM-based DDR solutions is limited by the I/O count and interface speed. This limitation arises due to the package, power considerations, and costs on both the memory and processing sides.

**Fig. 3.1** DRAM-based memory subsystems

<sup>1</sup>Note that the latency, actual sustainable bandwidth, and the total energy consumption of a DRAM strongly depend on the application being executed. Reaching the maximum theoretical bandwidths in Fig. 3.2 is practically impossible on general-purpose systems.

**Fig. 3.2** Interface frequency and maximum bandwidth of different DRAM types

**Fig. 3.3** Properties of today's DRAMs (Sources: Micron, Hynix, Nvidia, Xilinx, JEDEC)

To avoid pin limitations, designers and vendors are using *Buffer on Board* (BoB) organizations [7], in which an additional logic component is interposed between the CPU and DRAM to control the memory and to communicate with the CPU over a narrow, high-speed, serial interface. This technique is mainly used in server applications where several terabytes of DRAM are required. The required storage capacity in embedded systems is much smaller than in the high-performance systems BoB targets, and thus this organization is inappropriate. All the other following DRAM devices can achieve easily several GB capacity, which is enough for most of the embedded applications.

*Package on Package* (PoP) organizations reduce the distance between the DRAM and the MPSoC (from centimeters to millimeters), providing higher bandwidth, lower latency, better power efficiency, and smaller form factors, all of which are especially important for smartphones and tablets. Low power DDR DRAMs (e.g., LPDDR4) can be used either as a device on a PCB or mounted directly as PoP. The latter organization permits only one device to be connected, requiring DRAM commands to be serialized due to the resulting low pin count. For example, if eight LPDDR4 devices are used on a PCB, they deliver a theoretical bandwidth of 137 GB/s.

To address the huge memory demand of highly parallel GPUs, graphic DDR DRAMs (e.g., GDDR5X or GDDR6) use techniques like *quad data rate* (QDR) to deliver high bandwidth compared to conventional DDR DRAM. While LPDDR4 devices are designed and optimized for ultra-low power consumption with aggressive power gating and higher-threshold transistors, GDDR5X/6 devices focus on delivering the highest achievable bandwidth. Both use an architecture with distributed banks (heavy sub-banking) due to the wider data I/O widths of ×16/×32 and the larger data prefetch of up to 16 bit per data I/O. However, GDDR5X/6 devices improve the column-to-column cycle time (*tCCD*) by reducing data path delays from primary sense amplifiers to the global sense amplifiers. Furthermore, GDDR5X/6 chips use an on-die *phase lock loop* (PLL) to achieve very high I/O performance in QDR mode. In contrast, LPDDR4 devices have no on-die PLL or *delay lock loop* (DLL). Combining 16 GDDR6 devices in QDR mode yields a theoretical bandwidth of 1 TB/s, as shown in Fig. 3.2.

Another way of achieving high bandwidth is 3D stacking: examples include WIDE I/O, Micron's *Hybrid Memory Cube* (HMC), and Samsung's *High Bandwidth Memory* (HBM). These memories reduce the distance between CPU and external RAM from centimeters to micrometers by means of *through-silicon via* (TSV) technology. The available bandwidth increases due to more pins provided by the TSVs, but, more importantly, this technology provides a major boost in energy efficiency compared to standard off-chip (G)DDR devices.

The combination of high bandwidth communication and the lower power consumption of 3D integrated memory is an ideal fit for embedded systems. For example, four parallel HBM2 devices on a 2.5D silicon interposer can provide up to 1 TB/s [16]. However, 3D memories suffer from thermal issues, which we discuss in Sect. 3.4.

**Fig. 3.4** Latency for an application running on DDR3 DRAM

From an application point of view, the DRAM subsystem has non-deterministic timing behavior [8] due to its complex protocol (i.e., the latency of a DRAM request depends on previous issued commands) and the runtime optimization of the memory controller; this makes it difficult to provide predictable performance and thus to guarantee real-time task predictability [1]. Figure 3.4 shows a histogram of the CHStone ADPCM benchmark [11] simulated on the DRAMSys framework [18].<sup>2</sup> Although the average latency is concentrated around 40 ns, the memory latency can easily vary by an order of magnitude.

As with the bandwidth issues discussed above, the memory controller plays an integral role in this non-deterministic timing behavior. The memory controller has to manage, on one side, accesses to the DRAM memory from the compute fabric and, on the other side, the complex interface protocol of the DRAMs. In the following we discuss the main contributions to the DRAM latency that origin from the complex internal memory architecture and the memory controller.

• **Row Misses**: The latency of a bank access varies depending on the state of its row buffer. If a memory access targets the same row as the row currently cached in the buffer (a row hit), it results in lower latency and lower energy memory

<sup>2</sup>The simulated DRAM is a DDR3 with a RBC address mapping and disabled scheduler.

accesses. On the other hand, if a memory access targets a different row from that currently in the buffer (a row miss), it results in higher latency and energy consumption.


<sup>3</sup>In fact, the degradation grows linearly with the capacity, which means it grows exponentially with each density generation. Liu et al. [27] and Bhati et al. [3] predicted that 40–50% of the power consumption of future DRAM devices will be caused by refresh commands, and the maximum DRAM bandwidth will be significantly reduced.

Due to this unpredictable timing behavior, processors for embedded applications with real-time and strict latency constraints have thus far largely avoided using DRAM. For example, Infineon's Aurix CPU, which is widely used for safety-critical applications, does not provide a DRAM controller.

In past years there were many investigations with respect to DRAM controllers for real-time and mixed-criticality applications in embedded systems. A detailed book which summarizes those approaches has been presented by Goossens et al. [8]. Most of these approaches concentrate operating the DRAM with statically precomputed command patterns which guarantee a predictable behavior. However, this predictability often comes with a degradation of sustainable bandwidth. Moreover, the bandwidth numbers presented in Fig. 3.2 are theoretical maxima: the sustainable memory bandwidth is much less, and it strongly depends on how the data is stored in the memories, i.e., the memory access pattern [12]. Therefore, it is not only important to choose a memory that provides high bandwidth, it is also important to design a DRAM controller that can bring the sustainable bandwidth closer to the theoretical maximum.

As already mentioned, general-purpose DRAM controllers use online scheduling techniques to improve the sustainable bandwidth, e.g., by reducing the number of row misses or read/write transitions. In order to reduce the number of read/write transitions, DRAM controllers buffer read and write commands in two distinct queues. An arbiter switches between read and write mode to diminish the *tWTR* penalty, the minimum time interval between the end of a WR burst and a RD command.

However, in embedded systems, many applications (e.g., signal, image, or neural network processing) have regular, fixed, and deterministic memory access patterns. On the compute side, inherent application-specific knowledge has been heavily exploited for efficient compute architectures. However, on the memory side there is limited research that exploits application knowledge to improve the memory access behavior. In [12] we presented an *application-specific memory controller* (ASMC). Key of this controller is an optimized mapping of the logical addresses to physical DRAM addresses such that the row misses in the access pattern stream are minimized. The corresponding mathematical optimization problem is an *integer linear programming* problem. The solution of this problem maximizes the number of row buffer hits and exploits the bank-level parallelism of the DRAM device in order to reduce the latency and therefore to keep up the sustainable bandwidth near to the maximum. Therefore, such an ASMC can outperform online schedulers because it was designed with a *global* application view. Furthermore, for real-time embedded systems with this method we can easily determine WCET bounds, since no non-deterministic online scheduling is involved.

The efficiency of this approach is demonstrated on an industrial embedded image processing application that consists of image rotation and FFT. Due to realtime requirements this application requires a minimum bandwidth of 9*.*57 GB/s. Figure 3.5 shows the bandwidth and energy for the standard address mappings of a standard memory controller with standard row-bank-column (RBC) mapping and bank-row-column (BRC) mapping, a manual optimization of the mapping of an

**Fig. 3.5** Industrial image processing application

experienced engineer and the ASMC approach. The ASMC approach has a runtime of ∼50 min, whereas the manual approach requires ∼1 week for an engineer to fully understand the application and analyze the behavior. Furthermore, by using the generated address mapping, all the online scheduling capabilities of the memory controller could be removed, which reduced the required area of the memory controller by 35%.

As mentioned already in Sect. 3.2 the refresh has a large impact on DRAM's bandwidth and latency. The overhead of refreshes can be reduced by only refreshing the memory cells inside the DRAM that hold data that are still alive. A large body of research exists developing schemes that manually refresh the DRAM row-by-row, characterizing each row's ability to retain data and eliminating unnecessary *Refresh* operations on rows that can be refreshed less often. These schemes have been shown to be extremely efficient. Since eliminating refresh improves both energy and performance of the memory system, these schemes offer the potential for significant gains in DRAM-system efficiency. However, these schemes are incompatible with the modern *auto-refresh* mechanism that is widely used: *auto-refresh* operates on multiple rows at once and not on a row-by-row basis. In addition, *auto-refresh* cannot skip any row, whether that row needs to be refreshed or not. Thus, the manual schemes use explicit row-level *Activate* (ACT) and *Precharge* (PRE) commands to refresh row-by-row, called *row granular refresh* (RGR). However, it was shown in [4] that techniques based on RGR could never be as effective as the DRAM's internal auto-refresh.

In [28] we presented a technique called optimized RGR which allows a rowby-row refresh with the same efficiency as the auto-refresh. Here, we investigated the timings that are relevant to *Activate* and *Precharge* commands and showed that these DRAM timing parameters can be reduced for performing the *Refresh*

operation row-by-row. We could demonstrate a reduction of latency and increase of bandwidth compared to standard *auto-refresh*, as shown in Fig. 3.6. The results can be even improved if only alive data is refreshed (ORGR select). Additionally, ORGR improved the energy efficiency compared to RGR.

It is becoming clear that embedded applications must concentrate on DRAM solutions like GDDR and HBM in combination with ASMCs and sophisticated refresh mechanisms in order to cope with their high bandwidth and low latency requirements.

#### **3.3 Power Consumption**

Power is one of the major challenges in today's embedded system development. According to Fig. 3.3, the preliminary choice for low power designs is LPDDR4 and Wide I/O2 due to their very low energy consumption. However, when aiming at high memory bandwidth, e.g., 1 TB/s these devices are not optimal. For example, to achieve the aforementioned bandwidth with LPDDR4, 64 devices (×32) are required. Although the average power would be only ∼17 W at a peak frequency of 2000 MHz, the high number of resulting I/O pins (2048) becomes unfeasible. Hence, the only alternative candidates for high bandwidth are HBM2 and GDDR5X/6. According to Figs. 3.2 and 3.3 the average power consumed<sup>4</sup> by the HBM (4 stack ×1024) and GDDR6 (16 devices, QDR, ×32) devices are ∼60 W and ∼150 W, respectively. These numbers show that DRAM will be a significant power contributor to embedded systems which require a high memory bandwidth. Therefore, it is mandatory to efficiently use DRAM's power-down modes in order to reduce power consumption.

In state-of-the-art memory controllers the entry to a power-down mode is scheduled when there was no activity in a period of time called timeout. DRAMs

<sup>4</sup>Operated at respective peak frequency.

**Fig. 3.7** Comparison of energy savings normalized to power-down disabled

offer three power-down modes, called *active power-down* (PDNA), *precharge power-down* (PDNP), and *self-refresh* (SREF). In [35] we showed that a highly opportunistic SREF entry results in an increased power consumption, since the SREF will always execute an internal refresh in the beginning. Therefore, the timeout for a SREF entry should be at least 500 clock cycles for a Wide I/O DRAM.

In [19] we presented an optimized power-down policy, called *staggered powerdown*, which considers all three available DRAM power-down modes to achieve the maximum saving in energy and the minimum in slow-down on the execution of the applications. The basic idea is to change to the next more efficient power-down state on a refresh event. With this method, unnecessary SREF entries will be avoided and the hardware timeout counters, as used in state-of-the-art controllers, are not required anymore. As shown in Fig. 3.7 for Wide I/O DRAMs an energy reduction up to 10% in high activity periods and up to 13% in idle phases is feasible.

A high power consumption also fosters a high thermal dissipation that largely impacts the reliability of a DRAM. This challenge is discussed in more detail in the next paragraph.

#### **3.4 Temperature vs. Reliability**

DRAMs are very sensitive to high temperature, which increases the leakage in the memory cells. Figure 3.8 shows the different leakage paths in a DRAM cell:

**Fig. 3.8** Leakage paths in modern buried wordline DRAM architecture [23, 34]


In order to avoid data corruption by retention errors due to leakage, the refresh frequency needs to be increased. The general rule of thumb is to double the refresh rate for every 10 ◦C increase over 85 ◦C [20]. For example, the refresh period must be decreased from 64 ms to 4–8 ms for 125 ◦C, which leads to a serious collapse of the sustainable bandwidth [16].

This situation is even worse for today's 3D stacked DRAM systems (e.g., Wide I/O, HBM, HMC, etc.), which aggravate the thermal crisis: i.e., these DRAMs are even more sensitive to temperature changes because of the stacked thin dies. Additionally, when aiming for highest bandwidths with HBM or HMC, these devices will consume, as mentioned before, a significant amount of power on a small area compared to their commodity counterparts. Thus, the self-heating of 3D-DRAMs is even more accelerated. Besides the leakage currents, crosstalk on bitlines and wordlines can also disturb the data stored in the cells or disturb their sensing. Due to the aforementioned effects and shrinking technology nodes, reliability is a major concern in DRAMs. Many techniques exist to improve the reliability, e.g., using *error correcting codes* (ECC) and/or spatial redundancy.

*Approximate* and *probabilistic computing* evolved as design paradigms that exploit the error resilience of applications to increase their performance and decrease the power consumption [10]. This paradigm can be extended to DRAMs resulting in *approximate DRAMs* (ADRAM) that enable a trade-off between energy efficiency, performance, and reliability. The inherent error resilience of applications allows sacrificing data storage robustness and stability by lowering the refresh rate or disabling refresh in DRAMs completely, as shown in Fig. 3.9. However, to apply ADRAM the statistical DRAM behavior with respect to retention time, process variation, and temperature has to be characterized.

Several studies for the usage of ADRAM are presented in [13–15, 20]. One scenario is safe refresh disabling, i.e., if the data lifetime is smaller than the refresh period, the refresh can be completely switched off without impact on the system

**Fig. 3.9** Approximate DRAM design space

**Fig. 3.10** Simulation framework for approximate DRAM explorations

behavior. To perform an accurate characterization we measured state-of-the-art DRAM devices, such as DDR3, DDR4, and Wide I/O. These measurements were the base for a simulation platform for ADRAM investigations.

Exploring ADRAM in a system context is challenging, since a trade-off between accuracy and simulation performance must be considered. Our framework relies on SystemC *Transaction Level Models* (TLM) for fast and accurate simulation. Figure 3.10 shows the closed loop simulation. This simulation loop consists of (1) *DRAM and gem5 Core Models* [5, 18], (2) a DRAM power model [6, 29], which uses either parameters from datasheets, or real measurements [13, 15], (3) a *thermal model* based on 3D-ICE [38], and (4) a *DRAM retention error model* [42].

As mentioned before, ECC is an efficient technique to improve DRAM's reliability, e.g., retention errors or errors induced by crosstalk. State-of-the-art ECC DDR DIMMs, for instance, consist of 8 DRAM devices and a further device for storing the ECC redundancy. Moreover, vendors recently introduced on-die ECC for LPDDR4 [24, 26] to correct retention errors. With ECC the refresh rate can be lowered by 4×, which largely reduces the power consumption. Finding an efficient ECC is a non-trivial task. Traditional ECC techniques for DRAMs assume a symmetric behavior of the retention errors, i.e., the error probability for a stored 0 and 1 is identical. In [22] and [23] we presented a more accurate error model for the retention behavior that exploits the internal cell structure (the so-called true- or anticells) of a DRAM. This model is asymmetric and we could show that the channel capacity according to Shannon's capacity definition (the memory cell is considered as a noisy channel) of a single memory cell is larger than in the traditional commonly used symmetrical model. Hence, a more efficient coding must exist. In [23] we presented a new and low-overhead coding scheme that improves the reliability with respect to retention errors.

#### **3.5 Safety and Security**

Since DRAMs are more and more used in safety-critical applications like automotive, safety and security are major concerns for DRAMs that were originally mainly developed for consumer applications. Apart from the temperature based retention errors discussed in Sect. 3.4, DRAMs are also prone to transient soft errors, i.e., effects of cosmic particle strikes [9]. Moreover, due to the high frequency of DRAM interfaces transient transmission errors on the DRAM bus can occur. Furthermore, there can be hard errors related to stuck at failures or aging, which could result in a defect column decoder. There exists only a limited amount of studies on DRAM error rates in the field since manufacturers and data centers are very careful to share this sensitive information [30, 36, 41]. Sridharan et al. report 20–66 FIT for a single DRAM device [39, 40]. This highlights the need for appropriate safety mechanisms in order to decrease the FIT rates. For example, the memory controller contains an additional logic that tests the interface periodically in order to detect errors or a strong ECC that is able to correct errors online in order to guarantee functional safety.

Apart from random failures, malicious causes can lead to a safety goal violation, too. Because of transitions to open environments for IoT or *Car2X* communication the vulnerability of DRAMs for embedded systems must be considered. As DRAM process technology scales down, the electrical interference between the memory cells increases, which leads to disturbance errors. Recently, the *row-hammer* problem [21, 32] and its exploits [25, 37] have caused a lot of attention in research and newspapers. By repeatedly opening and closing a DRAM row, called *hammering*, bits in adjacent rows can flip. This effect can be exploited to write on memory locations with prohibited access rights to, e.g., gain kernel privileges or escape a sandbox or hypervisor. In [25] the author showed that secret data can be read with a combination of row-hammer and data dependencies [23]. The row-hammer security attack [21] is a potential malicious behavior that has to be avoided. Controller triggered techniques like *target row refresh* where rows will be refreshed when their activation count exceeds a threshold or techniques on the device level like [2, 44] can alleviate this problem. In [17] a methodology for reverse engineering DRAMs by reconstructing the physical location of memory cells without opening the device package and microscoping the device was presented. This method consists of a retention error analysis while a temperature gradient is applied to the DRAM device. With this insight into the internal DRAM structure row-hammer countermeasure techniques can be improved.

#### **3.6 Conclusion**

Emerging applications executed on embedded computing systems require ever increasing main memory sizes. Thus, DRAMs are indispensable to be integrated in such systems. However, the use of DRAMs implies many new challenges. In this chapter, we highlighted some of the major challenges for the integration of DRAM subsystems into embedded computing systems. These challenges are namely: *bandwidth*, *latency*, *power*, *temperature*, *reliability*, *safety*, and *security*. Furthermore, we showed several solutions from our recent research activities in order to tackle and overcome these challenges.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 4 On the Formalism and Properties of Timing Analyses in Real-Time Embedded Systems**

**Jian-Jia Chen, Wen-Hung Huang, Georg von der Brüggen, Kuan-Hsun Chen, and Niklas Ueter**

#### **4.1 Introduction**

The advanced development of embedded computing devices, accessible networks, and sensor devices has triggered the emergence of complex cyber-physical systems (CPS). In such systems, advanced embedded computing and information processing systems heavily interact with the physical world. Cyber-physical systems are integrations of computation, networking, and physical processes to achieve high stability, performance, reliability, robustness, and efficiency [26]. A cyberphysical system continuously monitors and affects the physical environment which also interactively imposes feedback to the information processing system. The applications of CPS include healthcare, automotive systems, aerospace, power grids, water distribution, disaster recovery, etc.

Due to their intensive interaction with the physical world, in which time naturally progresses, *timeliness* is an essential requirement of correctness for CPS. Communication and computation of safety-critical tasks should be finished within a specified amount of time, called *deadline*. Otherwise, even if the results are correctly delivered from the functional perspective, the reaction of the CPS may be too late and have catastrophic consequences. One example is the release of an airbag in a vehicle, which only functions properly if the bag is filled with the correct amount of air in the correct time interval after a collision, even in the worst-case timing scenario. While in an entertainment gadget a delayed computation result is inconvenient, in the control of a vehicle it can be fatal. Therefore, a modern society cannot adopt a technological advance when it is not safe.

J.-J. Chen (-) · W.-H. Huang · G. von der Brüggen · K.-H. Chen · N. Ueter TU Dortmund, Dortmund, Germany

e-mail: Jian-Jia.Chen@tu-dortmund.de; Wen-Hung.Huang@tu-dortmund.de; Georg.von-der-Brueggen@tu-dortmund.de; Kuan-Hsun.Chen@tu-dortmund.de; Niklas.Ueter@tu-dortmund.de

Cyber-physical systems that require both functional and timing correctness are called cyber-physical real-time systems. Since cyber-physical real-time systems are replacing mechanical and control units that are traditionally operated manually, providing both predictability and efficiency for such systems is crucial to satisfy the safety and cost requirements in our society. Real-time computing for such systems is to provide safe bounds for deterministic or probabilistic timing properties. For providing deterministic timing guarantees, worst-case bounds are pursued. Specifically, the worst-case *execution* time (WCET) of a program (when it is executed exclusively in the system, i.e., without any interference) has to be safely calculated, for details the reader is referred to [31]. The WCETs of multiple programs are then used for analyzing the worst-case *response* time (WCRT) when multi-tasking in the system.

The strongest deterministic timing guarantee ensures that there is no deadline miss of a task by validating whether the WCRT is less than or equal to the specified relative deadline. When the deadlines of all tasks in a system are satisfied, the *hard* real-time requirements are met and the system is a *hard real-time system*. The assumption behind the requirements of hard real-time guarantees is that a deadline miss can result in fatal errors of the system. Ensuring worst-case timing properties has been an important topic for decades. Initially, such worst-case guarantees were achieved by constructing cyclically repetitive static schedules. The timing properties of static schedules can be analyzed easily, but the constructed realtime systems were inflexible to accommodate any upgrades or changes that were not planned in advance.

The seminal work by Liu and Layland [23] provided fundamental knowledge to ensure timeliness and allow flexibility for scheduling periodic real-time tasks in a uniprocessor system. A *periodic task τi* is an infinite sequence of task instances, called *jobs*, where two consecutive jobs of a task should arrive recurrently with a period *Ti* (i.e., the time interval length between the arrival times of two consecutive jobs is always *Ti*), all jobs of a task have the same *relative* deadline *Di* = *Ti* (i.e., the *absolute* deadline of a job arriving at time *t* is *t* + *Di*), and each job has the same worst-case execution time (WCET) *Ci* [23]. The *utilization Ui* of a task *τi* is hence defined as *Ui* = *Ci/Ti*. Although the periodic real-time task model is not always suitable for industrial applications, the exploration of the fundamental knowledge in the past decades provides significant insights. Specifically, Liu and Layland proved the applicability of preemptive dynamic-priority and fixed-priority scheduling algorithms and provided worst-case utilization analysis. To be precise, they showed that as long as the utilization *n <sup>i</sup>*=<sup>1</sup> *Ci/Ti* of the given *<sup>n</sup>* tasks is no more than *n(*2 1 *<sup>n</sup>* − 1*)*, which is ≥ 69*.*3%, then the worst-case response time of a task *τi* is guaranteed to be no longer than *Ti* if the priorities are assigned in the rate-monotonic (RM) order, i.e., *τi* has a higher priority when its period is shorter. Similarly, under preemptive earliest-deadline-first (EDF) scheduling, the utilization bound is guaranteed to be 100%.

However, in many scenarios occasional deadline misses are possible and acceptable. Systems that can still function correctly under these conditions are called *soft* real-time systems. When the deadline misses are bounded and limited, the term *weakly hard* real-time system is used. For such cases, safe and tight quantitative properties of deadline misses have to be analyzed so that the system designers can verify whether the occasional deadline misses are acceptable from the system's perspective. For this purpose, probabilistic timing properties can be very useful, in which the probability of deadline misses or miss rates are pursued. In safety standards, e.g., IEC-61508 and ISO-26262, the probability of failure has to be proved to be sufficiently low. Probabilistic timing properties are important to assure the service level agreements in many applications that require realtime communication and real-time decision-making, such as autonomous driving, smart building, internet of things, and industry 4.0. Deterministic guarantees of interest for weakly hard real-time systems include the quantification of the number of deadline misses within a specified time window length, the worstcase tardiness, and the worst-case number of consecutive deadline misses. Such deadline misses may be allowed and designed on purpose, especially to verify the controller for the physical plant in a CPS. With potential deadline misses in mind, suitable control approaches that can systematically account for data losses can be applied. Such weakly hard real-time systems have been proposed as a feature towards timing-aware control software design for automotive systems in [33].

To design a timing predictable and rigorous cyber-physical real-time system, two separate but co-related problems have to be considered:


The real-time systems research results in the past half-century have a significant impact on the design of cyber-physical systems. Allowing system design flexibility by using dynamic schedules (either fixed-priority or dynamic-priority schedules) has not only academic values but also industrial penetration. Nowadays, most realtime operating systems support fixed-priority schedulers and allow periodic as well as sporadic task activations. When task synchronization or resource sharing is necessary, the priority inheritance protocol and the priority ceiling protocol developed by Sha et al. [27] are part of the POSIX Standards (in POSIX.1-2008).

Existing analyses and optimizations for scheduling algorithms and resource management policies in complex real-time systems are usually *ad-hoc* solutions for a specific studied problem. In this chapter, we challenge this design and analytical practice, since the future design of real-time systems will be more complex, not only in the execution model but also in the parallelization, communication, and synchronization models.

#### **Our Conjecture**

We strongly believe that the future design of real-time systems require *formal properties* that can be used modularly to compose safe and tight analysis as well as optimization for the scheduler design and schedulability test problems. This chapter summarizes our recent progress at TU Dortmund for property-based analyses of real-time embedded systems with respect to both deterministic and probabilistic properties.

#### **4.2 Formal Analysis Based on Schedule Functions**

For uniprocessor systems, at most one job is executed at a time. Therefore, a **scheduling algorithm** (or **scheduler**) determines the order, in which jobs are executed on the processor, called a schedule. A schedule is an assignment of the given jobs to the processor, such that each job is executed (not necessarily consecutively) until completion. Suppose that **J** = {*J*1*, J*2*, ... Jn*} is a set of *n* given jobs. A schedule for **J** can be defined as a function *σ* : R → **J**∪{⊥}, where *σ (t)* = *Jj* denotes that job *Jj* is executed at time *t*, and *σ (t)* = ⊥ denotes that the system is idle at time *t*.

If *σ (t)* changes its value at some time *t*, the processor performs a **context switch** at time *t*. For a schedule *σ* to be valid with respect to the arrival time, the absolute deadline, and the execution time of the given jobs, we need to have the following conditions for each *Jj* in **J** for *hard real-time guarantees*:


*Note that the integration of* <sup>1</sup>*σ (t)*=certain job *over time used in this chapter is only a symbolic representation for summation.*

For a given sporadic task set **T**, each task *τi* in **T** can generate an infinite number of jobs as long as the temporal conditions of arrival times of the jobs generated by task *τi* can satisfy the minimum inter-arrival time constraint.

Suppose that the *j* th job generated by task *τi* is denoted as *Ji,j* . Let the set of jobs generated by task *τi* be denoted as **FJ***i*. A feasible set of jobs generated by a sporadic real-time task *τi* satisfies the following conditions:


A feasible set of jobs generated by a periodic real-time task *τi* should satisfy the first two conditions above and the following condition:

• By periodic releases, we have *ri,*<sup>1</sup> = *Oi* and *ri,j* = *ri,j*−<sup>1</sup> + *Ti* for any integer *j* with *j* ≥ 2.

A *feasible collection* **FJ** *of jobs* generated by a task set **T** is the union of the feasible sets of jobs generated by the sporadic (or periodic) tasks in **T**, i.e., **FJ** = ∪*τi*∈**TFJ***i*. It should be obvious that there are infinite feasible collections of jobs generated by a sporadic real-time task set **T**.

For a feasible collection **FJ** of jobs generated by **T**, a uniprocessor schedule for **FJ** can be defined as a function *σ* : R → **FJ** ∪ {⊥}, where *σ (t)* = *Ji,j* denotes that job *Ji,j* is executed at time *t*, and *σ (t)* = ⊥ denotes that the system is idle at time *t*. Recall that we assume that the jobs of task *τi* should be executed in the FCFS manner. Therefore, if *σ (t)* = *Ji,j* then *σ (t ) /*∈ *Ji,h*|*h* = 1*,* 2*,...,j* − 1 , for any *t > t* and *j* ≥ 2.

The feasibility and optimality of scheduling algorithms should be defined based on all possible feasible collections of jobs generated by **T**.

**Definition 4.1** Suppose that we are given a set **T** of sporadic real-time tasks on a uniprocessor system. A schedule *σ* of a feasible collection **FJ** of jobs generated by **T** is feasible for hard real-time guarantees if the following conditions hold for each *Ji,j* in **FJ**:

• *σ (t)* = *Ji,j* for any *t* ≤ *ri,j* and *t>di,j* ,

$$\bullet \quad \int\_{r\_{l,j}}^{d\_{l,j}} \mathbbm{1}\_{\sigma(t) = J\_{l,j}} dt = C\_{l,j}, \text{ and}$$

• if *σ (t)* = *Ji,j* , then *σ (t ) /*∈ *Ji,h*|*h* = 1*,* 2*,...,j* − 1 , for any *t > t* and *j* ≥ 2.

A sporadic real-time task set **T** is *schedulable* for hard real-time guarantees under a scheduling algorithm if the resulting schedule of any feasible collection **FJ** of jobs generated by **T** is always feasible. A scheduling algorithm is *optimal* for hard realtime guarantees if it always produces feasible schedule(s) when the task set **T** is schedulable under a scheduling algorithm.

#### *4.2.1 Preemptive EDF*

For the preemptive earliest-deadline-first (EDF) scheduling algorithm, the job in the ready queue whose absolute deadline is the earliest is executed on the processor. To validate the schedulability of preemptive EDF, the *demand bound function* DBF*i(t)*, defined by Baruah et al. [1], has been widely used to specify the maximum demand of a sporadic (or periodic) real-time task *τi* to be released and finished in a time interval with length equal to *t*:

$$\text{DBF}\_l(t) = \max\left\{0, \left\lfloor \frac{t - D\_l}{T\_l} \right\rfloor + 1\right\} \times C\_l. \tag{4.1}$$

To prove the correctness of such a demand bound function, we focus on all possible feasible sets of jobs generated by a sporadic/periodic real-time task *τi*. Recall that a feasible set **FJ***<sup>i</sup>* of jobs generated by a sporadic/periodic real-time task *τi* should satisfy the following conditions:


**Lemma 4.1** *For a given feasible set F J<sup>i</sup> of jobs generated by a sporadic/periodic real-time task τi, let F Ji,*[*r,r*+*t*] *be the subset of the jobs in F J<sup>i</sup> arriving no earlier than r and have absolute deadlines no later than r* + *t. That is,*

$$F\mathbf{J}\_{l,[r,r+t]} = \left\{ J\_{l,j} \mid J\_{l,j} \in F\mathbf{J}\_l, r\_{l,j} \ge r, d\_{l,j} \le r+t \right\}.\tag{4.2}$$

*For any r and any t >* 0*,*

$$\sum\_{\{J\_{l,j}\in \mathbf{F} \: \mathbf{J}\_{l,\{r,r+t\}}\}} C\_{l,j} \le \mathsf{DBF}\_l(t). \tag{4.3}$$

*Proof* By definition, DBF*i(t)* ≥ 0. Therefore, if **FJ***i,*[*r,r*+*t*] is an empty set, we reach the conclusion.

We consider that **FJ***i,*[*r,r*+*t*] is not empty for the rest of the proof. Let *Ji,j*<sup>∗</sup> be the first job generated by task *τi* in **FJ***i,*[*r,r*+*t*]. By the definition of **FJ***i,*[*r,r*+*t*] in Eq. (4.2), the arrival time *ri,j* <sup>∗</sup> of job *Ji,j* <sup>∗</sup> is no less than *r*, i.e., *ri,j* <sup>∗</sup> ≥ *r*. Since **FJ***i,*[*r,r*+*t*] is not empty, *ri,j* <sup>∗</sup> + *Di* ≤ *r* + *t*.

Since *ri,j* ≥ *ri,j*−<sup>1</sup> +*Ti* for any integer *j* with *j* ≥ 2 for the jobs in **FJ***<sup>i</sup>* as well as the jobs in **FJ***i,*[*r,r*+*t*], the absolute deadlines of the *subsequent* jobs in **FJ***i,*[*r,r*+*t*] are *at least ri,j* <sup>∗</sup> + *Ti* + *Di, ri,j* <sup>∗</sup> + 2*Ti* + *Di, ri,j* <sup>∗</sup> + 3*Ti* + *Di,...*. Therefore, there are at most *<sup>r</sup>*+*t*−*(ri,j*∗+*Di) Ti* + 1 ≤ *t*−*Di Ti* + 1 jobs in **FJ***i,*[*r,r*+*t*] since *r* ≤ *ri,j* <sup>∗</sup> . Since the actual execution time *Ci,j* of each job *Ji,j* is no more than *Ci* by the definition of the jobs in **FJ***i*, we reach the conclusion.

With the help of Lemma 4.1, the following theorem holds.

**Theorem 4.1** *A set* **T** *of sporadic tasks is schedulable under uniprocessor preemptive EDF if and only if*

$$\forall t > 0, \qquad \sum\_{\mathbf{r}\_l \in \mathbf{T}} \mathsf{DBF}\_l(t) \le t. \tag{4.4}$$

*Proof* **Only-if part**, i.e., the necessary schedulability test. We prove the condition by contrapositive. Suppose that there exists a *t >* 0 such that - *τi*∈**<sup>T</sup>** DBF*i(t) > t*, for contrapositive.

For each task *τi*, we create a feasible set of jobs generated by task *τi* by releasing the jobs periodically starting from time 0, and their actual execution times are all set to *Ci*. By the definition of a uniprocessor system in our scheduling model, at most one job is executed at a time. Therefore, the demand of the jobs that are released no earlier than 0 and must be finished no later than *t* is strictly more than the amount of available time since - *τi*∈**<sup>T</sup>** DBF*i(t) > t*. Therefore, (at least) one of these jobs misses its deadline no matter which uniprocessor scheduling algorithm is used.

Therefore, we can conclude that if the task set **T** is schedulable under EDF-P, then - *τi*∈**<sup>T</sup>** DBF*i(t)* <sup>≤</sup> *t,* <sup>∀</sup>*t >* 0.

**If part**, i.e., the sufficient schedulability test: We prove the condition by contrapositive. Suppose that the given task set **T** is not schedulable under EDF-P for contrapositive.

Then, there exists a feasible collection of jobs generated by **T** which cannot be feasibly scheduled under EDF-P. Let **FJ** be such a collection of jobs, where **FJ***<sup>i</sup>* is its subset generated by a sporadic real-time task *τi* in **T**. Let *σ* : R → **FJ** ∪ {⊥} be the schedule of EDF-P for **FJ**. Since at least one job misses its deadline in *σ*, let job *Jk,* be the first job which misses its absolute deadline *dk,* in schedule *σ*. That is,

$$\int\_{r\_{k,\ell}}^{d\_{k,\ell}} \mathbb{1}\_{\sigma(t) = J\_{k,\ell}} dt \, \prec C\_{k,\ell} \le C\_k. \tag{4.5}$$

Let *t*<sup>0</sup> be the earliest instant prior to *dk,*, i.e., *t*<sup>0</sup> *< dk,*, such that the processor only executes jobs with absolute deadlines no later than *dk,* in time interval *(t*0*, dk,*] under EDF-P. That means, immediately prior to time *t*0, i.e., *t* = *t*<sup>0</sup> − for an infinitesimal , *σ (t)* is either ⊥ or a job whose absolute deadline is (strictly) greater than *dk,*. We note that *t*<sup>0</sup> exists since it is at least the earliest arrival time of the jobs in **FJ**. Moreover, since EDF-P does not let the processor idle unless there is no job in the ready queue, *t*<sup>0</sup> ≤ *rk,*.

Let **FJ***i,*[*t*0*,dk,*] be the subset of the jobs in **FJ***<sup>i</sup>* arriving no earlier than *t*<sup>0</sup> and have absolute deadlines no later than *dk,*. That is, we define **FJ***i,*[*t*0*,dk,*] by setting *r* to *t*<sup>0</sup> and *t* to *dk,* − *t*<sup>0</sup> in Eq. (4.2). Let **FJ**[*t*0*,dk,*] be ∪*τi*∈**<sup>T</sup> FJ***i,*[*t*0*,dk,*] for notational brevity.

By the definition of *t*0, *dk,*, and EDF-P, the processor executes only the jobs in **FJ**[*t*0*,dk,*], i.e., *σ (t)* ∈ **FJ**[*t*0*,dk,*] for any *t*<sup>0</sup> *< t* ≤ *dk,*. Therefore,

$$\begin{split} d\_{k,\ell} - t\_0 &\stackrel{\scriptstyle 1}{=} & \left( \int\_{t\_0}^{d\_{k,\ell}} \mathbf{1}\_{\sigma(t) = J\_{k,\ell}} dt \right) + \sum\_{J\_{i,j} \in \mathbf{F} \mathbf{J}\_{[0,\ldots k\_{\ell,\ell}]} \cup \{I\_{k,\ell}\}} \left( \int\_{t\_0}^{d\_{k,\ell}} \mathbf{1}\_{\sigma(t) = J\_{i,j}} dt \right) \\ & \stackrel{\scriptstyle 2}{\leq} & \left( \int\_{t\_0}^{d\_{k,\ell}} \mathbf{1}\_{\sigma(t) = J\_{k,\ell}} dt \right) + \left( \sum\_{\tau\_i \in \mathbf{T} \; \space J\_{i,j} \in \mathbf{F} \mathbf{J}\_{[i,l\_0,d\_{k,\ell}]}} \; C\_{i,j} \right) - C\_{k,\ell} \\ & \stackrel{\scriptstyle 3}{=} & \left( \int\_{\tau\_{k,\ell}}^{d\_{k,\ell}} \mathbf{1}\_{\sigma(t) = J\_{k,\ell}} dt \right) + \left( \sum\_{\tau\_i \in \mathbf{T} \; \space J\_{i,j} \in \mathbf{F} \mathbf{J}\_{[i,l\_0,d\_{k,\ell}]}} \; C\_{i,j} \right) - C\_{k,\ell} \end{split}$$

$$\begin{aligned} \text{Eq. (4.5)} \\ \overset{\text{Eq. (4.5)}}{\leqslant} & C\_{k,\ell} + \left( \sum\_{\tau\_i \nparallel \mathbf{T}} \sum\_{\substack{J\_{i,j} \in \mathbf{F} \mathbf{J}\_{i} \sqcup\_{\mathbf{0}\_0, d\_{k,\ell}}}} C\_{i,j} \right) - C\_{k,\ell}, \\ \overset{\text{Eq. (4.3)}}{\leq} & \sum\_{\tau\_i \nparallel \mathbf{T}} \text{DBF}\_{\ell} (d\_{k,\ell} - t\_0), \end{aligned}$$

where the condition <sup>1</sup> = is due to *σ (t)* ∈ **FJ**[*t*0*,dk,*] for any *t*<sup>0</sup> *< t* ≤ *dk,*, the condition 2 ≤ is due to the definition of a schedule of the jobs in **FJ**[*t*0*,dk,*] \ *Jk,* , the condition 3 = is due to *t*<sup>0</sup> ≤ *rk,*, and *σ (t)* = *Jk,* for *t*<sup>0</sup> *< t* ≤ *rk,*. Hence, there is a certain = *dk,* −*t*<sup>0</sup> with - *τi*∈**<sup>T</sup>** DBF*i() >* . We reach our conclusion by contrapositive.

#### *4.2.2 Preemptive Fixed-Priority Scheduling Algorithms*

Under preemptive fixed-priority (FP-P) scheduling, each task is assigned a unique priority before execution and does not change over time. The jobs generated by a task always have the same priority defined by the task. Here, we define *hp(τk)* as the set of higher-priority tasks than task *τk* and *lp(τk)* as the set of lower-priority tasks than task *τk*. When task *τi* has a higher priority than task *τj* , we denote their priority relationship as *τi > τj* . We assume that the priority levels are unique.

For FP scheduling algorithms, we need another notation

$$\mathbf{FRJ}\_{l, \{r, r + \Delta\}} = \left\{ J\_{l, j} \mid J\_{l, j} \in \mathbf{FJ}\_l, r\_{l, j} \ge r, r\_{l, j} < r + \Delta \right\}. \tag{4.6}$$

That is, for a given feasible set **FJ***<sup>i</sup>* of jobs generated by a sporadic/periodic realtime task *τi*, let **FRJ***i,*[*r,r*+*)* be the subset of the jobs in **FJ***<sup>i</sup>* arriving in time interval [*r, r* + *)*. By extending the proofs like in Sect. 4.2.1, we can also prove the following lemma and theorem.

**Lemma 4.2** *The total amount of execution time of the jobs of τi that are released in a time interval* [*r, r* + *) for any* ≥ 0 *is*

$$\sum\_{J\_{l,j}\in\mathbf{FRJ}\_{l,\{r,r+\Delta\}}} C\_{l,j} \le \left\lceil \frac{\Delta}{T\_l} \right\rceil C\_l \stackrel{def}{=} demand\_l(\Delta). \tag{4.7}$$

**Theorem 4.2** *Let* min *>* 0 *be the minimum value that satisfies*

$$
\Delta\_{\min} = C\_k + \sum\_{\mathbf{r}\_l \in hp(\mathbf{r}\_k)} demand\_l(\Delta\_{\min}).\tag{4.8}
$$

*The WCRT Rk of task τk in a preemptive fixed-priority uniprocessor scheduling algorithm is*


Theorem 4.2 can be re-written into a more popular form, called time-demand analysis (TDA) proposed by Lehoczky et al. [21]: A (constrained-deadline) task *τk* is schedulable under FP-P scheduling if and only if

$$\exists t \left| 0 < t \le D\_k \le T\_k \right. \\ \left. C\_k + \sum\_{\mathfrak{r}\_l \in hp(\mathfrak{r}\_k)} \left\lceil \frac{t}{T\_l} \right\rceil C\_l \le t. \right. \tag{4.9}$$

Theorem 4.2 is a very interesting and remarkable result, widely used in the literature. It suggests to validate the worst-case response time of task *τk* by


To explain the above phenomena, Liu and Layland in their seminal paper [23] in 1973 defined two terms (according to their wording):


Liu and Layland [23] concluded the famous **critical-instant theorem** as follows: "*A critical instant for any task occurs whenever the task is requested simultaneously with requests for all higher-priority tasks*." Their proof was in fact incomplete. Moreover, their definition of the critical-instant theorem was incomplete since the condition min *> Tk* was not considered in their definition. A **precise definition of the critical-instant theorem** is revised as follows:

	- a job of task *τk* released at this instant has the largest response time if it is no more than *Tk* or
	- the worst-case response time of a job of task *τk* released at this instant is more than *Tk*.

#### **4.3 Utilization-Based Analyses for Fixed-Priority Scheduling**

The TDA in Eq. (4.8) requires pseudo-polynomial-time complexity to check the time points in *(*0*, Dk*] for Eq. (4.8), which can be further generalized for verifying the schedulability of task *τk* under fixed-priority scheduling:

$$\exists 0 < t \le D\_k \text{ s.t. } C\_k + \sum\_{\mathfrak{r}\_l \in hp(\mathfrak{r}\_l)} \sigma \left( \left\lceil \frac{t}{T\_l} \right\rceil C\_l + bC\_l \right) \le t,\tag{4.10}$$

where *σ >* 0 and *b* ≥ 0. Equation (4.10) can be used in many cases if *Dk* ≤ *Tk*, such as


Although testing Eq. (4.10) takes pseudo-polynomial time, it is not always necessary to test all possible time points to derive a safe worst-case response time or to provide sufficient schedulability tests. The general and key concept to obtain sufficient schedulability tests in **k2U** in [7, 8] and **k2Q** in [6, 10] is to test only a subset of such points for verifying the schedulability. Traditional fixedpriority schedulability tests often have pseudo-polynomial-time (or even higher) complexity. The idea implemented in the **k2U** and **k2Q** frameworks is to provide a general *k*-point schedulability test, which only needs to test *k* points under *any* fixed-priority scheduling when checking schedulability of the task with the *k*th highest priority in the system. Suppose that there are *k* − 1 higher-priority tasks, indexed as *τ*1*, τ*2*,...,τk*−1, than task *τk*. Recall that the task utilization is defined as *Ui* = *Ci/Ti*. The success of the **k2U** framework is based on a *k*-point effective schedulability test, defined as follows:

**Definition 4.2 (Chen et al. [7, 8])** A *k*-point effective schedulability test is a sufficient schedulability test of a fixed-priority scheduling policy that verifies the existence of *tj* ∈ {*t*1*, t*2*,...tk*} with 0 *< t*<sup>1</sup> ≤ *t*<sup>2</sup> ≤···≤ *tk* such that

$$C\_k + \sum\_{l=1}^{k-1} \alpha\_l t\_l U\_l + \sum\_{l=1}^{j-1} \beta\_l t\_l U\_l \le t\_j,\tag{4.11}$$

where *Ck >* 0, *αi >* 0, *Ui >* 0, and *βi >* 0 are dependent upon the setting of the task models and task *τi*.

The properties in Definition 4.2 lead to the following lemmas for the **k2U** framework which are proven in [8].

**Lemma 4.3** *For a given k-point effective schedulability test of a scheduling algorithm, defined in Definition 4.2, in which* 0 *< tk and* 0 *< αi* ≤ *α, and* 0 *< βi* ≤ *β for any i* = 1*,* 2*,...,k* − 1*, task τk is schedulable by the scheduling algorithm if the following condition holds:*

$$\frac{C\_k}{t\_k} \le \frac{\frac{\alpha}{\beta} + 1}{\prod\_{j=1}^{k-1} (\beta U\_j + 1)} - \frac{\alpha}{\beta}.\tag{4.12}$$

**Lemma 4.4** *For a given k-point effective schedulability test of a scheduling algorithm, defined in Definition 4.2, in which* 0 *< tk and* 0 *< αi* ≤ *α and* 0 *< βi* ≤ *β for any i* = 1*,* 2*,...,k* − 1*, task τk is schedulable by the scheduling algorithm if*

$$\frac{C\_k}{t\_k} + \sum\_{l=1}^{k-1} U\_l \le \frac{(k-1)((\alpha+\beta)^{\frac{1}{k}}-1) + ((\alpha+\beta)^{\frac{1}{k}}-\alpha)}{\beta}.\tag{4.13}$$

*Example 4.1* Suppose that *Dk* = *Tk* and the tasks are indexed by the periods, i.e., *T*<sup>1</sup> ≤ ··· ≤ *Tk*. When *Tk* ≤ 2*T*1, task *τk* is schedulable by preemptive ratemonotonic (RM) scheduling if there exists *j* ∈ {1*,* 2*,...,k*} where

$$C\_k + \sum\_{i=1}^{k-1} C\_i + \sum\_{i=1}^{j-1} C\_i = C\_k + \sum\_{i=1}^{k-1} T\_i U\_i + \sum\_{i=1}^{j-1} T\_i U\_i \le T\_j. \tag{4.14}$$

Therefore, the coefficients in Definition 4.2 for this test are *αi* = *βi* = 1 and *ti* = *Ti* for *i* = 1*,* 2*,...,k* − 1, and *tk* = *Tk*. Based on Lemma 4.3, the schedulability of task *τk* under preemptive RM is guaranteed if

$$\frac{C\_k}{T\_k} \le \frac{2}{\prod\_{j=1}^{k-1} (\beta U\_j + 1)} - 1 \quad \Rightarrow \quad \prod\_{j=1}^k (\beta U\_j + 1) \le 2. \tag{4.15}$$

Based on Lemma 4.4, the schedulability condition of task *τk* under preemptive RM is

$$\sum\_{l=1}^{k} U\_l \le k(2^{\frac{1}{k}} - 1). \tag{4.16}$$

The schedulability test in Eq. (4.15) was originally proposed by Bini and Buttazzo [3], called *hyperbolic bound*, as an improvement of the utilization bound in Eq. (4.16) by Liu and Layland in [23]. We note that the original proof in [23] was incomplete, pointed out and fixed by Goossens [15].

The success of the **k2Q** framework is based on a *k*-point effective schedulability test, defined as follows:

**Definition 4.3** A *k*-point last-release schedulability test under a given ordering *π* of the *k* − 1 higher-priority tasks is a sufficient schedulability test of a fixed-priority scheduling policy that verifies the existence of 0 ≤ *t*<sup>1</sup> ≤ *t*<sup>2</sup> ≤ ··· ≤ *tk*−<sup>1</sup> ≤ *tk* such that

$$C\_k + \sum\_{l=1}^{k-1} \alpha\_l t\_l U\_l + \sum\_{l=1}^{j-1} \beta\_l C\_l \le t\_j,\tag{4.17}$$

where *Ck >* 0, for *i* = 1*,* 2*,...,k* − 1, *αi >* 0, *Ui >* 0, *Ci* ≥ 0, and *βi >* 0 are dependent upon the setting of the task models and task *τi*.

The properties in Definition 4.3 lead to the following lemmas for the **k2Q** framework which are proven in [10].

**Lemma 4.5** *For a given k-point last-release schedulability test of a scheduling algorithm in Definition 4.3, in which* 0 *< αi, and* 0 *< βi for any i* = 1*,* 2*,...,k*−1*,* 0 *< tk, k*−1 *<sup>i</sup>*=<sup>1</sup> *αiUi* <sup>≤</sup> <sup>1</sup>*, and k*−1 *<sup>i</sup>*=<sup>1</sup> *βiCi* <sup>≤</sup> *tk, task τk is schedulable by the fixedpriority scheduling algorithm if the following condition holds:*

$$\frac{C\_k}{t\_k} \le 1 - \sum\_{l=1}^{k-1} \alpha\_l U\_l - \frac{\sum\_{l=1}^{k-1} (\beta\_l C\_l - \alpha\_l U\_l (\sum\_{\ell=l}^{k-1} \beta\_\ell C\_\ell))}{t\_k}. \tag{4.18}$$

*Example 4.2* Suppose that *Dk* = *Tk* and the tasks are indexed by the periods, i.e., *T*<sup>1</sup> ≤ ··· ≤ *Tk*. When *Tk* ≤ 2*T*1, task *τk* is schedulable by rate-monotonic (RM) scheduling if there exists *j* ∈ {1*,* 2*,...,k*} where

$$C\_k + \sum\_{i=1}^{k-1} C\_i + \sum\_{i=1}^{j-1} C\_i = C\_k + \sum\_{i=1}^{k-1} T\_i U\_i + \sum\_{i=1}^{j-1} C\_i \le T\_j. \tag{4.19}$$

Therefore, the coefficients in Definition 4.3 for this test are *αi* = *βi* = 1 and *ti* = *Ti* for *i* = 1*,* 2*,...,k* − 1, and *tk* = *Tk*. Based on Lemma 4.5, the schedulability of task *τk* under preemptive RM is

$$\frac{C\_k}{T\_k} \le 1 - \sum\_{i=1}^{k-1} U\_i - \frac{\sum\_{l=1}^{k-1} (C\_l - U\_l(\sum\_{\ell=l}^{k-1} C\_\ell))}{T\_k}.\tag{4.20}$$

The test in Eq. (4.20) is a quadratic form. The first *quadratic bound* (QB) by Davis and Burns in Equation (26) in [14] and Bini et al. in Equation (11) in [4] is

$$\sum\_{l=1}^{k} U\_l + \frac{\sum\_{l=1}^{k-1} C\_l - \sum\_{l=1}^{k-1} U\_l C\_l}{T\_k} \le 1. \tag{4.21}$$

The test in Eq. (4.20) is superior to the test in Eq. (4.21).

The generality of the **k2Q** and **k2U** frameworks has been demonstrated in [8, 10]. We believe that these two frameworks, to be used for different cases, have great potential in analyzing many other complex real-time task models, where the existing analysis approaches are insufficient or cumbersome.

For the **k2Q** and **k2U** frameworks, their characteristics and advantages over other approaches have been already discussed in [8, 10]. In general, the **k2U** framework is more precise by using only the utilization values of the higher-priority tasks. If we can formulate the schedulability tests into the **k2U** framework, it is also usually possible to model it into the **k2Q** framework. In such cases, the same pseudo-polynomial-time test is used. When we consider the worst-case quantitative metrics like utilization bounds, resource augmentation bounds, or speedup factors, the result derived from the **k2U** framework is better for such cases. However, there are also cases, in which formulating the test by using the **k2U** framework is not possible. These cases may even start from schedulability tests with exponentialtime complexity. We have successfully demonstrated three examples in [6] by using the **k2Q** framework to derive polynomial-time tests. In those demonstrated cases, either the **k2U** framework cannot be applied or with worse results (since different exponential-time or pseudo-polynomial-time schedulability tests are applied).

The automatic procedure to derive the parameters in the **k2U** can be found in [9]. Previously, the parameters in all the examples in [8] were manually constructed. This automation procedure significantly empowers the **k2U** framework to automatically handle a wide range of classes of real-time execution platforms and task models, including uniprocessor scheduling, multiprocessor scheduling, self-suspending task systems, real-time tasks with arrival jitter, services and virtualizations with bounded delays, etc. We believe that the **k2U** framework and the automatic parameter derivations together can be a very powerful tool for researchers to construct utilization-based analyses almost automatically. Depending on the needs of the use scenarios, a more suitable schedulability test class should be chosen for deriving better results.

**Utilization-Based Analysis for Dynamic-Priority Scheduling Algorithms** The **k2U** and **k2Q** frameworks provide general utilization-based timing analyses for fixed-priority scheduling. One missing building block is the utilization-based timing analyses for dynamic-priority scheduling algorithms, like EDF. The analytical framework in [8, 10] is based on analytical solutions of linear programming. However, such formulations do not work for EDF.

#### **4.4 Probabilistic Schedulability Tests**

In many real-time systems, it is tolerable that at least some of the tasks in the system miss their deadline in rare situations. Regardless, these deadline misses must be quantified to ensure the system's safety. We examine the problem of determining the deadline miss probability of a task under uniprocessor static-priority preemptive scheduling for an uncertain execution behavior, i.e., when each task has distinct execution modes and a related known probability distribution.

One important assumption for real-time systems is that a deadline miss, i.e., a job that does not finish its execution before its deadline, will be disastrous and thus the WCET of each task is always considered during the analysis. Nevertheless, if a job has multiple distinct execution schemes, the WCETs of those schemes may differ significantly. Examples are software-based fault-recovery techniques which rely on (at least partially) re-executing the faulty task instance, mixed-criticality systems, and a reduced CPU frequency to prevent overheating. In all these cases, it is reasonable to assume that schemes with smaller WCET are the common case, while larger WCETs happen rarely.

We use the example of software-based fault-recovery in the following discussion. When such techniques are applied, the probability that a fault occurs and thus has to be corrected is very low, since otherwise hardware-based fault-recovery techniques would be applied. If re-execution may happen multiple times, the resulting execution schemes have an increased related WCET, while the probability decreases drastically. Therefore, solely considering the execution scheme with the largest WCET at design time would lead to largely overdesigning the system resources. Furthermore, many real-time systems can tolerate a small number of deadline misses at runtime as long as these deadline misses do not happen too frequently. This holds true especially if some of the tasks in the system only have weakly hard or soft real-time constraints. Hence, being able to predict the probability of a deadline miss is an important property when designing real-time systems.

We focus on the probability of deadline misses for a single task here, which is defined as follows:

**Definition 4.4 (Probability of Deadline Misses)** Let *Rk,j* be the response time of the *j* th job of *τk*. The probability of deadline misses (DMP) of task *τk*, denoted by  *<sup>k</sup>*, is an upper bound on the probability that a job of *τk* is not finished before its (relative) deadline *Dk*, i.e.,

$$\Phi\_k = \max\_j \left\{ \mathbb{P}(R\_{k,j} > D\_k) \right\}, \quad j = 1, 2, 3, \dots \tag{4.22}$$

It was shown in [24] that the DMP of a job of a constrained- or implicit-deadline task is maximized when *τk* is released at its critical instant. Hence, the time-demand analysis (TDA) in Eq. (4.8) can be applied to determine the worst-case response time of a task when the execution time of each job is known. This implicitly assumes that no previous job has an overrun that interferes with the analyzed job, i.e., we are searching for the probability that the first job of *τk* misses its deadline after a longer interval where all deadlines were met.

When probabilistic WCETs are considered, the WCET obtains a value in *(Ci,*1*,...,Ci,h)* with a certain probability P*i(j )* for each job of each task *τi*. Therefore, TDA for a given *t* is not looking for a binary decision anymore. Instead, we are interested in the probability that the accumulated workload *St* over an interval of length *t* is at most *t*. The probability that *τk* cannot finish in this interval is denoted accordingly with P*(St > t)*. The situation where *St* is larger than *t* is called an *overload* for an interval of length *t* and hence P*(St > t)* is the *overload probability* at time *t*. Since TDA only needs to hold for one *t* with 0 *< t* ≤ *Dk* to ensure that *τk* is schedulable, the probability that the test fails is upper bounded by the minimum probability among all time points at which the test could fail. As a result, the probability of a deadline miss  *<sup>k</sup>* can be upper bounded by

$$\Phi\_k = \min\_{0 < t \le D\_k} \mathbb{P}(S\_l > t). \tag{4.23}$$

The number of points considered in the TDA can be reduced by only considering the *points of interest*, i.e., *Dk* and the releases of higher-priority tasks.

Therefore, testing the schedulability efficiently requires an efficient routine to calculate P*(St > t)* for a given *t* and a combination of given random variables *St* . The research results at TU Dortmund have recently achieved efficient calculations as follows:


We note that the DMP is not identical to the *deadline miss rate* of a task and that the deadline miss rate may be even higher than this probability, as detailed by Chen et al. [11]. However, the approach in [11] utilizes approaches to approximate the deadline miss probability as a subroutine when calculating the rate.

**Generality of Using** P*(St > t)*

The efficient calculation of P*(St > t)* results in efficient probabilistic schedulability tests and deadline miss rate analyses for preemptive fixedpriority uniprocessor systems. The general question is whether this holds also for other scheduling problems and platforms, like multiprocessor systems. Whether the applicability can be generalized is an open problem.

#### **4.5 Conclusion**

The critical-instant theorem has been widely used in many research results. Some of the extensions of the critical-instant theorem are correct, e.g., the level-i busy window concept in [20], and some are unfortunately incorrect, e.g., for self-suspending tasks in [18, 25]. Specifically, the misconception of modeling self-suspension time of a higher-priority task as its release jitter in the worstcase response time analysis in [25] and [19] had become a standard approach in multiprocessor locking protocols in real-time systems since 2009 until the error was found in 2016, summarized in Section 6 in [12].

In addition to the lack of formalism, the existing properties that have been widely used in analyzing timing satisfactions in cyber-physical real-time systems are also *biased towards computation*. One key assumption used in computation is that the execution of one cycle on a processor reduces the execution of a task by one cycle. If the problem under analysis does not have such a property, the workload characterized by using uniprocessor systems cannot be used at all. To explain this mismatch, consider the preemptive worm-hole switching protocol in communication as an example. Suppose that a message has to be sent from node A to node B by using two switches, called *S*<sup>1</sup> and *S*2. Namely, the message has to follow the path *A* → *S*<sup>1</sup> → *S*<sup>2</sup> → *B*. Suppose that the message is divided into *f* communication units, in which a communication unit can be sent and received in every time unit. A fast transmission plan is to fully parallelize the communication if possible. That is, one communication unit from *A* to *S*<sup>1</sup> for the first time unit, one communication unit from *A* to *S*<sup>1</sup> and *S*<sup>1</sup> to *S*<sup>2</sup> for the second time units, etc. Therefore, the communication time of the message can be modeled as *f* + 2. This analysis is correct under the assumption that *S*<sup>1</sup> and *S*<sup>2</sup> are not used by other flows. However, if the usage of *S*<sup>1</sup> or *S*<sup>2</sup> is blocked during the transmission of the message flow, using *f* +2 time units for analysis is problematic. For the fast transmission plan with *f* +2 communication time, it is actually possible that the message is transmitted in 3*f* time units as the links are blocked for any communication parallelism. To handle the increase of time, several factors have been introduced into the real-time analyses for priority-preemptive worm-hole networks, including direct interference, indirect interference, backpressure, non-zero critical instant, sub-route interference, and downstream multiple interference (summarized in Table VII in [17]). However, since the problem under analysis is essentially not the same as a uniprocessor schedule, applying the uniprocessor timing analysis with extensions is in my opinion only possible after a rigorous proof of equivalence. This mismatch leads to a significant amount of flaws in the literature in this topic. Specifically, the analysis in [28] had been considered safe for a few years until a counterexample was provided in [32].

To successfully tackle complex cyber-physical real-time systems that involve computation, parallelization, communication, and synchronization, we believe that new, mathematical, modulable, and fundamental properties for property-based (schedulability) timing analyses and scheduling optimizations are strongly needed. They should capture the pivotal properties of cyber-physical real-time systems and thus enable mathematical and algorithmic research on the topic. The view angles should not be limited to the processor- or computation-centric perspective. When there are abundant cores/processors, the bottleneck of the system design becomes the synchronization and the communication among the tasks [16, 29]. Different flexibility and tradeoff options to achieve real-time guarantees should be provided in a modularized manner to enable tradeoffs between execution efficiency and timing predictability.

**Acknowledgments** Part of this work has been supported by European Research Council (ERC) Consolidator Award 2019, PropRT (Number 865170), and Deutsche Forschungsgemeinschaft (DFG), as part of the priority program "Dependable Embedded Systems"—SPP1500, project GetSURE, and the Collaborative Research Center SFB 876 (http://sfb876.tu-dortmund.de/), subprojects A1, A3, and B2.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 5 ASSISTECH: An Accidental Journey into Assistive Technology**

**M. Balakrishnan**

#### **5.1 The Beginning: Mainly a Facilitator (2000–2005)**

My ASSISTECH journey started by accident. Myself and Mr. Dipendra Manocha (see the adjoining profile) had a common acquaintance with whom he used to meet and exchange audio cassettes in the late 90s and early 2000. Note Dipendra lost his eye sight when he was around 12 years of age. On his reference, Dipendra one day came to meet me with a request that can I help in making the "emacs" editor in Linux accessible through the screen reader software. At that time Dipendra was managing the computer center in NAB (National Association of Blind). At that time I myself was deeply into research and tool development only in the broad areas of VLSI/EDA tools with special focus on system level design that included high level synthesis as well as hardware-software co-design. For such application development, I neither had the skills nor any special interest. On the other hand, I was completely overwhelmed by the sincerity and focus of Dipendra and thus I decided to involve some students through a mini-project. They did manage to develop a basic solution but the solution itself did not get widely deployed but formed the beginning of a very long and fruitful relationship. In the initial years, most of our engagements were similar—he would propose a project (mainly software based) and I would identify a set of project students who would be jointly mentored by us to work on the solution. We became closer and I was always impressed with his ease of understanding the technological capabilities as well as limitations without any formal training in Science or Engineering. Being blind did not seem to matter at all! Note Dipendra's formal college education consists of an under-graduate and post-graduate degree in music. He had immense capability to explain the needs of the visually impaired in a

M. Balakrishnan (-)

Indian Institute of Technology Delhi, Delhi, India e-mail: mbala@cse.iitd.ac.in

<sup>©</sup> The Author(s) 2021 J.-J. Chen (ed.), *A Journey of Embedded and Cyber-Physical Systems*, https://doi.org/10.1007/978-3-030-47487-4\_5

language that could be easily understood by my CSE students. I started inviting him regularly for engaging my CS class to talk about the challenges faced by the VI and these talks were always a big hit and inspired many students.

#### **5.2 Early Phase: Focus on Embedded Systems (2005–2010)**

#### *5.2.1 ASSISTECH and COP315*

COP315 is a project based embedded systems course developed by me in late 90s. This course has an interesting German connection. I spent a year as a visiting Professor (Konrad Zuse Fellow) in the Computer Science Department at the University of Dortmund in 1994–1995. I came across this semester long course (group project course) where a group of students (upto 10) did a project that typically resulted in a small system development which they demonstrated at the end of the semester. The course had no formal lectures and though they were mentored by the research assistants but primarily were expected to rely on the material available in the open domain for building their backgrounds and solving the problems they encountered. I was impressed by the complexity as well as quality of projects including use of FPGAs, etc. by which they could build their projects and demonstrate. This was in spite of the fact that students at Dortmund had relatively much less exposure through formal instruction in hardware design vis-à-vis IIT Delhi students.

On my return to IIT Delhi, I decided to revamp a course which I used to teach for third year students "Microprocessor based System Design". This course initially had lectures, regular practical assignments as well as a small project. After the change, we made groups of 4–6 students, offered them a list of projects while being open to project suggestions from them as well. As in initial years almost all projects were built around microcontrollers and peripheral devices, I gave them a set of lectures on microcontroller based design. COP315 played a major role in the ASSISTECH journey as it incubated many of the ideas that became successful products later not only in Assistive Technology (AT) space but also in other domains.

#### *5.2.2 SmartCane*

The SmartCane journey began as a COP315 project in 2005. This was taken up by a group of 4 students—Rohan Paul, Dheeraj Mehra, Vaibhav Jain, and Ankush Garg who were at that time in the first semester of their third year. The first three students were dual degree students and thus were to stay in the Department for another 3 years. Before the start of the semester, Dipendra first mentioned a key problem in independent mobility of visually impaired in Indian infrastructure—**White cane's inability to detect knee above overhanging obstacles on walking paths without a footprint on the walking path itself. These obstacles can range from low overhanging tree branches on roads and footpaths to jutting out window airconditioners and room coolers in corridors. This often resulted in upper body injuries and resulted in loss of confidence in independent mobility.**

I am sure that the problem specification meetings the students had with Mr. Manocha inspired them a lot. The unusually intense engagement of this student group led by Mr. Rohan Paul with the objectives of the project was evident from the very beginning. As they worked on the initial prototypes, they also made it known to me that if opportunity is given they would like to work on it even beyond the semester. This group of students continued to work on it for the next 3 years (except Ankush who graduated in 2007), developed a series of prototypes and did multiple user testing at NAB (National Association of Blind) with the help of Mr. Dipendra Manocha. Many of the initial technical challenges were addressed in this phase that lasted from 2005–2008. Except at the very end of this phase, we worked without any external funding but were responsible for building some key collaborations that have lasted more than a decade. First 3-D printing facility has been established in IIT Delhi and was being managed by Prof. P.V.M. Rao in Mechanical Engineering Department. To prototype the casing that was in the form of handle we established contact with Prof Rao. He not only helped them fabricate but got soon involved in various aspects of design. This was the beginning of the relationship that has played a critical role in the success of ASSISTECH. Key technical achievements were also very inter-disciplinary in nature. Apart from innovative use of ultrasonic sensor to get the obstacle distance information and then to convey the same using a vibrator,

**Fig. 5.1** SmartCane through various stages of development: Laboratory (2005) to Product (2014)

key technical challenge revolved around reducing power consumption. It was not only in the electronics but also in the coupling between the vibrator and handle for efficient power transfer [2, 18, 19].

At the end of the prototype development (summer of 2007), by a chance coincidence all the three students could get support to present their work at TRANSED 2007 at Montreal. The fact that the paper also got noticed in the conference and received very positive feedback, it helped to motivate the students. Soon after that we were able to sign an agreement with Phoenix Medical Systems (PMS), Chennai for technology transfer and licensing. The period from 2008–2011 was very frustrating as we wrote several proposals for support for translational research to various Government agencies. Challenge was that significant support was also required at the manufacturer's end and agencies were willing to fund only the IIT Delhi Component (Fig. 5.1).

A chance meeting with Dr. Shirshendu Mukherjee, Country Manager of WT (Wellcome Trust, UK) in India resulted in a major success in getting the grant from WT in their first call for "Affordable Health Care in India". Our application itself was successful because the committee was impressed by the team composition academic entity with product know-how, a credible industrial partner and an NGO in the space as dissemination partner. This clearly was a turning point for the lab. The funding not only catered to the requirements of all the three partners but also included many provisions for creating a quality product, e.g. plastic injection mold costs, large scale multi-city validation trials and costs of getting CE marking. This funding clearly helped us to do many things very systematically and with professional support which is not typical of an academic project. Some of the key features of the development process are listed below.


We discuss some of the interesting user-centric design decisions related to ergonomics and aesthetics. Initial designs were difficult for women to grip due to their smaller hand sizes. This came to notice later as the focus group had only men. This required a major redesign including a completely novel packaging. Initially the color of the device was not being considered as it was thought that anyway it is a product for a blind person. A simple question from a blind woman settled this question—do you choose color of the clothes to wear only for yourself or for others as well to see and appreciate?

The success in launching the product within the project duration was highly appreciated by the sponsor (WT) and resulted in significant funding for national and International dissemination. While the industry partner focused on scaling manufacturing, over the next 3 years a team from IIT Delhi and Saksham focused on training users and mobility instructors and building partnership with 40+ agencies across the country. This has played a significant role in widespread acceptance of the device with sales crossing 70,000 units in 5 years (Figs. 5.2 and 5.3).

#### *5.2.3 OnBoard*

OnBoard is a globally unique solution for assisting visually impaired users to board public buses independently. It addresses two challenges that are faced by the VI in this process of independent boarding (Figs. 5.4 and 5.5).

**Fig. 5.2** SmartCane as a product comes with number of support material for trainers as well as for users (self-learning). It contains manuals in Braille and audio manual in different languages

**Fig. 5.3** SmartCane assembly line facility at Phoenix Medical Systems Chennai and a batch of devices packed in a box

**Fig. 5.4** This is a typical bus stop in Delhi. Very often a very large number of routes use the same bus stop and visually impaired person requires the help of another passenger waiting at the stop to help him/her identify the route number

**Fig. 5.5** Buses for various reasons do not come and stop in the bay necessitating waiting passengers to walk even up to 25 m to board the bus. For a visually impaired it becomes very difficult as well as unsafe to locate the entry door to board the bus especially as the time available is short and number of passengers may be rushing towards the bus

1. Typically many route numbers use the same bus stop. In Indian Metros in some of the bus stops the number of route numbers being serviced may even go up to 50 but 15+ is quite common and 5+ is almost the norm except in the suburbs. A VI commuter always requires help of another commuter at the bus stop and that poses two challenges.


Technically the work on OnBoard started more or less concurrently with Smart-Cane but after the development of prototypes and some testing, further development was shelved as the focus shifted to translational research on SmartCane. In hindsight it was the right decision as the solution involved changes to the infrastructure provided by a third party (bus operators) struggle has been much higher to get it implemented. The key technical features included a simple user interface, implementation of a slotted network protocol to simultaneously handle up to 8 buses at a bus stop and dual frequency protocol between the user device and the bus device for meeting both the requirements—querying and getting the route number and locating the entry door using an audio cue (Figs. 5.6 and 5.7).

**Fig. 5.6** User presses the query button (user module): (1) RF query is sent to all buses in the vicinity and buses respond with their route numbers. (2) User module reads out all the route numbers one by one

**Fig. 5.7** In case a route number is of user interest: (1) User presses the select button (user module), This triggers a voice output from the speaker (bus module fixed near the door). (2) Acts as auditory cue for locating the door

Starting 2014, with the help of TIDE scheme of DST, Govt of India, we restarted this work and did some preliminary testing on DIMTS buses in Delhi. Subsequently with the help of a very active organization in Mumbai (XRCVC1) we were able to reach out to BEST where between January and April 2015 we installed these devices on 25 buses (all buses on route numbers 121 and 134 from Back Bay depot). We identified 21 blind bus users, trained them on the device with 5 to 6 supervised boardings each and then asked them to do a total of 350 unsupervised boardings. As they were unsupervised boardings, we needed to evolve a mechanism of assessing the effectiveness of the solution. We asked the users to record the time they reached the bus stop and the time they boarded the bus. Comparing the waiting time at the bus stop with the frequency of the service, we could determine whether the user could board the first bus on the route or not. We achieved 92%+ success rate in users independently boarding the first bus on the route implying the device was very effective. We believe the failure was lower than 8% as sometimes the depot change the buses deployed on a specific route due to operational reasons and it was possible that some of the buses deployed on these routes did not contain the OnBoard bus device [8, 13, 16, 21].

Last 2 years had been spent in reducing the size of the user device and making it aesthetically pleasing. The bus device has also been significantly miniaturized and now it can be operated from the power source of the bus instead of a separate battery as was the case during Mumbai trials Further, a simple protocol by which the device can be retro-fitted in a few minutes has been developed and tested. At present we are awaiting funding for a much larger trials where the users would use the device over a period for their regular commute requirements. We feel such trials

<sup>1</sup>http://www.xrcvc.org/.

**Fig. 5.8** Mumbai trials on BEST Buses (Jan–April 2015): Bus device was mounted on the window and the unit operated with its own battery. The battery unit is seen below the front seat reserved for the disabled

**Fig. 5.9** Miniaturized bus device and user device (not to scale) used in the second Delhi trials. The bus device is mounted on the DMITS orange cluster bus in Delhi. Again the validation trials on these new devices were conducted successfully during May to July 2018

are required before we pursue the same becoming a regulatory requirement so that the bus systems become inclusive which is part of the Government policy (Figs. 5.8 and 5.9).

#### **5.3 Collaborations and Research: Formation of ASSISTECH (2010–2013)**

#### *5.3.1 Student Projects to Research*

This phase also saw a consolidation of our activities through the formation of ASSISTECH group with a clear objective of working in the space of mobility and education of visually impaired. Allocation of laboratory space in the newly constructed School of Information Technology building significantly facilitated this process. Apart from involvement of students through under-graduate projects, registration of first PhD student (Mr. Piyush Chanana) changed the nature of activities in the laboratory. This also meant that a more structured approach to involving users in evolving product/project specification became possible. Saksham our collaborator posted two of its blind employees to the laboratory and that had an impact at multiple levels. By this time it was clear that any design and product development without involvement of users at all stages, a process that is today referred to as co-creation is essential in this space. It is not sufficient for the designers to understand the limitations of VI users but they also have to understand their strengths in equal measure.

Inter-disciplinary research has gained traction globally but still very often the laboratories draw students from within a single disciplinary background. ASSIS-TECH with its user focus has been able to break this barrier and today has computer science, electrical engineering, mechanical engineering and design expertise apart from user community members under one roof. This has created a unique ecosystem for user-centric approaches to problem solving.

#### *5.3.2 NVDA Activities*

NVDA or Non-visual Desk Top access is a screen reader software that makes Microsoft Office products accessible to the visually impaired. It is an open source movement and has been successful in creating a large user base globally. ASSIS-TECH also participated in improving the NVDA tools to essentially support tables that are found in documents. Navigation across the cells of a table in an intuitive and non-verbose manner is critical for VI to comprehend tabular information. ASSISTECH helped augment NVDA in multiple ways and worked closely with the user groups [3, 4].

#### *5.3.3 TacRead and DotBook*

Refreshable Braille displays for accessing digital text are a technology which is more than three decades old. These are line display devices that contain an array of cells (typically ranging from 8 to 40) consisting of 6 or 8 pins per cell to create a 6-dot or 8-dot Braille character, respectively. The pins are driven by piezo actuators and through their up and down movement form the Braille characters. Though the devices have been around for a long time but the costs have been so high that there was hardly any penetration in low-income countries like India. Combination of patented technologies, low volume production associated with high margins from monopoly cell manufactures meant the devices continued to cost USD 50 to USD 100 per cell. The rapid growth of screen readers that had started becoming available on all platforms also meant that even in the high-income countries the Braille displays saw a dwindling market. On the other hand, many studies have now shown

**Fig. 5.10** Initial single Braille cell based on SMA (Shape memory alloys) designed and fabricated during 2014–2017

that if visually impaired persons do not learn to read their ability to write would be minimal. This resulted in the transforming Braille project with worldwide interest in producing low-cost Braille reading and writing devices.<sup>2</sup>

At the same time at ASSISTECH we were engaged in developing refreshable Braille cells using shape memory alloys. The development process involved efforts of multiple batches of mechanical engineering students working with our research staff. These students not only worked on it as their course projects but often stayed back after graduation to work on refining their designs to improve performance or reduce size/weight/cost. It has taken more than 5 years of design, testing, and instrumentation to produce reliable Braille cell modules. Based on the success of the SmartCane, WT funded us again for the Refreshable Braille displays. This time we had two industry partners (PMS and KSPL—KritiKal Solutions Private Limited) and Saksham was again our dissemination partner. The modules produced by Phoenix are being called TacRead, whereas the devices produced by KSPL using these modules are named DotBook. We did a launch of 20-cell and 40-cell devices on Feb 2019. Small volume production is on but some key changes and tooling required for volume production is being setup (Fig. 5.10).

The device is highly complex and represents innovation and design in software, electronics design, mechanical design as well as ergonomics. This also resulted in a number of research papers as well as a well awarded Master's thesis by Suman Muralikrishna [1, 7, 17, 20]. DotBook provides all the key functionality of a laptop including document editing, web browsing as well as emailing. Today all standard

<sup>2</sup>http://transformingbraille.org/.

tools have software towards taking inputs from graphical user interfaces. They all needed to be thought afresh as it was not only a character line display but had a length of only 40 characters. Power delivery as well as consumption was a huge challenge as SMA wires required heating to actuate and needed to be cooled to bring the pin return to its original position. Overheating meant not only possibility of wire breakage over time but also slower refresh rate due to higher latency required for deactivation. Mechanical complexity was evident with 320 moving parts (8 pins per cell and 40 cells) which need to be controlled such that the heights of actuated pins are within + or − 0.1 mm in a 1 mm movement. Many user studies were conducted both to identify the functions associated with the limited set of keys that are available as well as their suitable positioning on the device. Volume production is expected to begin by early 2020 as major reliability related issues in the cell module have been sorted out (Fig. 5.11).

#### **5.4 Change of Focus: Technology to Users (2013–2016)**

#### *5.4.1 Tactile Graphics Project*

During the dissemination phase of SmartCane, it was decided to create a selflearning manual for visually impaired. It was easy to print the Braille text in English and Hindi but the manual also contained a set of simple diagrams to explain ultrasonic ranging. In this process we realized that there is no structured way for producing tactile diagrams in some volume in India. This was in sharp contrast to what we saw in UK and USA where almost all the text material including diagrams were available to visually impaired students as Braille books. Clearly unavailability of tactile diagrams created a significant barrier to visually impaired students to pursue STEM subjects—typically they were forced to study only subjects like history and literature that did not require access to diagrams. This

**Fig. 5.11** DotBook: 20-cell and 40-cell devices that were launched on 28 Feb 2019. Volume production is expected to start in first quarter of 2020

prompted us to develop a low-cost technology for production of tactile graphics. Under a project sponsored by MEITY (Ministry of Information Technology), knowhow was created for production of low-cost tactile diagrams. This involved some software adaptations, standardization of 3-D printing for preparing low-cost molds, and setting up of facilities for production of tactile material using thermoforming. As tactile route to learning is very different from visual route, guidelines have been developed over many decades in the USA and Europe (e.g. BANA1). A set of tactile designers were trained using these guidelines and then in collaboration with an apex agency in India that is responsible for school curriculum and education, existing Science and Mathematics books for School grades 9th and 10th were produced with tactile diagrams. These were extensively tested with both children studying in blind schools as well as inclusive schools [15]. Once the project was completed, a non-profit company named Raised Lines Foundation<sup>3</sup> (RLF) has been incubated in August 2018. RLF is at present engaged in design and production of tactile diagrams and other tactile material for education. Initial feedback suggests that it is helping many visually impaired students to pursue STEM subjects all over India. In the next few years not only we intend to scale this venture but also create partnerships across the country for creation of tactile material in different Indian languages. It is also planned to create a design service for organizations outside India (Figs. 5.12 and 5.13).

#### *5.4.2 More Research Projects and International Collaboration*

This period also saw enrollment of number of PhD and Masters students in this space. As Design students started enrolling with us, it was clear that there were many

**Fig. 5.12** Tactile diagrams developed as part of the project Tactile Graphics sponsored by MEITY, Govt of India made innovative use of 3-D printing for production of low-cost tactile diagrams

<sup>3</sup>http://raisedlines.org/.

**Fig. 5.13** A non-profit company has been formed using the know-how developed under the project to produce school text books and other reading material for visually impaired. The company is in operation since August 2018

open questions to be answered to make the tactile diagrams effective. Also modern tactile production processes have created more flexibility in production but there have been little study on their effectiveness. Ms. Richa Gupta, a design graduate, started studying effectiveness of various forms of representation of tactile diagrams and is now close to finishing her work [10, 14].

This research on tactile diagrams also created our first significant collaboration with IUPUI in Indianapolis. Prof. Steven Mannheimer, whom we met in a Indo-US workshop in India was also interested in pursuing this area. He had already done some work with Indianapolis School for the Blind (ISB). This collaboration enabled Richa to conduct her studies and establish effectiveness of representation techniques in two distinct geographies—NAB in New Delhi and ISB at Indianapolis [11].

#### **5.5 Consolidation and Growth (2016 - )**

The period after 2016 has seen lot of growth—both in terms of research projects, students as well as activities. This period also saw lots of national and International recognitions.

#### *5.5.1 RAVI*

Reading Assistant for Visually Impaired (RAVI) is a project aimed at making pdf documents accessible. The work is under progress and we expect some initial results next year [6, 12]. The challenge is manifold


#### *5.5.2 MAVI*

Mobility Assistant for Visually Impaired (MAVI) is a project to use modern AI based image classification techniques for safe and efficient mobility of visually impaired. The focus of this work is to look at object detection in the context of street infrastructure that is typical of Indian cities. The initial prototypes have been built but a usable solution is likely only by 2021. The video stream from the camera is processed using multiple streams to detect


#### *5.5.3 NAVI*

Navigation Assistant for Visually Impaired (NAVI) is a mobile app based solution that can help in outdoor as well as indoor mobility. Mr. Piyush Chanana, a senior scientist in ASSISTECH, understood the challenges of VI persons in independent mobility by interacting with hundreds of blind users whom he has trained in the use of SmartCane. He has captured this in the specification and design of an app that can help visually impaired in outdoor mobility. Among other things, it involves annotation of tactile landmarks that are "visible" to a blind user [5, 9].

Currently another PhD student, Mr. Vikas Upadhyay has started working on indoor navigation. The work involves effective mapping of internal spaces, localization with limited additional infrastructure, and an appropriate user interface for visually impaired. Intent is not only to propose novel algorithms and techniques but also to install it in couple of public buildings and validate.

#### *5.5.4 Outreach Through Conferences*

This period also saw our intent to outreach and create forums for all Assistive Technology stakeholders to come together. Initially we organized two Indo-US workshops with participation of 50+ persons in this space. This was followed by two major assistive technology conferences in 2018 and 2019. EMPOWER 20184 and EMPOWER 20195 brought 250<sup>+</sup> AT researchers, users, innovators and entrepreneurs, educators, exhibitors, etc. on one platform and have been a major success.

<sup>4</sup>http://assistech.iitd.ac.in/empower2018/.

<sup>5</sup>http://assistech.iitd.ac.in/empower2019/.

#### *5.5.5 Major Recognitions*

This period also saw numerous awards being conferred for ASSISTECH activities. A spotlight talk in London as part of the Grand Challenges6 meeting organized by Gates foundation and Wellcome Trust in London on 24th Oct 2016 was the first major recognition of ASSISTECH activities. Major Indian awards included NCPEDP-Mphasis Universal Design award and three national awards. ACM recognized our contribution through the ACM Eugene L Lawler7 award for Humanitarian Contributions within Computer Science and Informatics at their annual awards function in San Francisco on 15th June 2019.

#### **5.6 Conclusion**

Clearly my venturing into Assistive Technology from Embedded Systems/EDA space has been an accident and thus I titled this paper as an accidental journey. The work has been hugely satisfying primarily because of the impact it potentially has on the lives of visually impaired people. The feedback we frequently get from our numerous users on how our devices and solutions have positively affected their lives is sufficient to drive and inspire us. Over a period ASSISTECH design philosophy has become completely user-centric. In the initial years we used to choose the problems to look at based on our own experience and expertise. Now if we learn of a major challenge in the visually impaired community and then scout around and try to put up a team by collaborating with people with the required expertise. Our motto is to touch a million people by 2022.

#### **References**


<sup>6</sup>https://grandchallenges.org/video/smartcane-m-balakrishnan-india.

<sup>7</sup>https://awards.acm.org/lawler.


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 6 Reflecting on Self-Aware Systems-on-Chip**

**Bryan Donyanavard, Tiago Mück, Kasra Moazzemi, Biswadip Maity, Caio Batista de Melo, Kenneth Stewart, Saehanseul Yi, Amir M. Rahmani, and Nikil Dutt**

#### **6.1 Introduction to Self-Aware Systems-on-Chip**

We are seeing an increasing number of complex cyber-physical systems (CPS) deployed for various applications, such as road-traffic control involving communicating autonomous cars and infrastructure, or smart grids controlling energy delivery down to the individual device. These distributed applications follow common design objectives, such as energy-efficiency, and require guarantees for high availability, real time or safety. In this context, autonomy is crucial: multiple system goals varying over time need to be adaptively managed and objectives holistically coordinated. By empowering future CPS with self-awareness, these systems promise to dynamically adapt, learn, and manage unforeseen changes [6].

#### *6.1.1 Computational Self-Awareness*

Computational self-awareness is the ability of a computing system to recognize its own state, possible actions, and the result of these actions on itself, its operational goals, and its environment, thereby empowering the system to become autonomous [6]. Computational self-awareness in itself is not a new field, but

University of California, Irvine, CA, USA

B. Donyanavard (-) · T. Mück · K. Moazzemi · B. Maity · C. B. de Melo · K. Stewart · S. Yi · N. Dutt

e-mail: bdonyana@uci.edu; tmuck@uci.edu; moazzemi@uci.edu; maityb@uci.edu; cbatista@uci.edu; kennetms@uci.edu; saehansy@uci.edu; dutt@uci.edu

A. M. Rahmani Technische Universität Wien, Vienna, Austria e-mail: amirr1@uci.edu

rather a unification of subjects studied disjointly in various fields including control systems, artificial intelligence, autonomic computing, software engineering, among others, and how such research can be applied toward building computer systems with varying degrees of self-awareness in order to accomplish a task [8].

#### *6.1.2 Cyber-Physical Systems-on-Chip*

Battery-powered devices are the most ubiquitous computers in the world. Users of battery-powered devices expect support for various high-performance applications running on same device, potentially at the same time. Applications range from interactive maps and navigation, to web browsers and email clients. In order to meet performance demands by users utilizing complex workloads, increasingly powerful hardware platforms are being deployed in battery-powered devices. Systems-onchip (SoCs) can integrate hundreds of heterogeneous cores and uncore components on a single chip. Such systems are constrained by a limited amount of shared system resources (e.g., power, interconnects). Simultaneously, the systems are expected to support workloads with diverse characteristics and demands that may conflict with system constraints. These platforms include a number of configurable knobs throughout the system stack and with different scope that allow for a trade-off between power and performance, e.g., dynamic voltage and frequency scaling (DVFS), power gating, idle cycle injection. These knobs can be set and modified at runtime based on the workload demands and system constraints. Heterogeneous many-core processors (HMPs) have extended this principle of dynamic powerperformance trade-offs by incorporating single-ISA, architecturally differentiated cores on a single processor, with each of the cores containing a number of independent trade-off knobs. All of these configurable knobs allow for a huge range of potential trade-off. With such a large number of possible configurations, SoCs require intelligent runtime management in order to achieve system goals for complex workloads. Additionally, the knobs may be interdependent, so the decisions must be coordinated.

Cyber-physical systems-on-chip (CPSoC) [21] provide an infrastructure for system introspection and reflective behavior, which is the foundation for computational self-awareness. Figure 6.1 shows the infrastructure of a sensor-actuator rich platform, integrated with decision-making entities that observe system state through virtual and physical sensors at various layers in order to set the system configuration through actuators. The actuations are determined by policies that enforce the overall application goals while considering system constraints. Such an infrastructure can deploy *re*active policies through the traditional *Observe*, *Decide*, and *Act* (ODA) feedback loop, as well as *pro*active policies through the augmented *self-aware* feedback loop. Figure 6.2 shows how the traditional ODA loop is augmented with reflection to provide self-aware adaptation. In this chapter we explore the

**Fig. 6.1** CPSoC infrastructure: sensors and actuators throughout the system stack, with support for adaptive policies that enforce a given goal (from [3])

**Fig. 6.2** Self-aware feedback loop. Policies are deployed to make action decisions toward achieving a goal by controlling the CPSoC based on observations and self-aware adaptation

use of computational self-awareness to address challenges of adaptive resource management in cyber-physical systems-on-chip.<sup>1</sup>

<sup>1</sup>Throughout the remainder of this chapter we use SoC as an umbrella term that includes CPSoC.

#### **6.2 Reflective System Models**

Traditionally, resource managers deploy an ODA feedback loop (lower half (in black) of Fig. 6.3) to manage systems at runtime. However, recent works [1, 27] have shown that a runtime model of the system can better manage the unpredictable nature of workloads.

Reflection can be defined as *the capability of a system to reason about itself and act upon this information* [26]. A reflective system can achieve this by maintaining a representation of itself (i.e., a self-model) within the underlying system, which is used for reasoning. Reflection is a key property of self-awareness. Reflection enables decisions to be made based on both *past* observations, as well as *predictions* made from past observations. Reflection and prediction involve two types of models: (1) a self-model of the subsystem(s) under control, and (2) models of other policies that may impact the decision-making process. Predictions consider *future* actions, or events that may occur before the next decision, enabling "what-if" exploration of alternatives. Such actions may be triggered by other policies invoked more frequently than the decision loop. The top half of Fig. 6.3 (in blue) shows prediction enabled through reflection that can be utilized in the decision-making process of a feedback loop. The main goal of the predictive model is to estimate system behavior based on potential actuation decisions as well as system dynamics.

#### *6.2.1 Middleware for Reflective Decision-Making*

The increasing heterogeneity in a platform's resource types and the interactions between resources pose challenges for coordinated model-based decision-making in the face of dynamic workloads. Self-awareness properties address these challenges for emerging SoC platforms through reflective resource managers. Reflective resource managers build a model of the system which represents the software organization or the architecture of the target platform. Resource managers can use reflective models to anticipate the effects of changing the system configuration

**Fig. 6.3** Feedback loop overview. The bottom part of the figure represents a simple observe– decide–act loop. The top part (in blue) adds the reflection mechanism to this loop, enabling predictions for smart decision-making

**Fig. 6.4** MARS framework overview from [10]. Different layers of the system stack coordinate through policies to orchestrate the management of resources: sensors inform policies of the system state; policies coordinate with models to perform reflective queries, and make resource management decisions; policies set actuators to enact changes on the system

at runtime. However, with SoC computing platform architectures evolving rapidly, porting the self-aware decision logic across different hardware platforms is challenging, requiring resource managers to update their models and platform-specific interfaces. To address this problem, we propose MARS (Middleware for Adaptive and Reflective Systems), a cross-layer and multi-platform framework that allows users to easily create resource managers by composing system models and resource management policies in a flexible and coordinated manner.

Figure 6.4 shows an overview of the MARS framework (shaded), with *Sensors* and *Actuators* interfacing across multiple layers of the system stack: *Applications*, *Linux kernel*, and *HW Platform*. The components of MARS are explained next.


MARS is implemented in the C++ language following an object-oriented paradigm and works on hardware (e.g., Odroid-XU3, Nvidia Jetson TX2), simulated (e.g., gem5), and trace-based offline [11] platforms. The framework is open source and available online.<sup>2</sup> While the current version of MARS targets energy-efficient heterogeneous SoCs, we believe the MARS framework can be ported to a wider range of systems (e.g., webservers, high-performance clusters) to support self-aware resource management.

#### **6.3 Managing Energy-Efficient Chip Multiprocessors**

Dynamic resource management for HMPs is a well-known challenge: integration of hundreds of cores running various workloads with conflicting constraints increases the pressure on limited shared system resources. A promising and well-established approach is the use of control-theoretic solutions based on rigorous mathematical formalisms that can provide bounds and guarantees for system resource management. In this context, we discuss efforts that deploy control-theoretic-centric runtime resource management of HMPs, from simple Single Input Single Output (SISO) controllers to more complex Supervisory Control Theory (SCT) methods.

#### *6.3.1 Single Input Single Output Controllers*

Conventional control theory methods proposed for resource management use Single Input Single Output (SISO) controllers for the ease in deployment and the guarantees they provide in tracking the target output. These SISO controllers use Proportional Integral (PI), Proportional Integral Derivative (PID), or lead-lag implementations [22]. Figure 6.5 depicts a first-order feedback SISO controller which can be deployed either as a PI or a PID controller. The error *e* is the input to the controller. Note that to compute the current control input *u*, the controller

<sup>2</sup>Code repository at https://github.com/duttresearchgroup/MARS.

**Fig. 6.5** *Single Input Single Output* (SISO) feedback loop

needs to have the current value of the error *e* along with the past value of the error and the past value of the control input. It is this memory inherent in the controller that makes it dynamic.

#### *6.3.2 Multiple Input Multiple Output Controllers*

Modern HMPs execute diverse set of workloads with varying resource demands, which sometimes exhibit conflicting constraints. In this context, the use of SISO controllers might not be effective as multiple system goals varying over time need to be managed in a coordinated and holistic manner. Multiple Input Multiple Output (MIMO) control theory is able to coordinate and prioritize multiple design goals and actions. MIMO controllers have proven effective for coordinating management of multiple goals in unicore processors [17] and HMPs [12].

#### *6.3.3 Adaptive Control Methods*

Ideally, control-theoretic solutions should provide formal guarantees, be simple enough for runtime implementation, and handle nonlinear system behavior. Static linear feedback controllers such as SISO and MIMO can provide robustness and stability guarantees with simple implementations, while adaptive controllers modify the control law at runtime to adapt to the discrepancies between the expected and the actual system behavior. However, modifying the controller at runtime is a costly operation that also invalidates the formal guarantees provided at design time. In order to be able to take predicted responsive actions against nonlinear behavior of the computer systems, a well-established and lightweight adaptive control-theoretic technique called *Gain Scheduling* can be used. This method is used for dynamic power management in chip multiprocessors in [2].

#### *6.3.4 Hierarchical Controllers*

Supervisory Control Theory (SCT) [19, 30] provides formal and systematic supervision of classical MIMO/SISO controllers. SCT uses modular decomposition of control problems to manage their complexity. Specifically, supervisory control has two key properties: (1) rapid adaptation in response to abrupt changes in management policy and (2) low computational complexity by computing control parameters for different policies offline. New policies and their corresponding parameters can be added to the supervisor on demand. Therefore, SCT is suitable for resource management problems (such as managing power, energy, and qualityof-service metrics) that can be modeled using logic and discrete system dynamics.

Figure 6.6 depicts a high-level view of supervisory control for HMP resource management. Either the user or the system software may specify *Variable Goals and Policies*. The *Supervisory Controller* aims to meet system goals by managing the low-level controllers. High-level decisions are made based on the feedback given by the *High-level Plant Model*, which provides an abstraction of the entire system. Various types of *Classic Controllers*, such as PID or state-space controllers, can be used to implement each low-level controller based on the target of each subsystem. The flexibility to incorporate any pre-verified off-the-shelf controllers without the need for system-wide verification is essential for the modularity of this approach. The supervisor provides parameters such as output references or gain values to each low-level controller during runtime according to the system policy. Low-level controller subsystems update the high-level model to maintain global system state,

**Fig. 6.6** High-level view of Supervisory Control Theory

and potentially trigger the supervisory controller to take action. The high-level model can be designed in various fashions (e.g., rule-based or estimator-based) to track the system state and provide the supervisor with guidelines. Supervisory control provides the opportunity to benefit from both classical control-theoretic methods and heuristics in a robust fashion. The SCT hierarchy in Fig. 6.6 is successfully used to manage quality-of-service (QoS) goals within a power budget on an HMP in [18].

#### **6.4 Heterogeneous Mobile Governors: Energy-Efficient Mobile System-on-a-Chip**

Mobile games stress modern SoCs by utilizing heterogeneous processing elements, CPUs and GPUs, concurrently. However, the utilization of each processing element may vary between games. Performance of these games that usually is measured in frames per second (FPS) can highly depend on the operating frequency of compute units. However, conventional DVFS governors conservatively choose high frequencies without considering the utilization pattern of the games [16]. In order to meet a performance goal while conserving energy, the frequency of each processing element should be as low as possible without an observable effect on the FPS.

#### *6.4.1 Sensors to Capture Dynamism*

To coordinate frequency configuration decisions, a cooperative CPU-GPU DVFS strategy, Co-Cap [14], limits the maximum frequency of CPUs and GPUs on a game-specific basis. Based on the utilization of each processing element, games are classified as one of the following classes: (1) No CPU-GPU Dominant; (2) CPU Dominant; (3) GPU Dominant; and (4) CPU-GPU Dominant. Figure 6.7 shows the classes and gives an example of each class. To determine a maximum frequency for each game class, Co-Cap implements a frame rate sensor, which is affected by both CPU and GPU frequencies. By limiting maximum frequencies for each game class, Co-Cap reduces energy consumption without observable performance degradation.

The assumption in Co-Cap is that games can only belong to one of the classes. However, some games might change their dynamic behavior throughout their life cycle. To proactively respond to the dynamic CPU and GPU frequency requirements of games, a DVFS governor policy requires more information about a game's workload dynamism. A Hierarchical Finite State Machine (HFSM) based CPU-GPU governor, HiCAP [13], models the dynamic behavior of mobile gaming workloads and applies a cooperative, dynamic CPU-GPU frequency-capping policy to conserve energy by adapting to a game's inherent dynamism. Using the HFSM, a DVFS governor can predict the next workload feature for a certain window


**Fig. 6.7** Classification of CPU-GPU workloads observed in mobile games from [14]

at a game's runtime. Through this added self-awareness, HiCAP reduces energy consumption even further than Co-Cap.

Further dynamism exists in a game's memory access patterns. Some scenes in mobile games read more graphics data than others, resulting in increased memory utilization. This may slow down the CPU portion of the game, but on the other hand when memory utilization is low, it may run faster than originally predicted by a conventional DVFS governor. A conventional DVFS governor cannot detect these memory utilization changes by sensing utilization, causing prediction errors to increase. MEMCOP, a Memory-aware Cooperative Power Management Governor for Mobile games [5], senses the number of last level cache misses to monitor the memory pressure of the system in addition to CPU, and GPU memory utilization. This prevents the CPU DVFS governor from increasing frequency due to inaccurate predictions caused by variation in memory access time.

#### *6.4.2 Toward Self-Aware Governors*

Co-Cap, HiCap, and MEMCOP DVFS policies are each steps toward a self-aware DVFS governor policy for heterogeneous SoCs. Each policy monitors system's state using novel sensors, and defines runtime prediction rules to reflect and adapt to changes in mobile game behavior. However, the predictive models are generated statistically at design time, and remain the same during the execution. Moreover, as the predictive model becomes more complex, prediction errors increase due to the assumption of a linear relationship between the model's input and output. ML-Gov, a machine learning enhanced integrated CPU-GPU governor [15], tries to address these issues by applying machine learning algorithms. This method does not require rule tuning at design-time. ML-Gov's machine learning algorithm helps to exploit nonlinear characteristics between frequency and performance. ML-Gov currently builds the model offline, but through enhanced self-awareness via online updates of the reflective model, could adapt to previously unknown games and classes.

#### **6.5 Adaptive Memory: Managing Runtime Variability**

Heterogeneous processing elements on mobile SoCs share limited memory resources, leading to memory contention and stalled processes waiting for data. This performance degradation is exacerbated by the Von Neumann bottleneck, a prevalent problem in modern day computer systems. Data transfer speeds in memory have not been able to keep up with the performance gains of processors exemplified by Moore's law. However, with the end of Moore's law on the horizon there is an ever increasing need to alleviate the Von Neumann bottleneck to increase the performance of computer systems. There have been various approaches over the years to address the Von Neumann bottleneck such as putting critical memory in an easily accessible cache [25] and recently in an easily accessible Software-Programmable Memory (SPM), also known as a scratchpad, using multi-threading [9], and exploiting cache-coherency [7]. We address the bottleneck by providing self-awareness with respect to memory resource utilization.

#### *6.5.1 Sharing Distributed Memory Space*

Software-Programmable Memories are a promising alternative to hardwaremanaged caches in embedded systems. However, traditional approaches for managing SPMs do not support sharing of distributed memory resources, missing the opportunity to utilize those memory resources. Employing operating-systemlevel awareness of SPM utilization, memory resources can be shared by allowing threads to opportunistically exploit the entire memory space for unpredictable application workloads. Best-effort policies can be used to maximize the usage of on-chip SPMs. The policies can be supported by hardware via distributed memory management units (MMUs), an on-chip component that can be used to exchange information between the NoC and an MMU's local SPM. Sharing distributed SPM space reduces memory contention, resulting in reduced memory latency by reducing off-chip memory accesses by about 14%. The off-chip access reduction decreases average execution time by about 19.5%, which in turn reduces energy consumption [23, 29]. More intelligent policies that explore a mixed SPM/cache hierarchy for many-core embedded systems can yield further improvements.

#### *6.5.2 Memory Phase Awareness*

Modern mobile devices use multi-core platforms that allow for concurrent execution of multimedia and non-multimedia applications that enter and exit at unpredictable times. Each application also has variable memory demands during these unpredictable times. By being aware of the periodic patterns, or *phasic behavior*, of an application's memory usage (memory phases), a system's on-chip memory can be more efficiency utilized. Memory phases can be identified from memory usage information extracted on an application basis, and can be used to prioritize different memory pages in a multi-core platform without having any prior knowledge about running applications. The identification process can be integrated into the runtime system and done online. For example, memory phases can be used for effective sharing of distributed SPMs for multi-core platforms to reduce memory access latency and contention. Experiments on workloads with varying intra- and interapplication memory-intensity show that using phase detection schemes can reduce memory access latency up to 45% for configurations up to 16 cores [28]. Ongoing work investigates more aggressive use of memory phasic behavior in many-core architectures with hundreds of cores.

#### *6.5.3 Quality-Configurable Memory*

We have established how self-awareness can be achieved through formal control theory. Figure 6.8 shows a closed feedback control loop with a quality monitor that can measure memory utilization and processor usage with respect to a QoS goal to fit the runtime requirements of applications. The quality monitor gives a quality score and sends the collected data to a high-level controller. The controller reflects on the data, then tunes knobs to adapt the memory utilization and processor usage to minimize the error between the current quality and the quality-of-service goal. The self-aware approach enables dynamic convergence toward dynamic memory utilization and quality targets for unpredictable workloads. While current

**Fig. 6.8** Control loop implementing a self-aware approach to memory utilization optimization. The controller optimizes memory knobs to improve application performance

results indicate that a self-aware memory controller outperforms a manual quality configuration scheme, there is much work to be done with to analyze energy tradeoffs when using a self-aware memory controller, and whether a MIMO controller could be more effective for resource management in many-core systems with the self-aware approach [24].

#### **6.6 What's Ahead?**

Self-awareness enables a system to observe its context and make changes to optimize its execution at runtime. For instance, it is possible to allow a system to tune its execution to optimize power consumption. Through observing how it has reacted to past changes in certain conditions, the system can learn what the impact on the overall execution and power consumption was, and if a different adaptation would be more appropriate in the future. To further explore such opportunities in computing systems, we shift our focus to a new project: the Information Processing Factory (IPF). IPF is a step toward autonomous many-core platforms in cyberphysical systems (CPS) and the Internet of Things (IoT). It represents a paradigm shift in platform design, with robust and independent platform operation in the focus of platform-centric design rather than existing semiconductor device or software technology, as mostly seen today [4].

We use the metaphor of an Information Processing Factory to draw similarities between microelectronics systems and factories as follows: in a factory, all components must adapt to the current workload [20]. Additionally, this adaptation cannot be done offline and must instead be done in real time without interrupting the baseline operations. Future microelectronic systems (e.g., MPSoCs) should operate in a similar manner.

Clusters of component-specific, uncorrelated control occurrences cannot handle operations of large scale systems with multi-criteria objective functions. Similarly, a centralized controller model is also inadequate in this case because it cannot scale. The goal of the IPF project is to demonstrate that a hybrid hierarchical approach, sporting as much modularity as possible and as much centralized as necessary, is a much more effective means of achieving the desired goal while maintaining cost efficiency, low overhead, and scalability.

Figure 6.9 depicts how we envision the platform to be structured. Information provided by sensors is gathered and merged into self-organizing, selfaware (SO/SA) control processing instances across different hardware/software abstraction layers comprising an MPSoC-based CPS system. The SO/SA instances generate actuation directives affecting the MPSoC system components at same or lower levels of abstraction. The SO/SA paradigm is not limited in scope to optimization of CPS operational parameters/metrics. In fact, self- and groupawareness can also enable higher level tasks such as self-protection of both the MPSoC and the overall CPS system.

#### *6.6.1 Example Use Case: Autonomous Driving*

The key innovation in automated driving as compared to driver assistance systems is the transition of decision-making from the driver to the vehicle. The application processing and communication requirements ask for platform performance, memory capacity, and communication bandwidth and latency far beyond the capabilities of current architectures. At the same time, these platforms must be highly reliable and guarantee sufficient functionality under platform errors, aging, and degradation to meet safety standards. That is, platforms and their components must be failoperational, i.e., must be able to continue driving, instead of fail-safe, as today.

Thus, the automated driving requirements can be mapped to corresponding requirements of an Information Processing Factory. The system must be capable of in-field integration, i.e., able to adapt to changes in the workload of both critical and non-critical (best-effort) functions. The system must find a new suitable mapping and must prevent the changes from violating the guarantees of other software components. The software must be able to detect and to adapt to transient errors in order to provide a reliable service. This requires self- diagnosis and self-healing. The system must be predictable and provide for minimum performance guarantees for all scenarios.

Allowing and exploiting dynamic system behavior through IPF can significantly improve platform performance and resource utilization. Thus, the system must be able to optimize the execution and mapping online: self-optimization. The optimization may target e.g. aging (temperature), power consumption, response time, and resource utilization.

#### **6.7 Summary**

Future cyber-physical systems will host a large number of coexisting distributed applications on hardware platforms with thousands to millions of networked components communicating over open networks. These distributed applications will include both critical and best-effort tasks, may be subject to permanent change, environment dynamics and application interference. Using wisdom gathered from our initial exploration into self-aware SoCs, we introduce a new Information Processing Factory paradigm to manage current and future cyber-physical systems.

**Acknowledgments** The authors would like to acknowledge Santanu Sarma, Chen-Ying Hsieh, JurnGyu Park, Majid Shoushtari, and Hossein Tajik for their research contributions. We acknowledge financial support by NSF grant CCF-1704859, and the Marie Curie Actions of the European Union's H2020 Program.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 7 Pushing the Limits of Parallel Discrete Event Simulation for SystemC**

**Rainer Dömer, Zhongqi Cheng, Daniel Mendoza, and Emad Arasteh**

#### **7.1 Introduction**

The IEEE standard SystemC language [13] is widely used for the specification, modeling, validation, and evaluation of electronic system level (ESL) models. The Accellera Systems Initiative maintains not only the official SystemC language definition, but also provides an open source proof-of-concept library that can be used to simulate SystemC design models [1]. However, implementing the classic scheme of discrete event simulation (DES), this reference simulator runs sequentially and cannot utilize the parallel computing resources available on multi- and many-core processor hosts. This severely limits the execution speed of SystemC simulation.

In order to provide faster execution, parallel discrete event simulation (PDES) [8, 12] techniques can be applied. While significant obstacles exist specifically for the SystemC language [7], many parallel simulation approaches have been proposed [5, 11, 19, 21–24]. Beyond these synchronous PDES techniques, *out-of-order* PDES [6] is even more aggressive. By localizing the simulation time to individual threads and carefully handling events at different times, the simulator engine can issue threads in parallel and *ahead of time*, following a partial ordering without loss of accuracy. This results in better exploitation of the available parallelism and thus maximum simulation speed.

The *Recoding Infrastructure for SystemC (RISC)* project described in this paper implements out-of-order PDES for the IEEE SystemC language as open source. Specifically, RISC provides a dedicated SystemC compiler and corresponding outof-order parallel simulator [2, 8, 16]. Compared to the other approaches, RISC automatically analyzes the SystemC source code, identifies all potential race condi-

R. Dömer (-) · Z. Cheng · D. Mendoza · E. Arasteh

Center for Embedded and Cyber-Physical Systems, University of California, Irvine, CA, USA e-mail: doemer@uci.edu

J.-J. Chen (ed.), *A Journey of Embedded and Cyber-Physical Systems*, https://doi.org/10.1007/978-3-030-47487-4\_7

tions, and then instruments the model to prevent any conflicts. This transformation does not require any manual recoding or application-specific knowledge.

We share our RISC proof-of-concept implementation with the EDA community as an open source software project in order to facilitate evaluation, promote parallel SystemC simulation, and achieve fruitful collaboration [3, 4].

#### **7.2 RISC Framework**

While the RISC software framework may be used for many other analysis and transformation tasks on SystemC models, parallel simulation is the main purpose. To perform semantics-compliant parallel simulation with out-of-order scheduling, we introduce a dedicated SystemC compiler that works hand in hand with a new simulator. This is in contrast to the traditional SystemC simulation flow where a SystemC-agnostic C++ compiler includes the SystemC headers and links the design model directly against the Accellera reference library.

As shown in Fig. 7.1, the RISC compiler acts as a frontend that processes the input model and generates an intermediate model with special instrumentation for conflict-free parallel execution. The instrumented model is then linked against the extended RISC SystemC library by the target compiler (a regular C++ compiler, such as GNU gcc or Intel icpc) in order to produce the output executable model. Out-of-order parallel simulation is then performed simply by running the generated executable model.

From the user perspective, we simply replace the regular C++ compiler with the SystemC-aware RISC compiler (which in turn calls the underlying C++ compiler). Otherwise, the overall SystemC validation flow remains the same as the traditional tool flow. Simulation is just faster due to the parallel execution. Note also that this process is fully automated. No user interaction or manual code transformation is necessary.

**Fig. 7.1** RISC tool flow for out-of-order parallel simulation of SystemC models [16]

#### *7.2.1 RISC Compiler*

In order to produce a safe parallel model, the RISC compiler performs three major tasks, namely segment graph construction, conflict analysis, and finally source code instrumentation.

#### **7.2.1.1 Segment Graph Construction**

A segment graph (SG) [6] is a directed graph that represents the source code segments executed during the simulation between scheduling steps. More specifically, every segment is associated with a corresponding scheduler entry point, namely a wait statement in SystemC. All other statements in the SystemC source code become part of those segment nodes where they are executed when the wait statement resumes its execution.

The segment graph construction is a fully automatic but complex process which we will not describe here (see [6] for detailed coverage). However, the RISC compiler must parse the SystemC input model first into an Abstract Syntax Tree (AST). Since SystemC is a syntactically regular C++ code, RISC relies here on the ROSE compiler infrastructure [18]. The ROSE internal representation (IR) provides RISC with a powerful C/C++ compiler foundation that supports AST generation, traversal, analysis, and transformation.

As illustrated with the RISC software stack shown in Fig. 7.2, the RISC compiler then builds a SystemC IR on top of the ROSE IR which accurately reflects the SystemC structures, including the module and channel hierarchy, port connectivity, and other SystemC-specific constructs. On top of the SystemC IR, the compiler architecture then builds the Segment Graph generator and data structures, as well as all other RISC analysis and transformation functions.

#### **7.2.1.2 Conflict Analysis**

The segment graph data structure serves as the foundation for segment conflict analysis. At run time, the scheduler in the simulator must ensure that every parallel thread to be issued has no conflicts with any other threads currently in the *READY*

**Fig. 7.2** Software stack of the RISC compiler [8] RISC

and *RUN* queues. For this we use the RISC compiler to detect any possible conflicts between these threads already at compile time.

Potential conflicts in SystemC include data hazards, event hazards, and timing hazards, all of which may exist among the segments executed by the threads considered for parallel execution. Again, we refer to [6] for a detailed discussion of these hazards and their static or dynamic detection in RISC. However, we note that if the hazards would be ignored, this would lead to race conditions at run time and jeopardize the correctness of the SystemC simulation.

#### **7.2.1.3 Source Code Instrumentation**

As a result of the conflict analysis, the RISC compiler generates a set of conflict and timing tables that describe all possible hazards between any two threads. Using this conservative conflict information, the simulator can then at run time quickly determine by a simple table look-up whether or not it is safe to issue a given thread in parallel or ahead of time.

As shown above in Fig. 7.1, the RISC compiler and simulator work closely together. The compiler performs conservative conflict analysis and passes the analysis results to the simulator which then can make safe scheduling decisions quickly.

To pass information from the compiler to the simulator, we use automatic source code instrumentation. That is, the intermediate model generated by the compiler contains instrumented (automatically generated) code which the simulator can then safely rely on.

At the same time, the RISC compiler also instruments the SystemC wait statements with corresponding segment ID and furnishes user-defined channels with automatic protection against race conditions among communicating threads.

#### *7.2.2 RISC Simulator*

The RISC simulator supports out-of-order discrete event simulation (OoO PDES) [6] for fast SystemC simulation. In OoO PDES, we break the strict order of time (the synchronous barrier) by localizing time stamps to each thread. Since each thread has its own time stamp, the OoO PDES scheduler relaxes the event and simulation time updates, allowing more threads (at different simulation cycles) to run in parallel and ahead of time. This results in a higher degree of parallelism and thus higher simulation speed. We are using advanced static compile-time analysis to identify all such potential conflicts. Based on this information (a simple table look-up is sufficient), the OoO PDES scheduler can then at run time quickly decide whether or not a set of threads has any conflicts with each other.

**Fig. 7.3** Module hierarchy visualization of a SystemC model of a Canny edge detector [17]

#### *7.2.3 RISC Analysis and Transformation Tools*

As an example of other SystemC analysis tools built on top of RISC, visual [17] enables the user to visualize the SystemC module hierarchy. It supports a graphical user interface implemented with the Gtk API and renders a specified SystemC source file's module hierarchy, which is drawn using the Cairo API. The tool obtains module data from the SystemC IR in the RISC software stack which contains information about nested modules and thus can recursively iterate through nested lists of child modules in order to obtain enough information to visualize the hierarchy of the entire SystemC source file. The input SystemC source file may contain thousands of lines of code which can make manually drawing a representation of the modules, ports, and channels described by the code a difficult and time-consuming task. Thus the visual tool was created to address this issue. It can automatically generate a visual representation of a SystemC model in a very short period of time. Figure 7.3 shows the module visualization of a Canny edge detector application.

#### **7.3 Experiments**

We will now evaluate the performance of the RISC simulator. The following experiments show the speedup on an Intel Xeon PhiTM Coprocessor 5110P manycore architecture. The coprocessor contains 60 cores where each core has a vectorization unit of 512 bit. To obtain unambiguous measurements, we turn CPU frequency scaling off for all experiments.

#### *7.3.1 Mandelbrot Renderer*

The Mandelbrot renderer is a parallel video application to compute the Mandelbrot set. Basically, the device under test (DUT) hosts a number of renderer units. Each

**Fig. 7.4** Speedup of the Mandelbrot Renderer [20]

unit computes a different slice of the Mandelbrot image. At compile time, the user defines how many slices are available.

Figure 7.4 shows the simulation results [20]. Due to the minimal communication needs in this application, highest speedups are reached. The vectorization unit with 512 bit can execute up to eight double-precision floating-point operations in parallel. A speedup *M* of 6.9x is achieved. The thread-level parallelization increases strongly on the 60 cores with a speedup *N* of 50x. Afterwards, the speed slows down due to the 60 physical cores and use of hyper-threads. Notably, the combination of the thread and data level parallelization *N* × *M* generates a speedup of up to 212x.

#### **7.4 RISC Open Source Project**

We make the Recoding Infrastructure for SystemC (RISC) described in this article freely available online as a software artifact [9]. Generally, an artifact is a software program together with an applicable data set and test suite that accompanies a research publication for the purpose of independent evaluation.1 The point here is that the proposed algorithms and data structures are made available as proof-of-concept implementation and can be used and evaluated by others. Experimental results may be replicated and validated. The proposed approach can also be compared against related work and in the presence of source code even be extended. Otherwise, great challenges are posed in repeatability [15].

<sup>1</sup>Because of its importance, artifact evaluation has been adopted as integral part of the review process in several computer science areas, such as Software Engineering and Programming Languages [10, 14].

Specifically, the presented RISC compiler and simulator are available as open source on the web [2] and can be used without restrictions (BSD license terms). RISC can be downloaded in both source code and binary format.

#### *7.4.1 Open Source Code and Documentation*

In its current version [4], the RISC open source package consists of approximately 162,000 lines of code and includes the C++ source code for the RISC compiler and simulator, Linux build scripts and installation instructions, as well as comprehensive documentation of the compiler and simulator APIs and tool manual pages. Example SystemC models, such as an abstract DVD player and the Mandelbrot renderer, and a regression test bench are included as well.

Given a suitable Linux platform,<sup>2</sup> the RISC source code package can be easily installed and then tested. After downloading and adjusting the installation Makefile, a simple make all command builds and installs the RISC framework and runs several demo examples. The user can then fully evaluate the software with other SystemC examples and even extend our proof-of-concept implementation with new features.

#### *7.4.2 Binary Image for "Plug-and-Play" Evaluation*

For a quick test run without compilation and installation, we also provide a Docker container [3] for using RISC in "plug-and-play" fashion. The Docker image contains RISC (and all needed libraries) in binary format and allows the user to test it with just a few Linux commands, as shown in Fig. 7.5.

**Fig. 7.5** Linux commands to use RISC in a Docker container

<sup>2</sup>Red Hat Enterprise and CentOS Linux version 6 and 7 are verified to work.

#### **7.5 Conclusion**

The Recoding Infrastructure for SystemC (RISC) provides an automatic compilerbased framework to analyze and simulate IEEE SystemC models in parallel. In particular, we have introduced the RISC compiler and simulator. Using automatic conflict analysis based on segment graph (SG) abstraction, OoO PDES can execute threads safely in parallel and out-of-order (ahead of time) and thus achieves fastest simulation speed but nevertheless maintains the classic SystemC modeling semantics. In order to foster collaboration in the EDA community, we provide the RISC framework as a free open source artifact for full evaluation and possible extension.

For the future, we intend to expand our open source efforts and hope to involve other members of the EDA community to use, evaluate, and extend the RISC framework.

**Acknowledgments** This work has been supported in part by substantial funding from Intel Corporation for two projects titled "Out-of-Order Parallel Simulation of SystemC Virtual Platforms on Many-Core Architectures" and "Scaling the Recoding Infrastructure for Parallel SystemC Simulation." The authors thank Intel Corporation for the valuable support.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 8 Impact of Negative Capacitance Field-Effect Transistor (NCFET) on Many-Core Systems**

#### **Hussam Amrouch, Martin Rapp, Sami Salamin, and Jörg Henkel**

#### **8.1 Introduction**

More than a decade ago, the semiconductor technology had entered the so-called nano-CMOS era, in which the transistor's feature sizes became below 90 nm. Since then, the prior trend of voltage scaling came to an end leading to the discontinuation of Dennard's scaling [7]. In Dennard's scaling, both the dimensions of transistor and the operating voltage are typically scaled by the same factor in order to ensure a constant electric field. Due to the non-scalable voltage, ever-increasing power densities in chips became a substantial obstacle for technology scaling due to the limited ability of existing cooling solutions to dissipate the generated heat [8]. To overcome this fundamental problem, the maximum frequency of processors had stopped increasing with every new generation in order to keep the on-chip power densities under acceptable levels and since 2005 the era of many-core processors had started.

To understand the inability of technology to scale voltage, we need to understand what determines the speed of a processor. As a matter of fact, the drive current (ON current) of a transistor dictates its switching speed and hence it ultimately determines the maximum delay of logic paths that form the processor's netlist. The ON current of a transistor is proportional to *(VDD* − *VT )*, where *VT* denotes the threshold voltage of transistor and *VDD* denotes the operating voltage. In order to maintain the same level of current, while *VDD* is scaled down, *VT* must also be reduced by almost the same amount. However, reducing *VT* comes with an exponential increase in the leakage current (OFF current) of transistor. This is primarily because that the sub-threshold swing of transistor is fundamentally

H. Amrouch · M. Rapp · S. Salamin · J. Henkel (-)

107

Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany e-mail: amrouch@kit.edu; henkel@kit.edu

J.-J. Chen (ed.), *A Journey of Embedded and Cyber-Physical Systems*, https://doi.org/10.1007/978-3-030-47487-4\_8

limited to 60 mV/decade at room temperature akin to "Boltzmann tyranny" [21]. Such a fundamental limit inevitably restricts the minimum possible *VT* to be at least 300 mV. To ensure a reliable operation, different kinds of safety margins need to be added on top of the minimum voltage, which enforces the operating voltage to remain almost the same with every new technology generation. As abovementioned, the inability to scale voltage has led to the discontinuation of Dennard's scaling, which, in turn, had led to preventing the frequency of processors from increasing.

*In summary, the fundamental limit of sub-threshold swing of transistor is the primary reason behind not scaling voltage and it is the origin of on-chip power density problems that processor's designers are facing since more than a decade ago.*

#### *8.1.1 Negative Capacitance Field-Effect Transistor (NCFET)*

NCFET integrates a ferroelectric layer inside the gate stack of a transistor, which acts as a negative capacitance. Such a layer provides an amplification of the vertical electric field that the transistor perceives. This, in turn, allows the transistor to overcome the fundamental limitation of sub-threshold swing of 60 mV at room temperature. The principle of NCFET was first proposed in 2008 by S. Salahuddin and S. Datta [16]. After which, it very rapidly gained a large popularity due to the remarkable steep switching and high ON current of transistors [1]. Many experiments have consistently proved NCFET [10]. A breakthrough has recently occurred when GlobalFoundries demonstrated NCFET-based circuits using their state-of-theart industrial 14 nm FinFET technology [9]. This showed, for the first time, that NCFET technology has become compatible with the existing CMOS fabrication process. In fact, such a compatibility is essential for any emerging technology to be adopted by semiconductor companies. Otherwise, massive production will never be possible.

In practice, NCFET technology enables the transistor to reach the same ON current, without increasing the OFF current, but at a much lower voltage [2]. This is only possible due to steeper sub-threshold swing. Therefore, in an NCFET technology, the processor can still meet the same performance (as in the conventional FET) but at a lower operating voltage leading to a significant power saving. Beside the *low-power* usage scenario of NCFET, *high-performance* usage scenario does also exist. NCFET enables the processor to be clocked at a higher frequency (compared to the conventional FET), while it still be operated at the same voltage due to the increase in the ON current. NCFET technology comes with an important side effect in which it increases the total capacitance of transistor. Such an increase can lead to reliability problems caused by IR-drop and voltage fluctuation during circuit's operation [2, 18]. At the same time, because NCFET technology enables circuits to operate at lower voltages, it is expected that other reliability problems, related to lifetime, to become much less because all the underlying aging mechanisms, such as negative bias temperature instability (BTI) and hot-carrier injection (HCI), strongly depend on the operating voltage [20].

In the following sections, we explain how modeling the NCFET effects from physics all the way to the system level can be done. Then, we explore how a manycore system can profit from the NCFET technology. Finally, we explore the impact that NCFET has on power management schemes and how existing assumptions w.r.t voltage-leakage dependency become not valid anymore when it comes to NCFET, which creates the necessity to develop novel power management techniques.

#### **8.2 Modeling NCFET at the System Level**

In the following we provide an overview of how NCFET is modeled at the system level, i.e., for the purpose of simulating many-core processors. Fundamentally, the properties of the ferroelectric layer are modeled at the physics level [12]. Figure 8.1 presents our methodology in which we traverse all layers from physics, through device, gate, and processor level, to model NCFET at the system level. The behavior of transistors with varying thickness of the ferroelectric layer is modeled following the industrial-standard compact model (BSIM-CMG) [5, 14]. Based on this model, we created NCFET-aware cell libraries supporting four different thicknesses of the ferroelectric layer under a wide range of the operating voltage [1]. The thickness ranges from 1 nm (called TFE1) up to 4 nm (TFE4). We then implemented a single many-core tile to the GDSII level and performed timing and power signoffs. The results are explained in detail in the next section. Signoff tools allow to compare power and performance of a processor implemented in different NCFET configurations and are used to extract frequency-dependent scaling factors for dynamic and leakage power. These factors serve as an abstraction at the system level and allow to estimate the power of an NCFET-based processor if the power of

**Fig. 8.1** Modeling NCFET at the system level (many-core processors) requires to traverse the whole stack from the physics level, where the effects of the ferroelectric layer are modeled, to the system level, where performance and power of many-core processors are affected

a baseline implementation (conventional FinFET) is known. Finally, these factors are used to simulate a many-core processor (further details in Sect. 8.2.2).

#### *8.2.1 Processor-Level Investigation*

This section shows how NCFET affects the performance and power of a single processor. The insights gained from this evaluation are important to build systemlevel NCFET models and explain observations from system-level simulations. We implemented the layout (GDSII level) of a single tile of the *OpenPiton* manycore [3], which contains a CPU, caches, and a NoC router. Power and timing signoffs are performed for different NCFET configurations (TFE1 to TFE4) and different operating voltages. Further details of the experimental setup can be found in [15].

Figure 8.2a shows how NCFET increases the performance of a processor. It allows to clock a processor at a higher frequency at the same operating voltage or allows to reduce the voltage while still maintaining the same performance (frequency). This is due to the inherent voltage amplification provided by the additional

**Fig. 8.2** (**a**) NCFET increases the frequency of a processor at a certain operating voltage, but (**b**) also increases the dynamic power consumption due to the increase in the transistor gate capacitance and frequency. (**c**) While leakage increases almost linearly with the operating voltage with conventional FinFET (baseline), this dependency gets weaker with a thin ferroelectric layer and even reverses with TFE4 due to a negative DIBL effect

ferroelectric layer. Like explained earlier, the ferroelectric layer increases the total gate capacity. Together with increased frequency, this increases the dynamic power consumption (Fig. 8.2b). The thicker the ferroelectric layer gets, the higher get the gains in the frequency, but also the higher gets the dynamic power. Figure 8.2c shows that leakage power is affected more severely. NCFET fundamentally changes the trend. With conventional FinFET (baseline), leakage power increases strongly with increasing voltage. When a thin ferroelectric layer is added (TFE1 and TFE2), this dependency becomes weaker, until at TFE3, leakage is almost independent of the voltage. With a thick ferroelectric layer (TFE4), an effect called negative drain-induced barrier lowering (negative DIBL) reverses the leakage dependency on the voltage [13]. Here, leakage increases at lower voltages. We will explain later (Sect. 8.3.3) how this necessitates developing novel power management techniques.

#### *8.2.2 Simulation of NCFET-Based Many-Core*

We use the *Sniper* many-core simulator [6] to simulate many-core processors. *McPAT* [11] is used to periodically estimate the power consumption of each core. Since *McPAT* does not support NCFET, it is used to estimate the power with conventional FinFET instead. We develop frequency-dependent scaling factors for dynamic and leakage power based on the processor-level investigation explained earlier.

Figure 8.3 shows the dynamic and leakage power of the single processor studied in the previous section depending on the *frequency*, as opposed to voltage like in Fig. 8.2. Two effects play a role for the dynamic power: NCFET technologies increase the dynamic power at a certain operating voltage (Fig. 8.2b), but also

**Fig. 8.3** (**a**) While NCFET technologies increase the dynamic power at iso-voltage, they also lower the required operating voltage at iso-frequency, which in total decreases the dynamic power at the same frequency. (**b**) NCFET technologies with a thin ferroelectric layer lower the leakage power, whereas leakage increases with a thick layer (TFE4). Most importantly, the negative DIBL effect reverses the leakage dependency, where lowering the V/f-levels increases leakage

allow to go to a lower operating voltage (Fig. 8.2a) while still maintaining the same frequency. Lowering the operating voltage has a stronger effect on the dynamic power. Consequently, NCFET technologies lower the dynamic power when operating at the same frequency (Fig. 8.3a). Figure 8.3b shows how leakage power depends on the V/f-level. The reverse leakage dependency with TFE4 strongly increases the leakage power. Below 700 MHz, TFE4 would allow to reduce the voltage below 0.2 V, which is the lower limit of the cell library.

Figure 8.3a,b allows to estimate the dynamic and leakage power consumption of a processor that is implemented in NCFET, if the power consumption in the baseline (conventional FinFET) is known. We extract frequency-dependent scaling factors for both dynamic and leakage power. These factors serve as an abstraction that allows simulation of complex benchmark applications, like *PARSEC* [4], on manycore processors with dozens of cores. We thereby scale the leakage and dynamic power that is estimated by *McPAT* to estimate the power consumption of NCFETbased many-cores. For brevity, details on this approach are omitted here and can be found in [15].

#### **8.3 Performance, Power, and Cooling Trade-Offs with NCFET-based Many-Cores**

NCFET fundamentally changes the characteristics of transistors and therefore also changes the performance and power of circuits [19], single-core processors [1], and many-core processors [15]. This section demonstrates the impact of the thickness of the ferroelectric layer on the power and performance of a many-core processor. We show that the optimal thickness depends on many factors, such as the application characteristics and the cooling scenario. This section evaluates performance, power, and cooling of a 25-core many-core operating under a thermal constraint of 80◦C. We study *PARSEC* [4] tasks with up to eight slave threads. Their characteristics range from highly memory-bound (e.g., *canneal*) to highly compute-bound (e.g., *swaptions*).

#### *8.3.1 Impact of NCFET on Performance*

Due to high power densities (failure of Dennard's scaling) and limited cooling capabilities, it is not always possible in modern technology nodes to simultaneously operate all cores at the peak V/f-levels without violating the thermal constraint. This study investigates the use-case in which cores with an active thread are operated at the peak V/f-levels and cores without a thread mapped to it are power-gated. In this use-case, four factors affect the thermally sustainable utilization (i.e., the number of cores that can be turned on): the application characteristics (power consumption), the mapping of threads to cores, the cooling system, and the transistor technology.

**Fig. 8.4** NCFET technologies increase the thermally sustainable utilization of a 25-core manycore, i.e., the number of usable cores without violating the thermal constraint, compared to the baseline (conventional FinFET). The optimal thickness of the ferroelectric layer depends on the application characteristics

We use an Integer Linear Program to obtain the thermally-optimal mapping of threads to cores, which minimizes the formation of hotspots and, thereby, maximizes the thermally sustainable utilization. We study the use-case of a passive cooling, i.e., there is no fan on top of the heat sink.

Figure 8.4 shows the thermally sustainable utilization of two benchmarks *bodytrack* and *swaptions* during the parallel section of the benchmarks (Region of Interest) for different NCFET technologies. Other benchmarks are available in [15]. *Swaptions* is a highly compute-intensive task, which results in high power consumption and therefore, the thermally sustainable utilization in the baseline is low (only 8 out of 25 cores). Dynamic power forms the major part of the total power consumption and therefore, thicker ferroelectric layers increase the thermally sustainable utilization because dynamic power is reduced (compare Fig. 8.3a). Consequently, the highest performance is observed with the thickest ferroelectric layer (TFE4). *Bodytrack* is less compute-intense and has lower dynamic power consumption and consequently lower total power. This results in a higher thermally sustainable utilization compared to *swaptions*. However, due to lower dynamic power, leakage power accounts for a larger fraction of the total power. As demonstrated in Fig. 8.3b, TFE4 increases the leakage significantly over TFE3. Consequently, TFE4 results in a lower thermally sustainable utilization than TFE3 for *bodytrack* and the highest performance is observed with TFE3.

These investigations show that *the optimal thickness of the ferroelectric layer depends on the application characteristics*. Further investigations on how NCFET affects the performance in the case that cores are not operated at the peak V/f-levels can be read in [15]. These investigations additionally study forced-convection cooling (a heat sink with a fan) and reveal that *the optimal thickness of the ferroelectric layer also depends on the cooling scenario*.

#### *8.3.2 Impact of NCFET on Cooling Requirements*

This section studies how NCFET reshapes the existing trade-off between cooling costs and achievable performance, where higher performance comes at the cost of higher power dissipation and therefore higher cooling costs. We study the use-case in which the many-core is operated at its peak performance, i.e., all cores are active at the peak V/f-levels and determine the cooling capabilities that are required to make this use-case thermally safe. The cooling capabilities are measured by the inverse of thermal resistance of the heat sink 1*/Rth*. Varying this value corresponds to changing the air convection.

Figure 8.5 shows the required cooling capabilities for the three *PARSEC* benchmarks *swaptions*, *bodytrack*, and *canneal* during the parallel section of the benchmarks (Region of Interest). NCFET technologies allow to reduce the cooling capabilities over the baseline (conventional FinFET). Most importantly, the required cooling capabilities are minimized at different thicknesses of the ferroelectric layer depending on the application. *Swaptions* is highly computeintensive and consequently, dynamic power accounts for the majority of the total power. Increasing the thickness of the ferroelectric layer reduces the dynamic power (see Fig. 8.3) and therefore reduces the required cooling. *Canneal* on the other side is highly memory-bound and therefore, the power consumption is dominated by leakage. Leakage is minimized at TFE2, which consequently minimizes the cooling requirements. *Bodytrack* shows intermediate values for the dynamic power and therefore, TFE3 is optimal. *This investigation shows again that the optimal thickness of the ferroelectric layer depends on the application characteristics and ranges from 2 nm to 4 nm.*

**Fig. 8.5** NCFET technologies decrease the required cooling capabilities while maintaining the same maximum temperature of 80◦C under full system utilization (all cores active at peak V/f-levels). The thickness of the ferroelectric layer that results in the lowest cooling costs depends on the application characteristics and ranges from 2 nm (with *canneal*) to 4 nm (with *swaptions*)

#### *8.3.3 Impact of NCFET on Power Management Techniques*

The above investigations use the well-established concept of V/f-pairs that are determined at design time by selecting the operating voltage for a given frequency as the lowest voltage that makes operating at this frequency reliable. This is a reasonable approach with conventional transistor technologies, because using a higher voltage would unnecessarily increase both dynamic and leakage power. However, this is no longer true with NCFET with a thick ferroelectric layer (TFE4). Here, increasing the voltage decreases the leakage power. This leads to new optimization potential by selecting the operating voltage for a given frequency, which is demonstrated in the next section.

#### **8.4 NCFET-Aware Voltage Scaling**

Dynamic voltage scaling (DVS) technique for processor power management is considered to be one of the most effective ways to reduce the energy consumption of an application. DVS technique typically selects the minimum operating voltage *V*min that sustains the operating frequency of the processor at runtime based on the frequency demands of the application being executed. Reducing the operating voltage, in conventional FET, results in reducing the total power consumption, which implicitly reduces both dynamic and leakage power. However, such a wellknown voltage dependency becomes inverse with respect to leakage power in NCFET due to the negative DIBL effect (see Sect. 8.2.1). With such opposed dependencies (dynamic and leakage) to the operating voltage, total power follows the dominant component when voltage changed which leads to a novel trade-off. Consequently, power is not necessarily minimized at the minimum voltage *V*min, which traditional DVS selects, but at another voltage *Vopt* . Unawareness of NCFET and its trade-off could lead to not minimize the total power consumption. Therefore, in this section, a novel NCFET-aware voltage scaling technique is presented [17] to overcome the shortness that traditional DVS has in NCFET-based processors.

#### *8.4.1 Importance of NCFET-Aware DVS*

With traditional DVS, a set of voltage-frequency pairs are typically selected at design time and later are employed by the DVS technique at runtime to optimize the power. In this case, the lower the selected voltage is, the lower the total power is. Due to the new inverse dependency in leakage power that NCFET exhibits, this is not always valid with respect to NCFET. To demonstrate the consequence of such an inverse dependency at the system level, we plot the total power consumption and its components of the master thread of PARSEC *canneal* benchmark in Fig. 8.6.

The power examined starting from *V*min, that traditional DVS selects to sustain the required frequency, and then to overscale the operating voltage. The result shows that the power is not minimized at *V*min (i.e., *Vopt*=*V*min).

Different workloads exhibit different characteristics and hence different total power. Therefore, the contribution of power components differs. Traditional DVS neglects this difference as both contributions (leakage and dynamic) are affected in the same manner with voltage (both are reduced). With NCFET, the contribution of leakage to the total power cannot be neglected because it affects the operating voltage selection when DVS tries to minimize total power. Hence, based on the leakage share, *Vopt* could differ from *V*min.

For the aforementioned reasons, NCFET-aware DVS is crucial due to the change in the behavior of total power consumption over voltage scaling which emerges from the inverse dependency with respect to leakage power in NCFET.

#### *8.4.2 NCFET-Aware DVS Technique*

To enable runtime voltage selection, DVS first needs to determine workload characteristics and then *Vopt* can be correctly selected. Therefore, determining V/fpairs at runtime, like in traditional DVS techniques, is not possible here. Instead, the results from Sect. 8.2.1 have been used to build the power and performance analytical models at design time. Then, these models can be integrated with our new NCFET-aware DVS technique for runtime voltage selection.

#### **8.4.2.1 Design-Time Models**

**Power and Performance Modeling** The maximum operating frequency *f*max*(V )* depends on the voltage *V* over the minimum delay *d*min*(V )*:

$$d\_{\min}(V) = a\_{del} \cdot V^{b\_{del}} + c\_{del}; \quad f\_{\max}(V) = \frac{1}{d\_{\min}(V)}\tag{8.1}$$

*adel>*0, *bdel<*0, *cdel*≥0 are constants fitting parameters obtained at design time. Peak leakage and peak dynamic power consumption results by operating at maximum frequency are

$$P\_{leak}(V) = a\_{leak} \cdot V^{b\_{leak}} \tag{8.2}$$

$$P\_{dyn}^{peak}(V, d\_{\rm min}(V)) = a\_{dyn} \cdot V^{b\_{dyn}} + c\_{dyn} \tag{8.3}$$

*adyn>*0, *bdyn>*1, *cdyn*≥0, *aleak>*0, *bleak<*0 are constant fitting parameters obtained at design time. Both *Ppeak dyn (V , d*min*(V ))* and *Pleak(V )* are *convex* in *V* . By lowering the operating frequency of the CPU (higher delay), dynamic power decreases. However, since leakage power is independent from CPU activity, it is not affected.

$$P\_{dyn}^{peak}(V,d) = \frac{d\_{\rm min}(V)}{d} \cdot P\_{dyn}^{peak}(V, d\_{\rm min}(V)) \tag{8.4}$$

Therefore, *Ppeak dyn (V , d)* is convex in *V* (for constant *d*) if *bdyn* + *bdel>*1.

#### **8.4.2.2 Runtime Models**

**Workload-Dependent Power Modeling** Dynamic power consumption *Pdyn(V , d)* is affected by the running workload, which is reduced by a factor 0≤*rdyn*≤1 from the peak dynamic power *Ppeak dyn (V , d)*:

$$P\_{dyn}(V,d) = r\_{dyn} \cdot P\_{dyn}^{peak}(V,d) \tag{8.5}$$

$$P\_{\text{total}}(V, d) = P\_{\text{dyn}}(V, d) + P\_{\text{leak}}(V) \tag{8.6}$$

*rdyn* is not constant since it represents the current workload activity. Therefore, total power consumption *Ptotal(Vc,d)* at the current voltage *Vc*, *rdyn* is

$$r\_{\rm dyn} = \frac{P\_{\rm dyn}(V\_c, d)}{P\_{\rm dyn}^{peak}(V\_c, d)} = \frac{P\_{\rm total}(V\_c, d) - P\_{\rm leak}(V\_c)}{P\_{\rm dyn}^{peak}(V\_c, d)}\tag{8.7}$$

**Optimal Voltage Computing** *Vopt* that minimizes the total power can be obtained from the power and performance models:

$$V\_{\min}(d) = \left(\frac{d - c\_{del}}{a\_{del}}\right)^{\frac{1}{b\_{del}}}\tag{8.8}$$

**Algorithm 1** NCFET-aware voltage scaling algorithm to select the optimal voltage (*Vopt*) at runtime [17]

**Require:** Power and performance models: *Ppeak dyn (c, d)* and *Pleak (V )*, current operating voltage *Vc* and delay *d*, current power consumption *Pcurr*, min. voltage resolution

**Ensure:** Optimal operating voltage *Vopt*

1: *rdyn* <sup>←</sup> *(Pcurr*−*Pleak (Vc)) /Ppeak dyn (Vc,d)* Eq. (8.7) 2: *Vopt* ← *V*min*(d)* Eq. (8.8) 3: **repeat** 4: *Vopt* ← −*Ptotal(Vopt,d) /Ptotal(Vopt,d)* 5: *Vopt* ← *Vopt* + *Vopt* iterative update 6: **if** *Vopt<V*min*(d)* **then return** *V*min*(d)* out of bounds 7: **if** *Vopt>V*max **then return** *V*max out of bounds 8: **until** *Vopt <*  Termination criteria 9: **return** *Vopt*

$$V\_{opt}(d, r\_{dyn}) = \underset{V\_{\min}(d) \le V \le V\_{\max}}{\text{arg min}} P\_{total}(V, d) \tag{8.9}$$

Since *Ptotal(V , d)* is composed of convex functions, our implemented algorithm exploits that *Ptotal(V , d)* is convex in *V* . This guarantees that *Ptotal(V , d)* has exactly one minimum w.r.t. *V* within the range [*V*min*(d), V*max]. Algorithm 1 summarizes our implemented DVS technique and obtaining *Vopt* .

#### *8.4.3 Operating Voltage Selection*

Both DVS techniques differ in the way they select the operating voltage. Therefore, to show the different behavior between both techniques in operating voltage selection, the design space of the operating voltage selection with NCFET-aware (*Vopt*) and NCFET-unaware DVS (*V*min) has been explored in Fig. 8.7. NCFETunaware DVS sets *V*min that is needed to sustain the required frequency and therefore workload characteristic is not considered. Contrarily, NCFET-aware DVS considers the workload characteristic as it depends on the ratio of leakage to total power measured at *V*min. The explored design space in Fig. 8.7 shows two distinct regions: (1) For low leakage to total power ratio and for high frequencies, the same voltage is selected (similar action) by both techniques (i.e., *Vopt*=*V*min). (2) For high ratios of leakage to total power or low frequencies, NCFET-aware DVS selects a higher voltage (*Vopt>V*min). Moreover, Fig. 8.7 reveals that: the higher the required frequency or the higher the leakage to total power ratio, the higher *Vopt* is.

**Fig. 8.7** Operating voltage selection using both DVS techniques. Two regions appear: (1) NCFETaware selection differs from NCFET-unaware (*V*min=*Vopt*). (2) Similar action is done by both DVS as they select the same operating voltage (*V*min=*Vopt*). NCFET-unaware DVS selects *V*min (that sustains the required frequency) and NCFET-aware selects *Vopt* to minimize the power depending on the frequency and the ratio of leakage to total power. NCFET-aware DVS selects higher voltages when leakage power becomes prominent or at lower frequency

#### *8.4.4 Evaluation*

#### **8.4.4.1 Experimental Setup**

Using the same setup in Sect. 8.2.1, power and delay results were examined using the highest ferroelectric thickness (4 nm). Afterwards, the power and performance analytical models have been developed as described in Sect. 8.4.2.

For system-level simulation, relying on the setup described in Sect. 8.2.2, the NCFET-aware DVS technique (Algorithm 1) has been used to select the operating voltage when a set of tasks were examined from the PARSEC benchmark suite [4]. The frequencies are set between 1*.*0 GHz and 2*.*4 GHz. *Vdd* is set between 0*.*2 V and 0*.*7 V. The low operating voltages *Vdd* in NCFET are lower than traditional FET due to the inherent voltage amplification in NCFET provided by the negative capacitance. For fair comparisons, simulators for both DVS cases were configured to have: the same frequencies, and architecture, in addition to running the same benchmarks. Hence, only voltage selection differs based on DVS decision.

#### **8.4.4.2 NCFET-Aware DVS Results and Analysis**

To show the effectiveness of the NCFET-aware DVS, we first show how NCFETaware DVS actually operates to save power and later to report the energy savings

**Fig. 8.8** (**a**) Operating voltage and (**b**) total power consumption during an interval of the execution time of the *canneal* master thread with NCFET-unaware and NCFET-aware DVS. NCFET-aware DVS selects higher voltage most of the time (in this particular example) and reduces the power further at the same CPU frequency. Voltage selection is based on workload characteristics

for different benchmarks in comparison with NCFET-unaware DVS. Accordingly, an illustrative example of the master thread of PARSEC *canneal* benchmark was selected. Figure 8.8 shows distinct phases during an interval of the execution time. In phase-1, in Fig. 8.8b, it shows the total power consumption when the frequency is set at 1*.*7 GHz. Traditional DVS sets *Vdd* to the minimum voltage (0*.*28 V) which required to sustain this frequency. Thus, dynamic power is minimized but the leakage power is not. NCFET-aware DVS sets *Vdd* to a higher value to guarantee a better trade-off. This will increase the dynamic power but strongly decreases leakage power resulting in a power saving. In phase-2, the master thread is idle and waits for the termination of the slave threads. Therefore, frequency is reduced to the minimum frequency (1*.*0 GHz). Traditional DVS reduces *Vdd* to 0*.*2 V due to the low required frequency in which it increases the leakage power. NCFET-aware DVS, instead of reducing *Vdd* , increases the voltage to 0*.*53 V, which decreases the leakage power. Thereby, the total power consumption in phase-2 is reduced by 67 % compared to the traditional DVS. In phase-3, after the slaves terminated, the master resumes operation and its frequency is boosted again to 1*.*7 GHz. It is worth to mention that the performance obtained with both DVS techniques is the same. This is because they do not affect the frequency, but only set the *Vdd* under performance constraint.

To reveal the energy savings, different PARSEC benchmarks were examined when active threads are operated at 1*.*7 GHz and idle cores are suppressed to 1*.*0 GHz. Figure 8.9 summarizes the energy savings. Energy savings range as shown in Fig. 8.9 from 14 % up to 27 % and in average are up to 20 %.

**Fig. 8.9** Energy saving results of different benchmarks using the NCFET-aware DVS compared to NCFET-unaware DVS. Energy savings range from 14 % up to 27 % , and in average 20%

#### **8.5 Conclusion**

In this chapter, we investigated how NCFET technology impacts the existing trade-offs in processors and how it can reshape the future of many-core systems. Compared to the existing FinFET technology, NCFET technology allows the processor to operate at a much lower voltage while it still meets the same performance. This results in a considerable power saving and as a result the total number of cores, that can be simultaneously turned on at the maximum frequency, increases without violating the predetermined thermal constraints. We also showed how NCFET inverses the leakage-voltage dependency and proposed a new NCFETaware DVS technique that provides an energy saving of 20% on average compared to conventional DVS techniques, which are unaware of the new leakage-voltage dependency that NCFET brings.

**Acknowledgments** The authors would like to thank Yogesh S. Chauhan, Girish Pahwa, and Amol Gaidhane from the Indian Institute of Technology Kanpur (IIT Kanpur) for their valuable contribution and support in the NCFET analysis at the device level and compact modeling.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 9 Run-Time Enforcement of Non-functional Program Properties on MPSoCs**

**Jürgen Teich, Pouya Mahmoody, Behnaz Pourmohseni, Sascha Roloff, Wolfgang Schröder-Preikschat, and Stefan Wildermann**

#### **9.1 Introduction**

In a broad range of embedded systems, e.g., in real-time and safety-critical domains, applications require guarantees (rather than a best-effort behavior) w.r.t. non-functional properties of their execution such as timing characteristics and reliability. Delivering the required guarantees is, therefore, of utmost importance for the successful introduction of multi-/many-core architectures in the embedded domains of computing. In a many-core context, existing analysis tools either impose an immense computational complexity or deliver worst-case guarantees that suffer from a massive over-/under-approximation for the vast majority of execution scenarios (due to the inherent uncertainty of these scenarios) and, hence, are of no practical interest. Noteworthy, a major source of this uncertainty originates from the interferences among concurrent applications.

In view of abundant computational and storage resources becoming available, new programming paradigms such as *invasive computing* [22] have proved effective in alleviating these interferences by means of spatial isolation among applications. Here, hybrid (static analysis/dynamic mapping) approaches, e.g., [11, 19, 20, 24], enable a static generation of different mappings for each application on system resources in form of mapping classes rather than individual mappings. For each concrete mapping within such a class, safe bounds on the non-functional execution properties, e.g., latency, may hold, see, e.g., [25]. The statically generated and analyzed sets of optimal mapping classes are then provided to the run-time system which checks the availability of such constellations of resources under the current system workload, and, if enough resources are available, finally launches the

J. Teich (-) · P. Mahmoody · B. Pourmohseni · S. Roloff · W. Schröder-Preikschat · S. Wildermann

Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany e-mail: juergen.teich@fau.de

application [25]. Such a hybrid approach has been implemented within the language InvadeX10, a library-based extension of the X10 programming language. In this extension, the so-called *requirements* [23] on non-functional execution properties, e.g., latency, may be annotated to individual applications or program segments thereof.

Although spatial isolation among applications significantly reduces the aforementioned uncertainties, a considerable degree of them remain unaffected which might be unacceptable, e.g., for safety-critical applications. But also, real-world applications from the domain of streaming often exhibit a large jitter in the latency and throughput (in spite of inter-application resource isolation) which is not tolerable, e.g., in case of camera-based medical surgery. This intolerable or annoying variation mainly stems from two sources of uncertainty that cannot be eliminated or restricted through resource isolation:


In the presence of execution state and input uncertainties, *application-specific run-time techniques* can offer a practical approach to confine the non-functional properties of execution within acceptable bounds or to prevent the violation of requirements. Such techniques dynamically adjust a given set of control knobs, e.g., voltage/frequency settings, in reaction to observed (or predicted) changes in the input and/or environment states to steer the non-functional properties of execution within the desired range. Examples of such approaches include the enforcement of safety properties using automata [13] or the satisfaction of timing constraints (while minimizing energy) using control-theory oriented approaches [9]. We refer to this emerging class of application-specific run-time techniques as *Run-Time Requirement Enforcement (RRE)*. This paper presents the fundamentals, definitions, and taxonomy of RRE in the context of many-core systems. We exemplify the practice of different classes of RRE techniques and present a discussion on their advantages, drawbacks, and challenges in a case study on the enforcement of timing requirements for a distributed real-time image processing application.

#### **9.2 Preliminaries and Definitions**

#### *9.2.1 System Model*

A many-core *architecture* is typically organized as a set of so-called storage, I/O, and compute tiles which are interconnected by a Network-on-Chip (NoC) for scalability, see, e.g., Fig. 9.1. Memory and I/O tiles enable mass storage and offchip access, respectively. Each compute tile is typically organized as a multi-core or a processor array and comprises a set of processing cores, peripherals such as memories, and a network adapter which are interconnected via one or more buses. An *application* to be executed on the architecture is typically composed of a set of processing tasks with known data dependencies, provided as a task graph. In case of periodic applications, actor-based models of computation and languages such as ActorX10 [17] may be used for parallel programming of MPSoCs. Each application may be augmented with one or a set of *requirements* on specific non-functional properties of its execution, e.g., execution time, throughput, or power corridors. In the following, a *mapping* of an application on a given architecture corresponds to a binding of its tasks to platform cores, a routing of the data exchanged between communicating tasks, an allocation of the required processing, communication, and storage resources, and a scheduling of tasks and communications on the allocated

**Fig. 9.1** A heterogeneous invasive MPSoC architecture

resources. Alternatively to concrete mappings, a set of constraints that reflect a constellation of required resources and, hence, correspond to several concrete deployments of the application on the architecture may be characterized at design time through techniques of design space exploration [19, 24, 25].

#### *9.2.2 \*-Predictability*

Non-functional requirements of applications, e.g., real-time constraints, can often be expressed in form of intervals according to the definition for the predictability of a non-functional property from [23]:

**Definition 9.1 (\*-predictability)** Let *o* denote a non-functional property of a program (implementation) *p* and the uncertainty of its input (space) given by *I* and environment by *Q*. The predictability (marker) of objective *o* for program *p* is defined by the interval

$$o(p, \mathcal{Q}, I) = [\inf\_{o} (p, \mathcal{Q}, I), \dots, \sup\_{o} (p, \mathcal{Q}, I)] \tag{9.1}$$

where *inf <sup>o</sup>* and *supo* denote the infimum and supremum of property *o*, respectively, under variation of state *q* ∈ *Q* and input *i* ∈ *I* .

Figure 9.2 exemplifies Definition 9.1 for three implementations *p*1, *p*2, and *p*<sup>3</sup> of an application with two requirements in terms of latency and power consumption.<sup>1</sup> The rectangle associated with each implementation *pi* confines the observable latency and power range for *pi* under the variation of input *i* ∈ *I* and state *q* ∈ *Q*. As illustrated, *p*<sup>1</sup> never satisfies the latency requirement under any input/state and, thus, is of no interest. Contrarily, *p*<sup>2</sup> satisfies both requirements in all input/state scenarios which—although offering desirable qualities—is achieved through, e.g., an over-reservation of resources or a persistently maximized core voltage/frequency which is often not affordable and/or practical. Contrarily to *p*<sup>1</sup> and *p*2, *p*<sup>3</sup> exhibits an attractive case: Under certain input/state scenarios it satisfies the requirements (with an affordable resource demand), while under other scenarios the acceptable latency-power region is surpassed.

In real-life use cases, the observable predictability intervals are often too coarse, so that a large share of viable implementations (like *p*3) do not satisfy the

<sup>1</sup>Note that a lower bound on latency makes sense in many applications that communicate result data to other applications or systems. Here, either buffer limitations would cause overflows in case the producer would be faster than the consumer. Alternatively, data might get lost if the producer overwrites not yet consumed data. Similarly, a minimal lower bound is the default in the case of reliability requirements. There, the lower bound could indicate a minimal expected lifetime. Finally, even lower power bounds can be found in the area of high-performance computing. In fact, the energy bill of a supercomputer increases by the amount of not consumed power but reserved by the provider.

**Fig. 9.2** Example of a program *p* with a latency requirement and a power requirement given each by an interval (corridor). Shown are three program implementations. *p*<sup>1</sup> does not satisfy the latency requirement for any possible execution. *p*<sup>2</sup> satisfies the two requirements for any possible variation in input *i* ∈ *I* and state *q* ∈ *Q*. Finally, *p*<sup>3</sup> may satisfy the two requirements, but obviously not for all observable executions. Here, run-time requirement enforcement techniques might be applicable to control the resources of the platform based on run-time monitoring to stay within the requirement corridors

given requirements under all input/state scenarios. For such partially satisfactory implementations, run-time techniques can be employed to render them consistently satisfactory by regularly monitoring (or predicting) the online input/state scenario and either acting proactively to avoid any violation of a set of given requirements, e.g., by adjusting the voltage/frequency settings of cores prior to program execution, or in reaction to any observed violation. The purpose of such run-time techniques is, therefore, to enforce that the desired latency and power corridor are never (or only occasionally) violated. We refer to these application-specific run-time techniques as Run-Time Requirement Enforcement (RRE) in the following.

#### **9.3 Run-Time Requirement Enforcement**

To satisfy a set of given requirements, the observable predictability intervals of the partially satisfactory implementations must be obviously reduced. In general, this can be achieved by techniques such as *restricting* the input space *I* or using *approximate computing* [23]. Alternatively, *isolation* techniques that reduce the state space *Q* may be applied such as the use of simpler cores, resource reservation protocols, or using *invasive computing* [22]. In the latter approach, an application program invades a set of processing and communication resources prior to execution. Through inter-application isolation, composability is established which is essential for an independent analysis of individual applications [1, 8, 12].

**Definition 9.2 (Run-Time Requirement Enforcer (RRE))** A Run-Time Requirement Enforcer (RRE) of a requirement *ro(p)* = [*LBo, UBo*] of a program *p* is a

**Fig. 9.3** Example of Run-Time Requirement Enforcement (RRE)

control technique to steer *o* within the corridor spanned by a lower bound *LB<sup>o</sup>* and an upper bound *UB<sup>o</sup>* for each execution of *p*.

Figure 9.3 exemplifies Definition 9.2 for a latency and a power requirement corridor of an implementation *p* of an application. An RRE is also depicted whose task is to confine the observable predictability interval of *p* within the corridor specified by the latency and power requirements. Given the actual (current) input *i*act ∈ *I* and state *q*act ∈ *Q*, the RRE in this case proactively estimates the expected latency *L*est and power consumption *P*est based on which it takes actions (outgoing arcs of the RRE) with the goal to avoid any violation of the requirements. Examples of RRE actions include adjusting the voltage/frequency of the cores or awaking reserved cores that are currently in a sleep state for power reduction, or even changing the mapping of some tasks to other cores [14].

#### **9.4 Taxonomy of Run-Time Requirement Enforcers**

According to [23], each requirement of an application can be either *soft* or *hard*. In case of a soft requirement, occasional violations are still considered acceptable. In this context, a RRE can be classified as either a *loose* or a *strict* enforcement technique as follows:

**Definition 9.3 (Loose/Strict RRE)** A Run-Time Requirement Enforcer (RRE) of a requirement *ro(p)* = [*LBo, UBo*] of a program *p* is called strict if it can be formally proven that no concrete execution of *p* will leave the given corridor at run-time. It is called loose, if one or multiple consecutive violations of *o* are tolerable.

Independent from the above definition, an RRE can be classified as a *centralized* or a *distributed* enforcement technique:

**Fig. 9.4** Centralized RRE

**Definition 9.4 (Centralized/Distributed RRE)** A Run-Time Requirement Enforcer (RRE) of a requirement *ro(p)* = [*LBo, UBo*] of a program *p* is called centralized if a single enforcer instance is used to enforce the requirement. It is called distributed in case multiple enforcers jointly enforce the requirement.

Figure 9.4 illustrates an example of a centralized RRE of a latency requirement for an object detection streaming application from the area of robot vision illustrated in Fig. 9.6. Here, the execution time of the 9 tasks (actors) of the application is monitored by a local so-called *Run-Time Requirement Monitor (RRM)* instantiated on each of the invaded tiles. A centralized RRE instance is also instantiated which, in the example, receives the monitored timing information of the last actor in the chain, i.e., Image Sink (ISi), from the RRM on the respective tile based on which it conducts enforcement decisions. During the execution of the application, each RRM derives the time elapsed for the execution of its local actor(s) for the current image frame and creates a time stamp that is sent together with the processed frame to the subsequent actor. Thus, each actor in the chain is provided with the information about the time already elapsed for the processing of the frame by the previous actors based on which it determines the slack available for the remainder of the processing. Once the last actor in the chain has completed its processing, the local RRM computes and sends a completion time stamp to the centralized RRE. In case of a soft latency requirement, a loose RRE would react to any latency corridor violation by adjusting the voltage/frequency of the tiles that host the time-critical actors, i.e., SIFT Description (SD) and SIFT Matching (SM), with the goal to steer the latency of the chain back into the corridor for subsequent image frames.

**Fig. 9.6** Object detection streaming application

Figure 9.5 illustrates an example of a distributed RRE for the object detection application. Here, the overall time requirement per image frame could be partitioned into sub-corridors (or interval budgets) which are assigned to the invaded tiles. Also, in addition to the RRMs, a local RRE is instantiated per tile to enforce its assigned sub-corridor locally. Evidently, distributed RRE benefits from a simpler realization and scalability in comparison to centralized RRE. Nonetheless, centralized enforcement could better use global information to optimize secondary goals such as energy consumption, as we will show in Sect. 9.5.

#### *9.4.1 Enforcement Automata (EA)*

Although arbitrary algorithmic behavior can be envisioned for enforcement, in the following we focus on automata-based enforcement techniques, as they are simpler to generate and ideal for application of formal verification techniques for proof of correctness due to their strong formal semantics. Formal proofs are necessary

**Fig. 9.7** Example of an Enforcement Automaton (EA). Depending on the input *i* of a program *p* and a current state *s*, the automaton takes a state transition to enforce a requirement. In the example, in state *s* ∈ *S*, the EA outputs how many cores *n(s)* shall be powered on and in which power mode *m(s)* (voltage, frequency) *p* shall be executed

particularly for enforcement of hard requirements. Figure 9.7 illustrates an example of an enforcement automaton (EA) of type Moore in which the input is a measure of the current workload *i* of a periodically executed program (segment or task) *p*, e.g., an image processing actor or kernel. In each state *s* ∈ *S*, the EA produces a vector of two outputs: the number *n(s)* of cores to be powered on for executing the current job and the power mode *m(s)* to be applied to the active core(s). As illustrated, the RRE acts as an interface between the application and the system software of the tile. Although in the examples provided in this paper, only the power management facility (voltage/frequency settings) and the degree of parallelism are controlled by RREs, they could in general control or restrict other system software components as well, e.g., the thread scheduler or the memory managing unit, for the enforcement of the given requirements.

#### *9.4.2 i-lets and e-lets*

Whereas for centralized enforcement, we assume that only one enforcer is instantiated per application program *p*, each task/actor of a distributedly mapped application program will be assigned its own local enforcer. In our implementation, an enforcer is implemented as a preferential thread called *e-let* in the following whereas application threads spawned for each task execution are called *i-lets*. Note that even if both are considered logically equivalent in terms of executable threads at the level of operating system, there is a notable difference between both: Even if i-lets present the application code for which requirements need to be guaranteed, they are usually not preferred by the operating system over other threads of their kind of the same application program. Whereas e-lets are always considered the preferred execution entities of the application program, they dominate the i-lets also included in this program. In addition, according to the principle of least privilege, elets have the capability to overrule or restrict the behavior of system-level software components including schedulers as well as cache, memory, and power managers in order to be able to enforce the properties required of their assigned i-lets. Another major difference between application i-lets and e-lets is the way they are executed. Whereas i-lets are created and executed upon each activation, e-lets are created only once at the time where an application program invades a tile of cores. They remain active not only for one iteration, but until the whole application retreats from all occupied resources. e-lets, in particular, state transitions in case the behavior is described by an EA, are triggered by incoming events, very similar to data-driven execution. In case of the following robot vision application, e.g., a state transition is triggered each time a new frame is arriving from a neighbor tile. In normal execution, the i-lets of an activated application task start after the EA has transited from the actual to the next state and have run-to-completion semantics. Whereas, elets may alternatively be triggered by asynchronous events, e.g., an exception from a temperature monitor.

Since an e-let is to be provided with special system privileges, including in particular the capability for immediate and low-latency response in bounded time to system events and operational state changes, it is implemented as a *kernel-level thread*. As a first-class object of the operating-system kernel, such a thread makes it possible to establish and maintain a semantic relationship between system-level and user-level code. The same applies to an i-let, but without granting the associated kernel-level thread any special privileges.

An ensemble of kernel-level threads with and without special capabilities for the control of system behavior depending on the particular requirements of an application program is managed by the kernel in the shape of a *squad*. A squad is a special unit within a *team* (a non-empty set of processes sharing a common address space and common computing resources [4]) of related kernel-level threads. This unit consists of two types of threads: on the one hand, those that make up the actual *lead* of the application program and on the other hand at least one *aide* who assists the lead threads as system mediator. From the kernel's point of view, the aide has all the capabilities to assure the lead the required system behavior in a controlled manner. In addition to be able to override or modify certain operatingsystem decisions, the aide is able to instantly respond to system events. In such a setting, an e-let is mapped to an aide, while the i-lets for which certain properties are enforced appear as lead.

#### **9.5 Case Study**

In this section, we present examples of enforcement techniques for strict vs. loose as well as distributed vs. centralized enforcement of timing requirements for the case study of the previously introduced object detection application depicted in Fig. 9.6. The application consists of a chain of 9 tasks (actors) processing each input image in succession: an image source (ISo) actor to read in input images periodically at a constant rate, a gray-scale (GS) conversion actor, a Sobel edge detection (ED) actor and a Harris corner (HC) detection actor to determine, respectively, edges and corners in an image, a SIFT orientation (SO) actor to achieve invariance to image rotation, a SIFT description (SD) actor to extract the *features* in an image, a SIFT matching (SM) actor to detect objects in the image based on a previously trained set of object features, and a RANSAC (RS) actor to insert the detected objects into the image which is finally sent out by an image sink (ISi) actor. As platform, we consider a NoC-based 3 × 3 many-core architecture as depicted in Fig. 9.1 and map the application's actors on the architecture as illustrated in Figs. 9.4 and 9.5. All evaluations presented in this section are carried out using InvadeSIM [16, 18], a high-level functional simulation framework for multi-/many-core architectures and supporting resource-aware programming.

#### *9.5.1 Enforcement Problem Description*

In the following, we assume that each image frame of the given time-critical application must be processed within a latency upper bound *UB<sup>L</sup>* = 115 ms. Table 9.1 provides the average, standard deviation, and overall contribution of each actor's latency when processing a sequence of 9 149 images stemming from different sources of video streams when each actor is processed in isolation on a single core and running constantly at maximum frequency. As can be seen in Table 9.1, the SD and SM actors exhibit the highest degree of input-dependent variation in execution time and also the highest contribution to the overall latency. The remaining actors, on the other hand, do not exhibit a comparable execution time jitter and/or a comparable contribution to the overall application latency across the input space.

**Table 9.1** Average, standard deviation, and overall contribution to the overall latency of each actor of the object detection application in Fig. 9.6 when processing a test sequence of 9 149 images and executed each in isolation on a single core running constantly at maximum frequency according to Table 9.2


In the following, we present examples of RRE techniques using Dynamic Voltage and Frequency Scaling (DVFS) [3, 10, 21, 26] to enforce the global latency upper bound *UB<sup>L</sup>* = 115 ms for the given application. Due to the small variation and overall latency contribution of all except the actors SD and SM according to Table 9.1, we dedicate a time budget of 20 ms to the other actors altogether, assuming that their cumulative latency per input image does not exceed this budget. This translates into a latency upper bound of *UB<sup>L</sup>* = 95 ms for the SD and SM actors. For the demonstration of distributed RRE techniques, we further decide to split this bound into two individual latency upper bounds, namely *UB<sup>L</sup>* = 80 ms for SD and *UB<sup>L</sup>* = 15 ms for SM. Next, we present examples of loose vs. strict as well as distributed vs. centralized enforcement. As a merit of profit, we also investigate the potential energy savings of each RRE strategy in addition to evaluating its capability in enforcing the latency requirement(s).

According to Fig. 9.7, the following RRE techniques implemented as enforcement automata are privileged to adjust two control knobs prior to processing an image frame: (a) the degree of execution parallelism per actor that is adjusted by setting the number *n* of active cores that process the workload of each actor and (b) the power (voltage/frequency) mode *m* of the core(s) allocated for each actor adjusted through DVFS (for active cores) and power gating (for inactive cores). To this end, each RRE decides on a per input image basis how to distribute the workload of each actor being enforced between one and four cores available per tile according to the mappings shown in Figs. 9.4 and 9.5. At the same time, it sets the power mode of the cores of each tile to either a power-gated mode (with *f* =0 and *V*DD =0) or 20 possible DVFS configurations (with a frequency step size of 0*.*2 GHz and a maximum frequency of 4 GHz) summarized in Table 9.2. For both actors under enforcement, SD and SM, we analyzed the major source of latency variation according to Table 9.1 (single core, constant maximal frequency) as stemming from the variability in the number *i* of *features* in each image to be processed. Therefore, this number is used as a direct indicator of the input workload to the following RRE strategies.


**Table 9.2** Voltage/frequency (DVFS) modes of each core

#### *9.5.2 Power, Latency, and Energy Model*

Our investigation of enforcement strategies involves the evaluation of power consumption, execution latency, and energy demand per actor under enforcement. To evaluate the power consumption *P (m)* of a core in power mode *m*, we use Eq. (9.2) in which the first summand represents the dynamic power contribution calculated based on the effective switching capacitance *C*eff and the supply voltage *V*DD*(m)* and operating frequency *f (m)* of the core in power mode *m*. The second summand describes the static power consumption calculated as the product of leakage current *I*leak and supply voltage *V*DD*(m)*.

$$P(m) = C\_{\rm eff} \cdot V\_{\rm DD}(m)^2 \cdot f(m) + I\_{\rm leak} \cdot V\_{\rm DD}(m) \tag{9.2}$$

For the construction of proper enforcement automata, we need to know the relation between the number *i* of input features and the execution latency *L* of each actor to be enforced in dependence of the number *n* of cores and power mode *m*. Let *L(*1*,* 1*, m*max*)* denote the latency for processing one feature on one core in power mode *m*max (highest voltage and frequency). In the following, *L(*1*,* 1*, m*max*)* is determined by simulatively determining the execution latency of each actor per image for a representative set of 9 149 test images that fully covers the considered input space. Subsequently, the latency per feature of an actor is determined for each image by dividing its latency by the number of features *i* in that image. Figure 9.8 illustrates the distribution (left) and the cumulative distribution (right) of the per-feature latency for the SD actor. Based on the obtained distribution, we then determine *L(*1*,* 1*, m*max*)* according to the *strictness* of the latency requirement which specifies the minimum rate *s* ∈ [0*,* 1] of requirement satisfaction that must be achieved, specified by the user. In case of (a) strict enforcement, a strictness of *s* = 1 is considered, and hence, the maximum observed per-feature latency among all images is used as *L(*1*,* 1*, m*max*)*. For (b) loose enforcement, i.e., when *s <* 1, *L(*1*,* 1*, m*max*)* is set to the lowest per-feature latency among all images such that for *s*· 100 % of images, the latency per feature is lower than or equal to the selected *L(*1*,* 1*, m*max*)*. In Fig. 9.8 (right), this calculation corresponds to finding the lowest *x*-coordinate with a cumulative density of *s*. Having *L(*1*,* 1*, m*max*)* determined, the following Eq. (9.3) is then used to determine the actor latency *L(i, n, m)* based on

**Fig. 9.8** Distribution (left) and cumulative distribution (right) of observed per-feature latency of the SD actor for a test sequence of 9 149 input images with number *i* of features to be processed varying between 0 and 5 513. To the right, the value of *L(*1*,* 1*, m*max*)* is marked for requirement strictness values of *s* = 0*.*5, 0*.*84, and 0*.*98

the number of features *i* to be processed within an image, the number of cores *n* employed, and the power mode *m* selected by an RRE scheme. In Eq. (9.3), *e(n)* denotes the parallel efficiency in dependence of the number of cores *n* employed for the computation with *e(n)* = 1 in the best case. In our experiments, we consider *e(n)* = 1.

$$L(i, n, m) = L(1, 1, m\_{\text{max}}) \cdot \left\lceil \frac{i}{n \cdot e(n)} \right\rceil \cdot \frac{f(m\_{\text{max}})}{f(m)} \tag{9.3}$$

Note that Eq. (9.3) is a latency model specific to the SD and SM actors of our running application where *L(*1*,* 1*, m*max*)* must be determined individually for each actor to be enforced. Moreover, Eq. (9.3) could be alternatively replaced with an elaborate many-core timing analysis, e.g., those from [2, 5–7, 15, 25], to derive tight worst-case latencies that support a variety of different resource arbitration policies and resource sharing schemes. Based on the power consumption and latency models in Eqs. (9.2) and (9.3), the energy *E(i, n, m)* required by the actor for processing an image with *i* features using *n* cores running in power mode *m* is derived using Eq. (9.4).

$$E(i, n, m) = L(i, n, m) \cdot P(m) \cdot n \tag{9.4}$$

Finally, the maximum number of features that can be processed within a given latency bound *UB<sup>L</sup>* using *n* active cores running in power mode *m* can be determined using Eq. (9.5) which is derived from Eq. (9.3), considering *L(i, n, m)* ≤ *UBL*.

$$i\_{\text{max}}(UB\_L, n, m) = \left\lfloor n \cdot e(n) \cdot \left\lfloor \frac{UB\_L}{L(1, 1, m\_{\text{max}})} \cdot \frac{f(m)}{f(m\_{\text{max}})} \right\rfloor \right\rfloor \tag{9.5}$$

For example, with Eq. (9.5), we may compute *i*max*(*80*,* 4*,* 20*)* for the SD actor which is the highest number *i* of features of an input image for which a latency upper bound of *UB<sup>L</sup>* =80 ms can be enforced with a strictness of *s* ∈ [0*,* 1]. For instance, for loose enforcement with *s* = 0*.*5, we obtain *i*max*(*80*,* 4*,* 20*)* = 828, and for strict enforcement where *s* = 1, the maximum enforceable workload decreases to *i*max*(*80*,* 4*,* 20*)* = 760 features.

#### *9.5.3 Energy-Minimized Timing Enforcement*

According to Fig. 9.7, RRE may involve to set, modify, or impose restrictions on typically OS-related techniques such as thread scheduling or memory management. In the following examples, we exemplify enforcement strategies for latency enforcement of individual actor executions or complete applications by varying the number *n* ∈ [1*,* 4] of cores (parallelism) and the power mode *m* ∈ [1*,* 20] configuration for each actor execution. As, in general, multiple ways and settings for *n* and *m* might be feasible to enforce a requirement, the question becomes which requirementadhering constellation the enforcer selects at run-time. Often, this freedom of choice may be exploited by optimizing one or more (secondary) objectives in addition to satisfying the given requirement. In the following, we consider energy demand as an objective to be minimized.<sup>2</sup> Given a latency requirement *UB<sup>L</sup>* and the RRE decision space of *n* ∈ [1*,* 4] and *m* ∈ [1*,* 20], design space exploration can be conducted per actor (or a set of actors) to derive, e.g., in our running example for the SD actor, the maximum number *i*max of features that can be processed under each choice of *(n, m)* while respecting the latency requirement. Taking the SD actor with a latency upper bound *UB<sup>L</sup>* =80 ms as an example, Fig. 9.9 illustrates the maximum workload *i*max and the respective energy demand for each of the 80 possible *(n, m)* configurations

**Fig. 9.9** Maximum enforceable workload *i*max and energy demand of the SD actor under the variation of the number *n* ∈ [1*,* 4] of active cores and their power mode *m* ∈ [1*,* 20] for a hard (*s* = 1) latency bound of *UB<sup>L</sup>* = 80 ms. Pareto-optimal *(n, m)* configurations are connected by a red line. For an exemplary subset of them, the Pareto-optimal configuration *(n, m)* is also annotated

<sup>2</sup>Other objectives for choice of settings could be to activate the least number *n* of cores for increasing aspects of long-term reliability.

derived using Eqs. (9.5) and (9.4), respectively, in case of strict enforcement (*s* =1). The red line designates Pareto-optimal *(n, m)* configurations.

Based on such a design space exploration and the Pareto front of *(n, m)* configurations derived thereby, an energy-minimizing enforcement automaton may be systematically constructed in which prior to each execution of the SD actor, the enforcement automaton selects a state (Pareto-optimal *(n, m)* configuration) that is energy-minimal while satisfying the latency requirement in case input *i* is enforceable, thus if *i* ≤ *i*max*(UBL, n, m)*. For the example in Fig. 9.9, the enforcement automaton has 31 states, each corresponding to one of the 31 Pareto-optimal *(n, m)* configurations and the maximum enforceable workload *i*max associated with that configuration. Here, the state selection is steered solely by the number *i* of features in the image to be processed by the SD actor.<sup>3</sup> For instance, for images with *i* ≤ 9 features, *n* = 1 and *m* = 1 minimizes the energy demand of the SD actor without violating the given latency requirement *UB<sup>L</sup>* = 80 ms. For input images with 142 ≤ *i* ≤ 152 features, an energy-minimal and requirement-adhering execution can be realized only if *n* = 4 cores are used for SD in parallel and power mode *m* = 4. Finally, a strict enforcement becomes impossible if *i >* 760, even using the configuration with the highest compute power, i.e., *n* = 4 and *m* = 20. For non-enforceable inputs, the enforcer needs to either throw an exception, stop processing (drop) the image, or process only as much as the latency bound allows to be processed. In Sect. 9.5.6, we propose a number of exception handling techniques under the topic of range extension. Before that, we first present techniques for distributed enforcement where each actor is individually enforced. Subsequently, we present also an example of centralized enforcement in which a more global view of the system state can be obtained by a centralized RRE instance that can take decisions affecting multiple actors and resources.

#### *9.5.4 Distributed Enforcement*

Figure 9.10 shows the resulting automatically generated energy-minimizing enforcement automata for a distributed enforcement strategy of the two individual actors SD and SM with latency upper bounds 80 ms and 15 ms, respectively. The EAs for selecting the energy-minimizing *(n, m)* configurations obtained through the previously presented design space exploration are implemented as lightweight lookup tables for each actor. At run-time, once an image is ready to be processed, the number *i* of features in it becomes known. Prior to processing an image, the RRE (e-let) retrieves the energy-minimizing *(n, m)* configuration corresponding to

<sup>3</sup>Note that in this example, the RRE could also be represented by a function table rather than an FSM, as the selection of state is only dependent on the input. More general cases such as restricting the allowed settings in each state to allow only step-wise increase or decrease of DVFS modes can be constructed.

**Fig. 9.10** Implementation of distributed RRE using pre-explored energy-optimal parallelism degree and DVFS settings *(n, m)* for SD and SM actors with hard (*s* = 1) latency upper bounds of *UB<sup>L</sup>* =80 ms and *UB<sup>L</sup>* =15 ms, respectively

*i* features from the table and instructs the power manager to use these settings. As shown in Fig. 9.10, the integration of enforcers may be achieved at the level of actor graphs as a model transformation by inserting the enforcer as an actor in front of each actor to be enforced, such that for each image to be processed, the energyminimizing *(n, m)* configuration is set prior to execution of the image, and the configuration stays constant over the duration of processing this image. Employing the above enforcement strategy, the run-time manager is not compelled to run the enforced actors constantly with the maximum number *n* = 4 of cores and in the highest power mode *m* = 20 to guarantee the satisfaction of latency constraints in the presence of input variations, unless *i* ≥721 for SD or *i* ≥985 for SM. Also note that the given latency bounds cannot be strictly enforced for a feature count *i >*760 for SD and *i >* 1 036 for SM. Thus, the maximum workload that can be strictly enforced by both actors is limited to *i* = 760 features.

The histograms of observable latencies of the SD and SM actors (a) without enforcement (*n*=4 and *m*=20) and (b) with enforcement considering hard (*s* = 1) latency upper bounds of *UB<sup>L</sup>* = 80 ms and 15 ms for SD and SM, respectively, are illustrated in Fig. 9.11. As shown in the plots, the RREs choose a power mode that maximizes energy savings while satisfying the given latency upper bound of each actor under enforcement. For a variety of requirement strictness levels, Table 9.3 finally presents the average dynamic energy consumption and the achieved dynamic energy savings of the SD and SM actors compared to the non-enforced scenario with *n* = 4 and *m* = 20. As can be seen, in case of *loose enforcement*, i.e., a strictness *s <* 1, the RRE achieves between 38*.*3 % and 41*.*2 % dynamic energy savings per enforced actor (respectively, between 39*.*3 % and 40*.*8 % collectively

**Fig. 9.11** Latency distribution for the SD (left) and SM (right) actors. The enforced case corresponds to the energy-minimized enforcement for hard (*s* = 1) latency bounds of *UB<sup>L</sup>* =80 ms for SD and *UB<sup>L</sup>* =15 ms for SM. The non-enforced case corresponds to a fixed setting of *n* = 4 and *m* = 20 per actor

for the two actors) while satisfying latency upper bounds of 80 ms and 15 ms for the SD and SM actors, respectively. In case of *strict enforcement* which corresponds to a requirement satisfaction rate of *s* = 1, the RRE still is able to achieve dynamic energy savings of 37*.*6% for SD and 37*.*2% for SM (respectively, 37*.*6% collectively for the two actors) while guaranteeing that the given latency upper bound for each actor will never be violated. Evidently, this guarantee holds only for enforceable input images, i.e., those with *i* ≤760 features for the SD actor and *i* ≤1 036 for SM (see the RRE tables in Fig. 9.10). In Sect. 9.5.6, we discuss approaches that can be employed to enable the enforcement of latency requirements for inputs which are not enforceable merely using the given RRE control knobs. Finally, when analyzing the overall energy consumption of all actors per input frame, we obtain an overall dynamic energy reduction of 33*.*8 % in case of strict enforcement (*s* = 1) and between 35*.*4 % and 36*.*8 % in case of loose enforcement (*s <* 1) for the whole application, even though only two out of 9 actors are enforced. Noteworthy, the additional execution time and energy consumption of the RREs themselves can be neglected as these are implemented by simple table lookups.

#### *9.5.5 Centralized Enforcement*

In this section, we consider the combined enforcement of the SD and SM actors using centralized enforcement. As depicted in Fig. 9.12, a single instance of an RRE is now enforcing the overall hard (*s* = 1) latency upper bound of *UB<sup>L</sup>* = 80 + 15 = 95 ms for both SD and SM actors collectively. Similar to the distributed case, the energy-minimizing *(n, m)* configurations which are required for the construction of the RRE are obtained through a previously presented design space exploration, but now considering a unified latency upper bound *UB<sup>L</sup>* = 95 ms for the execution of both SD and SM actors. Note that considering a compound


The energy consumption without enforcement (*n*=4, *m* =20) serves as a baseline

**Fig. 9.12** Implementation of a centralized RRE using pre-explored energy-optimal parallelism degree and DVFS settings *(n, m)* for SD and SM actors with a hard (*s* = 1) latency upper bound of *UB<sup>L</sup>* = 95 ms for both actors collectively

**Fig. 9.13** Latency distribution of the SD and SM actors when collectively enforced for a hard (*s* = 1) compound latency bound of *UB<sup>L</sup>* =95 ms. The enforced scenario is realized using energyminimized enforcement, and the non-enforced scenario corresponds to a fixed configuration of *n* = 4 and *m* = 20 per actor

latency bound for both actors enables enforcing this bound for images with up to *i* = 790 features.

The histogram of observable collective latency of the SD and SM actors (a) without enforcement (*n* = 4 and *m* = 20 for both actors) and (b) with enforcement considering a hard latency upper bound, i.e., for a requirement strictness of *s* = 1, is illustrated in Fig. 9.13. As shown in the plots, the RRE assigns the number *n* of active cores and their power mode *m* for each actor under enforcement to maximize energy savings while satisfying the given latency upper bound of *UB<sup>L</sup>* = 95 ms collective for both actors. For a variety of requirement strictness levels, Table 9.4 finally presents the average dynamic energy consumption and the achieved dynamic energy savings of the two enforced actors using centralized enforcement compared to the non-enforced scenario with *n* = 4 and *m* = 20. As can be seen, the RRE achieves in case of *loose enforcement*, i.e., strictness *s <* 1, between 39*.*7 % **Table 9.4** Average dynamic energy consumption and savings per image for the SD and SM actors through centralized enforcementfor a compound latency bound of *UBL* = 95 ms in dependence of requirement strictness defined as the minimum acceptablerequirement satisfaction rate


The energy consumption without enforcement (*n*=4, *m* =20) serves as a baseline

and 42*.*1 % dynamic energy savings collectively for the two enforced actors while satisfying a compound latency upper bound of 95 ms. In case of *strict enforcement* (*s* = 1), the RRE still is able to achieve dynamic energy savings of 38*.*8% collectively for the two actors while guaranteeing that the given latency upper bound *UB<sup>L</sup>* = 95 ms will never be violated. Finally, when analyzing the overall energy consumption of all actors per input frame, we obtain a dynamic energy reduction of 35 % in case of strict enforcement (*s* = 1) and between 35*.*7 % and 37*.*9 % in case of loose enforcement (*s <* 1), even though only two out of 9 actors are enforced. In summary, compared to distributed enforcement, the centralized scheme is able to even save slightly more dynamic energy while enforcing a higher workload.

#### *9.5.6 Lower Latency Bound Enforcement and Range Extenders*

In certain cases, a latency requirement may introduce—in addition to an upper bound *UBL*—also a lower bound, *LBL*, thus, demanding the enforcement of a latency corridor. Such a lower latency bound could be enforced by means of, e.g., a simple timer (counter) that measures the time elapsed from the beginning of the current execution of the actor(s) under enforcement. The transmission of the produced result(s) to the next actor(s) could then be simply delayed to the time the timer indicates that the time interval of *LB<sup>L</sup>* has passed, see Fig. 9.14. More

**Fig. 9.14** Examples of range extenders and enforcement of lower latency bounds *LBL* and thus latency corridors

difficult and also diverse in the space of possible solutions, however, is the question of how to deal with non-enforceable inputs. In case of our running distributed object detection application, our test image sequences on purpose contained images with more features *i* than for which the given latency upper bound can be enforced with only *n* = 4 cores in highest power mode *m* = 20. In case of strict enforcement (*s* = 1) corresponding to hard real-time requirements, not even a single violation of a latency upper bound is tolerable. Hence, there must be techniques to avoid such violations per construction, if a non-enforceable input is observed. This is a matter of current research. We therefore briefly outline a few techniques how to deal with these cases: input omission (dropping), approximate computing to trade off processing speed with result accuracy (if applicable), revision of scheduling decisions, over-allocation of resources, or a dynamic reconfiguration between different mappings at run-time (change of operating point [14]), see also Fig. 9.14.

#### **9.6 Conclusions**

In this paper, we presented a formalization, classification, and the practice of a class of run-time techniques subsumed under the term of Run-Time Requirement Enforcement (RRE) that make the system management software of an MPSoC platform become the advocate of a parallel application program instead of both acting independently with the goal to provide means for the satisfaction of given non-functional requirements of parallel program execution such as performance (latency, throughput), power or energy consumption, or reliability. The nonfunctional requirements can thereby be expressed by interval ranges and specified over the application program as a whole, e.g., when specified by an actor graph. Alternatively, requirements can be specified for individual actors/tasks or threads, or even segments thereof. The goal of RRE is to enforce the satisfaction of these requirements at run-time. It has been shown by introductory examples on latency enforcement of a distributed object detection application that enforcers may be generated through profiling and the creation of high priority system-level threads called *e-lets* that are formally described in behavior by an enforcement automaton each. These e-lets proactively control the system resources claimed by an application program in view of observed workload variation. First, based on the assignment of exclusive resources to periodic workload such as streaming applications, composability is created that is necessary to allow for a static and independent analysis of each application running on a given MPSoC platform. This enables us to statically analyze non-functional properties of applications or parts thereof and define RRE techniques to control requirements dynamically. For a distributed object detection application as an example, it has been shown that the variability of non-functional execution properties can be greatly reduced in dependence of the level of strictness that shall be fulfilled for each requirement. Moreover, it has been shown that RRE techniques can be either implemented in a centralized or distributed manner. In the future, we want to look at how to decompose requirement corridors for distributed enforcement and study the control overheads of centralized enforcement. Finally, techniques for simultaneous enforcement of multiple non-functional requirements need to be investigated, as here, not only the input (workload) variation as considered in this seminal paper, but also the shared system state must be taken into account once multiple RREs are at work.

**Acknowledgments** This paper is dedicated to Peter Marwedel on behalf of his 70th birthday in recognition of his lifetime achievements in the area of design automation for embedded systems. Special thanks also to Zhai Ming for several experiments conducted for this paper. Finally, we would like to acknowledge the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—Project Number 146371743—TRR 89 Invasive Computing that funded our work.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 10 Compilation for Real-Time Systems a Decade After PREDATOR**

**Heiko Falk, Shashank Jadhav, Arno Luppold, Kateryna Muts, Dominic Oehlert, Nina Piontek, and Mikko Roth**

#### **10.1 Introduction**

PREDATOR was a collaborative research project running from February 2008 until January 2011 that was funded by the European 7th Framework Programme under the lead of Reinhard Wilhelm, Saarland University, Germany. It was concerned *"with embedded systems that are characterized by efficiency requirements on the one hand and worst-case constraints on the other. [...] Embedded systems with critical constraints need off-line guarantees for the satisfaction of these constraints. Unfortunately, it can be observed that in computer system design, the gap between average-case and worst-case behavior increases rapidly. This entails a decreasing precision of performance analysis results."* Therefore, PREDATOR proposed *"a new research and design discipline that looks at predictability and efficiency in a synergistic manner and that involves all levels of abstraction and implementation in embedded system design"* [6, 34].

These different abstraction levels were reflected by the project's scientific work packages. WP1 (led by Luca Benini, University of Bologna, Italy) dealt with predictable and efficient hardware architectures. Both functional and power models of a predictable architecture were developed and their sensitivity to architectural parameters that influence predictability and their costs were analyzed. WP2 (lead: Peter Marwedel, University of Dortmund, Germany) focused on compiler and code generation techniques for a single application task. Here, optimizations that are aware of hard real-time constraints and of Worst-Case Execution Times (WCET) were proposed; multi-objective trade-offs between real-time guarantees and energy consumption or code size were envisioned. WP3 was led by Giorgio Buttazzo

H. Falk (-) · S. Jadhav · A. Luppold · K. Muts · D. Oehlert · N. Piontek · M. Roth Institute of Embedded Systems, Hamburg University of Technology (TUHH), Hamburg, Germany e-mail: Heiko.Falk@tuhh.de

(Scuola Superiore Sant'Anna, Pisa, Italy) and targeted the coordination of multiple application tasks. Off-line and online coordination techniques were investigated such that guarantees on tasks' response times were derived under simultaneous optimization of resource usage. In the context of WP4 (lead: Lothar Thiele, ETH Zürich, Switzerland), distributed embedded systems were considered and the modular analysis of Multi-Processor Systems on Chip (MPSoC) with respect to performance and predictability was investigated. Finally, cross-layer aspects of the design and analysis of predictable and efficiency were considered in WP5 (lead: Reinhard Wilhelm).

Overall, PREDATOR was a high-quality collaborative effort that produced many seminal results in the field of designing predictable and efficient hardware and software architectures. On the occasion of Peter Marwedel's 70th anniversary, this article surveys the results in the area of compilers for real-time systems that have been achieved under his leadership within PREDATOR. The foundational character of this project is highlighted by providing an overview over code optimizations and analyses proposed in the past decade since PREDATOR was executed. These recent works directly base on challenges identified during and on results produced by PREDATOR.

Section 10.2 puts the state-of-the-art in compilation for real-time systems by the end of PREDATOR in a nutshell. Recent developments that integrate task coordination into compiler optimizations are described in Sect. 10.3. The combination of system-level analysis and code generation techniques for parallel multi-core systems is the subject of Sect. 10.4. Section 10.5 discusses multi-objective compiler optimizations that are able to adhere to real-time constraints, and Sect. 10.6 concludes this article and provides an outlook over future work.

#### **10.2 Challenges and State-of-the-Art in WCET-Aware Compilation During PREDATOR**

A program's WCET stands for its maximal possible execution time, irrespective of possible input data and of initial states of the hardware architecture. For the design of hard real-time systems, the WCET is a critical design parameter, since it allows to reason about whether a program always meets its deadline or not. However, the exact computation of a program's WCET is infeasible in general so that conservative WCET estimates are used instead. In the domain of hard real-time systems, such WCET estimates are usually produced by static timing analysis tools, e.g., aiT [1]. During PREDATOR's single-task activities carried out within Work Package WP2, such a timing analyzer was tightly integrated into a compiler framework. This allowed the compiler to perform WCET analyses in a fully automated fashion during code generation. The WCET data gathered this way constitutes a precise worst-case timing model inside the compiler which contrasts sharply with standard compilers that focus on average-case scenarios and that do not feature any timing models at all. The resulting WCET-aware C Compiler WCC [8] finally exploits this precise timing model in dedicated WCET-aware, single-task code optimizations.

However, modern real-time systems do not consist of only a single task—they are multi-task systems instead where tasks are preempted and scheduled according to an operating system's scheduling policy. Thus, the design of a timing predictable multitask system includes the consideration of all tasks' end-to-end latencies including blocking times due to preemptions, i.e., the tasks' Worst-Case Response Times (WCRT). Based on the tasks' WCRTs, a subsequent schedulability analysis can be used to determine whether all tasks definitely meet their respective deadlines. Since WCETs are characterized by the behavior of machine code for a given processor architecture, and since WCRTs and schedulability analyses rely on given WCET values and mostly depend on task-level scheduling properties, there is a natural link between compilers and operating systems: the former generate the machine code that the latter have to schedule. This link was already identified during PREDATOR:

#### **Challenge #1**

"The compiler [*...*] will apply optimizations not for each individual task in isolation, but will consider all tasks of the entire system in a holistic view. Furthermore, it is planned to take the individual scheduling policies [*...*] into account" [5].

Plazar et al. proposed a software-based cache partitioning for real-time multitask systems [29]. Cache partitioning is able to make the behavior of instruction caches more predictable, since each task of a system is assigned to a unique cache partition. The tasks in such a system can only evict cache lines residing in the partition they are assigned to. As a consequence, multiple tasks do not interfere with each other any longer w.r.t. the cache during context switches. This allows to apply static WCET analysis for each individual task of the system in isolation. The overall WCET of a multi-task system using partitioned caches is then composed of the WCETs of the single tasks given a certain partition size, plus the overhead required for scheduling and context switching. Until the completion of PREDATOR, an integration of schedulability analyses and a consideration of individual scheduling policies during compilation could not be realized due to a shortage of time.

In the context of performance analysis for massively parallel multi-core architectures, PREDATOR proposed a modular approach where a WCET analysis is performed for each application per individual processor core in isolation. By exploiting how often each core accesses the shared bus that connects all cores in a given MPSoC architecture, the additional timing interference that each processor core exhibits due to temporarily blocked bus accesses is estimated. According to PREDATOR's design rules for predictable architectures [38], TDMA-arbitrated shared buses were considered during modular performance analysis. In the end, upper timing bounds of all applications running on all processor cores are derived in a modular fashion which allows to reason about schedulability for such parallel multi-core systems [30].

Various execution models for the applications running on such an MPSoC architecture were considered. In the so-called Dedicated Access Model, applications are structured into three distinct phases: acquisition, execution, and replication. Only during the first and the latter, a task is allowed to access the shared bus in order to fetch input data or to write back computed results, resp. Since the main execution phase of a task is free of shared bus accesses, it cannot suffer from delays induced by other cores which allows for a very precise timing analysis. In the General Access Model, accesses to the shared bus can happen anytime during acquisition, replication, and execution. Thus, a timing analysis becomes more pessimistic here [31].

As the precision of timing analysis for MPSoCs thus strongly depends on the execution behavior of tasks, mechanisms enforcing well-suited and predictable access patterns to shared buses would be advantageous.

#### **Challenge #2**

"One new possibility to reduce the effect of (timing) interactions [*...*] is the use of traffic shapers. It is an open problem to include these units into a system-wide performance analysis that considers computation and communication resources" [5].

However, PREDATOR did not come up with approaches addressing this challenge. PREDATOR explicitly considered the trade-off between predictability where hard constraints on a system's resource usage must be met versus the efficiency of a system in the average case.

#### **Challenge #3**

"We will develop models capturing various optimization objectives within the compiler, e.g. code size or energy dissipation [*...*]. Novel optimization strategies are designed in order to minimize an objective other than WCET, under simultaneous adherence to real-time constraints" [5].

Since the WCC compiler featured a detailed WCET timing model right from the project start, and since modeling code size at the assembly code level is trivial from a compiler's point of view, it was obvious to consider trade-offs between these two objectives in the beginning. For this purpose, simple heuristics for the optimization Procedure Cloning were proposed where WCETs were minimized as long as the resulting code sizes did not exceed a user-provided threshold [19]. Later, WCC was coupled with an instruction set simulator allowing to perform dynamic profiling during compilation. Furthermore, data from an instruction-level energy model [32] was also integrated. This way, the compiler was able to simultaneously model WCET, code size, ACET, and energy consumption of generated machine code.

These models were used to determine Pareto-optimal sequences of compiler optimizations. It is a well-known problem that the order in which a compiler applies its optimizations can have a significant impact on the quality of the finally generated code. In the context of PREDATOR, a stochastic evolutionary multiobjective algorithm [20, 21] found optimization sequences that trade pairs of objectives, i.e., WCET, ACET and WCET, code size, resp. True multi-objective code optimizations that inherently model and consider different criteria at the same time during code generation have, however, not been investigated in depth during PREDATOR.

#### **10.3 Integration of Task Coordination into WCET-Aware Compilation**

Many architectures are equipped with fully software-controllable secondary memories. These are memories that are tightly integrated with the processor to achieve the best possible performance. These Scratchpad Memories (SPMs) can be accessed directly and are therefore in general well-suited for optimizations regarding energy consumption and execution times.

SPMs turned out to be ideal for WCET-centric optimizations, since their timing is fully predictable. The WCC compiler exploits SPMs for WCET minimization by placing assorted parts of a program into a scratchpad memory. During PREDATOR, an Integer-Linear Program (ILP) originally proposed by Suhendra et al. [33] was extended towards a single-task SPM allocation where binary decision variables *xi* are used per basic block *bi*. *bi* is moved from main memory onto the scratchpad memory if *xi* equals 1. The overall goal of this ILP is to find an assignment of values to the variables *xi* such that the resulting SPM allocation leads to the minimal WCET of the whole task. Constraints are added to the ILP that model the task's internal program structure. For each basic block *bi* and each successor *bsucc* of *bi* in the task's Control Flow Graph (CFG), a constraint is set up bounding the WCET *ci* of *bi*:

$$\mathbf{c}\_{l} \ge \mathbf{c}\_{\text{succ}} + \text{cost}\_{l, \text{main\\_mem}} - \text{gain}\_{l} \ast \mathbf{x}\_{l} \tag{10.1}$$

This constraint states that the WCET *ci* of a path starting in *bi* must be larger than the WCET *csucc* of any of the successors of *bi*, plus the contribution of *bi* to the WCET itself with *bi* located in main memory (*costi,main*\_*mem*), minus the potential gain when moving *bi* from main memory onto the scratchpad memory. Additional constraints in the ILP model loops and function calls. The limited available capacity of the SPM is considered as well as the additional overhead due to long-distance jumps from the main memory to the SPM or back. In the end, the WCET of an entire task is represented in the ILP model by a variable *c entry* main which models the WCET of the path starting at the task's entry point, i.e., at its main function [7].

This basic ILP model turned out to be very powerful and flexible so that it served as the basis for the optimization of multi-task systems. For this purpose, all tasks of a multi-task application were modeled in the ILP as described above. As a consequence, the ILP variables *cj* associated with the entry points of the tasks *τj* describe a safe upper bound of the tasks' WCETs. An early work [22] towards PREDATOR's Challenge #1 on optimization of multi-task systems under consideration of scheduling policies integrated Joseph's schedulability analysis [15] into this multi-task ILP.

For priority-based scheduling, a task *τj* 's WCRT *rj* is the maximum possible time interval between the activation of a task and its end, including penalties due to preemptions by higher-priority tasks. The tasks' WCRTs are computed as follows:

$$r\_j = c\_j + \sum\_{h=0}^{j-1} \left\lceil \frac{r\_j}{T\_h} \right\rceil \* c\_h \tag{10.2}$$

Eq. (10.2) accumulates the net WCET *cj* of task *τj* and the penalties due to tasks *τ*0*,...,τj*−<sup>1</sup> having higher priority than *τj* . Each such high-priority task *τh* preempts *τj* a total of *rj Th* times where *Th* denotes a task's period. For each preemption of *τj* by *τh*, the higher-priority task's WCET *ch* is considered.

However, it is not straightforward to integrate Eq. (10.2) into an optimization's ILP, since both the WCETs *ch* and the WCRTs *rj* are ILP variables so that the multiplication of *rj Th* by *ch* is infeasible. In order to solve this problem, an integer variable *pj,h* is added to the ILP for every combination of low- and high-priority tasks *τj* and *τh*, resp. *pj,h* denotes the timing penalty that is added to *τj* 's WCRT due to preemptions by *τh*. Using these variables, Eq. (10.2) can be rewritten to:

$$r\_j = c\_j + \sum\_{h=0}^{j-1} p\_{j,h} \tag{10.3}$$

In order to model *pj,h*, the following linearization scheme is applied: If *rj* is lower than or equal to *τh*'s period *Th*, *τj* can be preempted at most once by *τh*, thus leading to *pj,h* = 1 ∗ *ch*. If *rj* is greater than *Th* but lower than or equal to 2 ∗ *Th*, *pj,h* = 2 ∗ *ch* results, etc. In general, it has been proven that

**Theorem 10.1** *If τj is preempted at least N times by τh, then pj,h* ≥ *(N* + 1*)* ∗ *ch must hold.*

Such so-called conditional constraints can efficiently be translated into ILP in Eq. [25]. A natural upper bound for the number *N* of preemptions of *τj* by *τh* is *Dj Th* where *Dj* denotes task *τj* 's deadline. Thus, the conditional constraints from Theorem 10.1 are added to the ILP for all values of *N* with 0 ≤ *N* ≤ *Dj Th* − 1 and for all pairs of low- and high-priority tasks *τj* and *τh*, resp. Finally, the schedulability of the entire multi-task set is ensured during this ILP-based optimization by adding constraints

$$r\_j \le D\_j \tag{10.4}$$

such that the WCRT of each task *τj* must be within its respective deadline.

While this work is a first step towards schedulability-aware compiler optimization, it suffers from a couple of limitations: First, the task model only supports fixed-priority scheduling and periodic tasks. Second, preemption costs due to the execution of an actual scheduler and context switching overheads are not considered. Finally, the number of constraints of the ILP proposed in [22] grows quadratically with the size of the task set, and it depends on the actual values for tasks' deadlines and periods.

The consideration of Liu and Layland's schedulability test [18] helped to overcome the limitation to fixed priorities:

$$
\mu = \sum\_{j} \frac{c\_j}{T\_j} \le 1 \tag{10.5}
$$

Eq. (10.5) states that a system that is scheduled with dynamic priorities using Earliest Deadline First (EDF) is schedulable if and only if the system load *u* is less than or equal to 1. Due to the already linear nature of Eq. (10.5), it is easy to integrate this schedulability test into an ILP [22].

The relaxation of strictly periodic task sets required to use an event-based task model supporting arbitrary task activation patterns and deadlines [24]. For this purpose, the ILP described above has been extended by support for density and interval functions *η* and , resp., as originally proposed by Gresser [10] and later taken up by Albers et al. [2]. In this approach, an arbitrary kind of task activation pattern can be characterized by the density function *η* that denotes the maximum number of events (i.e., task activations) in some time interval *t*. The interval function models the inverse behavior and returns the minimal time interval *t* in which *n* tasks are activated. This task model provides a high flexibility so that periodical multi-task systems, periodical systems with jitter or bursts, or systems with fully arbitrary task activations can be modeled in the optimization's ILP.

The consideration of an actual scheduler's overhead for context switching can be added to the ILP-based framework described above by introducing an implicit task *τ*<sup>0</sup> with the highest priority into the multi-task system. *τ*<sup>0</sup> represents the periodically executed scheduler, and by considering an actual scheduler's WCET *c*<sup>0</sup> and its period *T*0, it can smoothly be integrated into the optimization framework [23].

As an alternative to Joseph's schedulability test (cf. Eq. (10.2)), Baruah proposed the so-called processor demand test [3]. It states that a multi-task system is schedulable if and only if the amount of *required* computation time is less than or equal to the amount of *available* computation time:

$$
\Delta t \ge \sum\_{\forall \mathbf{r}\_j} \left[ \eta\_j \left( \Delta t - D\_j \right) \* \left( c\_j + o\_j \right) \right] \tag{10.6}
$$

According to the event-based task model described above, *t* denotes one time interval to be analyzed. *ηj (t* − *Dj )* returns the number of activations of task *τj* that happen within *t* and that must be finished before the deadline *Dj* . Each task activation is multiplied by the task's respective maximum computational demand, i.e., its WCET plus additional preemption overheads *oj* .

Since Eq. (10.6) is linear, it can directly be added to our multi-tasking ILP model for each task *τj* . This schedulability test has to be modeled for all possible time intervals *t*. The maximal interval to be considered is regularly given by the task set's hyperperiod. Checking all possible intervals *t* up to the hyperperiod is practically infeasible. Fortunately, task preemptions can only occur if a new task is ready for execution for many real-life scheduling policies like, e.g., EDF. Thus, the schedulability test from Eq. (10.6) has to be modeled in the ILP only at the points of discontinuity of the task set's density functions *η*. Finally, one constraint needs to be added that ensures that the system's overall load due to periodically repeating task activations stays below 100%. It is also possible to extend this approach towards fixed-priority scheduling, and the resulting ILP model grows only linearly with the number of events that have to be analyzed, in contrast to the quadratic nature inherent to [22, 24].

Figure 10.1 shows the effect of our ILP-based multi-task SPM allocation on schedulability of task sets featuring 8 tasks. We randomly selected 20 different task sets from TACLeBench [9]. Task periods were also randomly determined using UUniFast [4] and adjusted [39]. For each task set, periods were assembled such that the entire system has an approximate initial load of 0.8, 1.0, . . . , 2.2 i.e., 8 different system loads are evaluated per task set. Task deadlines were chosen uniform randomly between 0.8 and 1.2 times the task's period. Furthermore, a jitter of up to 1% of each task's period was chosen uniform randomly. Our evaluation considered an ARM-based architecture with access latencies for main memory and SPM of 6 and 1 clock cycles, resp. The scratchpad size was set to 40% of each task set's total size.

**Fig. 10.1** Evaluation of schedulability-aware SPM allocation for 8 tasks

Figure 10.1 shows the schedulability of the task sets for the given initial system loads, using Deadline-Monotonic Scheduling (DMS) and Earliest Deadline First (EDF). The green and orange bars show the percentage of schedulable systems without any optimization applied, while the purple and yellow bars represent the schedulability after our ILP-based multi-task SPM allocation.

For an initial system load of 0.8, all task sets are schedulable, irrespective of the considered scheduling policy or whether the SPM allocation was applied or not. This is not surprising, since the considered systems feature sufficient idle times so that valid schedules are always found. The situation changes when considering higher initial system loads that range from 1.0 up to 2.2. In these scenarios, no task set was schedulable in an unoptimized state where the scratchpad memories were not used at all. However, our multi-task optimization is able to turn the vast majority of initially unschedulable task sets schedulable. For DMS scheduling, our ILP-based optimization achieves rates of schedulable task sets ranging from 100% (initial system loads of 1.0 and 1.2) to still 75% for an initial system load of 2.2. For EDF scheduling, the percentages of finally schedulable task sets are slightly smaller they range from 95% (initial system load of 1.0) to 75% again. The time required to solve our ILPs is moderate. The whole compilation, analysis, and optimization process using a modern ILP solver like, e.g., gurobi required less than 6 CPU minutes on average over all considered task sets.

#### **10.4 Analysis and Optimization of Multi-Processor Systems on Chip**

To address PREDATOR Challenge #2 on analyzing and shaping the communication traffic for MPSoC architectures, it is important to understand when events happen in a multi-core architecture which potentially influences the cores' timing behavior. For this purpose, modular performance analyses use so-called request functions *α* which are very similar to the density function *η* from Sect. 10.3. In the context of MPSoCs, however, such functions characterize how often a processor core requests the shared bus of a multi-core architecture within a certain interval of time. Usually, such functions are provided at a very abstract level assuming execution models consisting of, e.g., the aforementioned acquisition, execution, and replication phases. For a precise analysis when each core attempts to access a shared hardware resource, it is, therefore, beneficial to extract request functions at the machine code level [14, 27].

For a precise and tight MPSoC performance analysis, both lower and upper bounds of resource requests are generated. Positions within the machine code executed on the different cores are identified where timing-relevant requests are generated, i.e., where shared hardware resources are accessed. Based on the code's Control Flow Graph (CFG), all possible sub-paths inside the code that feature these identified positions have to be considered. For this purpose, the well-known Implicit

**Fig. 10.2** Extracted request functions for selected benchmarks. (**a**) Compressdata. (**b**) binarysearch

Path Enumeration Technique (IPET) [17] has been modified to find the maximum number of requests potentially occurring in a given time interval along any path of a program. An algorithm has been proposed [27] that provides bounds on the number of requests for time intervals *t* of a program's runtime under consideration of all possible paths inside the CFG. This algorithm can be parameterized to trade precision of the generated request functions versus required execution time by varying the number of sampling points, i.e., the granularity of time intervals *t* considered by the algorithm.

Examples of lower (*α*−) and upper (*α*+) request functions generated for two selected benchmarks compressdata and binarysearch are shown in Fig. 10.2. The vertical distance between the lower and upper functions shows the variation of the number of produced requests. For example, compressdata can terminate with solely 82 shared bus accesses in total, or with up to 131. For binarysearch, both the lower and upper request functions converge to a common value, since each possible path through the program's code covers an identical number of bus requests. Only the points in time when these events occur differ.

Figure 10.3 shows the influence of the number of considered sampling points on the precision of the upper request function *α*+ of compressdata. The finestpossible granularity, i.e., *t* = 1 clock cycle, leads to 131 samples in total and to a very smooth and precise result. When reducing the granularity such that only 50 samples are considered, the resulting request function has a clearly visible stepwise shape. However, the resulting function for 50 samples always dominates the most precise function so that no unsafe results are produced. For the highest precision with 131 samples, our algorithm requires 48 CPU seconds. In contrast, the time required to generate the request function for compressdata decreases down to 10 CPU seconds if 50 sampling points are considered.

**Fig. 10.3** Request functions for compressdata with different precision levels

**Fig. 10.4** Request functions *α* and delivery functions *β*

While request functions *α* denote the resource demand of a task w.r.t. shared bus accesses, so-called delivery functions *β* model the available capacity of a shared hardware resource during modular performance analysis [12, 35]. The relationship between both types of functions is illustrated in Fig. 10.4. The maximal horizontal distance between *α*<sup>+</sup> and *β* represents the maximum delay *d*max a task exhibits due to blocked shared bus requests. In the figure, a task requests 2 bus accesses during interval lengths of 3 clock cycles. However, the bus can deliver the desired capacity only within 13 clock cycles. Thus, a blocking time of 10 clock cycles results from Fig. 10.4.

If a compiler could modify the generated code such that a task's request function is shifted towards the rightmost end of Fig. 10.4, its blocking time gets reduced which in turn probably decreases WCRTs and improves schedulability for the entire MPSoC system. This approach was investigated by a Master's Thesis [28] where instruction scheduling was exploited. Locally within basic blocks, those instructions requesting shared bus accesses were postponed by scheduling independent instructions in front of them. If this succeeds for all program paths of a given length *t* (e.g., for *t* = 3 in Fig. 10.4), then the request functions are actually shifted as intended. This work revealed that compilers can be enabled to systematically reduce blocking times this way. For MPSoC task sets generated from the MRTC [11] and UTDSP [37] benchmark collections, blocking time reductions of up to 22.5% were reported. A solely local rescheduling of instructions, however, suffers from the inherent limitation that there is not too much potential for postponing shared bus accesses within a single basic block. Thus, maximal WCRT reductions of only up to 7.3% were achieved.

This basic idea to reshape bus requests at the code level is also pursued in currently ongoing work. By transforming the behavior of a task, its request function is modified such that its traffic will match a required profile. This is done by inserting additional machine instructions into the code, i.e., NOPs. Therefore, this approach does not rely on specific hardware or on operating systems that realize traffic shaping. Instead, the notion of code-inherent traffic shaping is introduced. If the places where to insert such additional instructions in a task's CFG are carefully chosen, parts of its request function that do not fit to a given access profile can be shaped systematically, even without necessarily increasing the task's WCET. For this purpose, two shaping algorithms using a greedy heuristic and an evolutionary algorithm have been designed which support various kinds of Leaky Bucket shapers [36].

The effectiveness of code-inherent shaping is depicted in Fig. 10.5 by means of MRTC's select benchmark. Based on a Leaky Bucket that generates a stepwise shaping profile, a delivery function *β* is assumed such that only half of the requests originally issued by the task within 1000 clock cycles can be fulfilled. It can be seen that the systematic insertion of a total of 408 NOP instructions results in a shaped request function that always stays below this delivery function. For this particular select task, its WCET increases from originally 36,019 clock cycles up to 50,317 clock cycles. While this WCET increase by 40% seems disadvantageous at a first glance, it is absolutely acceptable if the task still meets its deadline and if the shaped request function enables schedulability of the entire MPSoC task set.

**Fig. 10.5** Traffic shaping of select with *β(t)* being 50% of *α(t)* for *t* = 1000

#### **10.5 Multi-Objective Compiler Optimizations Under Real-Time Constraints**

The simultaneous consideration of multiple optimization objectives by a compiler according to PREDATOR Challenge #3 can, to some extent, already be achieved using ILP-based techniques, even though ILPs only allow for one objective function to be maximized or minimized. PREDATOR's distinction between efficiency requirements on the one hand and worst-case constraints on the other hand naturally suggests to model critical constraints that must always be fulfilled as inequations in an ILP. Efficiency requirements are then modeled by an ILP's objective function and get optimized in addition to the satisfaction of critical constraints. This way, it is rather straightforward to turn the multi-task scratchpad memory allocation described in Sect. 10.3 into a multi-objective WCET-, schedulability- and energyaware optimization.

The schedulability tests from Eq. (10.4) or (10.6) are mandatory constraints in the SPM allocation's ILP model. Using an energy model like, e.g., [32], the energy consumption *ei* of each basic block *bi* can be characterized in dependence of the ILP's binary decision variables *xi*. By combining these block-level energy values with profiling-based information about the blocks' execution frequencies, the overall energy consumption *ej* of task *τj* can be modeled. Multiplying these task-level energy values with the tasks' activation functions *ηj* (cf. Sect. 10.3) over the entire task set's hyperperiod *H* yields an expression that models the energy dissipation of the complete multi-task system and that thus can be minimized under simultaneous adherence to the ILP's schedulability constraints:

$$\min \sum\_{j} \eta\_{j} \left( H \right) \* e\_{j} \tag{10.7}$$

Evaluation results for randomly generated sets of 6 tasks are depicted in Fig. 10.6, the experimental setup is the same as described in Sect. 10.3. Figure 10.6a shows the task sets' schedulability for their respective initial system loads, again using DMS and EDF scheduling. As can be seen, the multi-objective ILP is able to turn more than 95% of all task sets schedulable for initial system loads of up to 1.6. For higher initial loads, schedulability was still achieved for more than 70% of all task sets.

Simultaneously, considerable energy reductions compared to systems that do not use the SPM were achieved, cf. Fig. 10.6b. For initial system loads of up to 1.8, the task sets' energy dissipation was reduced down to less than 70%. For higher initial system loads, the resulting energy consumption still ranges from 71 to 77%.

Another common additional optimization goal is to meet code size requirements. Code compression might be used to meet code size constraints in embedded systems. However, the performance overhead of such techniques might be critical for real-time systems that must adhere to strict timing constraints. In the context of PREDATOR Challenge #3, we thus recently considered compiler-based code compression for hard real-time systems for the very first time [26]. This approach

**Fig. 10.6** Evaluation of multi-objective schedulability- and energy-aware SPM allocation for 6 tasks. (**a**) Schedulability. (**b**) Energy consumption

exploits lossless asymmetric compression algorithms [13] where a computationally demanding and highly effective code compression is performed at compile time, while the decompression is computationally lightweight so that it is feasible to perform it at runtime.

In the proposed approach, complete binary executable functions are selected and compressed by the WCC compiler and the resulting bit stream is added to the executable code produced by the compiler. Furthermore, the executable is extended by specifically tailored code for the decompression of the selected functions. Upon execution of a program optimized this way, all compressed functions are decompressed in one go during the program's start. For this purpose, a processor's scratchpad memory is used as a buffer that finally holds all decompressed functions. These functions are then directly executed from the SPM.

This approach trades code size reductions due to the selection of functions to be compressed with the decompression overheads in terms of WCET which should be as small as possible. For this purpose, an ILP is proposed whose binary decision variables *xi* encode whether function *fi* is compressed or not.

For each function *fi* that might be compressed, its original, uncompressed code size *S*orig *<sup>i</sup>* and its Worst-Case Execution Time *<sup>C</sup>*orig *<sup>i</sup>* are pre-computed. Assuming that *fi* would be compressed, the corresponding values *<sup>S</sup>*comp *<sup>i</sup>* and *<sup>C</sup>*comp *<sup>i</sup>* can also be pre-determined. For the WCET analysis of a potentially compressed function *fi*, the decompression routine is added by the compiler, and the loops therein are precisely annotated with upper iteration bounds for the decompression of the currently considered function *fi* in order to support the WCET analyzer aiT. Based on this data, the impact of *fi*'s compression on the entire program's code size *Si* and Worst-Case Execution Time *Ci* can be expressed in the ILP.

ILP constraints ensure that the decompressed functions fit in the available SPM, that the entire program never gets larger due to the inserted decompression routine, and that the WCET increases of all functions always stay below a user-provided

**Fig. 10.7** Evaluation of compiler-based WCET-aware code compression for MediaBench

threshold *C*limit. Under these constraints, the ILP finally minimizes the entire program's code size by selecting appropriate functions *fi* for compression.

For six large-sized benchmarks from MediaBench [16], the effects of the proposed compiler-based code compression for an Infineon TriCore architecture are depicted in Fig. 10.7. For each considered benchmark, the diagram shows the resulting relative WCETs and code sizes, as well as the code size of the decompression routine added by the compiler. The 100% baseline of Fig. 10.7 denotes the WCETs and code sizes of the original, unoptimized benchmarks, resp. For the ILP-based selection of functions to be compressed, the threshold *C*limit was set to 0.5 so that maximum WCET increases by 50% were still accepted by the optimization.

As can be seen from Fig. 10.7, the finally obtained WCET increases are way below this user-provided upper bound. For epic and mpeg2, the WCETs degrade only marginally by 0.6% and 0.5%, resp. The WCETs of the other benchmarks increase between 3.5% and 14.1% only. In contrast to this, our approach achieves rather large code size reductions. After the optimization of gsm\_dec, its executable occupies only 73% of its original memory space. For all other benchmarks, an even higher degree of compression was achieved that reduces code sizes by more than a half. This way, the code size of cjpeg\_transupp was reduced to 42% of its original size, and a maximal reduction down to only 13% of the original code size was achieved for mpeg2. Finally, Fig. 10.7 shows that adding extra code to the generated binaries for the decompression routine is worthwhile, since this overhead is over-compensated by the achieved overall code size reductions. As can be seen, the code size overhead due to the decompressor varies between 2% (gsm, gsm\_enc and mpeg2) up to 15% (cjpeg\_transupp) only, compared to the benchmarks' original code size.

#### **10.6 Conclusions**

This article presented a survey of work done in the field of compiler techniques for real-time systems in the authors' group during the past 10 years. Origin of all these activities was the collaborative research project PREDATOR funded by the European 7th Framework Programme. During this project, seminal work was carried out in order to design predictable yet efficient embedded systems. A couple of scientific challenges has been identified that have initially been considered during PREDATOR and that, due to their complexity, required continuous research effort over many years even after the end of this collaborative research project. This article summarized these compiler-centric activities and their corresponding scientific challenges:


Despite the advances in the field of compilation for real-time systems achieved in the past years, we expect that a continuation of this effort is necessary in the future. This is motivated by the trend towards massively parallel embedded real-time systems on the one hand, which still requires dedicated analyses and optimizations that are capable to support current and future many-core architectures. On the other hand, the simultaneous trade-off of various optimization objectives and the corresponding systematic exploration of the design space is still an unsolved problem for optimizing compilers. Last but not least, another important driver for future research is the increasing complexity of the involved system- and code-level analyses and optimizations which needs to be managed to obtain automated design tools that are usable in practice even for highly sophisticated and massively parallel systems.

**Acknowledgments** Parts of the work surveyed in this article received funding from Deutsche Forschungsgemeinschaft (DFG) under project No. 200265263 and 380772147. Other parts received funding from the European Union's 7th Framework Programme under grant agreement No. 216008 (PREDATOR) and from the Horizon 2020 research and innovation programme under grant agreement No. 779882 (TEAMPLAY).

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Index**

#### **A**

Abstract interpretation (AI), 9, 11, 12 AI, *see* Abstract interpretation (AI) Application-specific memory controller (ASMC), 25–27 Application-specific run-time techniques, 126, 129 ASMC, *see* Application-specific memory controller (ASMC) Assistive technology collaboration/research (2010–2013) NVDA, 68 students projects, 67–68 TacRead/DotBook, 68–70 consolidation and growth (2016) major recognitions, 75 MAVI, 74 NAVI, 74 outreach through conferences, 74 RAVI, 73 embedded systems (2005–2010) ASSISTECH/COP315, 58–59 OnBoard, 63–67 SmartCane, 59–63 facilitator, 57 technology to users (2013–2016) international collaboration, 71–73 tactile graphics project, 70–71

#### **B**

Bank-row column (BRC), 25 Bias temperature instability (BTI), 109 Boltzmann tyranny, 108 BRC, *see* Bank-row column (BRC)

BTI, *see* Bias temperature instability (BTI) Buffer on Board (BoD), 20, 21

#### **C**

Centralized enforcement dynamic energy power consumption, 143, 145 latency distribution, 144 SD and SM actors, 142 CFG, *see* Control flow graph (CFG) Closed page policy (CPP), 24, 28 Commercial off-the-shelf (COTS), 24 Computational self-awareness, 79–81 Computer architecture, 1, 2 Computer structures, 1, 2 Control flow graph (CFG), 9, 11–13, 155, 159, 160, 162 COTS, *see* Commercial off-the-shelf (COTS) CPP, *see* Closed page policy (CPP) CPS, *see* Cyber-physical systems (CPS) Cyber-physical systems (CPS), 37, 79 biased towards computation, 52 preemptive earliest-deadline-first (EDF), 41–44 preemptive fixed-priority (FP-P) scheduling, 44–45 problem, 39 scheduling algorithm, 40–41 utilization-based analyses, 46–51 Cyber-physical systems-on-chip (CPSoC) battery-powered devices, 80–81 computational self-awareness, 80 DVFS, 80 feedback loop, 81

Cyber-physical systems-on-chip (CPSoC) (*cont.*) HMPs, 80 infrastructure, 81 ODA, 80 resource management (*see* Dynamic resource management, HMPs)

#### **D**

DAL, *see* Design assurance level (DAL) Deadline-Monotonic Scheduling (DMS), 158, 159, 163, 164 Delay lock loop (DLL), 22 Dennard's scaling, 107, 112 DES, *see* Discrete event simulation (DES) Design assurance level (DAL), 6 Device under test (DUT), 101 Discrete event simulation (DES), 97–104 Distributed enforcement energy-minimizing, 140 integration of enforcers, 141 latency constraints, 141 RRE, 142 DLL, *see* Delay lock loop (DLL) DMS, *see* Deadline-Monotonic Scheduling (DMS) DRAM, *see* Dynamic random access memories (DRAMs) Drive current (ON current), 107 DUT, *see* Device under test (DUT) DVFS, *see* Dynamic voltage and frequency scaling (DVFS) Dynamic random access memories (DRAMs), 19 arbitration, 24 auto-refresh mechanism, 26–27 bangwidth, 20 BoD, 21 command/address, 24 CPP *vs.* OPP, 24 current limiting, 24 industrial image processing application, 26 integer linear programming problem, 25 interface frequency, 21 latency, 20, 23 memory subsystems, 20 memory wall, 20 PoD, 22 power consumption, 27–28 properties, 22 refresh, 24

safety/security, 32 scheduling, 24 temperature *vs.* reliability cell capacitor leakage, 29 drain leakage, 29 simulation framework, 31 sub-threshold leakage, 29 Dynamic resource management, HMPs adaptive control methods, 85–86 MIMO, 85 SCT (*see* Supervisory Control Theory (SCT)) SISO, 84–85 Dynamic voltage and frequency scaling (DVFS), 80, 87, 88, 136, 137, 141, 144 Dynamic voltage scaling (DVS) technique, 115, 116, 118–121 design-time models, 116–117 energy saving results, 121 operating voltage, 120 PARSEC *canneal* benchmark, 115, 120 processor power management, 115 runtime models, 117–118

#### **E**

Earliest deadline first (EDF), 8, 41, 157, 159 ECC, *see* Error correcting codes (ECC) EDF, *see* Earliest deadline first (EDF) Electronic system level (ESL), 97 Embedded system design, 2, 151 Embedded systems, 1–3, 6, 19–22, 25, 27, 32, 40, 58–67, 89, 152, 166 Energy-minimized timing enforcement Pareto-optimal configuration, 139 SD actor, 140 thread scheduling or memory management, 139 Enforcement automata (EA) degree of parallelism, 133 distributed (*see* Distributed enforcement) formal verification techniques, 132 lower latency bound enforcement and range extenders, 146–147 Enforcement problem DVFS configurations, 136 energy demand, 137 execution latency, 137 input-dependent variation, 135 overall latency, 136 power consumption, 137 RRE scheme, 138

Error correcting codes (ECC), 30–32 ESL, *see* Electronic system level (ESL) European Aviation Safety Agency (EASA), 15

#### **F**

Four activate window *(tFAW),* 24

#### **G**

Gate induced drain leakage (GIDL), 29 GIDL, *see* Gate induced drain leakage (GIDL)

#### **H**

Hard real-time system, 38, 39, 152, 163 HBM, *see* High Bandwidth Memory (HBM) HCI, *see* Hot-carrier injection (HCI) Heterogeneous many-core processors (HMPs), 80, 84–87 HFSM, *see* Hierarchical Finite State Machine (HFSM) Hierarchical Finite State Machine (HFSM), 87 High Bandwidth Memory (HBM), 21, 22, 27, 30 High-performance computing (HPC), 19 HMC, *see* Hybrid Memory Cube (HMC) HMPs, *see* Heterogeneous many-core processors (HMPs) Hot-carrier injection (HCI), 109 HPC, *see* High-performance computing (HPC) Hybrid Memory Cube (HMC), 20–22, 30

#### **I**

ILP, *see* Integer-Linear Program (ILP) Instruction set architecture (ISA), 7, 15, 80 Integer-Linear Program (ILP), 155–159, 163–165 Integration of task coordination, WCET CFG, 155 conditional constraints, 156 DMS, 159 EDF, 159 ILP, 155, 156 Liu and Layland's schedulability test, 157 schedulability-aware compiler optimization, 157 schedulability-aware SPM allocation, 158 SPMs, 155 Invasive computing, 125, 129, 148 ISA, *see* Instruction set architecture (ISA)

#### **L**

Leakage current (OFF current), 107

#### **M**

Mandelbrot renderer, 101–103 MAVI, *see* Mobility Assistant for Visually Impaired (MAVI) Memory phase awareness, 90 Metal insulator metal (MIM), 29 Microarchitectural analysis abstract hardware, evolution, 13 observed event trace, 14 prediction event graph, 14 prediction graph, 12 semi-automatic derivation, 13 trace validation, 14 Middleware for Adaptive and Reflective Systems (MARS) framework overview, 83 policy manager, 84 resource management policies, 83 sensors and actuators, 83 MIM, *see* Metal insulator metal (MIM) MIMO, *see* Multiple Input Multiple Output (MIMO) Mobile games stress modern SoCs FPS, 87 HiCap, 88 MEMCOP DVFS policies, 88 ML-Gov, 89 Moore's law, 89 prediction models, 88 sensors to capture dynamism, 87–88 SPM, 89 Von Neumann bottleneck, 89 Mobility Assistant for Visually Impaired (MAVI), 74 MPSoC, *see* Multi-Processor Systems on Chip (MPSoC) MPSoC platform, 147 Multiple Input Multiple Output (MIMO), 85, 86, 91 Multiple optimization, compiler code compression, 163, 165 ILP's schedulability constraints, 163 SPM, 164 WCET, 164 Multi-processor systems, analysis and optimization CFG, 159 compressdata, 161 delivery functions, 161 IPET, 160

Multi-processor systems, analysis and optimization (*cont.*) MPSoC architectures, 159 MRTC's select benchmark, 162 Multi-Processor Systems on Chip (MPSoC), 152

#### **N**

Nano-CMOS, 107 National Science Foundation (NSF), 2 NAVI, *see* Navigation Assistant for Visually Impaired (NAVI) Navigation Assistant for Visually Impaired (NAVI), 74 Negative capacitance transistor (NCFET) bodytrack, 113 BTI, 109 canneal, 114 characteristics, 112 cooling requirements, 114 DVS (*see* Dynamic voltage scaling (DVS) technique) ferroelectric layer, 108 HCI, 109 leakage-voltage dependency, 121 operating voltage selection, 118–119 power management techniques, 115 principle, 108 processor-level investigation, 110–111 *Sniper* many-core simulator, 111–112 swaptions, 113 system level, 109–110 technology, 108 Negative drain-induced barrier lowering (negative DIBL), 111 Non-functional property, \*-predictability, 128–129 Non-visual Desk Top access (NVDA), 68 NSF, *see* National Science Foundation (NSF) NVDA, *see* Non-visual Desk Top access (NVDA)

#### **O**

*Observe*, *Decide*, and *Act* (ODA), 80, 82 Open page policy (OPP), 24 Open source project, RISC Docker container, 103 documentation, 103 plug-and-play evaluation, 103 proof-of-concept, 102

OPP, *see* Open page policy (OPP) Out-of-order discrete event simulation (OoO PDES), 100, 104

#### **P**

Package on Package (PoP), 20, 22 Parallel discrete event simulation (DES), 97–104 Peter Marwedel academic self-government, 2 ICD, 3 SFB 876, 3 teaching, 1–2 technology transfer, 3 PoP, *see* Package on Package (PoP) PREDATOR abstraction levels, 151 cache partitioning, 153 collaborative project, 151 General Access Model, 154 individual scheduling policies, 153 MPSoC, 152 multiple optimization (*see* Multiple optimization, compiler) novel optimization strategies, 154 off-line and online coordination techniques, 152 real-time systems, 166 system-and code-level analyses, 166 WCET, 151 WCRT, 153 WP2, 151 Proportional Integral Derivative (PID), 84–86

#### **Q**

QDR, *see* Quad data rate (QDR) QoS, *see* Quality-of-service (QoS) QSK, *see* Qualification support kit (QSK) Quad data rate (QDR), 3, 22 Qualification support kit (QSK), 7, 14 Quality-of-service (QoS), 87, 90

#### **R**

RAVI, *see* Reading Assistant for Visually Impaired (RAVI) RBC, *see* Row-bank-column (RBC) Reading Assistant for Visually Impaired (RAVI), 73

Recoding Infrastructure for SystemC (RISC) analysis and transformation tools, 101 compiler (*see* RISC compiler) EDA community, 104 IEEE SystemC language, 97 Mandelbrot renderer, 101–102 OoO PDES, 104 open source project (*see* Open source project, RISC) simulator, 100 Reflective system models decision-making process, 82 feedback loop, 82 MARS (*see* Middleware for Adaptive and Reflective Systems (MARS)) resource managers, 82 self-awareness, 82 Resource isolation execution state uncertainty, 126 input uncertainty, 126 RGR, *see* Row granular refresh (RGR) RISC compiler conflict analysis, 99–100 SG, 99 source code instrumentation, 100 Row-bank-column (RBC), 25 Row granular refresh (RGR), 26, 27 Run-time requirement enforcement (RRE), 126, 129–133, 136, 138–142, 144, 146, 147 approximate computing, 129 centralized/distributed, 131–132 EA, 132–133 i-lets and e-lets, 134–135, 148 invasive computing, 129 loose/strict, 130

#### **S**

Safety integrity level (SIL), 6 Schedulability analysis, 5, 153, 156 Scratchpad memories (SPMs), 89, 155, 158, 159, 163, 164, 168 Segment graph (SG), 98, 99, 104 Self-aware systems-on-chip automated driving, 92–93 computational self-awareness, 79–80 CPS, 79, 91, 93 CPSoC (*see* Cyber-physical systems-onchip (CPSoC)) IFS, 91 IoT, 91

memory phase, 90 quality-configurable memory, 90–91 reflection (*see* Reflective system models) SO/SA, 91, 92 Self-organizing, self-aware (SO/SA), 91, 92 SIL, *see* Safety integrity level (SIL) Single Input Single Output (SISO) feedback loop, 84 PI/PID, 84 Sobel edge detection (ED), 132, 135, 136 Software-Programmable Memory (SPM) memory resources, 89 MMUs, 89 Soundness, 10 SPMs, *see* Scratchpad memories (SPMs) Staggered powerdown, 28 Supervisory Control Theory (SCT) classic controller, 86 high-level view, 86 MIMO/SISO controllers, 86 QoS, 87 supervisory controller, 86 variable goals and policies, 86 SystemC DES, 97 ESL, 97 PDES, 97

#### **T**

TAT, *see* Trap assisted tunneling (TAT) TCL, *see* Tool confidence level (TCL) *tFAW*, *see* Four activate window *(tFAW)* Through-silicon via (TSV), 22 Timing analysis, 5, 52, 138, 152, 154 Timing verification, 5 TLM, *see* Transaction level models (TLM) Tool confidence level (TCL), 6 Tool operational requirements (TOR), 6, 7 Tool qualification levels (TQL), 6 TOR, *see* Tool operational requirements (TOR) TQL, *see* Tool qualification levels (TQL) Trace validation, 12–15 Transaction level models (TLM), 31 Trap assisted tunneling (TAT), 29 TSV, *see* Through-silicon via (TSV)

#### **V**

Verification test plan (VTP), 6 VTP, *see* Verification test plan (VTP)

#### **W**

WCET, *see* Worst-case execution time (WCET); Worst-Case Execution Times (WCET) WCET analysis, 5–15, 153, 164 aiT tool, 9 predictability, 7 safety properties, 8–10 terminology, 10 timing accident, 8 timing penalty, 8 TQL, 6–7 validation CFG reconstruction, 11

microarchitectural analysis (*see* Microarchitectural analysis) value analysis, 12 WCRT, *see* Worst-case response time (WCRT) WESE, *see* Workshop on Embedded and Cyber-Physical System Education (WESE) Workshop on Embedded and Cyber-Physical System Education (WESE), 2 Worst-case execution times (WCET), 5–15, 25, 38, 40, 50, 51, 151–159, 162–166 Worst-case response time (WCRT), 38, 44,

156, 157, 162