# Xuefeng Zhou · Hongmin Wu · Juan Rojas · Zhihao Xu · Shuai Li

# Nonparametric Bayesian Learning for Collaborative Robot Multimodal Introspection

Nonparametric Bayesian Learning for Collaborative Robot Multimodal Introspection

Xuefeng Zhou • Hongmin Wu • Juan Rojas • Zhihao Xu • Shuai Li

# Nonparametric Bayesian Learning for Collaborative Robot Multimodal Introspection

Xuefeng Zhou Robotic Team Guangdong Institute of Intelligent Manufacturing Guangzhou, Guangdong, China

Juan Rojas School of Electromechanical Engineering Guangdong University of Technology Guangzhou, China

Shuai Li School of Engineering Swansea University Swansea, UK

Hongmin Wu Robotic Team Guangdong Institute of Intelligent Manufacturing Guangzhou, Guangdong, China

Zhihao Xu Robotic Team Guangdong Institute of Intelligent Manufacturing Guangzhou, Guangdong, China

#### ISBN 978-981-15-6262-4 ISBN 978-981-15-6263-1 (eBook) https://doi.org/10.1007/978-981-15-6263-1

© The Editor(s) (if applicable) and The Author(s) 2020. This book is an open access publication. Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

To our ancestors and parents, as always

# Preface

Improving collaborative robot safety by data-driven monitoring and protection is critical for robots to be actively useful in human daily life. While some success has been demonstrated in structured and static environments that robots are slow, clumsy, and not general or robust enough when performing dexterous manipulations. As humans, doubt and understanding our own limitations, failures, and shortcomings is a key for improvement and development. Such knowledge alters our behaviors e.g. to execute tasks in a more cautious way. The main reason is humans have years of experience to collect data and develop good internal models of what happens when interacting with their environment. Equipping robots with a set of skills that allows them to assess the quality of their sensory data, internal models, used methods, etc. is correspondingly believed to greatly improve the overall performance of an autonomous system.

Statistical models have shown powerful performance for signal processing in recent years. Particularly, the nonparametric Bayesian learning methods are widely applied to the study of underlying dynamical phenomena generated in fields as diverse as robotics, bioinformatics, econometrics, and autonomous systems and control. Within these fields, there has been an explosion of multimodal data of increasingly complex phenomena, resulting in a push toward building more intricate multivariate time series models for accessing the quality of those sensory data. In recent robotics, human-robot coexistence and cooperation have become an urgent demand in modern manufacturing and services industries, resulting in the concept of "Human-Robot Collaboration, HRC" and "Physical Human-Robot Interaction, pHRI". Sharing intelligence, having a common behavior, and cooperating with people to accomplish common tasks (behavior, task, and intelligence) are their basic characteristics and elements, and also they have become the consensus of domestic and foreign academic circles. The prerequisite of HC is human-robot peaceful coexistence, that is, "safety collaboration". Therefore, the aim of this book is to introduce the Nonparametric Bayesian learning methods for robot introspection that ensures the robot with longer-term autonomy and safer HRC environment.

In this book, the robot introspection relates to safety that has a direct impact on a variety of research areas, such as HRC, pHRI, long term autonomy, and many others. Long term autonomy can benefit from autonomous anomaly monitoring and diagnosis as well as the anomaly recovery strategies. The ability to reason and solve its own anomalies, and proactively enrich owned knowledge is a direct way to improve autonomous behaviors of a robot. To this end, we start by considering the underlying pattern of multimodal observation during robot manipulation, which can be well modeled as a parametric Hidden Markov Model (HMM). And then, we instead take a nonparametric Bayesian approach in defining a prior through the use of the Hierarchical Dirichlet Process (HDP) on the standard HMM parameters, named Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM). The HDP-HMM can examine an HMM with an unbounded number of possible states and allow for flexibility in the complexity of the learned model and for the development of reliable and scalable variational inference methods.

Therefore, although many stable and robust robot control algorithms exist, internal modeling errors or external disturbances still affect the robotic systems, such as human collision, object sliding, and tool collision. The pHRI scenarios are the unstructured and non-standardized dynamic environments that cannot be completely modeled and analyzed. To endow robots with longer-term autonomy and safer HRC environment, it is necessary to model the real-time multi-modal signals for implementing the accurate behavioral introspection and anomaly recovery policy learning. In this book, the robot multimodal introspection and learning are theoretically investigated, and the main contents and achievements are as follows:

Chapter 1—In this chapter, we concisely introduce the definition of robot introspection that is considered in this book, and explain those interesting topics to be addressed. Moreover, the detailed outline of this book is presented. To make the following contents clear and easy to follow, we take all the research contents and achievements of this book into consideration; a multi-modal fusion-based robot introspective and learning system framework named SPAIR (Sense-Plan-Act-Introspect-Recover) is proposed based on the traditional robot control framework Sense-Plan-Act (SPA), which adds the phases of Robot Introspection (behavior recognition, anomaly monitoring, anomaly diagnosis) and anomaly recovery. The framework intentionally endows the robot with longer-term autonomy and safer human-robot collaborative environment, which mainly includes four functional modules: (1) directed graph representation of complex robot manipulation tasks; (2) robot movement generation and generalization learning; (3) real-time anomaly monitoring and diagnosis during robot execution; (4) robot anomaly recovery policy learning.

Chapter 2—In this chapter, to address the problem of multi-modal fusion, we abstract its research object as to how to effectively establish and interpret the probabilistic model of multivariate time series. The uncertainty of the number of hidden states and the high transition frequency between hidden states are two critical problems of HMM in modeling multivariate time series, which will greatly weaken the modeling performance and time consistency. Based on this, several nonparametric Bayesian models named Sticky Hierarchical Dirichlet Process Hidden Markov Models (sHDP-HMM) are investigated to jointly model the robot end-effector's velocity, force/torque, tactile signals of as well as their related statistical information (such as mean and variance) during robot manipulation task.

Chapter 3—In this chapter, to address the problem of learning and generalizing complex tasks of robots multimodal fusion and system framework construction, a method of combining Dynamic Movement Primitive (DMP) and Finite State Machine (FSM) is proposed in this paper, which divides the robot complex manipulation task into sequential motion primitives for improving the adaptability and diversity of the manipulation tasks, i.e. parameterized directed graph descriptions. The proposed method is based on the basic theory of robot learning from demonstration.

Chapter 4—In this chapter, nonparametric Bayesian models-based methods with the multi-modal fusion for robot real-time motion behavior recognition and anomaly monitoring are proposed. First, the normal multi-modal signals of each movement primitives are modeled using sHDP-HMM with the help of a parametric description of the robot's manipulation task. Then, the robot behavioral recognition is implemented by comparing the cumulative log-likelihood values of real-time observations. Finally, three types of abnormal thresholds for robot abnormal monitoring are proposed when the behavior is known, which include log-likelihood value, log-likelihood gradient value as well as mapping relationship between the latent state and log-likelihood value.

Chapter 5—In this chapter, nonparametric Bayesian models-based multi-objective classifiers for multi-modal anomaly diagnosis are also investigated when the multi-modal anomaly is triggered. Additionally, anomalous samples are extracted before and after the occurrence of anomaly events according to the size of the given window; sHDP-HMM model is learned for each anomaly type individually; the optimal model is selected by cross-validation method; anomaly diagnosis is tackled by comparing the cumulative log-likelihood values of testing samples across all learned models. Besides this, an anomaly diagnosis method by the multimodal sparse representation method of anomaly samples is presented.

Chapter 6—In this chapter, we mainly focus on learning the recovery policy based on anomaly monitoring and diagnosis. Typically, two task-level robot anomaly recovery policies are, respectively, proposed by learning the experience and intention of human treating accidental and persistent anomaly events, which include the re-enactment policy for accidental anomalies by using a multinomial distribution as well as the adaptation policy for persistent anomalies by the parametric representation of human's demonstrations.

In summary, this book mainly presents methods and algorithms for the collaborative robot multimodal introspection via nonparametric Bayesian methods, together with the corresponding theoretical analysis and experimental verification on two type of robots that perform three kinds of manipulation tasks, including HIRO-NX robot performing electronic assembly, Baxter robot performing pick-and-place task, and Baxter robot performing kitting experiment. This book is written for graduate students, academic, and industrial researchers in the field of nonparametric Bayesian methods, robotics, physical human-robot interaction, multivariate time series modeling, and autonomous system design. This book covers the clear explanation of the importance of robot introspection in the tread the robotic techniques, included but not limited to the task representation of robot complex task, the multimodal time series modeling, the anomaly monitoring, and diagnosis as well as the anomaly recovery policy during robot manipulation task. It provides not only rigorous theoretical analysis on the nonparametric Bayesian based methods and algorithms for assessing the quality of multimodal sensory data but also many intuitive real-robot experiments with or without cooperating with a human. We do hope that this book will benefit the readers, in terms of knowing the basic ideas in the nonparametric Bayesian modeling of multivariate time series for robot introspection. We also hope that the presented new advancements in the field of anomaly monitoring, diagnosis, and recovery could explore new theoretical tools and practical applications.

At the end of this preface, it is worth pointing out that, in this book, some distributed methods for robot introspection and their applications are provided. The ideas in this book may trigger more studies and researches in next generation of robotics, especially nonparametric Bayesian methods based multivariate time series modeling. There is no doubt that this book can be extended. Any comments or suggestions are welcome, and the authors can be contacted via e-mail: hm.wu@giim.ac.cn (Hongmin Wu).

Guangzhou, China Xuefeng Zhou Guangzhou, China Hongmin Wu Hong Kong, China Juan Rojas Guangzhou, China Zhihao Xu Swansea, UK April 2020

Shuai Li

# Acknowledgements

During the work on this book, we have had the pleasure of discussing its various aspects and results with many cooperators and students. We highly appreciate their contributions, which particularly allowed us to improve our book manuscript.

The continuous support of our research is provided by the Foshan Key Technology Research Project (Grant No. 1920001001148), GDAS' Project of Thousand doctors (post-doctors) Introduction (2020GDASYL–20200103128), Basic and Applied Basic Research Project of Guangzhou (Grant No. 202002030237), Foshan Innovation and Entrepreneurship Team Project (Grant No. 2018IT100173), Guangdong Province Key Areas R&D Program (Grant No. 2019B090919002), Guangzhou Science Research Plan's Major Project (Grant No. 201804020095) Guangdong Innovative Talent Project of Young College (Grant No. 2016TQ03X463), Guangzhou Science and Technology Plan Project (Grant No. 201803010106).

# Contents





# Acronyms


# **Chapter 1 Introduction to Robot Introspection**

**Abstract** In this chapter, we mainly introduce the definition, background, significance, and the start-of-the-art methods of collaborative robot multimodal introspection. The current issues of robot introspection are also introduced, which including the complex task representation, anomaly monitoring, diagnoses and recovery by assessing the quality of multimodal sensory data during robot manipulation. The overall content of this book is presented at the end.

#### **1.1 What is Robot Introspection?**

As humans, doubt and understanding our own limitations, failures and shortcomings is a key for improvement and development, such knowledge alters our behaviors e.g. to execute tasks in a more cautious way [1, 2]. Due to the humans can learn and recognize the environment for a long time through the five senses (visual, acoustical, gustatory, osphretic, and tactile) and generate corresponding internal models in the daily execution of operational tasks [3–7]. Correspondingly, equipping robots with a set of skills that allows them to assess the quality of their sensory data, internal models, used methods etc. is correspondingly believed to greatly improve the autonomous operability and safety performance of an autonomous system, e.g. presented in literature [8, 9].

In recent years, the ability of robots to learn intrinsic introspective models like humans has been favored by robotics researchers [10, 11]. With respect to humans, introspection is the process of examining one's internal state by evaluating the potential patterns of multi-modal sensing data [12, 13, 15], as shown in Fig. 1.1. Its application lies in giving robots three capabilities: "What's it doing?", "How to do?", and "How's it doing?". Among them, "What's doing?" is to solve the problem of continuous state estimation during the robot execution; "How's it doing?" is to monitor the real-time state during robot execution, such as normal or abnormal; "How to do?" including the robot task planning and decision making on future motion. The development and application of robot introspection can effectively

**Fig. 1.1** Illustration of assessing the underlying dynamics of the multimodal signals that generated

by a robot performing the electronic assembly at real-time

(1) estimate, monitor and prevent abnormal events; (2) evaluate the internal state of the robot; (3) strengthen the ability to recovery abnormalities; (4) optimize control and motion decisions.

However, robots do not have thoughts neither feelings, only data, hardware and algorithms. Therefore, robots can only assess the quality of sensor data, internal models, representations, information, perception input etc. Such knowledge can later lead to a modification of robots behavior by including the assessed quality score in the planning process. Consequently, robot introspection relates to safety, active perception, mapping and many topics. These topics have a direct impact on a variety of research areas, such as long term autonomy, search and rescue, and many others. Long term autonomy can benefit from autonomous failure recovery and active learning. For search and rescue, estimation of the confidence of the sensor input and used maps is essential for overall risk assessment. Moreover, for a large variety of tasks assessing the quality of sensor data, internal models, representation, information will directly affect mission success. The ability to reason and solve its own failures, and proactively enrich owned knowledge is a direct way to improve autonomous behaviours of a robot.

#### **1.2 Background and Significance**

Traditional industrial robots have played a significant role in standardization and automated production by isolating from human, which is difficult to meet unstructured, personalized, and flexible non-standard complex task requirements in a dynamic environment such as the human robot interactive scenarios that shown in Fig. 1.2. Due to they lack of multi-functional and intelligent capabilities that per-

**Fig. 1.2** Robots are applied in diverse, individualized and flexible non-standard tasks in unstructured dynamic environments

formed and without sufficient external sensing information. In addition, with the widespread application and development of collaborative robots [16, 17], robots will gradually move from a traditional closed manufacturing environment to a shared space that coexists and interacts with human [18], from semi-automatic operation tasks to autonomously perform, which inevitably leads to various unpredictable anomalies, such as objects slipping, collisions between end-effector and the environment, human collisions, and system abnormalities. Therefore, in order to give the robot a longer-term autonomy and a safer human-machine collaborative environment, the robot should perform real-time multi-modal fusion modeling to achieve accurate introspection of its own movement behavior (identification of movement behaviors, abnormal monitoring, and abnormal diagnoses), and anomaly recovery are essential in the next generation of intelligent collaborative robots.

Generally speaking, the robot follows the control framework of the Sense-Plan-Act (SPA) in a structured environment. First, the robot observes the surrounding environment and builds an internal model [14]. Then, it formulates a task execution plan, and finally executes this plan. However, the control framework is difficult to meet the increasingly complex robot operation tasks in unstructured dynamic environments [20, 23]. In the recent decades, increasing the robot multi-modal introspection and anomaly recovery policy after the robot execution would be an effective way to address the aforementioned issues. To this end, a novel Sense-Plan-Act-Introspect-Recover (SPAIR) control framework is proposed in this book, which can provide a theoretical framework and solutions for autonomous robot operation, human-computer interaction, and human-machine integration in the future.

#### **1.3 Issues to be Addressed in Robot Introspection**

This book intends to provide the reader with a comprehensive overview of the newly developed methods for improving robot introspection. In recent years, with the rapid development of collaborative robots, scholars in the field of robotics and artificial intelligence have carried out explorations on learning from human demonstration, integration with humans, multimodal sensing, and learning methods and theories of robots. Although lots of research results and research ideas within those topics are presented, the difficulties and key issues of multi-modal modeling and its application of robots are not fully considered, and effective theoretical frameworks and systematic integration have not yet been formed. The application research of robot multi-modal sensing makes the research in this book more valuable. Therefore, this book develops a theoretical study of robot multi-modal introspection and learning based on non-parametric Bayesian models, aiming to endow robots a longer-term autonomy and a safer human-robot collaboration environment. With respect to the state-of-the-art status of robot introspection, the following five issues need to be addressed urgently:

#### • **(1) How to assess the quality of internal models, methods and sensor data, used by robots and how to alter their behavior upon this information?**

Humans can accurately perceive the state of the outside world or the objects they need to operate through the five senses based on past experience and internal models, and can self-recognize the advantages and disadvantages to learn and adjust existing experiences and models. However, robots can only sensing the world through sensory data such that how to model and analyze of multi-modal time series is a difficult problem to realize robot introspection.

#### • **(2) How to represent the complex robot task and generalize to the environmental changes during unstructured scenarios?**

It's difficult to complete tasks from the beginning to the end with fixed and preprogrammed movements in the case of an unstructured dynamic human-robot collaboration environment. Resulting in encountering fault or abnormality of the task because of the unstructured environment, the changing pose of the operating object, and the unpredictable robot state. Therefore, how to learn and generalize the robot complex tasks from human demonstration for adapting to the changes in the environment is an valuable question.

#### • **(3) How to implement the anomaly monitoring and diagnoses of multimodal time series during robot execution?**

The robot anomalies usually derived from joint encoders, the environmental changes, and human interference in the human robot interaction scenarios. Assuming that the robot can monitor and classify these anomalies, it will reduce or prevent the robot from being potentially injured and improve the safety of human-robot collaboration. Due to the diversity and complexity of the types of anomalies, it isn't possible to directly enumerate and model all anomalies, resulting the anomaly monitoring cannot be achieved through supervised learning methods. Generally, the robot anomaly monitoring were implemented by learning the outstanding difference between the normal ones. In other words, anomalies are identified from models of normal robot behavior. Making the robot anomaly monitoring and diagnoses should be critical factor for realizing the robot longer-term autonomy.

#### • **(4) How to learn the recovery policy from human demonstration for specific anomalous event?**

Robot anomaly recovery cannot be performed by the traditional robot's own motion planning algorithm under the human robot interaction scenarios. Whereas human expectations for robot motion should be taken into consideration. The main challenge is how to incrementally learn the robot recovery policy when encounter various abnormal events from unstructured human demonstrations.

#### • **(5) How to develop a stable and extendable software framework for integrating the introspective abilities by modeling the multimodal sensory data?**

How to develop a software for robot multimodal introspection that including multifunctional modules for complex task representation, multi-modal fusion, anomaly monitoring, anomaly diagnoses, and anomaly recovery. It's another common issue to be solved for realizing long-term autonomy and safe human robot collaborative environment.

#### **1.4 System Framework for Robot Multimodal Introspection**

#### *1.4.1 Introduction*

With the rapid development of collaborative robots, robots are moving from the traditional structured environment of industrial manufacturing to unstructured dynamic shared workspaces that coexist with, collaborate with, and integrate with people. Although the robot has a more robust and stable control algorithm, the algorithm cannot precisely model the unstructured environment, and unexpected events will occur. In order to endow the robot with a longer-term autonomy and a safer human-robot collaborative environment, the robot must implement self-monitoring and recovery capabilities by evaluating multi-modal sensory sensor in real time.

This book aim to propose a theoretical framework and extendable system platform for robot multimodal introspection and learning based on non-parametric Bayesian models. It mainly includes the following four aspects:


The proposed framework divides the complex tasks of the robot into the directed graph, and learns the motion primitive model and realizes the recognition of the motion behavior for the motion behavior between the nodes in the directed graph; The non-parametric Bayesian time series model learns models for anomaly monitoring and anomaly diagnoses models, and compares and analyzes the performance and results of anomaly monitoring under three different anomaly threshold conditions. A robot anomaly recovery policy is designed at the top of the framework to handle anomalies encountered by external disturbances based on the representation of robot complex task and introspection. We assume the considered recovery policy should be learned incrementally from the context of task. Therefore, the proposed Re-enactment and Adaptation recovery policies correspond to accidental anomalies and persistent anomalies, respectively. Among them, the reenact policy uses a probability model of a polynomial distribution to count the humans decision-making at different anomaly situations for the accidental anomalies. Additionally, the robot learns and stores its recovery policy, while the motion adaptation uses human intuition to teach the recovery of persistent anomalies.

The anomalous event of robot is unpredictable, so that the recovery policy of the robot under different abnormal conditions cannot be planned offline. Therefore, learning the recovery policy incrementally is the key to achieving human-robot safety collaboration. The framework has fast and robust anomaly monitoring and diagnoses capabilities for both initially planned motion primitives and newly learned recovery motion primitives. In addition, by using non-parametric Bayesian models to incrementally learn models from multi-modal sensing data, the stability, reliability and fault tolerance of the framework are well improved.

#### *1.4.2 Modules of Introspective System*

To endow robots a longer-term autonomy and a safer human-robot collaboration environment, the robot's Introspection phase and the Recovery phase were added to the traditional robot control framework SPA. The system framework of the multimodal sensing and learning is shown in Fig. 1.3 and named SPAIR. The framework mainly includes four modules: (1) directed graph representation of complex robot tasks; (2) generalization and identification of robot motion behavior; (3) real-time abnormal monitoring and diagnoses during robot execution; (4) learning recovery policy for recovering from robot abnormal events.

In recent years, the directed graph representation of complex tasks of robots has been investigated by a large number of scholars, such as Kappler et al. [19], Fox et al. [12], Kroemer et al. [21], and Niekum et al. [22]. To address the problem of robot task representation, many scholars tried to represent the complex task by a series of movement primitives, and realized the contact state estimation and task representation of the robot [13, 24–26]. This book assumes that the robot's tasks are represented by a directed manipulation graph that include a set of movement primitives, and each primitive is composed of nodes: the start node and the target node, and along with the transition between different movement primitives is expressed too. In order to accurately represent the robot complex tasks and the subsequent recovery behavior, the movement primitive is formulated as follows: (1) The behavior of the robot is used to represent a complete movement of the robot, as shown in Fig. 1.3.

**Fig. 1.3** A systematic framework for robot introspection and learning

The complete process from the "Home" node to the "Pre-pick" node; (2) The robot task is usually composed of one or more movement primitives, as shown in Fig. 1.3. The "Pick-and-Place" task of the robot consists of four motions The behavior is composed of five action nodes, among which the movement primitives are "Home → Pre-pick", "Pre-pick → Pick", "Pick → Pre-place" and "Pre-place → Place".

In particular, the directed manipulation graph will be updated when the robot encounter an abnormal event, which including: *Ni j* indicates a recovery node between nodes *N<sup>i</sup>* and *N<sup>j</sup>* and node *Nijk* denotes a recovery node between nodes *Ni j* and *N<sup>k</sup>* . By analogy, nodes for recovery will be generated in the original graph structure, and the connection between the nodes (including the recovery node and the node of the original task) will generate new motion behavior. From the recovery module at the top of Fig. 1.3, it can be known that the manipulation graph *G* of the robot tasks will gradually robust and stable with the increasing recovery behaviors.

The nodes in the SPAIR framework will include multiple functions performed in parallel, including: the robot movement properties *M*, the robot introspective properties *(I )*, and the robot sensing external environment properties *V* . The *M* in this book refer to the robot movement primitives; *(I )* includes the robot's movement identification, anomaly monitoring, and anomaly diagnoses;*(V )*represent the visual estimation of the object's pose in the environment. Among them, the robot's introspective attributes are mainly responsible for real-time monitoring of robot anomalies. Once anomalies are detected, the system will classify the anomalies, and the system will recovery according to a limited recovery strategy for different types of anomalies. If the recovery strategy does not exist, The system will prompt the need for human recovery demonstration, and learn and update its recovery behavior. According to the diagnoses result, if it is a accidental anomaly, the robot will perform the motion redo recovery behavior, as shown in "*anomaly* − *i*" and "*anomaly* − *j*" shown in Fig. 1.3; if it is a persistent anomaly, the robot will execute the motion adaptation recovery behavior, as shown in "*anomaly* − *k*" and "*anomaly* − *m*" in Fig. 1.3.

The proposed framework is built on the *ROS-Indigo*<sup>1</sup> platform, allowing signals of different modalities to communicate in the form of topics, and the sampling frequency of different sensors is fixed to a specific value by synchronous sampling (this book uses 10 Hz), and combines multiple open source software and code, such as the description of the robotic task by the finite state machine *SMACH*<sup>2</sup> and the *ar*\_*track*\_*alvar* <sup>3</sup> for pose estimation of the object. In addition, the system will automatically collect multi-modal sensor signal data for each motion primitive and the subsequent proposed recovery behaviors during the robot's movement. So that the robot can learn experience and generalize movement from the generated data. To this end, the relevant code, videos, pictures and usage documents of this book will be all open source,<sup>4</sup> so that others researchers can improve and share it.

#### **1.5 Summary**

In this chapter, we mainly introduce the definition and significance of robot introspection in the background of human-robot collaboration. We inspired from human that doubt and understanding our own limitations, failures and shortcomings is a key for improvement and development, such knowledge alters our behaviors e.g.to execute tasks in a more cautious way. To endow robots a longer-term autonomy and a safer human-robot collaboration environment, we referred to the state-of-the-art status of robot introspection, five issues are needed to be addressed urgently. To this end, we proposed an introspective system framework with four modules, named Sense-Plan-

<sup>1</sup>http://wiki.ros.org/indigo.

<sup>2</sup>http://wiki.ros.org/smach.

<sup>3</sup>http://wiki.ros.org/ar\_track\_alvar.

<sup>4</sup>https://github.com/birlrobotics/smach\_based\_introspection\_framework.

Act-Introspect-Recover (SPAIR) for address the problems including robot complex task graphical representation and execution (skill) identification from unstructured demonstrations, multimodel time series modelling for robot anomaly monitoring and diagnose as well as the recovery policies learning for different anomalies.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 2 Nonparametric Bayesian Modeling of Multimodal Time Series**

**Abstract** In this chapter, we take a Bayesian nonparametric approach in defining a prior on the hidden Markov model that allows for flexibility in addressing the problem of modeling the complex dynamics during robot manipulation task. At first, considering the underlying dynamics that can be well-modeled as a hidden discrete Markov process, but in which there is uncertainty about the cardinality of the state space. Through the use of the hierarchical Dirichlet process (HDP), one can examine an HMM with an unbounded number of possible states. Subsequently, the sticky HDP-HMM is investigated for allowing more robust learning of the complex dynamics through a learned bias by increasing the probability of self-transitions. Additionally, although the HDP-HMM and its sticky extension are very flexible time series models, they make a strong Markovian assumption that observations are conditionally independent given the discrete HMM state. This assumption is often insufficient for capturing the temporal dependencies of the observations in real data. To address this issue, we consider extensions of the sticky HDP-HMM for learning the switching dynamical processes with switching linear dynamical system. In the later chapters of this book, we will verify the performances in modeling mulitmodal time series and present the results of robot movement identification, anomaly monitoring, and anomaly diagnose.

# **2.1 Introduction**

Human can perceive the state of the outside world or the objects that need to be operated through the five perspectives based on past experience and internal models, and can recognize the advantages and disadvantages of self to learn and adjust existing experiences and models. However, robots can only understand the outside world through sensing data, so modeling and analysis of multi-modal time series is a difficult and hot research topic to realize robot perception. The multi-modal information during the execution of the robot is a highly complex and dynamic process that incorporates non-linearities such as motion encoders, vision, forces/torques, haptics, and sounds. Modeling based on the traditional hidden Markov model has the problems of uncertainty of the number of recessive states and rapid conversion of recessive states, and it is difficult to accurately capture the potential modes of multi-modal data. In this chapter, based on the Bayesian basic theory and the theoretical derivation of the HMM model, combined with the inherent characteristics of the robot's multi-modal sensor information, it proposes that it is more efficient and complicated in terms of complex dynamic phenomena, uncertainty, and model parameter learning. The robust non-parameterized Bayesian model implements a solution in which the number of recessive states of the HMM model is determined by the complexity of the data and a high self-transition expected probability of the recessive state. To a certain extent, the non-parameterized Bayesian model is Research on robot multi-modal information sensing and fusion provides theoretical framework and application guidance.

#### **2.2 Related Works**

In neurosciences, studies have shown that the different senses of humans can be closely combined [1], so that they can implement the complex manipulation tasks and environmental perception in daily life. In recent years, inspired by the results of this research in neurosciences, scholars have carried out research on the role of multimodal sensing information fusion for implementing practical manipulation tasks in robotics [2, 3]. Fitzpatrick et al. [4] introduced a method of cross-modal verification, which enhances the robot's perception of multi-modal events by repeating and redundancy of sensing information, so that the robot can learn the underlying correlation between vision and auditory. Wu et al. [5] studied the use of a combination of acceleration and sound measurements to detect structural defects in aircraft parts. Su et al. [6] introduced a multi-modal event detection framework based on the robot's axis hole assembly task, including the robot's end pose and haptic sensing signals. It can be known from aforementioned instances that multimodal fusion refers to the synthesis of sensory data with a certain temporal sequence and spatial relationship from multiple different sensors, modeling at different levels and learning its potential patterns [7]. Then, the essence of multimodal fusion is to realize the modeling and analysis of Multivariate Time Series (MTS). For example, in order to allow the robot can perform flexible operations in an uncertain environment, a force/torque sensor is mounted on the end-effector [8], a tactile sensor and a acoustical sensor are installed on the robot claw, and a visual sensor is placed for monitoring the robot workspace [9–11], as shown in Fig. 2.1. In this way, the multi-sensor fusion method can be used to more comprehensively and accurately monitor environmental characteristics, eliminate signal variability and information uncertainty, and improve system reliability and stability [7].

Time series modeling is pervasive in fields as diverse as speech recognition, robotics, economics, medicine, multimedia, bioinformatics, and system control [12]. In recent years, related modeling methods have emerged endlessly, among which the Hidden Markov Model (HMM) [13] is the most popular method that were used in speech recognition [14–16], computational biology [17, 18], Machine translation [19, 20], cryptanalysis [21, 22], economics [23, 24], and human behavior recogni-

**Fig. 2.1** Robotic manipulating platforms integrated with multiple sensors

tion [25, 26], etc. In particular, HMM has been widely used in the field of robotics, mainly including process supervision [27, 28], robot state estimation [29, 30], decision making [31, 32], and learning robot motion primitive transformation [33, 34], etc. In literature [27] indicated that HMM can be used for anomaly monitoring, providing a solution for the follow-up work of this book in anomaly implementation. Although there have been a lot of successful application cases of HMM, this model has two inherent shortcomings [35]: (1) It's difficult to determine the size of the HMM hidden state space (the number of hidden states); (2) HMM follows the Markov chain hypothesis (in a given hidden state, the observation data are distributed independently of each other) and the traditional Expectation Maximization (EM) algorithm is used to infer unknown parameters [36, 37]. This inherent disadvantage greatly weakens the accuracy and efficiency of HMM modeling in time series data.

To address the problems of HMM, non-parameterized Bayesian model [38] and HMM of infinite hidden state space [39] were proposed. Yee Whye The et al. [40] proposed a Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM). HDP-HMM uses the hidden state transition probability distribution and observation distribution of HMM. The methods of introducing hyper-parameters to Bayesian a priori are added separately, so that the number of hidden states can be learned from the complexity of the training data, and is calculated by the Markov Chain Monte Carlo (MCMC) method. The Gibbs Sampling (GS) method infers the unknown parameters of the model [41, 42]. HDP-HMM uses the full Bayesian method to learn the number of hidden states by assuming an infinite state space, it effectively avoids overfitting the data and repeated trial and error model selection. From the process of the model implementation and experimental results, it can be known that HDP-HMM's modeling of hidden states still retains the properties of the original Markov chain, which will cause rapid transitions between hidden states, resulting in a lack of the principle of time continuity. The so-called time continuity means that the occurrence of events can always maintain a certain consistency, which makes it difficult to analyze events in the real world. To address the problems of HDP-HMM, Fox et al. [43–45] proposed a Sticky Hierarchical Dirichlet Process Hidden Markov Model (sHDP-HMM). The sHDP-HMM tackle the rapid transition of hidden states in HMM is to increase the self-transition of hidden states by adding the bias of the probability of each hidden state transition, so as to meet the requirement of time continuity. The experimental results show that the "sticky" hidden state enhances the consistency modeling of time series data, and can achieve more accurate results than HDP-HMM in speech recognition and multiple speaker recognition experiments [43].

#### **2.3 Theoretical Basis of Bayesian Model**

#### *2.3.1 Multimodal Time Series*

With the development of multi-modal sensing fusion technology and the diversity of the environment, except the joint encoders, robots often need to add force/torque sensors and tactile sensors to sense the surrounding environment, so that the robot's observation is often multimodal. In general, a multidimensional time series with dimension *D* and length *T* is represented as follows:

$$Y\_{T \times D} = [\mathbf{y}\_1(t), \mathbf{y}\_2(t), \dots, \mathbf{y}\_d(t), \dots \\ \mathbf{y}\_D(t)], t = 1, 2, \dots, T; 1 \le d \le D \quad (2.1)$$

where, *t* represents a time frame, *d* represents a variable, and *yd* (*t*) denotes the observation of the *d*-dimensional variable at the *t*-th time step. For ease of writing, define the observation vectors for all dimensions at time *<sup>t</sup>* as *yt* <sup>∈</sup> <sup>R</sup>*<sup>D</sup>*. Hence, the Eq. 2.1 is expressed as a matrix with *T* × *D*, and the matrix is set as an example of a multidimensional time series, that is,

$$\begin{aligned} \mathbf{y}\_{T\times D} = \begin{bmatrix} \mathbf{y}\_1 \\ \mathbf{y}\_2 \\ \vdots \\ \mathbf{y}\_T \end{bmatrix} = \begin{bmatrix} \mathbf{y}\_1(1) & \mathbf{y}\_2(1) & \dots & \mathbf{y}\_d(1) & \dots & \mathbf{y}\_D(1) \\ \mathbf{y}\_1(2) & \mathbf{y}\_2(2) & \dots & \mathbf{y}\_d(2) & \dots & \mathbf{y}\_D(2) \\ \vdots & \vdots & \dots & \vdots & \dots & \vdots \\ \mathbf{y}\_1(T) & \mathbf{y}\_2(T) & \dots & \mathbf{y}\_d(T) & \dots & \mathbf{y}\_D(T) \end{bmatrix} \end{aligned} \tag{2.2}$$

Each row in Eq. 2.2 represents the observation at time *t*, and each column represents the data of a signal. This book uses a non-parametric Bayesian hidden Markov model to learn and analyze the multi-dimensional time series [45, 46].

#### *2.3.2 Bayes' Rules*

Assuming two random variables are *y* and θ respectively, then according to the definition of Bayes' rule as follows:

$$p(\theta|\mathbf{y}) = \frac{p(\mathbf{y}|\theta)p(\theta)}{p(\mathbf{y})} \tag{2.3}$$

Where *p*(θ|*y*)is the posterior probability, *p*(*y*|θ )is the likelihood probability, *p*(θ )is the prior probability, and *p*(*y*)is the standardized constant. From Eq. 2.3, the posterior probability is proportional to the product of the likelihood probability and the prior probability. Bayesian law summarizes the process of Bayesian inference: given the likelihood probability and the standardized constant, the posterior probability can be inferred from the prior probability. The Bayesian time series model mentioned in this book means that the variables in the model are time-dependent, but due to the endless time correlation, the model cannot be established, so it is proposed that the variables require certain constraints are time-independent such as Markov chain constraints. So that two adjacent variables *yt*+<sup>1</sup> and *yt* can be predicted by Eq. 2.3

$$p(\mathbf{y}\_{t+1}|\mathbf{y}\_t) = \frac{p(\mathbf{y}\_t|\mathbf{y}\_{t+1})p(\mathbf{y}\_{t+1})}{p(\mathbf{y}\_t)}\tag{2.4}$$

Therefore, if *y* in Eq. 2.3 is expressed as the random variable and θ as parameters of corresponding probability model, where unknown parameter is accompanied by a hyper-parameter, Bayesian theory can be extended to

$$p(\theta|\mathbf{y},\lambda) = \frac{p(\mathbf{y}|\theta)p(\theta|\lambda)}{p(\mathbf{y})} \tag{2.5}$$

From Eq. 2.5, we can know that the parameters of the model are estimated by maximizing the posterior probability, that is, the maximum posterior probability estimation method (Maximum a Posteriori, MAP), i.e.

$$\theta\_{MAP} = \underset{\theta}{\text{arg}\,\text{max}}\, p(\theta|\mathbf{y}, \lambda) \tag{2.6}$$

If the prior probability *p*(θ|λ) is a constant, then the maximum posterior estimate is equivalent to the Maximum Likelihood (ML), that is,

$$\theta\_{ML} = \underset{\theta}{\arg\max} \ p\{\mathbf{y}|\theta\} \tag{2.7}$$

Equations 2.6 and 2.7 are both point estimation methods for model parameters, which will widely used to the later chapters.

#### *2.3.3 Conjugate Prior*

As we all know, the posterior distribution and the prior distribution belong to the same type in Bayes' rule such that the prior distribution and the posterior distribution are called conjugate distributions. Where the prior distribution is called the conjugate prior of the likelihood function. Therefore, the used of conjugate priors is often motivated by practical considerations that the parameters θ can't be directly learned from the observations. Namely, the conjugate priors allow for introducing a computationally tractable mechanism for incorporating new observations into the posterior distribution of the parameters θ. With the Bayesian framework, we are interested in incorporating a prior distribution on the latent model parameter θ for making predication about the new observations. To this end, assuming the associated conditional exist, and given *N* i.i.d. observations, this predictive likelihood is written by

$$p(\mathbf{y}|\mathbf{y}\_1, \dots, \mathbf{y}\_N, \lambda) = \int\_{\boldsymbol{\Theta}} p(\mathbf{y}|\vartheta a) p(\vartheta|\mathbf{y}\_1, \dots, \mathbf{y}\_N, \lambda) d\vartheta \tag{2.8}$$

To simplify the computational complexity, we only discusses conjugate priors that require the use of three similarity functions in subsequent chapters. In particular, three common observation model are considered in the hidden Markov model, including Multinomial Distribution, Multivariate Gaussian Distribution as well as Linear Gaussian Regression Model.

#### • **Multinomial Observations**

Multinomial Distribution [130, 131] considers a random variable *y* on a finite sample space *Y* = {1,..., *K*}, its probability mass function can be denoted by π = [π1,...,π*<sup>K</sup>* ], and describes the probability of a string of *N* observations of *y* taking on values *y*1,..., *yn* by

$$p(\mathbf{y}\_1, \mathbf{y}\_2, \dots, \mathbf{y}\_N | \boldsymbol{\pi}) = \frac{N!}{\prod\_k N\_k!} \prod\_k \pi\_k^{N\_k}, N\_k \stackrel{\Delta}{=} \sum\_n \delta(\mathbf{y}\_n, k) \tag{2.9}$$

Where, the notation δ(*j*, *k*) indicate the discrete Kronecker delta. When *K* = 2, this distribution is referred to as the binomial distribution.

The Dirichlet prior distribution is used for formulating the conjugate prior of the multinomial distribution. A *K*-dimentional Dirichlet distribution is the conjugate prior for the class of *K*-dimensional multinominal distribution and is uniquely defined by a set of hyperparameters α = [α1,...,α*<sup>K</sup>* ], which has the following form:

$$\begin{array}{l}\text{Dir}(\alpha\_1, \alpha\_2, \dots, \alpha\_K) = p(\pi \, | \alpha) \\ = \frac{\Gamma(\sum\_k \alpha\_k)}{\prod\_k \Gamma(\alpha\_k)} \prod\_k \pi\_k^{\alpha\_k - 1}, \alpha\_k > 0 \end{array} \tag{2.10}$$

Where, Γ (·)represents the standard Gamma function. When *K* = 2, this distribution is referred to as the Beta distribution, which can be denoted by *Beta*(α1, α2). The initial value of the distribution in Eq. 2.10 is given by

$$\left[\pi\_i\right] = \frac{\alpha\_i}{\sum\_k \alpha\_k} \tag{2.11}$$

Then, refer to the conjugacy of the Dirichlet distribution, conditioned on *N* multinomial observations *y*1,..., *yN* , the posterior distribution on π is also Dirichlet that

#### formulated by

$$p(\pi|\mathbf{y}\_1, \mathbf{y}\_2, \dots, \mathbf{y}\_N, \mathbf{a}) \propto p(\pi|\mathbf{a})p(\mathbf{y}\_1, \mathbf{y}\_2, \dots, \mathbf{y}\_N|\pi) \propto Dir(\mathbf{a}\_1 + N\_1, \dots, \mathbf{a}\_K + N\_K) \tag{2.12}$$

Where, the *Nk* represents the observed times of observation *yn* = *k*. Subsequently, using the normalizing constant of the Dirichlet distribution, and substituting to Eq. 2.8, one can derive the predictive likelihood to be

$$p(\mathbf{y} = k | \mathbf{y}\_1, \dots, \mathbf{y}\_N, \alpha) = \frac{N\_k + \alpha\_k}{N + \sum\_{k=1}^K \alpha\_k} \tag{2.13}$$

#### • **Multivariate Gaussian Observations**

Multivariate Gaussian distribution is parameterized by a mean vetor μ and covariance matrix Σ, which provides a useful description of continuous-valued random variables that concentrate about a given value and have constrained variability [130]. This distribution over a sample space *<sup>y</sup>* <sup>∈</sup> *<sup>Y</sup>* <sup>=</sup> <sup>R</sup>*<sup>D</sup>* that each observation *<sup>y</sup>* is a *D*-dimensional vector can be derived by

$$\mathcal{L}^{\mathcal{H}}(\mathbf{y};\mu,\Sigma) = p(\mathbf{y}|\mu,\Sigma) = \frac{1}{(2\pi)^{D/2}|\Sigma|^{1/2}} \exp\left\{-\frac{1}{2}(\mathbf{y}-\mu)^{T}\Sigma^{-1}(\mathbf{y}-\mu)\right\} \tag{2.14}$$

We conveniently denote this multivariate Gaussian distribution by *N* (μ, Σ). Three categories of priors about this distribution in terms of the parameters are considered in this book, including known covariance, known mean, and both the mean and covariance are uncertain.

#### **Known Covariance**

For known covariance Σ, we use the normal prior distribution as the conjugate prior on the mean vector μ, and represented by *N* (μ0, Σ0).

#### **Known Mean**

For known mean and only the convariance Σ is uncertain, the conjugate prior is formulated by the inverse-Wishart distribution [130]. The *D*-dimensional inverse-Wishart distribution with two parameters: covariance parameter Δ and ν degrees of freedom, and given by

$$p(\boldsymbol{\Sigma}|\boldsymbol{\Delta},\boldsymbol{\upsilon}) = \frac{|\boldsymbol{\upsilon}\boldsymbol{\Delta}|^{\frac{\boldsymbol{\upsilon}}{2}}|\boldsymbol{\Sigma}|^{\frac{\boldsymbol{\upsilon}+\boldsymbol{D}+1}{2}}}{2^{\frac{\boldsymbol{\upsilon}\boldsymbol{D}}{2}}\pi\frac{D(\boldsymbol{D}-1)}{4}\prod\_{i=1}^{k}\Gamma(\frac{\boldsymbol{\upsilon}+1-i}{2})}\exp\left\{-\frac{1}{2}tr(\boldsymbol{\upsilon}\boldsymbol{\Delta}\boldsymbol{\Sigma}^{-1})\right\} \qquad (2.15)$$

Where, *tr*(·) denotes the trace of matrix. We write this distribution by *I W* (ν, Δ), and the first moment is given by calculating the expectation value:

$$E[\Sigma] = \frac{\nu \Delta}{\nu - D - 1} \tag{2.16}$$

#### **Both Covariance and Mean are Uncertain**

We use the normal-inverse-Whishart distribution as the conjugate prior when both the mean and covariance are uncertain. This distribution defines a conditionally normal prior on the mean μ|Σ ∼ *N* (ϑ, Σ/κ), and an inverse-Wishart distribution on the covariance, Σ ∼ *I W* (ν, Δ). Therefore, the joint prior for uncertain mean and covariance is then given by

$$p(\mu, \Sigma | \kappa, \vartheta, \upsilon, \Delta) \propto |\Sigma|^{\frac{\upsilon + D + 1}{2}} \exp\left\{ \frac{1}{2} tr (\upsilon \Delta \Sigma^{-1}) - \frac{\kappa}{2} (\mu - \vartheta)^T \Sigma^{-1} (\mu - \vartheta) \right\} \tag{2.17}$$

Where, κ represents the degree of trust in the mean generated by this prior distribution, ϑ represents the mean of this prior distribution, ν represents the degree of trust in the variance generated by this prior distribution, and Δ represents the mean of the variance Σ matrix. We will use the notation *N IW* (κ, ϑ, ν, Δ) to represent the normal-inverse-Whishart distribution with four parameters.

#### • **Multivariate Linear Gaussian Regression Observations**

The normal multivariate linear regression model is one in which the observations *yi* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* can be described as a linear combination of a set of known regressors *xi* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* with errors accounted for by additive Gaussian noise, written as

$$y\_i = x\_{i1}a\_1 + \dots + x\_{in}a\_n + e\_i, \qquad e\_i \sim \mathcal{N}(0, \Sigma) \tag{2.18}$$

For considering the temporal correlation in a multimodal system, we may combine a set of *N* observation vectors into a matrix *Y* = [*y*<sup>1</sup> ··· *yN* ], the regressors also into a matrix *X* = [*x*<sup>1</sup> ··· *xN* ], and the noise terms into *E* = [*e*<sup>1</sup> ··· *eN* ] and formulated by

$$Y = AX + E \tag{2.19}$$

where the notation *A* = [*a*<sup>1</sup> ··· *aN* ] is referred to as the design matrix.

Assuming the noise covarianceΣ is know, the conjugate prior on the design matrix *<sup>A</sup>* is the matrix-normal distribution [132]. A matrix *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>*×*<sup>m</sup>* has a matrix-normal distribution *M N* (*A*; *M*, *V*, *K*) that given by

$$p(A) = \frac{|K|^{\frac{4}{2}}}{|2\pi V|^{\frac{n}{2}}} e^{\frac{1}{2}tr((A-M)^{T})V^{-1}(A-M)^{K})}\tag{2.20}$$

where, the *M* is the mean matrix, and *V* and *K* <sup>−</sup><sup>1</sup> are related to the covariance along the rows and the columns of *A*.

Along with Eqs. 2.15 and 2.20, the conjugate prior on the set of parameters *A* and Σ is the matrix-normal inverse-Wishart prior that places a conditionally matrixnormal prior on matrix *A* given Σ, that is

$$\begin{cases} \Sigma \sim \mathcal{J}\mathcal{W}(\mathbb{v}, \Delta) \\ A|\Sigma \sim \mathcal{M}\mathcal{J}(A; M, \Sigma, K) \end{cases} \tag{2.21}$$

#### **2.4 Hidden Markov Model**

HMM has been a workhorse in pattern recognition able to encode probabilistic statespace model. The HMM is a stochastic process where a finite number of latent states have Markovian state transitions. Conditioned on the mode sequence, the model assumes (discrete or continuous) conditionally independent observations given the latent state. Let *zt* denote the latent state of the Markov chain at time *t* ∈ *T* , and π*<sup>j</sup>* the state-specific transition distribution for state *j* ∈ *K*. Given the state *zt* , the observation *yt* is conditionally independent of observations and states at other time steps. Then, the Markovian structure on the state sequence and the observation are simply described as:

$$p(z\_t \mid z\_{t-1}) = \pi\_{z\_{t-1}}, \quad p(\mathbf{y}\_t \mid z\_t) = F(\theta\_{z\_t}) \quad t > 1. \tag{2.22}$$

where, the state at the first time step is distributed according to an initial transition distributions π0; *F*(·) represents a family of distributions (e.g., the multinomial for discrete data, or the multivariate Gaussian for real-vector-valued data); θ*zt* are the state-specific emission parameters and the *z*1:*<sup>T</sup>* is a state path over the hidden state space. Therefore, the resulting joint density for *T* observations is given by:

$$p(z\_{1:T}, \mathbf{y}\_{1:T}) = \pi\_0(z\_1) p(\mathbf{y}\_1 \mid z\_1) \prod\_{t=2}^T p(z\_t \mid z\_{t-1}) p(\mathbf{y}\_t \mid z\_t). \tag{2.23}$$

Then, to calculate the log-likelihood value of an observation vector *y*1:*<sup>T</sup>* = (*y*1, *y*2, *y*3,..., *yT* ), we first write the joint distribution with the Markov property *p*(*yt*|*yt*−<sup>1</sup>,..., *y*1) = *p*(*yt*|*yt*−<sup>1</sup>) and derive the joint probability of observation *y*1:*<sup>T</sup>* as:

$$\begin{aligned} p(\mathbf{y}\_{1:T}) &= \sum\_{t=1}^{T} p(\mathbf{y}\_t | \mathbf{y}\_{1:t-1}) \\ p(\mathbf{y}\_t | \mathbf{y}\_{1:t-1}) &= \sum\_{k=1}^{K} p(\mathbf{z}\_t = k | \mathbf{y}\_{1:t-1}) p(\mathbf{y}\_t | \mathbf{z}\_t) \end{aligned} \tag{2.24}$$

As illustrated in Eq. 2.23, if the state path *z*<sup>1</sup>:*<sup>T</sup>* is estimated, the maximum probability of the observation sequence can be obtained. However, the true state-path is hidden from the observations. There are different approaches to compute it: (i) the maximum likelihood state at any given moment; however, this approach does not consider uncertainty; (ii) a marginal probabilistic representation over hidden states at each time step. Here, the log-probability at time *t* is derived by computing the natural logarithm of the sum of exponentials over all hidden states:

$$\log p(\mathbf{y}\_{1:T}) = \log \sum\_{k=1}^{K} \exp \left( \frac{a\_t(k) \cdot \beta\_t(k)}{p(\mathbf{y}\_{1:T})} \right) \tag{2.25}$$

where α*t*(*k*) = *p*(*y*1:*<sup>t</sup>*,*zt*), β*t*(*k*) = *p*(*yt*+1:*<sup>T</sup>* |*zt*), represent the forward message passing and backward message respectively in the standard forward-backward algorithm [37, 47].

$$\begin{split} p(z\_{t} = k | \mathbf{y}\_{1:t-1}) &= \sum\_{j} p(z\_{t} | z\_{t-1} = j) p(z\_{t-1} | \mathbf{y}\_{1:t-1}) \\ \alpha\_{t}(k) &= p(z\_{t} = k | \mathbf{y}\_{0:t}) \\ &= \frac{1}{p(\mathbf{y}\_{t} | \mathbf{y}\_{1:t-1})} p(\mathbf{y}\_{t} | z\_{t} = k) p(z\_{t} = k | \mathbf{y}\_{1:t-1}). \end{split} \tag{2.26}$$

The denominator represents the probability of a sequence of observations that can be calculated by Eq. 2.24.

#### *2.4.1 Forward-Backward Algorithm*

To learn unknown parameters, the forward-backward algorithm [48] provides an efficient message passing procedure on computing node marginals of interest problems including *filtering p*(*zn*|*y*1,..., *yn*), *prediction p*(*zn*+*m*|*y*1,..., *yn*), and *smoothing p*(*zn*|*y*1,..., *yN* ), *N* > *n*, where, *zn* denotes the hidden state and *yn* represents the observation. According to the implementation of the belief propagation algorithm in Bayesian graphical model, we define a set of forward and backward messages by

$$\begin{aligned} for word: & \quad \alpha\_n(\mathbf{z}\_n) \triangleq p(\mathbf{y}\_1, \dots, \mathbf{y}\_n, \mathbf{z}\_n) \\ backward: & \quad \beta\_n(\mathbf{z}\_n) \triangleq p(\mathbf{y}\_{n+1}, \dots, \mathbf{y}\_N|\mathbf{z}\_n) \end{aligned} \tag{2.27}$$

To address the problem of filtering, we formulate the filtering with the forward messages defined in Eq. 2.27. The filtering is simply generated as

$$filtering: \quad p(z\_n|\mathbf{y}\_1, \dots, \mathbf{y}\_n) = \frac{\alpha\_n(z\_n)}{\Sigma\_z \alpha\_n(z)}\tag{2.28}$$

To address the problem of prediction, we can utilize the Markov structure for the underlying chain to derive, that is

$$p(\mathbf{z}\_{n+m}|\mathbf{y}\_1, \dots, \mathbf{y}\_n) = \frac{\Sigma\_{\mathbf{z}\_{n+m-1}} p(\mathbf{z}\_{n+m}|\mathbf{z}\_{n+m-1}) \cdots \Sigma\_{\mathbf{z}\_n} p(\mathbf{z}\_{n+1}|\mathbf{z}\_n) \alpha\_n(\mathbf{z}\_n)}{\Sigma\_{\mathbf{z}} \alpha\_n(\mathbf{z})} \tag{2.29}$$

From Eq. 2.29, it shows that the prediction is equivalent to propagating the forward message without incorporating the missing observations *yn*+<sup>1</sup>,..., *yn*+*<sup>m</sup>*. However, we utilize both of the forward and backward messages for addressing the problem of smoothing for any *m* by

$$\begin{split} p(z\_n|\mathbf{y}\_1, \dots, \mathbf{y}\_N) &= \frac{p(\mathbf{y}\_1, \dots, \mathbf{y}\_N|z\_n) p(z\_n)}{p(\mathbf{y}\_1, \dots, \mathbf{y}\_N)} \\ &= \frac{p(\mathbf{y}\_1, \dots, \mathbf{y}\_n|z\_n) p(\mathbf{y}\_{n+1}, \dots, \mathbf{y}\_N|z\_n) p(z\_n)}{p(\mathbf{y}\_1, \dots, \mathbf{y}\_N)} \\ &= \frac{\alpha\_n(z\_n) \beta\_n(z\_n)}{\Sigma\_\varepsilon \alpha\_n(z) \beta\_m(z)} \end{split} \tag{2.30}$$

Hence, we can referred to the belief propagation algorithm for addressing the problems of filtering, predication, and smoothing of HMM. Among of them, we can use the forward messages for calculating the marginal likelihood of observations.

#### *2.4.2 Viterbi Algorithm*

The Viterbi algorithm is proposed for addressing the problem of the most likely state sequence to have generated an observation sequence *y*1,..., *yN* when given a set of HMM parameters, which provides an efficient dynamic programming approach to computing this Maximum A Posteriori (MAP) hidden state sequence by

$$\begin{split} \hat{\boldsymbol{z}} &= \max\_{\boldsymbol{z}\_{1}, \dots, \boldsymbol{z}\_{N}} \pi^{0}(\boldsymbol{z}\_{1}) p(\mathbf{y}\_{1}|\boldsymbol{z}\_{1}) \prod\_{n=2}^{N} p(\boldsymbol{z}\_{n}|\boldsymbol{z}\_{n-1}) p(\mathbf{y}\_{n}|\boldsymbol{z}\_{n}) \\\\ &= \min\_{\boldsymbol{z}\_{1}, \dots, \boldsymbol{z}\_{N}} \left[ -\log \pi^{0}(\boldsymbol{z}\_{1}) - \log p(\mathbf{y}\_{1}|\boldsymbol{z}\_{1}) + \sum\_{n=1}^{N} -\log p(\boldsymbol{z}\_{n}|\boldsymbol{z}\_{n-1}) - \log p(\mathbf{y}\_{n}|\boldsymbol{z}\_{n}) \right] \end{split} \tag{2.31}$$

From the above, we can note that choosing the MAP sequence is not necessarily equivalent to choosing the maximum marginal independently at each node by *z*ˆ*<sup>n</sup>* = *maxp*(*zn*|*y*1,..., *yN* ). The marginal sequence also may not even be a feasible sequence for the HMM.

The Viterbi algorithm can be used to compute the most probable sequence of states in HMM based on the dynamic programming principle that the minimum cost path to *zn* = *k* is equivalent to the minimum cost path to node *zn*−<sup>1</sup> plus the cost of a transition from *zn*−<sup>1</sup> to *zn* = *k*. Therefore, the MAP hidden Markov model state sequence *z*ˆ1,...,*z*ˆ*<sup>N</sup>* can be computed in the following four steps that originally presented in literature [45]. Here, we first initialize the minimum path sum to state *z*<sup>1</sup> = *k* for each *k* ∈ {,..., *K*} by

$$n = 1: \quad \mathcal{P}\_1(z\_1 = k) = \log \pi^0(z\_1 = k) - \log p(\mathbf{y}\_1 | z\_k = k) \tag{2.32}$$

For *n* = 2,..., *N*, and for each *k* ∈ {1,..., *K*}, calculate the minimum path sum th state *zn* = *k* by

$$n = \{2, \dots, N\}: \quad \mathcal{J}\_n^\rho(\underline{z}\_n = k) = -\log p(\mathbf{y}\_n | \underline{z}\_n = k) + \min\_{\mathbf{z}\_{n-1}} \{ \mathcal{J}\_{n-1}^\rho(\underline{z}\_{n-1}) \} \tag{2.33}$$

$$-\log p(\underline{z}\_n = k | \underline{z}\_{n-1}) \}$$

and simplifies to

$$z\_{n-1}^\*(z\_n) = \underset{z\_{n-1}}{\arg\min} \{ \mathcal{P}\_{n-1}^\ell(z\_{n-1}) - \log p(z\_n = k | z\_{n-1}) \}\tag{2.34}$$

thus, computing the probability of hidden state sequence given observation sequence by

$$\min\_{\mathbf{z}\_1,\dots,\mathbf{z}\_N} -\log p(\mathbf{z}\_1,\dots,\mathbf{z}\_N | \mathbf{y}\_1,\dots,\mathbf{y}\_N) = \mathcal{F}\_N^\ell(\mathbf{z}\_N) \tag{2.35}$$

and then set the notation *<sup>z</sup>*ˆ*<sup>N</sup>* <sup>=</sup> *arg min zN S<sup>N</sup>* (*zN* ). Finally, the most likely hidden state sequence given observations can be iteratively calculated by

$$\hat{z}\_n = z\_n^\*(\hat{z}\_{n+1}), \quad n \in \{N-1, \dots, 1\} \tag{2.36}$$

We will propose methods that have straightforward connections with the Viterbi algorithm for anomaly monitoring during robot manipulation in Chap. 4.

#### **2.5 Nonparametric Bayesian Hidden Markov Model**

Bayesian methods are widely applied to various fields such as biomedicine, clinical trials, social science, and economics. The collected data are more and more complicated with the rapid development of modern science and technology. These complicated data include high/ultrahigh dimensional data, big data, discrete and continuous data. The aim of this section is to introduce the newly developed Bayesian methods, including Bayesian variable selection (e.g., fixed-dimensional data analysis and high/ultrahigh dimensional data analysis), Bayesian influence analysis (e.g., case deletion method and local influence analysis), Bayesian estimation and clustering methods (e.g., fixed dimensional data, high/ultrahigh dimensional data analysis, Bayesian network, and Bayesian clustering for big data), Bayesian hypothesis test including discrete and continuous random variables, variational Bayesian analysis and Bayesian clinical trials including design and dose-finding algorithm, and their applications in various fields.

Moreover, Bayesian nonparametric methods avoid the often restrictive assumptions of parametric models by defining distributions on function spaces, especially, for the prior function in Bayesian formulation. If effectively designed, these methods allow for data-driven posterior inference. The overview of Bayesian nonparametric inference in recent decades, see [49–51], and we briefly describe two aspects of Bayesian nonparametric methods on the hidden Markov model, including the Dirichlet process and its hierarchical extension, such that formulated the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) for modeling multimodal observations in robotics.

#### *2.5.1 Bayesian Nonparametric Priors*

HMMs restrictively assumes a fixed model complexity. In process monitoring, latent states can represent robot primitives. It's clear that not all robot skills have the same number of primitives and that even the same skill might have variation when conducted under different conditions. To introduce flexibility into the number of computed hidden states, we leverage priors as probability measures.

HMMs can also be represented though a set of transition probability measures *G <sup>j</sup>* . Probability measures yield strictly positive probabilities and sum to 1. Consider if instead of using a transition distribution on latent states, we use it across emission parameters θ *<sup>j</sup>* ∈ Θ. Then,

$$G\_j = \sum\_{k=1}^{K} \pi\_{jk} \delta\_{\theta\_k},\tag{2.37}$$

where δθ*<sup>k</sup>* is the unit mass for mode *k* at θ. The emission parameter θ *<sup>j</sup>* , can be evaluated at time index *t* − 1, such that:

$$p(\theta'\_t \mid \theta'\_{t-1}) = G\_{j\_{t-1}}, \quad p(\mathbf{y}\_t \mid \theta'\_t) = F(\theta'\_t) \tag{2.38}$$

So, given θ *<sup>j</sup>* , different probability weights are assigned to possible successor candidates θ*<sup>k</sup>* . We can also assign a prior to the categorical probability measure *G <sup>j</sup>* . The Dirichlet distribution is a natural selection due to conjugacy. Thus, the transition probabilities π*<sup>j</sup>* = [π*<sup>j</sup>*<sup>1</sup> ··· π*j K* ] are independent draws from a *K*-dimensional Dirichlet distribution:

$$\begin{aligned} p(\pi\_j) &= Dir(\alpha\_1, \dots, \alpha\_K), \quad j = 1, \dots, K\\ \sum\_{k}^{K} \pi\_{jk} &= 1 \end{aligned} \tag{2.39}$$

Additionally, emission parameters are drawn from a base measure *H* such that, *p*(θ *<sup>j</sup>*) = *H*. The Dirichlet process (DP) was used as the base measured instead of the Dirichlet distribution. The DP is a distribution over countably infinite probability measures *G*<sup>0</sup> : Θ ⇒ *R*+, where *G*0(θ ) ≥= 0 and <sup>Θ</sup> *G*0(θ )*d*θ = 1. The DP has a joint Dirichlet distribution:

$$p(G\_0(\theta\_j), \dots, G\_0(\theta\_K)) = \operatorname{Dir}(\operatorname{\mathcal{H}}(\theta\_j), \dots, \operatorname{\mathcal{H}}(\theta\_K)).\tag{2.40}$$

We can further summarize the probability measure as *p*(*G*0) = *D P*(γ , *H*) as:

24 2 Nonparametric Bayesian Modeling of Multimodal Time Series

$$G\_0 = \sum\_{k=1}^{\infty} \beta\_k \delta\_{\theta\_k} \quad p(\theta\_k) = H,\tag{2.41}$$

where γ is the concentration parameter and *H* is the base measure over parameter space Θ. The weights β*<sup>k</sup>* are sampled via a stick-breaking construction:

$$
\beta\_k = \nu\_k \Pi\_{l=1}^{k-1} (1 - \nu\_l), \tag{2.42}
$$

where *p*(ν*<sup>k</sup>* ) = β(1,γ). For succinctness, the stick-breaking process is defined as: *p*(β) = *GEM*(γ ). The DP is used to define a prior on the set of HMM transition probability measures *G <sup>j</sup>* . However, if each transition measure *G <sup>j</sup>* is an independent draw from *D P*(γ , *H*), where *H* is continuous, like a Gaussian distribution, transition measures lead to non-overlapping support. This means that previously seen modes (robotic primitives) cannot be selected again. To deal with this limitation, a Hierarchical Dirichlet Process (HDP) is used. The latter constructs transition measures *G <sup>j</sup> on the same support points* (θ1, θ2, ...θ*<sup>K</sup>* ). This can be done when *G <sup>j</sup>* is only a variation on a *global* discrete measure *G*0, such that:

$$\begin{aligned} G\_0 &= \sum\_{k=1}^{\infty} \beta\_k \delta\_{\theta\_k}, \quad p(\beta \mid \boldsymbol{\gamma}) = GEM(\boldsymbol{\gamma})\\ p(\theta\_k \mid G\_0) &= G\_0\\ G\_j &= \sum\_{k=1}^{\infty} \pi\_{jk} \delta\_{\theta\_k}, \quad p(\pi\_j \mid \boldsymbol{\alpha}, \beta) = DP(\boldsymbol{\alpha}, \beta). \end{aligned} \tag{2.43}$$

This HDP is used as a prior on the HMM. The implications are a mode complexity learned from the data and a sparse state representation. The *D P*(α, β) distribution encourages modes with similar transition distributions, but does not distinguish between self- and and cross-mode transitions. This is problematic for dynamical data. The HDP-HMM yields large posterior probabilities for mode sequences with unrealistically fast dynamics. Emily B. Fox et al. [52] introduced the sticky parameter into the HDP, yielding the sHDP-HMM, thus the transition probability π*<sup>j</sup>* is defined as:

$$p(\pi\_j \mid \alpha, \kappa, \beta) = DP\left(\alpha + \kappa, \frac{\alpha\beta + \kappa\delta\_i}{\alpha + \kappa}\right) \tag{2.44}$$

The sticky HDP increases the expected probability of self-transitions by an amount proportional to κ and leads to posterior distributions with smoothly varying dynamics. Finally, priors are placed on the concentration parameters α and γ , and the sticky parameter κ. Latent state creation is influenced by α and γ , while self-transition probabilities are biased by κ. These priors allow to integrate expert knowledge for better modeling than the expectation-maximization (EM) algorithm traditionally used in HMMs.

#### *2.5.2 The sHDP-VAR-HMM*

Many complex dynamical phenomena are not adequately described as conditionally independent of the observations given the state *zi* . That is, the observations are a noisy linear combination of some finite set of past observations plus additive white noise. In such cases, the dynamical processes can be modeled by an autoregressive (AR) process. Particularly, an order *r* vector AR process, denoted by VAR(*r*), with observations *yt* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* . The VAR(*r*) system can be considered an extension of the HMM; instead of having conditionally independent observations given the hidden state sequence, the system has conditionally linear dynamics. Figure 4.9 shows a graphical model representation and compares it to the sHDP-HMM described in previous section.

The VAR has simplifying assumptions that make it a practical choice in applications [52]. The switching regime can be combined with the sHDP-HMM from Sect. 2.5.1 to leverage the expressiveness of the VAR system with the ability of nonparametric priors to learn the mode complexity of the model. The VAR(r) process, with autoregressive order *r* consists of a latent state *zt* ∈ *R<sup>n</sup>* with linear dynamics observed through *yt* ∈ *R<sup>d</sup>* . The observations have mode specific coefficients and process noise as:

$$\begin{aligned} p(\mathbf{z}\_t) &= \pi\_{\mathbf{z}\_{t-1}}, \quad t > 1\\ \mathbf{y}\_t &= \sum\_{i=1}^r A\_i^{\mathbf{z}\_t} \mathbf{y}\_{t-i} + e\_t(\mathbf{z}\_t), \quad p(\mathbf{e}\_t) = \mathcal{A}'(\mathbf{0}, \boldsymbol{\Sigma}). \end{aligned} \tag{2.45}$$

We see a generative model for a time-series {*y*1, *y*2,..., *yT* } of observed multimodal data, a matrix of regression coefficients **<sup>A</sup>**(*k*) = [*A*(*k*) <sup>1</sup> ··· *A*(*k*) *<sup>r</sup>* ] ∈ <sup>R</sup>*dx*(*d*∗*r*) ], and a measurement noise Σ, with a symmetric positive-definite covariance matrix. Given the observation data, we are interested in learning the "*rth*" model order, for which we need to infer {**A**(*k*) , Σ(*k*) }. We leverage the Bayesian approach through the placement of conjugate priors on both parameters for posterior inference. As the mean and covariance are uncertain, the Matrix-Normal Inverse-Wishart (MNIW) serves as an appropriate prior on the multivariate AR distribution. The MNIW places a conditionally matrix-normal prior on **A**(*k*) given Σ:

$$p(\mathbf{A}^{(k)} \mid \Sigma) = \mathcal{A} \mathcal{A} \mathcal{N}(\mathbf{A}; M, \Sigma, K), \tag{2.46}$$

and an inverse-Wishart prior on Σ:

$$p(\boldsymbol{\Sigma}) = I \, W(\boldsymbol{\upsilon}\_0, \mathcal{S}\_0). \tag{2.47}$$

See [45] Appendix F.1 for a detailed derivation of resulting posterior and definitions for variables *M* and *K* which are parameters that define the matrix-normal portion of the MNIW prior.

#### **2.6 Summary**

In this chapter, a Bayesian nonparametric approach in defining a prior on the hidden Markov model that allows for flexibility in addressing the problem of modeling the complex dynamics during robot manipulation task is proposed. considering the underlying dynamics that can be well-modeled as a hidden discrete Markov process, but in which there is uncertainty about the cardinality of the state space. We use the hierarchical Dirichlet process (HDP) for examining an HMM with an unbounded number of possible states. Subsequently, the sticky HDP-HMM is investigated for allowing more robust learning of the complex dynamics through a learned bias by increasing the probability of self-transitions. Although the HDP-HMM and its sticky extension are very flexible time series models, they make a strong Markovian assumption that observations are conditionally independent given the discrete HMM state. This assumption is often insufficient for capturing the temporal dependencies of the observations in real data. Consequently, an extension of the sticky HDP-HMM for learning the switching dynamical processes with switching linear dynamical system is investigated. As the basis, the detailed formulations and theoretical process of those models are explained step-by-step in each sections, and widely used in the later chapters of this book.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 3 Incremental Learning Robot Task Representation and Identification**

**Abstract** In this chapter, we present a novel method for incremental learning robot complex task representation, identifying repeated skills, and generalizing to new environment by heuristically segmenting the unstructured demonstrations into movement primitives that modelled with a dynamical system. The proposed method combines the advantages of recent task representation methods for learning from demonstration in into an integrated framework. In particular, we use the combination of finite state machine and dynamical movement primitives for complex task representation, and investigate the Bayesian nonparametric hidden Markov model for repeated skill identification. To this end, a robot should be able to identify its actions not only when failure or novelty occurs, but also as it executes any number of skills, which helps a robot understand what it is doing at all times. Two complex, multi-step robot tasks are designed to evaluate the feasibility and effectiveness of proposed methods. We not only present the results in task representation, but also analyzing the performance of skill identification by various nonparametric models with various modality combinations.

#### **3.1 Introduction**

As we all know, a sentence in a language is composed of words according to grammatical rules, and a word is composed of letters according to word formation rules. Then, a robot complex manipulation task corresponds to a sentence, the movement primitives of the robot coupling correspond to a word, and the movement primitives of the robot's respective degrees can be expressed as letters. Therefore, the robot task can be represented by a set of learned movement primitives. Through the description of the movement primitives, the movement of the robot can be generalized, and the diversity and adaptability of the manipulation tasks can be improved. Based on the representation of the movement primitives of the robot tasks, the identification of the robot's movement will facilitate the abnormal monitoring and recovery of the robot system.

#### **3.2 Related Work**

In recent years, as robots have been widely used in fields such as human interaction, home services, medical care, and industrial production, to enable non-robot professionals and people who lack specialized knowledge to quickly complete robot movement programming. Compared to the traditional robot control scheme [18, 23], robots Learning from human demonstrations has become mainstream for complex task descriptions [1, 3]. Generally, in order to effectively learn and generalize the robot's manipulation tasks, the tasks are often divided into multiple segments of movement primitives for learning [4, 6], and thereafter the tasks are described by stitching or serializing different movement primitives [7–9]. Therefore, the learning methods of robot movement primitives mainly include the following four categories [12, 13]: Probabilistic movement primitives, dynamic movement primitives (DMPs), and probabilistic movements based on a Guassian Mixture Model (GMM). Probabilistic Movement Primitive (ProMP) and Stable Estimator of Dynamical System (SEDS).

In particular, GMM has strong noise processing ability and strong robustness for human unstructured demonstration trajectories, and it is easy to deal with the problem of high-dimensional trajectories. The key implementation process is to first use the multivariate Gaussian distribution to model the demonstration trajectory, then use the Expectation-maximization (EM) algorithm and nonlinear regression technology to learn the model, and finally adjust the input new task configuration to achieve a generalized description of the trajectory, as DA Duque et al. presented in [11].

This method is embodied in that an arbitrary continuous probability distribution can be approximated by a weighted average of a mixture of multiple Gaussian distributions, and has been widely used in robot motion primitive learning and motion generation [14]. Elena et al. [15] learned the trajectories of human demonstrations through GMM, and estimated the stability of the non-linear multi-dimensional dynamic system on the generated trajectories, so that they can not only generate trajectories for unknown tasks but also perform online in the presence of external disturbances. Adjustment Calinon et al. [16, 17, 19] proposed a trajectory learning framework based on variance in the task space, modeled human trajectories through GMM, and learned EM algorithms for model unknown parameters. Finally, Gaussian Mixture regression model (Gaussian Mixture Regression (GMR) and optimization estimator. Peter et al. [20] used the Gaussian Process (GP) method to establish a regression model on the trajectory of human demonstration to express motion and used KLback–Leibler Divergence (KL) as an indicator of trajectory generalization performance. This learning through probability distribution guarantees the stability of motion generation.

Additionally, Schaal and Ijspeert et al. [13, 22, 24] to study the demonstration movement with a dynamic system of nonlinear differential equations. The most advanced advantage of this method is that only one demonstration trajectory can be used to complete the learning and generate the Adapted movement. At the same time, DMP inherits many advantages of dynamic systems, such as conditional convergence, robustness to external disturbances, and time independence. Khansari et al. [21, 25, 26] aimed at the problem that the motion primitives learned in the GMM method could not guarantee its motion stability, and proposed a Stable Estimator of Dynamical Systems (SEDS) for nonlinear dynamic systems. The motion primitive learning method combines a dynamic system with a probabilistic statistical model to obtain its globally stable constraints. Finally, the parameter estimation problem is transformed into an optimization problem.

The robot's manipulation tasks can be divided into multiple parameterized motion primitive, so that the combination and serialization of multiple different motion primitives can effectively improve the diversity and adaptability of robot tasks. Therefore, how to correctly select the next motion primitive after the execution of the current motion primitive is another research difficulty of the complex task description of the robot. In response to the serialization of motion primitives, Pastor et al. [27] proposed a motion description based on the Associative Skill Memories (ASM) of DMP motion primitives for motion selection, which differ from the control scheme [31]. Statistical data such as mean and variance are saved at all times. After completing 90% of the current motion primitives, use the remaining 10% of the sensing data and the first 10% of the data of the candidate primitives to calculate the distance of several miles, and the nearest one is executed as the next step.

Niekum et al. [28, 29] proposed the serialization of robot manipulation tasks by the finite state machine method, and described the target points of each motion primitive in different coordinate systems, and then described by different coordinate systems. The classifier constituted by the target, and finally selects the motion primitive to be executed next in the way of classification. Manschitz et al. [4, 9, 30, 32] proposed a method for learning task manipulation graphs from Kinesthetic Teaching, and learned the conversion probability between various motion primitives from the experience of multiple teachings. The method is successfully applied to the task of unscrewing a light bulb by a multi-step robot.

Zhe Su et al. [33] introduced and evaluated a framework to autonomously construct manipulation graphs from manipulation demonstrations. Our manipulation graphs include sequences of motor primitives for performing a manipulation task as well as corresponding con-tact state information. The sensory models for the contact states allow the robot to verify the goal of each motor primitive as well as detect erroneous contact changes. The proposed framework was experimentally evaluated on grasping, unscrewing, and insertion tasks on a Barrett arm and hand equipped with two BioTacs.

In summary, due to the teaching programming and offline programming methods of traditional industrial robots can no longer satisfy the representation of multistep and complex robot manipulation tasks. In this chapter, through comparison and analysis, methods and theories for describing complex tasks of robots have been developed. The research combined the methods of dynamic movement primitives (DMP) and finite state machines (FSM) to achieve task representation and serialization selection among different movement primitives, an simple pick-and-place task was set in [34].

#### **3.3 Graphical Representation of Robot Complex Task**

In the case of unstructured and dynamic environments, the task of the robot is difficult to complete from beginning to end with fixed and pre-programmed movements, because the uncertainty of the environment and the unpredictability of the state of the robot in actual manipulation will cause the task The failure or exception occurred. For example, during the movement of the Baxter robot during the pick-and-place task, the robot will not be able to adjust the movement to adapt to changes in the environment due to changes in the pose of the object. Based on this problem, this paper proposes a way to artificially segment a complex task of a robot by describing multiple movement primitives, and uses a finite state machine mechanism to construct the task to implement a directed graph and motion primitive description of the task.

A finite state machine is a mathematical model that represents a finite number of states and the transition between these states. It is usually represented by a five-tuple: *M* = (*Q*, *q*0, -, δ, *F*), where *Q* is a non-empty finite set of states; *qi* ∈ *Q* represents a state; *q*<sup>0</sup> is the initial state; is the input condition Instruction; δ is a transfer function between states. Each state in the state machine stores the related information of the past, reflecting the changes before and after the input of the state machine; and the state transition indicates that the current state has changed. State transition rules are used to describe the necessary conditions for state transition. Generally, a directed state transition diagram can be used to describe the working condition of a finite state machine, which can clearly indicate the transition relationship and transition conditions between different states. As shown in Fig. 3.1, a system includes a finite state machine with four states, where each node represents a state *qi* , and

**Fig. 3.1** Illustrates the graphical representation of FSM state transition

each edge represents a transition relationship between states. If the input of state *q*<sup>0</sup> is *a*, the movement proceeds to state *q*1, if the input is *b*, the movement proceeds to state *q*2, and the other states transition The relationship is the same. The final node *q*<sup>3</sup> represents the end state of the system.

According to the understanding of the finite state machine, the robot's execution task is represented by a multi-segment movement primitive, and its primitive is derived from the human's definition and understanding of the task. Generally, the following five factors need to be considered when segmenting tasks:


Based on the above considerations, the finite state machine description of the autonomous pick-and-place task of the Baxter robot in Sect. 4.5.3 is implemented, as shown in Fig. 3.2. Finally, in order to improve the adaptability and generalization ability of operating tasks in an unstructured dynamic environment, this paper uses the model learning method to model motion primitives between states on the premise of finite state machine description of tasks. In order to be able to adapt to the needs of the task and changes in the environment. In particular, in this paper, the motion primitive learning method DMP based on dynamic systems is used to model the movement

**Fig. 3.2** Illustrates the FSM implementation of Baxter robot performing pick-and-place task

primitive. Thanks to this method, it only needs to learn from a human demonstration trajectory and good generalization. After learning the motion primitives, when a new starting amount and target amount are given to each movement behavior, a new movement behavior can be generated by referring to the demonstration movement, and a new manipulation task is generated. The specific parameterization and generalization process will be Explained in the next subsection.

# *3.3.1 Learning Graphical Representation from Unstructured Demonstration*

# • **Demonstration Collection**

Generally, capturing the demonstrations by receiving the multimodal input from the end-user, such as a kinesthetic demonstration. The only restriction on the multimodal data is that all the signals must be able to be synchronously recorded at the same frequency, i.e. temporal alignment. Additionally, the multimodal data at each time step should include the Cartesian pose and velocity of the robot end-effector (in case of object-grasping, will along with any other relevant data about the end-effector, e.g. the open or closed status of the gripper and the relative distance between the end-effector and object.) as well as the signals from F/T sensor and tactile sensor. Subsequently, the recorded Cartesian pose and velocity trajectory of the end-effector will be referred to as the kinematic demonstration trajectory for controlling the robot motion, and the recorded signals of F/T sensor and tactile sensor are applied for learning the introspective capacities.

#### • **Finite State Machine**

Defining the set *M* as the union of all the unique states *z* ∈ {1, 2,..., *K*} from the derived hidden state sequences *Z* = {*Z*1, *Z*2,..., *Z<sup>N</sup>* ,}, a Finite State Machine (FSM) that represents the task can designed to be constructed by creating nodes<sup>1</sup> *N*1, *N*2,..., *Nm*, where *K* ≤ *m* and each element should correspond to the same hidden state *z*. For *k* ∈ {1,..., *m*}, each node *Nk* is assigned a set of *M<sup>k</sup>* of all exemplars that have the same hidden state and executing phase, and the starting and goal pose are also recorded with *Ns*, *Ne*, respectively. A *m* × *m* transition matrix *T* can then be constructed, where each element *Ti*,*<sup>j</sup>* is originally set to 1 if there exists a directed transition from *Ni* to *Nj* , and 0 otherwise. We will implement the incremental task exploration by modulating this transition value as described in Chap. 6 in two ways. Consequently, we hopefully learn the robot complex task representation by formulating each state of constructed finite state machine using a state-specific movement primitive technique.

<sup>1</sup>Here, the term "node" is used to instead of the default "state" in FSM for avoiding confusing hidden state of HDP-VAR-HMM or state of the robot.

#### *3.3.2 Robot Movement Primitive Learning*

Learning movement primitive has been commonly applied to solve different robot tasks in isolation. This usually requires either careful feature engineering, or a significant number of demonstrations. This is far from what we desire: ideally, robots should be able to learn from very few demonstrations of any given complex task, and instantly generalize to new situations of the same task, without requiring taskspecific engineering. In this paper, we explore the Dynamical Movement Primitives (DMPs) for achieving such capability for equipping robot with the aforementioned generalization capability of movement primitive, which provides a framework for describing the dynamical systems as set of nonlinear differential equations. DMPs guarantee the stability and convergence by introducing an additional canonical system and also provide simple mechanisms for LfD. Particularly, a discrete movement DMP can be formulated by the transformation system, that is

$$\begin{aligned} \tau \dot{\mathbf{v}} &= K(\mathbf{g} - \mathbf{x}) - D\mathbf{v} - K(\mathbf{g} - \mathbf{x}\_0)\mathbf{s} + Kf(\mathbf{s}), \\ \tau \dot{\mathbf{x}} &= \mathbf{v}. \end{aligned} \tag{3.1}$$

where, the Eq. 3.1 is an extended PD control signal with spring and damping constants *K* and *D* respectively, position and velocity *x* and *v*, goal *g*, scaling *s*, and temporal scaling factor τ . The scaling term originates from an additional system that controls the system's phase execution by

$$
\tau \dot{\mathbf{s}} = -\alpha \mathbf{s},
\tag{3.2}
$$

where, the scalar α can be an arbitrary constant.

Additionally, the forcing term *f* (*s*)in Eq. 3.1 is used to alter attractor point dynamics and achieve an arbitrary trajectory, which commonly learned from an one-sot demonstration [1]. Thus, along with the Eq. 3.1, the forcing term *f* (*s*) can be formulated as a phase-dependent linear combination of a set of basis functions ψ*i*(*s*) (often use the Gaussian basis function), that is,

$$\begin{aligned} \tau f(\mathbf{s}) &= \frac{\sum\_{i=1}^{N} w\_i \psi\_i(\mathbf{s}) \mathbf{s}}{\sum\_i \psi\_i(\mathbf{s})} \\ \psi(\mathbf{s}) &= (-h\_i (\mathbf{s} - c\_i)^2) \end{aligned} \tag{3.3}$$

where, the basis function ψ(*s*) is formulated with mean *ci* and variance *hi* . Then, the forcing item is represented the linear combination of basis functions with variable weights*wi* and normalization constant- *<sup>i</sup>* ψ*i*(*s*). Phase *s* monotonically varies from 1 to 0 and controls phase progress by activating Gaussian centered at *ci* . The decreasing phase value ensures that the contribution of the forced term disappears and returns to the simpler point attractor dynamics to converge to the target. Thus, the (*g* − *x*) term in Eq. 3.1 that with respect to spatio-temporal scaling, which performs spatial scaling for guaranteeing the system can adjust to varying goals immediately. While the τ variable in Eq. 3.1 is designed for allowing the system to speed up or slow

**Fig. 3.3** Illustrates the learning and generalization of dynamical movement primitive

down the execution of a movement. Generally, the forcing term weights are learned from a kinesthetic demonstration with pose *x*(*t*), velocity *x*˙ as well as acceleration *x*¨ with duration *T* are extracted as in [29]. After learning movement primitive from One-shot demonstration, the target forcing turn of an unseen scenario is computed by adjusting Eq. 3.1, integrating Eq. 3.2 to convert from time to phase, and substituting appropriate values to yield,

$$f\_{target}(\mathbf{s}) = \frac{-K(\mathbf{g} - \mathbf{x}(\mathbf{x})) + D\dot{\mathbf{x}}(\mathbf{s}) + \tau \ddot{\mathbf{x}}(\mathbf{s})}{\mathbf{g} - \mathbf{x}\_0}. \tag{3.4}$$

Next, the goal is set to *g* = *x*(*T* ) and τ is selected such that a DMP reaches 95% convergence at *t* = *T* before using standard linear regression to compute the weights *wi* .

In summary, when a new starting pose and target pose are given after learning the motion primitives, a new movement trajectory can be generated with reference to the demonstration motion, thereby improving the diversity and adaptability of complex tasks. As shown in Fig. 3.3, the DMP model is learned from a human demonstration trajectory (blue curve in the figure). When different starting points and target points are given, the corresponding motion can be generated by the learned DMP model. Track (gray curve in the picture). In addition, the DMP model has a good generalization ability, which is conducive to improving the diversity and adaptability of robot manipulation tasks in an unstructured dynamic environment. In particular, in the generalization process of the SPAIR system proposed in this paper, the starting position of each motion primitive is determined by the human demonstration movement or the external vision system.

#### **3.4 Nonparametric Bayesian Method for Robot Movement Identification**

#### *3.4.1 Problem Statement*

As those aforementioned related works of robot complex task representation, the task should be composed of multiple segments of movement primitives, and new manipulation tasks can be generated through the learning and generalization of learned primitives. As the planned robot manipulation tasks become more and more complex, the movement primitive library will increase accordingly, and there will be multiple feasible primitives candidate during the transition in the task. At this time, the manipulation task described by the finite state machine will select the primitive with the highest probability value for execution according to the current observation value. By real-time monitoring of the robot's current behavior, that is, the identification that the robot is in or subsequently selecting that primitive, will help estimate the current execution status of the robot, and provide a basis for subsequent chapters for robot anomaly monitoring and recovery.

#### *3.4.2 Gradient of Log-Likelihood for Movement Identification*

Assume that *N* normal multimodal trajectories of each movement are collected for learning the common dynamics among them via kinesthetic demonstration or movement generation. To jointly model those trajectories using the Bayesian nonparametric hidden Markov model, particularly, the HDP-VAR-HMM (see details in Chap. 2) with linear Gaussian observation model is investigated, where each trajectory *n* ∈ *N* is represented as a multivariate time series that including multidimensional observation data *yn* = [*y*1, *y*2,..., *yT* ] at time *t*. Take a robot common manipulation task as example, *yt* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* could be an instant of wrench force, torque, and end effector velocity, where *d* is the total dimensionality of observed features. In HDP-VAR-HMM, each observation *yt* is interpreted by a positive integer, named hidden state *zt* = *k* ∈ {1, 2, 3,...}. A countably infinite set of consecutive integers *zt* that should be satisfied with the requirements that an initial probabilistic state distributions π<sup>0</sup> and a infinite transition distribution {π}<sup>∞</sup> *<sup>k</sup>*=<sup>1</sup> are generated using Markovian method. Consequently, the state sequence for all *t* > 1 with Markovian structure can be intuitively formulated as

$$\begin{aligned} z\_l &\sim \pi\_{z\_{l-1}}, \\ p(z\_1 = k) &= \pi\_0, \\ p(z\_l = k' | z\_{l-1} = k) &= \pi\_{kk'}.\end{aligned} \tag{3.5}$$

According to Eq. 3.5, the hidden state *zt*−<sup>1</sup> indexes all observations *yt* are assigned with *zt*−<sup>1</sup> at time step *t*. Since that, the observation *yt* given specific hidden state *zt* = *k* can be drew from its corresponding observation likelihood functions.

Since we assign the emission model of HDP-VAR-HMM as the first-order autoregressive Gaussian likelihood. As such, each observation *yt* are considered to be a noisy linear combination of the previous observation *yt*<sup>1</sup> plus additive Gaussian white noise and can be expressed as

$$
\epsilon\_t \mathbf{y}\_t = A\_k \mathbf{y}\_{t-1} + \epsilon\_t(\mathbf{z}\_t = k), \quad \epsilon\_t(k) \sim \mathcal{J}'(0, \Sigma\_k). \tag{3.6}
$$

Since the is assumed with zero mean, each state *k* should described with two dynamic parameters {*Ak* , Σ*<sup>k</sup>* }, where, *Ak* is a *d* × *d* matrix of regression coefficients that defines the expected value of each successive observation as <sup>E</sup>[*yt*|*yt*−1] = *Ak yt*−1, and Σ*<sup>k</sup>* is a *d* × *d* symmetric positive definite matrix that defines the covariance matrix of state *k*. Then, the problem turns into how to calculate the VAR observation likelihood?

Therefore, the Gaussian regression observation model explains a observed data pairs of input *yt* and *yt*−1, and output *zt* , which is different from the HMM case. Resulting that each input is a vector of length *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* , while each output is a scalar. With this dimension reduction, the task segmentation by grouping the unique hidden states. In particular, we focus on a generative model for learning the output data that depends on *yt* . As presented in [5], there are various generative models can be considered, such as full-covariance Gaussian, diagonal-covariance Gaussian, are possible for the observed input data. Consequently, the VAR observation likelihood can be directly define by the multivariate Gaussian log-likelihood function at specific state *zk* ,

$$\begin{split} \log p(\mathbf{y}\_t | \theta\_k, \mathbf{y}\_{t-1}) &= \log p(\mathbf{y}\_t | A\_k, \Sigma\_k, \mathbf{y}\_{t-1}) \\ &= \log \mathcal{N}(\mathbf{y}\_t | A\_k \mathbf{y}\_{t-1}, \Sigma\_k) \\ &= -\frac{d}{2} \log(2\pi) - \frac{1}{2} \log |\Sigma\_k| - \\ &\frac{1}{2} (\mathbf{y}\_t - A\_k \mathbf{y}\_{t-1})^T \Sigma\_k^{-1} (\mathbf{y}\_t - A\_k \mathbf{y}\_{t-1}). \end{split} \tag{3.7}$$

We define the parameter space Θ = {θ1, θ2,..., θ*<sup>K</sup>* } for all the hidden states, where θ*<sup>k</sup>* = {*Ak* , Σ*<sup>k</sup>* } for denoting the parameters of *k*-th state. Where θ*zt*=*<sup>k</sup>* represents the observation parameters for the trained HMM and the *z*<sup>1</sup>:*<sup>T</sup>* is a state path over the hidden state space. As such, if the state path is estimated, the maximum probability of the observation sequence can be obtained. Therefore, given *M* trained models for *S* skills in the robot manipulation task (*M* is equal to *S* when using *k*-fold cross validation for optimal model selection), the optimal model of a skill is used along with the standard forward-backward algorithm to compute the *expected cumulative likelihood* of a sequence of observations.

However, since the true state path is hidden from the observations. For the computational convenience, the general approach would be to use the maximum likelihood state at any given moment, but this would neglect uncertainty. Instead, we use a marginal probabilistic representation over hidden states at each time step. Thus, the log-probability at time *t* is derived by computing the logarithm of the sum of exponentials over all hidden states

$$\mathcal{QC}\_t = \log \sum\_{k=1}^K \exp\left(\frac{\alpha\_t(k) \cdot \beta\_t(k)}{p(\mathbf{y}\_{0:T})}\right),\tag{3.8}$$

where α*t*(*k*) = *p*(*zt* = *k*|*y*0:*<sup>t</sup>*), β*t*(*k*) = *p*(*yt*+1:*<sup>T</sup>* |*zt*), are presented the forward message passing and backward message in the standard forward-backward algorithm, respectively.

$$\begin{aligned} \alpha\_t(k) &= \frac{1}{p(\mathbf{y}\_t|\mathbf{y}\_{0:t-1})} p(\mathbf{y}\_t|z\_t = k) p(z\_t = k|\mathbf{y}\_{1:t-1}), \\ p(z\_t = k|\mathbf{y}\_{0:t-1}) &= \sum\_j p(z\_t|z\_{t-1} = j) p(z\_{t-1}|\mathbf{y}\_{0:t-1}). \end{aligned} \tag{3.9}$$

The *expected cumulative likelihood* isE[*log P*(*y*1:*<sup>T</sup>* <sup>|</sup> <sup>Θ</sup>*m*)]for each trained model *m* ∈ *M* using Eq. 3.8, where Θ*<sup>m</sup>* represents the parameters of the trained model. That is, given a test trial ξ, the cumulative log-likelihood is computed for test trial observations conditioned on all available trained skill model parameters *log P*(*y*<sup>ξ</sup>1:ξ*<sup>t</sup>* | Θ)*<sup>M</sup> m* at a rate of 100 Hz. The process is repeated when a new skill*s* ∈ *S* is started. Given the observable position in the FSM <sup>ξ</sup>*c*, we can index the correct log-likelihood <sup>I</sup>(ξ*<sup>c</sup>* <sup>∈</sup> *<sup>s</sup>*) and see if the probability density of the test trial given the correct model is greater than the rest:

$$\begin{aligned} \log P(\mathbf{y}\_{\xi\_1;\xi\_c} \mid \Theta\_s) &> \log P(\mathbf{y}\_{\xi\_1;\xi\_c} \mid \Theta\_m) \\ \forall m (m \in M \land m \neq s) . \end{aligned} \tag{3.10}$$

If so, the skill identification is deemed correct and we record the time at which the correct classification was flagged. We compute both a classification accuracy matrix and the mean time threshold value by cross-validation period.

#### **3.5 Experiments and Results**

Results for nominal and anomalous skill identification are reported independently for clarity. In the case of anomaly monitoring, the trials within the test folds for skill identification were randomly intermixed. A 12-dimensional observation vector consisting of 6D wrench measurements and 6D pose values at each time step *t*: *yt* = ( *fx* , *f <sup>y</sup>* , *fz*, *mx* , *my* , *mz*, *x*, *y*,*z*,*r*, *p*, *y*). All the parameters of the nonpara-

**Fig. 3.4** Experiment 1: the skills of a traditional pick-and-place task by FSM: **i** Pre-pick, **ii** Pick, **iii** Pre-place, and **iv** Place

metric methods presented in this paper are set to the same values as those found in [5], when used for speech recognition classification.

An autoregressive order *r* = 2 is used. The parameter mean matrices *M* and *K* are set such that the mass of the matrix-normal distribution is centered around stable dynamic matrices while allowing variability in the matrix values. We start by assuming the mean matrix *M* = **0** and setting *K* = 10 ∗ *Im*. The inverse-Wishart portion of the prior is given by ν<sup>0</sup> = *m* + 2 DoF (the smallest integer setting to maintain a proper prior). The scale matrix *S*<sup>0</sup> = 0.75∗ is the empirical covariance of the data set. Also, setting the prior from the data can move the distribution mass to reasonable parameter space values. A Gamma(a, b) prior is set on the sHDP concentration parameters α + κ and γ. A *Beta*(*c*, *d*) prior is set on the self-transition parameter ρ. We choose a weekly informative setting and choose: *a* = 1, *b* = 0.01, *c* = 10, *d* = 1. Finally, the Gibbs sampling parameters truncation level and maximum iterations are set to 20 and 500 respectively.

In order to assess the robot introspection ability of the sHDP-VAR(*r*)-HMM, two real-robot platforms were tested: (i) a Baxter robot for a pick-and-place task, as shown in Fig. 3.4 and (ii) a HIRO-NX robot for the snap assembly task, as shown in Fig. 3.9. Both robots used F/T sensors (Robotiq FT300 and a JR3 respectively) that were placed on the robot wrists to measure forces and moments at 100 Hz. End-effector pose signals were computed using forward kinematics. Our workstation consisted of an 8-core Intel Xeon processor, 16GB RAM, running Linux Ubuntu 14.04 and ROS-Indigo.

#### *3.5.1 Experiment 1: Baxter Robot Performs Pick-and-Place Task*

The dual-arm Baxter robot is used for the pick-and-place task. The latter is bootstrapped via a finite state machine composed of four skills (see Fig. 3.4) and the goal is to use the sensed wrench and pose signals in building a statistical model for each skill through the proposed sHDP-VAR(*r*)-HMM method. Figure 3.5 shows an example of the wrench signals for each skill (shown in different colors) of this task. The signals were segmented into four skills according to the states shown in the

**Fig. 3.5** Experiment 1: recorded wrench signals in the pick-and-place task. Different colors encode different skills

finite state machine in Fig. 3.4. Notice that as the task is being executed, significant patterns emerge within and in-between skills. For instance, if we examine the signals for *Go-to-Pick-Hover* and *Go-to-Pick* shown in Fig. 3.4, we can identify patterns that can be encoded with statistical significance using our model. To evaluate the performance of the sHDP-VAR(*r*)-HMM models, we tested 24 samples for each skill through leave-one-out cross validation [35]. The optimal model for each skill was selected via maximum likelihood estimation.

For nominal skill identification, we test normally executing skills. Figure 3.6 shows corresponding signals generated on the basis that for *ith* skill model output is 1 if its cumulative likelihood is greater than the rest of the models at each time step, else 0. If a model has a maximum output cumulative likelihood value during the last time step in a sample, then the value is set to 1, implying this observation belongs to that model class (the correct skill identification); otherwise, the firing would be 0. From the skill intervals shown in Fig. 3.5 and the models outputs of in Fig. 3.6, the monitoring accuracy for the sHDP-VAR(2)-HMM based method can be readily proposed. The same scheme can be inferred for the other skills.

We assessed the performance of the model by comparing with other similar techniques. Namely, the sticky HDP-HMM with Gaussian observation model and the sHDP-VAR(1)-HMM with a first order autoregressive model. We also show the performance by using different multimodal signal combinations. A confusion matrix is used for accuracy evaluation of skill identification for the different methods and shown in Fig. 3.8. Notice from Fig. 3.6, that at the beginning of each skill, the identifi-

**Fig. 3.6** Experiment 1: sHDP-VAR(2)-HMM model outputs for the pick-and-place task

cation suffers from low accuracies. This is due to the way in which we cumulatively compute the log-likelihood (not unlike humans that often need more information before confidently classifying). We compute the average test time to perform correct classification and report it as a percentage of total duration for a given skill (see Table 3.1). Low percentages imply quick identification, while large ones imply slow decisions.

For anomaly monitoring, we manually induced three types of external perturbations to collect anomaly data. First we introduce human collisions. Human collisions cause a pose displacement of a held object and leads to failures in the placement/assembly of that object. Five failure trials were induced for each perturbation, for a total of 15 anomalous trials (and 24 nominal trials). Receiver Operating Characteristic (ROC) curves were used to measure the discriminative ability of the system across wrench and pose sensor-modalities (i.e. wrench and wrench-pose modalities). Constant *k* determines our classification threshold and is optimized to obtain the best monitoring performance. Comparisons between unimodal wrench signals and multimodal pose-wrench signals are conducted. Figure 3.7 shows ROC curves evaluate the relationship between the false positive rate (FPR) and true positive rate (TPR). For any given true-positive rate, multimodal tests resulted in lower false-positive rates when compared to unimodal versions. Our method also enjoyed high true-positive rate when compared to other introspective methods.

**Fig. 3.7** Experiment 1: Receiver operating characteristic (ROC) curves for the pick-and-place task. The figure shows ROC curves that compare the performance of multimodal and unimodal sensing for anomaly monitoring and the performance of our anomaly monitoring method versus other baseline methods

**Fig. 3.8** The identification accuracy of different methods with different multimodal signals in a pick-and-place experiment

#### *3.5.2 Experiment 2: HIRO-NX Robot Performs Snap Assembly Task*

In this experiment, so as to show the applicability and superiority of the suggested sHDP-VAR-HMM monitoring scheme in a real-world assembly task of industrial relevance. A 6 DoF dual-arms HIRO robot with electric actuators and a Robotiq


**Table 3.1** Average time for correct anomaly identification in a pick-and-place task. The notation "pPick" and "pPlace" are the abbreviation from "Pre-pick" and "Pre-place", respectively

**Fig. 3.9** Experiment 2: the skills in a snap assembly task FSM: **i** Approach; **ii** Rotation; **iii** Insertion; **iv** Mating

6DoF force-torque sensor attached on the wrist is used to perform a snap assembly task of camera parts as shown in Fig. 3.9. A custom end-effector holds a male part, while a female part is fixed to a table in front of the robot. The tool center point (TCP) is placed at the location where the male and female parts make contact. The world reference frame was located at the manipulator's base. The TCP position and orientation were determined with reference to the world coordinate frame *T o*. The force and moment reference frames were determined with respect to the wrist's frame. OpenHRP [10] executes the FSM and modular hybrid pose-force-torque controllers execute the skills. Four nominal skills are connected by the FSM: (i) a guarded approach, (ii) an alignment procedure, (iii) a snap insertion with high elastic forces, and (iv) a mating procedure. Unexpected events occur during initial parts' contact, (e.g. the wrong parts localization) or during the insertion stage where wedging is possible. An example of the recorded wrench signals is shown in Fig. 3.10.

**Fig. 3.10** Experiment 2: recorded wrench signals in the snap assembly task. Different colors encode different skill

**Fig. 3.11** Experiment 2: sHDP-VAR(2)-HMM model outputs in the snap assembly task

As the procedure in experiment 1, forty-four real-robot nominal trials and 16 anomalous trials were conducted. We measured the skill identification and anomaly monitoring for the modeling schemes considered in experiment 2. The output of the proposed HDP-VAR(2)-HMM scheme is shown in Fig. 3.11. The identification accuracy per robot skill is presented in Fig. 3.13 and the average time for computing correct classifications is reported as a percentage of total skill duration (see Table 3.2). And

**Fig. 3.12** Experiment 2: Receiver operating characteristic (ROC) curves for the snap assembly task across multimodal scenarios and baseline methods. Unimodal wrench signals are compared with multimodal pose and wrench signals

**Fig. 3.13** The identification accuracy of different method with variable sensor signals in snap assembly experiment

as in experiment 1, we still compare performance with the same baseline methods. Meanwhile, the ROC curves confirm the anomaly monitoring performance which is illustrated in Fig. 3.12. It is thus seen, that the proposed process monitoring scheme has both accurate and efficient performance in contact tasks.


**Table 3.2** The average time of computing correct identification in snap assembly task

#### **3.6 Summary**

This chapter mainly focuses on the representation of robot complex, multi-step manipulation tasks and recognition of movement during online execution. Address the issue of task representation, we should give enough consideration to human understanding and intention in human-robot interactive manipulation tasks such that artificially segment complex tasks into multiple sub-tasks, and adopt dynamics that require only one demonstration movement of humans. The dynamical movement primitive (DMP) modeling method learns a parameterized primitive for each subtask. When given a task, a combination of finite state machine and movement primitives is proposed to describe a complex manipulation task. The described sub-tasks can use different starting points and ending points than the demonstration as the motion basis. The input of the meta-model is used to generate new manipulation tasks, thereby improving the diversity and adaptability of manipulation tasks.

For the problem of robot movement recognition, a method of comparing the log-likelihood function gradient values of the observed values at the current moment under all the learned nonparametric Bayesian models is proposed. First, it is assumed that the manipulation task of the robot is decomposed into a set of movement primitives. Then, the multidimensional signals of each robot movement primitive under normal conditions is collected by repeatedly performing the manipulation task, and the probability model is established for each primitive by using the method of Chap. 2. Finally, the log-likelihood function gradient values of the observations under each model are evaluated in real time to realize the recognition of the movement behavior during the execution of the robot. Two evaluation indexes of recognition accuracy and recognition efficiency (response efficiency) of movement are applied to the proposed method on the two tasks of "HIRO-NX robot performing electronic component assembly tasks" and "Baxter robot performing autonomous pick and place tasks". The comparisons of parametric and non-parametric models and various modalities combinations were performed, resulting in proofing the feasibility and effectiveness of the proposed movement recognition method in this chapter.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 4 Nonparametric Bayesian Method for Robot Anomaly Monitoring**

**Abstract** In this chapter, we introduce an anomaly monitoring pipeline using the Bayesian nonparametric hidden Markov models after the task representation and skill identification in previous chapter, which divided into three categories according to different thresholds definition, including (i) log-likelihood-based threshold, (ii) threshold based on the gradient of log-likelihood, and (iii) computing the threshold by mapping latent state to log-likelihood. Those method are effectively implement the anomaly monitoring during robot manipulation task.We also evaluate and analyse the performance and results for each method, respectively.

## **4.1 Introduction**

Anomalies, also known as outliers, are observations that deviate too much from other observations to cause suspicion that they are caused by different mechanisms. Anomaly monitoring is a widely studied problem in machine learning, and has important significance in the fields of network intrusion monitoring, credit card fraud monitoring, and medical health monitoring. In robot manipulation, abnormality refers to abnormal behaviors, events, or situations. Robotic abnormality indicates that different from normal execution or observed data that is different from normal, especially in the dynamic environment of human-computer interaction, abnormalities Usually stems from changes in the environment and human beings. Assuming that the robot can monitor, classify, and repair these anomalies, it will reduce or prevent the robot from being potentially injured and improve the safety of human-robot collaboration. Due to the diversity and complexity of anomaly types, it is not possible to directly enumerate and model all anomalies, and to implement anomaly monitoring through supervised learning. In general, research on robot abnormality monitoring is often implemented using unsupervised learning methods. In other words, it is abnormal to identify abnormal phenomena from the normal behavior model of the robot.

#### **4.2 Related Work**

The premise of performing anomaly monitoring is to endow robots with longerterm autonomy. To this end, we would like to discriminate between nominal and anomalous behavior while the robot executes a task. Researchers usually use two different methods to solve the problem of robot multimodal anomaly monitoring [1, 2]. On the one hand, a probability statistical model is established for each modal signal such as Rodriguez et al. [3] classify the success and failure of the robot in the assembly task by extracting the magnitude of the single-modal output force signal [4]. Ando et al. [5] used the K-Nearest Neighbor (KNN) algorithm to independently measure the differences between modalities to monitor the abnormal behavior of a mobile robot. Pastor et al. [6] used multi-modal sensing signals to monitor the failure of the robot to flip the box with chopsticks. This method first uses DMP to describe the motion of each stage, and then models each modality independently and combines normal statistical features (e.g. mean, variance, etc.) of the data for establishing ASM of each movement primitive to realize abnormality monitoring. The occurrence of abnormality is defined as if the signal value at least five times within three consecutive seconds is lower than a given threshold. Chu et al. [7] proposed an anomaly monitoring approach by training multiple HMM models for each modality given the observed values, and then the log-likelihood of each HMM was calculated as the feature vector for support vector machine model SVM (Support Vector Machine, SVM).

On the other hand, a probabilistic statistical model is established for all modality signals. Clifton et al. [8] used Gaussian Mixture Model (GMM) to model multi-modal signals, and realized anomaly monitoring through extreme value theory. Qiao et al. [9] proposed the use of a sequence of recessive states in the HMM model to monitor abnormal instructions. Zhang et al. [10] proposed a hierarchical HMM (Hierarchical HMM, HHMM) for system intrusion anomaly monitoring under the premise of the HMM method. This method uses HHMM to model normal data and is reflected by anomalies in hidden state sequences. Out the system intrusion event. Park et al. [11– 13] carried out an online anomaly monitoring method based on multi-modal sensory signals fusion during robot manipulation, where a parametric HMM (Left-to-right HMM) models a multi-modal time series of a robot performing normal motions, and proposes abnormal thresholds that change by the execution progress to achieve abnormal monitoring.We extend the HMM for anomaly monitoring in the subsequent part of this chapter that inspired by the HMM cases. In addition, the proposed method uses a variety of complementary modal information to detect a variety of potential anomalies in the human-robot interaction process on the system where the robot feeds the disabled, such as human abnormal contact and environmental collision.

Hu et al. [14] proposed an abnormal event monitoring method based on HDP-HMM. This method combines the Fisher Kernel into the OCSVM (One-Class Support Vector Machine) model, which has the advantages of generating models and discriminative models. Dilello et al. [15, 16] first proposed the application of the nonparameterized Bayesian time series model (sHDP-HMM) for abnormal monitoring and fault diagnoses during robot operation, using the model to transmit force/torque of the robot to perform normal alignment tasks [17, 18]. Sensing data is used for modeling, and the threshold value for abnormality monitoring is calculated from the validation data set. Then, the log-likelihood function value of the online observation data is calculated based on the learned model. Finally, the robot online is compared with the abnormality threshold value for anomaly monitoring.

In this chapter, we define an anomaly whenever a data stream is structurally different from prior learned models [19]. As Pimentel et al. review literature shows, many anomaly monitoring approaches have been proposed in the last decade [1]. However, few works focused on multivariate time series with spatial and temporal dependencies as in our case. In Milacski et al. [20], the method requires the full data sequence for processing, thus precluding the online detection capability. Besides our previous works [21, 22], others have used non-parametric Bayesian approaches to learn HMMs in contact tasks for abnormal detection and recognition. In [16, 23], DiLello et al. used the sticky hierarchical Dirichlet process prior to learn the model parameters of an HMM based on wrench signatures for an industrial robot alignment task. The work learned specific failure modes online. The approach is robust as it is models not just individual contact events but rather the evolution of signals, it does so while incorporating domain knowledge, and adjusting its model complexity according to the data patterns presented. Nonetheless, the approach is limited by the assumptions of the HMM in which observations are conditionally independent given the state. This is often an insufficient condition to capture the temporal dependencies of complex dynamical behaviors. In [24], Niekum et al. used a beta-process prior on the vector auto-regressive (BP-AR-HMM) to extract the state complexity from the data. The BP-AR-HMM has the ability to discover and model primitives from multiply related time-series. Robot skill discrimination was reliable and consistent; however, this work only modeled robot joint angles in a primarily noncontact oriented task. In [25], Hu et al. implemented anomaly monitoring based on HDP-HMM models and analyzed the efficiency and effectiveness of beam sampling in the HDP-HMM model. This implementation however suffers of low computational efficiency. Instead, we explore efficient online inference algorithms instead of the conventional sampling algorithms.

#### **4.3 Problem Statement**

Anomaly monitoring is used to continuously monitor the motion status of the robot to identify unexpected events during execution. Based on the recognition of robot motion in Chap. 3, this chapter describes in detail the abnormal event monitoring method for robots based on non-parametric Bayesian models. The proposed method is also applicable to the subsequent anomaly diagnoses. The main implementation ideas are: first, modeling the normally executed multi-modal data; then, test the normal movement behavior data based on this model. In order to obtain the definition of the boundary between normal and abnormal, that is, the calculation of the abnormal threshold; finally, monitoring the testing samples based on the model and the corre-

**Fig. 4.1** Illustration of the anomaly monitoring procedure after skill identification

sponding threshold. In order to better study the theory and methods of multivariate time series anomaly monitoring, this chapter proposes the non-parametric Bayesian models HDP-HMM and sHDP-HMM in different observation models based on the parametric HMM model. An anomaly monitor implemented under the combination of modal information and different anomaly threshold methods. The anomaly decision of the proposed anomaly monitor on robot motion behavior includes: normal, single anomaly, and multiple anomalies.

An ideal online anomaly monitoring system often requires the use of multimodal fusion methods to respond to different types of anomalous events, such as force/torque sensing information that can directly detect robotic collision anomalies and tactile sensing information that can directly sense objects. And the ability to respond to abnormal event occurrences in a timely manner (high response rate) and low false touch rate. In order to meet such requirements, this chapter addresses the problem of multi-modal fusion robot anomaly monitoring, and proposes the performance of anomaly monitors with three different probability models, different sensor information fusion, and different anomaly thresholds. The overview framework proposed after skill identification in Chap. 3 as shown in Fig. 4.1.

#### **4.4 Log-Likelihood for Anomaly Monitoring**

#### *4.4.1 Problem Formulation*

The robot introspection model uses non-parametric Bayesian priors along with a Hidden Markov model (HMM) and either Gaussian or Autoregressive emission models. First, HMMs are a doubly stochastic and generative process used to make inference on temporal data. The underlying stochastic process contain a finite and fixed number of latent states or modes *zt* which generate observations *Y* = {*y*1,..., *yt*} through mode-specific emission distributions *b*(*yt*). These modes are not directly observable and represents sub-skills or actions in a given node of a task. Transition distributions, encoded in transition matrix *Aji* , control the probability of transitioning across modes over time. The model assumes conditionally independent observations given the generating latent state. Given a set of observations, the Baum–Welch algorithm is used to infer model parameters Π = (*A*, *b*). The fixed-modes assumption in HMMs limits the model's expressive power as it is unable to derive natural groupings. Bayesian nonparametric priors extend HMM models to learn latent complexity from data [26, 27]. We use Fox's et al. sticky-Hierarchical Dirichlet Process (sHDP) prior with an auto-regressive switching system [26] to model nominal skills as in our previous work [21] and derive more robust failure identification methods in manipulation tasks, specially in recovery scenarios.

Bayesian statistics are combined with the sHDP prior to both learn model complexity *k* from the data but also to model the transition distribution of an HMM. The sticky parameter in the prior discourages fast-mode-switching otherwise present. Consider a set of training exemplars *<sup>t</sup>* = {*x*1,..., *xT* } of observed multi-modal data τ consisting of Cartesian pose and wrench values. Then, a mode-dependent matrix of regression coefficients **<sup>A</sup>**(*k*) = [*A*(*k*) <sup>1</sup> ... *A*(*k*) *<sup>r</sup>* ] ∈ <sup>R</sup>*d*×(*d*∗*r*) ] with an *r*th autoregressive order and *d* dimensions is used along with a measurement noise Σ, with a symmetric positive-definite covariance matrix. The generative model for the sHDP-AR-HMM is summarized as:

$$G\_0 = \sum\_{k=1}^{\infty} \beta\_k \delta\_{\theta\_k} \quad \beta|\_{\mathcal{Y}} \sim GEM(\boldsymbol{\nu}).$$

$$\theta\_k | G\_0 \sim G\_0. \tag{4.1}$$

$$G\_j = \sum\_{k=1}^{\infty} \pi\_{jk} \delta\_{\theta\_k} \ \pi\_j | \alpha, \beta \sim DP(\alpha, \beta).$$

The probability measure *G <sup>j</sup>* , which models the transition distribution of the modes π*<sup>j</sup>* determines the weights (probabilities) of transitioning between modes δθ*<sup>k</sup>* . To avoid fixing the mode complexity, *G <sup>j</sup>* uses a prior *G*<sup>0</sup> that is unbounded and can grow with the complexity of the data. While *G <sup>j</sup>* uses the same set of modes as *G*0, *G <sup>j</sup>* introduces variations over those points. *G*<sup>0</sup> provides support for a possibly infinite space, but due to the Dirichlet's process properties (the Chinese Restaurant Process), a finite set of modes is selected. In fact, we can understand the hierarchical specification as *G <sup>j</sup>* = *D P*(α, *G*0).

Different observation models can be included into the HMM. A Gaussian distribution with different covariance models (full, diagonal, and spherical) are considered. For Gaussian models, mode specific means and standard deviations are used θ*zt* = *N* (μ, σ<sup>2</sup>). Additionally, the sHDP-HMM can be used to learn VAR processes, which can model complex phenomena. The transition distribution is defined as in the sHDP-HMM case, however, instead of independent observations, each mode now has conditionally linear dynamics, where the observations are a linear combination of the past *r* mode-dependent observations with additive white noise. A prior on the dynamic parameters {*A*(*k*) , Σ(*k*) } is necessary. A conjugate matrix-normal inverse Wishart (MNIW) was used to this end. The generative process for the resulting HDP-AR-HMM is then found in Eq. (4.2).

$$\text{Observation Dynamic: } \mathbf{y}\_t = \sum\_{i=1}^r A\_i^{(z\_t)} \mathbf{y}\_{t-i} + e\_t(z\_t).$$

$$e\_t \sim \mathcal{N}(\mathbf{0}, \Sigma). \tag{4.2}$$

$$\text{Mode dynamics: } z\_t^{(i)} \sim \pi\_{z\_{t-1}^{(i)}}^{(i)}.$$

By using the model over a set of multi-modal exemplar data *<sup>t</sup>* , the sHDP-AR-HMM can discover and model shared behaviors in the data across exemplars. Scalable incremental or "memoized" variational coordinate ascent, with birth and merge moves [27] is used to learn the posterior distribution of the sHDP-HMM family of algorithms along with mean values for the model parameters θ of a given skill, hence θ*Sm* = {Π, **A**}*Sm* .

#### *4.4.2 Skill Identification*

The robot introspection system simultaneously detects nominal skills and anomaly events. Models are trained for individual skills to capture dynamics from multi-modal observations through vector τ*m*. τ*<sup>m</sup>* consists of end-effector pose and wrench values. Scalable incremental or "memoized" variational coordinate ascent, with birth and merge moves [27] is used to learn the posterior distribution of the sHDP-HMM along with mean values for the model parameters Π of a given skill *s*. Hence Π*<sup>s</sup>* = {π, **A**}: the transition matrix and regressor coefficients.

Given *S* trained models for *M* robot skills, scoring is used to compute the *expected cumulative likelihood* of a sequence of observationsE[*log P*(*<sup>Y</sup>* <sup>|</sup>Π*s*)]for each trained model *s* ∈ *S*. Given a test trial *r*, the cumulative log-likelihood is computed for test trial observations conditioned on all available trained skill model parameters *log P*(*yr*1:*rt* |Π )*<sup>S</sup> <sup>s</sup>* at a rate of 200 Hz (see Fig. 4.2 for an illustration). In Fig. 4.2, note the 4 probabilistic models. Given an indexed position in the graph, an anomaly threshold corresponding to that skill's expected cumulative log likelihood. The figure also illustrates how at the end of a skill, all data is reset and restarted. The process is repeated when a new skill *m* is started. Given the position in the graph *sc*, we can index the correct log-likelihood <sup>I</sup>(Π*<sup>s</sup>* <sup>=</sup> *sc*) and see if its probability density of the test trial given the correct model is greater than the rest:

**Fig. 4.2** Illustration of cumulative log-likelihood (L-) curves for four skills *s* in a task. As such, 4 probabilistic models Π*<sup>s</sup>* are derived. Given an indexed position in graph *Ni* , one can generate four L-curves. The curve for the correct model, has a value increasing likelihood, while the other graphs have value decreasing likelihoods. One can also see how at the termination of a skill, all data is reset and restarted. No anomalies are shown in this plot

$$\begin{aligned} \log \, P(\mathbf{y}\_{r\_1; r\_t} | \varPi\_{correct}) &> \log \, P(\mathbf{y}\_{r\_1; r\_t} \mid \varPi\_s) v \\ &\forall s (s \in S \land s \neq s\_c) .\end{aligned}$$

If so, the identification is deemed correct, and the time required to achieve the correct diagnoses recorded. At the end of the cross-validation period, a diagnoses accuracy matrix is derived as well as the mean time threshold value (these results were also reported in [21]).

#### *4.4.3 Anomaly Monitoring*

Due to the characteristics of abnormality, diversity, and unpredictability in the robot's motion behavior, unsupervised learning is the only way to achieve abnormality monitoring. In particular, this section proposes the use of log-likelihood function values based on non-parameterized Bayesian probability statistical models to implement anomaly monitoring. The main reason is that the probability model learned from normal data will behave when testing anomalous values The cumulative loglikelihood function value decreases sharply. Additionally, the curve of the cumulative log-likelihood function value that calculated from normal movement behavior will show an increasing trend, and the curve calculated with abnormal motion behavior will drop significantly when anomaly occurred. Therefore, a method based on the log-likelihood function of the model is proposed to realize the abnormal monitoring of the robot's motion behavior by setting the abnormal lower limit threshold.

Anomaly monitoring assumes that the cumulative log-likelihood *L* of a set of nominal skill exemplars share similar cumulative log-likelihood patterns. If so, the expected cumulative log-likelihood derived in training can be used to implement an anomaly threshold *F*1. Initially, we consider a likelihood curve generated from training data for a given skill *s*. Then, for each time step in an indexed skill *sc*, the anomaly threshold is set to

$$F1\_{s\_c} = \mu(L) - c \ast \sigma(L) \tag{4.3}$$

where *c* is a real-valued constant that is multiplied by the standard deviation to change the threshold. Here, we are only interested in the lower (negative) bound. Then, an anomaly is flagged if the cumulative likelihood crosses the threshold at any time: if *log P*(*yr*1:*rt* | Π*correct*) < *F*1*sc* : anomaly, else nominal. However, it is found that the threshold obtained by Eq. (4.3) makes the abnormal monitor frequently trigger falsely (False Positive) during actual application, which means that the normal execution condition monitoring becomes abnormal. In addition, the determination of constant variables requires tedious interactive verification. Aiming at this problem, this section refers to the experience of robot motion recognition, and determines the threshold of abnormality by solving the gradient information of each log-likelihood function value vector.

Upon our initial exploration of recovery schemes we noticed that after resetting the cumulative log-likelihood observations in the HMM model, false-positive anomalies were triggered at the beginning of the skill. Further examination revealed that the standard deviation of cumulative log-likelihood graphs during training began with small variances but grew over time as shown in Fig. 4.3. Given that variances are small at the beginning of the task, small variations in observations can trigger failure flags.

A second threshold definition was designed to overcome this situation and used in our work instead of F1. As the difference between *L* and *F*1 is minimal at first the new anomaly threshold *F*2 (for an indexed skill) is focused on computing the derivative of the difference:

$$F2\_{s\_c} = \frac{d \mid L - F1\_{s\_c} \mid}{dt} \tag{4.4}$$

Figure 4.4, illustrates the derivative signal crossing the empirical anomaly threshold as anomalies are triggered by external disturbances.

#### *4.4.4 Experiments and Results*

Three manipulation experiments are designed to test the robustness of the online decision making system for anomaly recovery. We use accidental human collisions as disturbances for a pick-and-place and open-and-close-drawer tasks, and tool col-

**Fig. 4.3** This plot shows how the standard deviation of the cumulative log-likelihoods computed during training grows over time. The standard deviation grows as observations show greater variance due to the accumulation of error

**Fig. 4.4** Example of how the anomaly identification metric crosses the anomaly threshold as anomalies are triggered by external perturbations during a skill execution

**Fig. 4.5** Two examples (pick-and-place (UPPER) and open-and-close-drawer) in which a human collaborator accidentally collides the robot. The introspection system identifies an anomaly (see bottom left plots) and triggers a recovery behavior (see the fluorescent node in the graph on the right)

lisions for a pick-and-place task. For each of the three experiments, we test four separate conditions:


The pick-and-place task consists of 5 basic nodes (not counting the home node and the recovery node): pre-pick, pick, pre-pick (returns to an offset position), preplace, and place. The open-and-close-drawer task consists of 5 nodes: pre-grip, grip, pull-to-open, push-to-close, and go-back-to-start. The direction and intensity of the human collisions was random but all executed under the sense that these are accidental contacts as a human user reaches into the workspace of the robot without noticing the robot's motion. We note that while the tasks are simple, the main focus of the work is placed on the robustness of the recovery systems given different external disturbance scenarios. A dual-armed humanoid robot -Baxter- was used. All code was run in ROS Indigo and Linux Ubuntu 14.04 on a mobile workstation with an Intel Xeon processor, 16 GB RAM, and 8 cores.

For motion generation, two techniques were used for the different tasks. For the pick-and-place task, we used Baxter's internal joint trajectory action server which uses cubic splines for interpolation. The goal target is identified online through imageprocessing routines. For the open-and-close-a-drawer task, we trained dynamic movement primitives using kinesthetic teaching.

In terms of robot introspection, the sHDP-HMM code with "memoization" variational coordinate ascent, with birth and merge moves was implemented using BNPY [28] and wrapped with ROS. Training used 10-trial batches. Observations used 13 dimensional vectors composed of 7 DoF pose values (position and quaternion) and 6 DoF wrench values. A baseline HMM algorithm was implemented through *hmmlearn* [28] and wrapped with ROS. The anomaly threshold for each skill was computed through leave-one-out cross-validation.

Figure 4.5 shows a representative image of the Baxter robot attempting a pick operation. A human collaborator accidentally collides with robot before a pick action. Note that collisions were strong enough to move the current pose significantly from the intended path and sometimes collide with other parts of the environment. The robot introspection system identifies an anomaly and triggers a recovery behavior. The lower left part of the image shows the anomaly F2-metric flagging the anomaly. The system then begins recovery as seen in the directed graph on the right (implemented in ROS-SMACH). Video and auxiliary data for the three experiments under the four conditions are available in [21].

For results reporting, two markers are provided: (i) A success recovery rate for the recovery policy and (ii) an F-score, precision, and recall numbers for assessing anomaly identification. In particular, these markers are used under experimental conditions (iii) and (iv) conditions described at the beginning of the experiment and results section. The first two conditions are use as a baseline and compare with the generic recovery behavior success rates.

#### (1) Human Collisions

**One anomaly per node**: 27 test trials were used for the pick-and-place task and 24 test trials were used for the open-and-close drawer task. Given that both of the tasks have 5 nodes, a total of 5 anomalies were induced in the task, 1 per node. Table 4.1, shows the recovery success rate of the task (represented by the F-score) and the robustness of the identification system through the precision-and-recall quantities for both micro and macro settings. The pick-and-place recovery had an average

**Table 4.1** Results for accidental human and tool collisions. Recovery strategy success rates are presented by the F-score, and robustness of the robot introspection system for anomaly is shown in the recall and precision settings. We also compare the performance of the more expressive sHDP-HMM model with an HMM that serves as a baseline. Generally, the sHDP-HMM model allowed for better introspection and thus better recovery rates and better recall


success rate of 85% with a maximum of 88%. The precision was 100% (for both macro/micro) indicating strong resistance to false positives. The recall was∼82% (for macro/micro) indicating the existence of some false-negatives. This might indicate that there might have been some collisions that were of lower magnitudes than the ones we might have trained with and were not detected by the system. For the drawer task, the recovery success rate was 91.67%. The precision was 95% (macro/micro) and recall was ∼84.5%. As with the human collision, weaker contact signals in tests compared to training might be the reason for the presence of false-negatives in our system.

**Multiple anomalies per node**: Under this scenario the pick-and-place task ran 42 test trials and the drawer task ran 30 test trials. Five anomalies were induced repeatedly one-after-the other for each node. The pick-and-place recovered 85% of the time with a precision of ∼95% and a recall of ∼84.5% (micro/macro). The drawer task recovered 93.3% of the time with a precision of ∼100%, and a recall of ∼93% (micro/macro).

#### (2) Tool Collisions

For the tool collision experiment, we only tested recovery in the pick-and-place task. The results for this scenario were similar to that of human collision. The recovery success rate was ∼88.89%, the precision 100%, and the recall 88.89%.

#### *4.4.5 Discussion*

This work showed the ability of a system to recover from unmodeled and accidental external disturbances that can't be anticipated. Such disturbances will be more common in shared human-robot workspaces. Our work demonstrated that our recovery strategy in connection with our previous introspection framework recovered 88% of the time from accidental human and tool collisions under single-anomaly and multiple-anomaly scenarios per node. The results indicate the system can recover at any part of the task, even when it is abused and multiple anomalies occur consecutively. From the videos in [21], we can see that even when in cases where the robot is in constant duress, the robot recovers consistently. Such robustness will play a role in enabling robots have increasing levels of long-term autonomy.

We by using a more expressive model to do robot introspection, our recovery ability also increased. We will continue to explore improved models that can better capture spatio-temporal relationships of high-dimensional multi-modal data. As well as looking for representations that scale over time in order to acquire a useful and practical library of skill identification and motion generation.

Yet there is much work to be done. Manual annotations for state-dependency are an important weakness. Crucially we wish to move towards modeling human understanding for decision making in the midst of robot-object-environment interactions. Scalability is an important factor in this domain as the system must scale to ever growing number of learned tasks. Learning how anomalies and recovery decisions are made and re-used across similar scenarios will be important. The manual approach will not scale. Adaptability and not only reverting is also important. Incremental learning is also key. The current work is limited to reverting. By simply revering we don't model recovery behaviors and also do not learn how to adapt. We also cannot handle new scenarios. We must develop action-confidence metrics that let us learn new scenarios on demand.

#### **4.5 Gradient of Log-Likelihood for Anomaly Monitoring**

Based on the implementation in Sect. 4.4, it is difficult to determine the constant parameters *c* in actual applications, which leads to frequent false positives. To address the problem, this section explores an anomaly monitoring method based on the loglikelihood function gradient value and mathematically verifies its implementation process to prove its feasibility and effectiveness. After identifying the motion behavior *s*, this method assumes the parameters of the specific motion behavior and its established probability model are known as Θ. From Sect. 2.4.1, we can know that the log-likelihood function value for the observation sequence can be determined by the probability in all credible states, and the simplified description is

$$\mathcal{L}^{\rho}(t) = \log \sum\_{i=1}^{N} \alpha\_i(t) = \log \sum\_{i=1}^{N} \exp \left( \log \alpha\_i(t) \right) \tag{4.5}$$

Therefore, given an HMM model Θ and an incoming multimodal time series *Y*1:*<sup>t</sup>* , the natural logarithm of the filtered belief state associated with the forward model for latent state *i* ∈ {1, 2,..., *N*} can be represented according to Eq. (4.5). To compute *L<sup>t</sup>* , we first compute log α*i*(*t*). According to the forward-algorithm, we derive:

$$\alpha\_i(\mathbf{l}) = \pi\_i b\_i(\mathbf{y}\_1),$$

$$\alpha\_i(t+1) = b\_i(\mathbf{y}\_{t+1}) \sum\_{j=1}^N \alpha\_j(t) A\_{ji}. \tag{4.6}$$

From Eqs. (4.5) and (4.6), we know that log α*i*(*t*) can be computed recursively through log α∗(*t* − 1). Expanding the *log* in Eq. (4.5) we have:

$$\log a\_i(t) = \log b\_i(\mathbf{y}\_t) + \log \sum\_{j=1}^{N} \exp(\log \alpha\_j(t-1) + \log A\_{ji}) \tag{4.7}$$

#### *4.5.1 The Hidden Markov Model Forward Gradient*

# • **Viterbi Path in HMMs**

The Viterbi algorithm, expanded in Eq. (4.8), attempts to estimate the most likely state sequence. Viterbi uses dynamic programming to estimate the underlying state sequence *z*ˆ<sup>1</sup>:*<sup>t</sup>* through MAP computations given a sequence of observations *y*1:*<sup>t</sup>* :

$$\begin{aligned} \hat{z}\_{1:t} &= \operatorname\*{arg\,max}\_{z\_{1:t}} P(z\_{1:t} \mid \mathbf{y}\_{1:t}) \\ &= \operatorname\*{arg\,max}\_{z\_t} (b\_{z\_t}(\mathbf{y}\_t) \\ &\quad \operatorname\*{arg\,max}\_{z\_{1:t-1}} (A\_{z\_{t-1}z\_t} P(z\_{1:t-1}, \mathbf{y}\_{1:t-1})) \\ &= \operatorname\*{arg\,max}\_{z\_t} (b\_{z\_t}(\mathbf{y}\_t) \\ &\quad \operatorname\*{arg\,max}\_{z\_{t-1}} (A\_{z\_{t-1}z\_t} b\_{z\_{t-1}}(\mathbf{y}\_{t-1}) \\ &\quad \cdot \quad \cdot \\ &\quad \operatorname\*{arg\,max}\_{z\_1} (A\_{z\_{1:t}z\_1} \pi\_{z\_1} b\_{z\_1}(\mathbf{y}\_1)) \dots )) \end{aligned} \tag{4.8}$$

First, we introduce simplified notation for the Viterbi algorithm, where a node *j* has a maximal belief state δ*t*(*j*) = max*<sup>z</sup>*1:*<sup>t</sup> P*(*z*1:*<sup>t</sup>* = *j* | *y*1:*<sup>t</sup>*) with associated traceback *z*ˆ*<sup>t</sup>* .

**Theorem 4.1** *For an incremental time series Y , a good HMM model outputs an incremental Viterbi path that stably expands on the previous one. The stable expansion of the Viterbi path is as follows: given a Viterbi path "11223" for an input Y* [1*:t*]*, then the path at Y* [1*:t* + 1] *becomes "11223\*", where \* is the newly appended hidden state.*

Good models are those that predict their data as accurately as possible and can be achieve through two steps: (i) HMM parameters optimization: the Baum–Welch algorithm given a proper initialization or similar algorithm incrementally optimize HMM parameters until a local maximum of likelihood is reached; minimizing the perplexity of the model. (ii) Model selection optimization: this consists in selecting the number of hidden states and values for observation models. Many techniques exist including BIC, MCMC, Variational Bayes, or non-parametric HMMs [29].

*Proof* Consider, without loss of generality, a Viterbi graph where we examine two consecutive time-steps (*t* − 1, *t*) along two possible latent states *l*, *k* (the analysis generalizes to HMMs with more states). We also assume ∀*i*,*i* = arg max*<sup>j</sup> Ai j* ; that is, all hidden states states tend to self-transition. Also at time *t* − 1:

$$
\delta\_{t-1}(l) \succ \delta\_{t-1}(k). \tag{4.9}
$$

We also define the following symbol:

$$w\_{ji}(t) = A\_{ji} \* b\_i(\mathbf{y}\_t). \tag{4.10}$$

Due to our first assumption, we have: max*<sup>j</sup> wji*(*t*) = *wii*(*t*). Then, at time *t*, the δ values are:

$$\delta\_t(l) = \max(\delta\_{t-1}(l) \* w\_{ll}(t), \delta\_{t-1}(k) \* w\_{kl}(t))\tag{4.11}$$

$$\delta\_t(k) = \max(\delta\_{t-1}(l) \* w\_{lk}(t), \delta\_{t-1}(k) \* w\_{kk}(t))\tag{4.12}$$

According to (4.9) and our max weight formulation, the max function in (4.11) is: δ*t*(*l*) = δ*<sup>t</sup>*−<sup>1</sup>(*l*) ∗ *wll*(*t*). So, the max state *l* at time *t* − 1 will contribute to itself instead of *k* at time *t*. Therefore, there is only one condition under which the Viterbi sequence breaks:

$$
\delta\_t(k) = \delta\_{t-1}(k) \* w\_{kk}(t) \quad \text{and} \quad \delta\_t(k) > \delta\_t(l).
$$

In other words, given our original assumption, the Viterbi sequence breaks if the following inequalities are met:

$$
\delta\_{t-1}(k) \* \boldsymbol{w}\_{kk}(t) > \delta\_{t-1}(l) \* \boldsymbol{w}\_{lk}(t)
$$

$$
\text{and}
$$

$$
\delta\_{t-1}(k) \* \boldsymbol{w}\_{kk}(t) > \delta\_{t-1}(l) \* \boldsymbol{w}\_{ll}(t)
$$

In ratio form:

$$
\frac{\omega\_{kk}(t)}{w\_{lk}(t)} > \frac{\delta\_{t-1}(l)}{\delta\_{t-1}(k)}\tag{4.13}
$$

$$
\text{and}
$$

$$
\frac{w\_{kk}(t)}{w\_{ll}(t)} > \frac{\delta\_{t-1}(l)}{\delta\_{t-1}(k)}\tag{4.14}
$$

If an observation is emitted by state *l* and it is not undergoing a state switch, (δ*<sup>t</sup>*−<sup>1</sup>(*l*)>δ*<sup>t</sup>*−<sup>1</sup>(*k*) and*wll*(*t*) > *wkk* (*t*)), inequality (4.14) fails and the Viterbi sequence does not break.

When we transition from state *l* to *k* and begin emitting *wkk* (*t*) > *wll*(*t*). However, the momentum in δ*t*(*k*) and δ*t*(*l*) prevent their inequality relationship to switch. Nonetheless, after *p* time steps, the inequality δ*<sup>t</sup>*−<sup>1</sup>(*l*)>δ*<sup>t</sup>*−<sup>1</sup>(*k*) becomes δ*<sup>t</sup>*−1+*<sup>p</sup>*(*l*)<δ*<sup>t</sup>*−1+*<sup>p</sup>*(*k*). The latter is reasonable when state *k* has been emitting for some time. It is before this time *t* − 1 + *p* that inequalities (4.13) and (4.14) are met. To see why, one will notice the left hand side inequalities in (4.13) and (4.14) are larger than 1 while the right hand side ratios become smaller than 1 at time *t* − 1 + *p*, thus their inequality relationships must've swapped before this time. Note that the order in which inequalities (4.13) and (4.14) are met matters. If (4.14) is met but (4.13) is not, a clean cut occurs since we have:

and

$$
\delta\_{t-1}(k) \* \boldsymbol{w}\_{kk}(t) \prec \delta\_{t-1}(l) \* \boldsymbol{w}\_{lk}(t) \tag{4.15}
$$

$$
\delta\_{t-1}(k) \* \left. \omega\_{kk}(t) \right| > \delta\_{t-1}(l) \* \left. \omega\_{ll}(t) \right| \tag{4.16}
$$

Equation (4.16) asserts a switch from *l* to *k*. Equation (4.15) states that δ*t*−<sup>1</sup>(*l*) contributes to δ*t*(*k*), implying the previous max state contributes to the next max state. And since the max state has changed, the roles of *l* and *k* swap and inequality (4.14) begins to fail and renders sequence break unattainable. This yields a clean transition cut. If (4.13) and (4.14) are met simultaneously, a sequence break occurs. But since (4.14) is met, the states switch and the roles of *l* and *k* swap and preclude a further sequence break.

If, let's say at time *ta*, inequality (4.13) is met but (4.14) is not, a future sequence break is destined to happen at the future moment when (4.14) is met, let's say at time *tb*. When that sequence break occurs, the history between *ta* and *tb* flips and after the sequence breaks the state transitions from *l* to *k*. Again, roles switch and no further sequence breaks occur. Above all, we can safely conclude that the during the execution of the stable period of a hidden state, no sequence break will occur. It is only during state transitions that a sequence could break and even if it does, it only last for a single time step–the time step when inequality (4.14) is met. The analysis extends to HMM models with more than 2 hidden states, so long as we apply the analysis to pairs of hidden states composed of the current max state and another non-max state.

**Corollary 4.1** *Given Theorem 4.1 the gradient of the log-likelihood of the forward algorithm (from now on referred to as the forward gradient) will depend solely on the latest emission probabilities and the transition matrix.*

*Proof* According to the log-exp-sum trick [29], the approximation log*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> exp(*yi*) ≈ max*<sup>i</sup>*∈{1,...,*N*} *yi* is best approached for larger values of *yi* . Applying this approximation to Eqs. (4.5) and (4.7), which is supported by Theorem 4.1, we have:

$$\mathcal{QC}\_l \approx \max\_{i \in \{1, \ldots, N\}} (\log \alpha\_i(t)) \tag{4.17}$$

$$\begin{aligned} \log \alpha\_i(t) &\approx \log b\_i(\mathbf{y}\_t) + \\ &\max\_{j \in \{1, \ldots, N\}} (\log \alpha\_j(t - 1) + \log A\_{ji}) \end{aligned} \tag{4.18}$$

Substitute Eq. (4.18) into Eq. (4.17), rename *i* and *j* to *zt* and *zt*−1, then recursively decompose log α, we have:

#### 4.5 Gradient of Log-Likelihood for Anomaly Monitoring 67

$$\begin{aligned} \mathcal{Q}\_t &= \max\_{z\_t \in \{1, \dots, N\}} (\log \alpha\_{z\_t}(t)) \\ &= \max\_{z\_t \in \{1, \dots, N\}} (\log b\_{z\_t}(\mathbf{y}\_t) + \\ &\quad \max\_{z\_{t-1} \in \{1, \dots, N\}} (\log A\_{z\_{t-1:t}} + \log \alpha\_{z\_{t-1}}(t-1))) \\ &= \max\_{z\_t \in \{1, \dots, N\}} (\log b\_{z\_t}(\mathbf{y}\_t) + \\ &\quad \max\_{z\_{t-1} \in \{1, \dots, N\}} (\log A\_{z\_{t-1:t}} + \log b\_{z\_{t-1}}(\mathbf{y}\_{t-1}) + \\ &\quad \ddots \end{aligned} \tag{4.19}$$
 
$$\begin{aligned} \max\_{z\_t \in \{1, \dots, N\}} (\log A\_{z\_{t:t}} + \log \pi\_{z\_t} b\_{z\_t}(\mathbf{y}\_1)) \dots \text{i} \end{aligned}$$

Equation (4.19) is the log version of Eq. (4.8). This suggests that, after approximations by Eqs. (4.17) and (4.18), the computation of the log-likelihood is the same as the computation of the Viterbi path using the Viterbi algorithm.

Now, since Theorem 4.1 shows that in general the Viterbi path at time *t*, *z*ˆ1:*<sup>t</sup>* , expands on the Viterbi path at time *t* − 1, *z*ˆ1:*t*−1, we have:

$$\mathcal{QC}\_t = \max\_{\boldsymbol{z}\_t \in \{1, \dots, N\}} (\log b\_{\boldsymbol{z}\_t}(\mathbf{y}\_t) + \\\\ \begin{aligned} & \max\_{\boldsymbol{z}\_t \in \{1, \dots, N\}} (\log A\_{\boldsymbol{z}\_{t-1:t}} + \log \alpha\_{\boldsymbol{z}\_{t-1}}(t-1)) \\ & = \log b\_{\hat{\boldsymbol{z}}\_t}(\mathbf{y}\_t) + \log A\_{\hat{\boldsymbol{z}}\_{t-1}\hat{\boldsymbol{z}}\_t} + L\_{t-1}. \end{aligned} \tag{4.20}$$

Then, the forward gradient can be derived from Eq. (4.20) as:

$$
\nabla \mathcal{X}\_t^\ell = \mathcal{X}\_t^\ell - \mathcal{X}\_{t-1}^\ell = \log b\_{\hat{\mathbb{z}}\_t}(\mathbf{y}\_t) + \log A\_{\hat{\mathbb{z}}\_{t-1}\hat{\mathbb{z}}\_t}. \tag{4.21}
$$

Equation (4.21) supports Corollary 4.1, where the forward gradient depends on the latest emission probability *bz*<sup>ˆ</sup>*t*(*yt*) and transition probability from hidden state *z*ˆ*<sup>t</sup>*−<sup>1</sup> to *z*ˆ*<sup>t</sup>* . Also, given that good HMM models have strong inertia (high probabilities of self-transitions), state-switching should be sparse and then *z*ˆ*<sup>t</sup>*−<sup>1</sup> will equal to *z*ˆ*<sup>t</sup>* most of the time.

#### • **Normal Skill Identification Based on the Forward Gradient**

Corollary 4.1 led us to design a new method for skill identification. If we use *n* HMM models to represent *n* robot skills, with observations coming from a certain skill, the HMM model corresponding to that skill *m*ˆ , should output a value-increasing forward log-likelihood curve that is greater than the rest of the HMM models. This also means model *m*ˆ will output a larger forward gradient value compared to other models.

The forward gradient depends on the latest emission probabilities, which in turn depend on the latest observation. The largest probabilities and thus gradients will belong to the HMM model of a currently executing robot skill. The key insight however is the inverse relationship: the use of the forward gradients to infer the currently executing skill, and then it can validate the strong correlation between the forward gradient and skill observations. This is contrasted with the log-likelihoods of the observations *log P*(*Y*1:*<sup>t</sup>*|Π ) do not. The forward-gradient measure for skill identification is defined as follows: given *p* skills *s*<sup>1</sup> : *sp*, we have HMM models *ms* for *s* ∈ {1,..., *p*} and an input time series *Y* , then the most probable skill *s*ˆ generating *Y* [*t*] is inferred as:

$$\hat{s} = \arg\max\_{s \in \{1, \ldots, n\}} \left( \nabla \mathcal{L}\_t^{\rho\_n} (Y) \right) \tag{4.22}$$

where, <sup>∇</sup>*Lms <sup>t</sup>* (*Y* ) is the forward gradient output by model *ms* at time *t* computed using time series *Y* .

#### *4.5.2 Anomaly Monitoring Based on the Forward Gradient*

We now build on the premise established in Eq. (4.22). Furthermore, consider a set of nominal observations for an executing skill, we know that the corresponding skill HMM model will output a value-increasing forward log-likelihood curve, and hence, a stable positive forward gradient. So, when an anomaly occurs, the forward gradient decreases significantly as illustrated in Fig. 4.6. Given that anomalies influence the forward gradient value, we propose a gradient-based metric for HMM anomaly monitoring.

Consider an HMM model *m* representing a skill *s* with *n* time-series trials *Yi* for *i* ∈ {1,..., *n*} collected from successful executions of skills *s* ∈ *S*. To detect anomalies in a new time series *Y* we can first derive:

$$\begin{aligned} \nabla \mathcal{QC}\_{\max} &= \max\_{i \in \{1, \dots, n\}} (\max\_{t \in \{1, \dots, T\_i\}} (\nabla \mathcal{QC}^m\_t(Y\_i))), \\\\ \nabla \mathcal{QC}\_{\min} &= \min\_{i \in \{1, \dots, n\}} (\min\_{t \in \{1, \dots, T\_i\}} (\nabla \mathcal{QC}^m\_t(Y\_i))), \\\\ \nabla \mathcal{QC}\_{\text{range}} &= \nabla \mathcal{QC}\_{\max} - \nabla \mathcal{QC}\_{\min}, \end{aligned}$$

**Fig. 4.6** For each skill (denoted in different background color), the log-likelihood gradient can be used for skill identification

where *Ti* is the time length of trial *Yi* and ∇*L<sup>m</sup> <sup>t</sup>* (*Yi*) is the forward gradient output by model *m* at time *t* computed using time series *Yi* . Then, we use an empirically-derived test to trigger an anomaly for *Y* :

$$
\nabla \mathcal{L}\_t^{\text{on}}(Y) < \nabla \mathcal{L}\_{\text{min}} - \frac{\nabla \mathcal{L}\_{\text{range}}^{\rho}}{2}. \tag{4.23}
$$

This test detects if the gradient is an outlier compared with gradients of successful skill executions.

#### *4.5.3 Experiments and Results*

As for experimental setup, a dual armed humanoid Baxter robot was used to perform a pick-and-place operation. The robot consisted of a Robotiq force-torque sensor and standard Baxter fingers. 5 nominal trials were used for training the HMM model. In testing, 5 trials were used for skill identification and anomaly monitoring respectively. The pick-and-place task consists of 5 skills:


Figure 4.7, shows the experimental setup and the execution of the five skills.

For training, the observation vector concatenates a 7-dimensional Cartesian endeffector pose and a 6-dimensional wrench. For each skill, we train corresponding HMM models using the Baum–Welch algorithm. The number of hidden states is selected such that emission probabilities are maximized leading to distinct and uniquely grouped hidden states.

For results reporting, we use the three factors identified by Pettersson's survey on Event Detection [30]. Namely: diagnoses accuracy, robustness (false-positive rate), and reaction time (the time it takes to identify a skill from the beginning of a skill execution). Note that for anomaly identification, internal and external perturbations are used including: unexpected movement of target object, object absence, slippery picks, and human collisions.

#### • **(A) Skill Identification Performance**

Figure 4.8 presents The skill identification confusion matrix. Skills 1 and 3 were recognized with 100% accuracy, 2 and 4 with 99% accuracy, and skill 5 with the largest surface contacts with 94% accuracy. Overall accuracy was 98.4%.

**Fig. 4.7** The Baxter humanoid robot performing a pick-and-place task. 5 independent skills used to perform the task. Executed skill motions are sketched with red arrows

**Fig. 4.8** Skill identification confusion matrix for pick-and-place

#### • **(B) Reaction Time Performance**

In terms of reaction time, a percentage is computed to assess the time it takes for the identification to execute from the beginning of a skill. The reaction percentage, using t as "true", is computed as: *offset* = *predicted-t\_beginning, length*=*t\_endt\_beginning*, and *reaction*=*offset/length*. The closer the reaction percentage is to 0% the better the identification method. A negative reaction percentage means the predicted start occurs earlier than ground truth, while a positive percentage implies a delayed identification. We assess two forms to determine the beginning of a skill (i) use the "first skill" occurrence, or (ii) use the "first 10 successive skill" occurrences. The reaction percentage for these two formats is found in Table 4.2. The average reaction time for absolute values (i.e. looking at the average time difference of the prediction, whether early or late) the "first skill" is 2.70% across all skills and the average reaction time for the "first 10 skills" is 0.97%. Between the two measures, we have a total average of 1.84%.

#### • **(C) Anomaly Monitoring Performance**

For anomaly monitoring performance of our gradient-based method, we use two environments: (i) anomaly identification as it occurs and any false-positives before an actual anomaly occurs, and (ii) an external collision is given to the robot to trigger a recovery. Then, when the robot completes its recovery behavior, we count how many false-positives are triggered before moving to the next skill execution. The robot recovery behavior is detailed in [31]. Five nominal and five anomalous trials are used for the analysis. The results are compared with two other baseline methods: the magnitude-based metric from the description in normal skill identification based on the forward gradient, and the derivative-of-difference metric from the definition in anomaly monitoring based on the forward gradient.

The five anomalous trials contain a total of 14 anomalies, consisting of:



**Table 4.2** Reaction time as a duration percentage of a skill


**Table 4.3** Anomaly monitoring metrics performance comparison

# • **(D) Pre-recovery Performance**

For pre-recovery performance computation, during each trial, we record the anomalies triggered by the testing metric and count its true positives, false positives and false negatives. The results are shown in Table 4.3. The result shows our proposed forward gradient detected all anomalies and triggered no false positives or false negatives. The other two baseline methods suffer from false positives though they deliver high true positives.

# • **(E) Post-recovery Performance**

For post-recovery performance metrics, we trigger an intentional anomaly, after recovery is completed, we count any false-positives before next skill execution. Both the forward gradient method and the derivative-of-difference method had not falsepositives. Whilst the magnitude-based metric had more than 10 and prevented the system from continuing its task execution.

#### • **Comparison with Related Works**

Comparisons across works is challenging as results use different formats across experiments. Table 4.4 is an effort to harmonize results across related literatures. The comparison should be done loosely as different tasks (small levels of contact vs. large levels of contacts, structured environment vs. unstructured environment) present different challenges to event detection. For skill identification, our current approach ranks 2nd behind the tool breakage work that identified anomalies in structured milling processes. Our work did better than [16, 22], albeit these works modeled more complex dynamical phenomena. Similar statements can be made about anomaly identification. As for reaction times, our approach offers about double the speed-up compared to the only other work that reported this number. In conclusion, based on internal and external evidence, we hold that our measure is the most robust, stable, and fastest measure reported to date.

#### *4.5.4 Discussion*

This work presented a theoretically derived event detection measure useful for nominal and anomalous behavior identification, even in post-recovery actions. Our results


**Table 4.4** Skill and anomaly identification, and reaction time comparison for state-of-the-art eventdetection methods

aAverage of Table 3 is cited from [32], k-fold: gesture classification performance (novice, inter, and expert and both macro/micro levels)

bAverage reaction time for their best multimodal setup

showed very strong performance compared with state-of-the-art results across the board. More experimental validation is certainly necessary: both in number of trials and robotic tasks. This work also remains to be tested in the area of anomaly diagnoses. The latter is concerned not only with the identification problem with the grouping of anomaly types which is more challenging. We anticipate working in conjunction with machine or deep learning models for the diagnoses of this signals. Some works [11, 12, 16, 23] already provide some characterizations.

#### **4.6 Mapping Latent State to Log-Likelihood for Anomaly Monitoring**

#### *4.6.1 Hierarchical Dirichlet Process Vector Auto-regressive Hidden Markov Models*

HMMs segment sequential data into interpretable hidden states that are assumed Markovian. They are limited by its a priori determination of hidden states (and fixed throughout the modeling process) as well as the HMMs limited ability to model complex dynamics due to the assumption that observations are conditionally independent 74 4 Nonparametric Bayesian Method for Robot Anomaly Monitoring

given the state. Instead, this section uses Bayesian non-parametric techniques that model HMMs through priors that can model an unbounded number of hidden states. In particular, we use HDP priors to learn HMMs with infinite hidden states. We also use the switching vector auto-regressive (HDP-VAR-HMM) to model multimodal observation sequences and its element consists of features extracted from an F/T sensor, joint encoders, and a tactile sensor (for a discussion on the used features see experimental setup). Figure 4.9 illustrates graphical model representations for the sHDP-HMM and the sHDP-VAR-HMM.

Assume that we record *N* sequential executions for each skill and expect to learn the common dynamics among them.We utilize the HDP-VAR-HMM to jointly model those sequences, where sequence *n* ∈ *N* has multidimensional observation data *yn* = [*y*1, *<sup>y</sup>*2,..., *yT* ] at time *<sup>t</sup>*. For instance, *yt* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* could be an instant of wrench signatures and end effector velocity, where *d* is the dimensionality of observed features. The HDP-VAR-HMM interprets each observation *yt* by assigning it a single hidden state *zt* . The hidden state value is derived from: (i) a countably infinite set of consecutive integers *zt* = *k* ∈ {1, 2, 3,...}, (ii) an initial probabilistic state distributions π<sup>0</sup> generated with Markovian dynamics, and (iii) a transition distribution {π}<sup>∞</sup> *<sup>k</sup>*=<sup>1</sup>. Then, the Markovian structure on the state sequence dictates that for all *t* > 1:

$$\begin{aligned} z\_t &\sim \pi\_{z\_{t-1}}, \\ p(z\_1 = k) &= \pi\_0, \\ p(z\_t = k' | z\_{t-1} = k) &= \pi\_{kk'}. \end{aligned} \tag{4.24}$$

Since *zt* ∼ π*zt*−<sup>1</sup> , we see that *zt*−<sup>1</sup> indexes all observations *yt* are assigned with *zt*−1. We can draw the observation *yt* given specific hidden state *zt* = *k* from its corresponding observation likelihood functions. In our case, we consider the first-order auto-regressive Gaussian likelihood. As such, observations *yt* are considered to be a noisy linear combination of the previous observation plus additive white noise ε and can be expressed as:

$$
\varepsilon\_t \mathbf{y}\_t = A\_k \mathbf{y}\_{t-1} + \varepsilon\_t(\underline{\mathbf{z}}\_t = k), \quad \varepsilon\_t(k) \sim \mathcal{A}'(\mathbf{0}, \Sigma\_k). \tag{4.25}
$$

where, each state *k* has two dynamic parameters {*Ak* , Σ*<sup>k</sup>* }: *Ak* , a *d* × *d* matrix of regression coefficients that defines the expected value of each successive observation as <sup>E</sup>[*yt*|*yt*−<sup>1</sup>] = *Ak yt*−1, and <sup>Σ</sup>*<sup>k</sup>* , a *<sup>d</sup>* <sup>×</sup> *<sup>d</sup>* symmetric positive definite matrix that defines the covariance matrix of state *k*.

Therefore, the Gaussian regression observation model explains a observed data pairs of input *<sup>y</sup>* and output *<sup>z</sup>*. Each input is a vector of length *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* , while each output is a scalar. In the paper, we assume that the input data *y* are fixed dependence on previous observation and focus on a generative model for learning the output data *z* which depends on *y*. Various generative models, such as full-covariance Gaussian, diagonal-covariance Gaussian, are possible for the observed input data. These can be directly define the multivariate Gaussian log-likelihood function given specific state *zk* ,

**Fig. 4.9** Graphical representation of HDP-HMM and HDP-VAR-HMM. **a** The graphical representation of a sticky HDP-HMM over *T* time steps. The latent, discrete-valued Markov process *zt* captures the temporal dependencies in the observations *yt* . **b** The graphical model for a sticky HDP-VAR (*r*)-HMM over *T* time steps. The latent, discrete-valued Markov process *zt* dictates the linear dynamical model at time *t*. For an order *r* switching VAR process (*r* = 1 in this paper), the conditionally linear dynamics are completely determined by previous observations *yt*−<sup>1</sup>

$$\begin{split} \log p(\mathbf{y}\_{l}|\boldsymbol{\theta}\_{k}, \mathbf{y}\_{t-1}) &= \log p(\mathbf{y}\_{l}|A\_{k}, \boldsymbol{\Sigma}\_{k}, \mathbf{y}\_{t-1}) = \log \, \mathcal{N}(\mathbf{y}\_{l}|A\_{k}\mathbf{y}\_{t-1}, \boldsymbol{\Sigma}\_{k}) \\ &= -\frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\boldsymbol{\Sigma}\_{k}| - \frac{1}{2}(\mathbf{y}\_{l} - A\_{k}\mathbf{y}\_{t-1})^{T}\boldsymbol{\Sigma}\_{k}^{-1}(\mathbf{y}\_{l} - A\_{k}\mathbf{y}\_{t-1}). \end{split} \tag{4.26}$$

However, it is often the case that we don't know the parameters of this distribution, but instead to estimate them. We need to learn the approximate values θ*<sup>k</sup>* = {*Ak* , Σ*<sup>k</sup>* } for each state by defining a conjugate prior distribution on it. Particularly, the Matrix Normal Inverse Wishart (MNIW) distribution is conjugate for both the *Ak* and Σ*<sup>k</sup>* are uncertain. This distribution assume that when only the covariance Σ*<sup>k</sup>* is uncertain, the conjugate prior is defined as *d*-dimensional Inverse Wishart (IW) distribution with covariance parameter Δ, a symmetric, positive definite scale matrix, and ν, the degrees of freedom, i.e.

$$
\Sigma\_k \sim \mathcal{J} \mathcal{W}(\boldsymbol{\upsilon}, \Delta), \tag{4.27}
$$

where the first moment is given by the mean

$$\mathbb{E}[\Sigma\_k] = \frac{\nu \Delta}{\nu - d - 1}. \tag{4.28}$$

After that, we place a conditionally Matrix-Normal (MN) prior on *Ak* given Σ*<sup>k</sup>* , which is given by

$$A\_k \sim \mathcal{A} \mathcal{U} (\mathbf{M}, \Sigma\_k, V), \tag{4.29}$$

where *M* is the mean matrix of *Ak* and *V* is a symmetric, positive definite matrix that determine the covariance of *Ak* relative to Σ*<sup>k</sup>* . Therefore, we derive the useful expectations

$$\begin{aligned} \mathbb{E}(A\_k) &= M, \\ Cov[A\_k] &= \frac{1}{\nu - d + 1} (V^{-1} \times \Delta). \end{aligned} \tag{4.30}$$

The resulting log-likelihood function of joint distribution of *Ak* , Λ*<sup>k</sup>* is defined as *log p*(*Ak* , Σ*<sup>k</sup>* |ν, Δ, *M*, *V* <sup>−</sup><sup>1</sup>)

$$\log p(A\_j, \Lambda\_j | \upsilon, \Delta, M, V^{-1}) = F\_{\mathcal{J}' \mathcal{W}}(\upsilon, \Delta) + F\_{\mathcal{A}' \cup \mathcal{V}}(M, V) + F,\qquad(4.31)$$

where,

$$\begin{aligned} F\_{WI}(\boldsymbol{\upsilon},\Delta) &= -\frac{d(d-1)}{4}\log\pi - \frac{d}{2}(\log 2)\boldsymbol{\upsilon} - \sum\_{i=1}^{d} \log \Gamma\left(\frac{\boldsymbol{\upsilon} + 1 - d}{2}\right) + \frac{\boldsymbol{\upsilon}}{2}\log|\Delta|, \\ F\_{MN}(M,V) &= -\frac{d^2}{2}\log 2\pi + \frac{d}{2}\log|V|, \\ F &= \frac{\boldsymbol{\upsilon} - 1}{2}\log|A\_k| - \frac{1}{2}tr(A\_k^T A\_k A\_k V) + tr(A\_k A\_k V M^T) - \frac{1}{2}tr(A\_k(\Delta + M V M^T)). \end{aligned} \tag{4.32}$$

where Γ () is represented the gamma function. Therefore, all the optional hyperparameters of HDP-VAR-HMM are specified by ν, Δ, *M*, *V* . Apparently this straightforwardly method is computationally feasible when those four parameters are known. We will specify the detailed options in experimental section through all our experiments.

With reference to the Eq. (4.24), for the probabilistic representation of transition distribution π*<sup>j</sup>* under the HDP-VAR-HMM prior, the number of states is unbounded and every observation assigned a unique state. The HDP prior encourages sharing states over time through a latent root probability vector β over the infinite set of states (see Fig. 4.9). Consider the probability <sup>∞</sup> *<sup>j</sup>*=<sup>1</sup> β*<sup>j</sup>* = 1 on a countably infinite set, where defines the sticking-breaking construction of the prior on β and first draws independent variable μ*<sup>j</sup>* ∼ *Beta*(1,γ) for each state *j*, and then sets β*<sup>j</sup>* = μ*<sup>j</sup> j*−<sup>1</sup> =<sup>1</sup> (1 − μ), where the concentration parameter γ controls the relative proportion of the weights β*<sup>j</sup>* . Here, we interpret μ*<sup>j</sup>* as the conditional probability of choosing state *j* amongs states {*j*, *j* + 1, *j* + 2,...}

In expectation, we always describe the first *K* orders in stick-breaking as the most common states and represent their probabilities with the vector [β1, β2,..., β*<sup>K</sup>* , β>*<sup>K</sup>* ], where β>*<sup>K</sup>* = <sup>∞</sup> *<sup>j</sup>*=*K*+<sup>1</sup>. Given this (K+1) dimensional probability vector β, we simplify the HDP-VAR-HMM transition distributions π*<sup>j</sup>* for each state *j* from a Dirichlet distribution with mean equal to β and variance governed by concentration parameter α > 0

$$[\beta\_1, \beta\_2, \dots, \beta\_K, \beta\_{\ge K}] \sim Dir(\alpha \beta\_1, \alpha \beta\_2, \dots, \alpha \beta\_{\ge K}),\tag{4.33}$$

Generally, we define the starting probability vector π<sup>0</sup> from the same prior, but with much smaller variance, π<sup>0</sup> ∼ *Dir*(α0β) with α<sup>0</sup>  α, which denotes that few starting states are observed.

To capture the sufficient dynamics of given time series *n* with *Tn* observations. Our goal is to interpret these observation with the hidden state *zj* . However, instead of assigning variable states in a short time period, we should treat each observation in the context of its predecessors for improving the temporal consistency. A "sticky" parameter κ of favors self-transition by assigning extra prior mass on the transition probability π*j j* was introduced. Thus, the transition representation will in a temporally consistent way, that is

$$[\beta\_1, \beta\_2, \dots, \beta\_K, \beta\_{\succeq K}] \sim Dir(a\beta\_1, \dots, a\beta\_j + \kappa, \dots, a\beta\_{\succeq K}).\tag{4.34}$$

where κ > 0 governs the degree of self-transition bias. Informally, what we are doing is increasing the expected probability of self-transition for keeping the dynamical phenomenon consistency for many time-steps.

After constructing the sHDP-VAR-HMM model, we utilize the state-of-the-art variational inference algorithm for learning the parameters, which named as Split-Merge Monte Carlo method by Michael C. hughes et al. More information can be found in [28].

#### *4.6.2 Anomaly Monitoring Based on Hidden State*

The anomaly detector identifies whether a skill execution contains: no anomalies, one anomaly, or multiple anomalies. In this section, we introduce the mathematical modelling that makes up our detector.

#### • **(A) Hidden State Representation**

Hidden state space modeling of multivariate time-series is one of the most important tasks in representation learning by dimensional reduction. In this work, we propose the anomaly detector is a thresholding marginal log-likelihood with Bayesian nonparametric hidden Markov model, which leads to tackle the anomaly monitoring problem in a more natural way to meet the need of real-world applications.

As we discussed above, the sHDP-VAR-HMM interpreted each observation *yt* by assigning to a single hidden state *zt* , where the hidden state value is derived from a countably infinite set of consecutive integers *zt* = *k* ∈ {1, 2, 3,..., *K*}. Figure 4.10 shows the low-dimensional hidden state space of the multi-modal time series generated by the robot during manipulation, which contains 5 hidden states.

We denote Θ*<sup>s</sup>* to represent all the parameters of the trained sHDP-VAR-HMM from nominal demonstrations, including the hyperparameters for learning the posterior and the parameters for the observation model definition. Our anomaly detector performs anomaly monitoring by comparing the marginal log-likelihood*log p*(*yt*|Θ*s*) based on current executing skill of the observation *yt* with a learned threshold ρ(*z*), that derives from the most likely estimated hidden state given current observations.

**Fig. 4.10** An illustration for the representation of multimodal time series in low dimensional hidden state space

Here, the *z* would be included variable values, that is, we should synchronously update the threshold in the same skill period. Let's say the definition of an anomaly data point at any time during the manipulation is one that *logp*(*yt*|Θ*s*) < ρ(*z*), otherwise it deems the manipulation to be nominal.

The proposed method generates the mapping pairs of marginal log-likelihood and the most likely hidden state at any time. As specific hidden state *z*, we calculated the fixed log-likelihood threshold from testing the nominal demonstrations by subtracting half of range between maximum and minimum from the minimum. Unlike the anomalous log-likelihood was deviate by a certain standard deviation from the mean in [11, 12, 21], that is ρ(*z*) = μ − *c*σ, where the symbol *c* is a constant value for adjusting the sensitivity of the detector, μ and σ are the estimated mean and standard deviation of the log-likelihood. Our implementation don't need to take the constant *c* into consideration for each skill or hidden state.

#### • **(B) Joint Probability of the Observed Sequence**

The threshold ρ(*z*) is associated with the joint probability of the observed sequence. Firstly, in order to derive the forward-backward procedure of first-order autoregressive HMM for the joint state sequence *z*1:*<sup>T</sup>* given observation *y*1:*<sup>T</sup>* , plus the initial observation *y*0, (*T* ≥ 1). We have this problem when dealing with conventional HMM, here the auto-regressive structure of ARHMM makes this situation more complicated, a modified forward-backward method could solve the problem efficiently. Let's first write the joint distribution with the Markov property *p*(*yt*|*yt*−<sup>1</sup>,..., *y*1) = *p*(*yt*|*yt*−<sup>1</sup>), that is

$$\begin{aligned} p(\mathbf{y}\_{0:T}) &= \sum\_{t=1}^{T} p(\mathbf{y}\_t|\mathbf{y}\_{1:t-1}), \\ p(\mathbf{y}\_t|\mathbf{y}\_{1:t-1}) &= \sum\_{k=1}^{K} p(\mathbf{z}\_t = k|\mathbf{y}\_{1:t-1}) p(\mathbf{y}\_t|\mathbf{z}\_t). \end{aligned} \tag{4.36}$$

Where θ*zt* represents the observation parameters for the trained HMM and the *z*<sup>1</sup>:*<sup>T</sup>* is a state path over the hidden state space. As such, if the state path is estimated, the maximum probability of the observation sequence can be obtained. However, since the true state path is hidden from the observations. For the computational convenience, the general approach would be to use the maximum likelihood state at any given moment, but this would neglect uncertainty. Instead, we use a marginal probabilistic representation over hidden states at each time step (see details in hidden state estimation of auto-regressive model section). Thus, the log-probability at time *t* is derived by computing the logarithm of the sum of exponentials over all hidden states

$$\mathcal{QC}\_t = \log \sum\_{k=1}^K \exp\left(\frac{\alpha\_t(k) \cdot \beta\_t(k)}{p(\mathbf{y}\_{0:T})}\right),\tag{4.37}$$

where α*t*(*k*) = *p*(*zt* = *k*|*y*0:*<sup>t</sup>*), β*t*(*k*) = *p*(*yt*+1:*<sup>T</sup>* |*zt*), are presented the forward message passing and backward message in the standard forward-backward algorithm, respectively.

$$\begin{split} \alpha\_{t}(k) &= \frac{1}{p(\mathbf{y}\_{t}|\mathbf{y}\_{0:t-1})} p(\mathbf{y}\_{t}|z\_{t} = k) p(z\_{t} = k|\mathbf{y}\_{1:t-1}), \\ p(z\_{t} = k|\mathbf{y}\_{0:t-1}) &= \sum\_{j} p(z\_{t}|z\_{t-1} = j) p(z\_{t-1}|\mathbf{y}\_{0:t-1}). \end{split} \tag{4.38}$$

The denominator represents the probability of a sequence observations, which can be calculated by Eq. (4.36). Assume that we recorded *N* sequential data for each skill and the length of sequence *n* ∈ *N* is *Tn*. Thus, we can get a set of log-probabilities vector of each skill

$$\mathcal{A}^{\ell:N} = \{\mathcal{A}^1\_{T\_1}, \mathcal{A}^2\_{T\_2}, \dots, \mathcal{A}^N\_{T\_N}\}.\tag{4.39}$$

The above implies that the forward—backward procedure can be used to find the most likely state for any moment *t*. It cannot, however, be used to find the most likely sequence of states.

#### • **(C) Hidden State Estimation of Auto-regressive Model**

For offline constructing the state and log-likelihood pairs, we are interested in the most likely state sequence upon an observation sequence *y*0:*<sup>T</sup>* . One might be derived from copulating the MAP sequence by maximizing the marginal probability at each observation *i*

$$
\hat{z}\_i = \arg\max\_{z\_i} p(z\_i|\mathbf{y}\_{0:T}).\tag{4.40}
$$

We note this combined single observation estimated sequence might yield unlikely (even 0-probability) sequences. We therefore consider

$$
\hat{z}\_{1:T} = \underset{z\_{1:T}}{\text{arg}\max} \ p(z\_{1:T}, \mathbf{y}\_{0:T}).\tag{4.41}
$$

where, *p*(*z*<sup>1</sup>:*<sup>T</sup>* , *y*<sup>0</sup>:*<sup>T</sup>* ) was computed in Eq. (4.36). Actually, we note the most likely state sequence that minimum the cost path to *zt* is equivalent to the minimum cost path to node *zt*−<sup>1</sup> plus the cost of a transition from *zt*−<sup>1</sup> to *zt* , and the cost incurred by observation *yt* .

For convenience, we abbreviate the joint log-likelihood function *L* (*z*<sup>1</sup>:*<sup>T</sup>* ) - *log p*(*z*<sup>1</sup>:*<sup>T</sup>* , *y*<sup>0</sup>:*<sup>T</sup>* ), and the transition probability π*zt*−<sup>1</sup> *zt*−1, then

$$\mathcal{L}\mathcal{C}\_T(z\_{1:T}) = s\_1(z\_1) + \sum\_{t=2}^T s\_t(z\_t, \pi\_{z\_{t-1}}),\tag{4.42}$$

where

4.6 Mapping Latent State to Log-Likelihood for Anomaly Monitoring 81

$$\begin{aligned} s\_l(z\_1) &= \log \left( \pi\_{z\_l} p(\mathbf{y}\_l | \theta\_{z\_l}, \mathbf{y}\_0) \right), \quad t = 1\\ s\_t(z\_t, z\_{t-1}) &= \log \left( p(z\_t | z\_{t-1}) p(\mathbf{y}\_t | \theta\_{z\_t}, \mathbf{y}\_{t-1}) \right), \ t \ge 2 \end{aligned} \tag{4.43}$$

Obviously, *Lt*(*z*1:*<sup>t</sup>*) = *L<sup>t</sup>*−1(*z*1:*t*−1) + *st*(*zt*,*zt*−1), and set

$$\begin{aligned} \upsilon\_t(z\_t) &= \max\_{z\_{1:t-1}} \mathcal{Q}\_t^\rho(z\_{1:t}), \qquad z\_t \in \{1, 2, 3, \dots, K\} \\ \upsilon\_t(z\_{t-1}) &= \max\_{z\_t} \{\upsilon\_{t-1}(z\_t) + s\_t(z\_t, z\_{t-1}), \end{aligned} \tag{4.44}$$

From the above, The maximizing state sequence is iteratively obtained, for *n* ∈ {*T* − 1, *T* − 2,..., 1}

$$
\hat{z}\_T = \underset{z\_{l-1}}{\text{arg}\max} \ \upsilon\_n(z\_{l-1}), \tag{4.45}
$$

Assume that we recorded *N* sequential data for each skill and the length of sequence *n* ∈ *N* is *Tn*. Thus, we can get a set of the most probable states vector of each skill

$$\mathcal{X}^{\diamond 1:N} = \{\mathcal{X}^{\diamond 1}\_{T\_1}, \mathcal{X}^{\diamond 2}\_{T\_2}, \dots, \mathcal{X}^{\diamond N}\_{T\_N}\}.\tag{4.46}$$

#### • **(D) Threshold Calculation**

Having computed the joint probability of the observations and the most probable states for each skill *s*. we can concatenate them to compute the threshold of specific state *z* = *k*, that is ρ(*zk* ). Each state consists of its maximum *L max zk* and minimum *L min zk* . We cluster the log-likelihoods with the same state, which represents the clustering data by HMM belong to the same statistical distribution.

$$\begin{aligned} \mathcal{L}\_{z\_k}^{\rho\_{\max}} &= \max\_{i \in \{1, \ldots, N\}} \mathcal{L}\_i^n \mathbf{1}(z\_k == k), \\ \mathcal{L}\_{z\_k}^{\rho\_{\min}} &= \min\_{i \in \{1, \ldots, N\}} \mathcal{L}\_i^n \mathbf{1}(z\_k == k), \\ \mathcal{L}\_{z\_k}^{\rho\_{\max}} &= \mathcal{L}\_{z\_k}^{\rho\_{\max}} - \mathcal{L}\_{z\_k}^{\rho\_{\min}}, \end{aligned} \tag{4.47}$$

where the notation **1**(η) to represent an indicator random variable that is 1 if the event η occurs and 0, otherwise. Using Eq. (4.47), we can set the anomaly monitoring threshold of each active state *zk* at any moment according to:

$$
\rho(z\_k) = \mathcal{E}\_{z\_k}^{\prime min} - \frac{\mathcal{E}\_{z\_k}^{range}}{2}. \tag{4.48}
$$

In testing implementation, we preprocessed the incoming multimodal data in the same scheme as the training process and then get the most possible state *ztest* by recursively compute the filtered marginals and the corresponding log-likelihood *Ltest* using Eqs. (4.38) and (4.37) respectively.

$$z\_{test} = \underset{1,\ldots,K}{\arg\max} \left\{ p(z\_t = 1 | \mathbf{y}\_{0:t}), \ldots, p(z\_t = K | \mathbf{y}\_{0:t}) \right\}.\tag{4.49}$$

Here, this approach detects an anomaly when the log-likelihood is lower than the state dependent threshold ρ(*ztest*) This approach also extends to online detection. Given the trained sHDP-VAR-HMM and streaming sensor data, the detector can recursively compute the *zt* and *L<sup>t</sup>* at time *t*, and then perform the comparison, if *L<sup>t</sup>* < ρ(*ztest*), then anomaly, else nominal.

#### *4.6.3 Experiments and Results*

#### • **Finite State Machine Based Kitting Experiment**

In this section, we describe the kitting experimental setup, external disturbances, data collection, and anomaly monitoring and diagnoses model parameters.

A kitting experiment is setup, where a robot and a human co-worker closely collaborate to place a set of goods in a packaging box. Specifically, a human coworker is tasked to place a set of 6 objects on the robot's "collection bin" (located in front of the robot) in a one-at-a-time fashion. The objects may accumulate in a queue in front of the robot. As soon as the first object is placed on the table, the robot identifies the object and places it in a packaging box located to the right of the robot. Thus, the robot picks an object and transports it towards the box, after which, the robot appropriately places each of the six objects in different parts of the box. The whole implementation is shown in Fig. 4.11.

The kitting experiment consists of 6 skills (Skill 3) : Home → Pre-pick; (Skill 4) : Pre-pick → Pick; (Skill 5): Pick → Pre-pick; (Skill 7) : Pre-pick → Pre-place; (Skill 8) : Pre-place → Place; (Skill 9) : Place → Pre-place that were effectively

**Fig. 4.11** A kitting experiment consists of 6 skills (in various color) that were modeled by DMP, respectively. Objects that need to be packaged are placed by a human collaborator before the robot in a collection bin. The shared workspace affords possibilities for accidental contact and unexpected alteration of the environment. The robot is tasked to pick each one of the objects and place them in the packaging box to its right. The visualization module uses the ALVAR tags to provide a consistent global pose with respect to the base of the robot. Ten nominal executions and one snapshot of each skill are shown

modeled by the dynamic movement primitives and the ROS-SMACH.<sup>1</sup> One-shot kinesthetic demonstrations were used to train DMP models for each of the skills of the kitting experiment. Note that a skill's starting and ending pose can be adapted if necessary thus providing a flexible and generalizable skill encoding procedure. The trajectory adaptations however increase the difficulty of assessing the sensory data for both anomaly monitoring and diagnoses—due to the variability in the data collection even from nominal executions.

### • **The Robot, Ojects, and Visual System**

A Baxter humanoid robot's right arm is used to pick commonplace objects set before him. The robot is equipped with a 6 DoF Robotiq FT sensor, 2 Baxter-standard electric pinching fingers, as well as Baxter's left hand camera. Each finger is further equipped with a multimodal tactile sensor composed of: (i) a 4 × 7 taxel matrix that yields absolute pressure values, (ii) a dynamic sensor which provides a single capacitive reading in millivolts (mV) useful to detect tactile events, and (iii) an IMU and gyroscope [34]. Baxter's left hand camera is placed flexibly in a region that can capture objects in the collection bin with a resolution of 1280 × 800 at 1 fps (we are optimizing pose accuracy and lower computational complexity in the system). The use of the left hand camera facilitated calibration and object tracking accuracy.

A set of 6 common household objects consisting of box-like shapes and bottles were used in our work. The objects ranged in weight from 0.0308 to 1.055 kg and in volume from 3.2 × 10−<sup>04</sup> m<sup>3</sup> to 1 × 10−<sup>03</sup> m3. The object's surfaces also varied slightly: some heavier objects had sleeker surfaces to incite object slips—we believe not an unreasonable determination as warehouses contain a wide variety of objects whilst other objects had rougher surfaces. Across executions, object locations and order was varied to promote generalization.

Alvar tags,<sup>2</sup> with 0.06 m sides, were placed around the circumference of the objects for robust visual recognition (ALVAR can handle change in lighting conditions, optical flow-based tracking, and good performance for multitag scenarios) regardless of orientation.

#### • **External Disturbances**

External disturbances may be introduced into a system for a variety of reasons. We list possible scenarios in which disturbances may occur in real-systems, as shown in Fig. 4.12:

(1) For collaborative jobs (such as kitting in warehouses, see the description in experiment setup section), humans may experience monotony leading to boredom and loss-of-attention. In such cases, it is possible for a human co-worker to accidentally collide with the robot manipulator or alter the environment in unexpected ways.

<sup>1</sup>http://www.ros.org/SMACH.

<sup>2</sup>http://wiki.ros.org/ar\_track\_alvar.

**Fig. 4.12** Seven representative unexpected anomalies in the kitting experiment (left-to-right): Human Collision (HC), Tool Collision (TC), Object Slip (OS), Human Collision with Object (HCO), Wall Collision (WC), No Object (NO), Other (OTHER)


The seven types of anomalies we just described will be declared more succinctly in the rest of the paper as: Human-Collisions (HC), Tool-Collisions (TC), Object Slips (OS), Human-Collision-with-Object (HCO), Wall-Collision (WC), No-Object (NO), and Other (OTHER) respectively. In Sect. 4.6.1, the introspection methodology used to model robot skills, continued by a presentation of anomaly monitoring in Sect. 4.6.2 and anomaly diagnoses in Chap. 5.

#### • **Dataset Collection and Sensory Preprocessing**

It is undeniable that bias exists on the side of robot researchers as to which anomalies may exist in the task. In this work, we first attempt to discover likely anomalies in this task by emulating a collaborative kitting task in which the human collaborator experiences tedium and monotony resulting in undesired events that lead to anomalous events.

We tasked 5 participants—whose ages range from 23 to 28—to act as a collaborator in the kitting task under the monotonous conditions mentioned in experiment setup section. One expert user and four novice users participated in the experiments. Novice users learned from the expert to induce anomalies by observing a round of executions. Each participant performed 1 nominal and 6 anomalous executions by placing the set of six household objects in a one-by-one fashion. For training, when an anomaly occurs, the robot execution is halted and the data stream is labelled with an anomaly class (namely the labels with HC, TC, OS, NO, etc.) for supervised classification during testing. Subsequently, anomalous time-stamps are also extracted. In total, we collected a data set with 18 nominal executions and 180 anomalous executions (each anomalous trial has at least one anomaly across the entire task), and 38 failure trails are ignored.

As for our multimodal observation vector, it is comprised of 6 DoF force-torque signals, 6 DoF Cartesian velocity signals (from Baxter's right end-effector), and 56 DoF tactile signals from both the left and right tactile sensors. Empirical feature extraction is performed on this observation vector to improve diagnoses performance. We now describe how features for each modality at time *t* were engineered.

**Wrench modality:** Assume the force and torque sequence is a time-series vector and that each element represents the magnitude of each dimension ( *fx* , *f <sup>y</sup>* , *fz*, *tx* , *ty* , *tz*). We wish to extract structural information from the wrench instead of simply using raw values. In this way, we can find signal patterns that may occur across the different DoFs. To this end, we computed the norm of both the force *n <sup>f</sup>* and the torque *nt* as features respectively:

$$m\_f = \sqrt{f\_x^2 + f\_y^2 + f\_z^2}, \quad n\_t = \sqrt{t\_x^2 + t\_y^2 + t\_z^2}. \tag{4.50}$$

**Velocity modality:** Similarly, we take the Cartesian linear (*lx* ,*ly* ,*lz*) and angular (*ax* , *ay* , *az*) velocities and compute their norm *nl* and *na* respectively:

$$m\_l = \sqrt{l\_x^2 + l\_y^2 + l\_z^2}, \quad n\_a = \sqrt{a\_x^2 + a\_y^2 + a\_z^2}. \tag{4.51}$$

**Tactile modality:** Due to the computational cost of processing the tactile sensor's high dimensionality, we empirically tested a number of features for each tactile sensor, they include: the maximum taxel value, the largest five taxel values, the mean of all taxel values, and the standard deviation for all taxel values. It was the standard deviation, which proved to be the most useful feature for anomaly monitoring and diagnoses. The standard deviation for each tactile sensor *sl* and *sr* are defined as:

86 4 Nonparametric Bayesian Method for Robot Anomaly Monitoring

$$s\_l = \sqrt{\frac{1}{28} \sum\_{i=1}^{28} (l\_i - \mu\_l)}, \quad s\_r = \sqrt{\frac{1}{28} \sum\_{i=1}^{28} (l\_i - \mu\_r)}. \tag{4.52}$$

where <sup>μ</sup>*<sup>l</sup>* <sup>=</sup> <sup>1</sup> 28 <sup>28</sup> *<sup>i</sup>*=<sup>1</sup> *li* and <sup>μ</sup>*<sup>r</sup>* <sup>=</sup> <sup>1</sup> 28 <sup>28</sup> *<sup>i</sup>*=<sup>1</sup> *ri* are the mean of each tactile sensor respectively. Equations (4.50)–(4.52) facilitate the identification of diverging signals during anomalous situations. We empirically concatenate all the valuable features for representing the robot executions both in nominal and anomalous cases. If raw signals are used directly, they result in high false-positive detection rates (FPR) in our experimentation. For instance, when a robot carries objects of varying weights, the F/T signals for such operations vary significantly making it difficult to use raw values for detection. Hence, a standardization method is used to scale the extracted features through a normalization constant. We ultimately end with an feature vector of length 18 as shown below:

$$\mathbf{y}\_{l} = \left[ \frac{f\_{x}}{10}, \frac{f\_{y}}{10}, \frac{f\_{z}}{10}, \frac{t\_{x}}{10}, \frac{t\_{y}}{10}, \frac{t\_{z}}{10}, \frac{n\_{f}}{10}, \frac{n\_{l}}{0.4}, l\_{x}, l\_{y}, l\_{z}, a\_{x}, a\_{y}, a\_{z}, \frac{n\_{l}}{0.4}, n\_{a}, \frac{s\_{l}}{250}, \frac{s\_{r}}{250} \right]. \tag{4.53}$$

Note, that all training and testing data are pre-processed in the same manner. We will also use the same pre-processing approach when comparing with other baseline approaches.

#### • **Basic Parameter Settings of Each Model**

For anomaly monitoring and diagnoses using the sHDP-VAR-HMM with a first-order autoregressive Gaussian likelihood, each state *k* has two parameters to be defined: the regression coefficient matrix *Ak* and the precision matrix Λ*<sup>k</sup>* . The conjugate prior is the MNIW for which four parameters ν, Δ, *V*, *M* are assigned values in a fashion similar to the *Motion-Capture-Dataset* in [28]. In order to guarantee the prior has a valid mean, the degrees-of-freedom variable is set with ν = *d* + 2 and we set Δ by constraining the mean or expectation of the covariance matrix <sup>E</sup>[Λ<sup>−</sup><sup>1</sup> *<sup>k</sup>* ] under the Wishart prior in Eq. (4.30).

$$\mathbb{E}[A\_k^{-1}] = s\_F \sum\_{n=1}^{N} \sum\_{t=1}^{T\_n} (\mathbf{y}\_t - \overline{\mathbf{y}})(\mathbf{y}\_t - \overline{\mathbf{y}})^T. \tag{4.54}$$

Assume that we record *N* sequential data for each skill and the length of sequence *n* ∈ *N* is *Tn*. Thus, we can easily define the parameter Δ accordingly as

$$
\Delta = (\nu - d - 1) \mathbb{E}[\boldsymbol{A}\_k^{-1}].\tag{4.55}
$$

Specifically, we placed a nominal prior on the mean parameter with mean equal to the empirical mean and expected covariance equal to a scalar *sF* times the empirical covariance, here *sF* = 1.0. This setting is motivated by the fact that the covariance is computed from polling all of the data and it tends to overestimate mode-specific covariances. A value slightly less than or equal to 1 of the constant in the scale matrix mitigates the overestimation. Also, setting the prior from the data can move the distribution mass to reasonable parameter space values. The mean matrix *M* and *V* are set such that the mass of the Matrix-Normal distribution is centered around stable dynamic matrices while allowing variability in the matrix values (see Sect. 4.6.1 for details).

$$\begin{aligned} M &= \mathbf{0}\_d, \\ V &= 1.0 \ast \mathbf{I}\_d. \end{aligned} \tag{4.56}$$

where **0***<sup>d</sup>* and **I***<sup>d</sup>* are the matrix of all zeros and the identity matrix, respectively, each of size *d* × *d*. For the concentration parameters, a *Gamma*(*a*, *b*) prior is set on HDP concentration parameters α. A *Beta*(*c*, *d*) prior is set on the self-transition parameter μ and the degree of self-transition bias κ is set to 50. We choose a weekly informative setting and choose: *a* = 0.5, *b* = 5, *c* = 1, *d* = 5. Specifically, the initial transition proportion parameter is set μ ∼ *Beta*(1, 10). The Split-Merge Monte Carlo method sampling parameter maximum iterations is set to 1000. The truncation states number is set to *K* = 5 for anomaly monitoring, and *K* = 10 for anomaly diagnoses. Then, the HDPHMM model of each anomaly class or nominal skill is trained. Note, that as with pre-processing, we also use the same model parameters in training and testing as well as in other baseline approaches.

#### • **Results and Analysis of Anomaly Monitoring in Kitting Experiment**

According to Sect. 4.6.2, the anomaly detector implementation consists of representing multimodal observations through latent states, concatenating the derived loglikelihood values of a latent state, and calculating the anomaly monitoring threshold.

We first generate the latent state partitions for nominal demonstrations with varying speeds and goals. Notice how the non-parametric method yields varying dynamics across executions reflecting statistical variations in the robot motion trajectory due to changes in the desired trajectory completion time and varying goal distances.

Then, latent state log-likelihood values are concatenated and the corresponding thresholds computed. To this end, we visualized the performance of the proposed method as well as a baseline by assessing the anomaly time reaction of anomaly flags; that is, which algorithm can make a correct detection at a time closer to the ground truth. For training, ten nominal executions were used along with *3-fold* cross validation to obtain nominal sHDP-VAR-HMM models for each skill. The other 20 nominal executions were used for testing.

The execution varying threshold for anomaly monitoring was set according to Eq. (4.48). The results for skill 3 are visualized in Fig. 4.13. As seen in Fig. 4.13, latent state *zhat* = 2 is the most prominent latent state (the majority of the observations in the skill are ascribed to this state). We now analyze the time sensitivity of our system in detecting anomalies. Anomaly identification sensitivity is intimately connected to threshold setting. We compare the results of our execution varying threshold with a fixed threshold presented in [35]. The comparison is seen in Fig. 4.14. Under nominal situations both methods have the same robustness in avoiding falsepositives. In anomalous situations, there are multiple occasions in which our varying

**Fig. 4.13** All the log-likelihood values of individual hidden state of skill 3 when the truncation states number is set to *K* = 5. Each subplot shows the mean value (black dotted line) and the corresponding log-likelihood threshold value (red solid line) in Eq. (4.48)

threshold yielded better time sensitivity and detection accuracy. From Fig. 4.14, we see that for executions 2 and 4, our varying threshold detector identified anomalies that the fixed threshold method could not. Notice that the distance between the *anomaly*\_*t*\_*by*\_*human* and the *anomaly*\_*t*\_*by*\_*detector* is larger for the fixed threshold (recall from the dataset collection section that ground truth time stamps were obtained during training). This improved sensitivity reduces false-negatives and helps us flag anomalies at times closer to the ground truth.

Finally, Table 4.5 shows the anomaly monitoring performance for our hiddenstate detector (HSD) and the baseline gradient-based detector (GD) for the six skills used in the kitting task for 368 nominal executions and 136 anomalous executions. Our results indicate that while the GD system had the HSD resulted in higher 16.6% better precision, our system had 76.0% better recall. The difference is significant in reducing false-negatives and significantly raising the sensitivity of our system. The sensitivity improvement is almost four times bigger than the difference in precision. For this reason, our system reflects an F1-score that is 25.0% larger and an accuracy that is *F1-score*, and an accuracy that is 5.7% higher standing at 91.0%.

**Fig. 4.14** The comparison of two anomaly detectors for five robot anomalous executions. Each row represents the same testing execution. *Left column:* results of our hidden state-based anomaly detector. *Right column:* results of the gradient-based anomaly detector [21]. The ground truth (*anomaly*\_*t*\_*by*\_*human*) are represented by green dashed lines and the detected anomalies are represented by yellow solid lines

#### *4.6.4 Discussion*

Comprehensive experimental results showed robust, sensitive, and better timed anomaly monitoring for a wide range of anomalous situations in an unstructured co-bot scenario where a human and a robot collaborated to complete kitting tasks. The improved anomaly detector exploits the mapping relationship between latent states and the marginal log-likelihood values from the sHDP-VAR-HMM for monitoring anomalies. Our evaluations showed that we could detect anomalies reliably (overall accuracy of 91.0%).


**Table 4.5** The comparison of anomaly monitoring performance measures between our proposed method and a baseline method

With regards to anomaly monitoring, Table 4.5 indicated that our proposed detector achieved unexpectedly superior *recall* measurements compared to the baseline (the gradient-based method), an improvement of 61.7%. By adjusting the anomaly monitoring threshold during the task's progress execution task as a function of the maximum and minimum of the log-likelihood of observations for latent states, we gained enhanced sensitivity. Human-centric safety approaches dictate that increased sensitivity is highly desirable in human-robot collaboration tasks so as to provide higher safety guarantees for a human. Robotic workcells on the other hand may work better with approaches like the gradient-based detector that was less sensitive but had higher precision. Here the reduction in the false-positive rate may support longer-term autonomy in a workcell. The anomaly detector exhibited not only a better *accuracy* but also more accurately timed anomaly flags. That is, our anomaly flags occurred closer to the ground truth time. This would be useful. Timeliness may prove a critical role in cases where anomalies are triggered as a chain reaction. That is anomalies that cause more anomalies. By acquiring more timely detection, we may be able to also provide more timely recovery techniques or even prevention so as to minimize the possibility of a cascade of anomalies.

#### **4.7 Summary**

In this chapter, there are three kinds of anomaly monitoring methods in various threshold are presented based on the Bayesian nonparametric HMMs, namely: loglikelihood-based threshold, threshold based on the gradient of log-likelihood, and computing the threshold by mapping latent state to log-likelihood. All proposed methods and implementation processes were validated on the experimental platforms of "HIRO-NX robot performs electronic component assembly tasks" and "Baxter robot performs object packing tasks" using the ROC curve or accuracy-precisionrecall rate-F1 score as a performance evaluation indicator. The results are as follows:

(1) The anomaly monitoring method using log-likelihood-based threshold can effectively realize the online monitoring, but there are a large number of false positive (false triggering) and has an uncertain constant coefficient *c*. For the task of HIRO-NX robot performing the electronic components assembly, sHDP-VAR(2)-HMM obtains the optimal performance where the constant value of the threshold is *c* = 3 and the observation model is the second-order linear Gaussian regression model, resulting in a false positive rate of about 5.0% and a true positive rate of about 95.0%.


#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 5 Nonparametric Bayesian Method for Robot Anomaly Diagnose**

**Abstract** In this chapter, we introduce two novel anomaly diagnose methods using the Bayesian nonparametric hidden Markov models when anomaly triggered, including i)multi-class classifier based on nonparametric models, ii) sparse representation by statistical feature extraction for anomaly diagnose. Additionally, the detail procedure for anomaly sample definition, the supervised learning dataset collection as well as the data augmentation of insufficient samples are also declared. We evaluated the proposed methods with a multi-step human-robot collaboration objects kitting task on Baxter robot, the performance and results are presented of each method respectively.

# **5.1 Introduction**

In addition, anomaly diagnoses refers to the process of clearly classifying anomalies using supervised learning on the premise that the robot detects the occurrence of anomalies. Due to the uncertainty, diversity and unpredictability of anomalies, the robot's anomaly diagnoses can effectively improve the safety performance of the robot system and provide guarantee for subsequent anomaly repair behaviors. Therefore, the online abnormal monitoring and diagnoses of robots is the focus and difficulty of robots' long-term autonomous operation.

#### **5.2 Related Work**

The robot introspection not only needs to implement the movement identification and monitoring of anomalies, but also to diagnose the types of anomalies for providing sufficient support in subsequent recovery. The samples used for anomaly diagnoses in this chapter would have the same modal information as in anomaly monitoring and belong to multidimensional time series. Since the problem of multidimensional time series classification is a supervised learning problem [15–17], which aims to determine the labels of multidimensional time series of the same structure and length, Orsenigo et al. [18] proposed a discrete support vector machine (Discrete Support Vector Machine) Machine (DSVM) classification method. This method benefits from the concepts of warping distance and softened variable margin and realizes the classification of multidimensional time series. Lee et al. [19] proposed a time series classification method based on K-Nearest Neighbor [20] (K-Nearest Neighbor, KNN), which was successfully used to evaluate the traffic prediction application of the mobile telecommunications industry. Seto et al. [21] combined dynamic time warping [8, 22] (DTW) and KNN to derive a multivariate time series classification based on dynamic time warping template selection, and applied it to human activity recognition. In addition to the above-mentioned distance-based multidimensional time series classification method, the method of establishing feature vectors through dimensionality reduction methods has also received attention [23–25]. Nanopoulos et al. [26] proposed a method based on statistical feature extraction to establish feature vectors, and then trained a multi-layer perceptron (MLP) from the feature vectors and target categories to achieve the classification of time series. Utomo et al. [27] proposed to classify the multidimensional data of the hospital intensive care unit by the method of multidimensional compression description (MultiCoRe), which extracted the features including time domain and frequency domain for classification. Jaakkola et al. [28] introduced a method combining HMM and SVM for classifying protein domains. This method can also be applied to other fields of biological sequence analysis. Raman et al. [29] proposed the use of a multi-layer HDP-HMM modeling method to achieve classification of human movements, training HDP-HMM from all movement motion samples, and building a multi-objective classifier from this model. Test the size of the log-likelihood function of the specimen to achieve classification. Dilello et al. [30] used sHDP-HMM to classify robots with multiple modal anomalies. The model assumes that the observations are independent of each other, which weakens the correlation between abnormal data to some extent. Although the training of neural networks with small samples is easy to cause overfitting and difficult to achieve online monitoring, but because of its better modeling capabilities and less data pre-processing process, multivariate time based on deep learning Sequence classification is currently being extensively studied by scholars [31–34].

Besides the aforementioned that multivariate time series classification has received significant interest in areas such as healthcare, object recognition, and human action recognition [35–37]. Meanwhile, most of the solutions to the classification problem of multidimensional time series are solved by supervised learning, that is, the multiobjective classifier learning is performed by manually labeled samples in advance by humans. In view of this, the anomaly diagnoses problem considered in this chapter is mainly for the classification of conventional anomaly samples with multimodel observation of robotic systems. In robotics, complex semi-autonomous systems that have provided extensive assistance to humans, anomalies are occur occasionally [39]. For this reason, enabling accurate, robust, and fast anomaly monitoring and diagnoses in robots has the potential to enable more effective, safer, and more autonomous systems [38, 49]. Execution monitoring, especially with the focus of detecting and classifying anomalous executions has been well studied in robotics [40]. Daehyung Park et al.[39] introduced a multimodal execution monitoring system to detect and classify anomalous executions for robot-assisted feeding. Specifically, the classifier labelled not only the anomaly type but also the cause of the anomaly by using an artificial neural network system. The neural net successfully integrated the anomaly monitoring and diagnoses for assisting a person with severe quadriplegia. However, due to the HMM anomaly detector limitations, the classified accuracy for anomaly types and causes are 90.0%, 54.0% respectively. In [41], Bjreland et al.introduced a monitoring system that also can detect, classify, and correct anomalous behaviors using a predictive model. Those two integrated system give us a lot of inspiration and confidence for improving the robot introspection with the anomaly monitoring and diagnoses.

The anomaly diagnoses has been used to determine the source of anomalies while running manipulator or mobile robots [42]. Several common time series classification algorithms are distance based metrics, such as the k-nearest neighbors (KNN) approach have proven to be successful in classifying multivariate time series [43]. Plenty of research indicates Dynamic Time Warping (DTW) as the best distance based measure to use along KNN [44]. Besides the distance based metrics, the feature based algorithms have been used over the years [45], which rely heavily on the features being extracted from the time series data or modeling the time series with the parametric methods. Hidden State Conditional Random Field (HCRF) and Hidden Unit Logistic Model (HULM) are two successful feature based algorithms that have led to state of the art results on various benchmark datasets [46]. However, HCRF is a high computational efficiency algorithm that detects latent structures of the input time series using a chain of *k-*nominal latent variables. Further, the number of parameters is linearly increasing of latent states required. To overcome this, HULM proposes using *H* binary stochastic hidden units to model 2*<sup>H</sup>* latent structures of the data with only *O*(*H*) parameters. Another approach for multivariate time series classification is by applying dimensional reduction techniques or by concatenating all dimensions of a multivariate time series into a univariate time series [47].

#### **5.3 Problem Statement**

Another main task of robotic introspection is to accurately evaluate the potential pattern of real-time multi-modal sensing data. It is not limited to the identification of robot movement and abnormal monitoring, but also has the ability to classify conventional abnormalities. Anomaly diagnoses mainly refers to the identification of conventional learned abnormal types through supervised learning after the robot detects the occurrence of abnormalities. An intelligent robot system includes but is not limited to the following types of abnormalities: system abnormalities, internal sensors anomalies, motion instruction failures, noise or damage to sensors and actuators, robot-human-environment interactions, or external disturbances in the environment. Park et al. [50] carried out a diagnoses of the types of robot anomalies and their causes, and integrated this function into a human-robot interactive robot system for feeding disabled people. The types of anomalies considered in this section mainly come from external disturbances caused by system anomalies (kinematic anomalies, sensor failures, etc.) or human improper interference behavior or robot ends in a human-machine collaboration environment. In order to improve the compactness and portability of the proposed robotic sensing system, this section still adopts the method of non-parametric Bayesian hidden Markov model to implement multi-class anomaly diagnoses.

#### **5.4 Collection and Augmentation of Anomaly Sample**

The collection of anomalous samples in this book takes into account the precursory period and the duration of a certain period of time when the anomaly occurs, which is determined by the value of the given anomalous window range *win*\_*len* (for ease of adjustment, the duration of the precursory period and duration is usually equal, both are half of *win*\_*len*). As shown in Fig. 5.1, a sample with an anomaly type of "tool collision" is extracted (light red background area) around the trigger time (black vertical dashed line) of the previous anomaly monitor.

In addition, due to the great randomness and uncertainty of the occurrence of anomalies, the number of samples between different types of anomalies is extremely unbalanced, and for some anomalies (collision between the robot and the environment), the impact on the robot body may even cause damage There is no guarantee

**Fig. 5.1** An example of extracting an "tool collision" anomaly sample with a given when anomaly detected

**Fig. 5.2** The implementation flowchart of K-means Algorithm

for the number of samples for model training. In view of this problem, this article considers the data enhancement processing for the abnormal types of fewer samples collected. The main goal is to generate synthetic data based on the statistical characteristics of the data.

The proposed multimodal augmentation method is inspired by the *K* − *means* clustering method in traditional machine learning, combined with Dynamic Time Warping (DTW) to enhance the data. Among them, the implementation flowchart of the *K* − *means* clustering method is shown in Fig. 5.2.

In the algorithm, *K* represents the number of categories, Means represents the mean. As the name suggests, this method is a method for Dimensional or multidimensional data points, and even the time series considered in this article) clustering algorithm, the core idea is to use a preset K value and the initial centroid of each category (Centroid) to distance (Euclidean distance, Manhattan distance and Time series distance measure (DTW) is used to cluster the data points that are closer, so that the average value of the clustered iteration is preferentially obtained with the smallest objective function value, as shown in the following formula:

$$J = \sum\_{j=1}^{K} \sum\_{i=1}^{N} ||x\_i^{(j)} - c\_j||^2 \tag{5.1}$$

where, the symbol *J* is the objective function of the *K* − *means* algorithm; *K* is the number of clusters; *N* is the number of data points; || · || is expressed as the distance

**Fig. 5.3** The pseudocode for data augmentation for sparse anomaly class

function between the data points and the centroid; and *c* represents the centroid of the category. From the implementation of the *K* − *means* algorithm, we know that by clustering data points close to the centroid, and data points in the same category have similar statistical characteristics, the goal of satisfying data enhancement is to hope to generate similar data based on sparse dataset New data for characteristics. To this end, the proposed algorithm, after initializing *K* random centroids, loops through the following two steps: (1) Dividing. Divide each data point to the nearest centroid according to the distance measurement of DTW; (2) Update. According to the selected mean measurement method, the centroid is moved to the center of various types of designated data points. This time, each time the updated centroid is used to generate a composite data, the algorithm's pseudo code is shown in Fig. 5.3.

Meanwhile, an example of one-dimensional data enhancement for an abnormal type is shown in Fig. 5.4. As can be seen from Fig. 5.4, the proposed algorithm can effectively capture the statistical (such as peak and mean) characteristics of the original data.

#### **5.5 Anomaly Diagnose Based on Nonparametric Bayesian Model**

The anomaly diagnoses is triggered once an anomaly is detected. A system can possibly address a wide variety of types of anomalies including low-level hardware anomalies: sensor and actuator noise or breakage; mid-level software contingencies like: logic errors or run-time exceptions; high-level misrepresentations: poor modeling of the robot, the world, their interactions, or external disturbances in the

**Fig. 5.4** An example for illustrating the data augmentation of anomaly sample

environment). In [48, 50], Park et al.identified both the anomaly class and the cause. In this work, we deal with anomalies caused by external disturbances generated either by intrusive human behavior or resulting from poor modeling or anticipatory ability on the robot's end.

#### *5.5.1 Multiclass Classifier Modeling*

We consider the problem of multiclass classification based on multivariate time series, that is, to find a function *<sup>f</sup>* (*y<sup>n</sup>*) <sup>=</sup> *<sup>c</sup><sup>n</sup>* given the *<sup>i</sup>*.*i*.*<sup>d</sup>* training data *<sup>Y</sup>* = {*y<sup>n</sup>*}*<sup>N</sup> <sup>n</sup>*=1, *C* = {*c<sup>n</sup>*}*<sup>N</sup> <sup>n</sup>*=1, where *<sup>y</sup><sup>n</sup>* = [*y<sup>n</sup>* <sup>0</sup> , *y<sup>n</sup>* <sup>1</sup> , ..., *y<sup>n</sup> Tn* ] is an observation sequence and *c<sup>n</sup>* ∈ {1, .., *c*} its corresponding anomaly class. An observation *y<sup>n</sup> <sup>t</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* consists of the same multimodal features as described in Sect. 4.6.1 at time-step *t*. Our objective is diagnoses, where given a new test observation sequence *y*ˆ, we have to predict its corresponding anomaly class *c*ˆ. Here, the *y*ˆ is recorded by considering a *window*\_*size* around the anomaly occurred moment, and the *c*ˆ is labeled manually during training procedure. We represent each anomaly class by a separate sHDP-VAR-HMM as outlined in Sect. 4.6.1, the Θ*<sup>c</sup>* are the parameters for class *c*. It would be straightforward to estimate the posterior density of parameters *p*(Θ*c*)|*Y*,*C*) by


**Fig. 5.5** An example for illustrating the progress for training nonparametric Bayesian model for anomaly diagnoses

$$p(\Theta\_c | Y, C) = p(\Theta\_c) \prod\_{c^n = c} p(\mathbf{y}^n | \Theta\_c) p(c). \tag{5.2}$$

That is, each sHDP-VAR-HMM is trained separately and a conditional density *p*(*y*|*c*) for each class is trained throughout the process as defined in Fig. 5.5.

#### *5.5.2 Experiments and Results*

#### • **Anomaly Dataset in Kitting Experiment**

The dataset captures sensory-motor signatures of the Kitting task under anomalous scenarios as outlined in this paper. A total of 136 anomalous events were recorded in 142 experimental executions across all skills. The proportions for each anomaly class were as follows: TC 15.7%, HC 16.7%, HCO 16.7%, NO 13.0%, OS 16.7%, WC 15.7%, and OTHER 5.6%. To intuitively explain, analyze, and propose corresponding diagnose methods, all the anomalies identified using our proposed method are taken into further consideration. That is, we extract all the anomaly data points from each abnormal movement and concatenate them with labels, in which the anomalous data point is also restricted with the same features as considered in anomaly identification. Those identified anomalies are visualized via the *t*-distributed Stochastic Neighbor Embedding (t-SNE) method [51] and labelled manually, as shown in Fig. 5.6.

#### • **Anomaly Diagnoses Window Considerations and Online Recording**

Note that when an anomaly monitoring is flagged, we consider a window of time duration ±*window*\_*size* for the subsequent anomaly diagnoses.1 In cases where an anomaly is detected towards the beginning of a skill execution, and the duration of existing data prior to the detection is less than our *window*\_*size*. In this work, we do not extract data for diagnoses as we deem a minimal presence of signals before and after the detection crucial for the diagnoses. The sensory-motor data collected at this stage, allows to build basic models of the anomalies described in Chap. 4.

Our system recorded sensory data online. Upon detection of an anomaly, the timestep at which the flag occurred is recorded. Then, we record (online) the multimodal signatures (F/T, Cartesian velocity, and Tactile signals) before and after the anomaly (also referred to as the sample) according to the *window*\_*size*. Signals were resampled at 10 Hz to achieve temporal synchronization for all modalities and the further preprocessed as described in preprocessing step.

#### • **Parameter Settings and Results**

To avoid overfitting, we performed *3-fold* cross validation for each anomaly type separately. An 18 dimensional feature vector was used as presented in Equation 4.53. We compare against five baselines consisting of parametric HMMs (with a fixed number of hidden states), various observation models, and various variational inference methods. The training and testing dataset for all the diagnoses methods was the same in this comparison.

*(1) Parametric HMM Settings:* We train 3 differing types of HMMs for each anomaly class. Each HMM uses four different numbers of hidden states *K* ∈ {3, 5, 7, 10} to train each class. We need to estimate the transition matrix, mean, and covariance parameters. To this end, K-Means++ [52] clusters the data and yields

<sup>1</sup>We use the Redis database for this purpose (https://redis.io/) as rosbags can only be processed offline.

**Fig. 5.6** Visualization of all the identified anomalies using t-SNE method. We heuristically label the clusters according to the analysis of unexpected anomalies and the data points density in such a kitting experiment

initial estimates. During testing, we evaluate a test sample against all classes and select the class with the largest log-likelihood. The parametric HMMs result summary is presented in Table 5.1.

◦ *HMM-Gauss-EM:* A classifier based on a classical HMM with Gaussian observations was trained independently for each anomaly class. The standard Baum-Welch Expectation Maximization algorithm [53] was used to learn the HMM model.

◦ *HMM-Gauss-VB:* A classifier based on classical HMM with Gaussian observation was trained independently for each anomaly class. The standard Variational Bayesian (VB) inference algorithm [54] was used to learn the HMM model.

◦ *HMM-AR-VB:* A classifier based on a classical HMM with first-order autoregressive Gaussian observations was trained independently for each anomaly class. The standard VB inference algorithm [54] was used to learn the HMM model.

In conclusion, we find that for parametric HMMs (i) the best diagnoses accuracy rate was 95.7% when using 5–7 fixed states; (ii) Variational inference algorithms outperformed the standard EM algorithm; (iii) The Autoregressive observation model classified better than the Gaussian model due to it's linear conditional nature; (iv) Parametric HMMs, in general, are less effective to jointly model the dynamics of the



robotic task. Therefore, we consider a Bayesian nonparametric verision of HMMs with a hierarchical prior which shares statistical strength across the training samples.

*(2) Nonparametric HMMs Setting:* We also train 4 kinds of classifiers base on nonparametric HMMs, independently for each anomaly class. We specify the truncation number of states as K=10 as explained in Sect. 4.6.1. Comparing to the parametric HMMs, the actual number of hidden states of each anomaly class is automatically learned from data in non-parametric fashion. The same procedure during testing as described in parametric HMMs. Benefits from the automatic state inference with HDPHMM, the auto-regressive correlation of the anomaly data, and the effective variational inference techniques. The summary of diagnoses results of non-parametric HMMs are presented in Table 5.1. Those numbers in blue are denoted the optimal diagnoses accuracy of specific anomaly type across all the methods, respectively.

◦ *HDPHMM-Gauss-VB:* A classifier based on HDPHMM with Gaussian observation was trained independently for each anomaly class. The standard VB inference algorithm [55] was used to learn the HDPHMM model. Similiar to the method proposed in [57], instead of the blocked Gibbs sampling algorithm, we learn the model by variational Bayesian inference.

◦ *HDPHMM-Gauss-moVB:* A classifier based on HDPHMM with Gaussian observation was trained independently for each anomaly class. A memoized online variational inference algorithm (moVB) [56] based on scalable adaptation of state complexity is used to learn this HDPHMM model.

◦ *HDPHMM-AR-VB:* A classifier based on HDPHMM with first-order autoregressive Gaussian observation was trained independently for each anomaly class. The standard VB inference algorithm [55] was used to learn the HDPHMM model.

◦ *HDPHMM-AR-moVB:* Finally, our results were evaluated on the HDPHMM with first-order autoregressive Gaussian observation for each anomaly class. A memoized online variational inference algorithm (moVB) [56] based on scalable adaptation of state complexity was used to learn this HDPHMM model.

Given that non-parametric sHDP-VAR-HMM learns the complexity of the model from the data, it produces more accurate models as is reflected by the higher diagnoses accuracies shown in Table 5.2. The learned number of states for the different anomaly types is shown in Table 5.3. Note that for equivalent parametric HMMs, a tedious model needs to be computed for each class to optimize the state cardinality between types. The diagonal elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier. It is evident from the confusion matrix that the diagnoses outperforms other baseline methods.

#### *5.5.3 Discussion*

The multiclass classifier that is flagged when an anomaly is detected, was also tested through the sHDP-VAR-HMM on seven anomalies and six baseline methods. Our evaluations showed that we could not only detect anomalies reliably (overall accuracy


**Table 5.2** Confusion matrix for diagnoses results on kitting experiment dataset with seven anomalies by HDPHMM-AR-moVB method

**Table 5.3** The number of hidden states being active for different anomaly classes using HDPHMM-AR-moVB method. An active state is one in which at least one observation is assigned to this state


of 91.0%, as presented in Chap. 4) but also classify them precisely (overall accuracy of 97.1%).

With regards to anomaly diagnoses, anomalies usually occur from various sources such as sensing errors, control failures, or environmental changes. If a robot identifies the type, it may be beneficial to prevent or at least recover from the anomalous situation. In our proposed diagnoses method, we trained the sHDP-VAR-HMM model for each anomaly type separately. To address this, Natraj Raman had proposed a signal discriminative HDP-HMM for all classes albeit with an extra level that is class specific. Thus, an interesting future work direction consists in investigating a multilevel sHDP-VAR-HMM for all classes for multiclass classification and the extensions of the observation model by using higher order autoregressive Gaussian models.

Development of robot introspection is expected to have a direct impact on a large variety of practical applications, such as that can be used to estimate the log-likelihood of failure and prevent the failure during robot manipulation task. Also, the improvement of safety in human-robot collaborative environment by assessing the quality of learned internal model for each skill, which can speed up the anomaly recovery and/or repair process by providing the detailed skill identification and anomaly monitoring.

There are a number of limitations in our work. Currently we do not explicitly reduce the influence of outliers occasionally found in the derived log-likelihood values for specific hidden states. The outliers have a significant impact on the calculation of the threshold and our approach needs to address them specifically to avoid their impact. Additionally, we note the fact that our kitting experiment was not conducted under real factory conditions or in a real household daily task. Thus the verifiability of the work in real-world settings is unclear and further testing in real-factory conditions is necessary. The kitting experiment provides a proof-of-concept and the authors would like to extend their work to actual scenarios through corporate partners.

**Fig. 5.7** The implementation procedure of sparse representation

#### **5.6 Anomaly Classifier Based on Feature Extraction**

The procedure of the sparse representation system for multimodal time-series is shown in Fig. 5.7.

#### *5.6.1 Anomaly Sample Collection*

#### • **Sensory Preprocessing**

The original multimodal sensory data includes a 6 DoF force and torque signals from F/T sensor, a 6 DoF Cartesian velocity signal from a manipulator's end-effector, 56 DoF tactical signals from a left and a right tactile sensor panels. To produce more consistent data content, we do not directly concatenate individual data sources, instead temporal synchronization is done across modalities at 10*H Z*. Also, different preprocessing techniques are done for specific modalities.

**Wrench modality:** Takes a force and torque time-series vector sequence and for each element represents the magnitude of each dimension ( *fx* , *f <sup>y</sup>* , *fz*, *tx* , *ty* , *tz*). Empirically, we wish that anomalies HC and TC can effectively flag external perturbations caused from different directions. We also consider the norm of force *n <sup>f</sup>* and torque *nt* respectively:

$$m\_f = \sqrt{f\_x^2 + f\_y^2 + f\_z^2}, \quad n\_t = \sqrt{t\_x^2 + t\_y^2 + t\_z^2} \tag{5.3}$$

**Velocity modality:** We measure the linear (*lx* ,*ly* ,*lz*) and angular (*ax* , *ay* , *az*) Cartesian velocity (the endpoint state of a Baxter right hand), which are reported with respect to the base frame of the robot. As with the wrench source, we also consider the norm of the linear velocity *nl* and angular velocity *na* respectively.

$$m\_l = \sqrt{l\_x^2 + l\_y^2 + l\_z^2}, \quad n\_a = \sqrt{a\_x^2 + a\_y^2 + a\_z^2} \tag{5.4}$$

**Taxel modality:** Due to the high dimensionality of tactical sensor, we do not process all the original signals as the model input. After empirical testing, the standard deviation of each tactile sensors *sl* , *sr* were selected as the preferred features and defined as:

$$s\_l = \sqrt{\frac{1}{28} \sum\_{i=1}^{28} (l\_i - \mu\_l)}, \quad s\_r = \sqrt{\frac{1}{28} \sum\_{i=1}^{28} (l\_i - \mu\_r)}\tag{5.5}$$

where the <sup>μ</sup>*<sup>l</sup>* <sup>=</sup> <sup>1</sup> 28 <sup>28</sup> *<sup>i</sup>*=<sup>1</sup> *li* and <sup>μ</sup>*<sup>r</sup>* <sup>=</sup> <sup>1</sup> 28 <sup>28</sup> *<sup>i</sup>*=<sup>1</sup> *ri* is the mean of each tactical panel, respectively.

The above three preprocessing operations capture unexpected changes in the original signals since the signals are slightly different from the robot variable progress in normal executions. We then concatenate all the features to represent robot executions both in nominal and anomalous cases. Note that raw concatenated features can easily result in high False Positive Rates (FPR) during execution due to significant task execution variability across the same task in our experiments. For instance, the F/T signals vary different across differing objects of different weights; or similarly, human collisions which occur from different directions and magnitudes.

A standardization method is used to scale original signals ξ*<sup>o</sup>* by its mean and standard deviation according to:

$$\xi(\*) = \frac{\xi\_o(\*) - mean(\xi\_o(\*))}{std(\xi\_o(\*))}, \quad \* \in \{f\_x, f\_y, ... \mathfrak{s}\_l, \mathfrak{s}\_r\}.\tag{5.6}$$

Finally, our eighteen dimensional multimodal signal(wrench, velocity, and tactical modalities) is represented as:

$$\begin{split} \mathbf{y}\_t &= [\xi(f\_x), \xi(f\_y), \xi(f\_z), \xi(t\_x), \xi(t\_y), \xi(t\_z), \\ &\quad \xi(n\_f), \xi(n\_l), \xi(l\_x), \xi(l\_y), \xi(l\_z), \xi(a\_x), \\ &\quad \xi(a\_y), \xi(a\_z), \xi(n\_l), \xi(n\_a), \xi(s\_l), \xi(s\_r)]. \end{split} \tag{5.7}$$

#### *5.6.2 Anomaly Features Extraction*

In order to keep temporal consistency, anomaly signals are considered for a given *window*\_*size* which is fixed around events flagged as anomalous. For instance, if a tool collision is detected as shown in Fig. 5.8, then *window*\_*size* can evaluate how the diagnoses reactivity performs in our system. Generally, the *window*\_*size* will equal a power of two (the a preferred size when including the Fast Fourier Transformation (FFT) between the time and frequency domain).

As described above, the sparse representation is applied for each extracted sample window. According to our previous work on anomaly diagnoses [20], the features are extracted in both the time domain and frequency domain. For anomaly diagnoses, we empirically consider the independent features and the corrective features along a specific modality signal as in Eq. 5.7. Here, the original multimodal signals with 18 DoFs and 12 feature types are considered in both the time and frequency domains, where the final feature vector is of length 558.

#### **A: Independent features**

Here, we calculate the features along a specific modality signal ξ {∗} = (*x*1, *x*2, ..., *xn*) with *n* data points

**Fig. 5.8** Our robot introspection system for extracting the anomaly data when anomaly detected. For instance, the data of tool collision is represented with a given *window*\_*size* = ±2 s in red background

#### 5.6 Anomaly Classifier Based on Feature Extraction 111

1. mean

$$
\mu = \frac{1}{n} \left( \sum\_{i=1}^{n} x\_i \right) \tag{5.8}
$$

2. standard\_deviation

$$
\sigma = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} (\mathbf{x}\_i - \mu)^2} \tag{5.9}
$$

3. mean\_diff: calculate the mean over the differences between subsequent values

$$
\mu\_{diff} = \frac{1}{n} \sum\_{i=1}^{n-1} (\mathbf{x}\_{i+1} - \mathbf{x}\_i) \tag{5.10}
$$

4. mean\_*abs*\_diff: calculate the mean over the absolute differences between subsequent values

$$
\mu\_{abs\\_diff} = \frac{1}{n} \sum\_{i=1}^{n-1} |\mathbf{x}\_{i+1} - \mathbf{x}\_i| \tag{5.11}
$$

5. abs\_energy: calculate the absolute energy, that is the sum over the squared values

$$e\_{abs} = \sum\_{i=1}^{n} x\_i^2 \tag{5.12}$$

#### **B: Correlative features**

1. autocorrelation: calculate the autocorrelation of the specified lags *k* ∈ {1, 2, 3, 4} given the mean μ and variance σ2, respectively.

$$\mathcal{A}\_k = \frac{1}{(n-k)\sigma^2} \sum\_{t=1}^{n-k} (\mathbf{x}\_t - \mu)(\mathbf{x}\_{t+k} - \mu) \tag{5.13}$$

2. mean\_autocorrelation: calculate the mean of the autocorrelation which taken over all possible lags *l* ∈ {1, ..., *n*}

$$\mu(\mathcal{R}) = \frac{1}{n-1} \sum\_{l=1}^{n} \mathcal{R}\_l \tag{5.14}$$

3. std\_autocorrelation: calculate the standard deviation of the autocorrelation which taken over all possible lags *l* ∈ {1, ..., *n*}

$$\sigma(\mathcal{R}) = \sqrt{\frac{1}{n-1} \sum\_{l=1}^{n} (\mathcal{R}\_l - \mu(\mathcal{R}))^2} \tag{5.15}$$

4. ar\_coefficient: get the first 5 coefficients by fitting the unconditional maximum likelihood of an autoregressive model *A R*(*p*) with order *p*. In case of our implementation, the order is a fixed *p* = 10. The *A R*(5) model is defined as

$$\mathbf{x}\_t = \varphi\_0 + \sum\_{i=1}^{p=10} \varphi\_i \mathbf{x}\_{t-i} + \varepsilon\_t \tag{5.16}$$

where ε*<sup>t</sup>* is drawn from a Gaussian white noise with a mean of zero and unit variance.

5. partial\_autocorrelation: calculate the value of partial autocorrelation function of given lag *k* ∈ {1, 2, 3, 4, 5}, denoted α(*k*), is the autocorrelation between *xt* and *xt*+*<sup>k</sup>* with the linear dependence of *xt* on *xt*+<sup>1</sup> through *xt*+*k*−<sup>1</sup> removed. As such, the function is defined as

$$\begin{aligned} Cov(\mathbf{x}\_t, \mathbf{x}\_{t-k}) &= Cov(\mathbf{x}\_t, \mathbf{x}\_{t-k} | \mathbf{x}\_{t-1}, \dots, \mathbf{x}\_{t-k+1}) \\ var(\mathbf{x}\_t) &= Var(\mathbf{x}\_t | \mathbf{x}\_{t-1}, \dots, \mathbf{x}\_{t-k+1}) \\ var(\mathbf{x}\_{t-k}) &= Var(\mathbf{x}\_{t-k} | \mathbf{x}\_{t-1}, \dots, \mathbf{x}\_{t-k+1}) \\ \alpha(k) &= \frac{Cov(\mathbf{x}\_t, \mathbf{x}\_{t-k})}{\sqrt{var(\mathbf{x}\_t)var(\mathbf{x}\_{t-k})}} \end{aligned} \tag{5.17}$$

#### **C: Spectrum-based features**

1. fft\_coefficient: Get the first 5 Fourier coefficients of real part by FFT algorithm, respectively. In this implementation, we define the discrete Fourier transform function as

$$\begin{aligned} \mathcal{F} &= \sum\_{m=0}^{n-1} a\_m \exp\{-2\pi \frac{mk}{n}\}, \quad k = 0, \ldots, n-1\\ a\_m &= \exp\{2\pi i f m \Delta t\} \end{aligned} \tag{5.18}$$

where, Δ*t* is the sampling interval.

2. fft\_angle: Calculate the angle of obtained complex value *F* of the first 5 Fourier coefficients, respectively.

$$\theta = \tan^{-1} \frac{\mathcal{F}\_{imag}}{\mathcal{F}\_{real}} \tag{5.19}$$

#### *5.6.3 Experiments and Results*

#### • **Experimental Setup**

A kitting experiment consists of 5 basic nodes: Home, Pre-pick, Pick, Pre-place, and Place. The experiment is implemented in the following order with those nodes by 6 skills with the ROS-SMACH,2 (Skill 1): Home → Pre-pick; (Skill 2): Pre-pick → Pick; (Skill 3): Pick→Pre-pick; (Skill 4): Pre-pick →Pre-place; (Skill 5): Pre-place → Place; (Skill 6): Place → Pre-place, as shown in Fig. 4.11.

The primary goal of the kitting task is designed to transport 6 different objects to a fixed container. The right arm of Baxter humanoid robot is used to pick objects and equipped with a 6 DoF Robotiq FT sensor, 2 Baxter-standard electric pinching fingers. Each finger is further equipped with a multimodal tactile sensor that a four by seven taxel matrix that yield absolute pressure values. The left hand camera is placed flexibly in a region that can capture objects with a resolution of 1280 × 800 at 1 fps (we optimize pose accuracy and lower computational complexity in the system). The use of the left hand camera facilitated calibration and object tracking accuracy. All code was run in ROS Indigo and Linux Ubuntu 14.04 on a mobile workstation with an Intel Xeon processor, 16GB RAM, and 8 cores.

When robot collaboratively works with human in a shared workspace, so many external disturbances are likely to occur. Those anomalies in Fig. 4.12 are considered in the Kitting experiment, which including the following 7 types: Tool Collision (TC) that may be derived from the visual error or the user accidentally collide with the object during robot moving to grasp it (see Fig. 4.12b); Human Collision (HC) is usually happened by a user to unintentionally collide with the robot arm in the human-robot collaboration environment(see Fig. 4.12a). We treat the human collision differently from whether the robot carrying object or not. Thus, Human Collision with Object (HCO) is assumed that the human collision while robot carrying object from the node *Pre* − *pick* to *Pre* − *place* (see Fig. 4.12d). The object have been knock down by the robot during grasping may induce the No object (NO) or missedgrasps(see Fig. 4.12f). Another common anomaly is Object Slip (OS) that the picked object may slip from the robot's gripper if the grasping pose is not optimal or the robot moves at high speed. Finally, the False Positive (FP) is labeled when some unexpected disturbances may be detected by the anomaly detector for a variety of reasons, for instance, the system error, the object is placed at unreachable zone, without feasible inverse kinematic solution, and so on. In the rest of this paper, we intentionally achieve the spare representation of the seven types of anomalies while preserved sufficient diagnoses accuracy, respectively.

#### • **Results and Analysis**

The dataset contains a total of 108 samples from 137 experimental recordings of kitting task and the proportion for each anomaly are TC 15.7%, HC 16.7%, HCO 16.7%, NO 13.0%, OS 16.7%, WC 15.7%, FP 5.6%, respectively. For evaluating the

<sup>2</sup>http://www.ros.org/SMACH.


**Table 5.4** Multiclass classifiers

performance of the proposed sparse representation of multivariate time-series, we took the following 9 representative classifiers3into consideration and the parameter settings are described respectively in Table 5.4.

As described above, the feature selection is a process where the most significant features in predicting the outcome are selected automatically. However, irrelevant features decrease the model's accuracy and increase computational cost. We therefore calculate the *p*\_*value* of each extracted feature by using the hypothesis testing method in spscitefresh2016. That is, we preform a singular statistical test checking the hypotheses for each extracted feature *f*1, *f*2, ..., *fn*,

$$\begin{aligned} H\_0^i &= \{ \mathbf{x}\_i \text{ is irrelevant for predicting class } \mathbf{y} \}; \\ H\_1^i &= \{ \mathbf{x}\_i \text{ is relevant for predicting class } \mathbf{y} \}; \end{aligned} \tag{5.20}$$

The result of hypothesis test in Eq. 5.20 is a *p*\_*value*, which assess the probability that feature *xi* is not relevant for predicting class *y*. As such, we define the score of feature by calculating the negative logarithmic value on the *p*\_*value*. Large scores − log(*p*\_*values*) indicate features, which are relevant for predicting the target.

The performance of different classifiers is shown in Fig. 5.9, as you can see (reading left to right on the graph), the accuracy indicates to increase as the number of features are added, until a point beyond which there seems to be too few features for the classifier to make any reliable conclusions. Specifically, those features are extracted from a sorted feature vector in descending order by score values.

<sup>3</sup>http://scikit-learn.org/.

**Fig. 5.9** The comparison between diagnoses accuracy and the number of feature on different classifiers

#### *5.6.4 Discussion*

This work implements sparsely represent the recorded multimodal time-series with relevant features as small as possible while preserving diagnoses accuracy. We propose the multivariate features are extracted in both time domain and frequency domain and not only consider the static statistical characteristics, but also including the correlation and interaction of each dimensional sensory signal. Results indicate that the data set can be significantly reduced up to 72.2% ∼ 86.1% (the number of features is 100 and 200, respectively) of the raw data while keep the average diagnoses accuracy at about 85% with small data preparation. Future work should therefore include analyzing the trade-off between the value *window*\_*size* and the diagnoses accuracy. So as to represent the recorded multimodal time-series with relevant features as small as possible while preserved diagnoses accuracy.

#### **5.7 Summary**

In this chapter, anomaly diagnose methods using the Bayesian nonparametric hidden Markov models and sparse representation by statistical feature extraction when anomaly triggered are introduced. Specifically, the detail procedure for anomaly sample definition, the supervised learning dataset collection as well as the data augmentation of insufficient samples are also presented. The proposed methods is verified with a multi-step human-robot collaboration objects kitting task on Baxter robot, the performance and results are presented of each method respectively. That is, a multitarget classifier based on the nonparametric Bayesian model is proposed, Which using the sHDP-VAR-HMM model to model the anomaly sample data of various anomaly types. For the task of Baxter robot performing kitting experiment, resulting in the diagnoses accuracy of a total of 7 types of anomaly events in is 97.1%. Additionally, for the sparse representation for anomaly diagnoses, we extract a total of 31 features of time and frequency domains of the anomaly sample. Then, the extracted features importance analysis using the hypothesis testing method, after that the filtered features are verified on 9 common out-of-the-box supervised learning methods for multi-class diagnoses in *sklearn*.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 6 Learning Policy for Robot Anomaly Recovery Based on Robot Introspection**

**Abstract** In this chapter, the anomaly recovery would be acted when both of the anomaly monitoring and diagnoses are analysed, which aim to respond the external disturbances from the environmental changes or human intervention in the increasingly human-robot scenarios. To effectively evaluate the exploration, we summarize the anomalies in a robot system include only two catalogues: accidental anomalies and persistent anomalies. In particular, we first diagnose the anomaly as accidental one at the beginning such that the reverse execution is called. If and only if robot reverse many times (not less than twice) and still couldn't avoid or eliminate the anomaly, the human interaction is called. Our proposed system would synchronously record the multimodal sensory signals during the process of human-assisted demonstration. That is, a new movement primitive is learned once an exploring demonstration acquired. Then, we heuristically generate a set of synthetic demonstrations for augmenting the learning by appending a multivariate Gaussian noise distribution with mean equal to zeros and covariance equal to ones. Such that the corresponding introspective capabilities are learned and updated when another human demonstration is acquired. Consequently, incrementally learning the introspective movement primitives with few human corrective demonstrations when an unseen anomaly occurs. It is essential that, although there are only two different exploring strategies when anomaly occurs, numerous exploring behaviors can be generated according to different anomaly types and movement behaviors under various circumstances.

#### **6.1 Introduction**

In recent years, with the widespread promotion of the cooperative robot and human beings' inclusive operation, the robot's abnormal recovery behavior cannot be performed by the traditional robot's own motion planning algorithm, and human expectations for robot motion should be brought into play. The recovery strategy will reflect the human-centered concept of human-machine collaboration. For this reason, the robot anomaly recovery problems considered in this article mainly refer to the recoveries learned from humans by themselves in the case of robots that can recovery anomalies (not considering abnormal situations that cannot be recovered autonomously such as power interruptions or system crashes). The strategy makes the robot's motion change from an abnormal state to a normal state. In other words, how to learn from humans to obtain corresponding recovery strategies for different abnormal types of robots is the key issue of this chapter, and it is required that the proposed recovery strategies have certain expansibility and generalization.

#### **6.2 Related Work**

To address the unpredictable abnormalities in robot operation tasks in an unstructured dynamic environment, Gutierrez et al. [1] proposed a way of expanding the original operation graph to recovery unforeseen abnormal events. The overall idea of the method is to use finite states machine learns an initial robot operation diagram for limited human demonstration movements, then re-plans the robot's movement when there is an abnormality in the actual operation application, and adds this recovered movement to the operation diagram, as the operation diagram continues Perfect, so that we can always find a feasible recovery behavior for the abnormal event. Paster et al. [2, 3] proposed the establishment of a specific library of motion primitives for the robot's operation tasks, and taught the different ways to complete the task through human experience. After combining the methods of motion primitive selection, the abnormal premise was detected Next, the next primitive corresponding from the motion primitive library is completed [4, 5], thereby completing the behavior of abnormal recovery. Different from the robot's own method of motion conversion and selection, Salazar-Gomez et al. [6] proposed to implement robot's abnormal recovery by introducing human observation and control [7]. This method is a kind of human-in-the-loop (Human-in-the-loop) human-robot interaction. This has the advantage that humans have a more intuitive understanding of the robot's movements. When abnormal or error occurs, they can think about feasible solutions in time. Such recovery behavior is more Vulnerable to researchers and more conducive to applications in the human-machine environment. Niekum et al. [8] also described the operation tasks of the robot through the finite state machine. In the case of detecting an abnormal event, humans assist the robot to complete the current sub-task through kinematic teaching.

With associating this recovery behavior with the type of anomaly and update it to the original operation diagram, so that the robot can adopt a specific recovery behavior in the face of different anomalies. Mao et al. [9] proposed a method of human-assisted learning to implement robot abnormal recovery, where the robot's movement from the initial motion primitive will collide with the environment. At this time, the human manual Stop the robot and start the teach and teach mode to re-teach the recovered motion primitives for the robot to complete the abnormal recovery. Karlsson et al. [10, 11] proposed an online method of dynamically modifying robot motion primitives to achieve anomaly recovery. This method also pre-parameterized the robot's motion by DMP. In the event of an abnormality, it is indicated by human motion Teach ways to learn recovery behavior. In [12], the authors proposed the abort and retry behavior in grasping by minimizing the overhead time as soon as the task is likely to fail. As an alternative, Johan S. Laursen et al. introduced a system for automatically reversed in robot assembly operations using reverse execution, from which forward execution can be resumed [13]. Similarly, a recovery policy by modelling reversible execution of robotic assembly is proposed in [14]. Another perspectives for anomaly recovery, Arne Muxfeldt et al. proposed a novel approach for recovering from errors during automated assembly in typical mating operations [15–17], which is based on automated error detection with respect to a predefined process model such that a recovery strategy can be selected from an optimized repository.

In summary, the use of human-machine collaboration to demonstrate abnormal recovery behavior has been favored by a large number of researchers, because in daily life or human-machine collaboration environment, the abnormal recovery behavior of robots cannot be achieved by the traditional robot itself. The motion planning algorithm should be carried out, and human expectations for robot motion should be brought into play. The human-assisted robot abnormality recovery strategy will further reflect the "human-centered" human-machine collaboration concept. In view of this, based on anomaly monitoring and classification, this paper proposes two types of recovery strategies for accidental and persistent anomalies, and different strategies will have different recovery behaviors under different types of anomalies.

#### **6.3 Statement of Robot Anomaly Recovery**

Specifically, the task segmentation, anomaly monitoring and diagnosis can be implemented through the generative methods such as the Bayesian non-parametric time series model that described in Chaps. 3–5 respectively. With the help of the task generalization and introspective capabilities for each movement primitives, two task exploration policies are proposed for responding the anomalies: reverse execution or human interaction. These policies can be learned from the context of manipulation tasks in an increasing manner. An overview of our SPAI framework can be described by a graph *G* composed of *Nb* behaviors (or sub-tasks) *B*, which are interconnected by edges *E* such that *G* : {*B*, *E* }, *B* = {*B*1, .., *BNb* }, *E<sup>s</sup>*,*<sup>t</sup>* = {(*s*, *t*) : *s*, *t* ∈ *B*}. Behaviors in turn consist of nodes *N* and edges *E* , such that: *B* = {*N* , *E* }. Nodes can be understood as phases of a manipulation task. In our work, we prefer to name them milestones *N<sup>i</sup>* = (1, ..., *NI*), as they indicate particularly important achievements in a task. With task exploration behaviors, we may introduce a exploring nodes *Ni j* in-between milestones creating a new branch to the subsequent milestone. It is also possible to introduce further exploring nodes *Nijk* on already existing branches. The full set of nodes in a task then is described as the union of milestone nodes with all branched nodes *N* = {*N<sup>i</sup>* - *Ni j* - ... - *Ni j*...*<sup>q</sup>* }. Node Transitions *T* , behave as with behavior transitions *E* , so *T<sup>s</sup>*,*<sup>t</sup>* = {(*s*, *t*) : *s*, *t* ∈ *N* }. In our framework, a node is composed of modules. Modules are dependent processes that play a key role in the execution of a manipulation phase and could include demonstration collection, segmentation, movement learning, introspection, task representation and exploration, vision, natural language processing, navigation, as well as higher level agents. In this chapter, we restrict a node to segmentation, generalization, introspection and exploration modules.

#### **6.4 Learning and Application of Anomaly Recovery Policy**

#### *6.4.1 Learning for Unstructured Demonstrations*

As we know, a sentence in language is made up of words according to grammatical rules, and a word is made up of letters according to word formation. Correspondingly, a complex and multi-step robot manipulation task can be represented as a sentence, in which a coupled robot movement primitive is equivalent to a word, and the robot movement primitive of each degree of freedom (DoF) of can be considered as letter such that the robot manipulation task can be represented with a set of movement primitives. To this end, how to effectively learn a multi-functional movement primitive is a critical problem for intelligent robot performing complex tasks in unstructured environment. If so, that is an interesting idea for generalizing the task representation with respect to the external adjustment as well as improving the diversity and adaptability of tasks. As a consequence, the task is consist of sequential movement primitives, which not only take the kinesthetic variables into consideration but also equip with introspective capacities such as the identification of movement, anomaly detection and anomaly diagnoses. We now introduce a Bayesian nonparametric model for robust learning the underlying dynamics from unstructured demonstrations.

#### • **Demonstrations**

Generally, capturing the demonstrations by receiving the multimodal input from the end-user, such as a kinesthetic demonstration. The only restriction on the multimodal data is that all the signals must be able to be synchronously recorded at the same frequency, i.e. temporal alignment. Additionally, the multimodal data at each time step should include the Cartesian pose and velocity of the robot end-effector (in case of object-grasping, will along with any other relevant data about the end-effector, e.g. the open or closed status of the gripper and the relative distance between the end-effector and object.) as well as the signals from F/T sensor and tactile sensor. Subsequently, the recorded Cartesian pose and velocity trajectory of the end-effector will be referred to as the kinematic demonstration trajectory for controlling the robot motion, and the recorded signals of F/T sensor and tactile sensor are applied for learning the introspective capacities.

# • **Learning from Demonstration**

Learning from demonstration (LfD) has been extensively used to program the robots, which aiming to provide a natural way to transfer human skills to robot. LfD is proposed by simply teaching a robot how to perform a task as human-like, in which users can demonstrate new tasks as needed without any prior knowledge about the robot. However, LfD often yields weak interpretation about the environment as well as the task always is single step and lab-level such that lacks of robust generalization capabilities in dynamic scenarios, especially for those complex, multi-step tasks, e.g. human-robot collaborative kitting task designed in this paper.

For this reason, we present the powerful algorithms that draw from recent advances in Bayesian nonparametric HMMs for automatically segment and leverage repeated dynamics at multiple levels of abstraction in unstructured demonstrations. The discovery of repeated dynamics provides discriminating insights for understanding the task invariants, high-level description from scratch, and appropriate features for the task. In this paper, these discoveries could be concatenated using a finite state representation of the task, and consisted of movement primitives that are flexible and reusable. Thus, this implementation provides robust generalization and transfers in complex, multi-step robotic tasks. We now introduce a flowchart which integrates three major modules that critical for implementing the complex task representation from unstructured demonstrations.

#### • **Segmentation**

The aim of the segmentation is exploring the hidden state representation of the demonstrations, in which a specific hidden state usually denotes the clustering observations would be sampled from a statistical model, e.g. multivariate Gaussian distribution. Hidden state space modeling of multivariate time-series is one of the most important tasks in representation learning by dimensional reduction. In this work, we propose the segmentation approach is a hidden state determined with Bayesian nonparametric hidden Markov model, which leads to tackle the generalization problem in a more natural way to meet the need of real-world applications.

As we discussed above, the HDP-VAR-HMM interpreted each observation *yt* by assigning to a single hidden state *zt* , where the hidden state value is derived from a countably infinite set of consecutive integers *zt* = *k* ∈ {1, 2, 3, ..., *K*}. We denote Θ*<sup>s</sup>* to represent all the parameters of the trained HDP-VAR-HMM from nominal demonstrations, including the hyper-parameters for learning the posterior and the parameters for the observation model definition.

$$z = \underset{1,\ldots,K}{\text{arg}\max} \ p(z\_t|\mathbf{y}\_t, \Theta\_s). \tag{6.1}$$

Here, the *z* would be a variable value, that is, we can concatenate the derived hidden state sequences and group the starting and goal observations for each sequence, respectively.

Assume that we record *N* multimodal demonstrations via kinesthetic fashion and jointly model them using the HDP-VAR-HMM, result in *N* hidden state sequences *Z* = {*Z*1, *Z*2, ..., *Z<sup>N</sup>* ,}, where the element of *Zi*, is integer. Here, the problem is to segment the demonstrations into time intervals associated with individual movement primitives by state-specific way. After the complex task segmentation, we achieve the task representation as presented in Chap. 3 using Finite State Machine (FSM) and Dynamical Movement Primitives (DMP).

#### **6.5 Reverse Execution Policy for Accidental Anomalies**

Reverse execution allows the robot retry the current movement or several movements that independent with the current state for resolving the accidental faults, such as human collision and mis-grasp. The key question is how far back must we revert in the task? To address this, we are currently evaluating different methodologies to learn reverse policies. Ideally, the performing critic is able to include all task-relevant information: the state of the robot, the state of the environment, the affordances of a task, and the relationship between these elements. However, this is not trivial, we are studying whether we could integrate decision making processes from multiple users. It's also difficult to measure the motivation of users to select a given node. Human users might have key awareness of the task that may render them select nodes for different reasons. Expected Utility is an area of study in Risk management [18]. An utility probabilistic model is designed to reflect a users intrinsic motivations, not limited to utility, risk propensity, and the influence in learning within a single decision episode, or across episodes. We expect to present preliminary results at the workshop.

An intuitive illustration of reverse execution is presented in Fig. 6.1. We assume the robot performs the current behavior *B<sup>i</sup>* = {*N <sup>i</sup> <sup>s</sup>* , *N <sup>i</sup> <sup>g</sup>* } and an accidental fault *F<sup>x</sup>* is detected, where *N <sup>i</sup> <sup>s</sup>* and *N <sup>i</sup> <sup>g</sup>* indicate the starting and goal node, respectively. Subsequently, a new exploring node named *<sup>r</sup>* for responding this fault is autonomously appended to original task graph *G* as illustrated in Sect. 6.3, that is

$$\sigma\_r: \mathcal{P}\_{\mathcal{S}\_i, \mathcal{A}\_s} | \mathcal{P}\_x, \mathcal{A}\_i^{\mathcal{A}} \tag{6.2}$$

**Fig. 6.1** Exploring the task via reverse execution when a fault event occurs

As formulated in Eq. (6.4), the symbol *B*<sup>∗</sup> = {*N* <sup>∗</sup> *<sup>s</sup>* , *N* <sup>∗</sup> *<sup>g</sup>* } denotes a optimal way that consists of one or more selected movement primitives for retrying under current fault type and behavior situation. So, the critical problem is how to parameterize the transition probability *p*(*TBi*,*B*<sup>∗</sup> ) given *F<sup>x</sup>* and *B<sup>i</sup>* when more than one ways to explore. To address this, a statistical distribution is introduced, which the instances are independent identically distributed and belong to a discrete distribution as well as the sum of the transient probabilities of all the samples after a fault occurrence is equal to 1. For these reasons, we define the *p*(*TBi*,*B*<sup>∗</sup> )is modelled with a multinomial distribution that the random variables is the respective frequency counted based on human intention when a fault occurs. For example, **N***<sup>F</sup><sup>x</sup>* = (*N*1, *N*2, ..., *NK* ) is a frequency distribution of a random variables vector, where *K* represents the total number of movement primitives that from beginning to current movement *B<sup>i</sup>* with fault *F<sup>x</sup>* (including the *Bi*) and *Ni*,*i* ∈ {1, 2, ..., *K*} denotes how many times of movement *B<sup>i</sup>* is successfully executed. Therefore, the probability mass function of transition *p*(*TBi*,*B*<sup>∗</sup> ) is formulated as a multinomial distribution that model a total *<sup>N</sup>* <sup>=</sup> *i*=*<sup>K</sup> <sup>i</sup>*=<sup>1</sup> *Ni* reverse executions, that is,

$$p(\mathbf{N}|N,\theta) = \frac{N!}{N\_i! \dots N\_K!} \prod\_{i=1}^{K} \theta\_i^{N\_i} \tag{6.3}$$

where, θ*<sup>i</sup>* indicates the probability of movement primitive *B<sup>i</sup>* is selected, which subject to <sup>θ</sup>*<sup>i</sup>* ∈ [0, <sup>1</sup>] and *<sup>K</sup> <sup>i</sup>*=<sup>1</sup> θ*<sup>i</sup>* = 1. Therefore, we use the multinomial distribution not only to intuitively depict the expectation of human intention on the recovery behavior when an abnormal occurrence is detected in human robot interaction scenarios, but also to provide an indirect way to express the intuitive understanding of human for expecting the motion of the robot end-effector, the related manipulation objects as well as the complex relationship between "human-robot-environment".

#### **6.6 Human Interaction Policy for Persistent Anomalies**

Human interaction allows the robot explore the current movement by human-assist demonstration for resolving the persistent faults that can't be restored by reverse execution, such as tool collision and wall collision. The key question is how to fast capture and learn from the interactive demonstration base on the defined task representation. This exploring policy is activated when the system fail more than two times when reverse executions is carried out for resolving the fault *F<sup>x</sup>* . The human interaction policy is an another exploring way that through human assisted demonstration when robot encounter a persistent fault. It's essential that synchronously capture the human kinesthetic demonstration that not only the kinematic variables but also the relative coordinate frame (updating the task structure defined in Sect. 6.3 by understating the transformation relationship between demonstration and original movement) at each time step. After that, the movement is formulated by the DMP techniques intro-

**Fig. 6.2** Exploring the task via human interaction when a fault event occurs

duced in Chap. 3. In particular, the human interaction policy is not only limited to the originally designed movements, but also can be applied to the exploring movement (reverse execution or human interaction). Theoretically, our system can handle any kinds of failure situation with those aforementioned two exploring policies such that potentially achieve the long-term autonomy during robot manipulation task. Additionally, experiential verification indicates that another problem arises, there are faults also exist in reproducing the human interaction. To address this problem, we introduce a data augmentation method (without detail statement in this paper) for training a new exploring movement and introspective model (described in Chap. 4) when the same persistent fault is continuously encountered more than three times in the same (multimodal signals are recorded at each time). Consequently, we can achieve the critical behavior for incrementally addressing the faults by "exploration of exploration" and "anomaly monitoring and diagnose in exploration".

Without loss of generality, an intuitive illustration of human interaction is presented in Fig. 6.2. We assume the robot performs the current behavior *B<sup>j</sup>* = {*<sup>N</sup> <sup>j</sup> <sup>s</sup>* , *N <sup>j</sup> <sup>g</sup>* } and an accidental fault *<sup>F</sup><sup>y</sup>* is detected, where *<sup>N</sup> <sup>j</sup> <sup>s</sup>* and *N <sup>j</sup> <sup>g</sup>* indicate the starting and goal node, respectively. Subsequently, a new exploring node named *<sup>A</sup>* for responding this fault is autonomously appended to original task graph *G* as illustrated in Sect. 6.3, that is

$$\mathfrak{a}\_A \colon \mathcal{R}\_{\mathcal{R}\_j, \mathcal{R}\_h} \vert \mathcal{R}\_y, \mathcal{R}\_j \tag{6.4}$$

where, *<sup>B</sup><sup>h</sup>* indicate the human interactive demonstration *<sup>B</sup><sup>h</sup>* = {*<sup>N</sup> <sup>j</sup> <sup>s</sup>* , *N <sup>h</sup> <sup>g</sup>* }, that have the same starting node *N <sup>j</sup> <sup>s</sup>* with movement *B<sup>j</sup>* and a new goal node *N <sup>h</sup> <sup>g</sup>* is derived from demonstration. Equipping this exploring movement *B<sup>h</sup>* by formulating it using the dynamical movement primitive. Additionally, we define the goal pose (end-effector or joint variables) of demonstration as *P* that only task kinematic variables into consideration for task exploration. The *P* can be adapted by the transformation relationship between *N <sup>j</sup> <sup>s</sup>* and *N <sup>h</sup> <sup>g</sup>* , then *N <sup>h</sup> <sup>g</sup>* derived by

$$\mathcal{A}\_{\mathbf{g}}^{\prime h} = \mathbf{P} \mathcal{F}\_{\mathcal{A}\_{\mathbf{i}}^{\prime j}, \mathcal{A}\_{\mathbf{k}}^{\prime j}} \tag{6.5}$$

**Fig. 6.3** Illustrates the human interactive demonstration in our kitting experiment when the robot collides with the packaging box during transporting an object

As Fig. 6.3 shown a human robot collaborative task is designed in latter experimental verification, where a Baxter robot encounter a persistent fault (wall collision) during transporting object from human-over. In this situation, the robot likely to encounter a fault along the deficient movement (as shown in Fig. 6.3a, and the fault can't be eliminated by reverse execution because of the fixed obstacle (box) on the right hand side of robot. An modulated trajectory should be introduced for updating the original task representation, as shown in Fig. 6.3b a human interactive demonstration for exploring the failure task. Subsequently, a transformation in Eq. (6.5) is derived by learning from interactive demonstration, which can be explored in a new scenario when encounter a same fault, as shown in Fig. 6.3c.

#### **6.7 Experiments and Results**

#### *6.7.1 Platform Setup*

To evaluate the proposed method for incremental learning the introspective movement primitives. We designed a HRC task for picking and placing object into a

**Fig. 6.4** An autonomous kitting experiment by Baxter robot: The robot is designed to transport 6 marked objects with variable weights and shapes to a container, where the external anomalies may arise from accidental collisions (human-robot, robot-world, robot/object-world), mis-grasps, object slips So as to identify those unexpected anomalies, the robot arm is integrated with multimodal sensors (shown in bottom-right), including internal joint encoders, F/T sensor, tactile sensor

container using Baxter robot and integrated the ROS Indigo<sup>1</sup> and Moveit<sup>2</sup> as the middle-wares in our system. Specifically, a human co-worker is tasked to place a set of 6 objects marked with Alvar tags<sup>3</sup> on the robot's reachable region (located in front of the robot) in a one-at-a-time fashion. The objects may accumulate in a queue in front of the robot once the first object is placed on the table, the robot's left arm camera identifies the object and the robot's right arm picks and places it in a container located to the right of the robot, as shown in Fig. 6.4. Multiple sensors were installed for effectively sensing the unstructured environment and potential faults in such a kitting experiment. Here, the right arm of Baxter robot is equipped with a 6 degrees of freedom (DoF) Robotiq F/T sensor and 2 Baxter-standard electric pinching fingers, where each finger is further equipped with a multimodal tactile sensor composed of a 4 × 7 taxel matrix that yields absolute pressure values. In addition, Baxter's left hand camera is placed flexibly in a region that can capture objects in the collection bin with a resolution of 1280 × 800 at 1 fps (we are optimizing pose accuracy and lower computational complexity in the system). The use of the left hand camera facilitated calibration and object tracking accuracy. After there aforementioned integration, the robot picks each object and transports it towards the container, after which, the robot appropriately places each of the six objects in different parts of the container, several snapshots are visualized in Fig. 6.5.

In consideration of the redundant features would aggravate computational efficiency and increase false-positive rate (fault occurs even when robot's movement is

<sup>1</sup>http://wiki.ros.org/indigo.

<sup>2</sup>https://moveit.ros.org/.

<sup>3</sup>http://wiki.ros.org/ar\_track\_alvar.

**Fig. 6.5** The Snapshots of human-robot kitting experiment with a Baxter robot. We subjectively assume this task consists of 5 individual movements, including **a** Home → Pre-pick; **b** Pre-pick → Pick; **c** Pick → Pre-pick; **d** Pre-pick → Pre-place; **e** Pre-place → Place; **f** Reset

**Fig. 6.6** Multimodal signals in kitting task, where the grey represents the invalid robot movements including the adjustment, orientation alignment, reset, etc, other colors are empirically indicated an potential movement primitive along with obvious signals fluctuations

normal), we perform empirical features extraction on the original observation vector to improve identification performance. Specifically, we compute the norm of both the force *n <sup>f</sup>* and the moment *nm* as features in wrench modality, take the norm of both the Cartesian linear *nl* and angular *na* velocities in velocity modality, and consider the standard deviation for each tactile sensor*sl* and *sr*. Therefore, our feature vector *yt* of length 6 and formulated as *yt* = [*n <sup>f</sup>* , *nm*, *nl*, *na*,*sl*,*sr*], evolving extracted features of ten nominal executions are illustrated in Fig. 6.6.

#### *6.7.2 Parameter Settings for Anomaly Monitoring and Diagnose*

We need both qualitative and quantitative analysis of the proposed method. To evaluate the whole performance of IMPs in unstructured scenarios with unexpected faults that including both of the accidental and persistent causes. We organized 5 participants as collaborator (one expert user who confidently know this implementation and other four novice users) in our designed kitting experiment. Novice users first learned from the expert to induce fault during robot executions, which would aggravate the external uncertainty and increase the modeling difficulties. During data collection, each participant performed 1 nominal and 6 executions that at least one fault event by placing the set of 6 household objects in a one-by-one fashion. Consequently, totally perform 30 nominal executions and 180 failure executions. We induce the fault manually for each movement primitive, where including but no limited to the Human Collision (HC), Object Slip (OS), Tool Collision (TC), Wall Collision (WC), etc.

We first evaluate the complex task segmentation from twenty whole kinesthetic demonstrations using Bayesian nonparametric methods. As shown in Fig. 6.7, illustrating the complex task segmentation by learning the underlying dynamics using HDP-VAR-HMM, where each row indicates an independent demonstration and different color represents a specific movement primitive, which no need to guarantee each demonstration have the same segmentation order. After then, we concentrate the ordered hidden state sequence and group them by hidden state transition pair *t Pair*, where the total number of pair is computed using the permutation combination algorithm, i.e. *t Pairs* = *C*<sup>2</sup> <sup>5</sup> *A*<sup>2</sup> <sup>2</sup> = 20, where *K* is the dimension of hidden space. Thus, the frequency among the possible transition of the learned five movement primitives is illustrated in Fig. 6.8, where the kitting experiment is definitely begin with movement 2 (clustered by the hidden state 2) and then the successor should be movement 1, subsequent movement is 4 for most cases or movement 3 is the second-choice, and so on. Until now, we can effective learning the complex task representation for a set of nominal unstructured kinesthetic demonstration in a manipulation graph way. With this representation, we can evaluate the following the anomaly monitoring, diagnose as well as task exploration, respectively.

According to Chaps. 4 and 5, our proposed movement primitive equipped with two introspective capabilities: anomaly monitoring and diagnose, that necessary for endowing robot long-term autonomy and safer collaborative interaction in humanrobot collaborative scenarios. Particularly, our anomaly monitoring and diagnose are implemented based on the Bayesian nonparametric models proposed in Chap. 2 that using the HDP-VAR-HMM with a first-order autoregressive Gaussian likelihood, each state *k* has two parameters to be defined: the regression coefficient matrix *Ak* and the precision matrixΛ*<sup>k</sup>* as well as the four parameters ν, Δ, *V*, *M* of the conjugate prior MNIW are assigned in advance. To guarantee the prior has a valid mean, the degrees-of-freedom variable is set as ν = *d* + 2 and Δ is set by constraining the mean or expectation of the covariance matrix <sup>E</sup>[Λ<sup>−</sup><sup>1</sup> *<sup>k</sup>* ] under the Wishart prior in

**Fig. 6.7** Complex task segmentation by learning the underlying dynamics using HDP-VAR-HMM. Each row indicates an independent demonstration and different color represents a specific movement primitive, which no need to guarantee each demonstration have the same segmentation order

**Fig. 6.8** Illustrates the transition frequency among the derived five movement primitives. The hidden state sequences is first taken to extract the movement primitives from a set of demonstrations. A uncompleted manipulation sequence is ordered with 2 → 1 → 0 → 4 → 3 based on the derived sequences. The numbers on the transitions correspond to the transition frequency for weakly representing the task at the beginning, and subsequently strengthening by incremental learning the introspective capabilities and exploration from practical applications

Eq. (4.30).

$$\mathbb{E}[A\_k^{-1}] = s\_F \sum\_{n=1}^{N} \sum\_{t=1}^{T\_n} (\mathbf{y}\_t - \overline{\mathbf{y}})(\mathbf{y}\_t - \overline{\mathbf{y}})^T. \tag{6.6}$$

Assume that we record *N* sequential data for each skill and the length of sequence *n* ∈ *N* is *Tn*. Thus, we can easily define the parameter Δ accordingly as

$$
\Delta = (\nu - d - 1) \mathbb{E}[A\_k^{-1}].\tag{6.7}
$$

We placed a nominal prior on the mean parameter with mean equal to the empirical mean and expected covariance equal to a scalar *sF* times the empirical covariance, here *sF* = 1.0. This setting is motivated by the fact that the covariance is computed from polling all of the data and it tends to overestimate mode-specific co-variances. A value slightly less than or equal to 1 of the constant in the scale matrix mitigates the overestimation. Also, setting the prior from the data can move the distribution mass to reasonable parameter space values. The mean matrix *M* and *V* are set such that the mass of the Matrix-Normal distribution is centered around stable dynamic matrices while allowing variability in the matrix values (see Chap. 2 for details).

$$\begin{aligned} M &= \mathbf{0}\_d, \\ V &= 1.0 \ast \mathbf{I}\_d. \end{aligned} \tag{6.8}$$

where **0***<sup>d</sup>* and **I***<sup>d</sup>* are the matrix of all zeros and the identity matrix, respectively, each of size *d* × *d*. For the concentration parameters, a *Gamma*(*a*, *b*) prior is set on HDP concentration parameters α. A *Beta*(*c*, *d*) prior is set on the self-transition parameter μ and the degree of self-transition bias κ is set to 50. We choose a weekly informative setting and choose:

$$a = 0.5, b = 5.0, c = 1.0, d = 5.0\tag{6.9}$$

where, the initial transition proportion parameter is defined as μ ∼ *Beta*(1, 10) and the Split-Merge Monte Carlo method sampling parameter maximum iterations is set to 1000. The truncation states number is set to *K* = 5 for anomaly monitoring, and *K* = 10 for fault diagnose.

The anomaly monitoring is implemented by comparing the cumulative likelihood *L* of observed observations, where the fault threshold is calculated from a set of *Li*,*i* ∈ {1, 2, ..., 20} nominal demonstrations, and formulated with the expected cumulative likelihood μ(*L* ) minus the standard deviation σ (*L* ) that multiply by a constant value *c*. Additionally, the fault diagnose is activated when fault detected, which mainly implemented by comparing the sum of log-likelihood of a failure sample in a supervised fashion. Particularly, we use the K-Folders Cross Validation

**Fig. 6.9** Visualizing the robot trajectories of reverse execution when accidental faults are detected. As shown, the five normal IMPs is derived by learning from unstructured demonstrations that successfully performing human-robot collaborative kitting task and the dark-blue trajectories are explored with the maximum probability in our defined reverse policy once the fault detected

defined in *sklearn* <sup>4</sup> (here, *K* = 3) for model selection, in which the objective is the accuracy of anomaly monitoring with a fixed constant value *c* = 3.

To achieve the incremental learning introspective movement primitive from unstructured demonstrations, another critical capability would be the task exploration under the unseen and unexpected situation, especially when robot encounter a failure event. There are two independent exploring policies: reverse execution and human interaction, are proposed for responding the external accidental and persistent faults, respectively. We test the whole performance in two scenarios for evaluating how to combine the anomaly monitoring, fault diagnose, and exploration in practical applications along with the quantitative performance. As illustrated in Fig. 6.9, a set of collective 3D moving trajectories is presented for evaluating the exploring reverse execution when robot encounter accidental faults. Where the kitting task is first represented by five introspective movement primitives, and then the accidental faults (marked in yellow dots) randomly happened during robot execution, the robot immediately perform the reverse execution (dark-blue in color) according the frequency distribution shown in Table 6.1.

Additionally, As illustrated in Fig. 6.10, a set of collective 3D moving trajectories is presented for evaluating the exploring human interaction when robot encounter an wall collision that is a persistent fault. To evaluate the overall performance after human one-shot interaction, we get a 80.95% success rate when the wall collision induced by repeating the experiment 30 times.

<sup>4</sup>https://scikit-learn.org.

**Table 6.1** The frequency distribution of reverse execution policy in kitting experiment, where we record the data from five participants in total of 30 times of each accidental fault. Additionally, the <sup>2</sup> is an union of robot performing *Pre* − *pick* → *Pick* and *Pick* → *Pre* − *pick*


**Fig. 6.10** Visualizing the robot trajectories after human interaction when persistent faults are detected. As shown, the five normal IMPs is derived by learning from unstructured demonstrations that successfully performing human-robot collaborative kitting task and the dark-blue trajectories are incrementally explored by updated the demonstrated IMP parameters (including the start and goal pose, shape) once the fault detected

#### *6.7.3 Discussion*

In our previous work, we focus on the robot introspective capabilities in robotics, which only take the kinematic variables into consideration such that the task is restricted to be applied in human-robot collaborative scenarios that generalization (including motion, introspection, and decision-making, etc.) as a desirable characteristic. To address this problem, robot movement primitive augmented with introspective capacities *IMPs* is investigated in this paper, which by associating the generalization, anomaly monitoring, fault diagnoses, and task exploration during robot manipulation task. We mainly introduce the IMPs can be acquired by assessing the quality of multimodal sensory data of unstructured demonstrations using a nonparametric Bayesian model, named HDP-VAR-HMM, and IMPs can incremental learning the exploring policy using multimodal distribution and human one-shot modification when robot encounter fault. Particularly, reverse execution and human interaction are two independent policies for task exploration, which proposed to respond the external accidental and persistent fault, respectively. Experimental evaluation on a humanrobot collaborative packaging task with a Rethink Baxter robot, results indicate that our proposed method can effectively increase robustness towards perturbations and adaptive exploration during human-robot collaborative manipulation task. We need to emphasize that our method presents a solution for endowing robot with introspection in sense-plan-act control methodology for robot manipulation task.

Recently, we are working on extending IMPs to be more human-like manipulation that including the visual and audial information for human robot collaborative electronic assembly task. We are also investigating the use of variational recurrent autoencoder neural network to facilitate our proposed framework for more complex scenarios.

#### **6.8 Summary**

In this chapter, we present two policies to deal with accidental and persistent anomalies respectively: movement reverse execution and human interaction. Reverse execution allows the robot retry the current movement or several move-ments that independent with the current state for resolving the accidental faults by redo current or some previous movement primitives by updating the parameters f primitives (including shape, starting point and target point, etc.), and to use polynomial distribution to realize the modeling of a primitive after the occurrence of an anomaly occurrence. Human interaction allows the robot explore the current movement by human-assist demonstration for resolving the persistent faults that can't be restored by reverse execution, such as tool collision and wall collision, which mainly to learn the anomaly behavior from the human demonstration, and can realize the growth of the task representation of the recovery behavior. Importantly however we learned that the recovery ability of the system grows in difficulty with an increased number of adaptations as variations in sensory-motor signals increase as more recoveries are attempted.

The proposed recovery policies are verified on a Baxter robot performs kitting experiment tasks, results indicate that the proposed policies meet the extensibility and adaptability for improving the long-term autonomy, an integrated to the SPAIR framework. Consequently, this book provides a efficient theoretical framework and software system for the implementation of the longer-term autonomy and a safer environment for human-robot interaction scenarios. Ultimately the system presented in this book significantly extended the autonomy and resilience of the robot and has broad applicability to all manipulation domains that suffer from uncertainties in unstructured environments: making industrial and service robots prime candidates for this technology.

In this book, anomalies are and will continue to be a reality in robotics despite increasingly powerful motion-generation algorithms, to address them explicitly, we presented a tightly-integrated, graph-based online motion-generation, introspection, and incremental recovery system for manipulation tasks in loosely structured co-bot scenarios, which consist of the movement identification, task representation, anomaly monitoring, anomaly diagnoses, and anomaly recovery.

#### **References**


#### References 137


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.