Jiayue Sun Shun Xu Yang Liu Huaguang Zhang

# Adaptive Dynamic Programming For Chemotherapy Drug Delivery

Adaptive Dynamic Programming

Jiayue Sun · Shun Xu · Yang Liu · Huaguang Zhang

## Adaptive Dynamic Programming

For Chemotherapy Drug Delivery

Jiayue Sun The State Key Laboratory of Synthetical Automation for Process Industries and the College of Information Science and Engineering Northeastern University Shenyang, Liaoning, China

Yang Liu The Department of Thoracic Surgery The First Affiliated Hospital of China Medical University Shenyang, Liaoning, China

Shun Xu The Department of Thoracic Surgery The First Affiliated Hospital of China Medical University Shenyang, Liaoning, China

Huaguang Zhang The State Key Laboratory of Synthetical Automation for Process Industries and the College of Information Science and Engineering Northeastern University Shenyang, Liaoning, China

ISBN 978-981-99-5928-0 ISBN 978-981-99-5929-7 (eBook) https://doi.org/10.1007/978-981-99-5929-7

This work was supported by China National Postdoctoral Program for Innovative Talents (BX20220060) and National Natural Science Foundation of China (62203469, 62203097)

© The Editor(s) (if applicable) and The Author(s) 2024. This book is an open access publication.

**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Paper in this product is recyclable.

*To My Family Jiayue Sun To My Family Shun Xu To My Family Yang Liu To My Family Huaguang Zhang*

### **Preface**

Optimization is the process of finding the best solution to a problem subject to a set of constraints. It has long been a cornerstone of both engineering and mathematics. The evolution of optimization can be traced back to the ancient Greeks, who employed geometric methods to solve optimization problems. In the eighteenth and nineteenth centuries, optimization began to take on a more formalized approach with the development of calculus and the rise of industrialization. Mathematicians such as Leonhard Euler and Joseph-Louis Lagrange developed methods for finding the maximum and minimum values of functions, which were crucial for optimizing industrial processes and designing efficient machines. The twentieth century saw a significant expansion in the field of optimization, with the development of linear programming and other optimization techniques. Linear programming is a mathematical technique for optimizing a linear objective function subject to linear constraints, which involves optimizing a linear objective function subject to linear constraints, was first introduced by George Dantzig in the 1940s and quickly became a powerful tool for solving complex optimization problems. In the latter half of the twentieth century, optimization began to be applied to a wide range of fields beyond mathematics and engineering. Operations research, which uses mathematical models to optimize complex systems, became a popular field in business and management. Optimization techniques were also applied to fields such as finance, transportation, and telecommunications. These models are typically static and deterministic, meaning they do not take into account the dynamic nature of many real-world systems. Over time, researchers have developed increasingly sophisticated optimization algorithms that can adapt to changing conditions and learn from experience.

Adaptive dynamic programming (ADP) is a powerful optimization technique for improving dynamic systems. Despite being a relatively new area of optimization, it has received broad usage in various industries. ADP is rooted in the principle of dynamic programming, which involves breaking down a complex optimization problem into smaller subproblems. These subproblems are solved recursively using backward induction. ADP learns from feedback and adjusts its behavior accordingly, making it useful for systems operating in uncertain environments. It has been applied to a wide range of problems, including electrical power systems, robotics, self-driving cars, and trading strategies. This book focuses on the practical application of ADP in chemotherapy drug delivery, taking into account clinical variables and real time data. ADP's ability to adapt to changing conditions and make optimal decisions in complex and uncertain situations makes it a valuable tool in addressing pressing challenges in healthcare and other fields. As optimization technology evolves, we can expect to see even more sophisticated and powerful solutions emerge.

Shenyang, China April 2023

Jiayue Sun Shun Xu Yang Liu Huaguang Zhang

**Acknowledgements** Our acknowledgments also go to our fellow colleagues who have offered invaluable support and encouragement throughout this research effort. The authors are especially grateful to their families for their encouragement and never-ending support when it was most required. Finally, we would like to thank the editors at Springer for their professional and efficient handling of this project.

The writing of this book was partially supported by China National Postdoctoral Program for Innovative Talents (BX20220060), National Natural Science Foundation of China (62203097, 62203469).

## **Contents**




## **List of Figures**



## **List of Tables**


## **Chapter 1 Introduction**

The stability analysis of dynamical systems, which are ubiquitous in nature, has long been a hot topic of research and several approaches have been proposed. However, control scientists often demand optimality in addition to the stability of the control system. In the 1950s and 1960s, motivated by the development of space technology and the practical use of digital computers, the theory of optimization of dynamical systems developed rapidly, forming an important branch of the discipline: optimal control. It is increasingly used in many fields, such as space technology, systems engineering, economic management and decision-making, population control, and optimization of multi-stage process equipment. In 1957, Bellman proposed an effective tool for solving optimal control problems: the dynamic programming (DP) method [1]. At the heart of this approach is Bellman's optimality principle, which states that the optimal policy for a multilevel decision process has the property that, regardless of the initial state and initial decision, the remaining decisions must also be optimal for the state formed by the initial decision. This principle can be reduced to a basic recursive formula for solving multilevel decision problems by starting at the end and working backward to the beginning. It applies to a wide range of discrete, continuous, linear, nonlinear, deterministic, and stochastic systems.

ADP is a new approach to approximate optimality in the field of optimal control, and it is a current research topic in the international optimization field. The ADP method uses the function approximation structure to approximate the solution of the Hamilton-Jacobi-Bellman (HJB) equation and uses offline iteration or online update to obtain the approximate optimal control strategy of the system, which can effectively solve the optimal control problem of nonlinear systems [2–11]. Bertsekas et al. summarized neuronal dynamic programming in the literature [12, 13], describing in detail dynamic programming, the structure of neural networks, and training algorithms. Meanwhile, several effective methods have been proposed for applying neuronal dynamic programming. Si et al. summarized the development of ADP methods in cross-cutting disciplines and discussed the connection of DP and ADP methods with artificial intelligence, approximation theory, control theory, operations research, and statistics [14]. In [15], Powell showed how to use ADP methods to solve deterministic or stochastic optimization problems, and pointed out the direction of ADP methods. In [16], Balakrishnan et al. concluded previous approaches to the design of feedback controllers for dynamic systems using the ADP method from both model and model-free cases. In [17], the ADP method was described from the perspective of requiring initial stability and not requiring initial stability.

The ADP method has a unique algorithm and structure compared to other existing optimal control methods. It overcomes the drawback that classical variational theory cannot handle optimal control problems with closed-set constraints on the control variables. Like the maximum value principle, the ADP method is not only suitable for optimal control problems with open-set constraints, but also for optimal control problems with closed-set constraints. While the extreme value principle can only provide the necessary conditions for optimal control problems, the DP method gives sufficient conditions. However, the direct application of the DP method is difficult due to the difficulty of solving the problem of "dimensional disaster" by the HJB equation in the DP method. Hence the ADP method, as an approximate solution to the DP method, overcomes the limitations of the DP method. It is more suitable for applications in systems with strong coupling, strong nonlinearity and high complexity. For example, the literature [18] presented a constrained adaptive dynamic programming (CADP) algorithm that could be used to solve general nonlinear non-affine optimal control problems with known dynamics. Unlike previous ADP algorithms, it was able to handle problems with state constraints directly by proposing a constrained generalized policy iteration framework that transforms the traditional policy improvement process into a constrained policy optimization problem with state constraints. To solve the problem of robust tracking control, the literature [19] designed an online adaptive learning structure to build a robust tracking controller for nonlinear uncertain systems. The literature [20] proposed an iterative method of bias policy for solving data-driven optimal control problems for unknown continuous linear systems by adding a bias parameter that could further relax the conditions of the initial admissible controller. The literature [21] considered the first attempt at ADP control for a nonlinear Itô-type stochastic system, which transformed a complex optimal tracking control problem into a stable control optimization problem by reconstructing a new stochastic augmented system. The use of a critical neural network in iterative learning subsequently simplifies the structure of the behavioral criterion and reduces the computational load. The ADP approach is still very widely used for a number of common practical systems. The literature [22] developed an event-triggered adaptive dynamic planning method to design formation controllers, and solved the problem of distributed formation control for multi-rotor UAS. For wind/light energy hybrid systems, literature [23] presented an adaptive dynamic programming method based on Bellman's principle, which enables accurate current sharing and voltage regulation. Based on this approach, it is possible to obtain the optimal control variables for each energy body objective.

Optimal control of nonlinear systems has been one of the hot spots and difficulties in the field of control research. As a novel technology to solve the optimal control problem, ADP method integrates the theories of neural network, adaptive evaluation design, augmented learning and classical dynamic programming, to overcome the problem of "dimensional disaster", which also enables the acquisition of an approximate optimal closed-loop feedback control law. As a consequence, delving deeper into the theory of ADP and its algorithms for solving optimal control of nonlinear systems holds immense theoretical significance and practical application value. Although the researches on the ADP method are still in its early stages, this book aims to equip readers with a foundational understanding of the method and empower them to apply it to diverse optimization problems in fields such as medicine, science, and engineering.

#### **1.1 Optimal Control Formulation**

There are several schemes of dynamic programming [1, 13, 24]. One can consider discrete-time systems or continuous-time systems, linear systems or nonlinear systems, time-invariant systems or time-varying systems, deterministic systems or stochastic systems, etc. Discrete-time (deterministic) nonlinear (time-invariant) dynamical systems will be discussed first. Time-invariant nonlinear systems cover most of the application areas and discrete time is the basic consideration for digital implementation.

#### *1.1.1 ADP for Discrete-Time Systems*

Consider the following discrete-time nonlinear systems:

$$\mathbf{x}\_{k+1} = F(\mathbf{x}\_k, \boldsymbol{\mu}\_k), k = 1, 2, \dots,\tag{1.1}$$

where *xk* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is the state vector and *uk* <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* is the control input vector. The corresponding cost function (performance index function) of the system takes the form of

$$J(\mathbf{x}\_k, \overline{\mathbf{u}}\_k) = \sum\_{i=k}^{\infty} \gamma^{i-k} U(\mathbf{x}\_i, \boldsymbol{\mu}\_i), \tag{1.2}$$

where *uk* = *(uk , uk*+<sup>1</sup>*, ...)* is the control sequence starting at time *k*. *U(xi, ui)* is the utility function. γ is the discount factor, meeting 0 *<* γ *<* 1. Note that the function *J* is dependent on the initial time *k* and the initial state *xk* . Generally, it is desired to determine *u*<sup>0</sup> = *(u*0*, u*1*, ...)* so that *J (x*0*, u*0*)* is optimized (i.e., maximized or minimized). We will use *u*<sup>∗</sup> <sup>0</sup> = *(u*<sup>∗</sup> <sup>0</sup>*, u*<sup>∗</sup> <sup>1</sup>*, ...)* and *J* <sup>∗</sup>*(x*0*)* to denote the optimal control sequence and the optimal cost function, respectively.The objective of dynamic programming problem in this book is to determine a control sequence *uk , k* = 0*,* 1*, ...,* so that the function *J* (i.e., the cost) in (1.2) is minimized. The optimal cost function is defined as

$$J^\*(\mathbf{x}\_0) = \inf\_{\overline{\boldsymbol{\mu}}\_0} J(\mathbf{x}\_0, \overline{\boldsymbol{\mu}}\_0) = J(\mathbf{x}\_0, \overline{\boldsymbol{\mu}}\_0^\*), \tag{1.3}$$

which is dependent upon the initial state *x*0.

The control action may be determined as a function of the state. In this case, we write *uk* = *u(xk ),* ∀*k*. Such a relationship, or mapping *u* : *R<sup>n</sup>* → *R<sup>m</sup>*, is called feedback control, or control policy, or policy. It is also called control law. For a given control policy μ, the cost function in (1.2) is rewritten as

$$J^{\mu}(\mathbf{x}\_k) = \sum\_{i=k}^{\infty} \gamma^{i-k} U(\mathbf{x}\_i, \mu(\mathbf{x}\_i)),\tag{1.4}$$

which is the cost function for system (1.1) starting at xk when the policy *uk* = μ*(xk )* is applied. The optimal cost for system (1.1) starting at *x*<sup>0</sup> is determined as

$$J^\*(\mathbf{x}\_0) = \inf\_{\mu} J^{\mu}(\mathbf{x}\_0) = J^{\mu^\*}(\mathbf{x}\_0),\tag{1.5}$$

where μ<sup>∗</sup> denotes the optimal policy.

Dynamic programming is based on Bellman's principle of optimality [1, 13, 24]: An optimal (control) policy has the property that no matter what previous decisions have been, the remaining decisions must constitute an optimal policy with regard to the state resulting from those previous decisions.

According to Bellman, the minimum cost of any state starting at time *k* consists of two parts, one of which is the minimum cost at time *k* and the other part is the cumulative sum of the infinite minimum cost starting from time *k* + 1. In terms of equations, this means that

$$\begin{split} J^\*(\mathbf{x}\_k) &= \min\_{\boldsymbol{\mu}\_k} \{ U(\mathbf{x}\_k, \boldsymbol{\mu}\_k) + \gamma J^\*(\mathbf{x}\_{k+1}) \} \\ &= \min\_{\boldsymbol{\mu}\_k} \{ U(\mathbf{x}\_k, \boldsymbol{\mu}\_k) + \gamma J^\*(F(\mathbf{x}\_k, \boldsymbol{\mu}\_k)) \}. \end{split} \tag{1.6}$$

This is known as the Bellman optimality equation, or the discrete-time Hamilton-Jacobi-Bellman (HJB) equation. One then has the optimal policy, i.e., the optimal control *u*<sup>∗</sup> *<sup>k</sup>* at time *k* is the *uk* that achieves this minimum as

$$
\mu^\* = \arg\min\_{\mu\_k} J\{U(\mathbf{x}\_k, \mu\_k) + \gamma J^\*(\mathbf{x}\_{k+1})\}.\tag{1.7}
$$

Since one must know the optimal policy at time *k* + 1 to (1.6) use to determine the optimal policy at time *k*, Bellman's principle yields a backwards-in-time procedure for solving the optimal control problem. It is the basis for dynamic programming algorithms in extensive use in control system theory, operations research, and elsewhere.

#### *1.1.2 ADP for Continuous-Time Systems*

For continuous-time systems, the cost function *J* is also the key to dynamic programming. By minimizing *J* , one gets the optimal cost function *J* <sup>∗</sup>, which is often a Lyapunov function of the system. As a consequence of the Bellman's principle of optimality, *J* <sup>∗</sup>satisfies the Hamilton-Jacobi-Bellman (HJB) equation. But usually, one cannot get the analytical solution of the HJB equation. Even to find an accurate numerical solution is very difficult due to the so-called curse of dimensionality.

Consider the continuous-time nonlinear dynamical system

$$\dot{\mathbf{x}}(t) = F(\mathbf{x}(t), \boldsymbol{\mu}(t)), t \ge t\_0,\tag{1.8}$$

where *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is the state vector and *<sup>u</sup>* <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* is the control input vector. The corresponding cost function of the system can be defined as

$$J(\mathbf{x}\_0, \boldsymbol{\mu}) = \int\_{\boldsymbol{\mu}\_0}^{\infty} U(\mathbf{x}(\tau), \boldsymbol{\mu}(\tau)) d\tau,\tag{1.9}$$

with utility function *U(x, u)* ≥ 0, where *x(t*0*)* = *x*0. The Bellman's principle of optimality can also be applied to continuous-time systems. In this case, the optimal cost

$$J^\*(\mathbf{x}(t)) = \min\_{u(t)} \{ J(\mathbf{x}(t), u(t)) \}, t \ge t\_0,\tag{1.10}$$

satisfies the HJB equation

$$-\frac{\partial J^\*}{\partial t} = \min\_{u(t)} \left\{ U(\mathbf{x}, u) + \left(\frac{\partial J^\*}{\partial \mathbf{x}}\right)^T F(\mathbf{x}, u) \right\}. \tag{1.11}$$

The HJB equation in (1.11) can be derived from the Bellman's principle of optimality [24]. Meanwhile, the optimal control *u*∗*(t)* will be the one that minimizes the cost function,

$$
\mu^\*(t) = \arg\min\_{\mu(t)} \{ J(\mathbf{x}(t), \mu(t)) \}, t \ge t\_0. \tag{1.12}
$$

In 1994, Saridis and Wang [25] studied the nonlinear stochastic systems described by

$$\mathbf{dx} = f(\mathbf{x}, t)\mathbf{d}t + \mathbf{g}(\mathbf{x}, t)u \,\mathrm{d}t + h(\mathbf{x}, t)\mathrm{d}w, t\_0 \le t \le T \tag{1.13}$$

with the cost function

$$J\left(\mathbf{x}\_{0},\boldsymbol{\mu}\right) = \mathbb{E}\left\{\int\_{t\_{0}}^{T} \left(\mathcal{Q}\left(\mathbf{x},t\right) + \boldsymbol{\mu}^{T}\boldsymbol{\mu}\right) \mathrm{d}t + \phi\left(\mathbf{x}\left(T\right),T\right) : \mathbf{x}\left(t\_{0}\right) = \mathbf{x}\_{0}\right\}$$

where *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>, <sup>u</sup>* <sup>∈</sup> <sup>R</sup>*<sup>m</sup>*, and *<sup>w</sup>* <sup>∈</sup> <sup>R</sup>*<sup>k</sup>* are the state vector, the control vector, and a separable Wiener process; *<sup>f</sup>, g* and *<sup>h</sup>* are measurable system functions; and *<sup>Q</sup>* and φ are nonnegative functions. A value function *V* is defined as

$$V(\mathbf{x}, t) = \mathbb{E}\left\{ \int\_t^T \left( Q(\mathbf{x}, t) + \boldsymbol{\mu}^T \boldsymbol{\mu} \right) \mathbf{d}t + \phi(\mathbf{x}(T), T) : \mathbf{x}(t\_0) = \mathbf{x}\_0 \right\}, t \in I, \boldsymbol{I}$$

where *I* -[*t*0*, T* ]. The HJB equation is modified to become the following equation

$$\frac{\partial V}{\partial t} + \mathcal{E}\_{\mu}^{\prime} V + \mathcal{Q}(\mathbf{x}, t) + \boldsymbol{u}^{T} \boldsymbol{u} = \nabla V \tag{1.14}$$

where *L<sup>u</sup>* is the infinitesimal generator of the stochastic process specified by (1.13) and is defined by

$$\begin{aligned} \mathcal{L}\_u^\rho V &= \frac{1}{2} \operatorname{tr} \left\{ h(\mathbf{x}, t) h^T(\mathbf{x}, t) \frac{\partial}{\partial \mathbf{x}} \left( \frac{\partial V(\mathbf{x}, t)}{\partial \mathbf{x}} \right)^T \right\} \\ &+ \left( \frac{\partial V(\mathbf{x}, t)}{\partial \mathbf{x}} \right)^T (f(\mathbf{x}, t) + \mathbf{g}(\mathbf{x}, t) u) \end{aligned}$$

Depending on whether ∇*V* ≤ 0 or ∇*V* ≥ 0, an upper bound *V*¯ or a lower bound *V* of the optimal cost *J* <sup>∗</sup> are found by solving equation (1.14) such that *V* ≤ *J* <sup>∗</sup> ≤ *V*¯ . Using *V*¯ (or *V*) as an approximation to *J* <sup>∗</sup>, one can solve for a control law. This leads to the so-called suboptimal control. It was proved that such controls are stable for the infinite-time stochastic regulator optimal control problem, where the cost function is defined as

$$J\left(\mathbf{x}\_{0},\boldsymbol{u}\right) = \lim\_{T \to \infty} \mathbb{E}\left\{\frac{1}{T} \int\_{t\_{0}}^{T} \left(Q\left(\mathbf{x},t\right) + \boldsymbol{u}^{T}\boldsymbol{u}\right) \mathrm{d}t : \boldsymbol{x}\left(t\_{0}\right) = \boldsymbol{x}\_{0}\right\}$$

The benefit of the suboptimal control is that the bound *V* of the optimal cost *J* <sup>∗</sup> can be approximated by an iterative process. Beginning from certain chosen functions *u*<sup>0</sup> and *V*0, let

$$u\_i(\mathbf{x}, t) = -\frac{1}{2} \mathbf{g}^T(\mathbf{x}, t) \frac{\partial V\_{i-1}(\mathbf{x}, t)}{\partial \mathbf{x}}, i = 1, 2, \dots \tag{1.15}$$

Then, by repeatedly applying (1.14) and (1.15), one will get a sequence of functions *Vi* . This sequence {*Vi*} will converge to the bound *V*¯ (or *V* ) of the cost function *J* <sup>∗</sup>. Consequently, *ui* will approximate the optimal control when *i* tends to ∞. It is important to note that the sequences {*Vi*} and {*ui*} are obtainable by computation and they approximate the optimal cost and the optimal control law, respectively.

#### 1.1 Optimal Control Formulation 7

Some further theoretical results for ADP have been obtained in [2]. These works investigated the stability and optimality for some special cases of ADP. In [2], Murray et al. studied the (deterministic) continuous-time affine nonlinear systems

$$
\dot{\boldsymbol{x}} = \boldsymbol{f}(\boldsymbol{\dot{x}}) + \boldsymbol{g}(\boldsymbol{\dot{x}})\boldsymbol{u}, \\
\boldsymbol{x}\,(t\_0) = \mathbf{x}\_0 \tag{1.16}
$$

with the cost function

$$J(\mathbf{x}, \mu) = \int\_{t\_0}^{\infty} U(\mathbf{x}, \mu) d\mathbf{t} \tag{1.17}$$

where*U(x, u)* = *Q(x)* + *u<sup>T</sup> R(x)u, Q(x) >* 0 for *x* = 0 and *Q(*0*)* = 0, and *R(x) >* 0 for all *x*. Similar to [25], an iterative procedure is proposed to find the control law as follows. For the plant (1.16) and the cost function (1.17), the HJB equation leads to the following optimal control law

$$\boldsymbol{u}^\*(\boldsymbol{x}) = -\frac{1}{2}\boldsymbol{R}^{-1}(\boldsymbol{x})\boldsymbol{g}^T(\boldsymbol{x}) \left[\frac{\mathrm{d}J^\*(\boldsymbol{x})}{\mathrm{d}\boldsymbol{x}}\right].\tag{1.18}$$

Applying (1.17) and (1.18) repeatedly, one will get sequences of estimations of the optimal cost function *J* <sup>∗</sup> and the optimal control *u*∗. Starting from an initial stabilizing control *v*0*(x)*, for *i* = 0*,* 1*,...*, the approximation is given by the following iterations between value functions

$$V\_{i+1}(\mathbf{x}) = \int\_{\iota}^{\infty} U\left(\mathbf{x}(\tau), v\_i(\tau)\right) \mathbf{d}\tau$$

and control laws

$$v\_{i+1}(\mathbf{x}) = -\frac{1}{2} R^{-1}(\mathbf{x}) \mathbf{g}^T(\mathbf{x}) \left[ \frac{\mathbf{d} V\_{i+1}(\mathbf{x})}{\mathbf{d} \mathbf{x}} \right]$$

The following results were shown in [2].

(1) The sequence of functions {*Vi*} obtained above converges to the optimal cost function *J* <sup>∗</sup>.

(2) Each of the control laws *v<sup>i</sup>*+<sup>1</sup> obtained above stabilizes the plant (1.16), for all *i* = 0*,* 1*,...*

(3) Each of the value functions *Vi*+<sup>1</sup>*(x)* is a Lyapunov function of the plant, for all *i* = 0*,* 1*,...*

Abu-Khalaf and Lewis [26] also studied the system (1.16) with the following value function

$$V(\mathbf{x}(t)) = \int\_{t}^{\infty} U(\mathbf{x}(\tau), u(\tau)) \mathrm{d}\tau = \int\_{t}^{\infty} \left(\mathbf{x}^{T}(\tau) \, \mathcal{Q} \mathbf{x}(\tau) + u^{T}(\tau) \mathrm{Rx}(\tau)\right) \mathrm{d}\tau$$

where *Q* and *R* are positive-definite matrices. The successive approximation to the HJB equation starts with an initial stabilizing control law *v*0*(x)*. For *i* = 0*,* 1*,...*, the approximation is given by the following iterations between policy evaluation

$$0 = \mathbf{x}^T \mathcal{Q} \mathbf{x} + v\_i^T(\mathbf{x}) R v\_i(\mathbf{x}) + \nabla V\_i^T(\mathbf{x}) \left( f(\mathbf{x}) + \mathbf{g}(\mathbf{x}) v\_i(\mathbf{x}) \right)$$

and policy improvement

$$v\_{i+1}(\mathbf{x}) = -\frac{1}{2}R^{-1}\mathbf{g}^T(\mathbf{x})\nabla V\_i(\mathbf{x})$$

where ∇*Vi(x)* = ∂*Vi(x)/*∂*x*. In [26], the above iterative approach was applied to systems (1.16) with saturating actuators through a modified utility function, with convergence and optimality proofs showing that *Vi* → *J* <sup>∗</sup> and *v<sup>i</sup>* → *u*∗, as *i* → ∞. For continuous-time optimal control problems, attempts have been going on for a long time in the quest for successive solutions to the HJB equation. Published works can date back to as early as 1967 by Leake and Liu [26]. The brief overview presented here only serves as a beginning of many more recent results [26–28].

#### **1.2 Publication Outline**

The general layout of the presentation of this monograph is given as follows. Adaptive dynamic programming is used to design drug dosage regulation mechanisms to provide adaptive viral treatment strategies for input-limited organisms, and to extend this to tumour cells, immune cells and interplay and regulation schemes among the immune system. The main contents of this monograph are shown as follows:


#### 1.2 Publication Outline 9

the main objective is approximating a Nash equilibrium between the tumor cells and the immune cell population, which is governed through chemotherapy drugs and immunoagents guided by the mathematical growth model of the tumor cells. Secondly, a novel intelligent nonzero-sum games-based ADP is put forward to solve optimization control problem through reducing the growth rate of tumor cells and minimizing chemotherapy drugs and immunotherapy drugs. Meanwhile, convergence analysis and iterative ADP algorithm are specified to prove feasibility. Finally, simulation examples are listed to account for availability and effectiveness of the research methodology.


dynamics is established to model the relations among the tumor cells (TCs), virus particles and the immune response. ADP method is extended to approximately obtain the optimal strategy for the interaction system to reduce the populations of TCs. Due to the consideration of asymmetric control constraints, the nonquadratic functions are proposed to formulate the value function such that the corresponding Hamilton-Jacobi-Bellman equation (HJBE) is derived which can be deemed as the cornerstone of ADP algorithms. Then, the ADP method of single-critic network architecture which integrates MDRM is proposed to obtain the approximate solutions of HJBE and eventually derive the optimal strategy. The design of MDRM makes it possible for the dosage of the agentia containing oncolytic virus particles to be regulated timely and necessarily. Furthermore, the uniform ultimate boundedness of the system states and critic weight estimation errors are validated by Lyapunov stability analysis. Finally, simulation results are given to show the effectiveness of the derived therapeutic strategy.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 2 Neural Networks-Based Immune Optimization Regulation Using Adaptive Dynamic Programming**

#### **2.1 Introduction**

In the fight against cancer, there had been no effective measures before chemotherapy and radiation appeared since there only exist tiny differences between cancer cells and normal cells. Doctors operate to remove solid tumors that have not yet spread, which can not guarantee cancer from recurring. When radiotherapy and chemotherapy have increased side effects, and targeted therapy is not flexible because of its strong pertinence, the scientific research direction began to turn to the human body system. Generally, tumor cells escape from the immune system, not because it fails for the immune system to recognize them or it is not activated, but cancer cells have evolved a way to block the activation of T cells by making a specific binding. Thus, the medical communities have struggled to find a lot of special means for cancer cells to intercept the activation of the T cells, freeing up the immune system. Compared with traditional treatments such as surgery, radiation and chemotherapy, immunotherapy has fewer side effects and better therapeutic effects. However, it is difficult to tackle the transient period of immune agents. Therefore, the hybrid therapy of chemotherapy and immunotherapy is a better choice. As [1], it is hardly sufficient to control tumor growth through neither chemotherapy nor immunotherapy alone, but tumor cells can be eradicated by adopting the combination therapies which is known as biochemotherapy described in [2].

With extensive development of nonlinear dynamic [3, 4], its engineering application scenarios enjoy increasing diversification such as competitive Nash equilibrium problems, especially in the biomedical field. And not coincidentally, game theory has been introduced into the interaction model of tumor cells and immune cells. Both of the chemotherapy and immunotherapy aim at reducing the number of tumor cells. Based on this fact, the collaborative game is formed and one can design adaptive therapy from the view of game theory. Multiple biological interactions constitute complex nonlinear growth process of tumor cells, however, regarding major influence factors of tumor cell populations as research object is the focus. Hunting cells refer to the immune cells participating in removing foreign agents and strengthening the immune response. temperatures have suggested that cell-mediated anti-tumor immunity contributed to increasing the population of hunting tumor cells to maintain a specific proportion between the resting and the hunting predator cells as 40% in literature [5], which was beneficial for maintenance of the tumor dormant state. The immune regulations vary from individual to individual, but immunotherapy-based optimal regulation plays the role of reducing tumor cells without considering certain circumstances in case of special invocation. Enhanced tumor antigen presentation could effectively stimulate dendritic cells and increase the immunotherapy-based curative effect in [6]. The known "predator-prey" between immune cells and tumor cells leads to cyclic growth and reduction, which can be continue indefinitely or reach an equilibrium saddle point determined by system parameters. Literature [7] investigated nonlinear dynamical model which provided guiding significance for introducing that into cybernetics. As known, system identification or optimal control is of great practical value. As a powerful and effective optimization algorithm, the ADP method can solve the nonlinear optimal control problems well, realizing the most appropriate therapeutic strategy.

Of course, the immune system has the responsibility for restraining tumor growth, but it is hardly to fight out the tumor cells alone. Firstly, ego characteristic of tumor cells compared to normal cells within the body leads to no exclusion and tolerance to tumor cells of the immune system. Secondly, there is no strong defense mechanism itself in fighting with the cancer cells which means the failure of the immune response. Finally, Immune function was observed to be protective through intervention with organic binding agents of CD4 and CD8 cells. Chemotherapy can not only rapidly kill differentiated tumor cells, but also destroy regular cells. This side effect caused by chemotherapy can be lessened through introducing the immunotherapy. Thus the combined therapy of chemotherapy and immunotherapy is more reasonable. Immunotherapies can strengthen the immune system through extra stimulation, on the other hand, improve the ability to recognize foreign entity. Therefore, decelerating the growth rate of tumor cells with minimized dose of chemotherapy and immunotherapeutic drugs is the control objective. Furthermore, optimal control strategy is obtained through ADP method, giving the optimal levels of each treatment regimen through nonzero-sum differential games strategy developed in [8].

Prescribed performance tracking control has been creatively developed as [9], however, there is seldom any literatures focusing on this scope considers mutual relationship among tumor cells, immune cells, chemotherapy and immunotherapy drugs, let alone setting the performance as eventually acquired of optimal therapeutic effect associated with coupling behaviors mentioned above. Retrospect to literatures as [10], the chapter transformed it into multi-player nonzero-sum games problems whose optimal control was obtained by complex decoupling in dealing with Hamilton-Jacobi equation as [11]. Subsequently, online adaptive and off-policy learning algorithms were respectively developed in [12–14]. Of course, the constrained-input was taken into consideration, when it comes to practical applications in [15], even more intensive work on uncertain constraints were in contemplation considered as [16]. As [17], the control policies of the distributed subsystems acted as players, noticeably, the chapter was formulated as a two-players nonzero-sum game including chemotherapy and immunotherapy. [18] first introduced an updating strategy based on intertask relationships. Synchronously, reciprocal action between the tumor cells and immune cells which could be analogous to interactions between systems in [19, 20].

The unknown nonlinear dynamic is usually implemented by fuzzy control as [21, 22] and neural networks in [18, 23], where the actor network and critic network are adopted for updating control policy at an appropriate time through policy iteration technique as [24–26]. The convergence of model-based policy iteration algorithm is equivalent to that of data-based learning as [27]. Similarly, states of the system and critic error are required to be ultimately uniformly bounded during the process of value iteration, which is guaranteed through event-triggered formation control scheme firstly proposed for all signals of the closed-loop system in literature [28]. According to the iterative value algorithm, the optimum can be obtained through learning continuously [29, 30]. However there is little research on the two-players nonzero-sum game considering tumor cells and immune cells using the proposed value iteration learning.

#### **2.2 Preliminaries**

As is known, there exist interaction relationships among the anticancer agent cells, lymphocytes and macrophages that constitute the basic immune system microenvironment, which can be presented as follows. Firstly, T-lymphocytes and cytotoxic macrophages/natural killer cells can effectively damage tumor cells. Secondly, destroyed behaviour of macrophages can also active T-lymphocytes for launching another attack. Meanwhile, the population of T-lymphocytes can be fed through resting cells. Finally, the model is guided by degradation of resting cells and activation of immune cells by natural growth rate. This section gives the nonlinear growth equation which can represent the whole immune response.

$$N\_{total} = \frac{\upsilon N\_H(t) N\_T(t)}{\nu + N\_T(t)} \tag{2.1}$$

where *NH* (*t*), *NT* (*t*) denote the number of hunting cells and tumor cells at time *t*, respectively. υ and ν are positive constants. The changes in quantity caused by the inactivation of the immune cells and the apoptosis of tumor cells are presented as:

$$\begin{split} \frac{dN\_T(t)}{dt} &= -\sigma\_1 N\_H(t) N\_T(t) \\ \frac{dN\_H(t)}{dt} &= -\sigma\_2 N\_H(t) N\_T(t) \end{split} \tag{2.2}$$

where σ<sup>1</sup> denotes the loss rate of *NT* (*t*) caused by *NH* (*t*) and σ<sup>2</sup> represents the loss rate of *NH* (*t*) caused by *NT* (*t*). The situations above reflect the competition between tumor cells and the host cells. Then we construct the dynamic equations as follows


**Table 2.1** Detailed descriptions of system parameters

$$\begin{aligned} N\_T(t) &= \iota\_1 N\_T(t)(1 - \varrho\_1 N\_T(t)) - \sigma\_1 N\_T(t) N\_H(t) \\ &- \delta\_1 N\_{CD}(t) N\_T(t) \\ \dot{N}\_H(t) &= \frac{\upsilon N\_H(t) N\_T^2(t)}{\nu + N\_T^2(t)} + \frac{\varsigma N\_H(t) N\_{ID}(t)}{\vartheta + N\_{ID}(t)} - \sigma\_2 N\_T(t) N\_H(t) \\ &- \mathfrak{D} N\_H(t) - \delta\_2 N\_{CD}(t) N\_H(t) \end{aligned} \tag{2.3}$$

where D represents the death rate of cells without considering any tumor cells. ια (α = 1, 2) and α denote the per capita growth rates and reciprocal carrying capacities. The descriptions of the other associated parameters are given in Table 2.1.

Consider the given chemotherapy and immunotherapy drugs as *u*(*t*) and v(*t*) at time *t*, which is regarded as multiple dose administration compared with influence of recombinant human interleukin-11 for injection or recombinant human granulocyte colony-stimulating factor injection. Assume that targeted therapy cannot be achieved through only chemotherapeutic drugs. Then we can obtain that

$$f\_{response}(t) = s\_{\alpha}(1 - e^{-\lambda u(t)})\tag{2.4}$$

where *s*<sup>α</sup> is the different response coefficients for distinguishing the change rate of different cells. The mathematical model considering injected drugs is presented as

$$\begin{aligned} N\_{CD}(t) &= u(t) - \varphi\_1 N\_{CD}(t) \\ \dot{N}\_{ID}(t) &= v(t) - \varphi\_2 N\_{ID}(t) \\ \dot{N}\_T(t) &= \iota\_1 N\_T(t)(1 - \varrho\_1 N\_T(t)) - \sigma\_1 N\_T(t) N\_H(t) \\ &- \delta\_1 N\_{CD}(t) N\_T(t) - s\_2 (1 - e^{-\lambda u(t)}) \\ \dot{N}\_H(t) &= \frac{\upsilon N\_H(t) N\_T^2(t)}{\nu + N\_T^2(t)} + \frac{\varsigma N\_H(t) N\_{ID}(t)}{\vartheta + N\_{ID}(t)} - \sigma\_2 N\_T(t) N\_H(t) - \mathfrak{D} N\_H(t) \\ &- \delta\_2 N\_{CD}(t) N\_H(t) - s\_1 (1 - e^{-\lambda u(t)}) \end{aligned} \tag{2.5}$$

where *NC D*(*t*) and *NI D*(*t*) are concentrations of chemotherapy and immunotherapy. v(*t*) and *u*(*t*) are the doses of chemotherapeutic drug and immunotherapeutic drug. Generally speaking, λ is taken as 1 for the unknown role of cytokines.

**Remark 2.1** The model (2.5) describes the relations among the hunting cells, the tumor cells, the concentration of chemotherapy agentia, and the concentration of immunotherapy agentia. From (2.5) we can find both of the hunting cells and the chemotherapy agentia can reduce the number of tumor cells, and the immunotherapy agentia can stimulate the growth of hunting cells. On the other hand, the tumor cells can influence the number of hunting cells. Based on this complicated interactive relationship, we can obtain the optimal object through ADP, that is, minimization of tumor cells while ensuring the number of normal cells at certain time *t*.

Before proceeding, let *X* = [*NT* , *NH* , *NC D*, *NI D*] *<sup>T</sup>* , then the model (2.5) can be simplified as

$$\dot{X}(t) = f(X) + \mathbf{g}(X)u(t) + \kappa(X)v(t) \tag{2.6}$$

where *f* (*X*) is the right-hand dynamics of (2.5) excluding the control *u*(*t*) and v(*t*). The matrixes *g*(*X*) = [0, <sup>0</sup>, <sup>1</sup>, <sup>0</sup>] *<sup>T</sup>* and κ(*X*) = [0, 0, 0, 1] *T* .

For system (2.6), the performance index function of the player can be given as

$$J\_{\epsilon}(X\_0) = \int\_0^{\infty} \left( X^T \mathcal{Q}\_{\epsilon} X + u^T \mathcal{R}\_{\epsilon 1} u + v^T \mathcal{R}\_{\epsilon 2} v \right) d\tau \tag{2.7}$$

where Q is positive definite matrix, R<sup>1</sup> and R<sup>2</sup> are symmetric positive matrixes. The corresponding cost functions are presented as:

$$\mathcal{W}\_{\epsilon}(X,\mu,\upsilon) = \int\_{t}^{\infty} \mathfrak{R}\_{\epsilon}(X,\mu,\upsilon)d\tau \tag{2.8}$$

with the utility function

$$\mathfrak{R}\_{\epsilon}(X,\mu,\upsilon) = X^{T}\mathfrak{Q}\_{\epsilon}X + \mathfrak{u}^{T}\mathfrak{R}\_{\epsilon1}\mathfrak{u} + \mathfrak{v}^{T}\mathfrak{R}\_{\epsilon2}\upsilon. \tag{2.9}$$

**Definition 2.2** For two-player NZS game of system (2.6), the Nash equilibrium solution is said to be obtained with the control pair (*u*∗, v∗) which satisfied that,

$$\begin{aligned} \vee\_{\epsilon}(\mu^\*, v^\*) &\le \mathcal{V}\_{\epsilon}(\mu, v^\*)\\ \vee\_{\epsilon}(\mu^\*, v^\*) &\le \mathcal{V}\_{\epsilon}(\mu^\*, v) \end{aligned} \tag{2.10}$$

for any admissible control policies *u* and v.

The Hamilton functions can be constructed as:

$$\begin{aligned} \mathcal{H}\_{\epsilon}(X, \boldsymbol{\mu}, \boldsymbol{v}) &= X^{T} \mathcal{Q}\_{\epsilon} X + \boldsymbol{\mu}^{T} \mathcal{R}\_{\epsilon 1} \boldsymbol{\mu} + \boldsymbol{v}^{T} \mathcal{R}\_{\epsilon 2} \boldsymbol{v} \\ &+ \nabla \boldsymbol{\ell}\_{\epsilon}^{T} \left( f(X) + \boldsymbol{g}(X) \boldsymbol{\mu}(t) + \boldsymbol{\kappa}(X) \boldsymbol{v}(t) \right) \end{aligned} \tag{2.11}$$

where ∇V is the partial derivative of the cost function and = 1, 2. According to the stationarity conditions at equilibrium points, the optimal control for two players are obtained

$$u^\* = -\frac{1}{2} \mathcal{R}\_{11}^{-1} \mathbf{g}^T(X) \nabla \mathcal{V}\_1^\*$$

$$v^\* = -\frac{1}{2} \mathcal{R}\_{22}^{-1} \kappa^T(X) \nabla \mathcal{V}\_2^\* \tag{2.12}$$

with V<sup>∗</sup> <sup>1</sup> and V<sup>∗</sup> <sup>2</sup> being the solutions of coupled HJ equations as

$$\begin{aligned} X^T \mathcal{Q}\_1 X - \frac{1}{4} \nabla V\_1^{\*T} \mathcal{g}(X) \mathcal{R}\_{11}^{-1} \mathcal{g}^T(X) \nabla V\_1^\* + \nabla V\_1^{\*T} f(X) \\ + \frac{1}{4} \nabla V\_2^{\*T} \kappa(X) \mathcal{R}\_{22}^{-1} \mathcal{R}\_{12} \mathcal{R}\_{22}^{-1} \kappa^T(X) \nabla V\_2^\* \\ - \frac{1}{2} \nabla V\_1^{\*T} \kappa(X) \mathcal{R}\_{22}^{-1} \kappa^T(X) \nabla V\_2^\* = 0, \end{aligned} \tag{2.13}$$

and

$$\begin{split} &X^{T}\mathcal{Q}\_{2}X-\frac{1}{4}\nabla\mathcal{V}\_{2}^{\*T}\kappa(X)\mathcal{R}\_{22}^{-1}\kappa^{T}(X)\nabla\mathcal{V}\_{2}^{\*}+\nabla\mathcal{V}\_{2}^{\*T}f(X) \\ &\quad +\frac{1}{4}\nabla\mathcal{V}\_{1}^{\*T}\mathcal{g}(X)\mathcal{R}\_{11}^{-1}\mathcal{R}\_{21}\mathcal{R}\_{11}^{-1}\mathcal{g}^{T}(X)\nabla\mathcal{V}\_{1}^{\*} \\ &\quad -\frac{1}{2}\nabla\mathcal{V}\_{2}^{\*T}\mathcal{g}(X)\mathcal{R}\_{11}^{-1}\mathcal{g}^{T}(X)\nabla\mathcal{V}\_{1}^{\*}=0. \end{split} \tag{2.14}$$

**Lemma 2.3** *For nonlinear system (2.6), suppose that*V<sup>∗</sup> <sup>1</sup> *and*V<sup>∗</sup> <sup>2</sup> *satisfy the equations (2.13) and (2.14). Then under the optimal control (2.12), the system is asymptotically stable.*

*Proof* The proof is omitted since it is similar to that in [31, 32].

By solving the coupled HJ equations (2.13) and (2.14), one can obtain the optimal control as (2.12), which means the Nash equilibrium for the two-player NZS game system is attained. Nevertheless, due to the existence of nonlinear terms and coupled terms, these partial differential equations are uneasy to solve. Since ADP is a powerful approximate learning method, the approximate solutions of (2.13) and (2.14) can be acquired.

#### **2.3 Design of Adaptive Controller**

In order to find the optimal control strategy, a critic network is constructed based on neural network firstly. And then optimal value function can be shown as:

$$\mathcal{V}\_{\epsilon}^{\*} = (\zeta\_{\epsilon}^{\*})^{T} \xi\_{\epsilon}(X) + o\_{\epsilon}, \epsilon = 1, 2,\tag{2.15}$$

where ζ<sup>∗</sup> ∈ *Rp* , ξ ∈ *Rp* and *o* ∈ R are the ideal weight vector, activation function and approximation error of the neural network. As it's scarcely possible to get the weight ζ<sup>∗</sup> , we give the approximate version

$$
\hat{V}\_{\epsilon}^{\*} = (\hat{\zeta}\_{\epsilon})^T \xi\_{\epsilon}(X). \tag{2.16}
$$

Based on (2.12) and (2.15), we obtain the optimal control as

$$\begin{aligned} u^\* &= -\frac{1}{2} \mathcal{R}\_{11}^{-1} \mathbf{g}^T(X) ( (\nabla \xi\_1(X))^T \zeta\_1^\* + \nabla o\_1) \\ v^\* &= -\frac{1}{2} \mathcal{R}\_{22}^{-1} \kappa^T(X) ( (\nabla \xi\_2(X))^T \zeta\_2^\* + \nabla o\_2) \end{aligned} \tag{2.17}$$

Then we further get the approximate control policies as

$$\begin{aligned} \hat{u} &= -\frac{1}{2} \mathcal{R}\_{11}^{-1} \mathbf{g}^T(X) (\nabla \xi\_1(X)^T \hat{\zeta}\_1) \\ \hat{v} &= -\frac{1}{2} \mathcal{R}\_{22}^{-1} \kappa^T(X) (\nabla \xi\_2(X)^T \hat{\zeta}\_2) \end{aligned} \tag{2.18}$$

**Remark 2.4** For the unknowable nature of ideal weights, the NNs are used to approximate the system dynamics and approximate version as (2.16), aming at minimizing the current estimate of the value functions in (2.15) by selecting policies (2.18) can be obtained with available closed-form expressions.

According to (2.18), the closed-loop system can be rewritten as

$$
\dot{X}(t) = f(X) + \mathbf{g}(X)\hat{u} + \kappa(X)\hat{v}.\tag{2.19}
$$

Furthermore, we can attain the approximate Hamilton as

$$\begin{split} \mathcal{H}\_{\epsilon}(X, \hat{u}, \hat{v}) &= X^{T} \mathcal{Q}\_{\epsilon} X + \hat{u}^{T} \mathcal{R}\_{\epsilon 1} \hat{u} \\ &+ \hat{v}^{T} \mathcal{R}\_{\epsilon 2} \hat{v} + (\hat{\zeta}\_{\epsilon})^{T} \nabla \xi\_{\epsilon}(X) \dot{X}(t) \\ &= e\_{\epsilon}(t). \end{split} \tag{2.20}$$

To approach the optimal strategy and minimize *e*(*t*), the goal of adaptive learning is set to be E = E<sup>1</sup> + E<sup>2</sup> = 1/2*e*<sup>2</sup> <sup>1</sup> + 1/2*e*<sup>2</sup> 2. Then applying the gradient descent method, we obtain the learning law of critic for player

$$\dot{\hat{\zeta}}\_{\epsilon} = -\varrho\_{\epsilon} \frac{1}{(\delta\_{\epsilon}^{T}\delta\_{\epsilon} + 1)^{2}} \frac{\partial \mathcal{E}(t)}{\partial \hat{\zeta}\_{\epsilon}} = -\varrho\_{\epsilon} \frac{1}{(\delta\_{\epsilon}^{T}\delta\_{\epsilon} + 1)^{2}} \frac{\partial \mathcal{E}\_{\epsilon}(t)}{\partial \hat{\zeta}\_{\epsilon}} = -\varrho\_{\epsilon} \frac{\delta\_{\epsilon}e\_{\epsilon}(t)}{(\delta\_{\epsilon}^{T}\delta\_{\epsilon} + 1)^{2}} \tag{2.21}$$

where δ = ∇ξ(*X*)*X*˙(*t*), and is the positive learning law. Let ˜ ζ = ζ<sup>∗</sup> − ˆ ζ, then we have

$$\dot{\tilde{\zeta}}\_{\epsilon} = \varrho\_{\epsilon} \frac{\delta\_{\epsilon} \sigma\_{\epsilon}(t)}{(\delta\_{\epsilon}^{T} \delta\_{\epsilon} + 1)^{2}} - \varrho\_{\epsilon} \frac{\delta\_{\epsilon} \delta\_{\epsilon}^{T} \tilde{\zeta}\_{\epsilon}}{(\delta\_{\epsilon}^{T} \delta\_{\epsilon} + 1)^{2}} = \varrho\_{\epsilon} \underline{\delta}\_{\epsilon} \sigma\_{\epsilon}(t) - \varrho\_{\epsilon} \bar{\delta}\_{\epsilon} \overline{\delta}\_{\epsilon}^{T} \tilde{\zeta}\_{\epsilon}, \tag{2.22}$$

where δ = δ/(δ*<sup>T</sup>* δ + 1)2, ¯ δ = δ/(δ*<sup>T</sup>* δ + 1) and σ(*t*) = −∇*o<sup>T</sup>* (*X*)( *f* (*X*) + *g*(*X*)*u*<sup>ˆ</sup> <sup>+</sup> <sup>κ</sup>(*X*)v)<sup>ˆ</sup> is the approximate residual error when employing critic neural network [33].

Before presenting the main results of this chapter, two regular assumptions are necessary [34–36].

**Assumption 2.1** For = 1, 2, the signal ¯ δ is persistently excited such that the following inequality is satisfied

$$
\zeta\_{\epsilon} I\_{\nu\_{\epsilon} \times \nu\_{\epsilon}} \le \int\_{t}^{t+T} \bar{\delta}\_{\epsilon} \bar{\delta}\_{\epsilon}^{T} d\varepsilon,\tag{2.23}
$$

where ν denotes the neuro number of the th critic network.

**Assumption 2.2** For = 1, 2, there exist positive constants ξ*max* , *omax* and σ*max* such that the following inequalities hold, that is, ∇ξ(*X*) ≤ ξ*max* , ∇*o* ≤ *omax* and σ ≤ σ*max* .

Applying the Lyapunov method, the stability in the sense of UUB is demonstrated to be guaranteed by the following theorem.

**Theorem 2.5** *For system (2.6),when the weight updating laws of critic networks are given by (2.21), then the UUB properties of the weight estimation error* ˜ ζ *can be guaranteed by the obtained control policies (2.18).*

*Proof* Select the Lyapunov function as

$$\mathcal{L} = \frac{1}{2} \varrho\_1^{-1} \tilde{\zeta}\_1^T \tilde{\zeta}\_1^T + \frac{1}{2} \varrho\_2^{-1} \tilde{\zeta}\_2^T \tilde{\zeta}\_2^T. \tag{2.24}$$

Taking the time derivative of (2.24), then we obtain

$$\begin{split} \dot{\mathcal{L}} &= \varrho\_1^{-1} \tilde{\zeta}\_1^T \dot{\tilde{\zeta}}\_1 + \varrho\_2^{-1} \tilde{\zeta}\_2^T \dot{\tilde{\zeta}}\_2 \\ &= \tilde{\zeta}\_1^T (\underline{\delta}\_1 \sigma\_1(t) - \bar{\delta}\_1 \bar{\delta}\_1^T \tilde{\zeta}\_1) + \tilde{\zeta}\_2^T (\underline{\delta}\_2 \sigma\_2(t) - \bar{\delta}\_2 \bar{\delta}\_2^T \tilde{\zeta}\_2) \end{split} \tag{2.25}$$

According to Young's inequality, we have

$$
\tilde{\zeta}\_1^T \underline{\delta}\_1 \sigma\_1(t) \le \tilde{\zeta}\_1^T \bar{\delta}\_1 \sigma\_1(t) \le \frac{1}{2} \tilde{\zeta}\_1^T \bar{\delta}\_1 \bar{\delta}\_1^T \tilde{\zeta}\_1 + \frac{1}{2} \sigma\_{1max}^2. \tag{2.26}
$$

Similarly,

$$
\tilde{\zeta}\_2^T \underline{\delta}\_2 \sigma\_2(t) \le \frac{1}{2} \tilde{\zeta}\_2^T \bar{\delta}\_2 \tilde{\delta}\_2^T \tilde{\zeta}\_2 + \frac{1}{2} \sigma\_{2\max}^2. \tag{2.27}
$$

Substituting (2.26) and (2.27) into (2.25), we get

$$\dot{\mathcal{L}} \le -\frac{1}{2} \tilde{\zeta}\_1^T \bar{\delta}\_1 \bar{\delta}\_1^T \tilde{\zeta}\_1 - \frac{1}{2} \tilde{\zeta}\_2^T \bar{\delta}\_2 \bar{\delta}\_2^T \tilde{\zeta}\_2 + \frac{1}{2} (\sigma\_{1\text{max}}^2 + \sigma\_{2\text{max}}^2). \tag{2.28}$$

From (2.28) we can conclude that L˙ < 0 when one of the following conditions holds

$$\|\tilde{\zeta}\_1\| > \sqrt{\frac{\sigma\_{1\text{max}}^2 + \sigma\_{2\text{max}}^2}{\lambda\_{\text{min}}(\tilde{\delta}\_1 \tilde{\delta}\_1^T)}},\tag{2.29}$$

or

$$\|\tilde{\zeta}\_2\| > \sqrt{\frac{\sigma\_{1\text{max}}^2 + \sigma\_{2\text{max}}^2}{\lambda\_{\text{min}}(\tilde{\delta}\_2 \overline{\delta}\_2^T)}}. \tag{2.30}$$

According to Lyapunov theory, it yields that the weight estimation errors for both critic networks are UUB.

**Remark 2.6** The weight matrices are usually updated through certain renewal equations, and from (2.29) and (2.30), we can draw that the approximation weight error will asymptotically converge to zero as ν → ∞.

**Theorem 2.7** *Consider the system (2.6). The weight updating laws for critic networks are given by (2.21). Then the obtained policies (2.18) can force system states X to be UUB.*

*Proof* In order to discuss the stability of closed-loop system, the derivative of V = V<sup>∗</sup> + V<sup>∗</sup> is considered as

$$\begin{split} \dot{V} &= (\nabla \boldsymbol{\ell}\_1^\*)^T (f(\boldsymbol{X}) + \boldsymbol{g}(\boldsymbol{X})\hat{\boldsymbol{\mu}} + \kappa(\boldsymbol{X})\hat{\boldsymbol{v}}) \\ &+ (\nabla \boldsymbol{\ell}\_2^\*)^T (f(\boldsymbol{X}) + \boldsymbol{g}(\boldsymbol{X})\hat{\boldsymbol{\mu}} + \kappa(\boldsymbol{X})\hat{\boldsymbol{v}}). \end{split} \tag{2.31}$$

Recalling (2.13) and (2.14), we have

$$\begin{split} \nabla V\_{1}^{\*T} f(X) &= -X^{T} \mathcal{Q}\_{1} X + \frac{1}{4} \nabla V\_{1}^{\*T} g(X) \mathcal{R}\_{11}^{-1} g^{T}(X) \nabla V\_{1}^{\*} \\ &\quad - \frac{1}{4} \nabla V\_{2}^{\*T} \kappa(X) \mathcal{R}\_{22}^{-1} \mathcal{R}\_{12} \mathcal{R}\_{22}^{-1} \kappa^{T}(X) \nabla V\_{2}^{\*} \\ &\quad + \frac{1}{2} \nabla V\_{1}^{\*T} \kappa(X) \mathcal{R}\_{22}^{-1} \kappa^{T}(X) \nabla V\_{2}^{\*}, \end{split} \tag{2.32}$$

and

$$
\begin{split}
\nabla V\_2^{\*T} f(X) &= -X^T \mathcal{Q}\_2 X + \frac{1}{4} \nabla V\_2^{\*T} \kappa(X) \mathcal{R}\_{22}^{-1} \kappa^T(X) \nabla V\_2^\* \\ &\quad - \frac{1}{4} \nabla \boldsymbol{\ell}\_1^{\*T} \boldsymbol{g}(X) \mathcal{R}\_{11}^{-1} \mathcal{R}\_{21} \mathcal{R}\_{11}^{-1} \boldsymbol{g}^T(X) \nabla \boldsymbol{\ell}\_1^\* \\ &\quad + \frac{1}{2} \nabla \boldsymbol{\ell}\_2^{\*T} \boldsymbol{g}(X) \mathcal{R}\_{11}^{-1} \boldsymbol{g}^T(X) \nabla \boldsymbol{\ell}\_1^\*. \end{split} \tag{2.33}
$$

For = 1, we can obtain V˙ <sup>∗</sup> as

$$\dot{V}\_{1}^{\*} = -X^{T}\mathcal{Q}\_{1}X - \frac{1}{4}\nabla V\_{1}^{\*T}\mathcal{G}(X)\mathcal{R}\_{11}^{-1}\mathcal{G}^{T}(X)\nabla V\_{1}^{\*}$$

$$\begin{split}-\frac{1}{4}\nabla \mathcal{V}\_{2}^{\*T}\kappa(X)\mathcal{R}\_{22}^{-1}\mathcal{R}\_{12}\mathcal{R}\_{22}^{-1}\kappa^{T}(X)\nabla \mathcal{V}\_{2}^{\*}\\-\nabla V\_{1}^{\*T}(\mathcal{G}(X)(\mu^{\*}-\hat{\mu})+\kappa(X)(\upsilon^{\*}-\hat{\upsilon})).\end{split}\tag{2.34}$$

According to (2.15) and (2.16) we have

$$\begin{split} \dot{V}\_{1}^{\*} &= -X^{T}\mathcal{Q}\_{1}X - \frac{1}{4}\nabla V\_{1}^{\*T}g(X)\mathcal{R}\_{11}^{-1}g^{T}(X)\nabla V\_{1}^{\*} \\ &- \frac{1}{4}\nabla \ell\_{2}^{\*T}\kappa(X)\mathcal{R}\_{22}^{-1}\mathcal{R}\_{12}\mathcal{R}\_{22}^{-1}\kappa^{T}(X)\nabla \ell\_{2}^{\*} \\ &+ \frac{1}{2}((\nabla\xi\_{1}(X))^{T}\zeta\_{1}^{\*} + \nabla o\_{1})^{T}\Big{(}g(X)\mathcal{R}\_{11}^{-1}g^{T}(X) \\ &\times ((\nabla\xi\_{1}^{T}(X))^{T}\widetilde{\zeta}\_{1} + \nabla o\_{1}) + \kappa(X)\mathcal{R}\_{22}^{-1}\kappa^{T}(X) \\ &\times ((\nabla\xi\_{2}^{T}(X))^{T}\widetilde{\zeta}\_{2} + \nabla o\_{2}) \Big{)}. \tag{2.35} \end{split}$$

Due to Assumption 2.2 and Theorem 2.5, we obtain that

$$\begin{split} \dot{V}\_{1}^{\*} \leq & -X^{T} \mathcal{Q}\_{1} X - \frac{1}{4} \nabla V\_{1}^{\*T} \mathcal{g}(X) \mathcal{R}\_{11}^{-1} \mathcal{g}^{T}(X) \nabla V\_{1}^{\*} \\ & - \frac{1}{4} \nabla V\_{2}^{\*T} \kappa(X) \mathcal{R}\_{22}^{-1} \mathcal{R}\_{12} \mathcal{R}\_{22}^{-1} \kappa^{T}(X) \nabla V\_{2}^{\*} + \theta\_{1}, \end{split} \tag{2.36}$$

where the positive constant θ<sup>1</sup> denotes the bound of the term <sup>1</sup> <sup>2</sup> ((∇ξ1(*X*))*<sup>T</sup>* ζ<sup>∗</sup> 1 + ∇*o*1)*<sup>T</sup> g*(*X*)R<sup>−</sup><sup>1</sup> <sup>11</sup> *<sup>g</sup><sup>T</sup>* (*X*)((∇ξ*<sup>T</sup>* <sup>1</sup> (*X*))*<sup>T</sup>* ˜ <sup>ζ</sup><sup>1</sup> + ∇*o*1) <sup>+</sup> <sup>κ</sup>(*X*)R<sup>−</sup><sup>1</sup> <sup>22</sup> κ*<sup>T</sup>* (*X*)((∇ξ*<sup>T</sup>* <sup>2</sup> (*X*))*<sup>T</sup>* ˜ ζ2 +∇*o*2) . As R11, R<sup>12</sup> and R<sup>22</sup> are symmetric positive definite, we have

$$\frac{1}{4}\nabla \mathsf{V}\_{2}^{\*T}\kappa(X)\mathscr{R}\_{22}^{-1}\mathscr{R}\_{12}\mathscr{R}\_{22}^{-1}\kappa^{T}(X)\nabla \mathsf{V}\_{2}^{\*}$$

$$+\frac{1}{4}\nabla \mathsf{V}\_{1}^{\*T}\mathsf{g}(X)\mathscr{R}\_{11}^{-1}\mathsf{g}^{T}(X)\nabla \mathsf{V}\_{1}^{\*}>0.\tag{2.37}$$

Furthermore, we attain

$$\dot{V}\_1^\* \le -X^T \mathcal{Q}\_1 X + \theta\_1 \le -\lambda\_{\min}(\mathcal{Q}\_1) \|X\|^2 + \theta\_1. \tag{2.38}$$

Similarly, for = 2, it yields that

$$
\dot{V}\_2^\* \le -X^T \mathcal{Q}\_2 X + \theta\_2 \le -\lambda\_{\min}(\mathcal{Q}\_2) \|X\|^2 + \theta\_2,\tag{2.39}
$$

where the definition of θ<sup>2</sup> is similar to that of θ1. Then it can be concluded that V˙ < 0 when the following inequality is satisfied

$$\|X\| > \max\left\{ \sqrt{\frac{\theta\_1}{\lambda\_{\min}(\mathcal{Q}\_1)}}, \sqrt{\frac{\theta\_2}{\lambda\_{\min}(\mathcal{Q}\_2)}} \right\} \stackrel{\Delta}{=} \Theta. \tag{2.40}$$

Thus with the proposed control policies (2.18), the system state *N* is UUB with the bound Θ. This completes the proof.

**Remark 2.8** From Theorems 2.5 and 2.7, we can conclude that under the obtained control policies the states of the system *X* and the critic weight error ˜ ζ are ultimately uniformly bounded.

**Remark 2.9** According to the clinical requirements, the specific value of the cost function is finalised. Transformation is implemented from the mathematical mechanism model to the solvable affine model. Subsequently, the chapter solve the optimal control problem that means minimum dose of medicine can realize the best therapeutic effect.

#### **2.4 Simulation and Numerical Experiments**

To verify the proposed method in the previous section, a simulation is given as followed.

#### *2.4.1 States Analysis on Tumor Cell Growth*

According to clinical medical statistics borrowed from the literature [37], the specific parameters of the dynamic models are presented as Table 2.2.

According to (2.5) and Table 2.2, we construct the model (2.41)

$$\begin{aligned} \dot{N}\_T(t) &= 0.00431 N\_T(t) (1 - 1.02 \times 10^{-9}) N\_T(t) \\ &- 6.41 \times 10^{-11} N\_T(t) N\_H(t) \\ &- 0.08 N\_{CD}(t) N\_T(t) - (1 - e^{-u(t)}) \\ \dot{N}\_H(t) &= 0.33 + \frac{0.0125 N\_H(t) N\_T^2(t)}{2.02 \times 10^7 + N\_T^2(t)} + \frac{0.125 N\_H(t) N\_{ID}(t)}{2 \times 10^7 + N\_{ID}(t)} \\ &- 3.42 \times 10^{-6} N\_T(t) N\_H(t) - (1 - e^{-u(t)}) \\ &- 0.204 N\_H(t) - 3.42 \times 10^{-6} N\_{CD}(t) N\_H(t) \\ \dot{N}\_{CD}(t) &= u(t) - 0.1 N\_{CD}(t) \\ \dot{N}\_{ID}(t) &= v(t) - N\_{ID}(t) \end{aligned}$$

The initial state of tumor cells *N*1(*t*) and immune cells *N*2(*t*)in a patient and follow a certain chemotherapy and immunotherapy regimen. Correspondingly, *N*3(*t*) and *N*4(*t*) respectively denote the concentrations of chemotherapy and immunotherapy. And we can get the following curves on systems states tumor cells, immune cells, chemotherapy and immunotherapy drugs as shown in Fig. 2.1. Initial value is set as *X*<sup>0</sup> = 20 10 8 6 *<sup>T</sup>* .


**Table 2.2** Concentration variation on immune cells, tumor cells, chemotherapeutic drug and immunoagents

**Fig. 2.1** The curves of system states

It is obviously that the control policies can stabilize the nonlinear system and make the system states tend to zero which means that the closed-system is stable and the control method is effective. Retrospect the original problem that the key is to minimize cancer cells and reduce therapy toxicity as possible.

#### *2.4.2 Weight Analysis of Control Policies*

The weights ζ<sup>∗</sup> of the control policies *u*(*t*) and v(*t*) can be estimated through the value functionVˆ <sup>∗</sup> = (ˆ ζ)*<sup>T</sup>* ξ(*X*)in (2.16), and the performance index is shown as (2.6) with Q<sup>1</sup> = *I*<sup>4</sup>×4, Q<sup>2</sup> = 5Q1, R<sup>11</sup> = R<sup>22</sup> = 1, R<sup>12</sup> = R<sup>21</sup> = 2. The initialize weights are set as [−0.25, −0.25, −1, −0.25] *<sup>T</sup>* . The selected activation function is selected as [ζ*T* <sup>11</sup>→<sup>15</sup>, <sup>ζ</sup>*<sup>T</sup>* <sup>16</sup>→<sup>18</sup>, <sup>ζ</sup>*<sup>T</sup>* <sup>19</sup>→<sup>10</sup>], where <sup>ζ</sup><sup>11</sup>→<sup>15</sup> = [*N*<sup>2</sup> <sup>1</sup> (*t*), *N*1(*t*)*N*2(*t*), *N*1(*t*)*N*3(*t*), *N*1(*t*)*N*4(*t*), *N*<sup>2</sup> <sup>2</sup> (*t*)] and ζ<sup>16</sup>→<sup>18</sup> = [*N*2(*t*)*N*3(*t*), *N*2(*t*)*N*4(*t*), *N*<sup>2</sup> <sup>3</sup> (*t*)] and ζ<sup>19</sup>→<sup>10</sup> = [*N*3(*t*)*N*4(*t*), *N*<sup>2</sup> <sup>4</sup> (*t*)]

According to Fig. 2.2, we can conclude that the proposed optimal control demonstrated a shorter convergence time than that without taking optimal control, where the former needs only 10*s*, but the later may be 38*s*, which draws the superiority of the proposed method.

In Fig. 2.3, we can obtain the less doses of the drugs is another advantage compared with that without taking optimal control. Taking comprehensive consideration of Figs. 2.2 and 2.3, we can draw a conclusion that the adopted algorithm can not only

**Fig. 2.2** Optimal control policies *u*(*t*)

**Fig. 2.3** Optimal control policies v(*t*)

**Fig. 2.4** The curves of system states

decrease the convergence time but also reduce doses of chemotherapy drugs and immune agents, and patients will benefit from for the minimal toxicity and shorter response time.

When the initialize state is set as [−0.5, −0.1, −1, −0.4] *<sup>T</sup>* , and the other parameters are unaltered, we give another set of figures as Figs. 2.4, 2.5 and 2.6. In Figs. 2.5 and 2.6, there exist more obvious advantages for the proposed algorithms over that without taking optimal control in response time and control policies,and we can conclude that effectiveness of the control method does not vary in the different initial weights.

#### **2.5 Conclusion**

This chapter has introduced adaptive dynamic programming into solving the optimal control policies of tumor cells growth model and realized objective of minimizing tumor cells with the minimum dose of chemotherapeutic and immunotherapeutic drugs. As is known, the negative effect caused by chemotherapy and immunotherapy must be reduced for the reasonable treatment plan extracted from the optimal control behavior. Convergence properties have been proved to be guaranteed through Lyapunov theory. Meanwhile, states of the system and critic error have been demonstrated to be ultimately uniformly bounded. Simulations have been given to verify

**Fig. 2.5** Optimal control policies *u*(*t*)

**Fig. 2.6** Optimal control policies v(*t*)

rationality of the proposed method. In the future work, we will further investigate the medical frontier topics and propose adaptive therapeutic methods to solve these issues by employing ADP approach.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 3 Optimal Regulation Strategy for Nonzero-Sum Games of the Immune System Using Adaptive Dynamic Programming**

#### **3.1 Introduction**

As the rapid increase of tumor patients, immunotherapies integrated with multipronged approaches are being burgeoning for treatment of cancers with specific forms, especially for poorly immunogenic tumors as [1]. The original intention of immunotherapy is fighting cancer cells with their own lethality of immune cells. AIDS as a typical immunodeficiency syndrome caused by failure of immune response tends to be attributed to weakened immune levels, however, Natural killer cell population determine whether shutdown of immune system, once the activate immune system can not be suspended from and produce cytokines [2], which is regarded as an overreaction of the immune system such as COVID-19. Thus, the Nash equilibrium between the tumor cells and the immune cell population needs to be solved through optimal regulation based on specific learning method, and optimal control scheme is firstly brought into this field with its unique superiority, what's more, nonzero-sum games-based ADP enjoys meliority and practicability.

Decision and estimation on unknown nonlinearity existed so extensively in fields of engineering practice, medical treatments and even the social sciences, such that literature [3] firstly proposed the evaluation of the designed S-Box with highly nonlinearity on the basis of Chinese I-Ching philosophy. It is of great importance to make a suitable treatment decision in the field of health care where remains highly nonlinearity. To obtain an optimal mixed treatment strategy, the growth model of cell population levels was developed based on combination of immune and chemotherapy as literatures [4, 5]. When it comes to reaction of the immune system to tumor growth, a rather complicated nonlinear model of the immune system is requisite to simulate the overall aggressive combination treatment plan of immunotherapy and chemotherapy well. Thus, the process of solving the nonlinear function is hardly to be achieved unless the application of exceptionally optimized iterative algorithm such as backstepping techniques in [6], self-learning optimal regulation in [7], hierarchical lifelong learning as [8], broad learning adaptive neural control in [9] and adaptive dynamic programming, which benefits from its adaptive capability and strong autonomous iterative learning ability [10, 11]. Whether backstepping or adaptive dynamic programming both could guarantee the control objective would be achieved, and unknown nonlinear function matched the value of successive searching approximation through neural networks or fuzzy control as literatures [12–15].

*H*<sup>∞</sup> control enjoys excellent disturbance suppression while minimizing performance index and it is recognized as a typical two-player zero-sum problem, which can be equivalent to solving algebraic Riccati equations, and it is generally applied into linear dynamics systems, of course, systems with quadratic performance index could be actually solved such as literature [16]. Meanwhile, the familiar Hamilton-Jacobi-Isaacs is perceived as an effective medium in dealing with systems considering inherent nonlinearity, such as unknown mechanical parameters in [17], which is difficult to achieve using conventional methods for absence of exact system parameters. The mainstream analysis of ADP is seeking optimal control strategy integrated with solution to Bellman functions without information of system dynamics, which has ascended to the core methodologies of optimization and artificial intelligence. When it comes to actual models, control constraint has been definitely considered as [18–20], thus the chapter mainly focuses on dynamic model of the immune system which limits the single injection of drugs to an intervention level, and the optimal control scheme is transformed into constrained control which needs to take a discounted factor into account, avoiding infinite time dimension effectively, which will lead to development of optimal constrained control policy.

Model-free adaptive control was developed to obtain optimal control strategy without knowledge of exact system parameters as literatures [21–23], and multiple neural networks were constructed to achieve multi-objective approximation or optimization control process. Research with respect to multiple networks has been extended to multitudinous actor-critic constructions. A tremendous amount of practical application scenarios need multiple controllers, each of which minimizes its individual performance function as nonzero-sum problem. As elaborated in nonzerosum game theory, the control objective was minimizing the individual performance function and maintaining stability to yield a Nash equilibrium in [24]. As literature [25], saddle point of the Nash equilibrium was explored throughout the nonzerosum games-based optimization iterative process using ADP, even if there was no feasible saddle point, optimum was realized through mixed optimal control scheme iteratively, and the latter is of universal significance for conditions that are uneasy to satisfy in practical applications, The local optimal problem exits extensively which was firstly effectively avoided through fault-tolerant adaptive multigradient recursive reinforcement learning as [9]. To seek the solution to Nash equilibrium, the simultaneous algebraic Riccati or Hamilton-Jacobi-Isaacs functions require solving for nonlinear systems, which leads to "curse of dimensionality" with huge amount of computation, especially for multitudinous actor-critic constructions suffering from higher computational burden by many multiples, such as a double-loop policy iteration in [26]. According to the reason described above, the chapter adopts compromise acceptable actor-critic neural networks with appropriate dimensions, effectively realizing the transformation process from value iterative to cost function.

Value and policy iterations generally constitute the whole iterative methods, and begin with an semidefinite function or admissible control law accordingly. With applications of ADP to solve the optimal control strategy for both continuous [27, 28] or discrete-time systems [29, 30], however, traditional ADP can not satisfy the physical application in the immune system considering the mixed treatment strategy with chemotherapy drugs and immunotherapy, improving matters somewhat by nonzero-sum games-based ADP. There are seldom any literatures on nonzero-sum games-based ADP method for solving optimal regulation schemes of the immune system, let alone considering optimal constrained control, policy iterations, tumor regression and mixed control strategy of chemotherapy and immunotherapy, scilicet the cost function approaching covers minimization of the tumor cells, chemotherapy drugs and immunotherapy drugs, simultaneously.

#### **3.2 Establishment of Mathematical Model**

This part mainly introduces the mathematical growth model of tumor cells, which considers the influence of external factors such as chemotherapy drugs and immunotherapy on the tumor cells, mutual effect between two types of cells. In the following model, *T u*(*t*) represents the amount of tumor cells, *I m*(*t*) denotes the number of immune cells, and *Che*(*t*), *I mpy* (*t*) depicts the concentrations of chemotherapy drugs and immunotherapy drugs in the bloodstream, respectively.

#### *3.2.1 Growth Model of Tumor Cells*

Individually considering the natural growth law of tumor cells without the relationship with immune cells and any external effect on them, the growth law of tumor cells is subject to logical growth.

$$Tu(t+1) = Tu(t) + C\_1 T u(t) (1 - C\_2 T u(t)).\tag{3.1}$$

But when it comes to the interaction between immune cells and tumor cells, the direct killing of cells by chemotherapeutic drugs, and the growth model of tumor cells can be revised to:

$$\begin{split} Tu(t+1) &= Tu(t) + C\_1 Tu(t)(1 - C\_2 Tu(t)) \\ &- C\_{Im, Tu} Tu(t)Im(t) - C\_{Che, Tu} Tu(t)Che(t), \end{split} \tag{3.2}$$

where the specifications of parameters are demonstrated as Table 3.1.


**Table 3.1** Parameter specifications of the tumor cells

#### *3.2.2 Growth Model of Immune Cells*

Considering the natural growth law of immune cells simply, we assume that a fixed number of immune cells are produced in a unit of time and that these cells have an inevitable life cycle.

$$Im(t+1) = Im(t) + C\_3 - C\_{Im,d}Im(t) \tag{3.3}$$

The tumor cells in the body can stimulate the growth of immune cells, which shows a positive non-linear change by (3.4).

$$
\Delta\_{im} = \frac{\alpha\_1 T u(t)^2 I m(t)}{\beta\_1 + T u(t)^2}. \tag{3.4}
$$

In immunotherapy, the addition of immune agents can produce an immune response, which leads to the non-linear growth of immune cells.

$$
\Delta\_{Im\_{p\chi}} = \frac{\alpha\_2 T u(t) I m\_{p\chi}(t)}{\beta\_2 + I m\_{p\chi}(t)}.\tag{3.5}
$$

Simultaneously, in the struggle between immune cells and tumor cells, immune cells themselves can also cause losses,

$$
\Delta\_{C\_{Tu,lm}} = -C\_{Tu,lm} T u(t) I m(t). \tag{3.6}
$$

and in chemotherapy, chemotherapeutic drugs can also cause damage to immune cells.

$$
\Delta\_{C\_{Che,Im}} = -C\_{Che,Im} \operatorname{Che}(t) \operatorname{Im}(t). \tag{3.7}
$$

Combined (3.3)–(3.7), and then (3.8) can be obtained.


**Table 3.2** Parameter specifications of the immune cells

$$\begin{aligned} Im(t+1) &= Im(t) + \mathcal{C}\_3 - \mathcal{C}\_{Im,d} Im(t) + \Delta\_{im} + \Delta\_{Im\_{py}} \\ &+ \Delta\_{C\_{Lu,ln}} + \Delta\_{C\_{Ch,ln}} \end{aligned}$$

$$\begin{aligned} &= Im(t) + \mathcal{C}\_3 - \mathcal{C}\_{Im,d} Im(t) + \frac{\alpha\_1 T u(t)^2 Im(t)}{\beta\_1 + T u(t)^2} \\ &+ \frac{\alpha\_2 T u(t) Im\_{py}(t)}{\beta\_2 + I m\_{py}(t)} - \mathcal{C}\_{Tu,Im} T u(t) Im(t) \\ &- \mathcal{C}\_{Che,Im} Che(t) Im(t). \end{aligned}$$

Parameter elucidation of immune cells are outlined as Table 3.2.

#### *3.2.3 Drug Attenuation Model*

We assume that at some point after the injection of a chemotherapy drug, the concentration of the drug in the body will decrease exponentially. To guarantee the effectiveness of the treatment, we add chemotherapy drugs to the body, simultaneously.

$$Che(t+1) = Dr\_{Che}(t) - e^{-\gamma\_i}Che(t). \tag{3.9}$$

Similarly, we can obtain the attenuation model of the immunoagents:

$$\operatorname{Im}\_{\rm py}(t+1) = \operatorname{Dr}\_{\rm Im}(t) - e^{-\gamma\_2} \operatorname{Im}\_{\rm py}(t). \tag{3.10}$$

where injected at *t*, *DrChe*(*t*) and *DrI m*(*t*) denotes concentrations of the chemotherapy drugs and immunoagents separately. <sup>γ</sup><sup>1</sup> and <sup>γ</sup><sup>2</sup> is the decay rates of the chemotherapy drugs and immunoagents.

#### *3.2.4 The Design of the Optimization Problem*

Combined with the contents of (A), (B) and (C), we finally obtain the mathematical model affecting the growth of tumor cells:

$$\begin{cases} \begin{aligned} \boldsymbol{T}u(t+1) &= \boldsymbol{T}u(t) + \boldsymbol{C}\_{1}\boldsymbol{T}u(t)(1 - \boldsymbol{C}\_{2}\boldsymbol{T}u(t)) \\ &- \boldsymbol{C}\_{\operatorname{Im},T\_{u}}\boldsymbol{T}u(t)\boldsymbol{I}m(t) - \boldsymbol{C}\_{\operatorname{Che},T\_{u}}\boldsymbol{T}u(t)\boldsymbol{C}he(t) \\ \boldsymbol{Im}(t+1) &= \boldsymbol{Im}(t) + \boldsymbol{C}\_{3} - \boldsymbol{C}\_{\operatorname{Im},d}\boldsymbol{I}m(t) \end{aligned} \\ \begin{aligned} \label{eq{10}} &+ \frac{\alpha\_{1}\boldsymbol{T}u(t)^{2}\boldsymbol{I}m(t)}{\beta\_{1} + \boldsymbol{T}u(t)^{2}} + \frac{\alpha\_{2}\boldsymbol{T}u(t)\boldsymbol{I}m\_{\mathcal{P}}(t)}{\beta\_{2} + \boldsymbol{I}m\_{\mathcal{P}}(t)} \\ - \boldsymbol{C}\_{\operatorname{T}u,\operatorname{Im}}\boldsymbol{T}u(t)\boldsymbol{I}m(t) - \boldsymbol{C}\_{\operatorname{Che},\operatorname{Im}}\boldsymbol{C}he(t)\boldsymbol{I}m(t) \\ \boldsymbol{C}he(t+1) &= \boldsymbol{D}r\_{\operatorname{Che}}(t) - e^{-\gamma\_{1}}\boldsymbol{C}he(t) \\ \boldsymbol{I}m\_{\mathcal{P}}(t+1) &= \boldsymbol{D}r\_{\operatorname{Im}}(t) - e^{-\gamma\_{2}}\boldsymbol{I}m\_{\mathcal{P}}(t) . \end{aligned} \end{cases} (3.11)$$

Given that *T u*(*t*), *I m*(*t*) are biomass, and *Che*(*t*), *I mpy* (*t*) are the drug concentrations in the bloodstream,

$$\int T u(t), \operatorname{Im}(t), \operatorname{Che}(t), \operatorname{Im}\_{\operatorname{py}}(t) \ge 0, \forall t > 0. \tag{3.12}$$

And all parameters in the model are non-negative:

$$C\_1; C\_2; C\_3; C\_{Im, Tu}; C\_{Che, Tu}; C\_{Im, d}; C\_{Tu, Im}; C\_{Che, Im}$$

$$\alpha\_1; \alpha\_2; \beta\_1; \beta\_2; \gamma\_1; \gamma\_2 \ge 0, \forall t > 0. \tag{3.13}$$

When we qualitatively analyze the problem that how to minimize the residual tumor cell population in the bloodstream on the premise of using as few drugs as possible, including chemotherapy drugs and immunoagents. This process can be described as a quantitative mathematical expression as (3.14).

$$\begin{split} \min & \| aTu(t)^2 + b\_1 \int\_0^{Dr\_{Ch}(t)} \tanh^{-1}(\bar{U}\_1^{-1}s)\bar{U}\_1 R\_1 ds \\ & + b\_2 \int\_0^{Dr\_{Im}(t)} \tanh^{-1}(\bar{U}\_2^{-1}s)\bar{U}\_2 R\_2 ds \, . \end{split} \tag{3.14}$$

It is emphasized here that the single dose of the two drugs should be limited to avoid drug poisoning. So we use a definition form with input constraints. During the whole treatment process, we can get:

$$\sum\_{t=t\_0}^{t\_f} \lambda^t \{aTu(t)^2 + b\_1 \int\_0^{Dr\_{Ck}(t)} \tanh^{-1}(\bar{U}\_1^{-1}s)\bar{U}\_1 R\_1 ds$$

$$+ b\_2 \int\_0^{Dr\_{Im}(t)} \tanh^{-1}(\bar{U}\_2^{-1}s)\bar{U}\_2 R\_2 ds\},\tag{3.15}$$

where 0 <sup>&</sup>lt; <sup>λ</sup> <sup>&</sup>lt; 1, *<sup>U</sup>*¯<sup>1</sup> and *<sup>U</sup>*¯<sup>2</sup> represent the maximum permissible dose of chemotherapy drug and dose of immune agents in a single injection, respectively.

#### **3.3 The Proposed Nonzero-Sum Games-Based ADP Scheme**

To solve the given problems above, we propose an aggressive treatment plan or control scheme based on nonzero-sum games-based ADP algorithm.

#### *3.3.1 Theoretical Introduction*

For a differential control system *x*(*t* + 1) = *F*(*x*(*t*), *u*(*t*), *t*)), *x*(*t*) is the state variable, *u*(*t*) is the control variable, *F* is the transition mapping between states, and then the cost of state transition is obtained: *U*(*x*(*t*), *u*(*t*), *t*), and the total cost of the whole period is *<sup>t</sup> <sup>f</sup> <sup>t</sup>*=*t*<sup>0</sup> *U*(*x*(*t*), *u*(*t*), *t*).

When solving a finite time problem, we can equivalent it to

$$\sum\_{t=t\_0}^{\infty} \lambda^t U(\mathbf{x}(t), \boldsymbol{\mu}(t), t), 0 < \lambda < 1. \tag{3.16}$$

In the application of Bellman's optimality principle to solve (3.1), we first stipulate *J* (*x*(*t*0)) = <sup>∞</sup> *<sup>t</sup>*=*t*<sup>0</sup> <sup>λ</sup>*<sup>t</sup> U*(*x*(*t*), *u*(*t*), *t*), and then we can obtain that

$$J^\*(\mathbf{x}(t)) = \min\_{u(t)} \left\{ U[\mathbf{x}(t), u(t)] + \lambda J^\*[\mathbf{x}(t+1)] \right\}, t \in (t\_0, \infty). \tag{3.17}$$

The corresponding optimal control can be solved and the form as follows.

$$u^\*[\mathbf{x}(t)] = \underset{u(t)}{\text{arg min}} \left\{ U[\mathbf{x}(t), u(t)] + \lambda J^\*[\mathbf{x}(t+1)] \right\}, t \in (t\_0, \infty). \tag{3.18}$$

This typical solution approach is a considerable challenge for computing and storage space.

**Remark 3.1** Adaptive dynamic programming as an optimize learning method is usually used to track the cost function, which is not only designed to minimize the tumor cells, but also minimum dose chemotherapy drugs and immunoagents in this chapter.

#### *3.3.2 Iterative ADP Algorithm*

To solve (1), we use an iterative adaptive dynamic programming algorithm, and the revised facilitate solving differential equations model.

(1) Brief interpretation of ADP algorithm

Firstly, we take a value function *K*(*x*) to approximate the cost function *J* (*x*). In this case, the purpose of iteration is to ensure that the approximate function approaches to the optimal value equation and obtain the optimal decision law. Namely,

$$\begin{cases} K(\mathbf{x}) \to J^\*(\mathbf{x}) \\ \kappa \to \boldsymbol{\mu}^\*. \end{cases} \tag{3.19}$$

Secondly, in the specific solution process: Give *K*<sup>0</sup>(·) = 0, we make

$$\kappa^0(\mathbf{x}(t)) = \underset{\mathbf{u}(t)}{\text{arg min}} \left\{ U[\mathbf{x}(t), \mathbf{u}(t)] + \lambda K^0(\mathbf{x}(t+1)) \right\},\tag{3.20}$$

and update the value function as

$$K^1(\mathbf{x}(t)) = U[\mathbf{x}(t), \kappa^0(\mathbf{x}(t))] + \lambda K^0(\mathbf{x}(t+1)),\tag{3.21}$$

for *i* = 1, 2, 3, ... , we can get

$$\kappa^i(\mathbf{x}(t)) = \operatorname\*{arg\,min}\_{\mathbf{u}(t)} \left\{ U[\mathbf{x}(t), \mathbf{u}(t)] + \lambda K^i(\mathbf{x}(t+1)) \right\}. \tag{3.22}$$

and

$$K^{i+1}(\mathbf{x}(t)) = \min\_{\mathbf{u}(t)} \left\{ U[\mathbf{x}(t), \mathbf{u}(t)] + \lambda K^i(\mathbf{x}(t+1)) \right\}. \tag{3.23}$$

Thus,

$$K^{i+1}(\mathbf{x}(t)) = U[\mathbf{x}(t), \kappa^i(\mathbf{x}(t))] + \lambda K^i(\mathbf{x}(t+1)). \tag{3.24}$$

and the optimal solution is obtained when the error requirement has been adequately satisfied as condition that *K<sup>i</sup>* (*x*(*t*)) → *K*∗(*x*(*t*)) and *<sup>K</sup><sup>i</sup>*+1(*x*(*t*)) <sup>−</sup> *<sup>K</sup><sup>i</sup>* (*x*(*t* + 1)) <sup>≤</sup> ε, where *<sup>i</sup>* represents the number of iterations.

**Algorithm** : **EvolutionaryADPalgorithm Initialization** : 1. A certain initial state is given randomly in the feasible region *x*(*t*); 2. Set <sup>Λ</sup>0(·) <sup>=</sup> 0; 3. Specific parameters are given according to the requirements: error , discount factor λ; **IterationandUpdate** : 4. *<sup>i</sup>* <sup>=</sup> 0 , substitute *<sup>x</sup>*(*t*) into "(3.26) = 0 ", yield κ*<sup>i</sup>* (*t*); 5. Plug *<sup>x</sup>*(*t*) and κ*<sup>i</sup>* (*t*) into (3.25),and to get *x*(*t* + 1); 6. According to the (3.29), calculate <sup>Λ</sup>*i*+1(*x*(*t*)) <sup>=</sup> <sup>∂</sup>*U*(*x*(*t*+1),κ*i*(*t*)) ∂*x*(*t*) <sup>+</sup> <sup>Λ</sup>*<sup>i</sup>* (*x*(*t* + 1)); 7. According to the data set [*x*(*t*), Λ*i*+1(*x*(*t*))] , the neural network of the relationship between *x* ∼ Λ ; 8. Using the neural network obtained by "7.", the value in the same state is calculated. When  <sup>Λ</sup>*i*+1(*x*(*t*)) <sup>−</sup> <sup>Λ</sup>*<sup>i</sup>* (*x*(*t*)) <sup>≤</sup> ,ends; If it is not true, returns "4.";

To faster convergence to the optimal solution, we update in each iteration and value function, the control law according to the current direction of steepest descent, that is,

$$\frac{\partial K^{i+1}(\mathbf{x}(t))}{\partial \mathbf{x}(t)} = \frac{\partial U(\mathbf{x}(t), \kappa^{i+1}(t))}{\partial \mathbf{x}(t)} + \lambda [\frac{\partial \mathbf{x}(t+1)}{\partial \mathbf{x}(t)}]^T \frac{\partial K^i(\mathbf{x}(t+1))}{\partial \mathbf{x}(t+1)},\qquad(3.25)$$
 
$$\text{where}\begin{aligned} [\mathbf{x}(t+1), \mathbf{x}(t)]^T &= \lambda [\mathbf{x}(t), \mathbf{x}(t)]^T \end{aligned}\qquad(3.26)$$

$$\frac{\partial K^{i+1}(\mathbf{x}(t))}{\partial \kappa^{i+1}(t)} = \frac{\partial U(\mathbf{x}(t), \kappa^{i+1}(t))}{\partial \kappa^{i+1}(t)} + \lambda [\frac{\partial \mathbf{x}(t+1)}{\partial \kappa^{i+1}(t)}]^T \frac{\partial K^i(\mathbf{x}(t+1))}{\partial \mathbf{x}(t+1)}.\tag{3.26}$$

Setting Λ*<sup>i</sup>* (*x*(*<sup>t</sup>* <sup>+</sup> <sup>1</sup>)) <sup>=</sup> <sup>∂</sup>*K<sup>i</sup>* (*x*(*t*+1)) ∂*x*(*t*+1) :

#### (2) Modification of Model (3.11)

Compared with the traditional control strategy, we directly solve the problem proposed in this chapter by using ADP, although it is difficult to solve the model. Here, we propose a fitting idea to modify the model. Analysis on (3.11) shows that the injection of chemotherapy drugs into the body has a direct effect on tumor cells. On the other hand, immunoagents act on immune cells, which affects tumor cell populations. Throughout the whole action process, we can only consider the input of chemotherapy drugs and immunoagents at every moment as the two control inputs of the system and the state variables of the system are selected as the intermediate transition variables such as tumor cells and immune cell population.

1. The standard expressions of control variables, state variables, cost functions and so on are given as follows,

$$u(t) = Tu(t), \\ u\_1(t) = Dr\_{che}(t), \\ u\_2(t) = Dr\_{Im}(t). \tag{3.27}$$

$$K(\mathbf{x}) = \sum\_{t=\mathbf{t}\_0}^{\infty} \lambda^t \{ax(t)^2 + b\_1 \int\_0^{u\_1(t)} \tanh^{-1}(\bar{U}\_1^{-1}s)\bar{U}\_1 \mathcal{R}\_1 \, ds\} $$

$$+ b\_2 \int\_0^{u\_2(t)} \tanh^{-1}(\bar{U}\_2^{-1}s)\bar{U}\_2 \mathcal{R}\_2 \, ds\}. \tag{3.28}$$

2. The modified system model adopts the form of nonlinear affine system, namely:

$$\mathbf{x}(t+1) = f(\mathbf{x}(t)) - [g\_1(\mathbf{x}(t)), g\_2(\mathbf{x}(t))][\boldsymbol{\mu}\_1(t), \boldsymbol{\mu}\_2(t)]^T. \tag{3.29}$$

3. Update the optimal control law and value function:

Let  $\frac{\partial K^{i+1}(x(t))}{\partial u\_1^i(t)} = 0$  and  $\frac{\partial K^{i+1}(x(t))}{\partial u\_2^i(t)} = 0$ , 
$$u\_1^{i,\*}(t) = \bar{U}\_1 \\ tanh(\frac{\lambda}{b\_1 \bar{U}\_1 R\_1} g\_1(\mathbf{x}(t)) \boldsymbol{\Lambda}^i(\mathbf{x}(t+1))), \tag{3.30}$$

$$u\_2^{i,\*} (t) = \bar{U}\_2 \\\\tanh(\frac{\lambda}{b\_2 \bar{U}\_2 R\_2} g\_2(\mathbf{x}(t)) A^i(\mathbf{x}(t+1))). \tag{3.31}$$

From this, we can also get

$$\frac{\partial K^{i+1}(\mathbf{x}(t))}{\partial \mathbf{x}(t)} = A^{i+1}(\mathbf{x}(t)) = \lambda [\frac{\mathbf{d}f(\mathbf{x})}{\mathbf{d}\mathbf{x}} - u\_1^i \frac{\mathbf{d}g\_1(\mathbf{x})}{\mathbf{d}\mathbf{x}} - u\_2^i \frac{\mathbf{d}g\_2(\mathbf{x})}{\mathbf{d}\mathbf{x}}]$$

$$\cdot A^i(\mathbf{x}(t+1)) + 2a\mathbf{x}.\tag{3.32}$$

**Remark 3.2** To approximate optimal value based on optimal decision law, value iteration method is devoted to tending to the cost function *J* (*x*)through value function *K*(*x*).

**Remark 3.3** The fitted curve is constructed according to date obtained from the original model which is uneasy to solve, and the modification of model is research objectives for replacement, considering control inputs as chemotherapy drugs and immunoagents, simultaneously.

#### *3.3.3 Convergence Analysis*

This section provides proof of the convergence of this algorithm to prove the effectiveness of the algorithm in theory. This proof is mainly derived from formulas (3.1), (3.2), and (3.3), including two lemmas and three theorems.

**Lemma 3.4** *Take a control sequence {Ar i* (*x* (*t*)*)}. When it is brought into formula (1), the corresponding value function J <sup>i</sup> Ar*(*x* ) *is obtained. Compared with the control sequence {* κ*i*(*x* (*t*))*} corresponding to the minimum cost K<sup>i</sup>* (*x* (*t*))*. If J* 0 *Ar*(·) <sup>=</sup> *<sup>K</sup>*0(·) <sup>=</sup> <sup>0</sup>*, J <sup>i</sup>*+<sup>1</sup> *Ar* (*x* (*t*)) = *U*[ *x*(*t*), *Ar i* (*x* (*t*))] + λ*<sup>J</sup> <sup>i</sup> Ar*(*x* (*t* + 1))*, satisfying*

$$\begin{split} K^{i+1}(\vec{\tilde{x}}(t)) &= U[\vec{\tilde{x}}(t), \kappa^i(\vec{\tilde{x}}(t))] + \lambda J^i\_{Ar}(\vec{\tilde{x}}(t+1)) \\ &= \min\_{Ar(t)} \left\{ U[\vec{\tilde{x}}(t), Ar(t)] + \lambda K^i(\vec{\tilde{x}}(t+1)) \right\} \end{split} \tag{3.33}$$

*Then, J <sup>i</sup> Ar*(*x* (*t*)) ≥ *K<sup>i</sup>* (*x* (*t*)) *for* ∀*i.*

*Proof K<sup>i</sup>* (*x* ) is obtained by taking the minimum value equation *J <sup>i</sup> Ar*(*x* ). {κ *i* (*x* (*t*))} is the corresponding optimal control sequence. For the arbitrarily control sequence {*Ar i* (*x* (*t*))}, the value equation *J <sup>i</sup> Ar*(*x* ) which is corresponding with the arbitrarily control sequence must not be less than *K<sup>i</sup>* (*x* ).

**Lemma 3.5** *Select a stable admissible control sequence {Sa i* (*x* (*t*)*)} with certain restrictions and the corresponding value equation J <sup>i</sup> Sa*(*x* )*. For controllable system, if J* <sup>0</sup> *Sa*(·) <sup>=</sup> *<sup>K</sup>*<sup>0</sup>(·) <sup>=</sup> <sup>0</sup> *and J <sup>i</sup>*+<sup>1</sup> *Sa* (*x* (*t*)) = *U*[ *x*(*t*), *Sa i* (*x* (*t*))] + λ*<sup>i</sup>* (*x* (*t* + 1))*, Then J i Sa*(*x* ) *is bounded.*

*Proof*

$$\begin{split} J\_{Sa}^{i+1}(\vec{\tilde{x}}(t)) &= U[\vec{\tilde{x}}(t), \vec{S}\dot{\vec{a}}^{\,}(\vec{\tilde{x}}(t))] + \lambda J\_{Sa}^{i}(\vec{\tilde{x}}(t+1)) \\ &= U[\vec{\tilde{x}}(t), \vec{S}\dot{\vec{a}}^{\,}(\vec{\tilde{x}}(t))] + \lambda U[\vec{\tilde{x}}(t), \vec{S}\dot{\vec{a}}^{\,}^{\,-1}(\vec{\tilde{x}}(t+1))] \\ &+ \lambda J\_{Sa}^{i-1}(\vec{\tilde{x}}(t+2)) \\ &= U[\vec{\tilde{x}}(t), \vec{S}\dot{\vec{a}}^{\,}(\vec{\tilde{x}}(t))] + \lambda U[\vec{\tilde{x}}(t), \vec{S}\dot{\vec{a}}^{\,-1}(\vec{\tilde{x}}(t+1))] \\ &+ \lambda^{2} U[\vec{\tilde{x}}(t+2), \vec{S}\dot{\vec{a}}^{\,-2}(\vec{\tilde{x}}(t+2))] + \dots \\ &+ \lambda^{i+1} J\_{Sa}^{0}(\vec{\tilde{x}}(t+i+1)). \end{split} \tag{3.34}$$

Thus, *J <sup>i</sup>*+<sup>1</sup> *Sa* (*x* (*t*)) <sup>=</sup> *<sup>i</sup> <sup>j</sup>*=<sup>0</sup> <sup>λ</sup>*<sup>i</sup> U*[ *x*(*t* + *j*), *Sa i*− *j* (*x* (*<sup>t</sup>* <sup>+</sup> *<sup>j</sup>*))] and *<sup>J</sup> <sup>i</sup>*+<sup>1</sup> *Sa* (*x* (*t*)) ≤ lim*<sup>i</sup>*→∞ *<sup>i</sup> <sup>j</sup>*=<sup>0</sup> <sup>λ</sup>*<sup>i</sup> U*[ *x*(*t* + *j*), *Sa i*− *j* (*x* (*t* + *j*))], where {*Sa<sup>i</sup>* (*x* )} is the stable allowable control sequence, and we can get an conclusion that 0 <sup>≤</sup> *<sup>J</sup> <sup>i</sup>*+<sup>1</sup> *Sa* (*x* (*t*)) ≤ lim*<sup>i</sup>*→∞ *<sup>i</sup> <sup>j</sup>*=<sup>0</sup> <sup>λ</sup>*<sup>i</sup> U*[ *x*(*t* + *j*), *Sa i*− *j* (*x* (*t* + *j*))] ≤ *C* for given constant *C*. That is, *J i Sa*(*x* ) is bounded.

**Theorem 3.6** *From formula (1), {* κ*i*(*x* (*t*))*} is the control sequence corresponding to the minimum value function K<sup>i</sup>* (*x* )*. Assuming the initial state K<sup>i</sup>* (·) = 0*, it can be proved that the sequence {* κ*i*(*x* (*t*))*} is a monotonic non-decreasing sequence, and Ki* (*x* (*t*)) ≤ *K<sup>i</sup>*+<sup>1</sup>(*x* (*t*))*.*

*Proof* Define a value equation *T <sup>i</sup>* (*x* (*t*)) : *T <sup>i</sup>* (·) = 0, *T <sup>i</sup>*+1(*x* (*t*)) <sup>=</sup> λ*<sup>T</sup> <sup>i</sup>* (*x* (*t* + 1)) + *U*[ *<sup>x</sup>*(*t*), τ *<sup>i</sup>*+1(*x* (*t*))]. When *i* = 0, *T* <sup>1</sup>(*x* (*t*)) = *U*[ *<sup>x</sup>*(*t*), τ 0(*x* (*t*))] + λ*<sup>T</sup>* <sup>0</sup>(*<sup>x</sup>* (*t* + 1)), *T* <sup>1</sup>(*x* (*t*)) − *T* <sup>0</sup>(*x* (*t*)) = *U*[ *<sup>x</sup>*(*t*), τ 0(*x* (*t*))] ≥ 0, we get *T* <sup>1</sup>(*x* (*t*)) ≥ *T* <sup>0</sup>(*x* (*t*)).

Assuming *t* = *i* − 1, *T <sup>i</sup>* (*x* (*t*)) ≥ *T <sup>i</sup>*−1(*x* (*t*)), When *t* = *i*, *T <sup>i</sup>*+<sup>1</sup> (*x* (*t*)) = *U*[ *x*(*t*), ξ*i* (*x* (*t*))] + λ*<sup>T</sup> <sup>i</sup> x* (*t* + 1) and *T <sup>i</sup>*+1(*x* (*t*)) − *T <sup>i</sup>* (*x* (*t*)) = λ(*U*[ *x*(*t*), ξ*<sup>i</sup>*−1(*<sup>x</sup>* (*t* + 1))]) ≥ 0. Then *T <sup>i</sup>*+1(*x* (*t*)) ≥ *T <sup>i</sup>* (*x* (*t*)). And we can get *Ki* (*x* (*t*)) ≤ *K<sup>i</sup>*+1(*x* (*t*)).

**Theorem 3.7** *It is known that {* κ*i*(*x* (*t*))*} is the control sequence corresponding to the minimum cost function K<sup>i</sup>* (*x* )*, which can prove* lim*<sup>i</sup>*→∞ *K<sup>i</sup>* (*x* (*t*)) = *K*∗(*x* (*t*))*.*

*Proof* {κ*<sup>i</sup>* (*x* )}and *K<sup>i</sup>* (*x* ) have been given in Lemma 3.2, and the corresponding value function of {κ*i*,*<sup>l</sup>* (*x* )} is *Ki*+1,*<sup>l</sup>* (*x* (*t*)) = *U*[ *<sup>x</sup>*(*t*), κ *i*,*l* (*x* (*t*))] + λ*Ki*,*<sup>l</sup>* (*x* (*t*)), where *l* is the length. Obviously, *Ki*+1,*<sup>l</sup>* (*x* (*t*)) <sup>=</sup> *<sup>i</sup> <sup>j</sup>*=<sup>0</sup> <sup>λ</sup>*<sup>i</sup> U*[ *<sup>x</sup>*(*<sup>t</sup>* <sup>+</sup> *<sup>j</sup>*), κ *i*− *j*,*l* (*x* (*t* + *j*))].

After taking the limit, we can obtain *K* <sup>∞</sup>,*<sup>l</sup>* (*x* (*t*)) <sup>=</sup> lim*<sup>i</sup>*→∞ *<sup>i</sup> j*=0 λ*i U*[ *<sup>x</sup>*(*<sup>t</sup>* <sup>+</sup> *<sup>j</sup>*), κ *i*− *j*,*l* (*x* (*t* + *j*))], and define *K*∗(*x* (*t*)) = inf *<sup>l</sup>* {*<sup>K</sup>* <sup>∞</sup>,*<sup>s</sup>*(*<sup>x</sup>* (*t*))}. Similarly, Ω∞+1,*<sup>s</sup>*(*x* (*t*)) ≤ *K* <sup>∞</sup>,*<sup>l</sup>* (*x* (*t*)) ≤ *D<sup>s</sup>* can be obtained from Lemma 3.5. On the other hand, we get *Ki*+<sup>1</sup>(*x* (*t*)) ≤ *K* <sup>∞</sup>,*<sup>s</sup>*(*x* (*t*)) based on Lemma 3.4. Therefore, it can be concluded that *Ki*+<sup>1</sup>(*x* (*t*)) ≤ Ω*<sup>i</sup>*+1,*<sup>l</sup>* (*x* (*t*)) ≤ Ω∞,*<sup>l</sup>* (*x* (*t*)) ≤ *D<sup>s</sup>*. *K*∗(*x* (*t*)) = inf *l K* <sup>∞</sup>,*<sup>l</sup>* (*x* (*t*)) with the definition of minimum value for the optimal value equation, extracting a control sequence { κ*i*,*m*} so that *<sup>K</sup>* <sup>∞</sup>,*<sup>m</sup>* <sup>≤</sup> *<sup>K</sup>*∗(*<sup>x</sup>* (*t*)) <sup>+</sup> , and drawing an conclusion that *K* <sup>∞</sup>,*<sup>m</sup>* ≤ *K*∗(*x* (*t*)) <sup>+</sup> . Considering *<sup>K</sup>i*+<sup>1</sup>(*<sup>x</sup>* (*t*)) ≤ *Ki*+1,*<sup>l</sup>* (*x* (*t*)) ≤ *K* <sup>∞</sup>,*<sup>l</sup>* (*x* (*t*)) ≤ *D<sup>l</sup>* in another way and taking the limit, the formula holds for any *i*,*l*, then lim*<sup>i</sup>*→∞ *K<sup>i</sup>* (*x* (*t*)) = inf *<sup>s</sup> <sup>D</sup><sup>s</sup>*.

To guarantee lim*<sup>i</sup>*→∞ *K<sup>i</sup>* (*x* (*t*)) = *K* <sup>∞</sup>,g(*x* (*t*)), the control sequence { κ*i*,g} is necessary, and then we can get *Ki*+<sup>1</sup>(*x* (*t*)) ≥ *K*∗(*x* (*t*)). Combining both aspects above, lim*<sup>i</sup>*→∞ *K<sup>i</sup>* (*x* (*t*)) = *K*∗(*x* (*t*)) is obtained.

**Theorem 3.8** *For any state variable x* (*t*)*, the optimal value equation K<sup>i</sup>* (*x* (*t*)) *satisfies the characteristics of the HJB equation.*

$$K^\*(\vec{\chi}(t)) = U[\vec{\chi}(t), \vec{\kappa}(t)] + \lambda K^\*(\vec{\chi}(t+1)). \tag{3.35}$$

*Proof* From the proved lemmas and theorems, a series of characteristics about "*K<sup>i</sup>* (*x* (*t*))" are obtained. At this time, it is necessary to verify that characteristics of the HJB equation are satisfied. According to (3.23), there exits *K*∗(*x* (*t*)) = inf κ (*t*) {*U*[ *<sup>x</sup>*(*t*), κ ] + λ*K<sup>i</sup>* (*x* (*t* + 1))}, meanwhile, according to Theorems 3.6 and 3.7, yield that *K<sup>i</sup>*+<sup>1</sup>(*x* (*t*)) = min κ (*t*) {*U*[ *<sup>x</sup>*(*t*), κ ] + λ*K<sup>i</sup>* (*x* (*t* + 1))}. Then take the mathematical limit, we get *K*∗(*x* (*t*)) ≤ inf κ (*t*) *U*[ *<sup>x</sup>*(*t*), κ ] + λ*K<sup>i</sup>* (*x* (*t* + 1)for the randomness of {*u* (*t*)}).

From the other side, we have *K<sup>i</sup>*+1(*x* (*t*)) ≥ inf κ (*t*) *U*[ *<sup>x</sup>*(*t*), κ ] + λ*K<sup>i</sup>*−1(*<sup>x</sup>* (*t* + 1), take the limit again, then yield that *K*∗(*x* (*t*)) ≥ inf *i U*[ *<sup>x</sup>*(*t*), κ ] + λ*K<sup>i</sup>* (*x* (*t* + 1)). As to the analysis above, we can get a final conclusion.

$$K^\*(\tilde{\boldsymbol{\chi}}(t)) = U[\tilde{\boldsymbol{\chi}}(t), \tilde{\boldsymbol{\kappa}}(t)] + \lambda K^\*(\tilde{\boldsymbol{\chi}}(t+1)).\tag{3.36}$$

All content is verified.

**Remark 3.9** The control sequence { κ*i*(*x* (*t*))} is a monotonic non-decreasing sequence corresponding to the minimum value function *K<sup>i</sup>* (*x* ), which tend to be *K*∗(*x* (*t*)) eventually, satisfying the characteristics of the HJB functions as [31].

#### **3.4 Simulation and Numerical Experiments**

In this section, we consider the mechanism model of tumor cell growth combined with immunotherapy, chemotherapy and combination treatments proposed as experimental validation. Firstly, The affine system model is constructed with chemotherapy drugs and immunoagents as control inputs and the account involved of tumor cells as state variables. Secondly, according to the affine model obtained by fitting, we developed the cost function of treatment loss with the clinical treatment requirements. Finally, the optimal treatment plan for a patient with a basic condition is given after calculation by the algorithm.

#### *3.4.1 An Affine Model of Tumor Cell Growth*

According to clinical medical statistics and literature [4], the specific parameters of the mechanism model are given as Table 3.3.

At this point, when we give the initial count of tumor cell population and immune cells in a patient and follow a certain chemotherapy and immunotherapy regimen, we can get the following four curves on tumor cells and immune cell population as shown in Figs. 3.1 and 3.2. It is obviously that state variable *T u*(*t*) denoted the population of tumor cells tend to be stable in Fig. 3.2, similarly, for *I m*(*t*) in Fig. 3.1.

When the fitted affine system is carried out according to the data obtained from the mechanism model, *DrChe*(*t*) and *DrI m*(*t*) are selected as two control inputs and *I m*(*t*) as state variables. Within the allowable error range, the obtained fitting relation is shown as the following form,


**Table 3.3** Concentration variation on immune cells, tumor cells, chemotherapeutic drug and immunoagents

**Fig. 3.1** The curves of tumor cells

$$\mathbf{x}(t+1) = f(\mathbf{x}(t)) - [g\_1(\mathbf{x}(t)), g\_2(\mathbf{x}(t))][\boldsymbol{u}\_1(t), \boldsymbol{u}\_2(t)]^T. \tag{3.37}$$

$$f(\mathbf{x}) = \mathbf{x} + 0.00431\mathbf{x} (1 - 1.02 \times 10^{-9} \mathbf{x}).\tag{3.38}$$

$$g\_1(\mathbf{x}) = \exp(8.15 \times 10^{-6} [\log(\mathbf{x})]^{6.131} + 3.482). \tag{3.39}$$

$$g\_2(\mathbf{x}) = \exp(0.05639[\log(\mathbf{x})]^{2.093} + 2.492). \tag{3.40}$$

The curves before and after fitting are compared as Fig. 3.3, which meets the requirements of fitting precision, which guarantees accuracy of the data traced back to the original source.

**Fig. 3.2** The curves of immune cells

**Fig. 3.3** The curves of immunoagents drug concentrations in the bloodstream

#### *3.4.2 The Treatment Loss Cost Function*

The form of the cost function proposed in the third part as (3.17). Unlike the theoretical mechanism model analysis, and combined with clinical requirements, it is necessary to limit the single injection of drugs to no more than 0.05. Therefore,

$$U\_1 = 0.05, U\_2 = 0.05.\tag{3.41}$$

To avoid the optimal solution in the infinite time dimension, we choose the discount factor λ <sup>=</sup> <sup>0</sup>.95. Finally, the specifically obtained cost function as follows:

*K*(*x*) = ∞ *t*=*t*<sup>0</sup> 0.95*<sup>t</sup>* {2.<sup>784</sup> <sup>×</sup> <sup>10</sup>−5*x*(*t*) 2 + *<sup>u</sup>*1(*t*) 0 50*tanh*−<sup>1</sup> (0.05−<sup>1</sup> *s*)*ds* + *<sup>u</sup>*2(*t*) 0 850*tanh*−<sup>1</sup> (0.05−<sup>1</sup> *s*)*ds*}. (3.42)

#### *3.4.3 The Optimal Solution of the Treatment*

According to the previous two subsections, we have completed the transformation from the mathematical mechanism model to the solvable affine model, and determined the specific value of the cost function according to the clinical requirements. The optimal treatment strategy is acquired through the proposed algorithm and make a comparison to prove the effectiveness and feasibility. The cost function is designed to minimize the tumor cells, meanwhile, there exit minimum dose chemotherapy drugs and immunoagents.

In the following three figures (Figs. 3.4, 3.5 and 3.6), the blue curve represents the changes of tumor cells and the changes of a single dose in patients under the normal treatment regimen. In contrast, the red curve represents the optimal treatment regimen's effect calculated by the nonzero-sum game-based ADP algorithm.

As shown in Fig. 3.4, there are originally many cancer cells in the body. The two curves are close to the upper limit, with drugs and dual function of the immune system, a substantial reduction in the number of cancer cells. The amount of drug injection therapy hasn't changed greatly during the process from beginning to end. Even in the closing stage, cancer cells decreased significantly, there are still specific doses, and we solve the treatment dose is substantially less than the former.

Correspondingly, as shown in Fig. 3.5 that the changing trend of the injection dose of immunoagents on the two curves is close to the changing direction of chemotherapy drugs. The optimized treatment is slightly more than the traditional treatment plan when more cancer cells are in the initial stage, but it will not last for a long time. When the number of cancer cells is relatively large, the primary or indirect target of these two drugs is cancer cells; then, in the late stage of treatment, the number of

**Fig. 3.4** The injection dose curve of chemotherapy drugs under two kinds of treatment

**Fig. 3.5** The injection dose curve of immunologic agents under two kinds of treatment

**Fig. 3.6** The curves of tumor cells under two kinds of treatment

cancer cells is significantly reduced. If the chemotherapy drugs are put in according to the normal treatment, the normal cells will suffer a lot of erosion, which has a more significant impact on the body. However, the optimized drug dose has been dramatically reduced, and the normal cells have been less affected.

As shown in Fig. 3.6, control effect of the two treatment schemes on the number of tumor cells enjoy resemblance to that in the initial stage. Still, at the final stage, the algorithm optimized by ADP not only significantly reduces the count of tumor cell population, combined with Figs. 3.4 and 3.5, but also minimize the injection amount of the two drugs, which shows the effectiveness of our treatment scheme.

**Remark 3.10** The optimal regulation strategy for the immune system enjoys advantage of decreasing of tumor cells, what is more, clinical treatment benefits from typical minimization of chemotherapy drugs and immunoagents.

#### **3.5 Conclusion**

Nonzero-sum games-based adaptive dynamic programming has been proposed acquiring the optimum through affecting the growth of tumor and immune cells, providing guidance for clinical practice through adjusting the administered doses of chemotherapy drugs and immunotherapy drugs. Obtained results have shown that the immune system can decrease the tumor cells, meanwhile, minimizing of chemotherapy drugs and immunoagents through optimal control behavior. Simulation examples have presented availability and effectiveness of the research methodology. The future research will focus on solving the optimal mixed treatment strategy taking account of complex immunotherapy system including immune cell subsets and cytokines, considering the switched control policies in according with hybrid therapy.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 4 Evolutionary Dynamics Optimal Research-Oriented Tumor Immunity Architecture**

#### **4.1 Introduction**

Interaction between cancer cells, surrounding stromal cells and immune cells through autonomous and non-autonomous signaling can influence survival competition. Therefore, it is very critical for evolutionary and ecological dynamics mechanistic understanding of tumor progression [1]. It is assumed that evolution causes traits to change continuously over time even if the ecological dynamics are constantly changing. More broadly, imagine an evolutionarily stable state that is a trajectory of phenotypic states-an evolutionarily stable trait attractor. This can be used in scenarios where there is sufficient variation to facilitate rapid evolution, or where the state involves a plastic response to environmental conditions, eventually constituting evolutionary stability. Simultaneously, Natural killer (NK) cells as one of the players in the game attack many tumour cell lines, which is critical in anti-tumour immunity [2], however, the interaction between NK cells and tumour targets is poorly. To overcome drug resistance, anti-tumor immunotherapy gradually replaces the traditional treatment strategy [3]. The interaction between specialized cancer cell populations and immune cells has become a special evolutionary dynamics phenomenon in the process of tumor immunity growth architecture. The goal of optimization is to minimize administration dosage and reduce negative effects.

The dynamic perception or learning process is realized through interactions between cells and organism architecture, accomplishing observing their responses and learning optimal control strategy ultimately of Markov decision. It is required to seek an optimal control scheme such that the desired dosage of administration can be tracked and the optimal performance of minimize chemotherapeutic drugs and immunological agents can be achieved. Thus, reinforcement learning is urgently needed for optimal research-oriented tumor immunity architecture. The classical policy iteration and value iteration frameworks are never out of date, and the new min-Hamilton function [4] and the low-gain parameter ADP-Bellman equation for global stabilization are thriving [5].

The interaction between cells is highly nonlinear and coupled. When the computational conditions allow, whether it is the adaptive algorithm design based on policy iteration, or the adaptive hierarchical neural network algorithm [6], which can easily solve the coupled fractional order chaotic synchronization problem. All inspire us in solving the optimal solution of the HJB equation of the idea. Once computing conditions are not available, model-free is the best idea. The iteration-ADP algorithm is developed into iteration-NDP algorithm, which does not require an accurate system model [7], but only requires observable system data, which can reduce the cost and optimize the control action in the process of error backpropagation [8]. The emergence of Q-learning, from containing three classes four networks to interleaving double iteration, and then to the critical Q-learning [9] of a single class one network, effectively improves the utilization of resources, and the problem of insufficient exploration no longer exists. The interaction between cells coincides with multiple agents, and the attack of tumor cells on normal cells may cause abnormal reactions, and the neural net-based attack detection and estimation scheme designed by [10] can easily capture such anomalies. Cells cannot proliferate without limit. When solving the optimal solution of the constrained auxiliary subsystem, based on the framework of ADP, the idea of pi iteration is continued, and a strong convergence synchronous iterative optimization strategy [11] is given.

The difficult-to-decouple leaderer-follower behavior of vehicle-vehicle communication [12], human-vehicle interaction, and mutual quality of everyone can be easily solved with off-PI [13]. Switching system [14], T-S fuzzy, nash equilibrium, zero-sum game [15], let each agent deal with a low-dimensional state and local pattern, reduce conservatism, can easily obtain the minimum local cost [16]. Influenced by the improved exploration feature, the parallel A-C asynchronous gradient sharing mechanism can realize the parallel optimization operation of diversified agents in a short time [17]. Affected by the time difference error, integral reinforcement learning can obtain the estimated control strategy by updating the critic weight [18, 19]. In order to obtain a better stabilizing adaptive control scheme, it is necessary to give an appropriate robust control scheme for the control system [20]. Reference [21] summarizes the recent outstanding progress in the continuous nonlinear control system of the controller that combines adaptability and robustness. The reliability and effectiveness of the actual power system and some large machinery and heavy machinery devices with these two designs considered are also demonstrated. The theory integrates ecological and evolutionary dynamics blending ecological mathematical model evolutionary game theory [22]. Then evolutionarily stable strategies will be investigated to seamless integration of both sides [23]. Solvable dynamic equations can be used to explore optimal control objectives, however, what followed is a disaster of dimensions.

To overcome it, dual-heuristic dynamic programming is proposed for the nonlinear affine evolutionary dynamic dated from ADP considering the actual constraints. By introducing a discounted performance index, the optimal regulation problem of the infinite dimensional problem is reformulated into a finite dimensional. Different from previous value iterations which requires a strategy for initially stable the system. ADP is conformed to the optimal formation control by the establishment of performance index function [24]. The affine mathematical model is firstly introduced to twinborn the real scenario [25]. The optimal control is transformed into pursuing solution of HJB, and the convergence is proved. ADP involves learners giving rise to learning strategy, and the author studied a competitive learning system setting with cancer cell populations and immune cells, aiming at minimizing the dose administered.

#### **4.2 Pre-knowledge**

Consider a classical discrete-time nonlinear affine system,

$$x(t+1) = f(\mathbf{x}(t)) + g(\mathbf{x}(t))u(t) \tag{4.1}$$

where the state variable *x*(t) ∈ R<sup>n</sup>, the control variable *u*(t) ∈ R<sup>m</sup>, and f(·) ∈ R<sup>n</sup>, g(·) ∈ Rn×<sup>m</sup> can be stabilized on a compact set - ∈ R<sup>n</sup>, and f(0) = 0 g(0) = 0. Colloquially, the optimal control problem of (4.1) is equivalent to obtaining *u*∗(t) = *u*(*x*(t))(the optimal control law) that minimizes the proposed infinite-horizon performance index:

$$\mathbf{J}(\mathbf{x}(t)) = \sum\_{t=0}^{\infty} \mathbf{K}(\mathbf{x}(t), \boldsymbol{\mu}(t)). \tag{4.2}$$

K(*x*(t), *u*(t)) is the cost function, K(*x*, *u*) ≥ 0 ∀*x*, *u*. Basically, the cost function K(·) is given a quadratic form

$$\mathbf{K}(\mathbf{x}(t), \boldsymbol{\mu}(t)) = \mathbf{x}^{\boldsymbol{r}}(t)\mathbf{P}\mathbf{x}(t) + \boldsymbol{\mu}^{\boldsymbol{r}}(t)\mathbf{Q}\boldsymbol{\mu}(t) \tag{4.3}$$

P, Q > 0 are all positive definite matrices.

The optimal control problem of (4.2) can be converted to solve the HJB equation. According to the Bellman optimal principle, the optimal value function should obey the following[9]:

$$\mathbf{J}^{\circ}(\mathbf{x}(t)) = \min\_{\boldsymbol{\mu}(t)} \left\{ \mathbf{x}^{\top}(t) \mathbf{P} \mathbf{x}(t) + \boldsymbol{\mu}^{\top}(t) \mathbf{Q} \boldsymbol{\mu}(t) + \mathbf{J}^{\circ}(\mathbf{x}(t+1)) \right\} \tag{4.4}$$

By minimizing the right side of the (4.4) to solve the optimal control law, get the optimal value function J∗(*x*(t)). For necessity, one can take the partial derivative of the right-hand side of (4.4) with respect to *u*(t) to obtain *u*∗. Hence,

$$\mu^\*(t) = -\frac{\mathbf{Q}^{-\!\!\! }}{2} \left[ g(\mathbf{x}(t)) \right]^r \frac{\partial \mathbf{J}^\*(\mathbf{x}(t+1))}{\partial \mathbf{x}(t+1)} \tag{4.5}$$

Take (4.5) into (4.4), it can be obtained that

$$\mathbf{J}^\*(\mathbf{x}(t)) = \mathbf{x}^\top(t)\mathbf{P}\mathbf{x}(t) + \frac{1}{4} \left[ \frac{\partial \mathbf{J}^\*(\mathbf{x}(t+1))}{\partial \mathbf{x}(t+1)} \right]^\top g(\mathbf{x}(t)) \mathbf{Q}^{-1}$$

$$\cdot g^\top(\mathbf{x}(t)) \left[ \frac{\partial \mathbf{J}^\*(\mathbf{x}(t+1))}{\partial \mathbf{x}(t+1)} \right] + \mathbf{J}^\*(\mathbf{x}(t+1)). \tag{4.6}$$

By the on (4.6), it is almost impossible to obtain an analytical solution for *u*∗(t). Impossible in the current moment t can know the next moment J∗(*x*(t + 1)). To overcome this dilemma, the approximate optimal solution of HJB equation can be studied. In the fourth part of this chapter, the derivation of IDHP algorithm is introduced to solve this kind of optimal control problem [26, 27].

#### **4.3 Modeling of Mixed Immunotherapy and Chemotherapy for Tumor Cell**

In this part, a mathematical model is constructed from the natural growth of a single type of tumor cells, the gradual increase of the interaction between various immune cells and tumor cells in vivo, and the influence of external application of chemotherapy drugs and immune agents on the population of tumor cells [22, 28, 29].

First, define the acronyms of various cells:


For the convenience of writing, the following subsections do not specify the time, and the default is <sup>t</sup>, lowercase letters "a, <sup>b</sup>, <sup>c</sup><sup>1</sup>, c<sup>2</sup>, e, f, g, h<sup>1</sup>, h<sup>2</sup>, i, j, l, m, n<sup>1</sup>, n<sup>2</sup>, p<sup>1</sup>, p<sup>2</sup>, q<sup>1</sup>, q<sup>2</sup>, r, s, u" all represent fixed real numbers; Uppercase letters "G, <sup>K</sup>, <sup>O</sup>,R,I" represent different categories of gain items, which depend on time t; L(·) is a constant that depends on the cell type; and **e**(·) stands for exponential functions.

#### *4.3.1 The Natural Growth of Cells*

According to the [2, 22], the increase of tumor cells follows a natural growth curve, GTu <sup>=</sup> <sup>a</sup>Tu(<sup>1</sup> <sup>−</sup> <sup>b</sup>Tu) (G(·) represents the natural growth tescr operator of all types of cells). Natural killer cells [22] are assumed to be produced at a constant rate and to be influenced by circulating lymphocytes throughout the production cycle (since circulating lymphocytes represent the overall level of immune health), and thus,GNK <sup>=</sup> <sup>c</sup><sup>1</sup>CL <sup>−</sup> <sup>c</sup><sup>2</sup>NK. In the absence of tumor cells, Cytotoxic T lymphocytes are assumed to be absent and cell growth of CT(t) cells is only affected by natural mortality, GCT = −eCT. Circulating lymphocytes are also produced at a constant rate during their lifetime, GCL <sup>=</sup> <sup>f</sup> <sup>−</sup> <sup>g</sup>CL. It is set that when the body is injected with chemotherapy drugs or immune agents, it will show exponential decay, GChdr = −**e**−γαChdr, GImdr = −**e**−γβImdr.

#### *4.3.2 Intercellular Conditioning*

When the above cells exist at the same time, there will be a negative interaction between the two populations, partly due to the competition for growth space and nutrients, and this indirect effect. The other part is the direct resistance of cell populations to each other [22]:

$$\mathcal{H}\_{\mathcal{T}\_u} = -\mathfrak{j}\mathcal{U}\_{\mathcal{H}}\mathcal{T}\_u \quad \mathcal{H}\_{\mathcal{C}\_{\mathcal{T}}} = \mathfrak{h}\_1 \cdot \frac{(\mathcal{C}\_{\mathcal{T}}/\mathcal{T}\_u)^{\mathrm{i}}}{\mathfrak{h}\_2(\mathcal{C}\_{\mathcal{T}}/\mathcal{T}\_u)^{\mathrm{i}}} \cdot \mathcal{T}\_u$$

And just to simplify the writing, let's write O for a particular term, and notice that O = O(t), which is related to CT(t), Tu(t).

$$\Theta = \mathfrak{h}\_1 \cdot \frac{(\mathcal{C}\_{\mathcal{T}}/\mathcal{T}\_u)^{\mathrm{i}}}{\mathfrak{h}\_2 (\mathcal{C}\_{\mathcal{T}}/\mathcal{T}\_u)^{\mathrm{i}}} \quad \mathcal{H}\_{\mathcal{C}^{\mathcal{T}}} = \Theta \cdot \mathcal{T}\_u \tag{4.7}$$

NK cells have the function of recruitment, which is to design sequential application methods of cell cycle non-specific drugs and cell cycle specific drugs, recruit more cells at specific stages into the proliferation cycle, so as to increase the number of tumor cells killed [29–31].

$$\mathcal{R}\_{\mathcal{U}\_{\hbar}} = \frac{\mathbb{I} \cdot \mathcal{T}\_{\boldsymbol{u}}^{2}}{\mathbf{m} \mathcal{T}\_{\boldsymbol{u}}^{2}} \mathcal{U}\_{\hbar}; \quad \mathcal{R}\_{\mathcal{C}\boldsymbol{\mathcal{T}}} (\mathcal{T}\_{\boldsymbol{u}}, \mathcal{C}\_{\mathcal{T}}) = \mathfrak{p}\_{1} \frac{\mathbb{G}^{2} \mathcal{T}\_{\boldsymbol{u}}^{2}}{\mathfrak{q}\_{1} + \mathbb{G}^{2} \mathcal{T}\_{\boldsymbol{u}}^{2}} \mathcal{C}\_{\mathcal{T}}$$

CT cells have a similar recruitment effect [32]. It is directly proportional to the number of cells killed by NK cell lysis of tumor cells, RCT (Nk, Tu) <sup>=</sup> <sup>n</sup><sup>1</sup>NkTu. Also, the presence of tumor cells stimulates the immune system to secrete more cells, RCT (CL, Tu) <sup>=</sup> <sup>n</sup><sup>2</sup>CLTu. In the immune function, NK cells or CD cells may have to undergo multiple contact with tumor cells, and then inactivate [29, 33–35].

$$\mathcal{G}\_{ac,\mathcal{U}\_{\hbar}} = -\mathfrak{p}\_2 \mathcal{T}\_{\mathfrak{u}} \mathcal{U}\_{\hbar} \quad \mathcal{G}\_{ac,\mathcal{C}\mathcal{T}} = -\mathfrak{q}\_2 \mathcal{C}\_{\mathcal{T}} \mathcal{T}\_{\mathfrak{u}} \quad \mathcal{G}\_{\mathcal{C}\_{\mathcal{L}},\mathcal{C}\_{\mathcal{T}}} = -\mathfrak{r} \mathcal{U}\_{\hbar} (\mathcal{C}\_{\mathcal{T}})^2$$

#### *4.3.3 Drug Intervention*

All kinds of cell populations in this model contain the action tescr of chemotherapy drugs, and the killing effect of chemotherapy drugs is not always effective. At low drug concentration, the killing rate increases almost linearly, while at high drug concentration, the killing rate tends to be stable. Saturation type is used to describe them in the model [36], 1 − **e**Chdr(t) .

$$\mathcal{D}\_r^{\ell\hbar}(\cdot) = \mathcal{L}\_{(\cdot)}(1 - \mathbf{e}^{C\hbar\_{\hbar}(t)})(\cdot)$$

(·) = Tu,CT,CL, Nk.

L(·) represents the interaction coefficient between corresponding cells and tumor cells. It also includes immunotherapy, whose impact on immune system efficacy can be mathematically described by the Michaelis-Menten interaction, s, u are the constant [30].

$$\mathcal{D}\_r^{g\_{\mathfrak{M}}}(\mathcal{C}\_{\mathcal{I}}, \mathcal{G}m\_{\mathfrak{sl}}) = \mathfrak{u} \frac{g\_{\mathfrak{M}\_{\mathfrak{sl}}} \mathcal{C}\_{\mathcal{I}}}{\mathfrak{s} + \mathcal{G}m\_{\mathfrak{sl}}}$$

Chemotherapy and immunotherapy drugs are injected in a certain period of time, and denote by V*Che*(t) and V*I m*(t) the amount of chemotherapy drug injection and the amount of immunotherapy drug injection, respectively.

#### *4.3.4 Mixed Growth Model of Cell Population*

Combined with the above contents, the total cell population growth model can be obtained:

$$
\mathcal{G}m\_{\rm dz}(t+1) = (1 - \mathbf{e}^{-\gamma\_{\beta}})\mathcal{G}m\_{\rm dz}(t) + \mathcal{U}\_{\rm Im}(t) \tag{4.8a}
$$

$$
\hat{C}\hbar e\_{\rm d}(t+1) = (1 - \mathbf{e}^{-\gamma\_{\alpha}})C\hbar\_{\rm d}(t) + \mathcal{V}\_{Che}(t)\tag{4.8b}
$$

$$\mathcal{C}\_{\mathcal{L}}(t+1) = \mathfrak{f} - \mathcal{L}\_{\mathcal{C}\_{\mathcal{L}}} + (1 - \mathfrak{g})\mathcal{C}\_{\mathcal{L}}(t) - \mathcal{L}\_{\mathcal{C}\_{\mathcal{L}}}\mathbf{e}^{\complement\mathcal{H}\_{\mathcal{H}}(t)} \tag{4.8c}$$

$$\mathcal{T}\_{\mathcal{L}}(t+1) = (1 + \mathfrak{g} - \mathcal{C}\_{\mathcal{L}})\mathcal{T}\_{\mathcal{L}}(t) - \mathfrak{h}\mathcal{T}\_{\mathcal{L}}(t)$$

$$\begin{aligned} \mathcal{J}\_{\mathfrak{u}}(t+1) &= (1+\mathfrak{a}-\mathcal{L}\_{\mathcal{I}\_{\mathfrak{u}}})\mathcal{J}\_{\mathfrak{u}}(t) - \mathfrak{b}\mathcal{T}\_{\mathfrak{u}}^{\natural}(t) \\ &+ \mathcal{J}\_{\mathfrak{u}}(t) \Big[ \mathbf{e}^{\mathcal{C}\mathfrak{h}\_{\mathfrak{f}}(t)} - \mathbf{j}\mathcal{U}\_{\mathfrak{k}}(t) - \mathcal{O}(t) \Big] \end{aligned} \tag{4.8d}$$

$$\begin{split} \mathcal{C}\_{\mathcal{T}}(t+1) &= (1 - \mathfrak{e} - \mathcal{L}\_{\mathcal{C}\_{\mathcal{T}}}) \mathcal{C}\_{\mathcal{T}}(t) + [\mathfrak{n}\_{1} \mathcal{U}\_{\mathbb{R}}(t) - \mathfrak{q}\_{2} \mathcal{C}\_{\mathcal{T}}(t) \\ &+ \mathfrak{n}\_{2} \mathcal{C}\_{\mathcal{L}}(t)] \cdot \mathcal{T}\_{\mathfrak{u}}(t) - \mathfrak{r} \mathcal{U}\_{\mathfrak{k}}(t) \mathcal{C}\_{\mathcal{T}}{}^{\mathfrak{z}}(t) + \mathcal{L}\_{\mathcal{C}\_{\mathcal{T}}} \mathcal{C}\_{\mathcal{T}}(t) \\ &\stackrel{\text{a.s.}}{\longrightarrow} \mathcal{C}\_{\mathcal{L}} \text{ to } \mathfrak{T} \mathcal{T} \mathcal{T} \text{ to } \end{split} \tag{4.8\mathbf{e}}$$

$$\begin{split} \mathbf{c} \cdot \mathbf{e}^{\mathcal{H}\_{\hbar}(t)} &+ \mathcal{C}\_{\mathcal{T}}(t) \Big[ \frac{\mathbf{u} \mathcal{G} m\_{\hbar^{\prime}}(t)}{\mathbf{s} + \mathcal{G} m\_{\hbar^{\prime}}(t)} + \frac{\mathbf{p}\_{1} \mathcal{O}^{\cdot}(t) \mathcal{T}\_{\mathbf{u}}^{\cdot}(t)}{\mathbf{q}\_{1} + \mathcal{O}^{\cdot}(t) \mathcal{T}\_{\mathbf{u}}^{\cdot}(t)} \Big] \\ \mathcal{H}\_{\hbar}(t+1) &= -\mathcal{L}\_{\hbar\_{\ell}} + (1 - \mathbf{c}\_{2}) \mathcal{U}\_{\hbar}(t) + \frac{\mathbf{l} \cdot \mathcal{T}\_{\mathbf{u}}^{\cdot \cdot}(t)}{\mathbf{m} + \mathcal{T}\_{\mathbf{u}}^{\cdot \cdot}(t)} \mathcal{U}\_{\hbar}(t) \\ &+ \Big[ \mathcal{L}\_{\hbar\_{\ell}} \mathbf{e}^{\mathcal{C} \delta\_{\ell^{\ell}}(t)} - \mathfrak{p}\_{2} \mathcal{T}\_{\mathbf{u}}(t) \Big] \mathcal{U}\_{\hbar}(t) + \mathfrak{c}\_{1} \mathcal{C}\_{\mathcal{L}}(t) \end{split} \tag{4.8f}$$

#### **4.4 Iterative-Dual Heuristic Dynamic Programming Algorithm for Mixed Treatment**

The optimal control problem has been transformed into solving the HJB equation (4.4). In this part, a constrained iterative dual heuristic dynamic programming algorithm based on mixed treatment is given. The algorithm is derived from adaptive dynamic programming [26]. This part mainly three parts research content are presented as working mechanism of ADP algorithm, structure of constrained iterative dual-heuristic dynamic programming algorithm and proof of convergence on I-DHP algorithm.

#### *4.4.1 Working Mechanism of ADP Algorithm*

Generally speaking, for unconstrained control problems, the performance functional (4.3) is usually chosen as the quadratic form. In this chapter, considering the actual constraints, is transformed into solving a bounded control problem, adopted a nonquadratic functional as follows:

$$\mathbf{Y}(t) = \mathbf{x}^{\intercal}(t)\mathbf{P}\mathbf{x}(t) + 2\int\_{0}^{\mathbf{u}(t)} \tanh^{-\overline{r}}(\overline{\mathcal{U}}^{-1}\mathbf{s})\overline{\mathcal{U}}\mathbf{Q}ds$$

It is convenient for mathematical calculation avoiding the loop or unlimited create unlimited returns markov decision process. In the loop or unlimited markov process which will constantly get reward again and again, so we need to add discount factor to avoid infinity and infinitesimal value function,By introducing discount factor λ, an infinite dimensional problem is transformed into a finite dimensional problem, 0 < λ ≤ 1.

$$\begin{aligned} \mathbf{J}(t) &= \sum\_{l=t}^{\infty} \lambda^{l \neg} \mathbf{Y}(\mathbf{x}(l), \boldsymbol{\mu}(l)) = \mathbf{Y}(t) \\ &+ \lambda \sum\_{l=t+1}^{\infty} \lambda^{l - (t+1)} \mathbf{Y}(\mathbf{x}(l), \boldsymbol{\mu}(l)) \end{aligned} \tag{4.9}$$

According to the Bellman optimality principle, the optimal value function satisfies:

$$\mathbf{J}^{\prime}(\mathbf{x}(t)) = \min\_{\mathbf{u}(t)} \left\{ \mathbf{x}^{\top}(t) \mathbf{P} \mathbf{x}(t) + 2 \int\_{0}^{\mathbf{u}(t)} \tanh^{-\top}(\overline{\mathcal{U}}^{-1} \mathbf{s}) \cdot \mathbf{s} \right\}. \tag{4.10}$$

$$\overline{\mathcal{U}} \mathbf{Q} ds + \lambda \mathbf{J}^{\prime}(\mathbf{x}(t+1)) \, \Big|\,. \tag{4.10}$$

In the ADP algorithm structure, it iterates according to the policy iteration, selecting T<sup>ι</sup> (*x*) as the approximation function and ø<sup>ι</sup> (*x*) as the corresponding control law. The whole iterative process is as follows:

1. Let the initial value function be T0(·) = 0 (which is far from optimal) and compute the control law at "ι = 0"as follows.

$$\mathfrak{gl}(\mathbf{x}(t)) = \underset{u(t)}{\text{arg min}} \left\{ \mathbf{x}^{\mathrm{r}}(t) \mathbf{P} \mathbf{x}(t) + 2 \int\_{0}^{u(t)} \tanh^{\mathrm{r}}(\overline{\mathcal{U}}^{-1} \mathbf{s}) \right. \tag{4.11}$$

$$\cdot \,\overline{\mathcal{U}} \mathbf{Q} ds + \lambda \,\mathrm{T}^{0}(\mathbf{x}(t+1)) \right\} \tag{4.11}$$

2. Get T1 (*x*(t)):

$$\mathbf{T}^{\flat}(\mathbf{x}(t)) = \mathbf{x}^{\top}(t)\mathbf{P}\mathbf{x}(t) + 2\int\_{0}^{\vartheta(\mathbf{x}(t))} \tanh^{-r}(\overline{\mathcal{U}}^{-1}\mathbf{s})\overline{\mathcal{U}} \, d\overline{\mathcal{U}}$$

$$\cdot \,\mathrm{Qds} + \lambda \mathrm{T}^{\flat}(\mathbf{x}(t+1)). \tag{4.12}$$

3. And for ι = 1, 2, 3, ···

$$\phi(\mathbf{x}(t)) = \underset{\mathbf{u}(t)}{\text{arg min}} \left\{ \mathbf{x}^{\mathrm{T}}(t) \mathbf{P} \mathbf{x}(t) + 2 \int\_{0}^{\mathbf{u}(t)} \tanh^{-\mathrm{T}}(\overline{\mathcal{U}}^{-1} \mathbf{s}) \right. $$

$$\cdot \,\overline{\mathcal{U}} \mathbf{Q} ds + \lambda \mathrm{T}(\mathbf{x}(t+1)) \, \right\}. \tag{4.13}$$

4. The iterative value function is obtained as follows:

$$\mathbf{T}^{+}(\mathbf{x}(t)) = \mathbf{x}^{\mathrm{r}}(t)\mathbf{P}\mathbf{x}(t) + 2\int\_{0}^{\vartheta(\mathbf{x}(t))} \tanh^{-\mathbb{T}}(\overline{\mathcal{U}}^{-1}\mathbf{s})$$

$$\cdot \overline{\mathcal{U}}\mathbf{Q}ds + \lambda \mathbf{T}(\mathbf{x}(t+1)).\tag{4.14}$$

#### *4.4.2 Structure of Constrained Iterative Dual-Heuristic Dynamic Programming Algorithm*

In the dual heuristic dynamic programming, the assumption is that the value function is smooth, modelled on the (4.5), the partial derivatives (4.14) on the right side of øι (*x*(t)), can get [37]:

$$\begin{split} \frac{\partial \mathbf{T}^{+\boldsymbol{\epsilon}}(\mathbf{x}(t))}{\partial \boldsymbol{u}(t)} &= \frac{\partial \left\{ \mathbf{x}^{\boldsymbol{r}}(t) \mathbf{P} \mathbf{x}(t) + 2 \int\_{0}^{\boldsymbol{u}(t)} \tanh^{-\boldsymbol{r}}(\overline{\mathcal{U}}^{-1} \mathbf{s}) \overline{\mathcal{U}} \mathbf{Q} ds \right\}}{\partial \boldsymbol{u}(t)} \\ &+ \lambda \frac{\partial \mathbf{T}(\mathbf{x}(t+1))}{\partial \boldsymbol{u}(t)} = \mathbf{0}. \end{split}$$

#### 4.4 Iterative-Dual Heuristic Dynamic Programming Algorithm for Mixed Treatment 61

And, for ι = 0, 1, 2, ···

$$\phi(\mathbf{x}(t)) = \overline{\mathcal{U}} \tanh \left( \frac{-\lambda}{2\overline{\mathcal{U}}\mathbb{Q}} \left[ \frac{\partial \mathbf{x}(t+1)}{\partial u(t)} \right]^r \frac{\partial \Gamma(\mathbf{x}(t+1))}{\partial \mathbf{x}(t+1)} \right) \tag{4.15}$$

Do the same with (4.14) respect to *x*(t),

$$\frac{\partial \mathbf{T}^{\leftarrow}(\mathbf{x}(t))}{\partial \mathbf{x}(t)} = 2\mathbf{P}\mathbf{x}(t) + \lambda \Big[\frac{\partial \mathbf{x}(t+1)}{\partial \mathbf{x}(t)}\Big]^{r} \frac{\partial \mathbf{T}(\mathbf{x}(t+1))}{\partial \mathbf{x}(t+1)}.\tag{4.16}$$

As can be seen in (4.15) and (4.16), both have <sup>∂</sup>T<sup>ι</sup> (*x*(t + 1)) <sup>∂</sup>*x*(<sup>t</sup> <sup>+</sup> <sup>1</sup>) , compared to T<sup>ι</sup> (*x*(t)) in (4.14), DHP algorithm evaluates and updates the first partial derivative of the value function.

The specific algorithm structure is as follows: (set costate function C*<sup>ι</sup>* (*x*(t)) = ∂T*<sup>ι</sup>* (*x*(t))/∂*x*(t)).

#### *4.4.3 Proof of Convergence on I-DHP Algorithm*

The convergence proof of the algorithm shows that with the increase of the number of iterations, the evaluation and update between (4.15) and (4.16) are continuously completed, and the termination condition can finally be satisfied and the optimal solution can be obtained.

The corresponding lemma needs to be given before the formal theorem proving. In order to facilitate writing, abbreviated "2 *<sup>u</sup>*(t) <sup>0</sup> tanh−*<sup>T</sup>* (U<sup>−</sup><sup>1</sup> *s*)UQ*ds*" to "H(u(t))".

**Lemma 4.1** *Assume that* ø<sup>ι</sup> (t) *is the control sequence calculated by (4.13),* T<sup>ι</sup> (*x*) *is the value function calculated by (4.14).* ! ι (t) *is any admissible control sequence in the domain, and* ι (*x*) *is its corresponding value function equation,*

$$\boldsymbol{\Omega}^{\prime + \! \! / +}(\boldsymbol{x}(t)) = \boldsymbol{x}^{\prime}(t)\mathbf{P}\boldsymbol{x}(t) + \mathbf{H}(\boldsymbol{1}^{\prime}(t)) + \lambda\boldsymbol{\Omega}^{\prime}(\boldsymbol{x}(t+1))\tag{4.17}$$

*and it is easy to obtain:*

*If* -0 (·) = T0 (·) = 0*, then* 0 ≤ T<sup>ι</sup> (*x*) ≤ ι (*x*)*,* ∀ι*.*

*Proof* The conclusion is obvious. T<sup>ι</sup> (*x*) is the minimum value that can be obtained on the right side of (4.14), and ø<sup>ι</sup> (t) is the corresponding control sequence. And ι (*x*) is any admissible value function, so it must be not less than T<sup>ι</sup> (*x*). -

**Lemma 4.2** *Given that* T<sup>ι</sup> (*x*) *by the (4.14), and if the system is controlled, then* Tι (*x*) *has an upper bounded* Z *(a constant).*

$$0 \le \mathcal{T}^\circ(x) \le \mathfrak{Z}, \forall \iota$$

#### **Algorithm 1:** Procedure of the I−DHP algorithm:

#### INITIAL:

1. Select a smaller positive number , initial iteration index for *ι* **= 0**, **C0** *(***·***)* **= 0**. CALCULATION:

2. Calculate the control law at the **0th** iteration:

$$\begin{split} \mathfrak{o}^{\mathsf{q}}(\mathbf{x}(t)) &= \underset{u(t)}{\arg\min} \Big\{ \mathbf{x}^{\top}(t) \mathbf{Pr}(t) + 2 \int\_{0}^{u(t)} \tanh^{-T}(\overline{\mathcal{U}}^{-1} \mathbf{s}) \overline{\mathcal{U}} \mathbf{Q} ds + \lambda \mathbf{T}^{0}(\mathbf{x}(t+1)) \Big\} \\ &= \overline{\mathcal{U}} \underset{\mathbf{x} \neq \mathbf{0}}{\tanh} \Big[ \frac{-\lambda}{2\overline{\mathcal{U}}\mathbf{Q}} \bigg[ \frac{\partial \mathbf{x}(t+1)}{\partial u(t)} \Big]^{T} \mathbf{C}^{\mathbf{0}}(\mathbf{x}(t+1)) \Big\} \end{split}$$

3. Update the costate function for iteration **1**:

$$\mathbf{C}^{1}(\mathbf{x}(t)) = 2\mathbf{P}\mathbf{x}(t) + \lambda \left[\frac{\partial \mathbf{x}(t+1)}{\partial \mathbf{x}(t)}\right]^{\mathsf{T}} \mathbf{C}^{\mathsf{0}}(\mathbf{x}(t+1))^{\mathsf{T}}$$

4. Similarly, the control law for the *ι* iteration:

$$\begin{split} \mathfrak{g}^{t}(\mathbf{x}(t)) &= \underset{u(t)}{\arg\min} \Big\{ \mathbf{x}^{T}(t) \mathbf{Pr}(t) + 2 \int\_{0}^{u(t)} \tanh^{-r}(\overline{\mathcal{U}}^{-1} \mathbf{s}) \overline{\mathcal{U}} \mathbf{Q} ds + \lambda \mathbf{T}^{\circ}(\mathbf{x}(t+1)) \Big\} \\ &= \overline{\mathcal{U}} \underset{\mathbf{x} \neq \mathbf{0}}{\text{tanh}} \Big[ \frac{-\lambda}{2 \overline{\mathcal{U}} \mathbf{Q}} \bigg[ \frac{\partial \mathbf{x}(t+1)}{\partial u(t)} \Big]^{T} \mathbf{C}^{\mathbf{t}}(\mathbf{x}(t+1)) \Big) \end{split}$$

5. Obtain the costate function for iteration *ι* **+ 1**:

$$\mathbf{C}^{\iota+1}(\mathbf{x}(t)) = 2\mathbf{P}\mathbf{x}(t) + \lambda \left[\frac{\partial \mathbf{x}(t+1)}{\partial \mathbf{x}(t)}\right]^r \mathbf{C}^{\iota}(\mathbf{x}(t+1))$$

COMPARATION:

6. If **<sup>C</sup>***ι***+1**(*x*(t)) <sup>−</sup> **<sup>C</sup>***ι*(*x*(t)) <sup>≤</sup> , stop and get the approximate optimal control law **ø***ι* (*x*(t)); Else, let *ι* = *ι* **+ 1**, and jump to the **4.**

*Proof* Set v<sup>ι</sup> (t) to be an admissible and stabilizing control sequence and V<sup>ι</sup> (*x*) to be:

V<sup>ι</sup>+<sup>1</sup> (*x*) =*x <sup>T</sup>* (t)P*x*(t) + H(v´ (t)) + λV<sup>ι</sup> (*x*(t + 1))

Then, it can be obtained: (V<sup>0</sup> (·) = T<sup>ι</sup> (·) = 0)

$$\begin{split} \boldsymbol{U}^{++}(\mathbf{x}) &= \mathbf{x}^{\top}(t)\mathbf{P}\mathbf{x}(t) + \mathbf{H}(\mathbf{v}^{\top}(t)) + \lambda\boldsymbol{U}(\mathbf{x}(t+1)) \\ &= \mathbf{x}^{\top}(t)\mathbf{P}\mathbf{x}(t) + \mathbf{H}(\mathbf{v}^{\top}(t)) + \lambda\Big{[}\mathbf{x}^{\top}(t+1)\mathbf{P} \\ &\quad \cdot \mathbf{x}(t+1) + \mathbf{H}(\mathbf{v}^{-1}(t+1))\Big{]} + \lambda^{2}\boldsymbol{U}^{-\top}(\mathbf{x}(t+2)) \\ &= \dots \\ &= \mathbf{x}^{\top}(t)\mathbf{P}\mathbf{x}(t) + \mathbf{H}(\mathbf{v}^{\top}(t)) + \lambda\Big{[}\mathbf{x}^{\top}(t+1)\mathbf{P} \\ &\quad \cdot \mathbf{x}(t+1) + \mathbf{H}(\mathbf{v}^{-1}(t+1))\Big{]} + \dots \\ &\quad + \lambda^{\top}\Big{[}\mathbf{x}^{\top}(t+\iota)\mathbf{P}\mathbf{x}(t+\iota) + \mathbf{H}(\mathbf{v}^{0}(t+\ulcorner)) \\ &\quad + \lambda^{\imath+1}\boldsymbol{U}^{0}(\mathbf{x}(t+\iota+1)). \end{split}$$

 $\mathbb{V}^{\iota \vdash l}(\mathbf{x}) = \sum\_{l=0}^{\iota} \lambda \left[ \mathbf{x}^{\iota}(t+l)\mathbf{P}\mathbf{x}(t+l) + \mathbf{H}(\mathbf{v}^{\iota \neg l}(t+l)) \right] \le \lim\_{l \to \infty} \left\{ \sum\_{l=0}^{\iota} \lambda^{l} \left[ \mathbf{x}^{\iota}(t+l) + \mathbf{H}(\mathbf{v}^{\iota \neg l}(t+l)) \right] \right\}.$  $\mathbf{P}(t+l) + \mathbf{H}(\mathbf{v}^{\iota \neg l}(t+l)) \left[ \right]$ .

Due to the admissible control sequence <sup>v</sup>ι(t), it has an upper bound Z that

$$\mathcal{V}^{\iota+l}(\mathbf{x}) \le \lim\_{\iota \to \infty} \left\{ \sum\_{l=0}^{\iota} \lambda^l \left[ \mathbf{x}^{\iota}(t+l) \mathbf{P} \mathbf{x}(t+l) + \mathbf{H}(\mathbf{v}^{\iota-l}(t+l)) \right] \right\} \le \mathbf{\tilde{z}}.$$

Combined with Lemma 4.1, it can be obtained the result. -

**Theorem 4.1** *For the iterative cost function* T<sup>ι</sup> (*x*) *which follows (4.14) and its corresponding control law* ø<sup>ι</sup> (t) *obtained by (4.13), it can be concluded that with the increase of the number of iterations,* T<sup>ι</sup> (*x*) *will converge to the optimal value function and* ø<sup>ι</sup> (t) *will converge to the optimal control law, i.e.,* T<sup>ι</sup> (*x*) → J∗(*x*)*,* ø<sup>ι</sup> (t) → *u*∗(t)*.*

*Proof* From Lemma 4.1, ι (*x*(t))is the cost function corresponding to an any admissible control sequence !<sup>ι</sup> (t), with -<sup>0</sup>(·) = 0.

Firstly, ι = 0,

$$\mathbf{T}^{\downarrow}(\mathbf{x}(t)) - \mathfrak{Q}^{\Downarrow}(\mathbf{x}(t)) = \mathbf{x}^{\uparrow}(t)\mathbf{P}\mathbf{x}(t) + \mathbf{H}(!^{\Downarrow}(t)) \ge 0$$

then, T1 (*x*(t)) ≥ -0 (*x*(t)), ι = 0.

Secondly, for ι − 1, given T<sup>ι</sup> (*x*(t)) ≥ ι−1 (*x*(t)), ∀*x*(t). Then, as ι, it can be able to conclude that

$$\begin{aligned} \mathbf{T}^{+}(\mathbf{x}(t)) - \boldsymbol{\Omega}(\mathbf{x}(t)) &= \mathbf{x}^{\top}(t)\mathbf{P}\mathbf{x}(t) + \mathbf{H}(!^{++}(t)) \\ &+ \lambda \left( \mathbf{T}(\mathbf{x}(t+1)) - \boldsymbol{\Omega}^{-1}(\mathbf{x}(t+1)) \right) \\ &\geq \lambda \left( \mathbf{T}(\mathbf{x}(t+1)) - \boldsymbol{\Omega}^{-1}(\mathbf{x}(t+1)) \right) \\ \end{aligned} \tag{4.18}$$

By the mathematical induction, it can be obtained that T<sup>ι</sup>+<sup>1</sup> (*x*(t)) ≥ ι (*x*(t)), ∀ι. Combined with Lemma 4.1, it is obviously concluded that T<sup>ι</sup>+<sup>1</sup> (*x*(t)) ≥ ι (*x*(t)) ≥ Tι (*x*(t)), that is, Tι (*x*(t)) is a non-decreasing sequence, ∀ι.

From Lemma 4.2, the sequence Tι (*x*(t)) is bounded to Z, which is equivalent to that the iterative equation has a limit value, which can be expressed as limι→∞ T<sup>ι</sup> (*x*(t)) = T∞(*x*(t)). Therefore, it is bold to assume that T∞(*x*(t)) = min ø(t) *<sup>x</sup> <sup>T</sup>* (t)P*x*(t) <sup>+</sup> <sup>H</sup>(ø(t)) <sup>+</sup> <sup>λ</sup>T∞(*x*(t)) . This assumption will be proved below. According to (4.14),

$$\mathbf{T}(\mathbf{x}(t)) \le \mathbf{x}^r(t)\mathbf{P}\mathbf{x}(t) + \mathbf{H}(\phi(t)) + \lambda \mathbf{T}^{\iota - \iota}(\mathbf{x}(t+1)).\tag{4.19}$$

From the non-decreasing property of sequence Tι (*x*(t)) , it can be known that Tι (*x*(t)) ≤ T∞(*x*(t)) ∀ι.

Substitute it into (4.19),

$$\mathbf{T}(\mathbf{x}(t)) \le \mathbf{x}^{\top}(t)\mathbf{P}\mathbf{x}(t) + \mathbf{H}(\phi(t)) + \lambda\mathbf{T}^{\Rightarrow}(\mathbf{x}(t+1)), \quad \forall \iota. \tag{4.20}$$

(4.20) for any ι was established, that when ι = ∞, also meet.

$$\mathbf{T}^{\infty}(\mathbf{x}(t)) \le \mathbf{x}^{\mathbb{T}}(t)\mathbf{P}\mathbf{x}(t) + \mathbf{H}(\phi(t)) + \lambda\mathbf{T}^{\infty}(\mathbf{x}(t+1)), \forall t. \tag{4.21}$$

Considering that ø(t) is any given control sequence, (4.21) can further obtain:

$$\mathbf{T}^{\infty}(\mathbf{x}(t)) \le \min \left\{ \mathbf{x}^{\top}(t)\mathbf{P}\mathbf{x}(t) + \mathbf{H}(\mathfrak{o}(t)) + \lambda \mathbf{T}^{\infty}(\mathbf{x}(t+1)) \right\}, \forall \iota. \tag{4.22}$$

$$\text{With (4.14), } \mathbf{T}(\mathbf{x}(t)) = \min\_{\phi(t)} \left\{ \mathbf{x}^{\top}(t) \mathbf{P} \mathbf{x}(t) + \mathbf{H}(\phi(t)) + \lambda \mathbf{T}^{-\!\!\! }(\mathbf{x}(t+1)) \right\}. \forall \alpha$$

At this time of its on the left, and as a result of Tι (*x*(t)) non decreasing, get, T∞(*x*(t))≥min ø(t) *x <sup>T</sup>* (t)P*x*(t)+H(ø(t))+λT<sup>ι</sup>−<sup>1</sup> (*x*(<sup>t</sup> <sup>+</sup> <sup>1</sup>)) . Similarly, let ι → ∞,

$$\mathbf{T}^{\infty}(\mathbf{x}(t)) \ge \min \left\{ \mathbf{x}^{\top}(t)\mathbf{P}\mathbf{x}(t) + \mathbf{H}(\phi(t)) + \lambda \mathbf{T}^{\infty}(\mathbf{x}(t+1)) \right\}, \forall \iota. \tag{4.23}$$

Combining (4.22) and (4.23), it follows that,

$$\mathbf{T}^{\omega}(\mathbf{x}(t)) = \min \left\{ \mathbf{x}^{\mathbb{T}}(t) \mathbf{P} \mathbf{x}(t) + \mathbf{H}(\phi(t)) + \lambda \mathbf{T}^{\omega}(\mathbf{x}(t+1)) \right\}, \forall \iota. \tag{4.24}$$

Can be seen from (4.24), the previous assumption proved how. Can be learned from Theorem 4.1, T∞(*x*(t)) is a discrete-time time solution of HJB equation. Considering the uniqueness of the solution of the discrete-time-time HJB equation, it means that T∞(*x*(t)) in (4.24) and J<sup>∗</sup> (*x*(t)) in (4.10) are the same solution. In other words, limι→∞ T<sup>ι</sup> (*x*(t)) = T<sup>∗</sup> (*x*(t)) = J<sup>∗</sup> (*x*(t)). -

Theorem 4.1 proves that T∞(*x*(t)) in (4.24) and J<sup>∗</sup> (*x*(t)) in (4.24) in (4.10) are the same solution of the HJB equation corresponding to the same cost function, while the termination criterion " <sup>T</sup>ι+1(*x*(t)) <sup>−</sup> <sup>T</sup><sup>ι</sup> (*x*(t)) <sup>≤</sup> " indicates that the optimal control law can be solved in finite time, and Theorem 4.2 will explain this context.

**Theorem 4.2** *The system (4.1) is controllable and the initial state x*(t) *of the system can be chosen arbitrarily. Under the finite iteration index* ι*, the iterative approximate cost function and the optimal cost function* T∗(*x*(t)) − T<sup>ι</sup> (*x*(t)) ≤ *are equivalent to the termination criterion* T<sup>ι</sup>+<sup>1</sup>(*x*(t)) − T<sup>ι</sup> (*x*(t)) ≤ *.*

*Proof* In Theorem 4.1, it is mentioned that Tι (*x*(t)) is a non-decreasing sequence, that is

$$\mathbf{J}(\mathbf{x}(t)) = \mathbf{T}(\mathbf{x}(t)) \ge \mathbf{T}^+(\mathbf{x}(t)) \ge \mathbf{T}(\mathbf{x}(t)).\tag{4.25}$$

If T∗(*x*(t)) − T<sup>ι</sup> (*x*(t)) ≤ , it can be concluded that

$$\mathcal{T}^\*(\mathbf{x}(t)) - \mathcal{T}^\iota(\mathbf{x}(t)) \le \epsilon, \ \mathcal{T}^\*(\mathbf{x}(t)) \le \mathcal{T}^\iota(\mathbf{x}(t)) + \epsilon. \tag{4.26}$$

Combined (4.26) with (4.25),

$$\mathcal{T}(\mathbf{x}(t)) \le \mathcal{T}^{+}(\mathbf{x}(t)) \le \mathcal{T}^{\prime}(\mathbf{x}(t)) \le \mathcal{T}^{\prime}(\mathbf{x}(t)) + \epsilon.$$

$$\Rightarrow \mathcal{T}(\mathbf{x}(t)) \le \mathcal{T}^{+}(\mathbf{x}(t)) \le \mathcal{T}^{\prime}(\mathbf{x}(t)) + \epsilon. \tag{4.27}$$

It can get that,

$$\left\|\mathbf{T}^{\iota+1}(\mathbf{x}(t)) - \mathbf{T}^{\iota}(\mathbf{x}(t))\right\| \leq \epsilon \tag{4.28}$$

From a different perspective, if (4.28) holds and the Tι (*x*(t)) is nondecreasing,

$$-\epsilon + \mathcal{T}^{\iota+1}(\mathbf{x}(t)) \le \mathcal{T}^{\iota}(\mathbf{x}(t)) \le \mathcal{T}^\*(\mathbf{x}(t)) = \mathcal{J}^\*(\mathbf{x}(t)).\tag{4.29}$$

It is obvious that T<sup>ι</sup>+<sup>1</sup>(*x*(t)) − T∗(*x*(t) ≤ ,

$$\left\|\mathbf{T}^{\iota+1}(\mathbf{x}(t)) - \mathbf{T}^\*(\mathbf{x}(t))\right\| \le \epsilon. \tag{4.30}$$

Based on the analysis of both sides, it can be concluded that T<sup>∗</sup> (*x*(t))− Tι (*x*(t)) ≤ ⇔ <sup>T</sup><sup>ι</sup>+<sup>1</sup> (*x*(t)) − T<sup>ι</sup> (*x*(t)) <sup>≤</sup> . -

The two theorems deal with value functions T(*x*(t)), while Algorithm 1 deals with costate function **C**(*x*(t)). It will be shown in Theorem 4.3 that this convergence is equivalent.

**Theorem 4.3** *(4.14) defines the sequence of value functions. The control law sequence is shown in (4.13) and the update cofunction sequence is shown in (4.16). The optimal value is chosen as the limit of the costate function* C∗(*x*(t)) = limι→∞C<sup>ι</sup> (*x*(t))*, and when the value function approaches the optimal value, the sequence of costate functions converges with the sequence of the control law.*

*Proof* In Theorems 4.1 and 4.2, it is shown that T∗(*x*(t)) and T∞(*x*(t))satisfy the corresponding HJB equation respectively. i.e., T∞(*x*(t))=T<sup>∗</sup> (*x*(t))=min ø(t) *x <sup>T</sup>* (t)P*x*(t) +

H(ø(t)) + λT<sup>∗</sup> (*x*(<sup>t</sup> <sup>+</sup> <sup>1</sup>)) .

Therefore, it can be concluded that the sequence Tι (*x*(t)) of value functions converges to the optimal value function of the discrete-time-time HJB equation. i.e.,T<sup>ι</sup> → T<sup>∗</sup> , as ι → ∞.

Given C*<sup>ι</sup>* (*x*(t)) = ∂T*<sup>ι</sup>* (*x*(t))/∂*x*(t). It is also possible that the corresponding sequence Cι (*x*(t)) of costate function converges to C<sup>ι</sup> → C<sup>∗</sup> as ι → ∞. Due to the association, costate function is convergent, at the same time, it is concluded that the corresponding sequence converges to the optimal control law ø<sup>ι</sup> → ø<sup>∗</sup> as ι → ∞. -

#### **4.5 Multi-factor Mixed Optimization Experiment Treatment of Tumor Cells**

This section explores a novel therapeutic intervention for tumor cell growth inhibition. A discrete-time affine control system has been constructed from the multi-factor tumor cell growth model, and the iterative DHP algorithm has been applied to realize the reduction of drug dosage under the condition of greatly inhibiting the proliferation of tumor cell population.

#### *4.5.1 Discrete Affine Model of Tumor Cell Growth*

According to clinical medical statistics and literature [2, 30, 31, 38–41], the values of each parameter in the tumor cell proliferation model affected by multiple factors are shown in Table 4.1.

Using these parameters, try to observe the tumor cell proliferation model given some circumstances.

With reference to [1], the initial "Tu(0) = 2 × 107, Nk(0) = 1 × 103, CT(0) = 10, CL(0) = 6 × 10<sup>8</sup> " was selected, and the chemotherapy drug at a dose of V*Che*(t) = 3.5 was injected every 5 days in (4.8) to observe the changes of various cells in the current body.


**Table 4.1** Estimated parameter values

Figure 4.1 shows an injection method of chemotherapy drugs in the form of pulse. The drug is injected into the body to study the influence of the addition of chemotherapy drugs on the number of various cell populations in the body at different times. As can be seen from the curve of tumor cell change in Fig. 4.1a (the second curve), a dose of 5 chemotherapy drug injected every 5 days for 60 days is sufficient to control the proliferation of tumor cells. The four curves showed different forms of

**Fig. 4.1** Ten doses of chemotherapy over 60 days has been sufficient to eliminate the tumor. **a** Curves of the population of the four cell species. **b** Distribution of 10 doses of chemotherapy drugs within 60 days and the trend of changes in the concentration of chemotherapy drugs in vivo

oscillatory changes in the early stage, which mainly depended on the pulse injection of chemotherapy drugs, and immunospecific cell CT, which also decreased to stability after tumor cells stabilized in the later stage. Figure 4.1b shows the corresponding mode of administration, with the red is the pulse of administration and the green is the change of the corresponding chemotherapy drug in the body.

#### *4.5.2 Construction of Affine Model*

In (4.8), although the discrete model has been obtained, it is too complex and the addition of various coupling forms, which is difficult to be directly combined into the iteration-DHP structure. At this time, the idea of constructing a simple affine model is introduced. It can be easily learned from the above two sub-parts, which can be simplified as the influence of the injected concentrations of the two drugs on tumor cells in the body. Then, the current concentration of tumor cells can be selected as the state variable, and the injected concentrations of the two drugs (chemotherapy drugs and immune agents) can be used as the control variable to form a data set, starting from a large number of random data. The desired affine discrete model is obtained by fitting.

$$\mathbf{x}(t+1) = f(\mathbf{x}(t)) + \begin{bmatrix} g\_1(\mathbf{x}(t)) \\ g\_2(\mathbf{x}(t)) \end{bmatrix}^r u(t) \tag{4.31}$$

$$g\_1(\log\_{10}(\mathbf{x})) = 0.001771 \Big(\log\_{10}(\mathbf{x})\Big) - 0.02931 \Big(\log\_{10}(\mathbf{x})\Big) + 0.1793 \Big(\log\_{10}(\mathbf{x})\Big)$$

$$- 0.5353 \Big(\log\_{10}(\mathbf{x})\Big) + 1.741 \Big(\log\_{10}(\mathbf{x})\Big) - 1.133 \tag{4.32}$$

$$g\_2\left(\log\_{10}(\mathbf{x})\right) = 0.007579 \left(\log\_{10}(\mathbf{x})\right) - 0.1087 \left(\log\_{10}(\mathbf{x})\right)^\circ$$

$$+ 0.4838 \left(\log\_{10}(\mathbf{x})\right)^\circ + 0.1783 \left(\log\_{10}(\mathbf{x})\right)^\circ - 0.2304\tag{4.33}$$

#### *4.5.3 Optimization of Mixed Treatment Regimen*

Following the affine model mentioned above, it is necessary to specify the cost function required in iteration-DHP before optimizing the treatment:


**Table 4.2** Default parameters

**Fig. 4.2** The iteration error change curve, after the end of the 67th iteration, satisfies the termination condition

$$\mathbf{J}(\mathbf{x}(t)) = \sum\_{\iota=0}^{\infty} \lambda^{\iota} \left\{ \mathbf{x}^{\operatorname{T}}(t) \mathbf{P} \mathbf{x}(t) + m\_1 \int\_0^{\mu\_1(t)} \tanh^{-\operatorname{T}}(\overline{\mathcal{U}\_1}^{-1} \mathbf{s}) \right. $$

$$\cdot \,\,\overline{\mathcal{U}}\_1 \mathbf{Q}\_1 ds + m\_2 \int\_0^{\mu\_2(t)} \tanh^{-\operatorname{T}}(\overline{\mathcal{U}}\_2^{-1} \mathbf{s}) \, \overline{\mathcal{U}}\_2 \mathbf{Q}\_2 ds \right\}. \tag{4.34}$$

According to clinical experience, the default parameters are shown in Table 4.2. The iteration error is set to 10−6, and the iteration error variation curve is shown in Fig. 4.2. The error decreases extremely fast in the first twenty iterations of the calculation, and the convergence rate gradually decreases after 20 iterations. At ι = 67, the termination condition has been satisfied.

Analysis of tumor cells after meet the termination criterion, according to the optimized regimen of population change curve as shown in Fig. 4.3, visible at an extremely rapid rate by the growth of stem. The usage and dosage of two drugs

**Fig. 4.3** Tumor cell population changes in optimized treatment

are shown in Fig. 4.4. Figure 4.4a represents the curve of injected concentration of chemotherapy drugs, and Fig. 4.4b represents the curve of injected concentration of immune drugs.

#### **4.6 Conclusion**

In this chapter, a tumor immune differential game system has been established to solve the problem of optimal clinical tumor treatment oriented to evolutionary dynamics. Firstly, a mathematical model of the game system between tumor cells and immune cells treated by immune agents and chemotherapy drugs has been given. Secondly, the bounded optimal control problem has been solved by the HJB equation with infinite horizon performance index which is subjected to practical constraints. Finally, the optimal iterative approximate control strategy has been obtained by the iterative dual heuristic dynamic programming algorithm, and the effectiveness of the proposed algorithm has been proved.

**Fig. 4.4** Optimization of treatment of different drugs usage and dosage. **a** Injection concentration of chemotherapeutic drugs, **b** Injection concentration of immune agents

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 5 N-Level Hierarchy-Based Optimal Control to Develop Therapeutic Strategies for Ecological Evolutionary Dynamics Systems**

#### **5.1 Introduction**

The death toll from tumor diseases is on the rise, and the nonlinear dynamics and control of tumor growth have attracted widespread attention [1]. The number of tumor cells is gradually increasing. The most obvious feature is abnormal anti-growth signals. There is a strict control mechanism for normal cells. However, in the continuous process, the static and death signals are turned off to generate cell division signals, which leads to the crazy growth of tumor cells [2, 3]. Tumor cells promote the growth of blood vessels, which are necessary to provide nutrients. This is why the flow of blood in tumor tissues is related to the benign or malignant tumor. Cancer cells are also polarized. They have evolved their camouflage ability in the ongoing battle with immune cells, causing the immune system to mistake them for normal cells, which makes it difficult for chemotherapeutic drugs to distinguish the volume of biological targets [4, 5]. When the differentiation process of normal cells is not controlled, they will evolve into tumor cells. This is the nature of tumor cells, tumor cells continue to proliferate, deprive their limited body energy supply, and ultimately destroy the body's function and die [6]. Therefore, in order to inhibit the growth of tumor cells, it is urgent to find a treatment that will minimize the damage to oneself.

In the fight against cancer, before the advent of chemotherapy and radiotherapy, there have been no effective measures for the small differences between cancer cells and normal cells [7, 8]. When the side effects of radiotherapy and chemotherapy increased and the targeted therapy was highly targeted and inflexible, scientific research projects began to turn to humans themselves [9]. The complex and unique communities of cell life are called microenvironments by scientists. The microenvironment has many characteristics that affect cell growth, behavior, and how to communicate with other cells nearby [10]. In the oncology world, researchers are committed to understanding the tumor microenvironment and trying to find feasible treatment opportunities. Under normal circumstances, the immune system can recognize and eliminate tumor cells in the tumor microenvironment. However, in order to survive and grow, tumor cells can adopt different strategies to suppress the body's immune system and fail to kill tumor cells normally, thereby surviving the various stages of the anti-tumor immune response. The above-mentioned characteristics of tumor cells are called immune escape. Tumor cells escape the immune system, not because the immune system cannot recognize them, nor because it is not activated, but cancer cells have evolved a way to prevent T cell activation through specific binding [11–13]. Therefore, the medical community has been trying to find many special methods to treat cancer cells to block the activation of T cells and release the immune system.

Chemotherapy not only kills rapidly differentiated tumor cells, but also involves conventional cells. Its side effects are the most obvious, but they can be alleviated by immunotherapy. The closure of immune checkpoints and the success of adoptive cell therapy have made immunotherapy a mature means of treating cancer [14, 15]. Compared with traditional therapies such as surgery, radiotherapy, and chemotherapy, immunotherapy has fewer side effects and better effects, but immunotherapy is difficult to overcome its transient nature. With the rapid increase in tumor patients, immunotherapy is rapidly emerging for the treatment of specific types of cancer, especially tumors with poor immunogenicity [2]. The original intention of immunotherapy is to fight cancer cells through the lethality of immune cells themselves. As a typical immune deficiency syndrome, AIDS is caused by the failure of the immune response and is often attributed to the weakening of the immune level. However, once the activated immune system cannot be stopped, cytokines are produced, which is considered to be an overreaction of the immune system like COVID-19 [16, 17]. Therefore, the combined treatment of chemotherapy and immunotherapy is more reasonable. Immunotherapy refers to a treatment method that artificially enhances or suppresses the body's immune function to achieve the purpose of curing diseases by referring to the body's low or hyperimmune state. There are many immunotherapy methods, which are suitable for the treatment of many diseases. Tumor immunotherapy aims to activate the human immune system, relying on its own immune function to kill cancer cells and tumor tissues [18]. Unlike previous surgery, chemotherapy, radiotherapy, and targeted therapy, the target of immunotherapy is not tumor cells and tissues, but the body's own immune system [19]. Different types of tumor cells interact with different types of immune cells, and these immune cells have the function of helping or attacking tumors [20].

The mechanism of immune regulation varies from person to person, but in the case of special calls, the optimal regulation based on immunotherapy will play a role in reducing tumor cells regardless of specific circumstances. Enhancing tumor antigen presentation can effectively stimulate dendritic cells and improve immunotherapeutic efficacy [21, 22]. The known "predation-prey" between immune cells and tumor cells will cause periodic growth and reduction of cells. This growth and reduction can continue indefinitely or reach a balanced saddle point determined by system parameters [23]. And all of the above is composed of a complex non-linear structure, and it is difficult to achieve the global optimum with conventional optimization methods. Especially for the treatment of the human body, how to rationally use drugs to achieve the minimum harm to the human body is particularly important. So this article proposes a novel evolutionary calculation method, N-Level Hierarchy Optimization (NLHO) algorithm. It is bionic from the hierarchical system of biological populations in the natural world. The hierarchical system refers to the hierarchical phenomenon in which the status of each animal in the animal group has a certain order. The basis of the formation of the hierarchy is the dominance behavior, that is, the "domination-submission" relationship [24]. When the formed hierarchical system stabilizes, lower-ranking people generally show compromise and obedience, but sometimes they also re-struggle to change the hierarchical order, and so on. A stable population will develop for a long time. This is an explanation for the rationality of the hierarchy preserved in evolutionary selection [25]. So for the entire species population, this is conducive to the preservation and continuation of the species. A variety of biological interactions constitute a complex nonlinear growth process of tumor cells, and the main influencing factors of tumor cell populations are the focus of research. Hunting cells refer to immune cells that participate in the removal of foreign objects and strengthen the immune response [26, 27].

In the NLHO algorithm, an N levels optimization structure is designed, which includes the leader level, guider level, executant level and follower level. In the entire population, the individual with the best search position is selected as the leader, who has the grasp of the entire search direction of the team it leads. The second level is the guider level, which executes the tasks issued by the leader and follows the direction of the leader to find the best. Of course, in the whole process, the guider will also refer to the task allocation of the global optimal leader to guide the executants to find the best, so as to prevent the leader of the team from falling into a local optimum. The third level is the executant level, which follows the guider to complete the task, in order to achieve a wider area of coverage search. At the same time, it will also refer to the tasks assigned by the leaders of the ethnic group to make the task goals clearer and speed up the convergence. The last level is the follower level. At this level, followers can be divided into any level to solve different optimization problems. Of course, in the later stage of searching, there may be excessive overlap between population individuals [28].

#### **5.2 Ecological Evolutionary Dynamics Systems Model**

This part mainly introduces the mathematical growth model of tumor cells, which takes into account the influence of external factors such as chemotherapy drugs and immunotherapy on tumor cells, as well as the interaction between the two cells. In the following model, *T*(t) represents the number of tumor cells, *I*(t) represents the number of immune cells, *Conche(t)* and *Conim(t)* represent the blood concentration of chemotherapy drugs and immunotherapy drugs, respectively.

Taking into account the interaction between immune cells and tumor cells, the direct killing of chemotherapeutic drugs and the growth model of tumor cells can be written as

80 5 N-Level Hierarchy-Based Optimal Control to Develop Therapeutic Strategies …

$$\begin{aligned} T(t+1) &= T(t) + \vartheta\_1 \times T(t) \times \left(1 - \vartheta\_2 \times T(t)\right) \\ &- \gamma \times T(t) \times I(t) - \varepsilon \times T(t) \times Con\_{che}(t) \end{aligned} \tag{5.1}$$

where, <sup>ϑ</sup><sup>1</sup> stands for inherent growth rate unrelated to immune cells and chemotherapy drugs, <sup>ϑ</sup><sup>2</sup> stands for the maximum interaction ability between immune cells and tumor cells, ignoring chemotherapy drugs, γ stands for the growth rate when tumor cells are inactivated and attacked by immune cells, ε stands for the stress response coefficient of tumor cells to chemotherapeutics.

Considering the natural growth law of immune cells, we assume that a fixed number of immune cells are produced in a unit time, and these cells have an inevitable life cycle. Tumor cells in the body can stimulate the growth of immune cells, which is a positive non-linear change. In immunotherapy, the addition of immune drugs can produce an immune response, leading to non-linear growth of immune cells. At the same time, in the struggle between immune cells and tumor cells, the immune cells themselves will also cause losses. In chemotherapy, chemotherapy drugs can also cause damage to immune cells.

$$\begin{aligned} I(t+1) &= I(t) + \vartheta\_3 - \lambda \times I(t) \\ &+ \frac{\alpha\_1 \times T^2(t) \times I(t)}{\beta\_1 + T^2(t)} + \frac{\alpha\_2 \times T(t) \times \text{Con}\_{\text{im}}(t)}{\beta\_2 + \text{Con}\_{\text{im}}(t)} \\ &- \xi\_1 \times T(t) \times I(t) - \xi\_2 \times \text{Con}\_{\text{chle}}(t) \times I(t) \end{aligned} \tag{5.2}$$

where, <sup>ϑ</sup><sup>3</sup> stands for rate of continuous inflow, <sup>λ</sup> stands for natural decay rate without any external effects, <sup>α</sup><sup>1</sup> stands for maximum recruitment rate caused by tumor cells, <sup>α</sup><sup>2</sup> stands for the largest proportion of tumor cells caused by immunotherapeutic drugs, <sup>β</sup><sup>1</sup> stands for steepness factor caused by tumor cells, <sup>β</sup><sup>1</sup> stands for steepness coefficient caused by immunotherapeutic drugs, <sup>ξ</sup><sup>1</sup> stands for stress response coefficient to chemotherapy drugs, <sup>ξ</sup><sup>2</sup> stands for response rate of tumor cells to immune cells.

At a certain point in time after the injection of chemotherapy drugs, the concentration of the drugs in the body will decrease exponentially. We are adding immune drugs at the same time. We can get the attenuation model of chemotherapy drugs and immune drugs in vivo.

$$Con\_{ch\epsilon}(t+1) = \chi\_{ch\epsilon}(t) - e^{-\vartheta\_1}Con\_{ch\epsilon}(t) \tag{5.3}$$

$$Con\_{im}(t+1) = \chi\_{im}(t) - e^{-\vartheta\_2} Con\_{im}(t)\tag{5.4}$$

where, <sup>χ</sup>*che(t)* and χ*im(t)* represent the concentration of chemotherapeutic drugs and immune drugs, respectively. <sup>θ</sup><sup>1</sup> and <sup>θ</sup><sup>2</sup> are the attenuation rates of chemotherapy drugs and immune drugs.

When we qualitatively analyze how to minimize the number of tumor cells remaining in the bloodstream under the premise of using as few drugs as possible, including chemotherapy drugs and immune drugs, this process can be described by quantitative mathematical expressions. From formulas (5.1)–(5.4), we can get:

$$\begin{aligned} F\_{\text{min}} &= \sum\_{t=t\_0}^t \delta^t \left\{ \omega T^2(t) + \int\_0^{\chi\_{\text{chr}}(t)} \tan^{-1}(\bar{U}\_1^{-1}s)\bar{U}\_1 \mathcal{R}\_1 ds \\ &+ \int\_0^{\chi\_{\text{im}}(t)} \tan^{-1}(\bar{U}\_2^{-1}s)\bar{U}\_2 \mathcal{R}\_2 ds \right\} \end{aligned} \tag{5.5}$$

where, *U*¯ <sup>1</sup> and *U*¯ <sup>2</sup> respectively represent the maximum allowable dose of chemotherapy drugs and the dose of a single injection of immunizing agent, δ is the discount factor, ω is a constant coefficient.

#### **5.3 N-Level Hierarchy Optimization Algorithm**

#### *5.3.1 Leader Level of the Hierarchy*

First of all, as individuals with high fitness values, leaders have strong self-learning capabilities. Therefore, the iterative formula of design leaders is as follows:

$$\mathbf{x}\_{l,j}^{t+1} = \mathbf{x}\_{l,j}^{t} \left( 1 + randn(\mu\_l, \sigma\_l) \right) \tag{5.6}$$

where, *i* denotes the *i*th leader in the population, and *j* is the dimension. *t* is the number of iterations. Randn is a Gaussian distribution, where the mean <sup>μ</sup>*<sup>l</sup>* = 0 and the standard deviation <sup>σ</sup>*<sup>l</sup>* is shown below:

$$\sigma\_{l} = \begin{cases} 1 & \text{, } f\_{l}^{t} \le f\_{i}^{t} \\ \exp(f\_{l}^{t} - f\_{i}^{t}) & \text{, } f\_{l}^{t} > f\_{i}^{t} \end{cases}, i \in [1, 2, \dots, N], i \ne l \tag{5.7}$$

where, *f <sup>t</sup> <sup>l</sup>* is the fitness value of the *l*th leader at the *t*th iteration, and *f <sup>t</sup> <sup>i</sup>* is the fitness value of any individual in the population that is different from the *l*th leader.

#### *5.3.2 Guider Level of the Hierarchy*

Secondly, as the individuals who guide the general direction of the evolution of the entire population for the leader, the guider must not only learn from the best overall, but also obey the leader's command.

82 5 N-Level Hierarchy-Based Optimal Control to Develop Therapeutic Strategies …

$$\begin{split} \boldsymbol{x}\_{g,j}^{t+1} &= \boldsymbol{x}\_{g,j}^{t} + randn(\boldsymbol{\mu}\_{g}, \sigma\_{g}^{2}) \times (\boldsymbol{x}\_{l,j}^{t} - \boldsymbol{x}\_{g,j}^{t}) \\ &+ \boldsymbol{s}\_{1} \times (\boldsymbol{x}\_{best,j}^{t} - \boldsymbol{x}\_{g,j}^{t}) \end{split} \tag{5.8}$$

where, *g* denotes the *g*th guider in the population, best is the best individual in the current iteration, *<sup>s</sup>*<sup>1</sup> is the acceptance factor of guider, μg = 0.5.

$$\sigma\_g = \exp(f\_l^t - f\_g^t) \tag{5.9}$$

$$s\_1 = \exp\left(\frac{f\_{best}^t - f\_g^t}{|f\_g^t| + \varepsilon}\right) \tag{5.10}$$

where, ε is an infinitesimal value to prevent a guider from having a fitness value of 0.

#### *5.3.3 Executant Level of the Hierarchy*

The executants seek the best as the main body of the entire population. On the one hand, follow the guider's arrangements, and on the other hand, follow the leader's direction. *<sup>x</sup>t*+<sup>1</sup>

$$\begin{split} \boldsymbol{\chi}\_{\boldsymbol{e},j}^{t+1} &= \boldsymbol{\chi}\_{\boldsymbol{e},j}^{t} + randn(\mu\_{\boldsymbol{e}}, \sigma\_{\boldsymbol{e}}^{2}) \times (\boldsymbol{\chi}\_{\boldsymbol{g},j}^{t} - \boldsymbol{\chi}\_{\boldsymbol{e},j}^{t}) \\ &+ s\_{2} \times (\boldsymbol{\chi}\_{l,j}^{t} - \boldsymbol{\chi}\_{\boldsymbol{e},j}^{t}) \end{split} \tag{5.11}$$

where, e is each executant in the population,*s*<sup>2</sup> is the acceptance factor of the executor, <sup>μ</sup>*<sup>e</sup>* = 0.8.

$$\sigma\_{\varepsilon} = \exp\left(\frac{f\_g^t - f\_{\varepsilon}^t}{|f\_{\varepsilon}^t| + \varepsilon}\right) \tag{5.12}$$

$$s\_2 = \exp(f\_l^t - f\_e^t) \tag{5.13}$$

#### *5.3.4 Follower Level of the Hierarchy*

Finally, there are followers, who themselves will be divided into multiple levels. Learn from each other at different levels, and notify the follow-up executant to check for deficiencies.

$$\begin{aligned} \mathbf{x}\_{f\_n,j}^{t+1} &= \mathbf{x}\_{f\_n,j}^t + randn(\boldsymbol{\mu}\_{f\_n}, \boldsymbol{\sigma}\_{f\_n}^2) \times (\mathbf{x}\_{f\_{n-1},j}^t - \mathbf{x}\_{f\_n,j}^t) \\ &+ c\_n \times rand \times (\mathbf{x}\_{e,j}^t - \mathbf{x}\_{f\_n,j}^t) \end{aligned} \tag{5.14}$$

where, *fn* is the *n*-level follower, *cn* is the absorption factor of the *n*-level follower, <sup>μ</sup> *fn* <sup>=</sup> <sup>0</sup>*.*<sup>8</sup> <sup>−</sup> <sup>0</sup>*.*<sup>6</sup> <sup>×</sup> *(t/tmax )*,*x<sup>t</sup> <sup>f</sup>*0*,<sup>j</sup>* <sup>=</sup> *<sup>x</sup><sup>t</sup> <sup>e</sup>,<sup>j</sup>* , *f <sup>t</sup> <sup>f</sup>*<sup>0</sup> = *f <sup>t</sup> <sup>e</sup>* , *n* is a natural number greater than 0.

$$\sigma\_{f\_n} = \exp(f\_{f\_{n-1}}^t - f\_{f\_n}^t) \tag{5.15}$$

$$c\_n = \exp(f\_{f\_{n-1}}^t - f\_{f\_n}^t) \tag{5.16}$$

#### **5.4 Simulation and Analysis for NLHO**

In this experiment, the population size is set to 100, and the maximum number of iterations is set to be 100. Each algorithm is run independently for 50 times, and the spatial dimension is selected according to different test functions. The distribution rates of each level system are LPercent = 10%, GPercent = 20%, EPercent = 40%, and FPercent = 30%. The value of the updated algebra G = 10.

For independent tests of 20 test functions, we separately count their mean, minimum and standard deviation to evaluate the performance of NLHO in various aspects by setting the difficulty in different aspects. At the same time, we select some typical algorithms for comparison, such as Taboo Search (TS), Chicken Swarm Optimization (CSO), Genetic Algorithm (GA), Ant Colony Optimization (ACO), and Simulated Annealing (SA), so as to compare the performance of the NLHO algorithm horizontally. The test results are shown in Table 5.1.

For the independent tests of the benchmark functions, we calculated 5 parameter indexes respectively, which were their best, worst, median, average and std. deviation. At the same time, in order to verify whether the results are statistically significant, we use the Wilcoxon rank-sum test between NLHO and the other algorithms. "+",


**Table 5.1** Experimental simulation results

"−", and "≈" mean that the proposed NLHO is significantly better, significantly worse, and no significantly statistically different in the comparison, respectively.

It can be seen from the experimental results that, compared with the other 5 algorithms, under 20 test functions 60 indicators, NLHO wins 58, 29, 59, 53 and 56 indicators, and they belong to different types of test functions, which reflects the better robustness of NLHO. Moreover, it can be seen from the Wilcoxon rank-sum test that NLHO is not only different from other population algorithms, but also has obvious advantages. The experimental results for each function are discussed in more detail below.

The Ackley function *f*<sup>1</sup> is widely used to test optimization algorithms. It is a continuous experimental function obtained by superimposing an exponential function with a moderately amplified cosine. It is characterized by an almost flat outer area, which is modulated by a cosine wave to form holes or peaks, making the curved surface undulating, but there is a large hole in its center. For the Ackley function, both NLHO and CSO have found the global minimum, and the average value is also equal to the global minimum, so that the standard deviation is also 0. The optimization process diagram of the NLHO algorithm is shown in this article, including the initial iteration diagram, the final result diagram and the intermediate process convergence curve, as shown in Fig. 5.1. The two algorithms are comparable, SA performance is average, while TS, GA and ACO perform poorly.

The Cross-in-Tray function *f*<sup>2</sup> has multiple global minimums. On this function, the mean, minimum and standard deviation of NLHO have reached the best, and the experimental results are shown in Fig. 5.2. At the same time, ACO also found the global optimum, but the mean and standard deviation are slightly inferior to NLHO. TS, CSO and SA performed well, while GA performed average.

The Drop-Wave function *f*<sup>3</sup> is multi-modal and very complicated. In each smaller input domain, its features have multiple ring-shaped peaks and valleys, and the depth of the valleys gradually decreases as the center of the circle shrinks. For the optimization process, it is very easy to fall into the local optimum. For this kind of complex optimization function, NLHO has shown strong optimization ability, and the global minimum, mean and standard deviation have all reached the best. The experimental results are shown in Fig. 5.3. CSO, ACO and SA performed well, and TS and GA performed poorly.

#### **5.5 Develop Therapeutic Strategies for Ecological Evolutionary Dynamics Systems Using NLHO**

In this section, we apply the NLHO algorithm to the EEDS model as an experimental verification. According to clinical treatment needs, chemotherapeutic drugs and immune drugs are used as input, and the cost of treatment loss is used as the objective function. Through the iteration of the NLHO algorithm, the optimal therapeutic strategies for patients with a certain basic condition are worked out. For some of the

**Fig. 5.2** Cross-in-tray

remainder of this sample we will use dummy text to fill out paragraphs rather than use live text that may violate a copyright.

**Fig. 5.3** Drop-wave


**Table 5.2** Experimental parameter

According to clinical medical statistics borrowed from the literature [29], the specific parameters of the dynamic models are presented as Table 5.2. Based on the above, we have completed the establishment of the EEDS model, and determined the specific value of the cost function according to clinical needs. At the same time, the feasibility and effectiveness of the NLHO algorithm are also verified on benchmarks. Apply NLHO to the model of EEDS to develop therapeutic strategies. The best processing strategy is obtained through experiments, which proves the effectiveness and feasibility of the algorithm. The cost function is designed to minimize the number of tumor cells, and also to use the smallest dose of chemotherapeutic drugs and immune drugs to achieve the least harm to the human body.

**Fig. 5.4** Quantity curve of tumor cells

When we give the patient the initial number of tumor cells and immune cells, according to the EEDS model and follow certain chemotherapy and immunotherapy plans, we can get the following set of curves of tumor cells and immune cells. As shown in Figs. 5.4 and 5.5, it shows the quantity curve of tumor cells and immune cells. Within a one-year treatment period, the number of tumor cells was successfully reduced to 254. Although the number of immune cells was reduced to 1.52×106, significant effects were obtained for the treatment of tumors. Moreover, as shown in Figs. 5.7, 5.8 and 5.9, due to the decline of the body's immune cells, the immune drug dropped to 0.022 and then increased to about 0.03. This caused the concentration of immune drugs in human blood to rise from the trough of 0.0096 to 0.03, reaching a sufficient level.

The dosage of chemotherapeutic drugs is adaptively and dynamically changed, as shown in Fig. 5.6. Then the concentration of chemotherapeutic drugs in the blood also has the same changing trend, as shown in Fig. 5.8. The reason is to minimize the damage of drugs, and chemotherapy drugs are also harmful, not only killing tumor cells in the body but also destroying immune cells. If chemotherapeutic drugs are put in according to normal treatment methods, normal cells will suffer a lot of erosion, and the impact on the body will be even more significant. However, the drug dose optimized by NLHO will dynamically change adaptively, and the impact on normal cells will be appropriately reduced without affecting the elimination of tumor cells.

**Fig. 5.5** Quantity curve of immune cells

**Fig. 5.6** Dosage of chemotherapeutic drugs

**Fig. 5.7** Quantity curve of immune drug

**Fig. 5.8** Concentration of chemotherapy drugs

**Fig. 5.9** Concentrations of immunereagents

#### **5.6 Conclusion**

It is a difficult problem to solve optimal therapeutic strategy for EEDS. Theoretically, it can achieve the desired therapeutic effect through reducing the tumor cells through the combined of chemotherapeutic drugs and immune drugs, and minimize the harm to the human body. Benefiting from the concept of heuristic algorithm in evolutionary computing, this chapter has designed the NLHO algorithm via 20 benchmark functions to test NLHO, including unimodal and multimodal, singlemode and multi-mode, single-extreme and multi-extreme, etc. It is compared with the five algorithms of TS, CSO, GA, ACO and SA, and runs independently 50 times to calculate the mean, minimum and standard deviation. It proves that NLHO has good optimization ability and can solve various problems well. At the same time, the development therapeutic strategies of EEDS have achieved very good results. The experimental results have shown that the NLHO algorithm develops therapeutic strategies well, and provides valuable prior knowledge and scientific basis for clinical medicine. Future work will further improve the EEDS, and integrate the optimal control strategy and the evolutionary calculation method for the optimal treatment method.

#### **References**


92 5 N-Level Hierarchy-Based Optimal Control to Develop Therapeutic Strategies …


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 6 Combination Therapy-Based Adaptive Control for Organism Using Medicine Dosage Regulation Mechanism**

#### **6.1 Introduction**

The death toll is soaring caused by neoplastic diseases, and the issues on nonlinear dynamics and control of tumour growth have motivated a widespread concern as [1]. Essential nutrients in humans are the resources for which the normal cells and tumor cells compete. Tumour cells will keep proliferating, robbing the limited energy supply of the body, and eventually disintegrating the somatic function to death. Somatic cells constantly divide, and new cells differentiate which end with apoptosis. In this manner, the relative balance can be maintained in human bodies. Nevertheless, when the process of differentiation for normal cells is out of control, the cells may well evolve into tumor cells. It is the nature for the tumor cells of which the tendency is to eat the body's nutrients crazily.

The population of tumour cells progressively increases for the following three characteristics. Firstly, the most obvious characteristic is the insensitivity to antigrowth signals. There exists strict control mechanism for normal cells, but for tumor cells, this mechanism is no longer valid. During the continuous process of division, tumor cells can escape from the monitoring of the anti-growth signals, which leads to the crazy growth of tumour cells. Secondly, tumour cells have the ability to promote the growth of blood vessels which are essential for providing nutrients, and it is the reason why the blood vessel density is associated with the malignant degree of the tumor tissue. Finally, tumor cells are also duplicitous, evolving camouflage abilities during their constant battle with immune cells to mislead the immune system into regarding them as normal cells, which results in the tumor immune escape. Thus, to suppress the growth of tumour cells, obstructing the generative mechanisms which relies on the necessary nutrients was an effective approach as literatures [2, 3].

Distinguishing from the mixed tumor treatment approach of immunotherapy and chemotherapy as [4], this chapter explores a more effective adaptive control strategy for organism using medicine dosage regulation mechanism. An additional population of cells which called endothelial cells enjoy the substances induced by malignant tumour cells, and they could transfer oxygen and nutrients to the primary focus causing proliferating of blood cells, which will increase carrying capacity of tumour cells known as tumour angiogenesis in [5]. As indicated in literature [6], anti-angiogenic agent could particularly decrease the growing rate of tumours, reaching saturation to some extent without killing the endothelial cells completely. When the chemotherapy agent was used in combination with anti-angiogenic agent to reduce the population of tumor cells , the latter could increase the effect of the former as described in [7]. Nevertheless, as the key element of promoting the growth of the vasculature, the endothelial cells could not be completely destroyed. Otherwise, it may not exist that the specified number of vasculature for constructing access of chemotherapy agent. On the basis of the pharmaceutical science concerning the chemotherapy agent and anti-angiogenic agent, the adaptive control strategy for organism will provide a guidance for clinical practice under the medicine dosage regulation mechanism, especially in the treatment process of Lung cancer. Furthermore, what counts is that since the anti-tumor drugs often kill both tumor cells and normal cells, it's of significance to utilize less drugs to achieve the therapeutic goal during the treatment process.

ADP is derived from dynamic programming and reinforcement learning, is a powerful tool to tackle optimization issues [8–10]. In general, the successful implementation of ADP-based methods depends on the cooperative work of actor and critic networks [11]. Under this framework, the actor is responsible for performing the control strategy with current data [12]. The goal of critic is to provide actor with the feedback information derived from the evaluation of the cost under the strategy. The distinct merit of this type of algorithm lies in that the optimal control strategy could be approximately acquired in the manner of iteration computation, and the "curse of dimensionality" could be obviated with effect. Different ADP-based methods have been researched by scholars to tackle multifarious optimal control problems with the aid of the artificial neural networks of which the performance is outstanding [13, 14], such as the robust control [15, 16], optimal consensus control [17–19] and the optimal tracking issues [20, 21]. Furthermore, for the system with multiple controllers, the optimization issues can be formulated by game theory. As a vital branch of game theory, NZSGs is derived from [22] with the goal of attaining the optimal strategy pair that can minimize the personal performance index for each player when stabilizing the controlled system [23–25]. Due to the excellent ability to approximate optimization, the ADP methods have been proposed to solve NZSGs. In [26], the adaptive method of critic-only structure was developed to solve two-player NZSGs without any initial stabilizing control. The experience replay technique was integrated into the ADP algorithm in [27] to concurrently utilize the historical data together with the real-time data to approximate the value function such that the persistence of excitation condition was not indispensable. In [28], the data-based integral reinforcement learning algorithm was proposed to solve NZSGs. More specially, it was a novel iterative learning algorithm based on both off-line and online manner which could extend the applicability of the data-based control scheme. Furthermore, in [29], the discrete-time *N*-player NZSGs was tackled via the off-policy reinforcement learning method which was independent of system dynamics.

Although the relevant academic achievements have been presented in theories and applications as [30–38], there is seldom any literature on this filed according to literature survey of the authors. The contributions can be shown as follows. First, the nearoptimal therapy for the treatment of tumor is firstly acquired via the ADP approach which is an efficient adaptive intelligent learning algorithm. Second, the interactive system with discounted value function is constructed based on the mathematical model simulating the interaction relationships among cells and drugs. Besides, two kinds of chemotherapy drugs and a kind of anti-angiogenic agent participate in the therapy such that the combination therapeutic strategy can be derived under the architecture of NZSGs. Third, the idea of cybernetics is extended to the frontier fields of medicine, more precisely, the therapy of tumor. Under the MDRM, the derived therapeutic strategy can achieve the therapeutic goal with the lowest doses of drugs, and the practical indications for medicine is considered for the first time.

Notations: <sup>N</sup><sup>+</sup> denotes the set containing all positive integers. -·-, *dia*g{·} and (·) - ∂(·)/∂*<sup>x</sup>* respectively represent the Euclidean norm of a vector/matrix, the operation of constructing diagonal matrix and the gradient operator. <sup>λ</sup>*<sup>m</sup>*(·) and <sup>λ</sup>*<sup>M</sup>* (·) separately denote the minimum eigenvalue and maximum eigenvalue of a matrix. *In*×*<sup>n</sup>* is the unit matrix whose dimension is *n*.

#### **6.2 Preliminaries**

#### *6.2.1 Establishment of Mathematical Model*

In this section, the growth mathematical model is established which considers the interaction relationships among the normal cells, tumor cells and endothelial cells. Moreover, the effects of control inputs, i.e., the chemotherapy and anti-angiogenic drugs, on these cells are embodied in the model. Thus, in the model formed from ordinary differential equations as follows, *PNC*(*t*), *PT C*(*t*) and *PEC*(*t*) respectively represent the populations of normal cells, tumor cells and endothelial cells, *PC D*<sup>j</sup> (*t*)(j = 1, 2) and *PAD*(*t*) denote the concentrations of chemotherapy and antiangiogenic drugs.

The population of normal cells, which is influenced by tumor cells, endothelial cells and the concentrations of chemotherapy and anti-angiogenic drugs, is modeled by

$$\dot{P}\_{NC}(t) = \alpha\_1 P\_{NC}(t) \left(1 - \frac{P\_{NC}(t)}{C\_1}\right) - A\_1 P\_{NC}(t) P\_{TC}(t)$$

$$- \Xi\_1(P\_{EC}(t), P\_{AD}(t)) \frac{P\_{NC}(t) P\_{CD1}(t)}{B\_1 + P\_{NC}(t)}$$

$$- \Xi\_2(P\_{EC}(t), P\_{AD}(t)) \frac{P\_{NC}(t) P\_{CD2}(t)}{B\_1 + P\_{NC}(t)},\tag{6.1}$$

where Ξ*<sup>ı</sup> PEC*(*t*), *PAD*(*t*) = Ξ*<sup>ı</sup>*1*PEC*(*t*) + Ξ*<sup>ı</sup>*2*PAD*(*t*) + Ξ*<sup>ı</sup>*0,*ı* = 1, 2. The parameters <sup>α</sup>1, *<sup>B</sup>*1, *<sup>C</sup>*<sup>1</sup> denote the proliferation rate, Holling type 2 constant and carrying capacity for normal cells, respectively. *A*<sup>1</sup> is the contention parameter between normal cells and tumor cells.

As the tumor cells contend with normal cells for necessary nutrients, the population of tumor cells is affected by that of normal cells. Besides, there exist mutual effects among tumor cells, endothelial cells and the drugs. Thus the corresponding model can be written as

$$\dot{P}\_{TC}(t) = \alpha\_2 P\_{TC}(t) - \frac{\alpha\_2 P\_{TC}(t)P\_{TC}(t)}{C\_2 + \Phi P\_{EC}(t)} - \Pi\_1 \left( P\_{EC}(t), P\_{AD}(t) \right) \frac{P\_{TC}(t)P\_{CD1}(t)}{B\_2 + P\_{TC}(t)}$$

$$- \Pi\_2 \left( P\_{EC}(t), P\_{AD}(t) \right) \frac{P\_{TC}(t)P\_{CD2}(t)}{B\_2 + P\_{TC}(t)} - A\_2 P\_{NC}(t) P\_{TC}(t), \tag{6.2}$$

where Πj *PEC*(*t*), *PAD*(*t*) = Πj1*PEC*(*t*) + Πj2*PAD*(*t*) + Πj0, j = 1, 2. The parameters <sup>α</sup>2, *<sup>B</sup>*2, *<sup>C</sup>*<sup>2</sup> are multiplication rate, Holling type 2 constant and carrying capacity for tumor cells. *A*<sup>2</sup> is contention parameter between normal cells and tumor cells.

The population of endothelial cells is associated with tumor cells and antiangiogenic drugs. The relations can be given as

$$\dot{P}\_{EC}(t) = s\_1 + KP\_{TC}(t) + \alpha\_3 P\_{EC}(t) \left(1 - \frac{P\_{EC}(t)}{C\_3}\right) - \frac{\Xi\_3 P\_{EC}(t) P\_{AD}(t)}{B\_3 + P\_{EC}(t)} \tag{6.3}$$

where *K* is multiplication rate caused by tumor cells and *s*<sup>1</sup> the inflow rate. Similarly, the parametersα3, *<sup>B</sup>*3,*C*<sup>3</sup> are multiplication rate, Holling type 2 constant and carrying capacity for endothelial cells. Ξ<sup>3</sup> is the killing rate for endothelial cells.

The concentrations of the drugs decrease during the treatment phases, owing to the washout process. Hence we can model the evolution process of the concentrations of chemotherapy and anti-angiogenic drugs by

$$\dot{P}\_{CD1}(t) = Dr\_{c1} - \left(\beta\_{c1} + m\_1 \frac{P\_{NC}(t)}{B\_1 + P\_{NC}(t)} + m\_2 \frac{P\_{TC}(t)}{B\_2 + P\_{TC}(t)}\right) P\_{CD1}(t) \tag{6.4}$$

$$\dot{P}\_{CD2}(t) = Dr\_{c2} - \left(\beta\_{c2} + m\_3 \frac{P\_{NC}(t)}{B\_1 + P\_{NC}(t)} + m\_4 \frac{P\_{TC}(t)}{B\_2 + P\_{TC}(t)}\right) P\_{CD2}(t) \tag{6.5}$$

and

$$\dot{P}\_{AD}(t) = Dr\_a - \left(\beta\_a + \frac{m\_5 P\_{EC}(t)}{B\_3 + P\_{EC}(t)}\right) P\_{AD}(t),\tag{6.6}$$

where *Drc*1, *Drc*<sup>2</sup> and *Dra* are the control inputs. <sup>β</sup>*<sup>c</sup>*1, <sup>β</sup>*<sup>c</sup>*<sup>2</sup> and <sup>β</sup>*<sup>a</sup>* denote the washout rates for the drugs. *m*1, *m*2, *m*3, *m*<sup>4</sup> and *m*<sup>5</sup> are the rates at which the drugs integrate into the cells. Based on the operations similar to that in [39], we obtain the simplified version of the model as

⎧ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩ *<sup>p</sup>*˙*NC*(*t*) <sup>=</sup> <sup>α</sup><sup>1</sup> *pNC*(*t*)(<sup>1</sup> <sup>−</sup> *pNC*(*t*)) <sup>−</sup> *<sup>a</sup>*<sup>1</sup> *pNC*(*t*)*pT C*(*t*) <sup>−</sup> <sup>ξ</sup><sup>1</sup> *pNC*(*t*)*pC D*1(*t*) *<sup>b</sup>*<sup>1</sup> <sup>+</sup> *pNC*(*t*) <sup>−</sup> <sup>ξ</sup><sup>2</sup> *pNC*(*t*)*pC D*2(*t*) *<sup>b</sup>*<sup>1</sup> <sup>+</sup> *pNC*(*t*) , *<sup>p</sup>*˙*T C*(*t*) <sup>=</sup> <sup>α</sup><sup>2</sup> *pT C*(*t*) - <sup>1</sup> <sup>−</sup> *pT C*(*t*) <sup>1</sup> <sup>+</sup> φ*pEC*(*t*) − *a*<sup>2</sup> *pNC*(*t*)*pT C*(*t*) <sup>−</sup> <sup>π</sup><sup>1</sup> *pT C*(*t*)*pC D*1(*t*) *<sup>b</sup>*<sup>2</sup> <sup>+</sup> *pT C*(*t*) <sup>−</sup> <sup>π</sup><sup>2</sup> *pT C*(*t*)*pC D*2(*t*) *b*<sup>2</sup> + *pT C*(*t*) *<sup>p</sup>*˙*EC*(*t*) <sup>=</sup> *<sup>s</sup>*<sup>1</sup> <sup>+</sup> *kpT C*(*t*) <sup>+</sup> <sup>α</sup><sup>3</sup> *pEC*(*t*)(<sup>1</sup> <sup>−</sup> *pEC*(*t*)) <sup>−</sup> <sup>ξ</sup><sup>3</sup> *pEC*(*t*)*pAD*(*t*) *<sup>b</sup>*<sup>3</sup> <sup>+</sup> *pEC*(*t*) , *p*˙*C D*1(*t*) = *uc*<sup>1</sup> − - <sup>β</sup>*c*<sup>1</sup> <sup>+</sup> *<sup>m</sup>*<sup>1</sup> *pNC*(*t*) *<sup>b</sup>*<sup>1</sup> <sup>+</sup> *pNC*(*t*) <sup>+</sup> *<sup>m</sup>*<sup>2</sup> *pT C*(*t*) *b*<sup>2</sup> + *pT C*(*t*) *pC D*1(*t*), *p*˙*C D*2(*t*) = *uc*<sup>2</sup> − - <sup>β</sup>*c*<sup>2</sup> <sup>+</sup> *<sup>m</sup>*<sup>3</sup> *pNC*(*t*) *<sup>b</sup>*<sup>1</sup> <sup>+</sup> *pNC*(*t*) <sup>+</sup> *<sup>m</sup>*<sup>4</sup> *pT C*(*t*) *B*<sup>2</sup> + *pT C*(*t*) *pC D*2(*t*), *p*˙*AD*(*t*) = *ua* − - <sup>β</sup>*<sup>a</sup>* <sup>+</sup> *<sup>m</sup>*<sup>5</sup> *pEC*(*t*) *b*<sup>3</sup> + *pEC*(*t*) *pAD*(*t*), (6.7)

where <sup>ξ</sup>*<sup>ı</sup> pEC*(*t*), *pAD*(*t*) <sup>=</sup> <sup>ξ</sup>*ı*<sup>1</sup> *pEC*(*t*) <sup>+</sup> <sup>ξ</sup>*ı*<sup>2</sup> *pAD*(*t*) <sup>+</sup> <sup>ξ</sup>*ı*<sup>0</sup> and <sup>π</sup><sup>j</sup> *pEC*(*t*), *pAD*(*t*) <sup>=</sup> <sup>π</sup><sup>j</sup><sup>1</sup> *pEC*(*t*) <sup>+</sup> <sup>π</sup><sup>j</sup><sup>2</sup> *pAD*(*t*) <sup>+</sup> <sup>π</sup><sup>j</sup><sup>0</sup> with *<sup>ı</sup>*, j <sup>=</sup> <sup>1</sup>, 2. The states *pNC*(*t*), *pT C*(*t*), *pEC*(*t*), *pC D*1(*t*), *pC D*<sup>2</sup> and *pAD* are nonnegative.

**Remark 6.1** The differential equation (6.7) is the simplified model describing the interaction relationships among cells and drug. Observing the model, one can discover that there exists competition between normal cells and tumor cells. The tumor cells require more nutrients such that they facilitate the proliferation of endothelial cells, which could provide the indispensable nutrients to promote the growth of tumor. The tumor cells can be effectively damaged by the chemotherapy drugs which have side-effects on normal cells to some extent, and the anti-angiogenic drug contributes to the proliferation inhibition of the endothelial cells.

#### *6.2.2 Nonzero-Sum Games Formulation*

Consider the interaction model (6.7) rewritten as

$$\begin{aligned} \dot{\boldsymbol{x}} &= \boldsymbol{f}(\boldsymbol{x}) + \begin{bmatrix} \mathbf{0}\_{3 \times 1} \ \mathbf{0}\_{3 \times 1} \\ \mathbf{0}\_{3} \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0}.12 \end{bmatrix} \boldsymbol{u}\_{1} + \begin{bmatrix} \mathbf{0}\_{3 \times 1} \ \mathbf{0}\_{3 \times 1} \\ \mathbf{0} & \mathbf{0} \\ \mathbf{0}.1 & \mathbf{0} \\ \mathbf{0} & \mathbf{0} \end{bmatrix} \boldsymbol{u}\_{2} \\ &= \boldsymbol{f}(\boldsymbol{x}) + \boldsymbol{g}\_{1} \boldsymbol{u}\_{1} + \boldsymbol{g}\_{2} \boldsymbol{u}\_{2}, \end{aligned} \tag{6.8}$$

where <sup>u</sup><sup>1</sup> = [*uc*1, *ua*] *<sup>T</sup>* , <sup>u</sup><sup>2</sup> = [*uc*2, <sup>0</sup>] *<sup>T</sup>* and *f* (*x*)is constructed by the right-hand side parts of (6.7) excluding the terms *uc*1, *uc*<sup>2</sup> and *ua*.

Define the value function for player *ı*(*ı* = 1, 2) as

$$V\_{\iota}(\mathbf{x}(t)) = \int\_{t}^{\infty} e^{-\varrho\_{\iota}(\varsigma - t)} \delta\_{\iota}(\mathbf{x}, u\_{1}, u\_{2}) d\varsigma,\tag{6.9}$$

where the utility function <sup>δ</sup>*<sup>ı</sup>*(*x*, <sup>u</sup>1, <sup>u</sup>2) <sup>=</sup> *<sup>x</sup> <sup>T</sup>*Υ*<sup>ı</sup> <sup>x</sup>* <sup>+</sup> <sup>u</sup>*<sup>T</sup>* <sup>1</sup> <sup>R</sup>*<sup>ı</sup>*1u<sup>1</sup> <sup>+</sup> <sup>u</sup>*<sup>T</sup>* <sup>2</sup> <sup>R</sup>*<sup>ı</sup>*2u2. The matrixes <sup>R</sup>*<sup>ı</sup>*<sup>j</sup> (*ı*, j <sup>=</sup> <sup>1</sup>, <sup>2</sup>) and <sup>Υ</sup>*<sup>ı</sup>* are positive definite, and *<sup>ı</sup>* <sup>&</sup>gt; 0 is the discount factor. According to the value function (6.9), we can define the corresponding Hamiltonian function as

$$\begin{aligned} H\_t(\mathbf{x}, u\_1, u\_2) &= (\nabla V\_t)^T (f + g\_1 u\_1 + g\_2 u\_2) \\ &+ \delta\_t(\mathbf{x}, u\_1, u\_2) - \varrho\_t V\_t, \iota = 1, 2. \end{aligned} \tag{6.10}$$

The optimal value function is defined as

$$V\_t^\* = \min\_{u\_t} \int\_t^\infty e^{-\varrho\_i(\varsigma - t)} (\mathbf{x}^T \mathcal{T}\_t \mathbf{x} + \sum\_{J=1}^{N=2} u\_J^T \mathcal{R}\_{IJ} u\_J) d\varsigma. \tag{6.11}$$

The target of NZSGs is to attain the admissible strategy pair {u<sup>∗</sup> <sup>1</sup>, u<sup>∗</sup> <sup>2</sup>} with the definition given in [23, 40]. According to the stationarity condition, the optimal strategy for player *ı* could be obtained by

$$u\_{\iota}^{\*} = -\frac{1}{2} \mathcal{R}\_{\iota \iota}^{-1} g\_{\iota}^{T} \nabla V\_{\iota}^{\*}.\tag{6.12}$$

Thus the HJEs can be obtained as

$$H\_{\iota}(\mathbf{x}, u\_1^\*, u\_2^\*, \nabla V\_{\iota}^\*) = (\nabla V\_{\iota}^\*)^T (f + g\_1 u\_1^\* + g\_2 u\_2^\*)$$

$$+ \mathbf{x}^T \mathcal{T}\_{\iota} \mathbf{x} + u\_1^\* \mathcal{R}\_{\iota 1} u\_1^\* + u\_2^\* \mathcal{R}\_{\iota 2} u\_2^\* - \varrho V\_{\iota}^\* = 0. \tag{6.13}$$

**Remark 6.2** It's noteworthy that there exists no zero equilibrium for system (6.8), which may well result in the divergence of *Vı*(*x*(*t*)). To resolve this issue, the discounted factor *<sup>ı</sup>* is introduced to form decay term such that *Vı*(*x*(*t*)) can be convergent.

In general, solving NZSGs is synonymous with solving the equations (6.13). Nevertheless, for nonlinear system, it's very intractable to tackle the coupled equations. To resolve this difficulty, an ADP method utilizing dosage regulation mechanism is proposed in the following sections.

#### **6.3 MDRM-Based Adaptive Critic Learning Method for NZSGs**

Firstly, we introduce the indications for medicine to judge when the medicine dosage should be regulated. Then under the MDRM, the ADP method of single-critic architecture is proposed to approximately seek the optimal strategy for the NZSGs of model (6.7).

#### *6.3.1 MDRM-Based Optimal Strategy Derivation*

For the sake of realizing conditioned therapy strategy, MDRM is required to handle the clinical data such that the strategy can be changed timely and necessarily. The time sequence {-} is constructed for recording the regulating instants and denotes the th regulating instant. Then the state could be denoted as

$$\check{\mathbf{x}}\_{\ell}(t) = \mathbf{x}(\hbar\_{\ell}), t \in [\hbar\_{\ell}, \hbar\_{\ell+1}). \tag{6.14}$$

For evaluating the difference between real-time data and latest recorded data, it's necessary to define an error function that *<sup>z</sup>* = ˘*x* <sup>−</sup> *<sup>x</sup>*(*t*), *<sup>t</sup>* ∈ [-, -+<sup>1</sup>). The operation of MDRM depends on the regulating condition which compares the error *z* with the threshold associated with real-time data. The strategy is adjusted only when *<sup>z</sup>* is larger than the threshold. That is, <sup>u</sup>˘*<sup>ı</sup>* <sup>=</sup> <sup>u</sup>*<sup>ı</sup>*(*x*˘),*<sup>ı</sup>* <sup>=</sup> <sup>1</sup>, 2, and <sup>∈</sup> <sup>N</sup>+. Thus the MDRM-based strategy could be got as

$$\check{u}\_{\iota}^{\*} = -\frac{1}{2} \mathcal{R}\_{\iota\iota}^{-1} g\_{\iota}^{T}(\check{\mathbf{x}}\_{\ell}) \nabla \check{V}\_{\iota}^{\*}, \iota = 1, 2,\tag{6.15}$$

where ∇*V*˘ <sup>∗</sup> *<sup>ı</sup>* <sup>=</sup> <sup>∂</sup>*V*<sup>∗</sup> *<sup>ı</sup>* /∂*<sup>x</sup>* when *<sup>t</sup>* <sup>=</sup> -. The version that based on the adjustment mechanism of HJEs is derived as

$$H\_t(\mathbf{x}, \breve{u}\_1^\*, \breve{u}\_2^\*, V\_t^\*) = \frac{1}{4} \sum\_{j=1}^{N=2} (\nabla \breve{V}\_j^\*)^T g\_j(\breve{\mathbf{x}}\_\ell) \mathcal{R}\_{jj}^{-1} \mathcal{R}\_{\ell j} \mathcal{R}\_{jj}^{-1} g\_j^T(\breve{\mathbf{x}}\_\ell) \nabla \breve{V}\_j^\*$$

$$+ (\nabla V\_i^\*)^T \left( f - \frac{1}{2} \sum\_{j=1}^{N=2} g\_j \mathcal{R}\_{jj}^{-1} g\_j^T(\breve{\mathbf{x}}\_\ell) \nabla \breve{V}\_j^\* \right) + \mathbf{x}^T \Upsilon\_i \mathbf{x} - g\_i V\_i^\*. \tag{6.16}$$

Differing from HJEs (6.13), due to the existence of the error *z*, (6.16) does not equal to zero. Before proceed with the discussion, the following assumption is required [41].

**Assumption 6.1** The optimal strategy u<sup>∗</sup> *<sup>ı</sup>* is locally Lipschitz. That is, for *ı* = 1, 2, there exists a constant <sup>θ</sup>*<sup>ı</sup>* <sup>&</sup>gt; 0 such that u∗ *<sup>ı</sup>* − ˘u<sup>∗</sup> *ı* -<sup>2</sup> <sup>≤</sup> θ*<sup>ı</sup>x* − ˘*x*-2.

**Theorem 6.3** *Consider the system (6.8), and suppose that Assumption 6.1 holds and V*<sup>∗</sup> *<sup>ı</sup> is the solution of (6.13). Then u*˘<sup>∗</sup> *<sup>ı</sup> formulated as (6.15) can stabilize system (6.8) when the following medicine indication is applied*

$$\left\|z\_{\ell}\right\|^{2} \leq \frac{(1-2\zeta)\lambda\_{m}(\Upsilon)}{\theta\lambda\_{M}(Y)}\left\|x\right\|^{2},\tag{6.17}$$

*where* ζ <sup>∈</sup> (0, <sup>1</sup>/2) *is adjustable parameter. The terms* θ*,* <sup>Υ</sup> *and Y are given in (6.21) and (6.22).*

*Proof* Selecting the Lyapunov function *Lya* = *V*<sup>∗</sup> <sup>1</sup> + *V*<sup>∗</sup> <sup>2</sup> , we can obtain the corresponding derivative as

$$\dot{L}\_{ya} = \sum\_{\iota=1}^{N=2} (\nabla V\_{\iota}^\*)^T (f + g\_1 \check{u}\_1^\* + g\_2 \check{u}\_2^\*). \tag{6.18}$$

According to (6.13), we have

$$\begin{aligned} \left(\nabla V\_{\iota}^{\*}\right)^{T} f &= -\left(\nabla V\_{\iota}^{\*}\right)^{T} \left(g\_{1}u\_{1}^{\*} + g\_{2}u\_{2}^{\*}\right) - \mathbf{x}^{T}\Upsilon\_{\iota}\mathbf{x} \\ &- u\_{1}^{\*T}\mathcal{R}\_{\iota1}u\_{1}^{\*} - u\_{2}^{\*T}\mathcal{R}\_{\iota2}u\_{2}^{\*} + \varrho\_{\iota}V\_{\iota}^{\*}, \end{aligned} \tag{6.19}$$

and

$$(\nabla V\_{\iota}^\*)^T \sum\_{J=1}^{N=2} g\_J (u\_j^\* - \check{u}\_j^\*) = -2u\_{\iota}^{\*T} \mathcal{R}\_{\iota \iota} g\_{\iota}^{-1} \sum\_{J=1}^{N=2} g\_J (u\_j^\* - \check{u}\_j^\*). \tag{6.20}$$

Let u<sup>∗</sup> = [u∗*<sup>T</sup>* <sup>1</sup> , <sup>u</sup>∗*<sup>T</sup>* 2 ] *<sup>T</sup>* and u˘ <sup>∗</sup> = [(u˘ <sup>∗</sup> <sup>1</sup> <sup>−</sup> <sup>u</sup><sup>∗</sup> <sup>1</sup>)*<sup>T</sup>* , (u˘ <sup>∗</sup> <sup>2</sup> <sup>−</sup> <sup>u</sup><sup>∗</sup> <sup>2</sup>)*<sup>T</sup>* ] *<sup>T</sup>* . Then we can derive that

$$\begin{split} \dot{L}\_{ya} &= -\boldsymbol{x}^{T}\mathcal{T}\_{1}\boldsymbol{x} - \boldsymbol{x}^{T}\mathcal{T}\_{2}\boldsymbol{x} - \sum\_{\iota=1}^{N=2} \sum\_{j=1}^{N=2} \boldsymbol{u}\_{j}^{\*T} \mathcal{R}\_{\iota j} \boldsymbol{u}\_{j}^{\*} \\ &+ 2 \sum\_{\iota=1}^{N=2} \boldsymbol{u}\_{\iota}^{\*T} \mathcal{R}\_{\iota\iota} \boldsymbol{g}\_{\iota}^{-1} \sum\_{j=1}^{N=2} \boldsymbol{g}\_{J} (\boldsymbol{u}\_{J}^{\*} - \boldsymbol{\tilde{u}}\_{J}^{\*}) + \varrho\_{1} V\_{1}^{\*} + \varrho\_{2} V\_{2}^{\*} \\ &= -\boldsymbol{x}^{T}\mathcal{T}\boldsymbol{x} - \boldsymbol{u}^{\*T}\mathcal{R}\boldsymbol{u}^{\*} - 2\boldsymbol{u}^{\*T}\boldsymbol{Z}\boldsymbol{\tilde{u}}^{\*} \\ &+ \varrho\_{1} V\_{1}^{\*} + \varrho\_{2} V\_{2}^{\*}, \end{split} \tag{6.21}$$

where <sup>Υ</sup> <sup>=</sup> <sup>Υ</sup><sup>1</sup> <sup>+</sup> <sup>Υ</sup>2, <sup>R</sup> <sup>=</sup> *dia*g{R<sup>11</sup> <sup>+</sup> <sup>R</sup><sup>21</sup>, <sup>R</sup><sup>12</sup> <sup>+</sup> <sup>R</sup><sup>22</sup>}, and *<sup>Z</sup>* = [*Z*1, *<sup>Z</sup>*2] with *Zı* = [R<sup>11</sup>g<sup>−</sup><sup>1</sup> <sup>1</sup> <sup>g</sup>*<sup>ı</sup>*, <sup>R</sup><sup>22</sup>g<sup>−</sup><sup>1</sup> <sup>2</sup> <sup>g</sup>*<sup>ı</sup>*] *<sup>T</sup>* ,*ı* = 1, 2. Applying Young's inequality, we have

$$\begin{split} \dot{L}\_{ya} &\leq -\mathbf{x}^{T}\Upsilon \mathbf{x} - \mathbf{u}^{\*T}\mathcal{R}\mathbf{u}^{\*} + \mathbf{u}^{\*T}\mathcal{R}\mathbf{u}^{\*} \\ &+ \breve{\boldsymbol{u}}^{\*T}\mathbf{Z}^{T}\mathcal{R}^{-1}\mathbf{Z}\breve{\boldsymbol{u}}^{\*} + \varrho\_{V} \\ &= -\mathbf{x}^{T}\Upsilon \mathbf{x} + \breve{\boldsymbol{u}}^{\*T}Y\breve{\boldsymbol{u}}^{\*} + \varrho\_{V}, \end{split} \tag{6.22}$$

where *<sup>Y</sup>* <sup>=</sup> *<sup>Z</sup><sup>T</sup>*R<sup>−</sup>1*Z*. It's noted that <sup>u</sup><sup>∗</sup> *<sup>ı</sup>* is the admissible strategy, we can derive that *V*<sup>∗</sup> *<sup>ı</sup>* is bounded. Hence *<sup>V</sup>* is the bound of the term <sup>1</sup>*V*<sup>∗</sup> <sup>1</sup> <sup>+</sup> <sup>2</sup>*V*<sup>∗</sup> <sup>2</sup> . According to the definitions of <sup>Υ</sup> and *<sup>Y</sup>* , we have that <sup>λ</sup>*<sup>m</sup>*(Υ ) > 0 and <sup>λ</sup>*<sup>M</sup>* (*<sup>Y</sup>* ) > 0. Furthermore, we can obtain

$$\begin{split} \dot{L}\_{ya} \leq & -2\zeta \lambda\_m(\Upsilon) \|\mathbf{x}\|^2 - (1 - 2\zeta)\lambda\_m(\Upsilon) \|\mathbf{x}\|^2 \\ &+ \lambda\_M(Y)\theta \|z\_\ell\|^2 + \varrho\_V, \end{split} \tag{6.23}$$

where <sup>θ</sup> <sup>=</sup> <sup>θ</sup><sup>1</sup> <sup>+</sup> <sup>θ</sup>2. When the indication (6.17) is satisfied, we derive that *<sup>L</sup>*˙ *ya* <sup>≤</sup> <sup>−</sup>2ζλ*<sup>m</sup>*(Υ )*x*-<sup>2</sup> <sup>+</sup> *<sup>V</sup>* . Then we can find that *<sup>L</sup>*˙ *ya* <sup>&</sup>lt; 0 holds when *x*- > *<sup>V</sup>* <sup>2</sup>ζλ*<sup>m</sup>* (Υ ). In light of Lyapunov theorem, the strategy (6.15) can stabilize system (6.8). This completes the proof.

#### *6.3.2 Implementation of Adaptive Critic Learning Method*

In this section, the approximate optimal strategy under MDRM is derived by ADP method of single-critic architecture. In light of the universal approximation properties of neural networks (NNs), *V*<sup>∗</sup> *<sup>ı</sup>* can be obtained by

$$V\_{\iota}^{\*} = \omega\_{\iota}^{\*T} \nu\_{\iota}(\mathbf{x}) + \sigma\_{\iota}, \iota = 1, 2,\tag{6.24}$$

where ω<sup>∗</sup> *<sup>ı</sup>* is the ideal weight vector, <sup>ν</sup>*<sup>ı</sup>* the activation function and <sup>σ</sup>*<sup>ı</sup>* the approximate error. To acquire the approximate version of the unknown vector ω<sup>∗</sup> *<sup>ı</sup>* , the critic NN is constructed by

$$
\hat{V}\_{\iota} = \hat{\omega}\_{\iota}^{T} \nu\_{\iota}(\mathbf{x}), \iota = 1, 2,\tag{6.25}
$$

with ω<sup>ˆ</sup> being the approximate vector. With the aid of critic NN, we can present the optimal strategy as

$$
\mu\_{\iota}^{\*} = -\frac{1}{2} \mathcal{R}\_{\iota \iota}^{-1} g\_{\iota}^{T} \left( \left( \nabla \nu\_{\iota} \right)^{T} \omega\_{\iota}^{\*} + \nabla \sigma\_{\iota} \right), \iota = 1, 2. \tag{6.26}
$$

Accordingly, we can obtain the optimal and approximate optimal strategies under MDRM as

$$\check{u}\_{\iota}^{\*} = -\frac{1}{2} \mathcal{R}\_{\iota\iota}^{-1} g\_{\iota}^{T}(\check{\mathbf{x}}\_{\ell}) \Big( \left( \nabla \nu\_{\iota}(\check{\mathbf{x}}\_{\ell}) \right)^{T} \boldsymbol{\omega}\_{\iota}^{\*} + \nabla \sigma\_{\iota}(\check{\mathbf{x}}\_{\ell}) \Big), \tag{6.27}$$

and

$$\check{u}\_{\iota} = -\frac{1}{2} \mathcal{R}\_{\iota\iota}^{-1} g\_{\iota}^{\mathcal{T}}(\check{\mathbf{x}}\_{\ell}) (\nabla \nu\_{\iota}(\check{\mathbf{x}}\_{\ell}))^{\mathsf{T}} \hat{\omega}\_{\iota}. \tag{6.28}$$

Then the approximate Hamiltonian can be presented as

$$H\_{\iota}(\mathbf{x}, \check{\boldsymbol{u}}\_{1}, \check{\boldsymbol{u}}\_{2}, \hat{\boldsymbol{\omega}}\_{\iota}) = \hat{\boldsymbol{\omega}}\_{\iota}^{T} \psi\_{\iota} + \delta\_{\iota}(\mathbf{x}, \check{\boldsymbol{u}}\_{1}, \check{\boldsymbol{u}}\_{2}) \stackrel{\Delta}{=} \epsilon\_{\iota},\tag{6.29}$$

where <sup>ψ</sup>*<sup>ı</sup>* = ∇ν*<sup>ı</sup> <sup>f</sup>* <sup>+</sup> <sup>g</sup>1u<sup>1</sup>(*x*˘) <sup>+</sup> g2u<sup>2</sup>(*x*˘) <sup>−</sup> *<sup>ı</sup>*ν*<sup>ı</sup>* .

In order to minimize *<sup>ı</sup>* in (6.29), we set the target of minimization as *<sup>E</sup>* <sup>=</sup> *<sup>E</sup>*<sup>1</sup> <sup>+</sup> *<sup>E</sup>*<sup>2</sup> <sup>=</sup> <sup>1</sup>/2<sup>2</sup> <sup>1</sup> <sup>+</sup> <sup>1</sup>/2<sup>2</sup> 2. Via applying gradient descent approach, we obtain

$$\dot{\hat{\omega}}\_{\iota} = -\gamma\_{\iota} \frac{1}{(\psi\_{\iota}^{T}\psi\_{\iota} + 1)^{2}} \frac{\partial E}{\partial \hat{\omega}\_{\iota}} = -\gamma\_{\iota} \frac{\psi\_{\iota}}{(\psi\_{\iota}^{T}\psi\_{\iota} + 1)^{2}} \epsilon\_{\iota} = -\gamma\_{\iota} \check{\psi}\_{\iota} \epsilon\_{\iota},\tag{6.30}$$

where <sup>γ</sup>*<sup>ı</sup>* is the adjustable parameter and <sup>ψ</sup>˘ *<sup>ı</sup>* <sup>=</sup> <sup>ψ</sup>*<sup>ı</sup>* /(ψ*<sup>T</sup> <sup>ı</sup>* <sup>ψ</sup>*<sup>ı</sup>* <sup>+</sup> <sup>1</sup>)2. Define <sup>ω</sup>˜*<sup>ı</sup>* <sup>=</sup> <sup>ω</sup><sup>∗</sup> *ı* − <sup>ω</sup>ˆ*<sup>ı</sup>* . From (6.30), we derive that

$$
\dot{\tilde{\omega}}\_{\iota} = -\gamma\_{\iota}\bar{\psi}\_{\iota}\bar{\psi}\_{\iota}^{T}\tilde{\omega}\_{\iota} + \gamma\_{\iota}\check{\psi}\_{\iota}e\_{\iota},\tag{6.31}
$$

where ψ¯ *<sup>ı</sup>* <sup>=</sup> <sup>ψ</sup>*<sup>ı</sup>* /(ψ*<sup>T</sup> <sup>ı</sup>* <sup>ψ</sup>*<sup>ı</sup>* <sup>+</sup> <sup>1</sup>) and the approximated residual error *eı* = −∇σ*<sup>T</sup> <sup>ı</sup>* ( *f* + <sup>g</sup>1u˘<sup>1</sup> <sup>+</sup> <sup>g</sup>2u˘2) <sup>+</sup> *<sup>ı</sup>*σ*<sup>ı</sup>* . For proceeding further, the following assumptions are required [11, 26, 27].

**Assumption 6.2** For any *<sup>ı</sup>* ∈ {1, <sup>2</sup>}, the signal ψ¯ *<sup>ı</sup>* is persistently excited on the time interval [*t*, *<sup>t</sup>* <sup>+</sup> *<sup>T</sup>* ]. That is, there exists the positive constant *<sup>b</sup>*ψ*<sup>ı</sup>* such that

$$b\_{\psi\_l} I\_{N\_{cl} \times N\_{cl}} \le \int\_{t}^{t+T} \bar{\psi}\_l \bar{\psi}\_l^T d\varsigma,\tag{6.32}$$

with *Ncı* being the neuron number of the *ı*th critic network.

**Assumption 6.3** For *ı* ∈ {1, 2}, there exist positive constants such that ω∗ *ı* -<sup>≤</sup> *<sup>b</sup>*ω*<sup>ı</sup>* , -<sup>∇</sup>ν*<sup>ı</sup>*-<sup>≤</sup> *<sup>b</sup>*ν*<sup>ı</sup>* , -<sup>∇</sup>σ*<sup>ı</sup>*-<sup>≤</sup> *<sup>b</sup>*σ*<sup>ı</sup>* and *eı*-≤ *beı* .

#### **6.4 Stability Analysis**

In this section, the asymptotic stability of the controlled system is analyzed by applying Lyapunov theory. Before presenting the main results, the boundedness of critic weight is discussed in the following lemma.

**Lemma 6.4** *For any ı* ∈ {1, 2}*, suppose that Assumptions 6.2–6.3 hold and the initial weight is finite. If the critic tuning law (6.30) is applied, then it holds that* <sup>ω</sup>˜*<sup>ı</sup> is locally ultimately bounded.*

*Proof* Consider the Lyapunov function as *Ly*ω. It's noted that the derivative of <sup>ω</sup>˜*<sup>ı</sup>* is flow dynamics, which indicates that there doesn't exist any jumps in the values of <sup>ω</sup>˜*<sup>ı</sup>* . More specially, <sup>ω</sup>˜*<sup>ı</sup>* is continuous at the regulating instant. Thus we only need to consider the time interval between two adjoining regulating instants.

According to Assumptions 6.2–6.3, it can be derived that

$$\begin{split} \dot{L}\_{\chi\omega} &= 2\gamma\_1 \tilde{\omega}\_1^T \dot{\tilde{\omega}}\_1 + 2\gamma\_2 \tilde{\omega}\_2^T \dot{\tilde{\omega}}\_2 \\ &= 2\gamma\_1 (-\tilde{\omega}\_1 \bar{\psi}\_1 \bar{\psi}\_1^T \tilde{\omega}\_1 + \tilde{\omega}\_1 \check{\psi}\_1 e\_1) \\ &+ 2\gamma\_2 (-\tilde{\omega}\_2 \bar{\psi}\_2 \bar{\psi}\_2^T \tilde{\omega}\_2 + \tilde{\omega}\_2 \check{\psi}\_2 e\_2). \end{split} \tag{6.33}$$

By applying Young's inequation, we can get

$$\begin{split} \dot{L}\_{\mathbf{y}\omega} &\leq -\gamma\_1(\tilde{\omega}\_1 \bar{\psi}\_1 \bar{\psi}\_1^T \tilde{\omega}\_1 - e\_1^T e\_1) \\ & -\gamma\_2(\tilde{\omega}\_2 \bar{\psi}\_2 \bar{\psi}\_2^T \tilde{\omega}\_2 - e\_2^T e\_2) \\ & \leq -\gamma\_1 b\_{\psi1} \left\lVert \tilde{\omega}\_1 \right\rVert^2 - \gamma\_2 b\_{\psi2} \left\lVert \tilde{\omega}\_2 \right\rVert^2 + \Gamma\_1, \end{split} \tag{6.34}$$

where <sup>Γ</sup><sup>1</sup> <sup>=</sup> <sup>γ</sup>1*b*<sup>2</sup> *<sup>e</sup>*<sup>1</sup> <sup>+</sup> <sup>γ</sup>2*b*<sup>2</sup> *<sup>e</sup>*2. Furthermore, when - ˜ω1- > <sup>Γ</sup><sup>1</sup> <sup>γ</sup>1*b*ψ<sup>1</sup> *<sup>b</sup>*ω˜ <sup>1</sup> or - ˜ω2- > <sup>Γ</sup><sup>1</sup> <sup>γ</sup>2*b*ψ<sup>2</sup> *<sup>b</sup>*ω˜ <sup>2</sup> , it yields that *<sup>L</sup>*˙ *<sup>y</sup>*<sup>ω</sup> <sup>&</sup>lt; 0. The lemma is proved.

**Theorem 6.4** *Consider the system (6.8) with strategy formulated as (6.28). Suppose that Assumptions 6.1–6.3 hold. The tuning law for critic network is given by (6.30). Then the state x and weight estimation error* <sup>ω</sup>˜*<sup>ı</sup> are UUB provided that the indication is applied*

$$\left\|z\_{\ell}\right\|^{2} \leq \frac{(1-\varpi\_{1}^{2})\lambda\_{m}(\mathcal{T})}{(1+\varpi\_{2})\theta\lambda\_{M}(Y)}\left\|x\right\|^{2} \stackrel{\Delta}{=} \left\|z\_{e}\right\|^{2},\tag{6.35}$$

*with* <sup>1</sup> *and* <sup>2</sup> *being the adjustable parameters.*

*Proof* Select the Lyapunov function candidate as

$$\begin{split} L\_{Y} &= \sum\_{\iota=1}^{N=2} V\_{\iota}^{\*}(\check{\mathbf{x}}\_{\ell}) + \sum\_{\iota=1}^{N=2} V\_{\iota}^{\*}(\mathbf{x}) + \frac{1}{2} \sum\_{\iota=1}^{N=2} \tilde{\omega}\_{\iota}^{T} \tilde{\omega}\_{\iota} \\ &= L\_{Ya} + L\_{Yb} + L\_{Yc} . \end{split} \tag{6.36}$$

Due to the utilization of MDRM, we present the proof process in two cases.

*Case I*: No regulation occurs, i.e., *<sup>t</sup>* ∈ [-, -+<sup>1</sup>). Then we obtain *L*˙ *Y a* = 0. The derivative of *LY b* can be obtained as

$$\dot{L}\_{Yb} = \sum\_{\iota=1}^{N=2} (\nabla V\_{\iota}^\*)^T (f + g\_1 \check{u}\_1 + g\_2 \check{u}\_2). \tag{6.37}$$

Let <sup>u</sup>˘ = [(u˘<sup>1</sup> <sup>−</sup> <sup>u</sup><sup>∗</sup> <sup>1</sup>)*<sup>T</sup>* , (u˘<sup>2</sup> <sup>−</sup> <sup>u</sup><sup>∗</sup> <sup>2</sup>)*<sup>T</sup>* ] *<sup>T</sup>* . Applying the operations similar to that in Theorem 6.3, we have

$$\begin{split} \dot{L}\_{Yb} &\leq -\mathbf{x}^{T}\Upsilon x + \breve{\boldsymbol{u}}^{T}Y\breve{\boldsymbol{u}} + \varrho\_{V} \\ &\leq -\mathbf{x}^{T}\Upsilon x + \varrho\_{V} + \lambda\_{M}(Y)\|\boldsymbol{u}\_{1}^{\*} - \breve{\boldsymbol{u}}\_{1}^{\*} + \breve{\boldsymbol{u}}\_{1}^{\*} - \breve{\boldsymbol{u}}\_{1}\|^{2} \\ &\quad + \lambda\_{M}(Y)\|\boldsymbol{u}\_{2}^{\*} - \breve{\boldsymbol{u}}\_{2}^{\*} + \breve{\boldsymbol{u}}\_{2}^{\*} - \breve{\boldsymbol{u}}\_{2}\|^{2} \\ &\leq -\mathbf{x}^{T}\Upsilon x + \varrho\_{V} + \lambda\_{M}(Y)(1 + 1/\varpi\_{2})\|\breve{\boldsymbol{u}}\_{1}^{\*} - \breve{\boldsymbol{u}}\_{1}\|^{2} \\ &\quad + \lambda\_{M}(Y)(1 + \varpi\_{2})\|\boldsymbol{u}\_{1}^{\*} - \breve{\boldsymbol{u}}\_{1}^{\*}\|^{2} \\ &\quad + \lambda\_{M}(Y)(1 + 1/\varpi\_{2})\|\breve{\boldsymbol{u}}\_{2}^{\*} - \breve{\boldsymbol{u}}\_{2}\|^{2} \\ &\quad + \lambda\_{M}(Y)(1 + \varpi\_{2})\|\boldsymbol{u}\_{2}^{\*} - \breve{\boldsymbol{u}}\_{2}^{\*}\|^{2}. \end{split}$$

Recall that <sup>θ</sup> <sup>=</sup> <sup>θ</sup><sup>1</sup> <sup>+</sup> <sup>θ</sup>2, and substitute (6.27) and (6.28) into (6.38). Then we can derive

$$\begin{split} \dot{L}\_{Yb} \leq & -\mathbf{x}^T \Upsilon \mathbf{x} + (1 + \varpi\_2) \theta \lambda\_M(Y) \|\mathbf{x} - \check{\mathbf{x}}\_\ell\|^2 \\ &+ \varrho\_V + \Gamma\_2, \end{split} \tag{6.39}$$

where <sup>Γ</sup><sup>2</sup> <sup>=</sup> <sup>1</sup> <sup>4</sup>λ*<sup>M</sup>* (*<sup>Y</sup>* )(<sup>1</sup> <sup>+</sup> <sup>1</sup>/<sup>2</sup>)<sup>2</sup> -R−1 11 -<sup>2</sup>*b*<sup>2</sup> g1*b*2 ν1*b*2 ω˜ <sup>1</sup> + -R−1 22 -<sup>2</sup>*b*<sup>2</sup> g2*b*2 ν2*b*2 ω˜ 2 <sup>+</sup> <sup>1</sup> 42 <sup>λ</sup>*<sup>M</sup>* (*<sup>Y</sup>* )(<sup>1</sup> <sup>+</sup> <sup>2</sup>)<sup>2</sup> -R−1 11 -<sup>2</sup>*b*<sup>2</sup> g1*b*2 σ<sup>1</sup> + -R−1 22 -<sup>2</sup>*b*<sup>2</sup> g2*b*2 σ2 with *<sup>b</sup>*g<sup>1</sup> and *<sup>b</sup>*g<sup>2</sup> denoting the bounds of known <sup>g</sup><sup>1</sup> and <sup>g</sup>2.

According to Assumption 6.2 and Assumption 6.3, we derive that

$$\dot{L}\_{Yc} \le -\gamma\_1 b\_{\psi1} \|\tilde{\omega}\_1\|^2 - \gamma\_2 b\_{\psi2} \|\tilde{\omega}\_2\|^2 + F\_1. \tag{6.40}$$

Based on (6.39) and (6.40), we can obtain

$$\begin{split} \dot{L}\_{Y} \leq & -(1 - \varpi\_{1}^{2})\lambda\_{m}(\mathcal{T}) \|\boldsymbol{x}\|^{2} - \varpi\_{1}^{2}\lambda\_{m}(\mathcal{T}) \|\boldsymbol{x}\|^{2} \\ &+ (1 + \varpi\_{2})\lambda\_{M}(\boldsymbol{Y})\theta \|\boldsymbol{x} - \check{\boldsymbol{x}}\_{\ell}\|^{2} - \gamma\_{1}b\_{\psi1}\|\tilde{\omega}\_{1}\|^{2} \\ &- \gamma\_{2}b\_{\psi2}\|\tilde{\omega}\_{2}\|^{2} + \mathfrak{E}, \end{split} \tag{6.41}$$

where £ <sup>=</sup> <sup>Γ</sup><sup>1</sup> <sup>+</sup> <sup>Γ</sup><sup>2</sup> <sup>+</sup> *<sup>V</sup>* . Applying the indication (6.35), then we conclude that *L*˙ *<sup>Y</sup>* < 0 when one of the conditions hold that

$$\|\mathbf{x}\| > \frac{1}{\varpi\_1} \sqrt{\frac{\mathfrak{E}}{\lambda\_m(\Upsilon)}} \stackrel{\scriptstyle \Delta}{=} \beta\_\mathbf{x},\tag{6.42}$$

$$\|\tilde{\omega}\_{\rm t}\| > \sqrt{\frac{\mathfrak{E}}{\gamma\_{\rm t} b\_{\psi\rm t}}} \stackrel{\Delta}{=} \beta\_{\tilde{\omega}\rm t}, \iota = 1, 2. \tag{6.43}$$

Thus *<sup>x</sup>* and <sup>ω</sup>˜*<sup>ı</sup>* can be guaranteed to be UUB.

*Case II*: A regulation occurs, that is, *<sup>t</sup>* <sup>=</sup> -+1. The difference of *LY* can be given by

$$
\Delta L\_Y = \Delta L\_{Ya} + \Delta L\_{Yb} + \Delta L\_{Yc}, \tag{6.44}
$$

where the terms are defined by *LY a* = *V*<sup>∗</sup> <sup>1</sup> (*x*˘+1) − *V*<sup>∗</sup> <sup>1</sup> (*x*˘) + *V*<sup>∗</sup> <sup>2</sup> (*x*˘+1) − *V*<sup>∗</sup> <sup>2</sup> (*x*˘), *LY b* = *V*<sup>∗</sup> <sup>1</sup> (*x*(-+1)) − *V*<sup>∗</sup> <sup>1</sup> (*x*(-− +1)) + *V*<sup>∗</sup> <sup>2</sup> (*x*(-+1)) − *V*<sup>∗</sup> <sup>2</sup> (*x*(-− +1)), *LY c* = <sup>1</sup>/2ω˜ *<sup>T</sup>* <sup>1</sup> (-+<sup>1</sup>)ω˜ <sup>1</sup>(-+<sup>1</sup>) <sup>−</sup> <sup>1</sup>/2ω˜ *<sup>T</sup>* <sup>1</sup> (-− +<sup>1</sup>)ω˜ <sup>1</sup>(-− +<sup>1</sup>) <sup>+</sup> <sup>1</sup>/2ω˜ *<sup>T</sup>* <sup>2</sup> (-+<sup>1</sup>)ω˜ <sup>2</sup>(-+<sup>1</sup>) <sup>−</sup> <sup>1</sup>/2ω˜ *<sup>T</sup>* 2 (-− +<sup>1</sup>)ω˜ <sup>2</sup>(-− +<sup>1</sup>). Recalling the analysis in *Case I*, we obtain that *L*˙ *<sup>Y</sup>* < 0 when *x* or <sup>ω</sup>˜*<sup>ı</sup>* is out of the corresponding bound. Furthermore, we can derive that *LY b* <sup>+</sup> *LY c* is monotonically decreasing when *<sup>t</sup>* ∈ [-, -+<sup>1</sup>). In light of the properties of limits, we have

$$\begin{split} 0 \le & V\_{\iota}^{\*} \left( \mathbf{x} (\boldsymbol{\hbar}\_{\ell+1}^{-}) \right) + \frac{1}{2} \tilde{\boldsymbol{\omega}}\_{\iota}^{T} (\boldsymbol{\hbar}\_{\ell+1}^{-}) \tilde{\boldsymbol{\omega}}\_{\iota} (\boldsymbol{\hbar}\_{\ell+1}^{-}) \\ & - V\_{\iota}^{\*} (\mathbf{x} (\boldsymbol{\hbar}\_{\ell+1})) - \frac{1}{2} \tilde{\boldsymbol{\omega}}\_{\iota}^{T} (\boldsymbol{\hbar}\_{\ell+1}) \tilde{\boldsymbol{\omega}}\_{\iota} (\boldsymbol{\hbar}\_{\ell+1}). \end{split} \tag{6.45}$$

As *x* is proved to be UUB, we can obtain

$$V\_{\iota}^\*(\check{\chi}\_{\ell+1}) \le V\_{\iota}^\*(\check{\chi}\_{\ell}).\tag{6.46}$$

According to (6.45) and (6.46), we can derive *LY* < 0, which indicates that the selected Lyapunov (6.36) is monotonically decreasing when *<sup>t</sup>* <sup>=</sup> -+1. This completes the proof.

**Remark 6.5** <sup>1</sup> and <sup>2</sup> in (6.35) are the adjustable parameters which determine the frequency of medicine dosage regulation. A larger <sup>1</sup> or <sup>2</sup> leads to a higher regulation frequency, and a smaller parameter implies a lower adjustment frequency. Thus we can determine these parameters according to the clinical data.

**Remark 6.6** In thischapter, the approximate optimal combination therapeutic strategy is derived via ADP method to inhibit the proliferation of tumor cells under the mechanism of medicine dosage regulation. The MDRM is constructed on the foundation of the above-mentioned medicine indication (6.35). The data at the dosageregulating instants should be recorded and will be utilized as reference data in the future. When the difference between the current clinical data and latest reference data is larger than the threshold, the medicine dosage can be regulated. Therefore, this mechanism can guarantee the derived therapeutic strategy to be regulated timely and necessarily.


**Table 6.1** Parameter specifications of the cells


**Table 6.2** Parameter specifications of the drugs

#### **6.5 Simulation Study**

In this section, the mathematical model (6.7) is considered which presents the relations between cells and drugs. For simplicity, we have constructed the rephrased system (6.8) of which the control issue could be deemed as NZSGs.

In light of the clinical medical statistics and literature [38], the parameters on cells and drugs for model (6.7) are given in Table 6.1 and Table 6.2, respectively. For the discounted value function (6.9) of system (6.8), the corresponding parameters are set as <sup>R</sup><sup>11</sup> <sup>=</sup> <sup>0</sup>.8*I*<sup>2</sup>×2, <sup>R</sup><sup>12</sup> <sup>=</sup> <sup>15</sup>*I*<sup>2</sup>×2, <sup>R</sup><sup>21</sup> <sup>=</sup> <sup>5</sup>*I*<sup>2</sup>×2, <sup>R</sup><sup>22</sup> <sup>=</sup> *<sup>I</sup>*<sup>2</sup>×2, <sup>Υ</sup><sup>1</sup> <sup>=</sup> <sup>0</sup>.02*I*<sup>6</sup>×<sup>6</sup> and <sup>Υ</sup><sup>2</sup> <sup>=</sup> <sup>0</sup>.06*I*<sup>6</sup>×6. In addition, the discounted factors <sup>1</sup> <sup>=</sup> <sup>2</sup> <sup>=</sup> <sup>0</sup>.2.

For the critic NNs, the activation functions are both set as [*x* <sup>2</sup> <sup>1</sup> , *x*1*x*2, *x*1*x*3, *x*1*x*4, *x*1*x*5, *x*1*x*6, *x* <sup>2</sup> <sup>2</sup> , *x*2*x*3, *x*2*x*4, *x*2*x*5, *x*2*x*6, *x* <sup>2</sup> <sup>3</sup> , *x*3*x*4, *x*3*x*5, *x*3*x*6, *x* <sup>2</sup> <sup>4</sup> , *x*4*x*5, *x*4*x*6, *x* <sup>2</sup> <sup>5</sup> , *x*5*x*6, *x* 2 6 ] *<sup>T</sup>* , and the learning laws are set by <sup>γ</sup><sup>1</sup> <sup>=</sup> <sup>1</sup>.5 and <sup>γ</sup><sup>2</sup> <sup>=</sup> 2. Besides, the parameters <sup>θ</sup> <sup>=</sup> 8, <sup>1</sup> <sup>=</sup> <sup>0</sup>.8 and <sup>2</sup> <sup>=</sup> 8.

The evolution curves of the model (6.7) are depicted in Fig. 6.1. From Fig. 6.1 we can observe that when *t* = 200*d*, the population of tumor cells reduces to zero, and when *t* = 600*d*, the population of normal cells almost returns to 1 and that of

**Fig. 6.1** The evolutions of model states

**Fig. 6.2** The therapy strategy curves of chemotherapy drug 1

endothelial cells drops down to a small steady value. This indicates that the proliferation of tumor cells can be suppressed after 600 days under the optimal therapy strategy. In Figs. 6.1, 6.2, 6.3, 6.4, 6.5, 6.6 and 6.7, we compare the medicine dosages of the derived therapy strategy and that of initial therapy strategy. It indicates that the medicine dosages of our near-optimal therapy strategy are significantly less than the dosages of initial strategy. It's of great practical significance since superfluous drugs may well affect the health of patients and impose additional financial burdens

**Fig. 6.3** The therapy strategy curves of chemotherapy drug 2

**Fig. 6.4** The therapy strategy curves of anti-angiogenic drug

on patients. Besides, one can find that when the clinical data becomes better, the regulation frequency of the derived therapy strategy becomes lower. This implies that the therapy strategy based on medicine dosage regulation mechanism can be regulated with the indications for medicine timely and necessarily. Figures 6.5, 6.6 and 6.7 present the curves of the cells under different therapy strategies, that is,

**Fig. 6.5** The population of normal cells under different therapies

**Fig. 6.6** The population of tumor cells under different therapies

**Fig. 6.7** The population of endothelial cells under different therapies

chemotherapy drug 1, chemotherapy drug 2, anti-angiogenic drug and the therapy comprised of these three drugs. We can conclude from Figs. 6.5, 6.6 and 6.7 that the therapeutic effect of the derived therapy is the best. Thus simulation results validate the effectiveness of our therapy strategy

#### **6.6 Conclusion**

In this chapter, an ADP-based method using medicine dosage regulation mechanism has been proposed to obtain the optimal combination therapy for curing cancer. A mathematical model is employed to describe the interactions among the normal cells, tumor cells, endothelial cells, chemotherapy drugs and anti-angiogenic drug. The mathematical model provides the foundation for us to solve the optimization issue under the architecture of NZSGs. The ADP method of single-critic framework is proposed to approximately seek the optimal strategy. In addition, the introduction of the medicine dosage adjustment mechanism guarantees the therapy strategy to be adjusted timely and necessary. Finally, the theory analysis and simulation results both indicate that the designed strategy can effectively decrease the population of tumor cells and endothelial cells with very few medicine dosage, which verifies the availability of the proposed method. Our future research direction is to seek the optimal strategy for decreasing tumor cells or other harmful cells with latest therapies, for example, the therapy applying oncolytic virus.

#### **References**


112 6 Combination Therapy-Based Adaptive Control for Organism Using Medicine …


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 7 Adaptive Virotherapy Strategy for Organism with Constrained Input Using Medicine Dosage Regulation Mechanism**

#### **7.1 Introduction**

Low efficacy and high toxicity for patients is the characteristics of traditional therapies as surgery, chemotherapy, and radiation, hence the most prosperous tumor treatment strategy, oncolytic virotherapy which depends on the virus with relatively weak pathogenicity and appropriate gene modification, simultaneously, the therapeutic effect benefits from strong replication capabilities. Similar to the principle of targeted therapy, gene-modified viruses repressed selectively infect tumor cells (ITCs) through rapid replication increment, and ultimately destroy TCs, concurrently, activate the body's immune response. Soluble tumor virus therapy not only can kill TCs, but also attract more immune cells to kill residual cancer cells, however, it doesn't deplete normal cells in the body. Oncolytic virus (OVs) enjoyed the superiority of minimal side effects and optimal therapeutic effects compared with traditional treatment strategies as literature [1]. Development of oncolytic viruses benefit from the virus-specific lytic CTL response eliciting immunostimulatory signals and contributing to killing of ITCs as literature [2], thus, viral doses, number of doses and timing with reliable mathematical models are the future research direction.

To lucubrate cancer virotherapy, mathematical models which described mechanisms of TCs, OVs and immune cells have been proposed and updated as literatures [3, 4]. Literature [5] expounded the inner mechanism including uninfected tumor cells (UTCs), ITCs and free viruses. Successively, the infected cells and uninfected cells are distinguished through logistic growth of TCs and elimination of free recombinant measles viruses as [6]. What matters most is the immune response which leads to inhibitory effect of viral therapy for misregarding of genetically modified viruses.

Therapy efficiency depends on hyperimmunity or not, in other words, infected cancer cells and viruses are swallowed for indistinguishability. Literatures has demonstrated the side effect of immune cells, and immunosuppressive agent cyclophosphamide is chosen to reduce immune response [7]. Reference [8] has considered the virus-free population adding the previous three variables, reflecting interactive relationship between innate immune with infected cancer cells and the virus cells, evolving into an effective mechanism analysis model, but more effective control strategy is in urgent need. Cytokines form natural killer cells contribute most strength on destruction of both tumor and virus-infected cells. The proposed model gives explicitation of interplay among TCs, OVs, and immune response, which is the guideline of optimal therapeutic strategies or dosage regimen on oncotherapy. Although correlational research on regulation on immune system and TCs has been proposed using ADP as [9], selective oncolysis will enjoy optimal therapeutic effect through gene-modified viruses compared with wild-type OVs based on ADP method.

As a vital branch in machine learning, obtaining information from interactive environment [10–12], reinforcement learning (RL) has been demonstrated to perform well in solving optimal control issues of nonlinear systems [13]. The ADP method, which was derived from RL and dynamic programming, generally attempts to obtain the optimal strategies with the aid of the classic critic-actor algorithm framework [14]. Under this architecture, the critic evaluates the cost when the current strategy is applied, and actor updates the control strategy in accordance with the feedback information provided by the critic. Thus the approximate optimal strategy can be derived and the "curse of dimensionality" can be obviated. Recently, ADP-based methods have been widely researched to tackle various optimal issues, for instances, tracking control [15–17], optimal consensus control [18–20], zero-sum games and nonzerosum games [21–23]. Different from fuzzy approximation as [24], the robust dynamic NN was established to asymptotically identify the uncertain system with additive disturbances, and the critic and actor worked together to find the equilibrium solution for nonzero-sum games subject to nonlinear system. The identifier was developed to reconstruct the unknown dynamics and the critic was tuned by a concurrent learning strategy which could effectively use real-time data and recorded data such that the persistence of excitation (PE) condition could be removed. By utilizing both online and off-line data, a data-based policy gradient ADP method was developed to seek optimal scheme in [25]. To address global optimum control issue and avoid falling into local optimality as[26], the ADP method which combined with the predesigned extra compensators was proposed in [27]. The introductions of these compensators contributed to deriving the augmented neighborhood error systems, thus the system dynamics requirement for ADP was avoided. In [28], integrating the neural network learning ability and the spirits of ADP, a general architecture of intelligent critic control was proposed to solve the robustness issues of disturbed nonlinear systems.

As saturation phenomena which exist widely in many practical systems can affect the system performance, multifarious ADP-based method were proposed to achieve optimal control with input constraints [29–31]. For the tumor-virus-immune system in this , the control input is the medicine containing the virus particles. Redundant or insufficient medicine dosages may well influence the therapeutic effect or patients' health. Thus we consider the asymmetric input constraints and construct the corresponding non-quadratic value functions associated with the tumor-virus-immune system.

Recently, ADP-based methods have been proposed to develop approximate optimal strategies in various practical applications [32–35]. However, there exist seldom any literatures associated with optimal strategy based on virotherapy which is derived from ADP-based methods. Enlightened by the literatures mentioned above, we design the virotherapy-based optimal strategy via ADP method with MDRM. The contributions can be stated as follows. Firstly, the mathematic model is introduced to simulate the relationships between TCs, OVs and immune cells. Due to the asymmetric dosage constraints for medicine, a non-quadratic utility function is constructed to form the discounted value function. Then, on the basis of the tumorvirus-immune model, ADP method of single-critic architecture is proposed to solve HJBE such that the approximate optimal strategy can be achieved, which means that the TCs can be largely eliminated with the constrained optimal virotherapy-based strategy. Furthermore, the reasonable the medicine dosage regulation mechanism is firstly introduced into this algorithm framework, and the indications for medicine is considered for the first time. Finally, theoretical analysis and simulation experiments both validate the effectiveness of the designed therapeutic strategy.

#### **7.2 Problem Formulation and Preliminaries**

#### *7.2.1 Establishment of Interaction Model*

In the section, tumor-virus-immune interaction model is introduced to describe the relations between TCs, viruses and immune cells. Due to the behavior of OVs, we can divide TCs into UTCs and ITCs. In the model composed of four ordinary differential equations as follows, *PT U* (*t*), *PT I*(*t*), *PV I*(*t*) and *PI M* (*t*) respectively denote the populations of UTCs, ITCs, free OVs and immune cells.

The population of UTCs can be affected by multiple factors, that is, the multiplication and apoptosis of TCs, the infection by OVs and the reduction caused by immune cells. Moreover, the growth dynamics of UTCs is presented as

$$\dot{P}\_{TU}(t) = A\_1 P\_{TU}(t) \left(1 - \frac{P\_{TU}(t) + P\_{TI}(t)}{K}\right) - A\_2 P\_{TU}(t) P\_{VI}(t)$$

$$-B\_1 P\_{TU}(t) P\_{IM}(t) - C\_1 P\_{TU}(t),\tag{7.1}$$

where *A*<sup>1</sup> is the tumor proliferation rate, *A*<sup>2</sup> is the infection rate of virus, *B*<sup>1</sup> denotes the killing-efficiency of immune cells, and *C*<sup>1</sup> is the apoptosis rate of UTCs.

Similarly, the population of ITCs can be modeled by

$$\dot{P}\_{TI}(t) = A\_2 P\_{TU}(t) P\_{VI}(t) - B\_2 P\_{TI}(t) P\_{IM}(t) - \varphi P\_{TI}(t),\tag{7.2}$$

where *B*<sup>2</sup> denotes the immune killing-efficiency of ITCs and ϕ is apoptosis rate of ITCs.

The lysis of ITCs which contain multiple replicated virion particles and the input of virus agentia can both contribute to the rise of the free virus population. Thus the evolution dynamics of virus population can be presented as

118 7 Adaptive Virotherapy Strategy for Organism with Constrained Input …

$$\dot{P}\_{VI}(t) = \mathcal{U} + \kappa \varphi P\_{TI}(t) - A\_2 P\_{TU}(t) P\_{VI}(t)$$

$$-B\_3 P\_{VI}(t) P\_{IM}(t) - C\_2 P\_{VI}(t),\tag{7.3}$$

where <sup>U</sup> denotes the input of agentia, <sup>κ</sup> the burst size of free viruses, *<sup>B</sup>*<sup>3</sup> the immune killing-efficiency rate of OVs, and *C*<sup>2</sup> the clearance rate of OVs.

The immune response dynamics can be formulated as

$$\begin{aligned} P\_{IM}(t) &= D\_1 P\_{TI}(t) P\_{IM}(t) + D\_2 P\_{TU}(t) P\_{IM}(t) \\ &- C\_3 P\_{IM}(t), \end{aligned} \tag{7.4}$$

where *D*<sup>1</sup> and *D*<sup>2</sup> are immune response rates stimulated by infected and uninfected cells. And *C*<sup>3</sup> is the apoptosis rate of immune cells. For purpose of simplifying the interaction model, we utilize the nondimensionalization technique [36, 37] to derive the simplified version as

$$\begin{cases} \dot{p}\_{TU}(t) = a\_1 p\_{TU}(t)(1 - p\_{TU}(t) - p\_{TI}(t)) - c\_1 p\_{TU}(t) \\ \qquad \quad - a\_2 p\_{TU}(t)p\_{VI}(t) - b\_1 p\_{TU}(t)p\_{IM}(t) \\ \dot{p}\_{TI}(t) = a\_2 p\_{TU}(t)p\_{VI}(t) - b\_2 p\_{TI}(t)p\_{IM}(t) - \varphi p\_{TI}(t) \\ \dot{p}\_{VI}(t) = u + \kappa p\_{TI}(t) - a\_2 p\_{TU}(t)p\_{VI}(t) \\ \qquad \quad - b\_3 p\_{VI}(t)p\_{IM}(t) - c\_2 p\_{VI}(t) \\ \dot{p}\_{IM}(t) = d\_1 p\_{TI}(t)p\_{IM}(t) + d\_2 p\_{TU}(t)p\_{IM}(t) \\ \qquad \quad - c\_3 p\_{IM}(t). \end{cases} (7.5)$$

Herein the nonnegative states of nondimensionalization version are represented as *pT U* (*t*), *pT I*(*t*), *pV I*(*t*) and *pI M* (*t*).

**Remark 7.1** In virotherapy, the viruses achieved their reproductive objective by infecting tumor cells and replicating themselves. After the lysis of infected cells, new reproductions burst out and infect other tumor cells. Under this mechanism, the tumor cells can be effectively eliminated. Furthermore, comparing with uninfected tumor cells, the infected cells can activate immune cells more effectually to kill tumor cells.

#### *7.2.2 Problem Formulation*

Consider the system (7.5) as

$$
\dot{\mathbf{x}} = f(\mathbf{x}) + gu,\tag{7.6}
$$

where g = [0, <sup>0</sup>, <sup>1</sup>, <sup>0</sup>] *<sup>T</sup>* , and *f* (*x*) is constructed by the right-hand side parts of (7.5) excluding the control input <sup>u</sup>. <sup>u</sup> ∈ [u*<sup>m</sup>*, <sup>u</sup>*<sup>M</sup>* ] where <sup>u</sup>*<sup>m</sup>* and <sup>u</sup>*<sup>M</sup>* denote the minimum and maximum thresholds for medicine input dosage.

For system (7.6), the corresponding discounted value function is defined as

$$V(\mathbf{x}(t)) = \int\_{t}^{\infty} e^{-\theta(\iota - t)} W(\mathbf{x}, u) d\iota,\tag{7.7}$$

with the discounted factor θ > 0. The utility function is given by

$$W(\mathbf{x}, u) = \mathbf{x}^T \Upsilon \mathbf{x} + \chi(u),\tag{7.8}$$

where the matrix <sup>Υ</sup> is positive definite, and <sup>χ</sup>(u) is non-negative function. It's noted that for system (7.6) the input constraints are not symmetric. In order to cope with this issue, function <sup>χ</sup>(u) is defined as

$$\chi(u) = 2\hbar \int\_{\alpha}^{u} \psi^{-1}(\hbar^{-1}(\iota - \alpha)) d\iota,\tag{7.9}$$

where <sup>α</sup> <sup>=</sup> (u*<sup>m</sup>* <sup>+</sup> <sup>u</sup>*<sup>M</sup>* )/2 and - <sup>=</sup> (u*<sup>M</sup>* <sup>−</sup> <sup>u</sup>*<sup>m</sup>*)/2. <sup>ψ</sup>(·) is a monotonic odd function which is continuously differential with ψ(0) = 0. Without loss of generality, we select the hyperbolic tangent function as ψ(·), that is, ψ(·) = tanh(·).

Differentiating the value function (7.7) along system (7.6), we obtain that

$$0 = \nabla V^T (f + gu) + \mathbf{x}^T \mathcal{T} \mathbf{x} + \chi(u) - \theta V. \tag{7.10}$$

Then the Hamiltonian function can be expressed as

$$H(\mathbf{x}, u, \nabla V) = \nabla V^T (f + gu) + \mathbf{x}^T \mathcal{T} \mathbf{x} + \chi(u) - \theta V. \tag{7.11}$$

The optimal value function is defined as

$$V^\*(\mathbf{x}) = \min\_{u} \int\_{t}^{\infty} e^{-\theta(\iota - t)} W(\mathbf{x}, u) d\iota. \tag{7.12}$$

which satisfies HJBE

$$\min\_{u} H(\mathbf{x}, u, \nabla V^\*) = 0.\tag{7.13}$$

Applying the stationary condition, we can derive the optimal strategy as

$$
\mu^\* = -\hbar \tanh(\frac{1}{2\hbar} g^T \nabla V^\*) + \alpha. \tag{7.14}
$$

On the basis of (7.13) and (7.14), we rewrite the HJBE as

$$\begin{aligned} \left(\nabla V^\*\right)^T f - \hbar (\nabla V^\*)^T g \tanh(\frac{1}{2\hbar} g^T \nabla V^\*) + \mathbf{x}^T \Upsilon \mathbf{x} \\ + (\nabla V^\*)^T g \alpha - \theta V^\* + \chi(\mathbf{u}^\*) = 0. \end{aligned} \tag{7.15}$$

**Remark 7.2** In the conventional optimal control issue with control constraints, it's often required that the input constraints should be symmetric. Nevertheless, the proposed method in this takes the asymmetric input constraints into account. Thus the symmetric constrained condition is relaxed by constructing the unconventional utility function (7.8).

Due to the nonlinear nature of (7.15), it's often intractable to derive the analytical solution, which is requisite for designing the optimal strategy. To overcome this issue, in the following sections, ADP method of single-critic network using dosage regulation mechanism is designed to approximately solve (7.15).

#### **7.3 Optimal Strategy Based on MDRM**

In order to achieve the goal of regulating therapeutic strategy timely and necessarily, MDRM is introduced to provide indications for medicine to determine the time when it's necessary to make some regulation. Therefore, the time sequence {*zı*} is required to record the regulating instants. The parameter *<sup>ı</sup>* <sup>∈</sup> <sup>N</sup><sup>+</sup> represents the *<sup>ı</sup>*th updating instant and N<sup>+</sup> is the set including all positive integers. Then we can define the state as

$$\check{\mathbf{x}}\_{\iota}(t) = \mathbf{x}(z\_{\iota}), t \in [z\_{\iota}, z\_{\iota+1}). \tag{7.16}$$

In general, the clinical data after the latest regulation is different from the current comparable data. Hence the error is given by

$$\nu\_{\iota}(t) = \check{\mathbf{x}}\_{\iota} - \mathbf{x}(t), t \in [z\_{\iota}, z\_{\iota+1}). \tag{7.17}$$

Based on ν*<sup>ı</sup>* and the threshold associated with state *x*, the medicine regulation mechanism is established. When a regulation occurs, ν*<sup>ı</sup>* = 0, which means the medicine dosage is regulated to be equal to the current medicine indication. The comparable data is updated by the clinical data at regulation instant, and the medicine dosage remains unchanged until the occurrence of the next regulation. That is, u˘ <sup>=</sup> u(*xı*). Thus we derive the MDRM-based strategy as

$$\check{u}^\* = -\hbar \tanh(\frac{1}{2\hbar}g^T(\check{\chi}\_\iota)\nabla V^\*(\check{\chi}\_\iota)) + \alpha,\tag{7.18}$$

where ∇*V*˘ <sup>∗</sup> = ∂*V*∗/∂*x* when *t* = *zı* . Then the medicine regulation mechanismbased HJBE can be denoted as

$$\begin{split} H(\mathbf{x}, \breve{u}^\*, V^\*) &= -\hbar (\nabla V^\*)^T g \tanh(\frac{1}{2\hbar} g^T(\breve{\mathbf{x}}\_i) \nabla V^\*(\breve{\mathbf{x}}\_l)) \\ &+ (\nabla V^\*)^T f + (\nabla V^\*)^T g \alpha + \mathbf{x}^T \Upsilon \mathbf{x} \\ &+ \chi(\breve{u}^\*) - \theta V^\*. \end{split} \tag{7.19}$$

The existence of the error ν*<sup>ı</sup>* lead to that (7.19) does equal to 0, which is different from HJBE (7.15). Before proceeding, an assumption is necessary [31].

**Assumption 7.1** The optimal strategy u<sup>∗</sup> is locally Lipschitz with respect to error <sup>ν</sup>*<sup>ı</sup>* , i.e., u<sup>∗</sup> − ˘u∗<sup>2</sup> <sup>≤</sup> *<sup>K</sup>*u*<sup>x</sup>* − ˘*xı*<sup>2</sup> <sup>=</sup> *<sup>K</sup>*uν*ı*<sup>2</sup> where *<sup>K</sup>*u is a positive constant.

**Theorem 7.1** *Consider the nonlinear system (7.6). Suppose that Assumption 7.1 is tenable and there exists function V*<sup>∗</sup> *satisfying (7.15). If the optimal strategy is formulated as (7.18) with the medicine indication*

$$\left\|\nu\_{t}\right\|^{2} \leq \frac{(1-\zeta^{2})\lambda\_{m}(\mathcal{T})}{K\_{u}}\left\|x\right\|^{2}\tag{7.20}$$

*where* ζ ∈ (0, 1) *is the designed parameter, then the controlled system is guaranteed to be asymptotically stable in the sense of UUB.*

*Proof* Select the Lyapunov function *Y*¯ = *V*∗(*x*). Then we can obtain the derivative of *V*<sup>∗</sup>

$$\dot{\tilde{Y}} = (\nabla V^\*)^T (f + g\breve{u}^\*). \tag{7.21}$$

According to (7.14) and (7.15), we derive that

$$(\nabla V^\*)^T f = -(\nabla V^\*)^T g u^\* - \mathbf{x}^T \Upsilon \mathbf{x} - \chi(u^\*) + \theta V^\*,\tag{7.22}$$

and

$$(\nabla V^\*)^T g = -2\hbar (\tanh^{-1}((u^\* - \alpha)/\hbar))^T. \tag{7.23}$$

Then (7.21) can be rewritten as

$$\begin{split} \dot{\tilde{Y}} &= -\left(\nabla V^\*\right)^T g(u^\* - \breve{u}^\*) - \mathbf{x}^T \Upsilon \mathbf{x} - \chi(u^\*) + \theta V^\* \\ &= -2\hbar(\tanh^{-1}((u^\* - \alpha)/\hbar))^T (\breve{u}^\* - u^\*) - \mathbf{x}^T \Upsilon \mathbf{x} \\ &- \chi(u^\*) + \theta V^\* \\ &= -\mathbf{x}^T \Upsilon \mathbf{x} - \chi(u^\*) + \theta V^\* + \varpi, \end{split} \tag{7.24}$$

where = −2-(tanh−<sup>1</sup>((u<sup>∗</sup> <sup>−</sup> <sup>α</sup>)/-))*<sup>T</sup>* (u˘ <sup>∗</sup> <sup>−</sup> u<sup>∗</sup>). Due to Young's inequality, from (7.24) we derive

$$
\varpi \le \hbar^2 (\tanh^{-1}((u^\* - \alpha)/\hbar))^2 + K\_u \|\nu\_t\|^2. \tag{7.25}
$$

#### 122 7 Adaptive Virotherapy Strategy for Organism with Constrained Input …

Via variable substitution approach, we have

$$\chi(u^\*) = 2\hbar \int\_0^{u^\*-\alpha} \tanh^{-1}((\iota - \alpha)/\hbar) d(\iota - \alpha). \tag{7.26}$$

The function (7.26) can be further expressed as

$$\begin{split} \chi(u^\*) &= 2\hbar^2 \int\_0^{\tanh^{-1}((u^\*-\alpha)/\hbar)} \varsigma (1-\tanh^2(\varsigma)) d\varsigma \\ &= -2\hbar^2 \int\_0^{\tanh^{-1}((u^\*-\alpha)/\hbar)} \varsigma \tanh^2(\varsigma) d\varsigma \\ &\quad + \hbar^2 (\tanh^{-1}((u^\*-\alpha)/\hbar))^2. \end{split} \tag{7.27}$$

Based on (7.24), (7.25) and (7.27), we can obtain

$$\dot{\tilde{Y}} \le \Xi\_1 + K\_u \left\| \nu\_t \right\|^2 + \theta V^\* - \mathbf{x}^T \Upsilon \mathbf{x},\tag{7.28}$$

where <sup>Ξ</sup>1(*x*) <sup>=</sup> <sup>2</sup>-<sup>2</sup> tanh−1((u∗−α)/-) <sup>0</sup> <sup>ς</sup> tanh<sup>2</sup>(ς)*d*ς. Via utilizing integral mean-value theorem, we derive that

$$\mathcal{E}\_1(\mathbf{x}) = 2\hbar^2 \tanh^{-1}((u^\* - \alpha)/\hbar)\rho \tanh^2(\rho),\tag{7.29}$$

where <sup>ρ</sup> <sup>∈</sup> (0, tanh−<sup>1</sup>((u<sup>∗</sup> <sup>−</sup> <sup>α</sup>)/-)). As u<sup>∗</sup> is admissible, it can be deduced that *<sup>V</sup>*<sup>∗</sup> and ∇*V*<sup>∗</sup> are bounded. Let *V*∗ ≤ *bV* and ∇*V* ∗ ≤ *b*∇*<sup>V</sup>* with *bV* and *b*∇*<sup>V</sup>* being positive constants. Then (7.29) becomes that

$$\begin{split} \Xi\_1(\boldsymbol{x}) &\leq 2\hbar^2 \tanh^{-1}((\boldsymbol{u}^\* - \boldsymbol{\alpha})/\hbar)\rho \\ &\leq 2\hbar^2(\tanh^{-1}((\boldsymbol{u}^\* - \boldsymbol{\alpha})/\hbar))^2 \\ &= \frac{1}{2}\nabla V^{\*T} g \boldsymbol{g}^T \nabla V^\* \\ &= \frac{1}{2}b\_g^2 b\_{\nabla V}^2 \triangleq b\_{\Xi\_1}, \end{split} \tag{7.30}$$

where the positive constant *<sup>b</sup>*g denotes the bound of <sup>g</sup>(*x*). According to (7.28) and (7.30), it can be obtained that

$$\begin{split} \dot{\tilde{Y}} &\leq -\zeta^2 \lambda\_m(\Upsilon) \|\boldsymbol{x}\|^2 - (1 - \zeta^2) \lambda\_m(\Upsilon) \|\boldsymbol{x}\|^2 \\ &+ K\_u \|\nu\_l\|^2 + \theta b\_V + b\_{\Xi\_l}. \end{split} \tag{7.31}$$

When the indication (7.20) is satisfied, it yields that ˙ *Y*¯ ≤ −ζ<sup>2</sup>λ*m*(Υ )*x*<sup>2</sup> + θ*bV* + *b*<sup>Ξ</sup><sup>1</sup> . Then we can conclude that ˙ *Y*¯ < 0 when *x* > θ*bV* +*b*Ξ<sup>1</sup> <sup>ζ</sup>2λ*<sup>m</sup>* (Υ ) .

Theorem 7.1 indicates that with the utilization of medicine regulation mechanism, the MDRM-based optimal strategy can asymptotically stabilize the controlled system.

#### **7.4 MDRM-Based Approximate Optimal Control Design**

The approximate optimal control strategy is designed based on the ADP algorithm which integrates the medicine regulation mechanism. Furthermore, for the closedloop controlled system, the asymptotically stability in the sense of UUB is guaranteed when the proposed medicine indication is applied.

#### *7.4.1 Implementation of the Adaptive Dynamic Programming Method*

In this section, the approximate optimal strategy is designed by the ADP method of single-critic framework which integrates the medicine regulation mechanism.

Based on the universal approximation properties of NN, *V*<sup>∗</sup> can be represented as

$$V^\* = \omega^{\*T}\vartheta(\mathbf{x}) + \tau,\tag{7.32}$$

where ω<sup>∗</sup> is the ideal weight vector, ϑ(·)the activation function and τ the approximate error. Let <sup>Γ</sup>1(*x*˘*ı*) <sup>=</sup> <sup>1</sup> 2<sup>g</sup>*<sup>T</sup>* (*x*˘*ı*)∇ϑ*<sup>T</sup>* (*x*˘*ı*)ω, then we have

$$\check{u}^\* = -\hbar \tanh(\varGamma\_1(\check{\mathbf{x}}\_l)) + \bar{\tau}(\check{\mathbf{x}}\_l) + \alpha, t \in [z\_l, z\_{l+1}) \tag{7.33}$$

where <sup>τ</sup>¯(*x*˘*ı*) = −(1/2)(<sup>1</sup> <sup>−</sup> tanh<sup>2</sup>((*x*˘*ı*)))g*<sup>T</sup>* (*x*˘*ı*)∇<sup>τ</sup> (*x*˘*ı*). Herein, (*x*˘*ı*) is selected between 1/(2-)g*<sup>T</sup>* (*x*˘*ı*)∇*V*∗(*x*˘*ı*) and <sup>Γ</sup>1(*x*˘*ı*). As the ideal weight <sup>ω</sup><sup>∗</sup> is unknown, the approximate version of *V*<sup>∗</sup> is derived by the critic NN, which is presented as

$$
\hat{V} = \hat{\omega}^T \vartheta(\mathbf{x}),
\tag{7.34}
$$

where ωˆ is the approximate vector. Then the MDRM-based approximate strategy can be obtained

$$\check{\mu} = -\hbar \tanh(F\_2(\check{\mathbf{x}}\_{\iota})) + \alpha, t \in [z\_{\iota}, z\_{\iota+1}), \tag{7.35}$$

where <sup>Γ</sup>2(*x*˘*ı*) <sup>=</sup> <sup>1</sup>/(2-)g*<sup>T</sup>* (*x*˘*ı*)∇ϑ*<sup>T</sup>* (*x*˘*ı*)ωˆ. Then the approximate Hamiltonian could be restated as

$$H(\mathbf{x}, \check{\boldsymbol{u}}, \hat{\boldsymbol{\omega}}) = \hat{\boldsymbol{\omega}}^T \boldsymbol{\xi} + \mathbf{x}^T \mathcal{T} \mathbf{x} + \chi(\check{\boldsymbol{u}}) \stackrel{\Delta}{=} \varepsilon\_H,\tag{7.36}$$

where <sup>ξ</sup> = ∇ϑ( *<sup>f</sup>* <sup>+</sup> gu˘) <sup>−</sup> θϑ.

The goal of tuning ωˆ is to minimize the term ε*<sup>H</sup>* . Thus we set the target function as *<sup>E</sup>* <sup>=</sup> <sup>1</sup> 2 ε*T <sup>H</sup>* ε*<sup>H</sup>* . Using the gradient descent approach, we obtain

$$
\dot{\omega} = -\ell \frac{\xi}{(\xi^T \xi + 1)^2} \varepsilon\_H = -\ell \check{\xi} \varepsilon\_H,\tag{7.37}
$$

where is the learning parameter and ˘ ξ = ξ/(ξ*<sup>T</sup>* ξ + 1)2. Define ω˜ = ω<sup>∗</sup> − ˆω. From (7.37) we derive that

$$
\dot{\tilde{\omega}} = -\ell \bar{\xi} \bar{\xi}^T \tilde{\omega} + \ell \check{\xi} e\_H,\tag{7.38}
$$

where ¯ <sup>ξ</sup> <sup>=</sup> <sup>ξ</sup>/(ξ*<sup>T</sup>* <sup>ξ</sup> <sup>+</sup> <sup>1</sup>) and the approximate residual error *eH* = −∇<sup>τ</sup> *<sup>T</sup>* ( *<sup>f</sup>* <sup>+</sup> gu˘) <sup>+</sup> θτ . Before presenting the main results, the following assumptions are requisite [38, 39].

**Assumption 7.2** The signal ¯ ξ is persistently excited over the time interval[*t*, *t* + *T* ]. In another word, there exists the positive constants φ and *T* such that

$$
\phi I\_{N\_c \times N\_c} \le \int\_t^{t+T} \bar{\xi} \bar{\xi}^T d\iota,\tag{7.39}
$$

with *Nc* being the neuron number of the critic network.

**Assumption 7.3** The terms τ¯ and *eH* are both bounded. That is, ¯τ ≤ *b*<sup>τ</sup>¯ and *eH* ≤ *beH* where *b*<sup>τ</sup>¯ and *beH* are positive constants.

#### *7.4.2 Stability Analysis*

This section discuss the asymptotic stability of the controlled system with the designed DARM-based strategy.

**Theorem 7.2** *Consider system (7.6) and let Assumptions 7.1–7.3 hold. The strategy is given by (7.35) and the weights tuning law for critic is set as (7.37). Then the closed-loop system (7.6) and weight estimation error* ω˜ *are asymptotically stable in the sense of UUB provided that the medicine indication is applied*

$$\left\|\nu\_{\iota}\right\|^{2} \leq \frac{(1-\eta^{2})\lambda\_{m}(\mathcal{T})}{2K\_{u}}\left\|\mathbf{x}\right\|^{2} \stackrel{\triangle}{=} \left\|T\_{\nu\_{\iota}}\right\|\tag{7.40}$$

*with* η ∈ (0, 1) *being the regulation parameter.*

#### *Proof* Select the Lyapunov function as

$$Y = V^\*(\check{\mathbf{x}}\_l) + V^\*(\mathbf{x}) + \tilde{\omega}\ell^{-1}\tilde{\omega} = Y\_a + Y\_b + Y\_c. \tag{7.41}$$

Note that when medicine indication is applied, the system can be described by the impulsive model comprising two components. One is flow dynamics for*t* ∈ [*zı*,*zı*+1) and the other is jump dynamics for *t* = *zı* . Hence we present the discussions over the two cases.

*Case I*: No regulation occurs, i.e., *t* ∈ [*zı*,*zı*+<sup>1</sup>). Then we can obtain *Y*˙ *<sup>a</sup>* = 0. In light of (7.22) and (7.23), we could derive that

$$\begin{aligned} \dot{Y}\_b &= (\nabla V^\*)^T (f + g\check{u}) \\ &= \Xi\_2 - \chi (u^\*) - \mathbf{x}^T \Upsilon \mathbf{x} + \theta V^\*, \end{aligned} \tag{7.42}$$

where <sup>Ξ</sup><sup>2</sup> = −2-(tanh−<sup>1</sup>((u<sup>∗</sup> <sup>−</sup> <sup>α</sup>)/-))*<sup>T</sup>* (u˘ <sup>−</sup> u<sup>∗</sup>). According to Young's inequation, we have

$$\left|\Xi\_2 \le \hbar^2 \|\tanh^{-1}((u^\*-\alpha)/\hbar)\|^2 + \|\check{u} - u^\*\|^2. \tag{7.43}$$

Recalling (7.27), we obtain

$$
\Delta \Xi\_2 - \chi(u^\*) \le \Xi\_1(\chi) + \left\| \check{u} - u^\* \right\|^2. \tag{7.44}
$$

As Ξ1(*x*) and *V*∗(*x*) are bounded, (7.42) becomes

$$\dot{Y}\_b \le \|\check{u} - u^\*\|^2 + b\_{\Xi\_l} + \theta b\_V - \mathbf{x}^T \Upsilon \mathbf{x}.\tag{7.45}$$

Applying the Young's inequation, we derive that

$$\begin{split} \left\lVert \check{u} - u^\* \right\rVert &= \left\lVert \check{u} - \check{u}^\* + \check{u}^\* - u^\* \right\rVert^2 \leq 2\left\lVert \check{u} - \check{u}^\* \right\rVert^2 + 2\left\lVert \check{u}^\* - u^\* \right\rVert^2 \\ &\leq 4\left\lVert \hbar \tanh(\varGamma\_1(\check{\mathbf{x}}\_i)) - \hbar \tanh(\varGamma\_2(\check{\mathbf{x}}\_i)) \right\rVert^2 + 4\left\lVert \check{\tau}(\check{\mathbf{x}}\_i) \right\rVert^2 + 2K\_u \left\lVert \nu\_i \right\rVert^2 \\ &\leq 8\hbar^2 \tanh^2(\varGamma\_1(\check{\mathbf{x}}\_i)) + 8\hbar^2 \tanh^2(\varGamma\_2(\check{\mathbf{x}}\_i)) + 2K\_u \left\lVert \nu\_i \right\rVert^2 + 4b\_{\check{\tau}}^2. \end{split} \tag{7.46}$$

As | tanh(·)| ≤ 1, it could be obtained that

$$\dot{Y}\_b \le -\lambda\_m(\mathcal{T}) \|\mathbf{x}\|^2 + 2K\_u \|\nu\_t\|^2 + \sigma,\tag{7.47}$$

where <sup>σ</sup> <sup>=</sup> <sup>16</sup>-<sup>2</sup> + 4*b*<sup>2</sup> <sup>τ</sup>¯ + θ*bV* + *b*<sup>Ξ</sup><sup>1</sup> .

Taking the derivative of *Yc*, we derive that

$$\dot{Y}\_c = -2\tilde{\omega}^T \bar{\xi} \bar{\xi}^T \tilde{\omega} + 2\tilde{\omega}^T \check{\xi} e\_H. \tag{7.48}$$

In light of Young's inequation, it yields that

126 7 Adaptive Virotherapy Strategy for Organism with Constrained Input …

$$2\tilde{\omega}^T \tilde{\xi} e\_H \le 2\tilde{\omega}^T \bar{\xi} e\_H \le \tilde{\omega}^T \bar{\xi} \bar{\xi}^T \tilde{\omega} + e\_H^T e\_H. \tag{7.49}$$

Then (7.48) can be further expressed as

$$\dot{Y}\_c \le -\tilde{\omega}^T \bar{\xi} \bar{\xi}^T \tilde{\omega} + e\_H^T e\_H \le -\lambda\_m(\delta) \|\tilde{\omega}\|^2 + b\_{\varepsilon H}^2,\tag{7.50}$$

where δ = ¯ ξ ¯ ξ*<sup>T</sup>* .

According to (7.47) and (7.50), when the medicine indication (7.40) is satisfied, we can derive that

$$\begin{split} \dot{Y} &\leq -(1-\eta^{2})\lambda\_{m}(\mathcal{T})\|\boldsymbol{x}\|^{2} - \eta^{2}\lambda\_{m}(\mathcal{T})\|\boldsymbol{x}\|^{2} + 2K\_{u}\|\nu\_{t}\|^{2} \\ &\quad - \lambda\_{m}(\delta)\|\tilde{\omega}\|^{2} + b\_{eH}^{2} + \sigma \\ &\leq -\eta^{2}\lambda\_{m}(\mathcal{T})\|\boldsymbol{x}\|^{2} - \lambda\_{m}(\delta)\|\tilde{\omega}\|^{2} + b\_{eH}^{2} + \sigma. \end{split} \tag{7.51}$$

Then it can be concluded that *Y*˙ < 0 when one of the conditions holds that

$$\|\|x\|\| > \frac{1}{\eta} \sqrt{\frac{b\_{\epsilon H}^2 + \sigma}{\lambda\_m(\mathcal{T})}},\tag{7.52}$$

and

$$\|\tilde{\omega}\| > \sqrt{\frac{b\_{\epsilon H}^2 + \sigma}{\lambda\_m(\delta)}}.\tag{7.53}$$

Thus *x* and ω˜ are demonstrated to be UUB.

*Case II*: A regulation occurs, i.e., *t* = *zı* . The difference of *LY* is presented as

$$
\Delta Y = \underbrace{V^\*(\check{\mathbf{x}}\_{t+1}) - V^\*(\check{\mathbf{x}}\_t)}\_{\triangle Y\_a} + \underbrace{V^\*(\mathbf{x}(z\_t^+)) - V^\*(\mathbf{x}(z\_t))}\_{\triangle Y\_b}
$$

$$
= \underbrace{\frac{1}{\ell} \tilde{\omega}^T(z\_t^+) \tilde{\omega}(z\_t^+) - \frac{1}{\ell} \tilde{\omega}^T(z\_t) \tilde{\omega}(z\_t)}\_{\triangle Y\_c}.
\tag{7.54}
$$

From the analysis in *Case I*, it can be derived that *L*˙ *<sup>Y</sup>* < 0 when (7.52) or (7.53) is satisfied. It can be further deduced that *Yb* + *Yc* is monotonically decreasing when *t* ∈ [*zı*,*zı*+<sup>1</sup>), that is,

$$Y\_b(\mathbf{x}(\boldsymbol{\varepsilon}\_l)) + Y\_c(\mathbf{x}(\boldsymbol{\varepsilon}\_l)) \ge Y\_b(\mathbf{x}(\boldsymbol{\varepsilon}\_l + \boldsymbol{\epsilon})) + Y\_c(\mathbf{x}(\boldsymbol{\varepsilon}\_l + \boldsymbol{\epsilon})),\tag{7.55}$$

where ∈ (0,*zı*+<sup>1</sup> − *zı*). According to the properties of limits, we can obtain

$$Y\_b(\mathbf{x}(\boldsymbol{\varepsilon}\_l)) + Y\_c(\mathbf{x}(\boldsymbol{\varepsilon}\_l)) \ge Y\_b(\mathbf{x}(\boldsymbol{\varepsilon}\_l^+)) + Y\_c(\mathbf{x}(\boldsymbol{\varepsilon}\_l^+)),\tag{7.56}$$

with *x*(*z*<sup>+</sup> *<sup>ı</sup>* ) = lim→<sup>0</sup> *x*(*zı* + ). More specially, it yields that

$$V^\*(\mathbf{x}(\boldsymbol{z}\_l)) + \frac{1}{\ell}\tilde{\omega}^T(\boldsymbol{z}\_l)\tilde{\omega}(\boldsymbol{z}\_l) \ge V^\*(\mathbf{x}(\boldsymbol{z}\_l^+)) + \frac{1}{\ell}\tilde{\omega}^T(\boldsymbol{z}\_l^+)\tilde{\omega}(\boldsymbol{z}\_l^+). \tag{7.57}$$

As *x* is proved to be UUB, it can be obtained that

$$V^\*(\check{\chi}\_{l+1}) \le V^\*(\check{\chi}\_l). \tag{7.58}$$

From (7.57) and (7.58), it's derived that *Y* < 0, which indicates that the constructed Lyapunov (7.41) is monotonically decreasing when *t* = *zı* .

**Remark 7.3** ζ in (7.40) is the regulation parameter determining the frequency of medicine dosage regulation. A large ζ means that the medicine dosage is regulated frequently while a small ζ implies the regulation occurs rarely. It can be set as an appropriate value according to the clinical data.

**Remark 7.4** Theorem 7.2 indicates that the designed MDRM-based approximate optimal strategy (7.35) can asymptotically stabilize system (7.6). The medicine indication (7.40), the cornerstone of MDRM, can provide a reasonable reference threshold for therapeutic strategy. When the difference derived from the current clinical data and latest reference data is larger than the threshold, the medicine dosage can be regulated, and the current indication data will be recorded and utilized as the new reference data in the future. Thus the designed therapeutic strategy can be regulated timely and necessarily according to the medicine indication.

**Remark 7.5** The discount factor is programmed to avoid infinity and infinitesimal value function in the accumulation of rewards, and immediately return can earn more than the delayed return of interest. In human trials, we have found that human prefer to immediately return can present close to exponential growth, the discount factor is used to simulate such a cognitive model and biological process to make a decision.

#### **7.5 Simulation Study**

In this section, we consider the system (7.6) which is the simplified version of the growth dynamics of cells and viruses described by (7.1)–(7.4). Based on system (7.6), the simulation experiment is conducted to show the effectiveness of the proposed ADP method with medicine regulation mechanism.

According to the clinical medical statistics and literatures [36, 37, 40], the parameters associated with the dynamics (7.1)–(7.4) are presented in Table 7.1. After the nondimensionalization, the corresponding parameters are set as *a*<sup>1</sup> = 0.36, *a*<sup>2</sup> = 0.1, *b*<sup>1</sup> = 0.36, *b*<sup>2</sup> = 0.48, *b*<sup>3</sup> = 0.16, *c*<sup>1</sup> = 0.1278, *c*<sup>2</sup> = 0.2, *c*<sup>3</sup> = 0.036, *d*<sup>1</sup> = 0.6, and


**Table 7.1** Parameter specifications of the tumor-virus-immune system

**Fig. 7.1** The population evolution of uninfected tumor cells

*d*<sup>2</sup> = 0.29. The initial state vector is [0.8, 0, 0.2, 0.05] *<sup>T</sup>* . The minimum and maximum thresholds are given by <sup>u</sup>*<sup>m</sup>* <sup>=</sup> 0 and <sup>u</sup>*<sup>M</sup>* <sup>=</sup> <sup>0</sup>.02. For the discounted value function (7.7) of system (7.6), the parameters Υ = 0.2*I*<sup>4</sup>×<sup>4</sup> and θ = 0.5.

For the critic network, we select the activation function as [*x* <sup>2</sup> <sup>1</sup> , *x*1*x*2, *x*1*x*3, *x*1*x*4, *x* 2 <sup>2</sup> , *x*2*x*3, *x*2*x*4, *x* <sup>2</sup> <sup>3</sup> , *x*3*x*4, *x* <sup>2</sup> 4 ] *<sup>T</sup>* . The other parameters are respectively set as *<sup>K</sup>*u <sup>=</sup> 20, ζ = 0.9 and = 1.6.

Simulation results demonstrate that in Figs. 7.1, 7.2, 7.3, 7.4, 7.5, 7.6 and 7.7. For model (7.5), the evolution trajectories of states are respectively depicted in

**Fig. 7.2** The population evolution of infected tumor cells

**Fig. 7.3** The population evolution of free oncolytic virus

Figs. 7.1, 7.2, 7.3 and 7.4. From Fig. 7.1, we could observe that under the attacks from oncolytic viruses and immune cells, the population of uninfected tumor cells rapidly declines and reaches a stabilizing value which is very low after *t* = 150*d*. Figures 7.2 and 7.3 reveal the relations between the population of infected tumor cells

**Fig. 7.4** The population evolution of immune cells

**Fig. 7.5** The curves of the therapeutic strategies

**Fig. 7.6** The population evolutions of uninfected tumor cells

**Fig. 7.7** The population evolutions of infected tumor cells

and that of virus particles which is large proportional. The immune cells are activated by the uninfected and infected tumor cells to kill tumor cells, which can be observed from Fig. 7.4. The medicine dosage of the derived approximate optimal therapeutic strategy and that of initial strategy are compared in Fig. 7.5. From Fig. 7.5, one can derive that the dosage of the obtained strategy is obviously less than that of initial strategy. On the other hand, the input dosages of the two strategies are both constrained by the pre-designed thresholds. This is of great practical significance since excess medicines may well threaten the health of patients and cause a huge overhead. Furthermore, it can be observed that the medicine dosage regulation frequency steps down when the clinic data becomes better, which means that with the aid of medicine regulation mechanism, the medicine dosage can be regulated timely and necessarily. Figures 7.6 and 7.7 present the population curves of the cells and viruses under the derived strategy with different burst sizes of viruses, that is, κ = 2, 5. This verified that the obtained therapeutic strategy can effectively kill tumor cells with oncolytic viruses of different burst out sizes. However, when the parameter κ is large enough, it may cause an oscillation. When the innate immune response is considered, the tumor-virus-immune system becomes very complicated. Though the viruses with large κ try their best to produce more replicas and infect more tumor cells, the reduction of tumor cells inactivate the immune response in the meanwhile. The viruses dominate the dynamics and the warfare between tumor cells and viruses can last a long time such that the oscillation occurs repeatedly. The oncolytic virus has the ability to effectively kill the tumor cells, while the immune response can reduce the killing-efficiency of the viruses and block their infections. Furthermore, the activated immune response can eliminate tumor cells as well. Thus there exists a subtle balance between the viruses and the immune cells which demands a further investigation.

#### **7.6 Conclusion**

Medicine regulation mechanism has been designed such that the constrained therapeutic strategy based on virotherapy can be obtained to eliminate tumor cells, guaranteeing that the medicine dosage can be regulated timely and necessarily. Firstly, a mathematical model is utilized to describe the relations among the uninfected tumor cells, infected tumor cells, oncolytic viruses and immune cells. Meanwhile, as the simplified version of the tumor-virus-immune model, the non-quadratic function is proposed to formulate the value function to acquire HJBE. Secondly, to address the optimal therapeutic strategy, single-critic architecture has been designed to seek the approximate solution of the HJBE through ADP. Finally, the simulation results has verified the effectiveness of the proposed method. Furthermore, nonzero-sum optimal control based on differential games will be a edge of the new frontier in therapy of tumor treatment, cardiovascular, orthodontic treatment, osteoporosis and cerebrovascular diseases.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.