**ICIAM 2019 SEMA SIMAI Springer Series 1**

Tomás Chacón Rebollo Rosa Donat Inmaculada Higueras Editors

# Recent Advances in Industrial and Applied Mathematics

## SEMA SIMAI Springer Series

## **ICIAM 2019 SEMA SIMAI Springer Series**

## Volume 1

#### **Editor-in-Chief**

Amadeu Delshams, Departament de Matemàtiques and Laboratory of Geometry and Dynamical Systems, Universitat Politècnica de Catalunya, Barcelona, Spain Centre de Recerca Matemàtica, Barcelona, Spain

#### **Series Editors**

Francesc Arandiga Llaudes, Departamento de Matemàtica Aplicada, Universitat de València, Valencia, Spain

Macarena Gómez Mármol, Departamento de Ecuaciones Diferenciales y Análisis Numérico, Universidad de Sevilla, Sevilla, Spain

Francisco M. Guillén-González, Departamento de Ecuaciones Diferenciales y Análisis Numérico, Universidad de Sevilla, Sevilla, Spain

Francisco Ortegón Gallego, Departamento de Matemáticas, Facultad de Ciencias del Mar y Ambientales, Universidad de Cádiz, Puerto Real, Spain

Carlos Parés Madroñal, Departamento Análisis Matemático, Estadística e I.O., Matemática Aplicada, Universidad de Málaga, Málaga, Spain

Peregrina Quintela, Department of Applied Mathematics, Faculty of Mathematics, Universidade de Santiago de Compostela, Santiago de Compostela, Spain

Carlos Vázquez-Cendón, Department of Mathematics, Faculty of Informatics, Universidade da Coruña, A Coruña, Spain

Sebastià Xambó-Descamps, Departament de Matemàtiques, Universitat Politècnica de Catalunya, Barcelona, Spain

This sub-series of the SEMA SIMAI Springer Series aims to publish some of the most relevant results presented at the ICIAM 2019 conference held in Valencia in July 2019.

The sub-series is managed by an independent Editorial Board, and will include peer-reviewed content only, including the Invited Speakers volume as well as books resulting from mini-symposia and collateral workshops.

The series is aimed at providing useful reference material to academic and researchers at an international level.

More information about this subseries at https://link.springer.com/bookseries/16499

Tomás Chacón Rebollo · Rosa Donat · Inmaculada Higueras Editors

# Recent Advances in Industrial and Applied Mathematics

*Editors* Tomás Chacón Rebollo Departamento de Ecuaciones Diferenciales y Análisis Numérico & Instituto de Matemáticas (IMUS) Universidad de Sevilla Facultad de Matemáticas Sevilla, Spain

Inmaculada Higueras Departamento de Estadística, Informática y Matemáticas, Edificio Los Pinos, Campus Arrosadia Universidad Pública de Navarra Pamplona (Navarra), Spain

Rosa Donat Departament de Matemàtiques Facultat de Matemàtiques Universitat de València Burjassot (Valencia), Spain

ISSN 2199-3041 ISSN 2199-305X (electronic) SEMA SIMAI Springer Series ISSN 2662-7183 ISSN 2662-7191 (electronic) ICIAM 2019 SEMA SIMAI Springer Series ISBN 978-3-030-86235-0 ISBN 978-3-030-86236-7 (eBook) https://doi.org/10.1007/978-3-030-86236-7

© The Editor(s) (if applicable) and The Author(s) 2022. This book is an open access publication. **Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

## **Foreword**

During the second week of July, the ICIAM 2019 Congress took place in Valencia with almost 4,000 participants, with 50 plenary talks, more than 300 mini-symposia, 550 contributed talks and 250 posters. A wide representation of world applied mathematics met in Valencia to present and discuss how mathematics was applied to the most diverse disciplines, such as applied mathematics for industry and engineering, biology, medicine and other natural sciences, control and systems theory, dynamical systems and nonlinear analysis, finance and management science, industrial mathematics, mathematics and computer science, numerical analysis, partial differential equations and simulation and modeling, to name some of them.

Within the organizing committee, the idea arose that these presentations and discussions should be reflected in some way for the future. And the offer from Springer came up to launch a series of volumes that would record the most notable advances that took place in it.

This offer crystallized in the *ICIAM 2019 SEMA SIMAI Springer Series*, which includes the present volume, dedicated to the conferences of the invited speakers, which occupies a very central and special place, since it is offered in open access mode, thanks to the support of Sociedad Española de Matemática Aplicada (SeMA).

The selection of the 336 mini symposia of the ICIAM 2019 was made by its academic committee. In a very direct relationship with it, the editorial committee of this series was formed by F. Arándiga Llaudes, M. Gómez Mármol, F. Guillén-González, F. Ortegón Gallego, C. Parés, P. Quintela, C. Vázquez-Cendón, S. Xambó-Descamps and myself. The members of this committee were in charge of selecting the proposals, many of them derived from mini-congress symposia, and also to act as the editors in charge for some of the 14 volumes that make up this series:


As can be easily seen, the application of mathematics spreads through the most diverse areas, such as industry, health and energy, engineering data science, environmental problems, geometric calculi, numerical approximation, traffic flow, education, etc.

Now is the time for the reader to delve into the volumes of this series and learn, reflect, incorporate new ideas and generally enjoy their content, hoping that the volumes of this series can serve as a reference for even more innovative applications of mathematics in the future.

Finally, it is time of acknowledgements. Starting with the ICIAM 2019 Congress, especially its executive committee led by Tomás Chacón and Rosa Donat as living forces of the event, as well as the scientific committee led by Alfio Quarterioni and the multiple organizers of mini-symposia, speakers and attendees. Continuing with Francesca Bonadei as the promotor within Springer of the need for the existence of this series, and with the members of the editorial board of this series, and ending with the editors in charge and authors of each volume, which with its excellent work, are the real creators of the message of this series.

Barcelona, Spain Amadeu Delshams

## **Preface**

The papers appearing in this volume are authored by some of the invited speakers of the **9th International Congress of Industrial and Applied Mathematics**, held in València from July 15 to 19, 2019. This volume is part of a series dedicated to ICIAM 2019-Valencia.

The congress, hosted by the Spanish Society for Applied Mathematics (SeMA), was organized at the Universitat de València (Spain), on behalf of the International Council for Industrial and Applied Mathematics (ICIAM). With 3983 participants from 99 different countries, more than 3400 lectures delivered and nearly 250 poster presentations, ICIAM 2019 has been a great success. These data represent a net increase in participation, with respect to an already rising trend in previous editions of this series of events, which can be considered a sound proof of the growing interest of the applied and industrial mathematics community in ICIAM congresses.

The industrial aspect of the congress was further enriched by organizing a specific mathematical technology transfer oriented activity: '*The Industry Day'*. Fourteen speakers, selected from a broad representation of different sectors, presented the results of ongoing collaborations with academy and the benefits derived from it, such as better products and services, optimization of processes, organization and accounting, and growth and innovation. In addition, 19 industrial mini-symposia were scheduled during the congress, and 48 'industry-related' posters were on display during '*The Industry Day*.'

Thirty-five satellite events took place during 2018 and 2019 covering a broad range of topics within industrial and applied mathematics. These events included two CIMPA schools (Kenitra, Morocco and Tunis, Tunisia, 2019), devoted to initiate young students from developing countries into research. Also, several Spanish towns/regions were appointed sub-venues of ICIAM-2019-Valencia (Bilbao, Galicia, Málaga, Seville and Zaragoza) and, as such, organized 12 satellite events. We are deeply thankful to the organizers of all satellite events.

The preparation of the candidacy in 2012 started the long process involved in the planning of this complex event. Our deepest gratitude and heartiest thanks go to all the people who helped with their abilities to create ICIAM 2019-Valencia. A list of all the committees and people involved in this task is given in this book.

The congress could not have been possible without the support of a large set of sponsors. A special mention is due to our main sponsors: Banco Santander, who financed over 70% of the Grant Program of the congress, and the Universitat de València, for its generous offer to make available their facilities to hold the conference. Thanks are also due to the Spanish universities that contributed to fund over 20% of the Grant Program and to the individual donors who contributed to the remaining 5%.

A thankful recognition is also due to our four institutional sponsors: Ministry of Science, Innovation and Universities, Generalitat Valenciana, Diputació de València and Ajuntament de València.

On behalf of ICIAM 2019, we would like to express our most sincere gratitude to the invited speakers that have contributed to this volume for taking the time to provide their valuable contributions, helping us to make this the reference publication of the congress.

Sevilla, Spain Valencia, Spain Pamplona, Spain Tomás Chacón Rebollo Rosa Donat Inmaculada Higueras

## **ICIAM Congresses**


#### **2019-Valencia**

Opening ceremony

Traditional valencian dances and Muixeranga (human towers)

#### ICIAM Congresses xi

Plenary talk

Closing ceremony and transfer of ICIAM flag

## **ICIAM Prize Winners**

#### **ICIAM Collatz Prize**


#### **ICIAM Lagrange Prize**


#### **ICIAM Maxwell Prize**


2019 Claude Bardos, Université Paris Diderot (Paris VII), France.

#### **ICIAM Pioneer Prize**


2019 Yvon Maday, Sorbonne University, Paris, France.

#### **ICIAM Su Buchin Prize**


#### ICIAM Prize Winners xv

2019 ICIAM Prize Ceremony. From left to right: Joan Ribó (Major of Valencia), Ximo Puig (President of the Generalitat Valenciana), Y. Maday, G. Di Nunno, His Majesty Felipe VI, G. Papanicolaou, C. Bardos, S. Mishra, M. Esteban and Pedro Duque (Minister for Science, Innovation and Universities)

## **Organization of ICIAM 2019-Valencia**

**Congress Director**: Tomás Chacón, University of Seville, Spain

#### **Honorary Committee**

*President*: His Majesty King Felipe VI of Spain

#### *Members*

Mr. Pedro Duque, Minister for Science, Innovation and Universities, Spain Mr. Ximo Puig, President of Generalitat Valenciana, Spain Mr. Vicent Marzà, Conseller d'Educació, Investigació, Cultura i Sport of Generalitat Valenciana, Spain Prof. Josefina Bueno, Directora General d'Universitats of Generalitat Valenciana, Spain Prof. Julio Abalde, Rector, University of A Coruña, Spain Prof. José Ángel Narváez, Rector, University of Malaga, Spain Prof. Nekane Balluerca, Rector, University of the Basque Country, Spain Prof. José Mora, Rector, Universitat Politècnica de València, Spain Prof. Antonio López, Rector, University of Santiago de Compostela, Spain Prof. Miguel Ángel Castro, Rector, University of Seville, Spain Prof. M. Vicenta Mestre, Rector, Universitat de València, Spain Prof. Manuel Joaquín, Reigosa, Rector, Universidade de Vigo, Spain Prof. José Antonio Mayoral, Rector, University of Zaragoza, Spain Mrs. Ana Botín, President of Banco Santander

## **Scientific Program Committee**

#### *Chair*

Alfio Quarteroni, EPFL, Lausanne, Switzerland, and Politecnico di Milano, Italy

#### *Members*

Tony F. Chan, Hong Kong, China Manuel Doblaré Castellano, Seville, Spain Qiang Du, New York, USA Enrique Fernández Cara, Seville, Spain Irene Fonseca, Pittsburgh, USA Irene Gamba, Austin, USA Markus Hegland, Canberra, Australia Ilse Ipsen, Raleigh, USA Ravi Kannan, Bangalore, India Claudia Kluppelberg, Munich, Germany Karl Kunisch, Graz, Austria Yasumasa Nishiura, Sendai, Japan Benoit Perthame, Paris, France Daya Reddy, Rondebosch, South Africa Claudia Sagastizabal, Rio de Janeiro, Brazil Jeffrey Saltzman, Waltham, USA Wil Schilders, Eindhoven, Netherlands Endre Suli, Oxford, UK Eric Vanden Eijnden, New York, USA Pingwen Zhang, Beijing, China

### **Executive Committee**

*Chair*

Tomás Chacón (US)

*Co-Chairs*

Rosa Donat (UV) Luis Vega (UPV/EHU)

*Members*

María Paz Calvo (UVA) Eduardo Casas (UC) Amadeu Delshams (UPC) Henar Herrero (UCLM)

Inmaculada Higueras (UPNA) Juan Ignacio Montijano (UNIZAR) Peregrina Quintela (USDC) Carlos Vázquez-Cendón (UDC) Elena Vázquez-Cendón (USC)

#### **Thematic Committees**

#### **Academic**

*Chair*: Amadeu Delshams (UPC)

Lino Álvarez-Vázquez (UVIGO) Rafael Bru (UV) Fernando Casas (UJI) Eduardo Casas (UC) Enrique Fernández-Nieto (US) Javier de Frutos (UVA) Dolores Gómez-Pedreira (USC) Jesús López-Fidalgo (UNAV) Pep Mulet (UV) Francisco Ortegón-Gallego (UCA) Francisco Padial (UPM) Carlos Vázquez-Cendón (UDC)

#### **Finance**

*Chair*: Eduardo Casas (UC)

Antonio Baeza (UV) Luis Alberto Fernández (UC) Julio Moro (UC3M) Carlos Vázquez-Cendón (UDC)

#### **Fundraising**

Carlos Vázquez-Cendón (UDC) Jesús Sanz-Serna (UC3M)

#### **Industrial Advisory**

*Chair*: Peregrina Quintela (USC)

Emilio Carrizosa (US) David Pardo (UPV/EHU) Antonio Huerta (UPC) Carlos Parés (UMA) Wenceslao González-Manteiga (USC)

#### **Communication and Outreach**

*Chair*: Henar Herrero (UCLM)

Sergio Blanes (UPV) Fernando Casas (UJI) Bartomeu Coll (UIB) Inmaculada Higueras (UPNA) Juan Ignacio Montijano (UNIZAR) Alfred Peris (UPV) Francisco Ortegón-Gallego (UCA) Francisco Pla (UCLM) Joan Solá-Morales (UPC) Sebastià Xambó-Descamps (UPC)

#### **Publications and Promotions**

*Chair*: Inmaculada Higueras (UPNA)

Rafael Bru (UPV) María Paz Calvo (UVA) Domingo Hernández-Abreu (ULL) Henar Herrero (UCLM) Mariano Mateos (UNIOVI) Julio Moro (UC3M)

#### **Satellite and Embedded Meetings**

*Chair*: María Paz Calvo (UVA)

Francisco Guillén-González (US) Carlos Parés (UMA) Luis Rández (UNIZAR) Carlos Vázquez-Cendón (UDC) Luis Vega (UPV/EHU)

#### **Travel Support Committee**

*Chair*: Elena Vázquez-Cendón (USC)

Macarena Gómez-Mármol (US) José Manuel González-Vida (UMA) Pep Mulet (UV) Francisco Javier Sayas (U. of Delaware) Rodrigo Trujillo-González (ULL)

#### **Local Arrangements**

*Chair*: Rosa Donat (UV)

José María Amigó (UMH) Francesc Aràndiga (UV)

Ana María Arnal (UJI) Antonio Baeza (UV) Sergio Blanes (UPV) Rafael Bru (UPV) Fernando Casas (UJI) Cristina Chiralt (UJI) Rafael Cantó (UPV) José Alberto Conejero (UPV) Isabel Cordero-Carrión (UV) Cristina Corral (UPV) Juan Carlos Cortés (UPV) María Teresa Gassó (UPV) Olga Gil-Medrano (UV) Alicia Herrero (UPV) Leila Lebtahi (UV) María del Carmen Martí (UV) Vicente Martínez (UJI) José Mas (UPV) José Salvador Moll (UV) Francisco Gabriel Morillas-Jurado (UV) Pep Mulet (UV) Mari Carmen Perea (UMH) Rosa Peris (UV) Alfred Peris (UPV) Sergio Segura de León (UV) Ana María Urbano (UPV) Pura Vindel (UJI)

#### **Acronyms of Spanish Universities**



#### **Collaborators at the Universitat de València**

#### **M. Vicenta Mestre, Rector**

#### **Rector's Cabinet**

Justo Herrera, Vice-rector Juan Vte. Climent, Manager Beatriz Gómez, Vice-manager José Ramírez, Vice-manager Joan Enric Úbeda, Director Carmen Fayos, Head of Staff

#### **Facultat de Psicologia**

M. Dolores Sancerni, Dean Juan M. Rausell, Administrator Juan J. Cancio, Coordinator Concierges of the building

#### **Facultat de Filosofía i Ciències de l´ Educació**

Rosa M. Bo, Dean Francisco J. Moreno, Administrator Esther Bolinches, Coordinator Concierges of the Building

#### **Health, Safety and the Environment Service**

M. José Vidal, Head of Staff Miguel A. Toledo, Technician Verónica Saiz, Technician Vicente Caballer, Technician

#### **Computer Service**

Fuensanta Doménech, Head of Staff

Faustino Fernández, IT Infrastructure Magdalena Ros, Quality Control

#### **Blasco Ibáñez Campus Management Unit**

Carmen Tejedo, Administrator Dolores Cano, Head of Staff M. Ángeles Llorens, Head of Staff Inmaculada Yuste, Administrative M. José Ballester, Services Coordinator Maria Luisa Jordán, Concierge Concierges of Aularios I, III y VI

#### **Facultat de Medicina i Odontologia**

Francisco J. Chorro, Dean M. Vicenta Alandi, Administrator. Guillermo Pérez, Coordinator Concierges of the Building

#### **Facultat de Filologia, Traducció i Comunicació**

Amparo Ricós, Dean Francisca Sánchez, Administrator Josep M. Valldecabres, Coordinator Concierges of the Building

#### **Facultat de Geografia i Història**

Josep Montesinos, Dean Joaquín V. Lacasta, Administrator Josep Vicó, Coordinator Concierges of the Building

#### **UVSports Service**

Vicent Añó, Director M. Paz Molina, Administrator Francisco Vicent, Coordinator Francisco Barceló, Concierge

#### **Technical and Maintenance Service**

Rosa M. Mochales, Head Rafael Antón, Technician M. Dolores Yagüe, Technician Jorge Vila, Technician Ramón Doménech, Maintenance Modesto Ramírez, Maintenance

Carles Aguado, Maintenance Diego Cantero, Maintenance

#### **UVdisability Service**

M. Celeste Asensi, Director Restituto Vaño, Accessibility

#### **Technical Unit**

Luis Juaristi, Head of Staff Vicente Tarazona, Technician José M. Zapata, Technician

#### **Collaborators at the University of Seville**

Teresa Ayuga

#### **Staff from External Partners**

Jesús Ibáñez, Security director UV Luis Briz and Security Staff from *Clece Security* Concierges from *UTE Blasco Ibáñez* Staff from *Grupo Fissa, Cleaning Company*. Amparo Cuadrado, Coordinator Antonio Gonzalbo and Maintenance Staff from *Ferrovial* Alexandre Andrés and Carlos J. Soler from Valnu Ana Mª Gómez and Gardening Staff from *Special Employment Center IVASS*

## **Opening Ceremony**

#### **Tomás Chacón Rebollo, Congress Director**

Your Majesty, President of the Region of Valencia, Minister of Science, Innovation and Universities, Major of Valencia, President of ICIAM, respected guests and delegates, on behalf of the Spanish Society for Applied Mathematics and the organizing committee, it is for me a pleasure to convey you our warmest welcome to ICIAM-2019-Valencia Congress.

Mathematics is silently shaping the present technological world. It provides a deep insight in numberless processes and systems, thereby advancing scientific knowledge. It also generates added value in virtually all economic sectors. On top of that, the last years have witnessed a change in paradigm, as mathematics directly provide the technological basis of emerging sectors related with data analysis.

The research and transfer in mathematics have experienced a fast development in Spain; besides all sciences, since the last decades of the twentieth century, Spain occupies today the 7th world position in mathematical research by citations. The mathematics play a relevant role in the Spanish economy; in fact, 10% of the national gross income and 6% of the employment are directly due to its use in the economic activity.

ICIAM 2019 Congress features 27 invited talks, the 5 ICIAM prices, the Olga Taussky-Todd Lecture and the Public Lecture. It will count on nearly 2000 talks as well as 250 posters. It also includes three special panels of great interest to understand the social framework in which our job as mathematicians takes place. This is industry talking about mathematics, instead of mathematicians talking about their collaborations with industry. ICIAM 2019 also includes an Industry Day, where 14 technological companies have agreed to present how mathematics helps to improve their production processes.

Thanks to four different funding programs, we have been able to offer over 230 scholarships to young researchers as well as to researchers coming from developing countries. In addition, we have implemented a volunteers program with over 170 young students that will greatly help the organization of the congress.

All this has been possible thanks to the collaborative work of the scientific panel committee, chaired by Prof. Alfio Quarteroni, and an enthusiastic organizing committee. I convey my deepest thanks to all of them. Special thanks are addressed to the Spanish Society for Applied Mathematics, and its president, Prof. Rosa Donat, who also chairs the local organizing committee. Let me also acknowledge the role of our families, for their support all along the organization of the congress.

We are indebted to ICIAM for trusting us to organize this congress and especially to her past and present presidents, Profs. Barbara Keyfitz and Maria Esteban, for their help and advice in the organization process. We also address our deepest thanks to the many organizations that have sponsored the congress: the Spanish Government, the Region of Valencia, the Diputació de València, the City Council and the University of Valencia, Spanish centers, departments and institutes of mathematics, Springer Publishing House, Santander Bank and the many individual donors. We are also indebted to SIAM for embedding their annual meeting in this ICIAM Congress and also to all you for organizing and participating in the many activities that take place within it.

You find yourself at the perfect time and place to learn about new mathematical tools, exchange ideas and move ahead in the thrilling challenge of shaping the world with mathematics.

Welcome to ICIAM 2019-Valencia Congress!!

## **Maria J. Esteban, President of ICIAM**

His Majesty the King, President of the Generalitat of Valencia, Major of Valencia, Minister of Science, Innovation and Universities, Congress Director, ladies and gentlemen, dear colleagues,

It is my great honor and pleasure, to welcome you all to ICIAM 2019, the ninth International Congress on Industrial and Applied Mathematics.

The ICIAM congresses are the main event organized by our international organization, a network of more than 50 learned societies. The global ICIAM community covers many countries and all topics that are related to the applications of mathematics to the real world, to industry, to health, to economy, to climate, to artificial intelligence and so on. Mathematics is unavoidable in the development of new technologies and in the advancement of our societies. As the recent report on the impact of mathematics on the Spanish economy shows, investing in mathematics is a very good idea, because the economic returns are high. This was also apparent in similar impact studies carried out previously in the UK, the Netherlands and France.

This congress is the occasion when worldwide applied and industrial mathematicians show to each other what they have done in the past years and what they plan to do next. During these days, we will prepare the future.

Spain was chosen six years ago to organize this big congress, the main event in our community, taking place only every four years. In 2015, we were in Beijing, and in 2023, we will be in Tokyo. Here today in the beautiful city of Valencia, we host more than 4000 mathematicians from all over the world, junior, senior, students, professors, researchers and engineers. During these six years, our Spanish colleagues have worked nonstop to make this congress a big success. In the name of the whole ICIAM community, let me thank the organizers for their huge effort. Thank you very much to the Spanish Society of Applied Mathematics (SEMA) and to the whole Spanish applied mathematics community. Thanks also to all official Spanish institutions that have offered their support.

And now, to all of you who are eager to see how the congress will develop, I wish you a productive week. Just be patient and courageous, because the program of the congress is very heavy, but this is the only way to show the whole span of our community's work in only five days. I thank you all for being here, and I wish you a great congress and a very pleasant week!

## **ICIAM 2019 in Numbers**

#### **Scope**


## **3983 Registered Delegates (Geographical Distribution)**

Percentage of participants per country

## **Number of Talks and Posters by Topic**


Number of mini-symposia talks, contributed talks and posters by topic

(continued)

#### ICIAM 2019 in Numbers xxxi

(continued)


#### **Satellite Meetings**

#### **Bilbao**


#### **Galicia**


#### **Sevilla**


#### **Málaga**

– NumHyp 2019—Numerical Approximation of Hyperbolic Systems with Source Terms and Applications, June 17–21, 2019.

#### **Zaragoza**


#### **Other Satellite Meetings**

– European Workshop on High Order Numerical Methods for Evolutionary PDEs: Theory and Applications (HONOM 2019), April 1–5, 2019, Madrid, Spain.

## **Contents**

#### **Invited Lectures**



## **Editors and Contributors**

#### **About the Editors**

**Tomás Chacón Rebollo** is a full professor at the Department of Differential Equations and Numerical Analysis and the Institute of Mathematics of the University of Seville (IMUS). He is Ph.D. in Mathematics and in Numerical Analysis at the universities of Seville and Paris 6, respectively. His scientific interests are numerical and reduced-order modeling in fluid mechanics and their applications to real-world problems. He is interested in the promotion of mathematical research and transfer. He was the director of BCAM (2012–2013) and IMUS (2015–2019) and chairman of ICIAM 2019 Congress.

**Rosa Donat** is a professor at the Department of Mathematics of the University of Valencia. She has worked on numerical methods for hyperbolic conservation laws and systems and on multiresolution and subdivision frameworks that incorporate nonlinear approximation techniques. She was actively involved in the Spanish Society for Applied Mathematics (SeMA), as a president in the 2016–1020 period.

**Inmaculada Higueras** is a full professor at the Department of Statistics, Computer Science and Mathematics of the Public University of Navarre (Pamplona, Spain). Her research focuses on time stepping methods for differential problems (ODEs, DAEs and PDEs). She has worked on numerical stability, numerical preservation of qualitative properties (positivity, monotonicity, contractivity, etc.) and on the design and implementation of robust and efficient numerical schemes.

## **Contributors**

**Marsha Berger** Courant Institute, New York University, New York, NY, USA

**Alfredo Bermúdez** Departamento de Matemática Aplicada, Instituto de Matemáticas, Universidade de Santiago de Compostela, Santiago de Compostela, Spain;

Instituto Tecnológico de Matemática Industrial (ITMATI), Santiago de Compostela, Spain

**Zhenning Cai** Department of Mathematics, National University of Singapore, Singapore, Singapore

**Huangxin Chen** School of Mathematical Sciences and Fujian Provincial Key Laboratory on Mathematical Modeling and High Performance Scientific Computing, Xiamen University, Fujian, China

**Albert Cohen** Laboratoire Jacques-Louis Lions, Sorbonne Université, Paris, France

**Carlos Conca** Departamento de Ingeniería Matemática, Facultad de Ciencias Físicas y Matemáticas, Centro de Modelamiento Matemático UMR 2071 CNRS-UChile, Centro de Biotecnología y Bioingeniería, Universidad de Chile, Santiago, Chile

**Wolfgang Dahmen** Mathematics Department, University of South Carolina, Columbia, SC, USA

**Ron DeVore** Department ofMathematics, Texas A &M University, College Station, TX, USA

**Leah Edelstein-Keshet** University of British Columbia, Vancouver, BC, Canada

**Yuwei Fan** Department of Mathematics, Stanford University, Stanford, CA, USA

**Maria Garzon** Department of Applied Mathematics, University of Oviedo, Oviedo, Spain

**Naohiro Horio** Department of Cardiovascular Surgery, Okayama University Hospital, Okayama, Japan

**Viet Q. H. Huynh** Advanced Institute for Materials Research, Tohoku University, Aobaku, Sendai, Japan

**Kristin Lauter** Cryptography and Privacy Research, Microsoft Research, Redmond, USA

**Claude Le Bris** Ecole des Ponts and Inria, Paris, France

**Haitao Leng** School of Mathematical Sciences, South China Normal University, Guangzhou, Guangdong, China

**Ruo Li** CAPT, LMAM and School of Mathematical Sciences, Peking University, Beijing, People's Republic of China

**Koki Otera** Graduate School of Environmental and Life Sciences, Okayama University, Okayama, Japan

**Kazue Sako** Waseda University, Tokyo, Japan

**Robert I. Saye** Mathematics Group, Lawrence Berkeley National Laboratory, Berkeley, CA, USA

**James A. Sethian** Department of Mathematics, University of California, Berkeley, California, USA

**Hiroshi Suito** Advanced Institute for Materials Research, Tohoku University, Aobaku, Sendai, Japan

**Kenji Takizawa** Faculty of Science and Engineering, Waseda University, Shinjuku City, Japan

**Takuya Ueda** Department of Diagnostic Radiology, Tohoku University Hospital, Sendai, Japan

**Dong Wang** School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, Guangdong, China

**Xiao-Ping Wang** Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China

**J. A. C. Weideman** Department of Mathematical Sciences, Stellenbosch University, Stellenbosch, South Africa

# **Invited Lectures**

## **Asteroid-Generated Tsunamis: A Review**

**Marsha Berger**

**Abstract** We study ocean waves caused by an asteroid airburst located over the ocean. The concern is that the waves would damage distant coastal cities. Simple qualitative analysis suggests that the wave energy is proportional to the ocean depth and the strength and speed of the blast. Computational simulations using GeoClaw and the shallow water equations show that explosions from realistic asteroids do not endanger distant cities. We explore the validity of the shallow water, Boussinesq, and linearized Euler equations to model these water waves.

#### **1 Introduction**

This talk will review some of the basics behind the simulation of asteroid-generated tsunamis, and how this piece of the Asteroid Threat Assessment Program (ATAP) got its start.

In 1994, the United States Congress asked NASA to identify 90% of asteroids larger than 1 km in diameter that could pose a threat to Earth. This led to the Near Earth Observing (NEO) program, which catalogued the objects and tried to determine their characteristics. In 2005, NASA's mission was expanded to track near Earth objects greater than 140 m in diameters. Obviously the largest dinosaur-killing asteroids are the most dangerous. However, the question arises, how small does an asteroid have to be before we don't have to worry about it? Little is known about asteroids smaller than 140 m in diameter, and whether they are safe to ignore. What if one exploded over an ocean. Could it generate a tsunami that would change it from a regional to a more global hazard that would threaten coastal populations far away?

As it turns out, in February, 2013 an approximately 20-m asteroid exploded about 15 miles above the ground over Chelyabinsk, Russia. This airburst provided an unprecedented opportunity for data collection. Teams of scientists visited, collected

M. Berger (B)

With many thanks to my collaborators Michael Aftosmis, Jonathan Goodman and Randy LeVeque.

Courant Institute, New York University, 251 Mercer St., New York, NY 10012, USA e-mail: berger@cs.nyu.edu

<sup>©</sup> The Author(s) 2022

T. Chacón Rebollo et al. (eds.), *Recent Advances in Industrial and Applied Mathematics*, ICIAM 2019 SEMA SIMAI Springer Series 1, https://doi.org/10.1007/978-3-030-86236-7\_1

**Fig. 1** Airbursts reports from April, 1988 to Dec, 2019. Figure taken from https://cneos.jpl.nasa. gov/fireballs

samples of the meteor to determine its composition, analyzed web cams from Russian cars to determine the trajectory and energy deposition, canvassed the region to see how far away windows broke (evidence of the blast overpressure), etc. [15]. In other words, data was collected that could be used for model validation. The ATAP project started shortly thereafter.

A reader might wonder how often such airbursts really occur. Figure 1 shows that in fact airbursts happens quite regularly. Since most of the world's surface is water, an investigation into airburst-generated tsunamis seems warranted.

In this talk I will focus only on simulations of smaller asteroids that explode before hitting the ground. There is very little literature on the effects of these airbursts. There is some literature on simulations of larger asteroids that do reach the ocean, and sometimes reach the ocean floor [4, 17, 18]. Impact simulations are generally performed using hydrocodes that simulate material deformation and failure, multimaterial phase changes (e.g. water turns into vapor and rises through the atmosphere), sediment excavation from the ocean floor, shock waves traveling through water, etc. A nice discussion can be found in the chapter by Gisler in [6]. These are very expensive calculations, so they tend to be axisymmetric to reduce cost, including the bathymetry.<sup>1</sup> Asteroid impact simulations is a dynamic area that is receiving a lot of recent attention [12, 13, 16].

In the next section we will present our simulations using the shallow water equations modeled with the GeoClaw software package, and describe how GeoClaw was adapted to model asteroid airbursts. We will review our analysis of a model problem that helps understand the simulations results. However, it turns out that airburstgenerated tsunamis have smaller length scales that earthquake-generated tsunamis. Hence we will turn to the linearized Euler equations to bring in the effects of compressibility and dispersion. It will turn out that dispersion is a much more important

<sup>1</sup> Bathymetry is underwater topography.

factor at the length scales and pressures of interest, and luckily the shallow water equations seem to overestimate the effect. We will conclude that airburst-generated tsunamis do not pose a global threat. This was the conclusion reached by all participants in the joint NASA-NOAA tsunami workshop in 2016 using a variety of codes and test problems, summarized in [11].

#### **2 Simulations of Airburst-Generated Tsunamis**

#### *2.1 Background*

The simulations we first present use the open-source software package GeoClaw [9]. GeoClaw solves the depth-averaged shallow water equations on bathymetry. It uses a second order finite volume scheme with a robust Riemann solver to deal with wetting and drying [5]. Very important for trans-oceanic wave propagation where coastal inundation is also important is the use of adaptive mesh refinement. GeoClaw uses patch-based mesh refinement, allowing resolution in deep water with grid cells the size of kilometers, and on land on the order of meters. Other issues such as well-balancing (an ocean at rest on non-flat bathymetry stays at rest), and a wellbalanced and conservative algorithm for adding and removing patches, are also part of GeoClaw. Desktop-level parallelism using OpenMP has also been implemented. There is no data from asteroid-generated tsunamis to use for benchmarking. We mention however that GeoClaw has had many benchmarking studies performed for earthquake-generated tsunamis, especially extensively in 2011 in [7]. This set of benchmarks was performed to allow GeoClaw to be used in hazard assessment work funded by the U.S. National Tsunami Hazard Mitigation Program.

The shallow water equations can be derived from the incompressible irrotational Euler equation using the long wavelength scaling, by assuming the ratio - = *h*/*L* - 1. Here, *h* is the depth of the water and *L* is the length scale of interest. This scaling leads to the conclusion that the velocity of the water in the *z* direction only enters at *O*(-), and the horizontal velocities are constant in the vertical direction to *O*(-2). Eliminating the need to compute the vertical velocity reduces the three-dimensional simulation to a much more affordable calculation using only the horizontal velocities *u* and v.

Ordinarily the pressure only appears as a gradient in the shallow water equations, allowing the value for the pressure itself to be set arbitrarily. In our simulations however we will need to match the pressure at the top of the water column with the atmospheric pressure produced by the asteroid blast wave. Re-deriving the shallow water equations and retaining the pressure produces the following set of equations for simulation:

$$\begin{aligned} h\_t + (hu)\_x + (hu)\_y &= 0\\ (hu)\_t + \left(hu^2 + \frac{1}{2}gh^2\right)\_x + (huv)\_y &= -ghB\_x - \frac{h}{\rho\_w}p\_{\epsilon x} - Du\\ (hv)\_t + (huv)\_x + \left(hv^2 + \frac{1}{2}gh^2\right)\_y &= -ghB\_y - \frac{h}{\rho\_w}p\_{\epsilon y} - Dv \end{aligned} \tag{1}$$

The other terms in (1) are *g*, gravity, *pe*, the external atmospheric pressure at the water surface, and ρw <sup>=</sup> 1025 kg/m<sup>3</sup> is the density of salt water. *<sup>B</sup>*(*x*, *<sup>y</sup>*) is the bathymetry (underwater topography, or depth of the ocean floor). Note that the pressure forcing appears in a non-conservative form, as does the bathymetry. In these equations, a flat ocean would have *h*(*x*, *y*) = −*B*(*x*, *y*). This is often described using the water elevation η(*x*, *y*) = *h* + *B*, where sealevel is η(*x*, *y*) = 0. In these equations we have neglected the Coriolis force (often considered unimportant for tsunami propagation). The term *<sup>D</sup>* <sup>=</sup> *gM*<sup>2</sup> <sup>√</sup>(*u*2+v2) *<sup>h</sup>*1/<sup>3</sup> is the drag, which is important in numerical simulations that include inundation. *M* = 0.025 is the Manning coefficient which we take to be constant.

To simulate the equation set (1), the external pressure must be known. This is obtained from detailed simulations of an asteroid entering the earth's atmosphere at a given speed, angle, and material composition, performed by others in the ATAP project [1]. The asteroid deposits its energy in the atmosphere, causing a blast wave. The simulations extract the ground pressure *pe*(*x*, *y*), and the width and amplitude of a Friedlander profile, an idealized blast wave profile, is fit to the data. This functional form is then used in the simulations for the pressure forcing. For simplicity we use a radially symmetric source term corresponding to a vertical entry angle for the asteroid. (In other simulations we have performed anisotropic simulations, with no change to our conclusions.) The blast wave in these simulations travels at 391.5 m/s, which we take to be constant. This is somewhat faster than the speed of sound in air.

Figure 2 shows a typical profile. A Friedlander profile has a characteristic width that describes the distance from the leading shock to the ensuing underpressure. Figure 2 is used in the simulations as follows: At a given time *t* in the simulation, each grid point needs to evaluate the atmospheric pressure. If the leading blast wave travels at speed *s* = 391.5 m/s, then at time *t* it has travelled a distance *d* = 391.5 × *t* meters. If the grid point is farther than *d* from the initial location of the blast wave there is no change to the ambient pressure. If it is less, the pressure profile is evaluated at that distance away and fed to the solver. The blue curve in Fig. 2 shows the profile at 50 s. The amplitude of the overpressure at that time is approximately 100% of ambient pressure. It is zero ahead of the blast, and decays as it gets closer to blast center. These values are used in Eq. (1).

The simulation in Fig. 2 resulted from a 250MT asteroid. This roughly corresponds to a meteor with a 200 m diameter entering the atmosphere with a speed of 20 km/s. Note that the maximum overpressure of the airburst is approximately 450%. (Explosions are measured in terms of MT (megatons) of TNT, relating the equivalent destructive power to the uses of dynamite; this is also used to quantify nuclear

**Fig. 2** A typical blast wave profile is drawn at two times. The amplitude is fit with a sum of decaying exponentials and the profile is scaled to get the pressure forcing at a given time. This functional form is then used in numerical simulations

bombs). For comparison, the explosion of Mount Saint Helens was estimated to be 25–35MT. The largest volcanic explosion ever records was Mount Tamboura, which was approximately 10–20 Gt, and caused global climate change and mass destruction. The airburst over Chelyabinsk was approximately 520 KT. The Tunguska event, the largest airburst of the previous century, is now thought to be about 15–20MT.

We point out that the length scale of the Friedlander profiles are significantly shorter than those of earthquake-generated tsunamis, which are typically on the order of 50–100 km. We will come back to this point in Sect. 3.

#### *2.2 Analytical and Computational Results for Shallow Water Equations*

In [2], we propose and analyze a one-dimensional model problem that helps describe the results seen in our simulations. The model problem first assumes that the pressure disturbance is a traveling wave and then builds on this to solve the problem where the pressure disturbance starts impulsively at time zero. Of course the actual pressure disturbance is a decaying function that will generate further waves as it changes amplitude, but the initial waves are the strongest and most important.

When the pressure pulse from the airburst hits the water, it causes two distinct waves with two different wave speeds. One will be related to the pressure pulse with speed *sb*, and the other is the gravity wave, moving with speed *sg*. What we call the *response wave* is an instantaneous disturbance of the sea surface that is in direct response to the amplitude of the moving pressure pulse and that propagates at the same speed, *sb* = 391.5 m/s (this is called η above, but we change notation here to indicate it is a response to the pressure forcing).

Our analysis shows the following relationship between the response wave and the pressure disturbance *pe*:

$$h\_r = \frac{h\_0 p\_e}{\rho\_w (s\_b^2 - s\_g^2)}\tag{2}$$

In (2), *h*<sup>0</sup> is the undisturbed height of the water (i.e. when η = 0). This shows that the response wave is stronger is deeper water, (almost linearly, since *sg* depends on *h*<sup>0</sup> too). For 4.5 times atmospheric pressure, at a depth of 3 km, the response wave would have an initial height of approximately 10.8 m. This amplitude would decay rapidly with the strength of the blast wave. Note that this response wave has *positive* amplitude, since *pe* > 0 and *sb* > *sg*. This is counterintuitive, since one would think that pushing on water would have lower its height. With hurricanes, the air pressure disturbance is negative, and hurricane travel slower than water waves, so again the water height increases, but this is more intuitive.

There are also *gravity waves* which move at the slower speed *sg* <sup>=</sup> <sup>√</sup>*gh* m/s. When *h* = 3000 m, this gravity wave moves at slightly less than 171 m/s, less than half the speed of the response wave. The initial gravity waves generated can also be estimated by linearizing the model problem and solving the homogeneous equation to get:

$$h(\mathbf{x}, t) = h\_r(\mathbf{x} - \mathbf{s}\_b t) - \left(\frac{\mathbf{s}\_b}{s\_\mathbf{g}} + 1\right) \frac{h\_r(\mathbf{x} - \mathbf{s}\_\mathbf{g} t)}{2} + \left(\frac{\mathbf{s}\_b}{s\_\mathbf{g}} - 1\right) \frac{h\_r(\mathbf{x} + \mathbf{s}\_\mathbf{g} t)}{2} \tag{3}$$

The first term in (3) is the response wave traveling at blast wave speed *sb*, and the next two are the gravity waves moving to the right and left with speed *sg*. We see that their amplitude is also a function of the amplitude of the response wave.

We next show results from two simulations at different distances from shore and ocean depths. More details on these particular simulations are in [2]. The first set of simulations are located off the coast of Westport, Washington. This area has been well-studied because of its proximity to the earthquake-prone M9 Cascadia subduction zone. The blast was located 180 km from shore, about 30 km from the continental shelf, and the ocean was 2575 m deep underneath the blast. Figure 3 shows the region of interest.

Figure 4 shows 3 snapshots at intervals of 25 s after the blast wave. A black circle is drawn indicating the location of the blast, the red just inside the circle is the response wave, and further interior to the circle is the gravity waves. Note that the leading gravity is a depression (negative amplitude). Contours of the bathymetry from −1000 to −100 are drawn to show the location of the continental shelf. Although the colorbar scale is from −1 to 1, the response wave height near the blast is over 10 m.

Figure 5 shows a zoom of the waves approaching shore (2000 s), about to hit the peninsula (3000 s), and mostly reflecting (4000 s), with some smaller waves entering Grays harbor. Note that the landscape is better resolved as the waves approach,

**Fig. 3** The first set of simulations has the blast located 180 km offshore from Westport, in 2575m deep water, indicated by the purple star. The zoom shows the region of interest studied for inundation

**Fig. 4** Westport simulations at intervals of 25 s after the blast. The waves are spreading symmetrically around the blast center. The largest wave is over 10m at the start

**Fig. 5** Selected times as gravity waves approach Westport coastline. The zooms cover a changing region closer and closer to shore. No inundation is observed. Note the colorbar scale is a factor of 5 smaller than in the figure above

indicating that the refinement level has increased. The wave amplitudes have greatly decreased, and no inundation is observed. Note that the colorbar scale (in units of meters) has been reduced by a factor of 5 in these later plots.

Since the first set of results did not show any inundation despite such a large blast, the second set puts the blast much closer to shore. We locate the blast 30 km off the coast of Long Beach, California, an area with a lot of important infrastructure. Figure 6 shows the topography. The water at the center of the blast is 797 m deep.

Figure 7 shows 3 snapshots at intervals of 25 s after the blast wave. Several features are evident. The black circle, which indicates the location of the blast wave at that time, no longer coincides with the leading elevation of the response wave (the red contours). This is because the topography becomes more shallow at the blast wave approaches Catalina Island, so its instantaneous amplitude has decreased, as expected

**Fig. 6** The second set of simulations has the blast located 30 km from Long Beach, in 797m deep water, indicated with the red dot. The zoom shows the region of interest studied for inundation

**Fig. 7** First row shows computed solution for Long Beach simulation at intervals of 25 s after the blast. The black circle indicates the location of the blast wave in air. Bottom row shows zooms near shore at two later times

from Eq. (2). Also notice that that atmospheric blast wave in the atmosphere jumps over the island, and the response wave reappears when the blast is again over water. Once again we see that the gravity waves are mostly a depression.

With this proximity to shore, the blast wave has not greatly decayed before it hits shore. The blast wave will be the more important cause of casualties and damages, and not the ensuing tsunami. The zooms in Fig. 7 have more refinement than the early times. The breakwater is now resolved, and water only approaches shore through the breakwater gaps or around the edge. But since the port infrastructure is two meters high, there is still no flooding. A very tiny bit of flooding is seen along the river (not visible in these plots).

We performed a number of additional simulations in a variety of locations, bathymetries, and asteroid strengths, including one with one Gt of energy. We have not found any examples where airbursts have caused significant onshore inundation. However, in the next section we examine whether the shallow water equations is an appropriate model for airburst-generated tsunamis, and compare the previous results with similar analyses and computations using the linearized Euler equations.

#### **3 The Linearized Euler Equations**

As reviewed earlier, the shallow water equations are a long wavelength approximation to the full 3D equations. Since the length scales of the Friedlander profile are on the order of 10 km, the ratio of water depth to length scale is not that small in a 4 km ocean. Closer to shore the shallow water equations may be more appropriate. The length scales are also important in determining the effect of dispersion, which is not present in the shallow water equations.

To examine this more closely, we compare the results from the previous section using the shallow water equations with those from the linearized Euler equations. This brings in the effects of both compressibility and dispersion. The latter equations have the advantage that the free surface boundary condition of the full Euler equations becomes a simple boundary condition when linearized, so the free water surface and the atmosphere do not have to be tracked or computed. Unfortunately it does require that the vertical direction be discretized along with the two horizontal directions, and so is much more expensive than a depth-averaged equation set.

#### *3.1 Analytical and Computational Results for Linearized Euler*

Again, we first review the results from [2] for our model traveling wave problem but for the linearized Euler equations (which are also derived there). Unlike the shallow water equations, which do not have any dependence on wave length, there is such a dependence in the Euler equations. We first present results for a single frequency *k*, where the length scale *L* = 2π/*k*. We then apply our results to a function with many frequencies. Finally we show some preliminary results of radially symmetric simulations confirming the model problem conclusions.

If we denote the external pressure forcing *pe*(*m*) = *Ak eikm*, where *m* = *x* − *sbt* is the traveling wave variable in our model problem, we can compute the response coefficients as a function of wave number, i.e. *hr*(*m*) = *h <sup>r</sup> <sup>e</sup>ikm* and amplitude *Ak* , and similarly for the velocities *u* and now the vertical velocity w too. The traveling wave

**Fig. 8** Comparison of response wave amplitudes as a function of length scale for the shallow water and linearized Euler equations. These were evaluated for a 4 km deep ocean, and 1 atm overpressure. At smaller length scales the dominant difference is due to dispersion, not to compressibility

problem can no longer be solved exactly, but can be evaluated numerically. In Fig. 8, we evaluate the solution to the model problem using an ocean depth of 4 km, and an amplitude of 1 atmosphere for the overpressure. We take the speed of sound in water *<sup>c</sup>*<sup>w</sup> <sup>=</sup> 1500 m/s, and density ρw <sup>=</sup> 1025 kg/m3. Figure <sup>8</sup> also evaluates the results for an artificially faster speed *c*<sup>w</sup> = 108, in order to approach the incompressible limit.

The green curve in Fig. 8 is the shallow water amplitude of the response wave. It is constant, since as expected there is no dependence on wave number. We can also compute the nonlinear response, which is done in [2], and overlays the linearized response. The blue curve is the linearized Euler result using the real sound speed of water. This does not appear to approach the shallow water curve. The red curve uses the artificially larger sound speed *c*<sup>w</sup> = 108, which approaches the incompressible limit and does approach the shallow water curve, giving us more confidence in the results. The difference between the linearized Euler curve and the shallow water curve is roughly 10%. We are calling this the effect due to compressibility. However, at the length scale of interest for airburst-generated tsunamis, the difference between the curves is over a factor of 2. We conclude that dispersion is a much more important effect.

Figure 8 showed the amplitude response due to a single frequency pressure perturbation. In Fig. 9 we evaluate the response to a Gaussian pressure pulse *pe*(*m*) = exp(−0.5(*m*/5)<sup>2</sup>) that includes all frequencies. We take the Fourier transform, multiply each frequency by the Fourier multiplier shown in Fig. 8 and transform back, so this is still a *static* response. The left figure shows results in 4 km deep water, and the right in 1 km deep water. Again we see that compressibility accounts for a smaller portion of the height difference between shallow water and linearized Euler results

**Fig. 9** Comparison of responses to a Gaussian pressure pulse in 4 km deep water (left) and 1 km deep water (right)

than dispersion. Note also that the Euler results have broadened, an indication of dispersion. The results in shallower water match better, as expected. Luckily, in all cases the shallow water results overestimate the response including compressibility and dispersion.

Finally, in Figs. 10 and 11 taken from [3] we show snapshots from time dependent simulations with the 250Mt airburst and compare linearized Euler (denoted AG for acoustic with gravity in the legends), shallow water, and two different Boussinesq2 models [8, 14]. We thank Popinet for the use of Basilisk in simulations using the Serre-Green-Naghdi (SGN) set of equations, and Jiwan Kim for the use of Bouss-Claw, which uses the Madsen Sørensen equation set [10].

We first show results in a 4 km deep flat ocean, then 1 km deep. Note that the scales are not the same in the two figures. Also, since the tsunami travels more slowly in shallower water, we only show those results every 100 s. Note that the leading shallow water gravity wave is a depression in both simulations. Also note that the two Boussinesq simulations agree with each other better than with the linearized Euler runs. The SGN simulation is in two space dimensions, and plotted as a function of radius, hence is much noisier than the other simulations which were one-dimensional radially symmetric computations. We point out that Boussinesq waves decay inversely proportional to distance traveled, whereas shallow water waves decay inversely to the square root of distance. Finally, all 4 codes show the same response wave behavior as an elevation in sealevel, albeit with different magnitudes.

We do not think that the depth-averaged equations are suitable for simulating the initiation of gravity waves, since there is significant variation in the vertical velocity. It does seem that depth-averaged equations can be used to propagate the waves, once

<sup>2</sup> Generally speaking, the Boussinesq equations keep the next term in the long-wavelength expansion for the shallow water equations. They are depth-averaged, but much more complicated than shallow water since they include dispersive terms with third order derivatives. We do not describe them further.

**Fig. 10** Comparison of initial generation of airburst tsunami using all 4 models in a 4 km deep ocean. Selected frames every 50 s. After 300 s, the SGN and BoussClaw resuls match linearized Euler in the leading gravity wave, but not (yet) the rest. The SWE model does not generate gravity waves that match at any of the times

initiated by a higher fidelity simulation. This has been demonstrated in [3]. We do not yet know how this translates into shoreline inundation. Preliminary evidence indicates that the shallow water model provides an overestimate of run-in due to airbursts, as it did in predicting wave height for the response wave, but we need more evidence for this hypothesis.

**Fig. 11** Comparison of airburst generated tsunamis using all 4 models in a 1 km deep ocean. Selected frames every 100 s. After about 200 s, SGN and BoussClaw match the linearized Euler results in the leading gravity wave, and by 400 s, the next few waves are very similar, though the amplitude is not quite right. The shallow water model still has very different waves

#### **4 Conclusions**

We have presented several numerical simulations of the shallow water equations in response to a 250Mt airburst. The results are further explained using a traveling wave model problem, for both the shallow water and linearized Euler equations. All results show that there is no significant water response (in either the response wave or the gravity wave) to the airburst. The most serious danger from an airburst would be from the blast itself if close enough to the blast center, rather than from water waves it generated.

We also found that because of the shorter wave-lengths of an airburst, the shallow water equations do not provide an accurate simulation of propagation for these waves, compared to simulations using Boussinesq or linearized Euler models. However it may be possible to use the shallow water equations to give an estimate of shoreline inundation. This is a matter for future study.

**Acknowledgements** Many thanks to my collaborators on this project: Michael Aftosmis, Jonathan Goodman and Randy LeVeque. This work was supported in part by a subcontract with Science and Technology Corporation (STC) under Contract NNA16BD60C, and by the Bay Area Environmental Research Institute (BAERI) under NASA Contract NNX16AO96A as part of the Asteroid Threat Assessment Project (ATAP).

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Some Case Studies in Environmental and Industrial Mathematics**

**Alfredo Bermúdez**

**Abstract** This presentation deals with four case studies in environmental and industrial mathematics developed by the mathematical engineering research group (mat+i) from the University of Santiago de Compostela and the Technological Institute for Industrial Mathematics (ITMATI). The first case involves environmental fluid mechanics: optimizing the location of submarine outfalls on the coast. This work, related to shallow water equations with variable depth, led us to develop a theory for numerical treatment of source terms in nonlinear first order hyperbolic balance laws. More recently, these techniques have been applied to solve Euler equations with source terms arising from numerical simulation of gas transportation networks when topography via gravity force is considered in the model. The last two problems concerns electromagnetism. One of them is related to nondestructive testing of car parts by using magnetic nanoparticles (the so-called magnetic particle inspection, MPI): mathematical modelling of magnetic hysteresis to simulate demagnetization. Finally, we present a mathematical procedure to reduce the computing time needed to achieve the stationary state of an induction electric machine when using transient numerical simulation.

#### **1 Introduction**

Four case studies developed by the Research Group in Mathematical Engineering from the University of Santiago de Compostela (USC) and the Technological Institute for Industrial Mathematics (ITMATI) are considered. Two of them are related to fluid mechanics. The first one was developed in the framework of a contract with the Ministry of Public Works of Galicia and concerns shallow water flows in a domain with variable depth. The second one deals with gas flow in transport networks and has

A. Bermúdez (B)

Departamento de Matemática Aplicada, Instituto de Matemáticas, Universidade de Santiago de Compostela, Lope Gómez de Marzoa s/n, 15782 Santiago de Compostela, Spain e-mail: alfredo.bermudez@usc.es

Instituto Tecnológico de Matemática Industrial (ITMATI), Rúa de Constantino Candeira, 15705 Santiago de Compostela, Spain

T. Chacón Rebollo et al. (eds.), *Recent Advances in Industrial and Applied Mathematics*, ICIAM 2019 SEMA SIMAI Springer Series 1, https://doi.org/10.1007/978-3-030-86236-7\_2

been done for the Reganosa company. From the mathematical point of view both are modelled with systems of nonlinear hyperbolic partial differential equations with source terms and the goal is to set up suitable finite volume discretization of the source terms.

The other two case studies concern electromagnetism. The goal of the first one, that has been financed by CIE Automotive company, is numerical simulation of magnetization and demagnetization processes in magnetic particle inspection procedures. Finally, the last case study is related to numerical solution of electric machines with optimal design in view. The underlying mathematical problems are, respectively, mathematical and numerical analysis of models for electromagnetic hysteresis, and methods to determine appropriate initial conditions for transient electromagnetic simulations, in order to attain the steady state as soon as possible.

#### **2 Environmental Flows. The Shallow Water Equations**

The technical goal of this work, commissioned by the Galician government to our research team in the eighties, was to determine the optimal location of submarine outfalls along the coast of the Galician *rias*. For this purpose several steps were done involving modelling, simulation and optimal control:


Regarding the first step, as the shallow water equation is a nonlinear system of hyperbolic partial differential equations, numerical methods developed in the eighties of the last century for Euler equations can be applied to its numerical solution. We mean finite volume methods combined with approximate Riemann solvers. The unexpected problem we found was related to the discretization of the source term which is present in the shallow water equations when the bottom is not flat. In order to give some insight we refer to Fig. 1: we have solved the shallow water equations by using a finite volume scheme with the Van Leer Q-scheme as approximate Riemann solver for flux term upwinding, and a *centred* scheme to discretize the source term arising from non-flat bottom. We have considered a static configuration in a closed channel, more precisely, the initial condition (and then the solution along the time) corresponds to water at rest. In the left plot one can see the computed water level which is a quite good approximation. However, the right plot shows the computed velocity which varies between around −60 and 80 m/s while the exact velocity is null.

**Fig. 1** Shallow water. Centred discretization of the source term. Computed water level (left) and computed velocity (right). Notice that the zero line is the result of a numerical simulation using [10]

Motivated by this problem, in the old paper [10] we developed a general methodology to discretize source terms in nonlinear systems of first order hyperbolic partial differential equations. In particular, our methods solve exactly the previous static problem. This paper is considered a seminal work in the theory of well-balanced schemes for numerical solution of conservation laws with source terms, an active field of research during the last years. Moreover, thirty years later, this methodology was applied by our research group to a different problem: Euler equations with gravity, more specifically, to numerical simulation of gas transportation networks on non-flat topography.

#### **3 Gas Network Simulation**

This industrial demand from the Reganosa company consisted in writing a software code for transient numerical simulation of a gas transport network. In Fig. 2 the high-pressure Spanish gas network is shown. Besides the great number of pipes, it includes entry (emission) and exit (consumption) points, underground storages and, more importantly, compression stations. The latter are needed to compensate the pressure drop along the network due to viscous friction of the gas on the pipe walls.

#### *3.1 Mathematical Modelling: Homogeneous Gas Flow in a Pipe*

The mathematical model for gas flow in a pipe consists of Navier-Stokes equations for compressible flows. More precisely, it involves the mass, momentum and energy conservation laws and some additional equations: the state equations for real gases and the Darcy-Weisbach law for turbulent friction between gas and pipe walls com-

**Fig. 2** Spanish gas transport network

bined with Colebrook equation to compute the friction factor. As the pipe length is much larger than the area of its cross-section we can use a 1D model:

$$\begin{split} \frac{\partial\rho}{\partial t}(\boldsymbol{x},t) + \frac{\partial(\rho\boldsymbol{v})}{\partial\boldsymbol{x}}(\boldsymbol{x},t) &= \boldsymbol{0}, \\ \frac{\partial(\rho\boldsymbol{v})}{\partial t}(\boldsymbol{x},t) + \frac{\partial(\rho\boldsymbol{v}^{2}+\boldsymbol{p})}{\partial\boldsymbol{x}}(\boldsymbol{x},t) &= \underbrace{-\frac{\lambda\rho(\boldsymbol{x},t)}{2D}|\boldsymbol{v}(\boldsymbol{x},t)|\boldsymbol{v}(\boldsymbol{x},t)}\_{\text{friction}} \underbrace{-g\rho(\boldsymbol{x},t)h'(\boldsymbol{x})}\_{\text{gravity force}}, \\ \frac{\partial(\rho\boldsymbol{E})}{\partial t}(\boldsymbol{x},t) + \frac{\partial((\rho\boldsymbol{E}+\boldsymbol{p})\boldsymbol{v})}{\partial\boldsymbol{x}}(\boldsymbol{x},t) &= \underbrace{-g\rho(\boldsymbol{x},t)\boldsymbol{v}(\boldsymbol{x},t)h'(\boldsymbol{x})}\_{\text{power of gravity force}} \\ &+ \underbrace{\alpha\frac{4}{D}(\theta\_{\boldsymbol{\varepsilon}\boldsymbol{M}}(\boldsymbol{x},t) - \theta(\boldsymbol{x},t))}\_{\text{heat exchange}}. \end{split}$$

Thermodynamic equation of state: *p* =*Z*(θ , *p*)ρ*R*θ

$$\begin{aligned} \text{Calcic equation of state: } e &= E - \frac{1}{2}|v|^2 \quad \text{with} \\ e &= \hat{e}(\theta) = \hat{e}(\theta\_0) + \int\_{\theta\_0}^{\theta} c\_v(s) \, ds \end{aligned} $$


#### *3.2 Numerical Solution: One Single Pipe with Homogeneous Gas*

Numerical methodology for solving the compressible Euler equations for homogeneous mixtures of perfect gases without sources has been well established since the eighties of the last century. For instance, one can use a simple first-order method consisting of Euler explicit for time discretization, finite volume method for space discretization, and approximate Riemann solvers (e.g., van Leer's Q-Scheme) for upwind discretization of the flux term (see, for instance, [24]). However, when source terms are present (e.g., the gravity term with variable heigth), numerics is more difficult and similar to the shallow water equations the use of well-balanced schemes is mandatory. This means that the discretization of source terms also needs some upwinding. In the last years many papers devoted to numerical solution of Euler equations with gravity have been written. Let us mention, for instance, [13–15, 23, 25, 27].

In order to highlight the need of using an upwind discretization of the source terms, we consider the following very simple test problem: *h*(*x*) in the gravity source term is an arbitrary function and we look for a static isothermal solution, i.e., satisfying v(*x*) = 0, θ(*x*) = θ*ext*, ∀*x* ∈ (0, *L*). It is easy to see that the exact solution is given by v(*x*) <sup>=</sup> 0, ρ(*x*) <sup>=</sup> <sup>ρ</sup><sup>0</sup> exp <sup>−</sup> *<sup>g</sup> R*θ*ext h*(*x*) − *h*<sup>0</sup> , and *<sup>p</sup>*(*x*) <sup>=</sup> *<sup>R</sup>*θ*ext*ρ<sup>0</sup> exp <sup>−</sup> *<sup>g</sup> R*θ*ext h*(*x*) − *h*<sup>0</sup> . For the data given in Table 1, the computed mass flow rate is shown in Fig. 3 as well as the exact solution which is null. One can see that the former is very bad, oscillating between around −10 and 10.

By using the general methodology developed in [10], we have proposed a discretization of the gravity term in [7] leading to a well-balanced scheme that reproduces the null solution exactly.


**Table 1** Data for static isothermal test

**Fig. 3** Mass flux, (kg/(m2 s)). Computed with centred discretization of source terms (black) and exact (red). The horizontal axis is the distance to the origin of the pipe

#### *3.3 Network with Heterogeneous Gas*

Simulation of heterogeneous gas flowing in a network is more difficult. New problems arise: junction modelling, gas quality simulation. These issues have been addressed in papers [8, 9].

#### *3.4 Experimental Validation in a Real Small Network*

The code has been used for a small gas network and the results have been compared to real measurements. The network can be seen in Fig. 4.

Topography is quite irregular as can be seen in Fig. 5. Results and measurements corresponding to mass flow rate and pressure for some particular nodes are shown in Figs. 6 and 7, respectively.

#### **4 Non-destructive Testing: Magnetic Particle Inspection (MPI)**

MPI is a non-destructive testing technique to detect near-surface defects in ferromagnetic pieces. The process is as follows: firstly, the workpiece is magnetized. Then, the presence of a surface discontinuity in the material allows the magnetic flux to leak, since air cannot support as much magnetic field per unit volume as metals. In order

**Fig. 4** The Reganosa network (Galicia. Spain)

**Fig. 5** Height function for edge #9

to identify a leak, ferrous particles, either dry or in a wet suspension, are applied to the workpiece. Then they are attracted to an area of flux leakage and form what is called an indication which is evaluated to determine its nature. Since cracks are more easily detected when they are perpendicular to the induced field, two magnetizations are made: circular and longitudinal. After inspection, a final demagnetization step is required for subsequent processing of the workpiece. In the next subsection we introduce an axisymmetric model for circular magnetization and present some numerical results (Figs. 8 and 9). Further details can be found in Refs. [2, 4–6].

**Fig. 6** Mass flow rate at node **01A**. Blue: real measurement. Red: computed with a homogeneous gas model. Green: computed with a heterogeneous gas model

**Fig. 7** Pressure at node **I-015**. Blue: real measurement. Red: computed with a homogeneous gas model. Green: computed with a heterogeneous gas model

**Fig. 8** Magnetic particle inspection

**Fig. 9** Crack indication. Circular magnetization. Longitudinal magnetization

**Fig. 10** Circular magnetization

#### *4.1 Circular Magnetization. Axisymmetric Model*

Let us introduce a mathematical model for circular magnetization. Thanks to axisymmetry, it can be written on a meridional section (see Fig. 10).

Given *I*(*t*), the magnetizing or demagnetizing current, and an initial condition *H*0, find *H*<sup>θ</sup> in Ω × (*t*0, *T* ] such that

$$\begin{aligned} \frac{\partial \, \mathcal{B}\_{\theta}}{\partial t} \mathbf{e}\_{\theta} + \mathbf{curl} \left( \frac{1}{\sigma} \, \mathbf{curl} (H\_{\theta} \mathbf{e}\_{\theta}) \right) &= \mathbf{0} \quad \text{in } \mathcal{Q} \times (t\_{0}, T], \\ H\_{\theta}(0, z, t) &= 0 \quad \text{on } (0, L) \times (t\_{0}, T], \\ H\_{\theta}(R\_{\mathcal{S}}(z), z, t) &= \frac{I(t)}{2 \pi \, R\_{\mathcal{S}}(z)} \quad \text{on } (0, L) \times (t\_{0}, T], \\ \frac{\partial H\_{\theta}}{\partial z}(\rho, z, t) &= 0 \quad \text{on } (\varGamma\_{1} \cup \varGamma\_{2}) \times (t\_{0}, T], \\ H\_{\theta}(\rho, z, t\_{0}) &= H\_{0}(\rho, z) \quad \text{in } \mathcal{Q}. \end{aligned}$$

and

$$B\_{\theta}(\mathbf{x}, t) = \mathcal{B}(H\_{\theta}(\mathbf{x}, \cdot), \xi(\mathbf{x}))(t),$$

where *B* is a scalar *hysteresis operator* to be defined later.

#### *4.2 Hysteresis Modelling*

Mathematical modelling of hysteresis is now a well established subject (see, for instance, the reference books [11, 12, 17–19, 26]). Let us summarize the main issues of the theory. We consider a system whose state is characterized by two scalar variables, *u* and w, which are assumed to depend continuously on time *t*. In our case *u* = *H*<sup>θ</sup> and w = *B*<sup>θ</sup> . The value of w(*t*) is determined by *u*(*t*) and by the values of *u*(τ ) for τ < *t*. Let us introduce some basic definitions and notations (Fig. 11).

At any instant *t*, w(*t*) depends on the previous evolution of *u*, and on an initial state of the system to be called ξ . We can formalize this as follows:

$$w(t) = \mathcal{F}(u, \xi)(t) \quad \forall t \in [0, T].$$

**Fig. 11** Hysteresis major and minor loops

**Fig. 12** Preisach triangle (left) and an example of Preisach function (right)

Here *F*(·,ξ) represents an operator between suitable spaces of time-dependent functions. Notice that *F* is non-local in time. A particular example of hysteresis operator is the Preisach operator:

$$\begin{aligned} \mathcal{F}: \mathcal{C}^0([0, T]) \times Y &\longrightarrow \mathcal{C}^0([0, T]),\\ [\mathcal{F}(u, \xi)](t) &:= \int\_T [h\_\rho(u, \xi(\rho))](t) p(\rho) d\rho, \end{aligned}$$

where *<sup>T</sup>* is the Preisach triangle, 0 <sup>&</sup>lt; *<sup>p</sup>* <sup>∈</sup> *<sup>L</sup>*<sup>1</sup>(*<sup>T</sup>* ) is the Preisach function which is determined by physical experiments for each material (see Fig. 12), *h*<sup>ρ</sup> is the relay function (see Fig. 13) and ξ : *T* → {−1, 1}is a Borel measure representing the initial magnetic state.

The classical Preisach model is built with the so-called rate-independent relay: let us fix any pair <sup>ρ</sup> := (ρ1, ρ2) <sup>∈</sup> <sup>R</sup><sup>2</sup>, ρ<sup>1</sup> < ρ2. For any continuous function *<sup>u</sup>* : [0, *<sup>T</sup>* ] → <sup>R</sup> and any <sup>ξ</sup> ∈ {−1, <sup>1</sup>}, we define *<sup>h</sup>*ρ(*u*,ξ) as follows.

Let *t*<sup>1</sup> <...< *tN* < *t* be such that *u*(*ti*) ∈ {ρ1, ρ2}. If {*ti*}=∅ or *t* = 0, then

$$h\_{\rho}(u,\xi)(t) = \begin{cases} -1 \text{ if } u(t) \le \rho\_1, \\ \xi \text{ if } \rho\_1 < u(t) < \rho\_2, \\ 1 \text{ if } u(t) \ge \rho\_2, \end{cases}$$

else

$$h\_{\rho}(\mu, \xi)(t) := \begin{cases} 1 & \text{if } \mu(t\_N) = \rho\_2, \\ -1 & \text{if } \mu(t\_N) = \rho\_1. \end{cases}$$

If we split *T* = *S*<sup>+</sup> *<sup>u</sup>* (*t*) ∪ *S*<sup>−</sup> *<sup>u</sup>* (*t*), where

$$\mathcal{S}\_{\boldsymbol{u}}^{\pm}(t) = \left\{ (\rho\_1, \rho\_2) \in \mathcal{T} : [r\_\rho(\boldsymbol{u}, \xi)](t) = \pm 1 \right\},$$

**Fig. 13** Classical relay operator

**Fig. 14** Input *u*(*t*) (left) and its corresponding splitting of Preisach triangle (right)

then

$$[\mathcal{F}(\mu,\xi)](t) := \int\_{S\_u^+(t)} p(\rho)d\rho - \int\_{S\_u^-(t)} p(\rho)d\rho.$$

We present some results obtained by solving the above model for a real crankshaft (see Fig. 14 for input data). Figure 15 shows the remanent magnetization after the circular magnetization process. In its turn, Fig. 16 shows the applied demagnetization current and the remanent magnetization after demagnetizing.

#### **5 Accelerated Simulation of Electric Machines**

In the design of electric machines (see Fig. 17), numerical simulation is an important tool. The engineer needs to know the behaviour of the machine in steady regime. In particular, he/she wants to know the torque. In order to get this steady state, finite element methods are used to solve a transient nonlinear system of PDEs derived

**Fig. 15** Remanent magnetization

**Fig. 16** Demagnetization current (left) and remanent magnetization after demagnetizing (right)

from Maxwell equations, coupled with electrical circuit equations, starting from an (arbitrary) initial condition until the steady state is achieved. The time for this transient model to attain the steady state highly depends on the choice of the initial condition. When an unappropriate value is prescribed (for instance, when it is set to zero), a very long CPU time is needed to reach the steady state solution. Therefore, techniques leading to a suitable initial condition are in high demand and in the literature we can find several approaches to the problem. Let us mention, for instance, *time periodic finite element methods* [21], *time periodic-explicit error correction methods* [16], *time differential correction* [20], *parareal algorithms* [22]. A common drawback for these methods is the need of choosing a suitable time interval in which the solution is assumed to be periodic: the so-called *effective period*. Indeed, magnetic fields in rotor and stator oscillate at different frequencies and the common time at which both are periodic is generally quite large. However, the periodicity condition has to be defined in a short time interval for the method to be useful. Our methodology aims to compute a suitable initial condition and has the advantage of making use of periodicity property only in the rotor bars, so the above limitation does not apply. Moreover, the computational cost of our approach does not depend on the size of this period, and the number of unknowns is very small in comparison with the previously mentioned methods.

This work has been developed under contract with the company Robert Bosch GmbH from Stuggart (Stefan Kurz, Marcus Alexander). It has given rise to a Spanish patent. A detailed description of the methodology has been published in papers [1] and [3].

#### *5.1 Description of the New Methodology*

The main lines of the developed methodology can be described for a toy model. Let us consider a simple series circuit with an inductor and a resistor,

$$\mathrm{LI}(t) + \mathrm{RI}(t) = \mathrm{E}(t),$$

**Fig. 18** A quarter of the geometric domain at time *t* = 0 (left) and *t* > 0 (right). Modification of a picture provided by Robert Bosch GmbH

with the electromotive force

$$\mathbf{E}(t) = \mathbb{E}\sin(at)$$

The general solution is

$$\mathcal{I}(t) = \underbrace{\mathcal{A}e^{-\frac{\mathbb{R}}{\mathbb{L}}t}}\_{\text{transient part}} + \underbrace{\frac{\mathbb{E}}{|\mathcal{Z}(\omega)|}\sin(\omega t - \varphi(\omega))}\_{\text{steady solution}}$$

where *<sup>Z</sup>*(ω) <sup>=</sup> <sup>R</sup> <sup>+</sup> <sup>ω</sup>L*<sup>i</sup>* <sup>∈</sup> <sup>C</sup> is the impedance of the circuit and ϕ(ω) its argument. We have two opposite extreme situations:


$$\varphi(\omega) \approx \frac{\pi}{2} \text{ and } |\mathcal{Z}(\omega)| \approx \omega \text{L}$$

and hence

$$\mathbf{I}(t) \approx A \mathbf{e}^{-\frac{\mathbf{R}}{\mathbb{E}}t} + \frac{\mathbb{E}}{\omega \mathbf{L}} \cos(\omega t).$$

If the equation is solved for I(0) = 0, then the solution is approximately given by

$$\mathbf{I}(t) \approx -\frac{\mathbb{E}}{\omega \mathbf{L}} \mathbf{e}^{-\frac{\mathbf{R}}{\mathbb{E}}t} + \frac{\mathbb{E}}{\omega \mathbf{L}} \cos \omega t,$$

so it includes a transient part. However, if the equation is solved for

$$\mathbf{I}(0) = \frac{\mathbb{E}}{a\mathbf{L}}.$$

then *A* = 0 and the transient part is close to zero from the beginning. The important remark is that, if <sup>R</sup>*<sup>T</sup>* <sup>L</sup> 1 then the above initial condition can be obtained without solving the ODE, as follows:


$$\mathcal{L} \int\_0^T \mathbf{I}(t) \, dt - \mathcal{L}T\mathbf{I}(0) = \mathbb{E} \int\_0^T (T - s) \sin \alpha s \, ds = \frac{\mathbb{E}T}{\omega}$$

• Moreover, since the steady solution is harmonic then *<sup>T</sup>* <sup>0</sup> I(*t*) *dt* = 0 and from the above equation we deduce

$$\mathcal{I}(0) = \frac{1}{\mathcal{L}T} \frac{\mathbb{E}T}{\omega} = \frac{\mathbb{E}}{\omega \mathcal{L}}$$

which is the suitable initial condition previously obtained. The interesting feature of this method is that it can be used in more general settings; in particular, to the model of induction machines with squirrel cage. In this case, the problem to be solved is the following:

*Given currents along the coil sides In*(*t*), *n* = *Nb* + 1,..., *Nc*, *and initial currents along the bars y*<sup>0</sup> *<sup>n</sup>* , *n* = 1,..., *Nb*, *find, for every t* ∈ [0, *T* ], *currents yn*(*t*), *n* = 1,..., *Nb*, *along the bars such that yn*(0) = *y*<sup>0</sup> *<sup>n</sup>* , *n* = 1,..., *Nb*, *and*

$$\begin{split} \mathcal{R}^b \frac{d}{dt} \mathcal{F} \left( t, \mathbf{y}^b(t) \right) + \left( \mathcal{R}^b + \left( \mathcal{A}^b \right)^\mathrm{T} \mathcal{B}^{-1} \left( \mathcal{A}^b \right) \right) \mathbf{y}^b(t) + \lambda(t) \left( \mathcal{A}^b \right)^\mathrm{T} \begin{pmatrix} \mathbf{0} \\ \mathbf{e} \end{pmatrix} = \mathbf{0}, \\ \mathcal{A}^b \mathbf{y}^b(t) \cdot \begin{pmatrix} \mathbf{0} \\ \mathbf{e} \end{pmatrix} = 0, \end{split}$$

*where <sup>F</sup>* : [0, *<sup>T</sup>* ] × <sup>R</sup>*Nb* −→ <sup>R</sup>*Nb is the nonlinear operator defined as*

$$\mathcal{F}(t, \mathbf{w}) := \left( \int\_{\mathcal{Q}\_l} \sigma A(\mathbf{x}, \mathbf{y}, t) \, d\mathbf{x} \, d\mathbf{y}, \dots, \int\_{\mathcal{Q}\_b} \sigma A(\mathbf{x}, \mathbf{y}, t) \, d\mathbf{x} \, d\mathbf{y} \right)^\top \in \mathbb{R}^{N\_b},$$

for *<sup>t</sup>* ∈ [0, *<sup>T</sup>* ], **<sup>w</sup>** <sup>∈</sup> <sup>R</sup>*Nb* , *with A*(*x*, *<sup>y</sup>*, *<sup>t</sup>*) *the solution to the following nonlinear magnetostatic problem:*

*Given a fixed t* ∈ [0, *T* ], *currents along the coil sides In*(*t*), *n* = *Nb* + 1,..., *Nc*, *and* **<sup>w</sup>** <sup>∈</sup> <sup>R</sup>*Nb* , find a field *<sup>A</sup>*(*x*, *<sup>y</sup>*, *<sup>t</sup>*) *such that*

$$\begin{split}-\operatorname{div}(\upsilon\_{0}\,\mathbf{grad}A)&=0\quad\text{in}\,\mathcal{Q}\_{0}^{\mathrm{rot}}\cup r\_{l}\left(\mathcal{Q}\_{0}^{\mathrm{sta}}\right),\\-\operatorname{div}(\upsilon\_{0}\,\mathbf{grad}A)&=\frac{w\_{n}}{\mathrm{meas}(\mathcal{Q}\_{n})}\quad\text{in}\,\mathcal{Q}\_{n},\,n=1,\ldots,N\_{b},\\-\operatorname{div}(\upsilon\_{0}\,\mathbf{grad}A)&=\frac{I\_{n}(t)}{\mathrm{meas}(\mathcal{Q}\_{n})}\quad\text{in}\,r\_{l}(\mathcal{Q}\_{n}),\,n=N\_{b}+1,\ldots,N\_{c},\\-\operatorname{div}(\upsilon(\cdot,\,|\mathbf{grad}A|)\,\mathbf{grad}A)&=0\quad\text{in}\,\mathcal{Q}\_{\mathrm{nl}}^{\mathrm{rot}}\cup r\_{l}\left(\mathcal{Q}\_{\mathrm{nl}}^{\mathrm{sta}}\right),\end{split}$$

*with suitable transmission and boundary conditions*.

#### *5.2 Numerical Experiments with Real Electric Machines*

We present the numerical results obtained for a particular induction machine with squirrel cage rotor. Firstly, we use our method to get a suitable initial condition. Next, we solve the transient model with this initial condition and compare the time needed to reach the steady-state with the one needed by taking null initial condition. The electric machine we have used for numerical experiments can be seen in Figs. 18 and 19. For confidentially issues it is a modification of a picture provided by Robert Bosch GmbH. Red, yellow and blue colors correspond to the three different phases. It is composed by 36 slots in the rotor and 48 slots in the stator. It is a three-phase machine having 2 pole pairs with 12 slots per pole. The source currents are characterized by an electrical frequency *fc* and a RMS current *Ic* through each slot. The currents corresponding to each phase of the stator are defined as

$$\begin{aligned} I\_A(t) &= \sqrt{2} \, I\_c \cos \left( 2 \pi f\_c t \right), \\ I\_B(t) &= \sqrt{2} \, I\_c \cos \left( 2 \pi f\_c t + \frac{2 \pi}{3} \right), \\ I\_C(t) &= \sqrt{2} \, I\_c \cos \left( 2 \pi f\_c t - \frac{2 \pi}{3} \right). \end{aligned}$$

We have considered four operating points corresponding to different electrical sources in the stator and different rotor velocities. They are described in Table 2. The physical time to reach the steady state for the different operating points can be seen in Table 3. Finally, in Fig. 20, the computed torque and current along the transient simulation are shown for operation point # 4.

*Notes and Comments*.


**Fig. 19** Transversal section of an induction electric motor with squirrel cage


**Table 3** Time to get the steady state with null initial condition and with the one obtained by the new method


**Fig. 20** Op. Point 4. Torque versus time (left). Current in bar 1 versus time (right)


**Acknowledgements** This work has been partially supported by Robert Bosch GmbH under contract ITMATI-C31-2015, by FEDER and Xunta de Galicia (Spain) under grant GI-1563 ED431C 2017/60, by FEDER/Ministerio de Ciencia, Innovación y Universidades-Agencia Estatal de Investigación under the research project MTM2017-86459-R.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Modelling Our Sense of Smell**

#### **Carlos Conca**

**Abstract** The first step in our sensing of smell is the conversion of chemical odorants into electrical signals. This happens when odorants stimulate ion channels along cilia, which are long thin cylindrical structures in our olfactory system. Determining how the ion channels are distributed along the length of a cilium is beyond current experimental methods. Here we describe how this can be approached as a mathematical inverse problem. Identification of specific functions of receptor neuron arrays is a major challenge today in both Mathematics and Biosciences. In this paper, two integral equations based mathematical models are studied for the inverse problem of determining the distribution of ion channels in cilia of olfactory neurons from experimental data.

#### **1 Introduction**

The first step in sensing smell is the transduction (or conversion) of chemical information into an electrical signal that goes to the brain. Pheromones and odorants, which are small molecules with the chemical characteristics of an odor are found all throughout our environment. The olfactory system (part of the sensory system we use to smell) performs the task of receiving these odorant molecules in the nasal mucosa, and triggering the physical-chemical processes that generates the electric current that travels to the brain. see Fig. 1 and Sect. 1.1.

What happens next is a mystery. Intuition tells us that the electrical wave generated gives rise to an emotion in the brain, which in turn affects our behavior. Of course, the workings of our other four senses is similarly a mystery. And so, we quickly come to perhaps one of the most fundamental questions in neurosciences for the future: How

C. Conca (B)

Departamento de Ingeniería Matemática, Facultad de Ciencias Físicas y Matemáticas, Centro de Modelamiento Matemático UMR 2071 CNRS-UChile, Centro de Biotecnología y Bioingeniería, Universidad de Chile, Santiago, Chile e-mail: cconca@dim.uchile.cl

<sup>©</sup> The Author(s) 2022

T. Chacón Rebollo et al. (eds.), *Recent Advances in Industrial and Applied Mathematics*, ICIAM 2019 SEMA SIMAI Springer Series 1, https://doi.org/10.1007/978-3-030-86236-7\_3

**Fig. 1** Odorants reaching the nasal mucus (left) and structure of an olfactory receptor neuron (right)

does our consciousness processes external stimuli once reduced to electro-chemical waves and, over time, how does this mechanism lead us to become who we are?

How can we approach this problem with mathematics? Faced with these reflections, applied mathematicians take time to stop and wonder if it is possible to provide such far-reaching phenomena with a mathematical representation that allows us to understand and act. Biology is synonymous with "function", so the study of biological systems should start by understanding the corresponding underlying physiology. Consequently, to obtain a proper mathematical representation of the transduction of an odor into an electrical signal, and before any mathematical intervention, we must first detect which atomic populations are involved in the process and identify their respective functions.

#### *1.1 Transduction of Olfactory Signals*

The molecular machinery that carries out this work is in the olfactory cilia. Cilia are long, thin cylindrical structures that extend from an olfactory receptor neuron into the nasal mucus (Fig. 1).

The transduction of an odor begins with pheromones binding to specific receptors on the external membrane of cilia. When an odorant molecule binds to an olfactory receptor on a cilium membrane, it successively activates an enzyme, which increases the levels of a ligand or chemical messenger named cyclic adenosine monophosphate (cAMP) within the cilia. As a result of this, cAMP molecules diffuse through the interior of the cilia. Some of the cAMP molecules binds to cyclic nucleotidegated (CNG) ion channels, causing them to open. This allows an influx of positively

**Fig. 2** Signal transduction mechanism for the olfactory system. **a** In the absence of stimulus channels are closed, system is at resting state. **b** Binding of odorants triggers cAMP synthesis and opening of CNG channels, leading to Ca2<sup>+</sup> and Na<sup>+</sup> transport and a Cl<sup>−</sup> flux

charged ions into the cilium (mostly Ca2<sup>+</sup> and Na<sup>+</sup> as illustrated in Fig. 2), which causes the neuron to depolarize, generating an excitatory response. This response is characterized by a voltage difference on one side and another of the membrane, which in turn initiates the electrical current. This is the overall process that human beings share with all mammals and reptiles to smell and differentiate odors.

#### *1.2 Kleene's Experimental Procedure*

Experimental techniques for isolating a single cilium (from a grass frog) were developed by biochemist and neuroscientist Steven J. Kleene and his research team at the University of Cincinnati in the early 1990s [5, 6]. One olfactory cilium of a receptor neuron is detached at its base and stretched tight into a recording pipette. The cilium is immersed in a cAMP bath. As a result of the phenomenon previously described inside the cilium, the intensity of the current generated is recorded.

Although the properties of a single channel have been described successfully using these experimental techniques, the distribution of these channels along the cilia still remains unknown, and may well turn out to be crucial in determining the kinetics of the neuronal response. Ionic channels, in particular, CNG channels are called "micro-domains" in biochemistry, because of their practically imperceptible size. This makes their experimental description using the current technology very difficult.

#### *1.3 An Integral Equation Model*

Given the experimental difficulties, there is a clear opportunity for mathematics to inform biology. Determining ion channels distribution along the length of a cilium using measurements from experimental data on transmembrane current is usually categorized in physics and mathematics as an inverse problem. Around 2006, a multidisciplinary team (which brought together mathematicians with biochemists and neuroscientists, as well as a chemical engineer) developed and published a first mathematical model [4] to simulate Kleene's experiments. The distribution of CNG channels along the cilium appears in it as the main unknown of a nonlinear integral equation model.

This model gave rise to a simple numerical method for obtaining estimates of the spatial distribution of CNG ion channels. However, specific computations revealed that the mathematical problem is poorly conditioned. This is a general difficulty in inverse problems, where the corresponding mathematical problem is usually ill-posed (in the sense of Hadamard, which requires the problem to have a solution that exists, is unique, and whose behavior changes continuously with the initial conditions), or else it is unstable with respect to the data. As a consequence, its numerical resolution often results in ill-conditioned approximations.

The essential nonlinearity in the previous model arises from the binding of the channel activating ligand (cAMP molecules) to the CNG ion channels as the ligand diffuses along the cilium. In 2007, mathematicians D. A. French and C. W. Groetsch introduced a simplified model, in which the binding mechanism is neglected, leading to a linear Fredholm integral equation of the first kind with a diffusive kernel. The inverse mathematical problem consists of determining a density function, say ρ = ρ(*x*) ≥ 0 (representing the distribution of CNG channels), from measurements in time of the transmembrane electrical current, denoted I0[ρ]. This mathematical equation for ρ is the following integral equation: for all *t* ≥ 0,

$$\mathcal{I}\_0[\rho](t) = \int\_0^L \rho(\mathbf{x}) \left[ \mathbb{P}(c(t, \mathbf{x})) \, \mathrm{d}x,\right. \tag{1}$$

where P is known as the Hill function of exponent *n* > 0 (see Fig. 3). It is defined by:

$$\forall w \ge 0, \qquad \mathbb{P}(w) = \frac{w^n}{w^n + K\_{1/2}^n}.$$

In this definition, the exponent *n* is an experimentally determined parameter and *K*1/<sup>2</sup> > 0 is a constant which represents the half-bulk (i.e., the ligand concentration for which half the binding sites are occupied); typical values for *n* in humans are *n* 2. Besides, in the linear integral equation above, *c*(*t*, *x*) denotes the concentration of cAMP that diffuses along the cilium with a diffusivity constant that we denote as *D*; *L* denotes the length of the cilium, which for simplicity is assumed to be one-dimensional. Here, by concentration we mean the molar concentration, i.e., the amount of solute in the solvent in a unit volume; it is a nonnegative real number.

Hill-type functions are extensively used in biochemistry to model the fraction of ligand bound to a macromolecule as a function of the ligand concentration and, hence, the quantity P(*c*(*t*, *x*)) models the probability of the opening of a CNG channel as a function of the cAMP concentration. The diffusion equation for the concentration of cAMP can be explicitly solved if the length of the cilium *L* is supposed to be infinite. It is given by:

$$c(t, x) = c\_0 \text{erfc}\left(\frac{x}{2\sqrt{Dt}}\right).$$

where *c*<sup>0</sup> > 0 is the maintained concentration of cAMP with which the pipette comes into contact at the open end (*x* = 0) of the cilium (while *x* = *L* is the closed end). Here, erfc is the standard complementary Gauss error function,

$$\operatorname{erfc}(\mathbf{x}) := \mathbf{l} - \frac{2}{\sqrt{\pi}} \int\_0^\mathbf{\hat{r}} e^{-\mathbf{r}^2} \,\mathrm{d}\mathbf{r}.$$

Accordingly, it is straightforward to check that *c* is decreasing in both its variables and that it remains bounded for all (*t*, *x*), 0 < *c*(*t*, *x*) ≤ *c*0.

Despite its elegance (by virtue of the simplicity of its formulation), this new model does not overcome the difficulties encountered in its non-linear version. In fact the mathematical inverse problem associated to model (1) can be shown to be ill-posed. More precisely, since P(*c*(*t*, *x*)) is a smooth mapping, the operator ρ → I0[ρ] is compact from L*<sup>p</sup>*(0, *<sup>L</sup>*) to L*<sup>p</sup>*(0, *<sup>T</sup>* ) for every *<sup>L</sup>*, *<sup>T</sup>* <sup>&</sup>gt; 0, 1 <sup>&</sup>lt; *<sup>p</sup>* <sup>&</sup>lt; <sup>∞</sup>. Thus, even if the operator I0 were injective, its inverse would not be continuous because, if so, then the identity map in L*<sup>p</sup>*(0, *L*) would be compact, which is known to be false.

#### *1.4 Non-diffusive Kernels*

This last result certainly has a more general character. In fact, it is clear from its proof that any model based on a first-order integral equation with a diffusive *smooth* kernel necessarily results in the problem of recovering the density from measurements of the electrical current being ill-posed.

An initial, natural approach to tackling this anomaly in model (1) was developed in Conca et al. [3]. This exploited the fact that the Hill function converges pointwise to a single step function as the exponent *n* goes to +∞, the strategy was to approximate P using a multiple step function.

Based on different assumptions of the spaces where the unknown ρ is sought, theoretical results of identifiability, stability and reconstruction were obtained for the corresponding inverse problem. However, numerical methods for generating estimates of the spatial distribution of ion channels revealed that this class of models is not satisfactory for practical purposes. The only feasible estimates for ρ are obtained for multiple step functions that are very close to a single-step function or, equivalently, for Hill functions with very large exponents, which imply the use of unrealistic models.

Another way to overcome the ill-posedness of the inverse problem in (1) consists of replacing the kernel of the integral equation with a non-smooth variant of the Hill function.

Specifically, let *a* ∈ (0, *c*0) be a given real parameter. A discontinuous version of P is obtained by forcing a saturation state for concentrations higher than *a*. By doing so, one is led to introduce the following disruptive variant of P (shown in Fig. 4):

$$\mathbb{H}(c) = \mathbb{P}(c) \, \mathbb{1}\_{c \le a} + \, \mathbb{1}\_{a \le c \le c\_0},$$

where 1*<sup>J</sup>* denotes the characteristic function of the interval *J* . The mathematical problem that recovers ρ from the electrical current data is therefore modelled by

$$\mathcal{I}\_{\mathbb{I}}[\rho](t) = \int\_{0}^{L} \rho(\mathbf{x}) \, \mathbb{H}(c(t, \mathbf{x})) \, \mathrm{d}x,\tag{2}$$

where *c*(*t*, *x*)is still defined as before. The introduction of this disruptive Hill function can be understood mathematically as follows: as *t* → ∞, the factor *x*/ <sup>√</sup>*Dt* in the complementary error function defining the concentration tends to 0, and consequently *c*(*t*, *x*) tends pointwise to *c*0. An inverse mathematical problem and a direct problem are associated with both models (1) and (2). In the first, the electric current is measured

and the unknown is the density ρ of ion channels, while in the direct problem the opposite is true. Since these are Fredholm equations of the first type, it is natural to tackle them using convolution. Once the variable ρ has been extended to [0,∞) by zero, the Mellin transform is revealed as being the most appropriate tool for carrying out this task (see the overview section "Mellin transform").

#### **2 A General Convolution Equation**

The Mellin transform is the appropriate tool to study model (2). It allows to reduce it in a convolution equation of the Mellin type. To do so, the key observation is the fact that <sup>H</sup>(*c*(*t*, *<sup>x</sup>*)) can be written in terms of <sup>√</sup>*<sup>t</sup> <sup>x</sup>* . Indeed, defining *G* as

$$G(z) = \mathbb{H}\left(c\_0 \text{erfc}\left(\frac{1}{2\sqrt{D}z}\right)\right),$$

we have I1[ρ](*t*) <sup>=</sup> *<sup>L</sup>* <sup>0</sup> ρ(*x*)*G*( √*t <sup>x</sup>* ) d*x*. Thus, by extending ρ by zero to [0,∞), and rescaling time *t* in *t* 2, we obtain

$$\mathcal{I}\_1[\rho](t^2) = \int\_0^\infty \mathbf{x} \rho(\mathbf{x}) G\left(\frac{t}{\varkappa}\right) \frac{\mathbf{dx}}{\varkappa} = \left(\mathbf{x} \rho(\mathbf{x})\right) \* G$$

which is a convolution equation in *x*ρ(*x*).

Taking Mellin transform on both sides and using its operational properties, we formally obtain

$$\frac{1}{2}\mathcal{M}1\_1[\rho](\mathbf{s}/2) = \mathcal{M}G(\mathbf{s})\mathcal{M}\rho(\mathbf{s}+1)$$

or equivalently,

$$\mathcal{M}\rho(\mathbf{s}+\mathbf{l}) = \frac{1}{2} \frac{\mathcal{M}\mathbf{l}\_1[\rho] \, (\mathbf{s}/2)}{\mathcal{M}G(\mathbf{s})}.\tag{3}$$

#### **Mellin Transform**

Austrian mathematician Robert Hjalmar Mellin (1854–1933) gave his name to the so-called Mellin transform, whose definition and properties are recalled below. The interested reader is referred to §2 of [1] or Lindelöf [7] for a summary of his work, and proof of the main results around this transform.

For *q* ∈ R, *q* + *i* R will denote the vertical line {*q* + *it*, *t* ∈ R} of the complex plane having abscissa *<sup>q</sup>*, and for *<sup>p</sup>* <sup>∈</sup> <sup>R</sup> (*<sup>p</sup>* <sup>≥</sup> 1), L*<sup>p</sup>* ([0,∞), *<sup>x</sup><sup>q</sup>* ), or simply L*<sup>p</sup> <sup>q</sup>* , will stand for the Lebesgue space with the weight *x<sup>q</sup>* , i.e.,

$$\mathcal{L}\_q^p = \left\{ f \colon [0,\infty) \to \mathbb{R} \mid \|f\|\_{\mathcal{L}\_q^p} < +\infty \right\},$$

where  *f* <sup>L</sup>*<sup>p</sup> <sup>q</sup>* = ( ∞ 0 <sup>|</sup> *<sup>f</sup>* (*x*)|*<sup>p</sup> <sup>x</sup><sup>q</sup>* <sup>d</sup>*x*)<sup>1</sup>/*<sup>p</sup>*. L*<sup>p</sup> <sup>q</sup>* , endowed with this norm, is a Banach space.

Let *<sup>f</sup>* be in L<sup>1</sup> ([0,∞), *<sup>x</sup><sup>q</sup>* ). The Mellin transform of *<sup>f</sup>* is a complex-valued function defined on the vertical line *q* + 1 + *i* R by

$$\mathcal{M}f(\mathbf{s}) = \int\_0^\infty x^s f(\mathbf{x}) \frac{\mathbf{d}x}{x}$$

From its very definition, it is observed that the Mellin transform maps functions defined on [0,∞) into functions defined on *q* + 1 + *i* R. Like in the Fourier transform, <sup>M</sup> *<sup>f</sup>* is continuous whenever *<sup>f</sup>* is in L1 ([0,∞), *<sup>x</sup><sup>q</sup>* ). Specifically, we have

**Theorem 1** (Riemann-Lebesgue) *The Mellin transform is a linear continuous map from* <sup>L</sup><sup>1</sup> ([0,∞), *<sup>x</sup><sup>q</sup>* )into <sup>C</sup><sup>0</sup>(*<sup>q</sup>* <sup>+</sup> <sup>1</sup> <sup>+</sup> *<sup>i</sup>* <sup>R</sup>; <sup>C</sup>) <sup>→</sup> <sup>L</sup>∞(*<sup>q</sup>* <sup>+</sup> <sup>1</sup> <sup>+</sup> *<sup>i</sup>* <sup>R</sup>; <sup>C</sup>); *its operator norm is* 1.

**Proposition 1** *If f is in* L<sup>1</sup> *<sup>q</sup> for every real number q in* (*a*, *b*) *then its Mellin transform* M *f* (·) *is holomorphic in the strip S* = {*s* ∈ C | *a* + 1 < Re(s) < b + 1}.

The following table summarizes the main operational properties of the Mellin transform:


where, ∀*x* ∈ R and ∀*k* ≥ 1,(*x*)*<sup>k</sup>* stands for the so-called Pochhammer symbol, which is defined by

$$(\mathbf{x})\_k = \mathbf{x} \cdot \dots \cdot (\mathbf{x} - k + 1) = \prod\_{j=0}^{k-1} (\mathbf{x} - j) \quad \text{if } k \ge 1,$$

and (*x*)<sup>0</sup> = 1, where *x* is in R.

#### *2.1 A Priori Estimates*

Seeking continuity and observability inequalities for model (2) is then reduced to find lower and upper bounds for M*G*(·) in suitable weighted Lebesgue's spaces. Doing so, one obtains

**Theorem 2** (A Priori Estimates) *Let k* <sup>∈</sup> <sup>N</sup> ∪ {0} *and r* <sup>∈</sup> <sup>R</sup> *be arbitrary. Assume that the Mellin transforms of* ρ *and I*1[ρ] *satisfy (3), then*

$$C\_\ell^k \|\rho\|\_{L^2\_r} \le \|(I\_1[\rho])^{(k)}\|\_{L^2\_{2k+\frac{\ell-3}{2}}} \le C\_\mu^k \|\rho\|\_{L^2\_r},$$

*where*

$$\mathcal{C}\_{\ell}^{k} \stackrel{(\text{def})}{=} \sqrt{2} \inf\_{s \in \frac{r-1}{2} + i \to \mathbb{R}} \left| \left( \frac{s}{2} \right)\_{k} \mathcal{M}G(s) \right| $$

$$\mathcal{C}\_{u}^{k} \stackrel{(\text{def})}{=} \sqrt{2} \sup\_{s \in \frac{r-1}{2} + i \to \mathbb{R}} \left| \left( \frac{s}{2} \right)\_{k} \mathcal{M}G(s) \right|,$$

*and L<sup>p</sup> <sup>q</sup>* <sup>=</sup> *<sup>L</sup><sup>p</sup>* ([0,∞), *<sup>x</sup><sup>q</sup>* ) *stands for the Lebesgue space with the weight x<sup>q</sup> , p* <sup>≥</sup> <sup>1</sup>*, q* ∈ R*.*

*Remark 1* It is worth noting that *C<sup>k</sup>* ,*C<sup>k</sup> <sup>u</sup>* could *a priori* range from 0 to +∞. *Proof* Using the properties of the Mellin transform in Eq. (3), it follows that

$$\mathcal{L}(\mathbf{s} - k)\_k \, \mathsf{M}[\boldsymbol{\varrho}](\mathbf{s} - k) = \mathcal{Z}(\mathbf{s} - k)\_k \, \mathsf{M}G(\mathcal{Z}(\mathbf{s} - k)) \, \mathsf{M}\boldsymbol{\rho}(\mathcal{Z}(\mathbf{s} - k) + \mathbf{l}) \tag{4}$$

Thanks to Parseval-Plancherel's isomorphism, for every *s* in *q* + *i* R, we have

$$\begin{aligned} \left\|(\mathrm{II}[\rho])^{(k)}\right\|\_{\mathrm{L}\_{2q-1}^{2}} &= \frac{1}{\sqrt{(2\pi)}} \left\|(-1)^{k}(s-k)\_{k}\,\mathsf{M}\mathrm{II}[\rho](s-k)\right\|\_{\mathrm{L}^{2}(q+i\,\mathrm{R})} \\ &= \frac{2}{\sqrt{(2\pi)}} \left\|(s-k)\_{k}\,\mathsf{M}G(2(s-k))\,\mathsf{M}\rho(2(s-k)+1)\right\|\_{\mathrm{L}^{2}(q+i\,\mathrm{R})} \\ &= \frac{2}{\sqrt{(2\pi)}} \left\|(s)\_{k}\,\mathsf{M}G(2s)\,\mathsf{M}\rho(2s+1)\right\|\_{\mathrm{L}^{2}(q-k+i\,\mathrm{R})} \\ &= \frac{1}{\sqrt{\pi}} \left\|\left(\frac{s}{2}\right)\_{k}\,\mathsf{M}G(s)\,\mathsf{M}\rho(s+1)\right\|\_{\mathrm{L}^{2}(2(q-k)+i\,\mathrm{R})} \end{aligned} \tag{5}$$

As M is an isometry from L<sup>2</sup> (2(*q* − *k*) + 1 + *i* R) on L<sup>2</sup> <sup>4</sup>(*q*−*k*)+<sup>1</sup>,

$$\|\mathsf{M}\rho(\mathsf{s}+1)\|\_{\mathsf{L}^{2}(2(q-k)+i\to\mathbb{R})} = \|\mathsf{M}\rho(\mathsf{s})\|\_{\mathsf{L}^{2}(2(q-k)+1+i\to\mathbb{R})} = \sqrt{2\pi} \,\|\rho\|\_{\mathsf{L}^{2}\_{4(q-k)+1}} \quad (6)$$

Thanks to (5) and (6) and the definitions of *C<sup>k</sup> <sup>l</sup>* ,*C<sup>k</sup> <sup>u</sup>* , we get

$$\|C\_l^k\|\rho\|\_{\mathcal{L}^2\_{4(q-k)+1}} \le \left\|(\mathrm{I}[\rho])^{(k)}\right\|\_{\mathcal{L}^2\_{2q-1}} \le C\_u^k \|\rho\|\_{\mathcal{L}^2\_{4(q-k)+1}}$$

Taking *<sup>r</sup>* <sup>=</sup> <sup>4</sup>(*<sup>q</sup>* <sup>−</sup> *<sup>k</sup>*) <sup>+</sup> 1, that is *<sup>q</sup>* <sup>=</sup> *<sup>k</sup>* <sup>+</sup> *<sup>r</sup>*−<sup>1</sup> <sup>4</sup> , provides the result.

**Mellin Convolution**

For two given functions *f*, *g*, the *multiplicative convolution f* ∗ *g* is defined as follows

$$(f\*g)(x) = \int\_0^\infty f(\mathbf{y}) \, \mathbf{g}\left(\frac{x}{\mathbf{y}}\right) \frac{\mathbf{d}\mathbf{y}}{\mathbf{y}}$$

**Theorem 3** (Mellin Transform of a Convolution) *Whenever this expression is well defined, we have*

$$\mathcal{M}(f \ast \mathbf{g})(\mathbf{s}) = \mathcal{M}f(\mathbf{s})\,\mathcal{M}\mathbf{g}(\mathbf{s})$$

Finally, the classical *L*2-isometry has his Mellin counterpart.

**Theorem 4** (Parseval-Plancherel's Isomorphism) *The Mellin transform can be extended in a unique manner to a linear isometry* (*up to the constant* (2π )−1/2) *from* L2 <sup>2</sup>*q*−<sup>1</sup> *onto the classical Lebesgue space* <sup>L</sup>2(*<sup>q</sup>* <sup>+</sup> *<sup>i</sup>* <sup>R</sup>):

$$\mathcal{M} \in \mathcal{L}\left(\mathbf{L}\_{2q-1}^2; \mathbf{L}^2(q+i\,\mathbb{R}, \,\mathrm{d}x)\right)$$

#### **3 Observability of CNG Channels**

The *a priori* estimates in the theorem above also allow to determine a unique distribution of ion channels along the length of a cilium from measurements in time of the transmembrane electric current.

**Theorem 5** (Existence and uniqueness of ρ) *Let a* > 0 *and r* < 1 *be given. If <sup>I</sup>*<sup>1</sup> <sup>∈</sup> *<sup>L</sup>*<sup>2</sup> [0,∞), *t r*−3 2 *, I* <sup>1</sup> <sup>∈</sup> *<sup>L</sup>*<sup>2</sup> [0,∞), *<sup>t</sup>* <sup>2</sup><sup>+</sup> *<sup>r</sup>*−<sup>3</sup> 2 *and a is small enough, then there exists a unique* <sup>ρ</sup> <sup>∈</sup> *<sup>L</sup>*<sup>2</sup>([0,∞), *<sup>x</sup><sup>r</sup>*) *which satisfies the following stability condition:*

$$\left\|I\_1\right\|\_{L^2([0,\infty),t^{\frac{\sigma-3}{2}})} + \left\|I\_1'\right\|\_{L^2([0,\infty),t^{\frac{\sigma-3}{2}})} \ge C \left\|\rho\right\|\_{L^2\_r},$$

*where C* > 0 *depends only on a and r.*

*Proof* The proof is based on the following technical lemmas and its corollaries.

**Lemma 1** *Let A and B be two elements of* [0,∞]*, k* ∈ ∪{0}<sup>N</sup> *be a nonnegative integer and f a function such that f* (*j*) *is in* L<sup>1</sup> *<sup>j</sup>*(*A*, *B*) *for every j* = 0,..., *k. For every real number t, we have*

$$\int\_A^B f(\mathbf{x}) \mathbf{x}^{it} \, \mathbf{dx} = \sum\_{j=0}^{k-1} (-1)^j \mathcal{Q}\_j \left[ \mathbf{x}^{j+1} f^{(j)}(\mathbf{x}) \mathbf{x}^{it} \right]\_A^B + (-1)^k \mathcal{Q}\_{k-1} \int\_A^B \mathbf{x}^k f^{(k)}(\mathbf{x}) \mathbf{x}^{it} \, \mathbf{dx},$$

*where Q <sup>j</sup>* = *Q <sup>j</sup>*(*t*) = *<sup>j</sup> <sup>l</sup>*=<sup>0</sup>(1 + *l* + *it*) <sup>−</sup><sup>1</sup> *.*

*Proof* We use induction on *<sup>k</sup>* <sup>∈</sup> <sup>N</sup>. For *<sup>k</sup>* <sup>=</sup> 0, since *<sup>Q</sup>*−<sup>1</sup> <sup>=</sup> 1, there is nothing to prove. We assume that the formula is true for an integer *<sup>k</sup>* <sup>∈</sup> <sup>N</sup>. As (*<sup>k</sup>* <sup>+</sup> <sup>1</sup> <sup>+</sup> *it*)*Qk* <sup>=</sup> *Qk*−1, it remains to prove that

$$(k+1+it)\int\_A^B \mathbf{x}^k f^{(k)}(\mathbf{x}) \mathbf{x}^{it} \, \mathbf{dx} = \left[\mathbf{x}^{k+1} f^{(k)}(\mathbf{x}) \mathbf{x}^{it}\right]\_A^B - \int\_A^B \mathbf{x}^{k+1} f^{(k+1)}(\mathbf{x}) \mathbf{x}^{it} \, \mathbf{dx}$$

As <sup>d</sup> <sup>d</sup>*<sup>x</sup> <sup>x</sup>it* <sup>=</sup> *it <sup>x</sup> xit* , the previous relation follows by integration by parts. Indeed, we have

$$\begin{aligned} \text{i)} \int\_A^B \mathbf{x}^k f^{(k)}(\mathbf{x}) \mathbf{x}^{it} \, \mathbf{dx} &= \int\_A^B \mathbf{x}^{k+1} f^{(k)}(\mathbf{x}) (\mathbf{x}^{it})' \, \mathbf{dx} \\ &= \left[ \mathbf{x}^{k+1} f^{(k)}(\mathbf{x}) \mathbf{x}^{it} \right]\_A^B - (k+1) \int\_A^B \mathbf{x}^k f^{(k)}(\mathbf{x}) \mathbf{x}^{it} \, \mathbf{dx} \\ &- \int\_A^B \mathbf{x}^{k+1} f^{(k+1)}(\mathbf{x}) \mathbf{x}^{it} \, \mathbf{dx} \end{aligned}$$

**Corollary 1** *Let f* : [*A*, *B*] → R *with A*, *B* ∈ [0,∞] *be a piecewise C*<sup>1</sup> *function. If f is non-negative, f is non-positive, f* ∈ L<sup>1</sup>(*A*, *B*), *f* ∈ L<sup>1</sup> <sup>1</sup>(*A*, *B*) *and for all t* ∈ R*:* [*x f* (*x*)*xit*]*<sup>B</sup> <sup>A</sup>* = 0*, then*

$$\sqrt{1+t^2} \left| \int\_A^B f(\mathbf{x}) \mathbf{x}^{it} \, \mathbf{d} \mathbf{x} \right| \le \int\_A^B f(\mathbf{x}) \, \mathbf{d} \mathbf{x}.$$

*Proof* From Lemma 1 with *k* = 1 one obtains

$$\forall t \in \mathbb{R}, \quad (1+it) \int\_{A}^{B} f(\mathbf{x}) \mathbf{x}^{it} \, \mathbf{d}x = -\int\_{A}^{B} \mathbf{x} f'(\mathbf{x}) \mathbf{x}^{it} \, \mathbf{d}x$$

As *A*, *B* ≥ 0 and *f* ≤ 0, using this previous identity twice, for *t* = 0 and for *t* = 0, we get

$$\sqrt{1+t^2} \left| \int\_A^B f(\mathbf{x}) \mathbf{x}^{it} \, \mathbf{d} \mathbf{x} \right| \le \int\_A^B \left| \mathbf{x} f'(\mathbf{x}) \right| \, \mathbf{d} \mathbf{x} = \int\_A^B f(\mathbf{x}) \, \mathbf{d} \mathbf{x}$$

**Lemma 2** *Let n*, *<sup>K</sup>* <sup>&</sup>gt; <sup>0</sup>, *<sup>q</sup>* <sup>∈</sup> <sup>R</sup> *and f* <sup>=</sup> *erfc<sup>n</sup> erfcn*+*<sup>K</sup> . There exists xq* <sup>&</sup>gt; <sup>0</sup> *such that the function gq* : *x* ∈ [*xq* ,∞) → *f* (*x*) *x<sup>q</sup>*−<sup>1</sup> *is decreasing. Let q*˜ = inf *Eq where Eq* = {*c* ≥ 0 | *g <sup>q</sup>* (*x*) < 0 ∀*x* ≥ *c*}*. The function q* → ˜*q is increasing and q*˜ = (*q*/(2*n*))<sup>1</sup>/<sup>2</sup> + *o q*<sup>1</sup>/<sup>2</sup> *as q* → ∞*.*

*Proof* As *f* > 0, the inequality *g <sup>q</sup>* (*x*) ≤ 0 is equivalent to

$$\frac{f'(\mathbf{x})}{f(\mathbf{x})} \le -\frac{q-1}{\mathbf{x}}\tag{7}$$

Let us compute *<sup>f</sup> <sup>f</sup>* . To do so, let *<sup>u</sup>* <sup>=</sup> erfc*<sup>n</sup>*, so that *<sup>f</sup>* <sup>=</sup> *<sup>u</sup> <sup>u</sup>*+*<sup>K</sup>* . We have

$$\frac{f'}{f} = \frac{u'}{u} \frac{K}{u+K} = n \frac{\text{erfc}'}{\text{erfc}} \frac{K}{u+K} \tag{8}$$

Since erfc (*x*) = −2π−1/2*e*−*x*<sup>2</sup> , for *<sup>x</sup>* large enough, erfc(*x*) <sup>=</sup> <sup>π</sup>−1/2*x*−1*e*−*x*<sup>2</sup> + *o x*−1*e*−*x*<sup>2</sup> , and so

$$\frac{f'(\mathbf{x})}{f(\mathbf{x})} = n \frac{\text{erfc}'(\mathbf{x})}{\text{erfc}(\mathbf{x})} (1 + o(1)) = -2n\mathbf{x} + o(\mathbf{x}) \tag{9}$$

This asymptotic expansion proves that the inequality (7) is satisfied for large enough values of *x*. As a consequence, for every *q* in R, the set *Eq* is not empty, which justifies the definition of *q*˜. Note that the definition of *q*˜ implies *g <sup>q</sup>* (*q*˜) = 0, and hence, thanks to (7), *<sup>f</sup>* (*q*˜) *<sup>f</sup>* (*q*˜) = −*q*−<sup>1</sup> *<sup>q</sup>*˜ . Let *<sup>q</sup>*<sup>1</sup> <sup>≥</sup> *<sup>q</sup>*<sup>2</sup> be two real numbers. In order to show that *q*˜<sup>2</sup> ≤ ˜*q*1, it is enough to prove that *g q*1 (*q*˜2) ≥ 0. This holds true because

$$\begin{aligned} g\_{q\_l}'(\tilde{q}\_2) &= \tilde{q}\_2^{q\_l - 2} (f'(\tilde{q}\_2)\tilde{q}\_2 + f(\tilde{q}\_2)(q\_1 - 1)) \geq \tilde{q}\_2^{q\_l - 2} (f'(\tilde{q}\_2)\tilde{q}\_2 + f(\tilde{q}\_2)(q\_2 - 1)) \\ &= \tilde{q}\_2^{q\_l - q\_2} g\_{q\_2}'(\tilde{q}\_2) = 0 \end{aligned}$$

To find an expansion for *q*˜, let us recall the following classical lower bound on erfc(*x*) for *x* ≥ 0,

$$\frac{1}{x + (\mathbf{x}^2 + 2)^{1/2}} \le \frac{1}{2} \pi^{1/2} \exp(\mathbf{x}^2) \operatorname{erfc}(\mathbf{x})$$

As the function *<sup>u</sup>* <sup>=</sup> erfc*<sup>n</sup>* takes its values in (0, <sup>1</sup>], *nK* <sup>1</sup>+*<sup>K</sup>* <sup>≤</sup> *nK <sup>u</sup>*+*<sup>K</sup>* <sup>≤</sup> *<sup>n</sup>*. Consequently, the identities (8) yield

$$-n\left(\mathbf{x} + (\mathbf{x}^2 + 2)^{1/2}\right) \le \frac{f'(\mathbf{x})}{f(\mathbf{x})}\tag{10}$$

Let *<sup>q</sup>* <sup>&</sup>gt; 1 and set *xq* <sup>=</sup> *<sup>q</sup>*−<sup>1</sup> (2*n*)1/2(*n*+*q*−1)1/<sup>2</sup> . The inequality−*<sup>q</sup>*−<sup>1</sup> *<sup>x</sup>* ≤ −*n x* + (*x* <sup>2</sup> + 2)<sup>1</sup>/<sup>2</sup> is equivalent to *x x* + (*x* <sup>2</sup> + 2)<sup>1</sup>/<sup>2</sup> <sup>≤</sup> *<sup>q</sup>*−<sup>1</sup> *<sup>n</sup>* . A simple computation shows that this inequality is satisfied for *x* = *xq* (and becomes and equality). Thanks to (10), we conclude that *xq* satisfies *<sup>f</sup>* (*xq* ) *<sup>f</sup>* (*xq* ) ≥ −*<sup>q</sup>*−<sup>1</sup> *xq* , which leads to *q*˜ ≥ *xq* , by definition of *q*˜ and by (7). This last inequality implies that *q*˜ tends to +∞ as *q* tends to +∞. Finally, from (9), we get the asymptotic for *q*˜, namely

$$-2n\tilde{q} + o(\tilde{q}) = \frac{f'(\tilde{q})}{f(\tilde{q})} = -\frac{q-1}{\tilde{q}}$$

This completes the proof of Lemma 2.

#### **Proof of Theorem** 5

We are now in a position to conclude the proof of Theorem 5. To do so, we begin by introducing

$$J(\mathbf{x}) \stackrel{(\text{def})}{=} \mathbb{H}(c\_0 \text{erfc}(\mathbf{x})) = f(\mathbf{x}) \mathbb{1}\_{x \ge a} + K \text{ } \mathbb{1}\_{0 < x < a},$$

where *<sup>f</sup>* (*x*) <sup>=</sup> erfc(*x*)*<sup>n</sup>* erfc(*x*)*n*+*c*−*<sup>n</sup>* <sup>0</sup> *K<sup>n</sup>* 1/2 , <sup>α</sup> <sup>=</sup> erfc−<sup>1</sup> *<sup>a</sup> c*0 , and *K* = 1. A brief calculation shows that *G* and *J* , and their corresponding Mellin transforms are related as follows

$$\mathcal{G}(\mathbf{x}) = J\left(\frac{1}{2\sqrt{D}\mathbf{x}}\right) \quad \text{and} \quad \mathcal{M}\mathcal{G}(\mathbf{s}) = \frac{1}{2^s \sqrt{D^s} \mathcal{M}J(-\mathbf{s})} \tag{11}$$

Thus, in terms of *J* , Eq. (3) becomes

$$\mathcal{M}\rho(\mathbf{s}+\mathbf{l}) = \frac{2^{\mathbf{s}-\mathbf{l}}}{\sqrt{D^{\mathbf{s}}}} \frac{\mathcal{M}[\rho] \text{ (s/2)}}{\mathcal{M}J(-\mathbf{s})} \tag{12}$$

From the estimate for erfc at +∞, given in the proof of Lemma 2, the function *J*<sup>1</sup> is in L<sup>1</sup> *<sup>k</sup>* for every *k* > −1. Thus M*J*<sup>1</sup> is holomorphic on the right half-plane, see Proposition 1. Using Lemma 3.2 in [1] on the vertical line <sup>1</sup>−*<sup>r</sup>* <sup>2</sup> <sup>+</sup> *<sup>i</sup>* <sup>R</sup> with <sup>1</sup>−*<sup>r</sup>* <sup>2</sup> > 0, one deduces that bounds for M*J* (−*s*) amounts to estimate |*s*M*J*1(*s*)|, from above or from below, on the vertical lines *q* + *i* R, for *q* > 0. The Mellin transform of *J*<sup>1</sup> at *s* = *q* + *it* is given by

$$\mathcal{M}J\_1(s) = K \int\_0^a \mathbf{x}^{s-1} \, \mathbf{dx} + c\_0^n \int\_a^{+\infty} f(\mathbf{x}) \mathbf{x}^{s-1} \, \mathbf{dx} = K \frac{a^s}{s} + c\_0^n \int\_a^{+\infty} f(\mathbf{x}) \mathbf{x}^{q-1} \mathbf{x}^{it} \, \mathbf{dx}$$

For any *a* ≥ 0, *q* > 0 and *s* ∈ *q* + *i* R we have

$$|\mathcal{M}J\_1(s)| \le K \frac{\alpha^q}{q} + c\_0'' \int\_{\alpha}^{+\infty} f(\mathbf{x}) x^{q-1} \, \mathbf{d}x,$$

which is finite. Let *q* > 0. According to Lemma 2 the function *x* → *f* (*x*)*x<sup>q</sup>*−<sup>1</sup> is decreasing for *<sup>x</sup>* <sup>≥</sup> *<sup>x</sup>*0. Let *<sup>a</sup>* <sup>&</sup>lt; *<sup>c</sup>*0erfc(*x*0) so that <sup>α</sup> <sup>=</sup> erfc−<sup>1</sup> (*a*/*c*0) <sup>≥</sup> *<sup>x</sup>*0. Let *g*(*x*) = *f* (*x*)*x<sup>q</sup>*−<sup>1</sup> 1*<sup>x</sup>*≥<sup>α</sup>. For every *t* ∈ R, *f* (*x*)*xit*<sup>∞</sup> *<sup>x</sup>*<sup>0</sup> = 0 because *f* vanishes for *<sup>x</sup>* <sup>≤</sup> <sup>α</sup> and *<sup>x</sup>*<sup>0</sup> <sup>≤</sup> <sup>α</sup>, and *<sup>g</sup>*(*x*) <sup>=</sup> <sup>π</sup>−*n*/<sup>2</sup>*x*−*n*+*q*−<sup>1</sup>*e*−*nx*<sup>2</sup> + *o x*−*n*+*q*−<sup>1</sup>*e*−*nx*<sup>2</sup> . Then Corollary 1 can be applied to the function *g*, with *A* = α, *B* = +∞, for *s* ∈ *q* + *i* R, to give

$$\begin{aligned} |s\mathcal{M}J\_1(s)| &\leq K \left| \alpha^s \right| + c\_0^n \frac{|s|}{\sqrt{1+t^2}} \sqrt{1+t^2} \left| \int\_a^\infty f(\mathbf{x}) \, \mathbf{x}^{s-1} \, \mathbf{d}x \right| \\ &\leq K\alpha^q + c\_0^n \max(1, q) \int\_a^\infty f(\mathbf{x}) \mathbf{x}^{q-1} \, \mathbf{d}x < \infty, \end{aligned}$$

because <sup>|</sup>*s*<sup>|</sup> <sup>√</sup>1+*<sup>t</sup>* <sup>2</sup> ∈ [*q*, <sup>1</sup>]∪[1, *<sup>q</sup>*], either *<sup>q</sup>* <sup>≤</sup> 1 or *<sup>q</sup>* <sup>≥</sup> 1. For small values of *<sup>a</sup>*, the first term dominates the second one. The same calculation as above leads to

$$|s\mathcal{M}J\_1(s)| \ge K\alpha^q - c\_0^n \max(1, q) \int\_a^\infty f(\mathbf{x}) \mathbf{x}^{q-1} \, \mathbf{d}x$$

This latter expression is equivalent to *K*α*<sup>q</sup>* as α tends to +∞, therefore, it is positive for large values of α.

#### **4 Unstable Identifiability, Non Existence of Observability Inequalities**

Since the French-Groetsch model is also a Fredholm integral equation of the first kind, it is natural to apply a Mellin transform here too. This leads to interesting results: neither an observability inequality nor a proper numerical algorithm for recovering ρ can be established. However, an Identifiability result holds whenever the current is measured over an open time interval (see the Identifiability Theorem below).

Defining *G*˜ as

$$\tilde{G}(z) = \mathbb{P}\left(c\_0 \text{erfc}\left(\frac{1}{2\sqrt{D}z}\right)\right).$$

and rescaling time *t* in *t* 2, we obtain a convolution equation very similar to (3):

$$\mathcal{M}\rho(\mathbf{s}+1) = \frac{1}{2} \frac{\mathcal{M}\mathbf{l}\_0[\rho] \, (\mathbf{s}/2)}{\mathcal{M}\tilde{G}(\mathbf{s})} \tag{13}$$

A close study of the transform of *G*˜(*s*) allows us to establish the following two theorems, which provide information about the behavior of the inverse problem associated with model (1). The proof of Theorems 3 and 4 requires to extend Mellin transform to functions in the Schwartz space and to prove that the Mellin transforms of such smooth and rapidly decreasing functions decay faster than polynomials on vertical lines.<sup>1</sup>

 

<sup>1</sup> The interested reader is referred to [1] for details on how this can be done and and for detailed proofs of Theorems 3 and 4.

**Theorem 3** (Non Observability) *Let r* < 1 *be fixed. For every non-negative integer k there exists no constant Ck* > 0 *such that the observability inequality:*

$$\left\| \left( I\_0[\rho] \right)^{(k)} \right\|\_{L^2\left( [0,\infty),t^{\frac{2k}{2} + \frac{r-3}{2}} \right)} \ge C\_k \|\rho\|\_{L^2\_r},$$

*holds for every function* <sup>ρ</sup> <sup>∈</sup> *<sup>L</sup>*2([0,∞), *<sup>x</sup><sup>r</sup>*)*.*

Note that this result shows that I0 ∈ L(*L*<sup>2</sup> *<sup>r</sup>* ; *L*<sup>2</sup> *r*−3 2 ), and that if the inverse problem were identificable (i.e., I0 were injective), then I<sup>−</sup><sup>1</sup> <sup>0</sup> could not be continuous.

**Theorem 4** (Identifiability) *Let r* <sup>&</sup>lt; <sup>0</sup> *and* <sup>ρ</sup> <sup>∈</sup> *<sup>L</sup>*<sup>1</sup>([0,∞), *<sup>x</sup><sup>r</sup>*) *be arbitrary. If there exists a nonempty open subset* U *of* (0,∞) *such that for all t* ∈ U, *I*0[ρ](*t*) = 0*, then* ρ = 0 *almost everywhere on* (0,∞)*.*

The interested reader is referred to [1, §4 and §5] for various numerical experiences associated with the different theoretical results of this paper. In particular, Theorems 5 and 6 are graphically illustrated in the quoted reference with data extracted from laboratory experiments carried out by Chen et al. [2] in the 1990s.

#### **A Path Forward**

The Mellin transform has been successful in mathematically analyzing models (1) and (2), allowing us to answer questions of existence (observability), uniqueness and identifiability of the distribution of ion channels along a cilium, as well as stability issues associated with both direct and inverse problems in these models. However, from a more holistic scientific point of view, not a purely mathematical one, the big question does not seem to be exactly this. Rather, it is about whether, by using and studying these models, Mathematics truly helps to improve our understanding of the olfactory system and, in general terms, the real world. In this sense, Kleene's experiments have been a great contribution, albeit insufficient. Much stronger validation of the models is required, which can only be achieved by forming multidisciplinary teams and designing ad-hoc experiments.

**Acknowledgements** C. C. is partially supported by PFBasal-001 and AFBasal170001 projects, and from the Regional Program STIC-AmSud Project NEMBICA-20-STIC-05.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **State Estimation—The Role of Reduced Models**

**Albert Cohen, Wolfgang Dahmen, and Ron DeVore**

**Abstract** The exploration of complex physical or technological processes usually requires exploiting available information from different sources: (i) *physical laws* often represented as a family of *parameter dependent partial differential equations* and (ii) *data* provided by *measurement devices* or *sensors*. The amount of sensors is typically limited and data acquisition may be expensive and in some cases even harmful. This article reviews some recent developments for this "small-data" scenario where inversion is strongly aggravated by the typically large parametric dimensionality. The proposed concepts may be viewed as exploring alternatives to *Bayesian inversion* in favor of more deterministic accuracy quantification related to the required computational complexity. We discuss *optimality criteria* which delineate intrinsic information limits, and highlight the role of *reduced models* for developing efficient computational strategies. In particular, the need to *adapt* the reduced models—not to a specific (possibly noisy) data set but rather to the sensor system—is a central theme. This, in turn, is facilitated by exploiting *geometric* perspectives based on proper *stable variational formulations* of the continuous model.

#### **1 Introduction**

Modern sensor technology and data acquisition capabilities generate an ever increasing wealth of data about virtually every branch of science and social life. Machine learning offers novel techniques for extracting quantifiable information from such large data sets. While machine learning has already had a transformative impact on

A. Cohen

W. Dahmen (B)

e-mail: dahmen@math.sc.edu

Laboratoire Jacques-Louis Lions, Sorbonne Université, 4, Place Jussieu, 75005 Paris, France e-mail: cohen@ann.jussieu.fr

Mathematics Department, University of South Carolina, 1523 Greene Street, Columbia, SC 29208, USA

R. DeVore Department of Mathematics, Texas A & M University, College Station, TX 77843-3368, USA

T. Chacón Rebollo et al. (eds.), *Recent Advances in Industrial and Applied Mathematics*, ICIAM 2019 SEMA SIMAI Springer Series 1, https://doi.org/10.1007/978-3-030-86236-7\_4

a diversity of application areas in the "big-data" regime, particularly in image classification and artificial intelligence, it is yet to have a similar impact in many other areas of science.

Utilizing data observations in the analysis of scientific processes differs from traditional learning in that one has the additional information that these processes are described by mathematical models—systems of partial differential equations (PDE) or integral equations—that encode the physical laws that govern the process. Such models, however, are often deficient, inaccurate, incomplete or need to be further calibrated by determining a large number of *parameters* in order to accurately represent an observed process. Typical guiding examples are Darcy's equation for the pressure in ground-water flow or electron impedance tomography. Both are based on second order elliptic equations as core models. The diffusion coefficients in these examples describe premeability or conductivity, respectively. The parametric representations of the coefficients could arise, for instance, from Karhunen-Loève expansions of a random field that represent "unresolvable" features to be captured by the model. In this case the number of parameters could actually be *infinite*.

The use of machine learning to describe complex states of interest or even the underlying laws, solely through data, seems to bear little hope. In fact, data acquisition is often expensive or even harmful as in applications involving radiation. Thus, a severe undersampling poses principal obstructions to *state* or *parameter estimation* by solely processing observational data through standard machine learning techniques. It is therefore more natural to try to effectively combine the data information with the knowledge of the underlying physical laws represented by parameter dependent families of PDEs.

Methods that fuse together data-driven and model-based approaches fall roughly into two categories. One prototype of a *data assimilation* scenario arises in meteorology where data are used to stabilize otherwise chaotic dynamical systems, typically with the aid of (stochastic) filtering techniques. A second setting, in line with the above examples, uses an underlying *stable* continuous model to *regularize* otherwise ill-posed estimation tasks in a "small-data" scenario. *Bayesian inversion* is a prominent way of regularizing such problems. It relaxes the estimation task to asking only for *posterior probabilities* of states or parameters to explain given observations.

The present article reviews some recent developments on data driven state and parameter estimation that can be viewed as seeking alternatives to Bayesian inversion by placing a stronger focus on deterministic uncertainty quantification and its relation to *computational complexity*. The emphasis is on foundational aspects such as the optimality of algorithms (formulated in an appropriate sense) when treating estimation tasks for "small-data" problems in *high-dimensional parameter* regimes. Central issues concern the role of *reduced modeling* and the exploitation of intrinsic problem metrics provided by the *variational formulation* of the underlying continuous family of PDEs. This is used by the so called *Parametrized Background Data-Weak* (PBDW) framework, introduced in [20] and further analyzed in [4], to identify a suitable trial (Hilbert) space U that accommodates the states and eventually also the data. An important point is to distinguish between the *data* and corresponding *sensors*—here linear functionals in the dual Uof U—from which the data are generated. This will be seen to actually open a *geometric* perspective that sheds light on intrinsic estimation limits. Moreover, in the deterministic setting, a pivotal role is played by the so called *solution manifold*, which is the set of all states that can be attained when the parameters in the PDE traverse the whole parameter domain.

Even with full knowledge of a state in the solution manifold, to infer from it a corresponding parameter is a *nonlinear severely ill-posed* problem typically formulated as a *non-convex* optimization problem. On the other hand, state estimation from data is a *linear*, and hence a more benign inversion task mainly suffering under the current premises from a severe undersampling. We will, however, indicate how to reduce, under certain circumstances, the latter to the former problem so as to end up with a *convex* optimization problem. This motivates focusing in what follows mainly on state estimation. A central question then becomes how to best invoke knowledge on the solution manifold to regularize the estimation problem without introducing unnecessarily ambiguous bias. Our principal viewpoint is to recast state estimation as an *optimal recovery* problem which then naturally leads one to explore the role and potential of *reduced modeling*.

The layout of the paper is as follows. Section 2 describes the conceptual framework for *state estimation* as an *optimal recovery task*. This formulation allows the identification of lower bounds for the best achievable recovery accuracy.

Section 3 reviews recent developments concerning a certain *affine* recovery scheme and highlights the role of *reduced models* adapted to the recovery task. The overarching theme is to establish certified recovery bounds. When striving for optimality of such affine recovery maps, high parameter dimensionality is identified as a major challenge. We outline a recent remedy that avoids the *Curse of Dimensionality* by trading deterministic accuracy guarantees against analogs that hold with quantifiable high probability.

Even optimal affine reduced models can, in general, not be expected to realize the benchmarks identified in Sect. 2. To put the results in Sect. 3 in proper perspective, we comment in Sect. 4 on ongoing work that uses the results on affine reduced models and corresponding estimators as a central building block for nonlinear estimators. We also indicate briefly some ramifications on parameter estimation.

#### **2 Models and Data**

#### *2.1 The Model*

Technological design or simulating physical processes is often based on continuum models given by a family

$$\mathcal{R}(\mu, \mathbf{y}) = 0, \quad \mathbf{y} \in \mathcal{Y}, \tag{2.1}$$

of partial differential Equations (PDEs) that depend on parameters *y* ranging over a parameter domain <sup>Y</sup> <sup>⊂</sup> <sup>R</sup>*dy* . We will always assume *uniform well-posedness* of (2.1): for each *y* ∈ Y, there exists a unique solution *u* = *u*(*y*) in some trial Hilbert space <sup>U</sup> which satisfies <sup>R</sup>(*u*(*y*), *<sup>y</sup>*) <sup>=</sup> 0.

Specifically, we consider only linear problems of the form B*yu* = *f* , that is,

$$\mathcal{R}(\mu, \mathbf{y}) = f - \mathcal{B}\_{\mathbf{y}} \mu. \tag{2.2}$$

Here *f* belongs to the dual V of a suitable *test space* <sup>V</sup> and <sup>B</sup>*<sup>y</sup>* is a linear operator acting from U to V that depends on *y* ∈ Y. Here, uniform well-posedness means then that B*<sup>y</sup>* is boundedly invertible with bounds independent of *y*. By the Babu*s*ˇka-Banach-Ne*c*ˇas Theorem, this is equivalent to saying that the bilinear form

$$(\mu, \nu) \mapsto b\_{\mathfrak{z}}(\mu, \nu) := (\mathcal{B}\_{\mathfrak{z}}\mu)(\nu) \tag{2.3}$$

satisfies the following *continuity* and *inf-sup conditions*

$$\sup\_{\mu \in \mathcal{U}} \sup\_{\nu \in \mathcal{V}} \frac{b\_{\boldsymbol{\gamma}}(\boldsymbol{\mu}, \boldsymbol{\nu})}{\|\mu\|\_{\mathbb{U}} \|\boldsymbol{\nu}\|\_{\mathbb{V}}} \leq C\_{b} \quad \text{and} \quad \inf\_{\mu \in \mathcal{U}} \sup\_{\boldsymbol{\nu} \in \mathcal{V}} \frac{b\_{\boldsymbol{\gamma}}(\boldsymbol{\mu}, \boldsymbol{\nu})}{\|\mu\|\_{\mathbb{U}} \|\boldsymbol{\nu}\|\_{\mathbb{V}}} \geq c\_{b} > 0, \quad \boldsymbol{\chi} \in \mathcal{Y}, \quad (2.4)$$

together with the property that *by*(*u*, *<sup>v</sup>*) <sup>=</sup> <sup>0</sup>, *<sup>u</sup>* <sup>∈</sup> <sup>U</sup>, implies *<sup>v</sup>* <sup>=</sup> 0 (injectivity of <sup>B</sup><sup>∗</sup> *y* ). The relevance of this stability notion lies in the entailed validity of the *error-residual relation*

$$\|C\_b^{-1}\|f - \mathcal{B}\_\mathbf{y}\nu\|\_{\mathbb{V}^r} \le \|\mu(\mathbf{y}) - \nu\|\_{\mathbb{U}} \le c\_b^{-1} \|f - \mathcal{B}\_\mathbf{y}\nu\|\_{\mathbb{V}^r}, \quad \nu \in \mathbb{U}, \mathbf{y} \in \mathcal{Y}, \qquad (2.5)$$

where *g*<sup>V</sup>- := sup{*g*(*v*) : *v*<sup>V</sup> = 1}. Thus, errors in the trial norm are equivalent to residuals in the dual test norm which will be exploited in what follows.

For a wide range of problems such as space-time variational formulations, e.g. of parabolic or convection-diffusion problems, indefinite or singularly perturbed problems, the identification of a suitable pair U, V that guarantees stability in the above sense is not entirely straightforward. In particular, trial and test space may have to differ from each other, see e.g. [6, 11, 17, 23] for examples as well as some general principles.

The simplest example, used for illustration purposes, is the *elliptic* family

$$\mathcal{R}(u, \mathbf{y}) = f + \text{div } (a(\mathbf{y})\nabla u),\tag{2.6}$$

set in - <sup>⊂</sup> <sup>R</sup>*dx* where *dx* ∈ {1, <sup>2</sup>, <sup>3</sup>}, with boundary conditions *<sup>u</sup>*|∂- = 0. Uniform well-posedness follows then for <sup>U</sup> <sup>=</sup> <sup>V</sup> <sup>=</sup> *<sup>H</sup>*<sup>1</sup> <sup>0</sup> (-)if we have for some fixed constants 0 < *r* ≤ *R* < ∞ the bounds

$$\forall r \le a(\mathbf{x}, \mathbf{y}) \le R, \quad (\mathbf{x}, \mathbf{y}) \in \mathfrak{Q} \times \mathcal{Y},\tag{2.7}$$

readily implying (2.4).

Aside from well-posedness, a second important structural property of the model (2.1) is *affine parameter dependence*. By this we mean that

State Estimation—The Role of Reduced Models 61

$$\mathcal{B}\_{\mathbf{y}}\mu = \mathcal{B}\_0\mu + \sum\_{j=1}^{d\_{\mathbf{y}}} \mathbf{y}\_j \mathcal{B}\_j \mu, \quad \mathbf{y} = (\mathbf{y}\_j)\_{j=1,\ldots,d\_{\mathbf{y}}} \in \mathcal{Y},\tag{2.8}$$

where the operators <sup>B</sup>*<sup>j</sup>* : <sup>U</sup> <sup>→</sup> <sup>V</sup> are *independent* of *y*. In turn, the residual has a similar affine dependence structure

$$\mathcal{R}(u, \mathbf{y}) = \mathcal{R}\_0(u) + \sum\_{j=1}^{d\_\mathcal{V}} \mathbf{y}\_j \mathcal{R}\_j u, \quad \mathcal{R}\_0(u) := f - \mathcal{B}\_0 u, \quad \mathcal{R}\_j = -\mathcal{B}\_j. \tag{2.9}$$

For the example (2.6) such a structure is encountered for *affine* parametric representations of the diffusion coefficients

$$a(\mathbf{x}, \mathbf{y}) = a\_0(\mathbf{x}) + \sum\_{j=1}^{d\_{\mathcal{V}}} y\_j \theta\_j(\mathbf{x}), \quad (\mathbf{x}, \mathbf{y}) \in \mathfrak{Q} \times \mathcal{Y}, \tag{2.10}$$

i.e., the field *a* is expanded in terms of some given spatial basis functions θ*j*. As indicated earlier, the pressure equation in Darcy's law for porous media flow is an example for (2.6) where the diffusion coefficient *a*(*y*) of the form (2.10) may arise from a stochastic model for permeability via a Karhunen-Loève expansion. In this case (upon proper normalization) *y* ∈ [−1, 1] <sup>N</sup> has, in principle, *infinitely* many entries, that is *dy* = ∞. However, due to (2.7), the θ*<sup>j</sup>* should then have some decay as *j* → ∞ which means that the parameters become less and less important when *j* increases. Another example is electron impedance tomography involving the same type of elliptic operator where parametric expansions represent possible variations of conductivity often modeled as piecewise constants, i.e., the θ*<sup>j</sup>* could be characteristic functions subordinate to a partition of -. In this case data are acquired through sensors that act through trace functionals greatly adding to ill-posedness.

A central role in the subsequent discussion is played by the solution manifold

$$\mathcal{M} = \mathfrak{u}(\mathcal{Y}) := \{ \mathfrak{u}(\mathfrak{y}) \, : \, \mathfrak{y} \in \mathcal{Y} \}\tag{2.11}$$

which is then the range of the *parameter-to-solution map u* : *y* → *u*(*y*) comprised of all states that can be attained when *y* traverses Y. Without further mention, M will be assumed to be compact which actually follows under standard assumptions met in all above mentioned examples.

Estimating states in M or corresponding parameters from measurements requires the efficient approximation of elements in M. A common challenge encountered in all such models lies in the inherent *high-dimensionality* of the states *u* = *u*(·, *y*) as functions of *dx* spatial variables *x* ∈ and *dy* 1 parametric variables *y* ∈ Y. In particular, when *dy* = ∞ any calculation, of course, has to work with finitely many "activated" parameters whose number, however, has to be coordinated with the spatial resolution of a numerical scheme to retain *model-consistency*. It is especially this issue that hinders standard approaches based on *first discretizing* the parametric model because rigorously balancing spatial and parametric uncertainties becomes then difficult.

What renders such problem scenarios nevertheless numerically tractable is a further property that will be implicitly assumed in what follows, namely that the *Kolmogorov n*-*widths* of the solution manifold

$$d\_n(\mathcal{M})\_\mathbb{U} := \inf\_{\dim \mathbb{U}\_n = n} \sup\_{\mu \in \mathcal{M}} \inf\_{\nu \in \mathbb{U}\_n} \|\mu - \nu\|\_{\mathbb{U}} \tag{2.12}$$

exhibits at least some algebraic decay

$$d\_{\mathfrak{n}}(\mathcal{M})\_{\mathbb{U}} \lessapprox n^{-s} \tag{2.13}$$

for some *s* > 0, see [13] for a comprehensive account.

For instance, this is known to be the case for elliptic models (2.6) with (2.7), as a consequence of the results of sparse polynomial approximation of the parameter to solution map *y* → *u*(*y*) established e.g. in [15]. More generally, (2.13) can be established under a general holomorphy property of the parameter to solution map, as a consequence of a similar algebraic decay assumed on the *n*-widths of the parameter set, see [14]. For a fixed finite number *dy* < ∞ of parameters, under certain structural assumptions on the parameter representations (e.g. piecewise constants on checkerboard partitions) one can even establish (sub-) exponential decay rates, see [2] for more details. Assuming *s* in (2.13) to have a "substantial" size for any range of *dy*, is therefore justified.

In summary, the results discussed below are valid and practically feasible for well posed linear models (2.4) with affine parameter dependence (2.9) whose solution manifolds have rapidly decaying *n*-widths (2.13).

#### *2.2 The Data*

Suppose we are given data **<sup>w</sup>** <sup>=</sup> (*w*1,...,*wm*) <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* representing observations of an unknown state *<sup>u</sup>* <sup>∈</sup> <sup>U</sup> obtained through *<sup>m</sup>* linearly independent linear functionals *<sup>i</sup>* <sup>∈</sup> <sup>U</sup>- , i.e.,

$$w\_i = \ell\_i(u), \quad i = 1, \ldots, m. \tag{2.14}$$

Since in real applications data acquisition may be costly or harmful we assume that *m* is *fixed*. The central task to be discussed in what follows is to recover from this information an estimate for the observed unknown state *u*, based on the prior assumption that *u* belongs to M or is close to M. Moreover, to bring out the essence of this estimation task we assume for the moment that the data are noise-free.

Following [4, 20], we first recast the data in a "compliant" metric, by introducing the Riesz representers <sup>ψ</sup>*<sup>i</sup>* <sup>∈</sup> <sup>U</sup>, defined by

$$(\psi\_i, \boldsymbol{\nu})\_{\mathbb{U}} = \ell\_i(\boldsymbol{\nu}), \quad \boldsymbol{\nu} \in \mathbb{U}, \quad i = 1, \ldots, m,$$

The <sup>ψ</sup>*<sup>i</sup>* now span the *<sup>m</sup>*-dimensional subspace <sup>W</sup> <sup>⊂</sup> <sup>U</sup> which we refer to as *measurement space*, and the information carried by the *i*(*u*) is equivalent to that of the orthogonal projection *P*W*u* of *u* to W. The decomposition

$$
\mu = P\_{\mathbb{W}} \mu + P\_{\mathbb{W}^\perp} \mu, \quad \mu \in \mathbb{U}, \tag{2.15}
$$

thus contains a first term that is "seen" by the sensors and a second (infinitedimensional) term which cannot be detected. The decomposition (2.15) may be seen as a sensor-induced "coordinate system" thereby opening up a *geometric perspective* that will prove very useful in what follows. State estimation can then be viewed as learning from samples *<sup>w</sup>* := *<sup>P</sup>*W*<sup>u</sup>* the unknown "labels" *<sup>P</sup>*<sup>W</sup><sup>⊥</sup> *<sup>u</sup>* <sup>∈</sup> <sup>W</sup>⊥.

In this article, we are interested in how well we can approximate *u* from the information that *u* ∈ M and *P*W*u* = *w* with *w* given to us. Any such approximation is given by a mapping *<sup>A</sup>* : *<sup>w</sup>* <sup>→</sup> *<sup>A</sup>*(*w*) <sup>∈</sup> <sup>U</sup>. The overall performance of recovery on all of M by the mapping *A* is typically measured in the worst case setting, that is,

$$E\_{\rm wc}(A, \mathcal{M}, \mathcal{W}) = \sup\_{u \in \mathcal{M}} \|u - A(P\_{\mathcal{W}} u)\|\_{\mathbb{U}}.\tag{2.16}$$

The optimal recovery error on M is then defined as

$$E\_{\rm wc}(\mathcal{M}, \mathbb{W}) := \inf\_{A} E\_{\rm wc}(A, \mathcal{M}, \mathbb{W}), \tag{2.17}$$

where the infimum is over all possible recovery maps. Let us observe that the construction of recovery maps can be restricted to be of the form

$$A: \mathcal{w} \to A(\mathcal{w}), \quad A(\mathcal{w}) = \mathcal{w} + B(\mathcal{w}), \quad \text{with } B: \mathcal{W} \to \mathcal{W}^{\perp}. \tag{2.18}$$

Indeed, given any recovery mapping *A*, we can write *A*(*w*) = *P*W*A*(*w*) + *P*<sup>W</sup><sup>⊥</sup> *A*(*w*) and the performance of the recovery can only be improved if we replace the first term by *w*. In other words, *A*(*w*) should belong to the affine space

$$\mathbb{U}\_{\mathfrak{w}} := \mathfrak{w} + \mathbb{W}^{\perp},\tag{2.19}$$

that contains *u*. The mappings *B* are commonly referred to as liftings into W⊥.

#### *2.3 Optimality Criteria and Numerical Recovery*

Finding a best recovery map *A* attaining (2.17) is known as *optimal recovery*. The best mapping has a well-known simple theoretical description, see e.g. [21], that we now describe. Note first that a precise recovery of the unknown state *u* from the given information is generally impossible. Indeed, the best we can say about *u* is that it lies in the *manifold slice*

$$\mathcal{M}\_{\le} := \{ \mu \in \mathcal{M} : P\_{\W} \mu = \nu \} = \mathcal{M} \cap \mathbb{U}\_{\le},\tag{2.20}$$

which is comprised of all elements in <sup>M</sup> sharing the same measurement *<sup>w</sup>* <sup>∈</sup> <sup>W</sup>. The Chebyshev ball *<sup>B</sup>*(M*w*) is the smallest ball in <sup>U</sup> that contains <sup>M</sup>*w*. The best recovery algorithm is then given by the mapping

$$A^\*(w) := \text{cen}(\mathcal{M}\_\mathbb{w}),\tag{2.21}$$

that assigns to each *w* ∈ M the center cen(M*w*) of *B*(M*w*), called the Chebyshev center of M*w*. Then, the radius rad(M*w*) of *B*(M*w*) is the best worst case error over the class M*w*. The best worst case error over M, which is achieved by *A*∗, is thus given by

$$E\_{\rm wc}(\mathcal{M}, W) = E\_{\rm wc}(A^\*, \mathcal{M}, \mathcal{W}) = \max\_{\mathbf{w} \in \mathcal{W}} \text{rad}(\mathcal{M}\_{\mathbf{w}}).\tag{2.22}$$

While the above mapping *A*<sup>∗</sup> gives a nice theoretical description of the optimal recovery algorithm, it is typically not numerically implementable since the Chebyshev center cen(M*w*) is not easily found. Moreover, such an optimal algorithm is highly nonlinear and possibly discontinuous. The purpose of this section is to formulate a more modest goal for the performance of a recovery algorithm with the hope that this more modest goal can be met with a numerically realizable algorithm. The remaining sections of the paper introduce numerically implementable recovery mappings, analyze their performance, and evaluate the numerical cost in constructing these mappings.

The search for a numerically realizable algorithm must out of necessity lessen the performance criteria. A first possibility is to weaken the performance criteria to *near best* algorithms. This means that we search for an algorithm *A* such that

$$E\_{\rm wc}(A, \mathcal{M}, \mathbb{W}) \le C\_0 E\_{\rm wc}(\mathcal{M}, \mathbb{W}),\tag{2.23}$$

with a reasonable value of *C*<sup>0</sup> > 1. For example, any mapping *A* which takes *w* into an element in the Chebyshev ball of M*<sup>w</sup>* is near best with constant *C*<sup>0</sup> = 2. However, finding near best mappings *A* also seems to be numerically out of reach.

In order to formulate a more attainable performance criterion, we return to our earlier observations about uncertainty in both the model class M and in the measurements *w*. The former is a modeling error while the latter is an inherent measurement error. Both of these uncertainties can be quantified by introducing for each ε > 0, the ε-neighborhood of the manifold

$$\mathcal{M}^{\varepsilon} := \{ \boldsymbol{\nu} \in \mathbb{U} : \text{dist}\,(\boldsymbol{\nu}, \mathcal{M})\_{\mathbb{U}} \le \varepsilon \}. \tag{2.24}$$

The uncertainty in the model can be thought of as saying the sought after *<sup>u</sup>* is in <sup>M</sup><sup>ε</sup> rather than *u* ∈ M. Also, we may formulate uncertainty (noise) in the measurements as saying that they are not measurements of a *<sup>u</sup>* <sup>∈</sup> <sup>M</sup> but rather some *<sup>u</sup>* <sup>∈</sup> <sup>M</sup>ε. Here the value of ε quantifies these uncertainties.

Our new goal is to numerically construct a recovery map *A* that is near-optimal on <sup>M</sup>ε, for some given ε > 0. Let us note that <sup>M</sup><sup>ε</sup> is not compact. An algorithm *<sup>A</sup>* is worst-case near optimal for <sup>M</sup><sup>ε</sup> if and only if its performance is bounded by a constant multiple of the diameter

$$\delta\_{\varepsilon}(\mathcal{M}, \mathbb{W}) := \max \{ \|u - \boldsymbol{\nu}\|\_{\mathbb{U}} : u, \boldsymbol{\nu} \in \mathcal{M}^{\varepsilon}, \, P\_{\mathbb{W}}(u - \boldsymbol{\nu}) = 0 \}. \tag{2.25}$$

Notice that ε = 0 gives the performance criterion for near optimal recovery over M. One can show that the function <sup>ε</sup> → δε(M,W) is monotone non-decreasing in <sup>ε</sup>, continuous from the right, and lim<sup>ε</sup>→0<sup>+</sup> δε(M,W) <sup>=</sup> <sup>δ</sup>0(M,W). The speed at which δε(M,W) approaches <sup>δ</sup>0(M,W) reflects the "condition" of the estimation problem depending onMandW. While the practical realization of worst-case near-optimality for <sup>M</sup><sup>ε</sup> is already a challenge, quantifying corresponding computational cost would require assumptions on the condition of the problem.

One central theme, guiding subsequent discussions, is therefore to find recovery maps *A*<sup>ε</sup> that realize an error bound of the form

$$E\_{\rm wc}(A\_{\varepsilon}, \mathcal{M}, \mathcal{W}) \le C\_0 \delta\_{\varepsilon}(\mathcal{M}, \mathcal{W}).\tag{2.26}$$

Any a priori information on measurement accuracy and model bias might be used to choose a viable tolerance ε.

High parametric dimensionality poses particular challenges to estimation tasks when the targeted error bounds are in the above worst case sense. These challenges can be somewhat mitigated when adopting a Bayesian point of view [24]. The prior information on *u* is then described by a probability distribution *p* on U, which is supported on M. Such a measure is typically induced by a probability distribution on Y that may or may not be known. In the latter case, sampling M, i.e., computing snapshots *u*(*y<sup>i</sup>* ), *<sup>i</sup>* <sup>=</sup> <sup>1</sup>,...,*N*, for i.i.d. samples *<sup>y</sup><sup>i</sup>* <sup>∈</sup> <sup>Y</sup>, provides labeled data (*wi*,*w*<sup>⊥</sup> *<sup>i</sup>* ) = (*P*W*u*(*y<sup>i</sup>* ), *P*<sup>W</sup><sup>⊥</sup> *u*(*y<sup>i</sup>* )) according to the sensor-based decomposition (2.15). This puts us into the setting of *regression* in machine learning asking for an estimator that predicts for any new measurement *<sup>w</sup>* <sup>∈</sup> <sup>W</sup> its lifting *<sup>w</sup>*<sup>⊥</sup> <sup>=</sup> *<sup>B</sup>*(*w*). It is then natural to measure the performance of an algorithm in an averaged sense. The best estimator *A* that minimizes the mean-square risk

$$E\_{\rm ms}(A, p, \mathcal{W}) = \mathbb{E}(\|\mu - A(P\_{\mathcal{W}}\mu)\|^2) = \int\_{\mathcal{U}} \|\mu - A(P\_{\mathcal{W}}\mu)\|^2 dp(\mu) \tag{2.27}$$

is given by the conditional expectation

$$A(\boldsymbol{w}) = \mathbb{E}(\boldsymbol{u}|P\_{\mathbb{W}}\boldsymbol{u} = \boldsymbol{w}).\tag{2.28}$$

Since always *<sup>E</sup>*ms(*A*, *<sup>p</sup>*,W) <sup>≤</sup> *<sup>E</sup>*wc(*A*,M,W), the optimality benchmarks are somewhat weaker. In the rest of this paper, we adhere to the worst case error in the deterministic setting that only assumes membership of *<sup>u</sup>* to <sup>M</sup> or <sup>M</sup>ε.

The following section is concerned with an important *building block* on a pathway towards achieving (2.26) at quantifiable computational cost. This building block, referred to as *one-space method* is a linear (affine) scheme which is, in principle, simple and easy to numerically implement. It depends on suitably chosen subspaces. We highlight the *regularizing property* of these subspaces as well as ways to *optimize* them. This will reveal certain intrinsic obstructions caused by *parameter dimensionality*. The one-space method by itself will generally not achieve (2.26) but, as indicated earlier, can be used as a building block in a *nonlinear* recovery scheme that may indeed meet the goal (2.26).

#### **3 The One-Space Method**

#### *3.1 Subspace Regularization*

The one space method can be viewed as a simple regularizer for state estimation. The resulting recovery map is induced by an *n*-dimensional subspace U*<sup>n</sup>* of U for *<sup>n</sup>* <sup>≤</sup> *<sup>m</sup>*. Assume that, for each *<sup>n</sup>* <sup>≥</sup> 0, we are given a subspace <sup>U</sup>*<sup>n</sup>* <sup>⊂</sup> <sup>U</sup> of dimension *n* whose distance from M can be assessed

$$\text{dist}(\mathcal{M}, \mathbb{U}\_n)\_{\mathbb{U}} := \max\_{\boldsymbol{\mu} \in \mathcal{M}} \text{dist}(\boldsymbol{\mu}, \mathbb{U}\_n)\_{\mathbb{U}} \le \varepsilon\_n. \tag{3.1}$$

Then the cylinder

$$\mathcal{K}(\mathbb{U}\_n, \varepsilon\_n) := \{ \mu \in \mathbb{U} : \text{dist}(\mu, \mathbb{U}\_n)\_{\mathbb{U}} \le \varepsilon\_n \}\tag{3.2}$$

containsMand likewise the cylinder <sup>K</sup>(U*n*, ε*<sup>n</sup>* <sup>+</sup> ε) containsM<sup>ε</sup>. Our prior assumption that the observed state belongs to <sup>M</sup> or <sup>M</sup><sup>ε</sup> can then be relaxed by assuming membership to these larger but simpler sets.

Remarkably, one can now realize an optimal recovery map quite easily that meets the relaxed benchmark *<sup>E</sup>*wc(K(U*n*, ε*n*),W): in [4] it was shown that the Chebyshev center of the slice

$$\mathcal{K}\_{\mathbf{w}}(U\_n, \varepsilon\_n) := \mathcal{K}(\mathbb{U}\_n, \varepsilon\_n) \cap \mathbb{U}\_{\mathbf{w}},\tag{3.3}$$

is exactly given by the state in U*<sup>w</sup>* that is closest to U*n*, that is

$$\mu^\* = \mu^\*(w) := \operatorname\*{argmin}\_{\mu \in \mathbb{U}\_w} \|\mu - P\_{\mathbb{U}\_\pi} \mu\|\_{\mathbb{U}}.\tag{3.4}$$

This minimizer exists and can be shown to be unique as long as <sup>U</sup>*<sup>n</sup>* <sup>∩</sup> <sup>W</sup><sup>⊥</sup> = {0}. The corresponding optimal recovery map

$$A\_{\mathbb{U}\_n} \colon w \mapsto \mu^\*(w) \tag{3.5}$$

was first introduced in [20] as the Parametrized Background Data Weak (PBDW) algorithm, and is referred to as the *one-space* method in [4]. Due to its above minimizing property, it is readily checked that this map is linear and can be determined with the aid of the singular value decomposition of the cross-Gramian between any pair of orthonormal basis for U*<sup>n</sup>* and W.

The worst case error *<sup>E</sup>*wc(K(U*n*, ε*n*),W) can be described more precisely by introducing

$$\mu(\mathbb{U}\_n, \mathbb{W}) := \sup\_{\nu \in \mathbb{U}\_n} \frac{\|\nu\|\_{\mathbb{U}}}{\|P\_{\mathbb{W}}\nu\|\_{\mathbb{U}}} \tag{3.6}$$

which is finite if and only if <sup>U</sup>*<sup>n</sup>* <sup>∩</sup> <sup>W</sup><sup>⊥</sup> = {0}. This quantity, also introduced in a related but slightly different context in [1], is therefore related to the angle between the spaces U*<sup>n</sup>* and W. It becomes large when U*<sup>n</sup>* contains elements that are nearly perpendicular to <sup>W</sup>. It is actually computable: one has μ(U*n*,W) <sup>=</sup> β(U*n*,W)−<sup>1</sup> where

$$\beta(\mathbb{U}\_n, \mathbb{W}) := \inf\_{\nu \in \mathbb{U}\_n} \sup\_{\boldsymbol{\nu} \in \mathbb{W}} \frac{\langle \boldsymbol{\nu}, \boldsymbol{w} \rangle\_{\mathbb{U}}}{\|\boldsymbol{\nu}\|\_{\mathbb{U}} \|\boldsymbol{w}\|\_{\mathbb{U}}},\tag{3.7}$$

and β(U*n*,W) is the smallest singular value of the cross-Gramian between any pair of orthonormal bases for W and U*n*. It has been shown in [4, 20] that the worst case error bound over <sup>K</sup>(U*n*, ε*n*) is given by

$$E\_{\rm wc}(A\_{\mathbb{U}\_n}, \mathcal{K}(\mathbb{U}\_n, \varepsilon\_n), \mathbb{W}) = E\_{\rm wc}(\mathcal{K}(\mathbb{U}\_n, \varepsilon\_n), \mathbb{W}) = \mu(\mathbb{U}\_n, \mathbb{W})\varepsilon\_n. \tag{3.8}$$

The quantity μ(U*n*,W) also coincides with the norm of the linear recovery map *A*<sup>U</sup>*<sup>n</sup>* . Relaxing the prior *u* ∈ M by exploiting information on M solely through approximability of <sup>M</sup> by <sup>U</sup>*n*, thus implicitly *regularizes* the estimation task: whenever μ(U*n*,W) is finite, the optimal recovery map *A*<sup>U</sup>*<sup>n</sup>* is bounded and hence Lipschitz.

One important observation is that the map *A*<sup>U</sup>*<sup>n</sup>* is actually independent of ε*n*. In particular, it achieves optimality for the smallest possible containment cylinder

$$\mathcal{K}(\mathbb{U}\_n) := \mathcal{K}(\mathbb{U}\_n, \text{dist}(\mathcal{M}, \mathbb{U}\_n)\_{\mathbb{U}}),\tag{3.9}$$

and therefore, since *<sup>E</sup>*wc(*A*<sup>U</sup>*<sup>n</sup>* ,M,W) <sup>≤</sup> *<sup>E</sup>*wc(*A*<sup>U</sup>*<sup>n</sup>* , <sup>K</sup>(U*n*),W) <sup>=</sup> *<sup>E</sup>*wc(K(U*n*),W),

$$E\_{\rm wc}(A\_{\mathbb{U}\_n}, \mathcal{M}, \mathbb{W}) \le \mu(\mathbb{U}\_n, \mathbb{W}) \text{dist} \,(\mathcal{M}, \mathbb{U}\_n)\_{\mathbb{U}}.\tag{3.10}$$

Likewise, the containment <sup>M</sup><sup>ε</sup> <sup>⊂</sup> <sup>K</sup>(U*n*, dist(M, <sup>U</sup>*n*)<sup>U</sup> <sup>+</sup> ε) implies that

$$E\_{\rm wc}(A\_{\mathbb{U}\_n}, \mathcal{M}^\varepsilon, \mathbb{W}) \le \mu(\mathbb{U}\_n, \mathbb{W}) (\text{dist}\,(\mathcal{M}, \mathbb{U}\_n)\_{\mathbb{U}} + \varepsilon). \tag{3.11}$$

On the other hand, the recovery map *A*U*<sup>n</sup>* may be far from optimal over the sets M or <sup>M</sup>ε. This is due to the fact that the cylinders <sup>K</sup>(U*n*, ε*n*) and <sup>K</sup>(U*n*, ε*<sup>n</sup>* <sup>+</sup> ε) may be much larger than <sup>M</sup> or <sup>M</sup>ε. In particular, it is quite possible that for a particular observation*w*, one has rad(M*w*) rad(K*w*(U*n*, ε*n*)). Therefore, we cannot generally expect that the one space method achieves our goal (2.26). In particular, the condition *<sup>n</sup>* <sup>≤</sup> *<sup>m</sup>*, which is necessary to avoid that μ(U*n*,W) = ∞, limits the dimension of an approximating subspace U*<sup>n</sup>* and therefore ε*<sup>n</sup>* itself is inherently bounded from below. The "dimension budget" has therefore to be used wisely in order to obtain good performance bounds. This typically rules out "generic approximation spaces" such as finite element spaces, and raises the question which subspace U*<sup>n</sup>* yields the best estimator when applying the above method.

#### *3.2 Optimal Affine Recovery*

The results of the previous section bring forward the question as to what is the best choice of the space <sup>U</sup>*<sup>n</sup>* for the given <sup>M</sup>. On the one hand, proximity to <sup>M</sup> is desirable since dist(M, <sup>U</sup>*n*)<sup>U</sup> enters the error bound. However, favoring proximity, may increase μ(U*n*,W). Before addressing this question systematically, it is important to note that the above results carry over verbatim when U*<sup>n</sup>* is replaced by an *affine space* <sup>U</sup>*<sup>n</sup>* = ¯*<sup>u</sup>* <sup>+</sup> U*<sup>n</sup>* where U*<sup>n</sup>* <sup>⊂</sup> <sup>U</sup> is a linear space. This means the reduced model <sup>K</sup>(U*n*, ε*n*) is of the form

$$
\mathcal{K}(\mathbb{U}\_n, \varepsilon\_n) := \bar{\mu} + \mathcal{K}(\tilde{\mathbb{U}}\_n, \varepsilon\_n).
$$

The best worst-case recovery bound is now given by

$$E\_{\rm wc}(\mathcal{K}(\mathbb{U}\_n, \varepsilon\_n), \mathbb{W}) = \mu(\widetilde{\mathbb{U}}\_n, \mathbb{W}) \varepsilon\_n. \tag{3.12}$$

Intuitively, this may help to better control the angle between W and U*<sup>n</sup>* by anchoring the affine space at a suitable location (typically near or on M). More importantly, it helps in *localizing* models via parameter domain decompositions that will be discussed later.

The one-space algorithm discussed in the previous section confines the "dimensionality" budget of the approximation spaces <sup>U</sup>*<sup>n</sup>* to *<sup>n</sup>* <sup>≤</sup> *<sup>m</sup>*. In view of (3.10), to obtain an overall good estimation accuracy, this space can clearly not be chosen arbitrarily but should be well adapted both to the solution manifold M and to measurement space *W*, that is, to the given observation functionals giving rise to the data.

A simple way of *adapting* a recovery space to W is as follows: suppose for a moment that we were able to construct for *n* = 1,..., *m*, a hierarchy of spaces Unb <sup>1</sup> <sup>⊂</sup> <sup>U</sup>nb <sup>2</sup> ⊂···⊂ <sup>U</sup>nb *<sup>m</sup>* , that approximate M in a *near-best* way, namely

$$\text{dist}\,(\mathcal{M}, \mathbb{U}\_{\boldsymbol{n}}^{\text{nb}})\_{\mathbb{U}} \le \text{Cd}\_{\boldsymbol{n}}(\mathcal{M})\_{\mathbb{U}}.\tag{3.13}$$

We may compute along the way the quantities μ(Unb *<sup>j</sup>* ,W), then choose

$$\ln n^\* = \operatorname\*{argmin}\_{n \le m} \mu \left( \mathbb{U}\_n^{\text{nb}}, \mathbb{W} \right) \text{dist} \left( \mathcal{M}, \mathbb{U}\_n^{\text{nb}} \right)\_{\mathbb{U}}, \tag{3.14}$$

and take the map *A*Unb *<sup>n</sup>*<sup>∗</sup> . We sometimes refer to this choice as *"poor man's algorithm"*. It is not clear though whether Unb *<sup>n</sup>*<sup>∗</sup> is indeed a near-best choice for state recovery by the one-space method. In other words, one may question whether

$$E\_{\rm wc}(A\_{\mathbb{U}\_{\kappa^\*}^{\rm nb}}, \mathcal{M}, \mathcal{W}) \le C \inf\_{\dim \tilde{\mathbb{U}} \le m} E\_{\rm wc}(A\_{\tilde{\mathbb{U}}}, \mathcal{M}, \mathcal{W}), \tag{3.15}$$

holds with a uniform constant *C* < ∞. In fact, numerical tests strongly suggest otherwise, which motivated in [12] the following alternative to the poor man's algorithm.

Recall that a given linear space U*<sup>n</sup>* determines the linear recovery map *A*<sup>U</sup>*<sup>n</sup>* . Likewise a given affine space U*<sup>n</sup>* determines an affine recovery map *A*<sup>U</sup>*<sup>n</sup>* . Conversely, it can be checked that an affine recovery map *A* determines an affine space U*<sup>n</sup>* that allows one to interpret the recovery scheme as a one-space method in the sense that *A* = *A*<sup>U</sup>*<sup>n</sup>* . Denoting by A the class of all affine mappings of the form

$$A(\mathbf{w}) = \mathbf{w} + \mathbf{z} + B\mathbf{w},\tag{3.16}$$

where *<sup>z</sup>* <sup>∈</sup> <sup>W</sup><sup>⊥</sup> and *<sup>B</sup>* <sup>∈</sup> <sup>L</sup>(W,W⊥) is linear, we might thus as well directly look for a mapping that minimizes

$$E\_{\rm wc}(A, \mathcal{M}, \mathbb{W}) := \sup\_{u \in \mathcal{M}} \|u - A(P\_{\mathbb{W}} u)\|\_{\mathbb{U}} = \sup\_{u \in \mathcal{M}} \|P\_{\mathbb{W}^\perp} u - z - BP\_{\mathbb{W}} u\|\_{\mathbb{U}} =: \mathcal{E}(z, B) \tag{3.17}$$

over <sup>A</sup>, i.e., over all (*z*, *<sup>B</sup>*) <sup>∈</sup> <sup>W</sup><sup>⊥</sup> <sup>×</sup> <sup>L</sup>(W,W⊥). It can be shown that indeed a minimizing pair (*z*∗, *B*∗) exists, i.e.,

$$\mathcal{E}(z^\*, B^\*) = \min\_{A \in \mathcal{A}} E\_{\text{wc}}(A, \mathcal{M}, \mathbb{W}) =: E\_{\text{wc}, \mathcal{A}}(\mathcal{M}, \mathbb{W}),$$

see [12]. However, the minimization of*E*wc(*A*,M,W) over(*z*, *<sup>B</sup>*) <sup>∈</sup> <sup>W</sup>⊥×L(W,W⊥) is far from practically feasible. In fact, each evaluation of *<sup>E</sup>*wc(*A*,M,W) requires exploring <sup>M</sup> and *<sup>B</sup>* can have a range in the infinite dimensional space <sup>W</sup>⊥. In order to arrive at a computationally tractable problem, one needs to


$$\mathbb{T}\ddot{\mathbb{W}}^{\perp} := \mathbb{T}\_L \ominus \mathbb{W} \tag{3.18}$$

of W in U*L*.

The resulting optimization problem

$$\mathcal{L}(\tilde{z}, \mathcal{B}) = \operatorname\*{argmin}\_{(z, \mathcal{B}) \in \tilde{\mathcal{W}}^{\perp} \times \mathcal{L}(\mathcal{W}, \tilde{\mathcal{W}}^{\perp})} \sup\_{\mu \in \tilde{\mathcal{M}}^{\emptyset}} \|P\_{\mathcal{W}^{\perp}} \mu - z - BP\_{\mathcal{W}} \mu\|\_{\mathcal{U}}.\tag{3.19}$$

can be solved by primal-dual splitting methods providing a *O*(1/*k*) convergence rate, [12].

Due to the perturbations (i) and (ii) of the ideal minimization problem, the resulting (*z*˜,*B*) is no longer optimal. However, one can show that

$$E\_{\rm wc}(\widetilde{\boldsymbol{A}}, \mathcal{M}, \mathbb{W}) \le E\_{\rm wc, \mathcal{R}}(\mathcal{M}, \mathbb{W}) + \eta + C\delta,\tag{3.20}$$

where the constant *C* is the operator norm of *B* minimizing (3.17). On the other hand, since the range of any affine mapping *A* is an affine space of dimension at most *m*, therefore contained in a linear space of dimension at most *m* + 1, one always has *<sup>E</sup>*wc,A(M,W) <sup>≥</sup> *dm*+<sup>1</sup>(M)U. Therefore (*z*˜,*B*) satisfies a near-optimal bound

$$E\_{\rm wc}(\widetilde{A}, \mathcal{M}, \mathbb{W}) \underset{\sim}{\lesssim} E\_{\rm wc, \mathcal{A}}(\mathcal{M}, \mathbb{W}),\tag{3.21}$$

whenever η and δ are picked such that

$$
\eta \lesssim d\_{m+1}(\mathcal{M})\_{\mathbb{U}}, \quad \text{and} \quad \delta \lesssim d\_{m+1}(\mathcal{M})\_{\mathbb{U}}.\tag{3.22}
$$

The numerical tests in [12] for a model problem of the type (2.6) with piecewise constant checkerboard diffusion coefficients and *dy* up to *dy* = 64 show that this recovery map exhibits significantly better accuracy than the method based on (3.14). It even yields smaller error bounds than the affine mean square estimator (2.27). The following section discusses the numerical cost entailed by conditions like (3.22).

#### *3.3 Rate-Optimal Reduced Bases*

To keep the dimension *L* of the space U*<sup>L</sup>* in (3.18) small, a near-best subspace Unb *L* in the sense of (3.13) would be highly desirable. Likewise the poor man's scheme (3.14) would benefit from such subspaces. Unfortunately, such near-best subspaces are not practically accessible. The *reduced basis method* aims to construct subspaces which come close to near-optimality in a sense that we further explain next. The main idea is to generate theses subspaces by a sequence of elements picked from the manifold M itself, by means of a *weak-greedy algorithm* introduced and studied in [8]. In an idealized form, this algorithm proceeds as follows: given a current space Uwg *<sup>n</sup>* = span{*u*1,..., *un*}, one takes *un*+<sup>1</sup> = *u*(*yn*+1) such that, for some fixed γ ∈]0, 1], *un*+<sup>1</sup> − *P*U*<sup>n</sup> un*+<sup>1</sup><sup>U</sup> ≥ γ max*<sup>u</sup>*∈<sup>M</sup> *u* − *P*U*<sup>n</sup> u*<sup>U</sup>, or equivalently

$$\|\|\mu(\mathbf{y}\_{n+1}) - P\_{\mathbb{U}\_n}\mu(\mathbf{y}\_{n+1})\|\|\_{\mathbb{U}} \ge \mathcal{Y} \max\_{\mathbf{y} \in \mathcal{Y}} \|\mu(\mathbf{y}) - P\_{\mathbb{U}\_n}\mu(\mathbf{y})\|\_{\mathbb{U}}.\tag{3.23}$$

Then, one defines Uwg *<sup>n</sup>*+<sup>1</sup> = span{*u*1,..., *un*+<sup>1</sup>}. While unfortunately, the weak greedy algorithm does in general not produce spaces satisfying (3.13), it does come close. Namely, it has been shown in [3, 19] that the spaces Uwg *<sup>n</sup>* are *rate-optimal* in the following sense:

(i) For any *s* > 0 one has

$$d\_n(\mathcal{M})\_\mathbb{U} \le C(n+1)^{-s}, \ n \ge 0 \implies \text{dist}\,(\mathcal{M}, \mathbb{U}\_n^{\text{reg}})\_\mathbb{U} \le \stackrel{\cdot \text{c}}{C}(n+1)^{-s}, \ n \ge 0,\tag{3.24}$$

where *<sup>C</sup>* depends on *<sup>C</sup>*,*s*, γ .

(ii) For any β > 0, one has

$$d\_n(\mathcal{M})\_\mathbb{U} \le C e^{-cn^\beta}, \ n \ge 0 \implies \text{dist}\,(\mathcal{M}, \mathbb{U}\_n^{\text{wg}})\_\mathbb{U} \le \widetilde{C} e^{-\tilde{c}n^\beta}, \ n \ge 0,\tag{3.25}$$

where the constants *<sup>c</sup>*˜, *<sup>C</sup>* depend on *<sup>c</sup>*,*C*, β, γ .

In the form described above, the weak-greedy concept seems infeasible since it would, in principle, require computing the solution *u*(*y*) for all values of *y* ∈ Y exactly, exploring the whole exact solution manifold. However, its practical applicability is facilitated when there exists a *tight* surrogate *R*(*y*, U*n*), satisfying

$$c\_{\mathcal{R}}\mathcal{R}(\mathbf{y}, \mathbb{U}\_{n}) \le \|\mu(\mathbf{y}) - P\_{\mathbb{U}\_{n}}\mu(\mathbf{y})\|\_{\mathbb{U}} = \text{dist}\left(\mu(\mathbf{y}), \mathbb{U}\_{n}\right) \le C\_{\mathcal{R}}\mathcal{R}(\mathbf{y}, \mathbb{U}\_{n}), \quad \mathbf{y} \in \mathcal{Y},\tag{3.26}$$

for uniform constants 0 < *cR* ≤ *CR* < ∞, which can be evaluated at affordable cost. Then, maximization of *<sup>R</sup>*(*y*, <sup>U</sup>*n*) over <sup>Y</sup> amounts to the weak-greedy step (3.23) with <sup>γ</sup> := *cR CR* . According to [18], the validity of the following two conditions indeed allows one to derive computable surrogates that satisfy (3.26):


Conditions (i) and (ii) ensure, in view of (2.5), that *u*(*y*) − *P*<sup>U</sup>*<sup>n</sup> u*(*y*)<sup>U</sup> ∼ R(*y*, <sup>U</sup>*<sup>n</sup> u*(*y*))<sup>V</sup>holds uniformly in *y* ∈ Y. Thus,

$$\mathcal{R}(\mathbf{y}, \mathbb{U}\_n) := \|\mathcal{R}(\mathbf{y}, \Pi\_{\mathbb{U}\_n} \mu(\mathbf{y}))\|\_{\mathbb{V}^\circ} = \sup\_{\mathbf{v} \in \mathbb{V}} \frac{\mathcal{R}(\mathbf{y}, \Pi\_{\mathbb{U}\_n} \mu(\mathbf{y}))(\mathbf{v})}{\|\mathbf{v}\|\_{\mathbb{V}}} \tag{3.27}$$

satisfies (3.26) and is therefore a tight surrogate for dist(M, <sup>U</sup>*n*)U. In the elliptic case (2.6) under assumption (2.7), (i) and (ii) hold and the above comments reflect standard practice. For the wider scope of stable but *unsymmetric* variational formulations [6, 16, 23] the inf-sup conditions (2.4) imply (i), but the Galerkin projection in (ii) needs to be replaced by a stable *Petrov-Galerkin* projection with respect to suitable test spaces V*<sup>n</sup>* accompanying the reduced trial spaces U*n*. It has been shown in [18] how to generate such test spaces with the aid of a *double-greedy* strategy, see also [16].

The main pay-off of using the surrogate *R*(*y*, U*n*) is that one no longer needs to compute *u*(*y*) but only the low-dimensional projection <sup>U</sup>*<sup>n</sup> u*(*y*) by solving for each *y* an *n* × *n* system, which itself can be rapidly assembled thanks to the affine parameter dependence [22]. However, one still faces the problem of its exact maximization over *y* ∈ Y. A standard approach is to maximize instead over a *discrete* training set <sup>Y</sup>*<sup>n</sup>* <sup>⊂</sup> *<sup>Y</sup>* , which in turn induces a discretization of the solution manifold

$$
\widetilde{\mathcal{M}}\_n = \{ \mu(\mathbf{y}) \; : \; \mathbf{y} \in \widetilde{\mathcal{Y}}\_n \}. \tag{3.28}
$$

The resulting weak-greedy algorithm can be shown to remain rate optimal in the sense of (3.24) and (3.25) if the discretization is fine enough so that <sup>M</sup>*<sup>n</sup>* constitutes an <sup>ε</sup>*n*-approximation net of <sup>M</sup> where <sup>ε</sup>*<sup>n</sup>* does not exceed *<sup>c</sup>*dist(M, <sup>U</sup>wg *<sup>n</sup>* )<sup>U</sup> for a suitable constant 0 < *c* < 1. In the current regime of *large or even infinite parameter dimensionality*, this becomes prohibitive because #Y*<sup>n</sup>* would then typically scale like *O* ε −*cdy n* , [10].

As a remedy it has been proposed in [10] to use training sets <sup>Y</sup>*<sup>n</sup>* that are generated by *randomly sampling* Y, and ask that the objective of rate optimality is met with high probability. This turns out to be achievable with training sets of much less prohibitive size. In an informal and simplified manner the main result can be stated as follows.

**Theorem 1** *Given any target accuracy* ε > 0 *and some* 0 <η< 1*, then the weak greedy reduced basis algorithm based on choosing at each step N* = *N*(ε, η) ∼ | ln η|+| ln ε| *randomly chosen training points in* Y *has the following properties with probability at least* <sup>1</sup> <sup>−</sup> <sup>η</sup>*: it terminates with dist*(M, <sup>U</sup>*n*(ε))<sup>U</sup> <sup>≤</sup> <sup>ε</sup> *as soon as the maximum of the surrogate over the current training set falls below c*ε<sup>1</sup>+*<sup>a</sup> for some c*, *<sup>a</sup>* <sup>&</sup>gt; <sup>0</sup>*. Moreover, if dn*(M)<sup>U</sup> <sup>≤</sup> *Cn*−*<sup>s</sup> , then n*(ε) < <sup>∼</sup> <sup>ε</sup><sup>−</sup> <sup>1</sup> *<sup>s</sup>* <sup>−</sup>*b. The constants c*, *a*, *b depend on the constants in* (3.26)*, as well as on the rate r of polynomial approximability of the parameter to solution map y* → *u*(*y*)*. The larger s and r, the smaller a and b, and the closer the performance becomes to the ideal one.*

#### **4 Nonlinear Models**

#### *4.1 Piecewise Affine Reduced Models*

As already noted, schemes based on linear or affine reduced models of the form <sup>K</sup>(U*n*, ε) can, in general, not be expected to realize the benchmark (2.26), discussed earlier in Sect. 2. The convexity of the containment set <sup>K</sup>(U*n*, ε) may cause the reconstruction error to be significantly larger than δε(M,W). Another way of understanding this limitation is that in order to make ε small, one is enforced to raise the dimension *n* of U*n*, making the quantity μ(U*n*,W) larger and eventually infinite if *n* > *m*.

To overcome this principal limitation one needs to resort to *nonlinear* models that better capture the non-convex geometry of M. One natural approach consists in replacing the single space <sup>U</sup>*<sup>n</sup>* by a family (U*<sup>k</sup>* )*<sup>k</sup>*=1,...,*<sup>K</sup>* of affine spaces

$$\mathbb{U}^{k} = \overline{\boldsymbol{u}}\_{k} + \widetilde{\mathbb{U}}^{k}, \quad \dim(\widetilde{\mathbb{U}}^{k}) = n\_{k} \le m,\tag{4.1}$$

each of which aims to approximate a *portion*M*<sup>k</sup>* ofMto a prescribed target accuracy simultaneously controlling μ(U*<sup>k</sup>* ,W): fixing ε > 0, we assume that we have at hand a partition of M into portions

$$\mathcal{M} = \bigcup\_{k=1}^{K} \mathcal{M}\_k \tag{4.2}$$

such that

$$\text{dist}\,(\mathcal{M}\_k, \mathbb{U}^k)\_{\mathbb{U}} \le \varepsilon\_k, \quad \text{and} \quad \mu(\widetilde{\mathbb{U}}^k, \mathbb{W})\varepsilon\_k \le \varepsilon, \quad k = 1, \dots, K. \tag{4.3}$$

One way of obtaining such a partition is through a greedy splitting procedure of the domain Y = [−1, 1] *dy* which is detailed in [9]. The procedure terminates when for each cell Y*<sup>k</sup>* the corresponding portion of the manifold M*<sup>k</sup>* can be associated to an affine U*<sup>k</sup>* satisfying these properties. We are ensured that this eventually occurs since for a sufficiently fine cell Y*<sup>k</sup>* one has rad(M*<sup>k</sup>* ) ≤ ε which means that we could then use a zero dimensional affine space <sup>U</sup>*<sup>k</sup>* = {¯*uk* } for which we know that μ(U*<sup>k</sup>* ,W) <sup>=</sup> 1. In this piecewise affine model, the containment property is now

$$\mathcal{M} \subset \bigcup\_{k=1}^{K} \mathcal{K}(\mathbb{U}\_k, \varepsilon\_k). \tag{4.4}$$

and the cardinality *K* of the partition depends on the prescribed ε.

For a given measurement *<sup>w</sup>* <sup>∈</sup> <sup>W</sup>, we may now compute the state estimates

$$
\mu\_k^\*(w) = A\_{\mathbb{U}^k}(w), \quad k = 1, \ldots, K,\tag{4.5}
$$

by the affine variant of the one-space method from (3.4). Since *u* ∈ M*<sup>k</sup>*<sup>0</sup> for some value *k*0, we are ensured that

$$\|\|\mu - \mu\_{k\_0}^\*(\mathcal{w})\|\|\_{\mathbb{U}} \le \varepsilon,\tag{4.6}$$

for this particular choice. However *k*<sup>0</sup> is unknown to us and one has to rely on the data *w* in order to decide which one among the affine models is most appropriate for the recovery. One natural *model selection* criterion can be derived if for any *<sup>u</sup>* <sup>∈</sup> <sup>U</sup> we have at our disposal a computable surrogate *S*(*u*) that is equivalent to the distance from *u* to M, that is

$$c\mathcal{S}(\overline{\boldsymbol{u}}) \le \text{dist}\,(\overline{\boldsymbol{u}}, \mathcal{M})\_{\mathbb{U}} \le C\mathcal{S}(\overline{\boldsymbol{u}}), \quad \text{dist}\,(\overline{\boldsymbol{u}}, \mathcal{M})\_{\mathbb{U}} = \min\_{\mathbf{y} \in \mathcal{Y}} \|\overline{\boldsymbol{u}} - \boldsymbol{u}(\mathbf{y})\|\_{\mathbb{U}},\tag{4.7}$$

for some fixed 0 < *c* ≤ *C*. We give an instance of such a computable surrogate in Sect. 4.2 below. The selection criterion then consists in picking *k*<sup>∗</sup> minimizing this surrogate between the different available state estimates, that is,

$$\mu^\*(\boldsymbol{w}) := \boldsymbol{u}\_{k^\*}^\*(\boldsymbol{w}) = \operatorname\*{argmin}{\{\mathbb{S}(\boldsymbol{u}\_k^\*(\boldsymbol{w})) \, :\, k = 1, \ldots, K\}}.\tag{4.8}$$

The following result, established in [9], shows that this estimator now realizes the benchmark (2.26) up to a multiplication of ε by κ := *C*/*c*, where *c*,*C* are the constants from (4.7).

**Theorem 2** *Assume that* (4.2) *and* (4.3) *hold. For any u* ∈ M*, if w* = *P*W*u, one has*

$$\|\|\mu - \mu^\*(\boldsymbol{w})\|\| \le \delta\_{\kappa\varepsilon}(\mathcal{M}, \mathbb{W}),\tag{4.9}$$

*where* δε(M,W) *is given by* (2.25)*.*

#### *4.2 Approximate Metric Projection and Parameter Estimation*

A practically affordable realization of the surrogate *S*(*u*), providing a *near-metric projection distance* toM, is a key ingredient of the above nonlinear recovery scheme. Since it has further useful implications we add a few comments on that matter.

As already observed in (2.5), whenever (2.1) admits a stable variational formulation with respect to a suitable pair (U, V) of trial and test spaces, the distance of any *<sup>u</sup>* <sup>∈</sup> <sup>U</sup> to any *<sup>u</sup>*(*y*) <sup>∈</sup> <sup>M</sup> is uniformly equivalent to the residual of the PDE in <sup>V</sup>-

$$\|c\|\Re(\bar{\mu}, \mathbf{y})\|\_{\mathbb{V}^{\mathbb{V}}} \le \|\mu(\mathbf{y}) - \bar{\mu}\|\_{\mathbb{U}} \le C \|\Re(\bar{\mu}, \mathbf{y})\|\_{\mathbb{V}^{\mathbb{V}}},\tag{4.10}$$

with *<sup>c</sup>* <sup>=</sup> *<sup>C</sup>*−<sup>1</sup> *<sup>b</sup>* ,*<sup>C</sup>* <sup>=</sup> *<sup>c</sup>*−<sup>1</sup> *<sup>b</sup>* from (2.5). Assume in addition that R(*u*, *y*) depends *affinely* on *y* ∈ Y, according to (2.9). Then, minimizing R(*u*¯, *y*)<sup>V</sup> over *y* is equivalent to solving a *constrained least squares* problem

$$\bar{\mathbf{y}} = \operatorname\*{argmin}\_{\mathbf{y} \in \mathcal{Y}} \|\mathbf{g} - \mathbf{M}\mathbf{y}\|\_2,\tag{4.11}$$

where **M** is a matrix of size *dy* × *dy* resulting from Riesz-lifts of the functionals R*j*(*u*¯).

The solution to this problem therefore satisfies

$$\|\|\bar{u} - \mu(\bar{\mathbf{y}})\|\|\_{\mathbb{U}} \le \kappa \inf\_{\mathbf{y} \in \mathcal{Y}} \|\|\bar{u} - \mu(\mathbf{y})\|\|\_{\mathbb{U}} = \kappa \text{dist}\,(\bar{u}, \mathcal{M})\_{\mathbb{U}}.\tag{4.12}$$

where κ = *C*/*c* = *Cb*/*cb* is the quotient between the equivalence constants in (4.10). The surrogate

$$S(\bar{u}) := \|\mathcal{R}(\bar{u}, \mathbf{y})\|\_{\mathbb{V}^\flat} \tag{4.13}$$

for the metric projection distance of *u* onto M obviously satisfies (4.7). It is indeed computable at affordable cost using (an approximation to) its Riesz-lifted version *e*(*u*¯, *y*)<sup>V</sup> = R(*u*¯, *y*)<sup>V</sup>- (in <sup>V</sup>*<sup>h</sup>* <sup>⊂</sup> <sup>V</sup>) assembled from the Riesz-lifts of the components R*j*(*u*¯), see [9] for details in the affine expansion (2.9).

Since solving the above problem provides an admissible parameter value *y* ∈ Y, this also has some immediate bearing on *parameter estimation*. Suppose we wish to estimate from *w* = *P*W*u*(*y*∗) the unknown parameter *y*<sup>∗</sup> ∈ Y. Assume further that *A* is any given linear or nonlinear recovery map. Computing along the above lines

$$\bar{\mathfrak{y}}\_{\boldsymbol{w}} = \underset{\boldsymbol{\mathcal{Y}} \in \mathcal{Y}}{\operatorname{argmin}} \, \|\mathcal{R}(\boldsymbol{A}(\boldsymbol{w}), \boldsymbol{y})\|\_{\mathbb{V}^{d}}$$

we have

$$\begin{aligned} \|\mu(\mathbf{y}^\*) - \mu(\mathbf{\bar{y}}\_w)\|\_{\mathbb{U}} &\le \|\mu(\mathbf{y}^\*) - A(\boldsymbol{w})\|\_{\mathbb{U}} + \|A(\boldsymbol{w}) - \mu(\mathbf{\bar{y}}\_w)\|\_{\mathbb{U}} \\ &\le E\_{\text{wc}}(A, \mathcal{M}, \mathbb{W}) + \kappa \operatorname{dist}(A(\boldsymbol{w}), \mathcal{M})\_{\mathbb{U}} \le (1 + \kappa) E\_{\text{wc}}(A, \mathcal{M}, \mathbb{W}). \end{aligned} (4.14)$$

We consider now the specific elliptic model (2.6) with affine diffusion coefficients *a*(*y*) given by (2.10). For this model, it was established in [5] that for strictly positive *f* and certain regularity assumptions on *a*(*y*) as functions of *x* ∈ -, parameters may be estimated by states. Specifically, when *a*(*y*) ∈ *H*<sup>1</sup>(-) uniformly in *y* ∈ Y, one has an inverse stability estimate of the form

$$\|\|a(\mathbf{y}) - a(\tilde{\mathbf{y}})\|\_{L\_2(\Omega)} \le C \|\mu(\mathbf{y}) - \mu(\tilde{\mathbf{y}})\|\_{\mathbb{U}}^{1/6}.\tag{4.15}$$

Thus, whenever the recovery map *A* satisfies (4.9) for some prescribed ε > 0, we obtain a parameter estimation bound of the form

$$\|a(\mathbf{y}^\*) - a(\mathbf{\bar{y}}\_w)\|\_{L\_2(\Omega)} \le C \delta\_{\kappa \varepsilon} (\mathcal{M}, \mathcal{W})^{1/6},$$

Note that when the basis functions θ*<sup>j</sup>* are *L*2-orthogonal, *a*(*y*∗) − *a*(*y*¯*w*)*<sup>L</sup>*2(-) is equivalent to a (weighted) <sup>2</sup> norm of *y*<sup>∗</sup> − ¯*yw*.

#### *4.3 Concluding Remarks*

The affine or piecewise affine recovery scheme hinges on the ability to approximate a solution manifold effectively by linear or affine spaces, globally or locally. As explained earlier this is true for problems of elliptic or parabolic type that may include convective terms as long as they are dominated by diffusion. This may however no longer be the case when dealing with pure transport equations or models involving strongly dominating convection.

An interesting alternative would then be to adopt a stochastic model according to (2.27) and (2.28) that allows one to view the construction of the recovery map as a regression problem. In particular, when dealing with transport models, a natural candidate for parametrizing a reduced model are *deep neural networks*. However, properly adapting the architecture, regularization and training principles pose wide open questions addressed in current work in progress.

**Acknowledgements** A. C. was supported by ERC Adv Grant BREAD. W. D. was supported in part by the NSF-grants DMS 17-20297, DMS-2012469, by the Smart State Program of the State of South Carolina, and the Williams-Hedberg Foundation. R. D. was supported by the NSF grant DMS 18-17603. A portion of this work was completed when the authors were supported as visitors to the Isaac Newton Institute of Cambridge University.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Pattern Formation Inside Living Cells**

#### **Leah Edelstein-Keshet**

**Abstract** While most of our tissues appear static, in fact, cell motion comprises an important facet of all life forms, whether in single or multicellular organisms. Amoeboid cells navigate their environment seeking nutrients, whereas collectively, streams of cells move past and through evolving tissue in the development of complex organisms. Cell motion is powered by dynamic changes in the structural proteins (actin) that make up the cytoskeleton, and regulated by a circuit of signaling proteins (GTPases) that control the cytoskeleton growth, disassembly, and active contraction. Interesting mathematical questions we have explored include (1) How do GTPases spontaneously redistribute inside a cell? How does this determine the emergent polarization and directed motion of a cell? (2) How does feedback between actin and these regulatory proteins create dynamic spatial patterns (such as waves) in the cell? (3) How do properties of single cells scale up to cell populations and multicellular tissues given interactions (adhesive, mechanical) between cells? Here I survey mathematical models studied in my group to address such questions. We use reaction-diffusion systems to model GTPase spatiotemporal phenomena in both detailed and toy models (for analytic clarity). We simulate single and multiple cells to visualize model predictions and study emergent patterns of behavior. Finally, we work with experimental biologists to address data-driven questions about specific cell types and conditions.

#### **1 Introduction: Motile Cells and Their Inner Workings**

Many types of cells are endowed with the ability to move purposefully. As an example, neutrophils, shown in Fig. 1a, are white blood cells that make up part of our immune system, in charge of patrolling tissues for pathogens or sites of injury. The motion of unicellular organisms such as bacteria, while interesting in its own right, is governed by distinct mechanisms that will not be discussed here.

L. Edelstein-Keshet (B)

University of British Columbia, Vancouver, BC, Canada e-mail: keshet@math.ubc.ca

<sup>©</sup> The Author(s) 2022

T. Chacón Rebollo et al. (eds.), *Recent Advances in Industrial and Applied Mathematics*, ICIAM 2019 SEMA SIMAI Springer Series 1, https://doi.org/10.1007/978-3-030-86236-7\_5

**Fig. 1 Cell motility and cell polarization: from biology to mathematical model**: **a** A white blood cell (neutrophil) moving between red blood cells (disk-shaped objects) from a 1950s movie clip by David Rogers. The 1D band represents a transect of the cell from front to back. We are concerned with how the cell breaks symmetry and polarizes to define such a front-back axis. **b**, **c** Sketch of a cell in top-down **b** and side **c** views, indicating the same 1D axis. **d** In our mathematical model, we aim to explain how regulatory proteins in the cell (called GTPases) spontaneously polarize and form hot spots of activity that define the front and back of the cell. **e** In our abstract "wave-pinning" model, this same process is depicted as a 1D pattern-formation event, with a wave that stalls to produced a polarized distribution

In a movie dating to the 1950s' David Rogers (then at Vanderbilt University) captured the amoeboid movements of a neutrophil as it navigates between red blood cells (disk shaped objects in Fig. 1a). In this movie, which can be seen on a popular YouTube site, we see a crawling cell, with dynamic shape—a broad front that pushes outwards, and a thin tail that is pulled along as the cell moves. Figure 1b, c are two projections of cell shape (top down in (b) and side view in (c)) that we later utilize in modeling cell polarization.

It is worth pointing out the sizes and timescales that concern us here. In contrast to some papers (e.g. Prof. Marsha Berger's whose work describes geological size scales and timescales of hours and days [1]), here we deal with the micro-world of cells, whose diameter is on the order of 10–30µm. The time-scale of relevance is on the order of seconds. As summarized in Table 1, the process of cell polarization, which defines the front and back of the cell and specifies its direction of motion, take place over seconds across the tiny cell diameter. Also noteworthy is the fact that the production of new copies of proteins (i.e. protein synthesis) does not suffice to explain how protein activity becomes concentrated at some parts of a cell, since synthesis takes hour(s), while the response times of a cell to stimuli that polarize it is known to take only seconds for fast-moving cells like neutrophils.

Here the purpose is to explain an important first step in cell motility: the symmetry breaking that creates a front and a back in the cell (Fig. 1d), namely the polarization of the cell. But before embarking on the mathematics that describes this process, we first discuss the important cellular components that are involved.


**Table 1** Typical sizes and speeds of cells, and typical time-scales of protein synthesis and activation

Recall that 1µm = 10−<sup>6</sup> m. *WBC* white blood cell (neutrophil)

#### *1.1 Actin Powers Cell Motility*

Unlike plants and bacteria, animal cells have no tough outer cell wall. They are enclosed in a lipid membrane that envelopes the interior, which in turn includes the fluid cytosol and many organelles. Most organelles, including the cell's nucleus are not directly involved in powering cellular motion.

Without some structural components, the cell would be essentially a bag of fluids. An internal "skeleton" (called the *cytoskeleton*) is formed by a meshwork of filamentous actin (F-actin), a dynamic biopolymer protein structure that is assembled at what becomes the cell front. The polymerization of actin leads to protrusion of the cell front [23]. Meanwhile, in association with the motor protein myosin, contraction of actomyosin leads to retraction of the rear portion of the cell [33], Fig. 2a.

Due to the abundance of actin monomers at excess concentration in every cell, actin assembly would be an explosive process were it not tightly controlled by many interacting regulatory cellular proteins. Many of those proteins, discovered and characterized experimentally over the last decades [27, 34], interact with actin to make it branch, to cut or cap its growing ends, to sequester or to recycle its monomeric subunits. Other proteins play the role of master-regulators that control the components of the cytoskeleton [30].

#### *1.2 GTPases Are Master Regulators*

One important class of proteins that regulate the cytoskeleton is the class of Rho GTPases, among which Rac and Rho are well known [3]. In the schematic Fig. 2, GTPases are shown to promote the assembly of filamentous actin, and the activity of myosin contraction. The GTPase Rac does the former, while the GTPase Rho enables the latter. Hence, if we can explain how Rac and Rho activities concentrate at one or another part of the cell, we can also explain the localizations of a front and rear cellular axis, and hence cell polarization. This then, is the main focus of our approach.

**Fig. 2 Schematic diagram of the cell's motility machinery**: **a** Actin filaments (F-actin), represented as blue curves, assemble at what becomes the cell front. Actin polymerization leads to protrusion at the front edge of the cell. In the cell rear, myosin motors (not shown) associate with F-actin to contract and pull up the "tail". Proteins in the class known as Rho GTPases are master regulators. These proteins control where and when actin assembly and myosin contraction take place. GTPases play an essential role in cell polarization. **b** Each GTPase has an active and an inactive state, modeled by the variables *u*, v. Only when bound to the cell membrane (shown in yellow) is the GTPase active. *A*, *I* denote rates of activation and inactivation

Interestingly, proteins in the family of Rho GTPases have a curious life-cycle. They occur in active and inactive forms, with only the active forms exerting the effects mentioned above [8]. Moreover, the active forms are always bound to the fatty membrane that forms the outer cell envelop (shown in yellow in Fig. 2). Hence, the small GTPases spend their cellular lives shuttling between the cell membrane (where part of their structure gets embedded when active) and the cell interior (where they are entirely inactive). This basic idea is illustrated in Fig. 2b. The GTPases act as cellular switches that are "ON" when active and "OFF" otherwise.

A natural question one could ask, is what is the functional purpose of the GTPase cycling between the cell's membrane and the cell's interior? As we shall see, mathematics may have something to contribute towards answering such questions. A second question is what property of the cellular machinery account for the spontaneous polarization of the cell? That is, how do GTPases redistribute so that their levels of activity differ between the front and rear of a cell [2].

#### **2 Mathematical Models**

In our earliest works on cell polarization, we attempted to account for many known features of the GTPase activity and their crosstalk and interactions [6, 18, 20]. Such models were largely computational, as it was a challenge to analyse them mathematically. It was clear that more basic model variants would be useful for mathematical progress to be feasible.

As described in Mori et al. [24, 25], we simplify a very complicated cellular process to allow for mathematical tractability. We thereby hope to identify key elements

**Fig. 3 Model geometry**: The complicated cell geometry is simplified into a 1D domain (transect along the cell diameter) with active and inactive proteins distributed along that axis, but with distinct rates of diffusion, *Du* -*D*<sup>v</sup>

that allow for spontaneous cell polarization. First, we consider just one GTPase (say Rac), rather than the entire network (Cdc42, Rac and Rho). We ask which biological attributes account for spontaneous symmetry breaking and polar pattern formation. To investigate this, we construct the following mathematical model.

We define *u*(*t*), v(*t*) to be the concentrations of the active and inactive forms of the GTPase. Then, based on the schematic diagram in Fig. 2b, it follows that

$$\frac{du}{dt} = Av - Iu, \quad \frac{dv}{dt} = -Av + Iu.$$

This is not yet enough, since spatial distribution is a vital aspect. Hence, we require a spatial variable, and need to account for the localization of each of *u*, v. To do so, we also need to define the geometry of interest.

As argued earlier, and noted in Fig. 1, to explain symmetry breaking for polarization, a 1D model along the front-back axis suffices. And while the detailed residence of the proteins on the membrane or cell interior is important, it proves helpful to simplify this too, in the steps shown in Fig. 3. In that figure, we first idealize the cell as a thin sheet of uniform thickness, surrounded top and bottom by a membrane (yellow outline). Zooming in on a small portion of the cell, we might see active (red) and inactive (black) copies of the GTPase associated with the membrane or the fluid cell interior. We homogenize these compartments, treating both *u* and v as dependent variables on a 1D spatial domain 0 ≤ *x* ≤ *L* where *L* is the cell diameter. We do however, take into account the very different rates of diffusion of a protein in the membrane (*Du* ≈ 0.01−0.1µm<sup>2</sup>/*s*) versus the fluid cell interior (*D*<sup>v</sup> ≈ 10 µm<sup>2</sup>/*s*) [28]. As we shall see, this huge disparity in diffusion plays a significant role.

The model becomes

$$\frac{\partial u}{\partial t} = D\_u \frac{\partial^2 u}{\partial x^2} + Av - Iu,\tag{1a}$$

$$\frac{\partial v}{\partial t} = D\_v \frac{\partial^2 v}{\partial x^2} - Av + I u. \tag{1b}$$

In principle, the rates of activation and inactivation *A*, *I*, are not merely constant. If they were, then Eq. (1) would be linear in *u*, v, and would have fairly uninteresting steady state solutions. Some nonlinearity is essential, and this also requires feedback—something that can only depend on levels of active proteins. (Recall that the inactive GTPases do not participate in any interactions.) We have considered models where many other proteins influence each of the state transitions [14, 18, 21], and in that case, the model would expand in complexity,

$$\frac{\partial \mu\_1}{\partial t} = D\_\mu \frac{\partial^2 \mu\_1}{\partial \mathbf{x}^2} + A(\mu\_1, \mu\_2, \dots) v\_1 - I(\mu\_1, \mu\_2, \dots) u\_1,\tag{2a}$$

$$\frac{\partial v\_1}{\partial t} = D\_v \frac{\partial^2 v\_1}{\partial x^2} - A(u\_1, u\_2, \dots) v\_1 + I(u\_1, u\_2, \dots) u\_1,\tag{2b}$$

$$\frac{\partial u\_2}{\partial t} = \dots \tag{2c}$$

Such examples, considered in the context of biological experiments, are briefly discussed further on, but mathematically, they are harder to analyze.

Our ultimate purpose, mathematically, is to strip away such complexity and focus on the most elementary example, where a single GTPase polarizes on its own. To do so, we considered the version

$$\frac{\partial u}{\partial t} = D\_u \frac{\partial^2 u}{\partial x^2} + A(u)v - I u,\tag{3a}$$

$$\frac{\partial v}{\partial t} = D\_v \frac{\partial^2 v}{\partial x^2} - A(u)v + I u,\tag{3b}$$

with feedback exclusively in the activation rate *A*(*u*) and a constant rate of inactivation *I*. This specific choice is somewhat arbitrary, as shown in [18], since it is possible to obtain essentially the same behaviour with nonlinearity introduced by assuming that *I* = *I*(*u*) with *A* constant, or by other variants where both *A* and *I* depend on *u*. The biological interpretation is somewhat different, since distinct proteins in cells play the role of activating (GEFs) and inactivating (GAPS) the GTPases. In the case of constant *I*, we can rescale time, so that *I* = 1. Altogether, then, the single-GTPase system consists of the pair of PDEs

$$\frac{\partial \mu}{\partial t} = D\_{\mu} \frac{\partial^2 \mu}{\partial x^2} + f(\mu, v), \tag{4a}$$

$$\frac{\partial v}{\partial t} = D\_v \frac{\partial^2 v}{\partial x^2} - f(u, v), \tag{4b}$$

with

$$f(u,v) = \left(b + \chi \frac{u^n}{1+u^n}\right)v - u,\tag{4c}$$

where *b* is the basal rate of activation and γ is an additional rate of activation depicting positive feedback from *u* to its own activation. The constant *n* ≥ 2 is the so-called "Hill coefficient". Larger values of *n* result in sharper switching between states.

We also assume Neumann boundary conditions, namely,

$$u\_x(0,t) = 0, \quad u\_x(L,t) = 0, \quad v\_x(0,t) = 0, \quad v\_x(L,t) = 0. \tag{4d}$$

This signifies that no material leaks out of the ends of the 1D domain, i.e. that the cell ends are sealed.

Notably, on the timescale of interest (a few seconds), no protein is made or lost, it is merely exchanged between the active and inactive states (see Table 1). This is captured by the model, since it is easy to see that the total amount of protein in the domain is conserved, that is,

$$\text{Mean total concentration} = \frac{1}{L} \int\_{0}^{L} (u(\mathbf{x}, t) + v(\mathbf{x}, t)) d\mathbf{x} = \text{constant} \tag{5}$$

As shown in [24, 25], the following properties are necessary and sufficient to ensure that a unimodal pattern (depicting a polarized distribution) will exist as a nonuniform steady state of the model:


$$\int\_{u\_a}^{u\_b} f(u,v) du.$$

4. The rates of diffusion of *u* and v are sufficiently different: *Du* -*D*v.

It is interesting to contrast the system (4) with a related one consisting of (4a), (4c) and (4d) but with v ≡ constant, that is, with a single bistable reaction-diffusion equation in one variable, *u*. The latter is known to sustain traveling wave solutions, as

**Fig. 4 Travelling waves versus wave-pinning**: **a** A single reaction-diffusion equation (4a) (for constant v) with kinetics of type (4c) is known to sustain traveling wave solutions for *u*(*x*, *t*). **b** In contrast, the system of Eqs. (4a)–(4d) with conservation and distinct rates of diffusion (*Du* - *D*v) results in waves that stop inside the domain, a phenomenon we termed "wave-pinning"

shown in Fig. 4a. In contrast, the two-variable system (4a)–(4d) leads to waves that decelerate and stop inside the domain (once the sign condition above is satisfied) as demonstrated in Fig. 4b. We refer to this behaviour as "wave-pinning". We see that Fig. 4a fails to explain polarization, because the cell diameter is eventually uniformly active. Figure 4b is consistent with polarization, since the two ends of the domain develop distinct levels of activity as time goes by. In this sense, wave-pinning is a simple caricature of cell polarization.

#### *2.1 How Wave-Pinning Works*

Full details of the analysis of such dynamics are described in [25]. Here it suffices to briefly mention the key asymptotic analysis ideas used in establishing the result.

The system (4) is rescaled to exploit the existence of a small parameter

$$
\epsilon^2 = \frac{D\_u}{rL^2},
$$

where *r* is a typical kinetics rate constant with units of 1/time (e.g., *r* = γ ). We then examine the short and intermediate time-scales of the rescaled system.

On a short time-scale (*ts* = *t*/), it can be shown that to leading order, at various sites in the domain, *u* approaches its steady state values *ua*, *ub*. This means that the domain is "carved up" into plateaus of high and of low activity levels *u* separated by transition layers between them.

To make progress, we consider the case of a single interface separating a low and a high plateau. Let the position of the interface be φ(*t*). We go on to seek the intermediate time scale behaviour. We construct an inner and an outer solution next to the transition layer and show that, to leading order, the variable v is roughly spatially constant on the two sides of the interface v ≈ *V*0(*t*), while it is depleted in time as *u* evolves.

**Fig. 5 Regimes of wave-pinning**: Wave-pinning, which represents cell polarization, depends on a balance between the total amount of GTPase (5) and the size of the small parameter <sup>=</sup> *Du* /(*r L*2). If the total amount is too small, the wave of activity collapses, whereas if it is too large, the wave sweeps across the entire domain, and a net homogenous state results. Polarization can also be lost in several ways (1) If the cell size decreases too much, and hence increases, the system leaves the polarization regime. (2) If cell size increases so that the mean total GTPase becomes too "diluted", polarization can also be lost. Image credit: Alexandra JIlkine

Using well-known analysis for wave-speed, we construct the speed of the wave, finding it to be described by a ratio of two integrals

$$\text{speed} = \frac{\int\_{u\_a}^{u\_b} f(u, v) du}{I\_2}.$$

Here *ua*, *ub* depend on *V*0(*t*), and *I*<sup>2</sup> is a strictly positive integral. We argue that the wave stops when the numerator vanishes, which is guaranteed to happen at some point by Condition 3, a Maxwell condition. Indeed, once v is depleted sufficiently, to the level v∗, the integral in the numerator vanishes. Details and discussion of the steps appear in [25]. Regimes of polarization are shown in (Fig. 5).

Intuitively, the result can be explained as follows: at the transition zone, the high *u* plateau activates an adjoining site by virtue of local diffusion and positive feedback. The spread of *u*, however, is at the expense of the inactive form v, which gets depleted as the wave of activity spreads. Once v is sufficiently depleted, the spread of the activity wave can no longer be sustained. At that point, the wave freezes.

It is also interesting to note that the fast diffusion of v means that it acts as a "global messenger" in the sense that it rapidly stores domain-wide information about the level of activity in the cell. Hence, local activation (of *u* by itself) and global depletion (of v) synergize to produce the polarization of activity in the domain.

#### **3 Recent Work: Analysis, Simulation, and Contact with Experiments**

The wave-pinning equations are merely a prototype of the dynamics of a protein in the small GTPase family. Related systems with greater levels of biological detail have also been explored [12, 14, 21]. Indeed insights by AFM Marée in [20] contributed to the understanding that led to the mathematical treatment of wave-pinning in [24, 25].

#### *3.1 Analysis of Slow-Fast Reaction Diffusion Systems: LPA*

While studying systems of reaction-diffusion equations (RDEs) for cell polarization, we have benefitted from a number of recent methods that result in shortcuts for quick diagnosis of pattern-formation regimes. Among these, the "Local Perturbation Analysis" (LPA) is a method to track local and global variables in RDEs using ODEs that approximate the fate of a small peak of activity (*uL* ). This method was invented by AFM Mareé and V Grieneisen [9, 36], and popularized in several papers [11, 12, 15]. It has helped us to identify approximate regimes where a nonuniform pattern could form by a finite perturbation of a spatially uniform state in a fast-slow reaction diffusion system.

Figure 6 illustrates a typical LPA bifurcation result, and its interpretation. The method identifies the existence of a spatially uniform global branch (in black), and parameter regimes where this branch is stable (solid) or unstable (dot-dashed curve). Even when the global homogeneous steady state is stable, a polarized pattern can be established with large enough stimulus. The local variable *uL* represents a thin local peak of active *u*. That peak could grow (and lead to a polar pattern) in the regime where the solid red curve is present. The LPA diagram demonstrates that a sufficiently large stimulus peak is needed, that its size has to exceed a threshold (dashed red curve), and that some parameter regimes allow for patterning in response to arbitrarily small stimuli (dot-dashed black curve). The latter regimes can be identified with Turing instabilities. The former regimes are not discoverable by the usual linear stability analysis (LSA) for Turing pattern formation, and are a helpful aspect of LPA that goes beyond LSA.

In our experience, solving the full PDEs with insights gained from LPA diagrams makes it easier to identify the interesting parameter regimes. Details of the method and its uses has been extensively described in [15]. Other useful shortcuts have included "sharp-switch" approximations (Hill functions replaced by piecewise constant functions), as in [12], and analysis of plateaus described in [36]. None of these replace the need for simulating the PDEs, but all of them help to gain familiarity with possible expected behaviours of the reaction-diffusion systems we have investigated. Most recently, Andreas Buttenschön has created full numerical bifurcation software for PDEs that permits much greater accuracy in tracking solution branches

**Fig. 6 Methods of analysis and simulations**: **a** Local perturbation analysis (LPA), a shortcut bifurcation method has helped to detect regimes of patterning in slow-fast reaction-diffusion systems. Here we show an example of how the basal activation rate *b* influences potential regimes of wave-pinning and of Turing-type instability. See text and references [11, 12, 15] for details. **b** A number of methods have been used to simulate polarization in 2D deforming domains representing the "top-down" view of a cell (as in Fig. 1b). From top to bottom: A cellular-Potts model simulation by A. F. M. Mareé of a 2D deforming cell with an internal reaction-diffusion signaling circuit (and an implicit reaction-diffusion solver) that includes GTPases, interacting lipids, actin, and other components [21], the wave-pinning system (4) solved in an immersed-boundary method simulation by Ben Vanderlei [35], by the level set and moving boundary node method by Zajac [7], and using CompuCell3D by undergraduate summer research student Zachary Pellegrin

[4]. The software builds on state of the art well-conditioned collocation techniques to discretize functions and their operators. Solution branches are continued using a matrix-free Newton-Gauss method, for which rigorous convergence estimates are available.

#### *3.2 Simulating the PDEs in Dynamic Cell-Shaped Domains*

So far, analytic results were described in 1D domains that represent a cell transect. It is instructive to ask how the same systems behave in domains whose shape more closely relates to that of cells, and in particular, where the internal chemistry affects (and is affected by) the deforming cell. Based on the fact that cell fragments (radius ≈ 5–10 μm) without a nucleus, and with overall uniform thickness (≈0.2 µm) are capable of motility, we take the liberty of reducing cell shape to its two-dimensional "top-down" projection shown in Fig. 1b, d. We solve the governing equations (4) or more detailed versions, in the 2D domain, and assume that the boundary of the domain is influenced by the local chemical activity level. For example, if *u* represents the level of activity of the GTPase Rac, it causes the boundary to be pushed outwards (via F-actin assembly), whereas Rho has the opposite effect (activating contraction via myosin).

A number of results obtained over the years by group members are illustrated in Fig. 6b. In general, we found that the simplest system to understand analytically (4), is not as robust computationally as other variants. Cross-talk between GTPases results in larger parameter regimes for polarization. As an example, models consisting of four PDEs that describe the mutual antagonism between Rac and Rho [12] lead to greater robustness in 2D computations. An even more detailed variant, that includes several GTPases (Rac, Rho, Cdc42), as well as their effects on actin assembly and myosin contraction was capable of realistic behaviour such as directed motility (chemotaxis) [20]. The addition of a layer of signaling lipids (phosphoinositides) also permitted a simulated cell to rapidly select one front despite conflicting or competing stimuli [21].

Simulating the reaction-diffusion systems for GTPase signaling in deforming domains also reveals that evolving domain shape and level curves of the chemical system influence one another: the zero-flux boundary conditions impose constraints on the level curves that also accelerate the dynamics of the chemical redistribution when the domain deforms. Such findings were discussed in detail in [21].

For practical reasons, it is harder to simulate the same systems in 3D. However, recent work by the group of Anotida Madzvamuse [5] has extended these results to a coupled bulk-surface wave-pinning computation in a 3D cell-shaped static domain.

#### *3.3 Contact with Biological Experiments*

While details are beyond the scope of this summary, it is worth noting several directions in which the mathematical modeling has contributed to understanding of experimental cell biology.

Willian Bement (U Wisconsin) studies the patterns of GTPases (Rho and Cdc42) that form spontaneously around sites of laser-inflicted wounds in frog eggs (Xenopus oocytes). The connectivity of these GTPases, and their crosstalk with proteins that activate or inactivate them (e.g. Abr) has been modeled by group members, including Cory Simon, Laura Liao, and William R Holmes. Combining models with experiments has helped to build an understanding of the biology [12, 13, 32].

The polarization of HeLa cells exposed to gradients that stimulate a graded response by the GTPase Rac were studied experimentally by Benjamin Lin, in the Lab of Andre Levchenko [19]. A model for Cdc42, Rac, and Rho, interacting with one another and with the phosphoinositides PIP, PIP2 and PIP3 explained the timing and strength of the response, and predicted results of experimental manipulations that affect parts of the crosstalk [14, 19].

Experiments have been carried out on melanoma cells grown on microfabricated surfaces that mimic the natural environment of cells ("extracellular matrix"). JinSeok Park, of the Levchenko Lab at Yale University found three typical motility phenotypes, including persistently polarized, random, and oscillatory front-back cycling,

**Fig. 7 Extensions of the minimal model**: **a** The simplest basic wave-pinning model of Eq. (4) can produce a polarized pattern. **b** When the GTPase promotes assembly of F-actin, which then promotes GTPase inactivation, waves and other exotic dynamics can be observed, provided the negative feedback is on a slow time-scale [10, 22]. In **a**, **b** time increases along the vertical axis and space is on the horizontal axis. **c** Some GTPases cause the cell to spread (Rac) or to shrink (Rho), affecting cell tension. If the tension also affects GTPase activity, interesting dynamics are observed. Shown is a time sequence (left to right) of a "tissue" composed of ≈370 cells, colour coded by their internal GTPase activity. The cell size is correlated to that activity, as described in [37]

depending on levels of adhesion to the substrate, and manipulations that affect activities of the GTPases or their downstream targets. We were able to account for the observed phenotypes by a model for Rac-Rho mutual antagonism, weighted by signals from the extracellular matrix substrate [16, 26, 29].

#### **4 Extending the Minimal Model**

The wave-pinning model has been used as a nucleus from which we have expanded to larger circuits, and greater levels of biological detail. We showed that some properties of the system (4) is shared by a circuit of the mutually antagonistic GTPases Rac-Rho [12]. A notable common feature is the existence of parameter regimes in which several states coexist. These include states of uniformly low activity, uniformly high activity, or polarized levels of activity. Which of these develops then depends on initial conditions. A recent contribution [38] extends these findings to more general model variants.

A hallmark of the kinetics we described above is the presence of bistability in some parameter regimes, i.e. the existence of two stable steady states separated by an unstable one. Such systems also display hysteresis, or a kind of history-dependence: slowly increasing a parameter results in a sudden appearance of a new steady state at some transition point, but to reverse the process, the same parameter has to be decreased much beyond the transition point. The addition of feedback from a third dynamic variable in such cases, is known to produce the possibility of oscillations.

We examined several cases of this type, motivated by biological observations. In one case, we studied feedback from F-actin to the inactivation of a GTPase, as observed, for example, in [31]. Assuming slow negative feedback from F-actin (to the inactivation of the GTPase), as shown in Fig. 7b leads to interesting dynamics of traveling waves and pulses in the domain [10, 22]. Feedback between the Rac-Rho circuit and the extracellular matrix also results in oscillations, as previously described [16]. More recently, we also modeled the interplay between mechanical tension in the cell and the activity of GTPases, as observed experimentally by [17]. Here we assumed that GTPase such as Rho and Rac can affect cell spreading, which changes the tension on the cell and feeds back to the activation of the GTPase. A typical circuit of this type is shown in Fig. 7c. As expected, such negative feedback is also consistent with regimes of oscillatory dynamics in individual cells, as demonstrated in [37]. Moreover, when cells with such behaviour are coupled to one another in 1D or in 2D (simulations in Fig. 7c), one observes waves of chemical activity coupled to cell-size changes as the "model tissue" undergoes the spatio-temporal dynamics so created.

#### **5 Discussion**

Cell biology presents an unlimited source of inspiring problems. The links between mathematics and cell biology are relatively recent, and not yet fully recognized. But the need for quantitative methods, computational platforms, and mathematical analysis of cellular phenomena promises to grow with time, presenting many opportunities for young applied mathematicians looking for problems to study.

Here I have mainly described a toy model that we constructed to help us understand cell polarization. The simplicity of the model made it mathematically tractable. Its analysis reveals several insights that were not a priori evident. First, with the right kind of positive feedback, we showed that a single GTPase could, on its own, lead to spontaneous polarization that explains cell directionality. In other words, it is not essential to have networks of such proteins to achieve this cellular process. Second, there is a functional purpose for the curious biology of GTPases: their cycling between membrane and cytosol is not a mere evolutionary artifact. We argue that this transition sets up the differences in diffusion between active and inactive GTPases—a difference that is crucial for polarization to be possible, according to our mathematical model.

The motivation of cell polarity led us to mathematics with a surprising twist, uncovering the phenomenon of decelerating waves and wave-pinning that were not widely recognized before in the literature on reaction-diffusion systems. From this standpoint, we could argue that biology inspires new mathematics. The efforts to understand models that were so developed also resulted in a variety of methods that ease the analysis, among them LPA. Extensions of the basic wave-pinning model led to variants with more exotic patterns and waves. These were investigated in various geometries, in single cells, and finally, in interacting groups of cells to identify causes for cell size fluctuations in a tissue and for a variety of emergent phenomena in single and collective cell motility. Finally, developing simple theoretical models and in parallel considering biologically-inspired detailed models are not mutually exclusive. Our experience in the former helps us with the later, and vice versa.

Many still-unanswered questions can be posed. Among these are some of the following: How does the internal GTPase state of a cell affect the outcome of interactions between cells, and how does contact between cells change their GTPase state? What are reasonable ways to model such cell-cell interactions leading to cell adhesion or cell separation? How is cell state coordinated in a multicellular tissue? What aspects of cell adhesion, mechanics, deformation, chemical secretion, and environmental topography (to name a few) affect and are affected by GTPase activities, and how should these be modelled? What methods of analysis can we develop to help with larger, more realistic models that have many interacting components? What aspects of 3D cell shape, and of cell motion in a 3D matrix lead to new phenomena, and what numerical methods should be developed to address such behaviours? Is there a compromise between large-scale computations and mathematical analysis in these more challenging scenarios? In conclusion, the motility and interactions of cells is a rich scientific area calling for investigation by applied mathematicians. Pattern formation inside living cells is merely one facet, while many other fundamental challenges are at hand.

**Acknowledgements** LEK gratefully acknowledges the contributions of many group members to this research over the years. Among these, special thanks go to A. F. M. Mareé, Y. Mori, W. R. Holmes, A. Jilkine, A. T. Dawes, C. Zmurchok, A. Buttenschön and E. G. Rens. LEK is supported by a Discovery grant from the Natural Sciences and Engineering Research Council of Canada (NSERC).

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Private AI: Machine Learning on Encrypted Data**

**Kristin Lauter**

**Abstract** This paper gives an overview of my Invited Plenary Lecture at the International Congress of Industrial and Applied Mathematics (ICIAM) in Valencia in July 2019.

#### **1 Motivation: Privacy in Artificial Intelligence**

These days more and more people are taking advantage of cloud-based artificial intelligence (AI) services on their smart phones to get useful predictions such as weather, directions, or nearby restaurant recommendations based on their location and other personal information and preferences. The AI revolution that we are experiencing in the high tech industry is based on the following value proposition: you input your private data and agree to share it with the cloud service in exchange for some useful prediction or recommendation. In some cases the data may contain extremely personal information, such as your sequenced genome, your health record, or your minute-to-minute location.

This quid pro quo may lead to the unwanted disclosure of sensitive information or an invasion of privacy. Examples during the year of ICIAM 2019 include the case of the Strava fitness app which revealed the location of U.S. army bases world-wide, or the case of the city of Los Angeles suing IBM's weather company over deceptive use of location data. It is hard to quantify the potential harm from loss of privacy, but employment discrimination or loss of employment due to a confidential health or genomic condition are potential undesirable outcomes. Corporations also have a need to protect their confidential customer and operations data while storing, using, and analyzing it.

To protect privacy, one option is to lock down personal information by encrypting it before uploading it to the cloud. However, traditional encryption schemes do not allow for any computation to be done on encrypted data. In order to make useful

K. Lauter (B)

Cryptography and Privacy Research, Microsoft Research, Redmond, USA e-mail: klauter@microsoft.com

<sup>©</sup> The Author(s) 2022

T. Chacón Rebollo et al. (eds.), *Recent Advances in Industrial and Applied Mathematics*, ICIAM 2019 SEMA SIMAI Springer Series 1, https://doi.org/10.1007/978-3-030-86236-7\_6

predictions, we need a new kind of encryption which maintains the structure of the data when encrypting it so that meaningful computation is possible. Homomorphic encryption allows us to switch the order of encryption and computation: we get the same result if we first encrypt and then compute, as if we first compute and then encrypt.

The first solution for a homomorphic encryption scheme which can process any circuit was proposed in 2009 by Gentry [21]. Since then, many researchers in cryptography have worked hard to find schemes which are both practical and also based on well-known hard math problems. In 2011, my team at Microsoft Research collaborated on the homomorphic encryption schemes [8, 9] and many practical applications and improvements [30] which are now widely used in applications of Homomorphic Encryption. Then in 2016, we had a surprise breakthrough at Microsoft Research with the now widely cited CryptoNets paper [22], which demonstrated for the first time that evaluation of neural network predictions was possible on encrypted data.

Thus began our Private AI project, the topic of my Invited Plenary Lecture at the International Congress of Industrial and Applied Mathematics in Valencia in July 2019. Private AI refers to our Homomorphic Encryption-based tools for protecting the privacy of enterprise, customer, or patient data, while doing Machine Learning (ML)-based AI, both learning classification models and making valuable predictions based on such models.

You may ask, "What is Privacy?" Preserving "Privacy" can mean different things to different people or parties. Researchers in many fields including social science and computer science have formulated and discussed definitions of privacy. My favorite definition of privacy is: a person or party should be able to control how and when their data is used or disclosed. This is exactly what Homomorphic Encryption enables.

#### *1.1 Real-World Applications*

In 2019, the British Royal Society released a report on Protecting privacy in practice: Privacy Enhancing Technologies in data analysis. The report covers Homomorphic Encryption (HE) and Secure Multi-Party Computation (MPC), but also technologies not built with cryptography, including Differential Privacy (DP) and secure hardware hybrid solutions. Our homomorphic encryption project was featured as a way to protect "Privacy as a human right" at the Microsoft Build world-wide developers conference in 2018 [39]. Private AI forms one of the pillars of Responsible ML in our collection of Responsible AI research and Private Prediction notebooks were released in Azure ML at Build 2020.

Over the last 8 years, my team has created demos of Private AI in action, running private analytics services in the Azure cloud. I showed a few of these demos in my talk at ICIAM in Valencia. Our applications include an encrypted fitness app, which is a cloud service which processes all your workout and fitness data and locations in the cloud in encrypted form, and displays your summary statistics to you on your phone after decrypting the results of the analysis locally. Another application shows an encrypted weather prediction app, which takes your encrypted zip-code and returns encrypted versions of the weather at your location to be decrypted and displayed to you on your phone. The cloud service never learns your location or what weather data was returned to you. Finally, I showed a private medical diagnosis application, which uploads an encrypted version of your Chest X-Ray image, and the medical condition is diagnosed by running image recognition algorithms on the encrypted image in the cloud, and returned in encrypted form to the doctor.

Over the years, my team<sup>1</sup> has developed other Private AI applications, enabling private predictions such as sentiment analysis in text, cat/dog image classification, heart attack risk based on personal health data, neural net image recognition of hand-written digits, flowering time based on the genome of a flower, and pneumonia mortality risk using intelligible models. All of these operate on encrypted data in the cloud to make predictions, and return encrypted results in a matter of fractions of a second.

Many of these demos and applications have been inspired by collaborations with researchers in Medicine, Genomics, Bioinformatics, and Machine Learning. We have worked together with finance experts and pharmaceutical companies to demonstrate a range of ML algorithms operating on encrypted data. The UK Financial Conduct Authority (FCA) ran an international Hackathon in August 2019 to combat moneylaundering with encryption technologies by allowing banks to share confidential information with each other. Since 2015, the annual iDASH competition has attracted teams from around the world to submit solutions to the Secure Genome Analysis Competition. Participants include researchers at companies such as Microsoft and IBM, start-up companies, and academics from the U.S., Korea, Japan, Switzerland, Germany, France, etc. The results provide benchmarks for the medical research community of the performance of encryption tools for preserving privacy of health and genomic data.

#### **2 What Is Homomorphic Encryption?**

I could say, "Homomorphic Encryption is encryption which is homomorphic." But that is not very helpful without further explanation. Encryption is one of the building blocks of cryptography: encryption protects the confidentiality of information. In mathematical language, encryption is just a map which transforms plaintexts (unencrypted data) into ciphertexts (encrypted data), according to some recipe. Examples of encryption include blockciphers, which take sequences of bits and process them in blocks, passing them through an S-box which scrambles them, and iterating that process many times. A more mathematical example is RSA encryption, which raises

<sup>1</sup> My collaborators on the SEAL team include: Kim Laine, Hao Chen, Radames Cruz, Wei Dai, Ran Gilad-Bachrach, Yongsoo Song, Shabnam Erfani, Sreekanth Kannepalli, Jeremy Tieman, Tarun Singh, Hamed Khanpour, Steven Chith, James French, with substantial contributions from interns Gizem Cetin, Kyoohyung Han, Zhicong Huang, Amir Jalali, Rachel Player, Peter Rindal, Yuhou Xia as well.

**Fig. 1** Homomorphic encryption

a message to a certain power modulo a large integer *N*, whose prime factorization is secret, *N* = *p* · *q*, where *p* and *q* are large primes of equal size with certain properties.

A map which is *homomorphic* preserves the structure, in the sense that an operation on plaintexts should correspond to an operation on ciphertexts. In practice that means that switching the order of operations preserves the outcome after decryption: i.e. *encrypt-then-compute* and *compute-then-encrypt* give the same answer. This property is described by the following diagram:

Starting with two pieces of data, *a* and *b*, the functional outcome should be the same when following the arrows in either direction, across and then down (*computethen-encrypt*), or down and then across (*encrypt-then-compute*): *E*(*a* + *b*) *E*(*a*) + *E*(*b*). If this diagram holds for two operations, addition and multiplication, then any circuit of AND and OR gates encrypted under map the encryption map *E*. It is important to note that homomorphic encryption solutions provide for randomized encryption, which is an important property to protect against so-called dictionary attacks. This means that new randomness is used each time a value is encrypted, and it should not be computationally feasible to detect whether two ciphertexts are the encryption of the same plaintext or not. Thus the ciphertexts in the bottom right corner of the diagram need to be decrypted in order to detect whether they are equal.

The above description gives a mathematical explanation of homomorphic encryption by defining its properties. To return to the motivation of Private AI, another way to describe homomorphic encryption is to explain the functionality that it enables. Figure 2 shows Homer-morphic encryption, where Homer Simpson is a jeweler tasked with making jewelry given some valuable gold. Here the gold represents some private data, and making jewelry is analogous to analyzing the data by applying some AI model. Instead of accessing the gold directly, the gold remains in a locked box, and the owner keeps the key to unlock the box. Homer can only handle the gold through gloves inserted in the box (analogous to handling only encrypted data). When Homer completes his work, the locked box is returned to the owner who unlocks the box to retrieve the jewelry.


To connect to Fig. 1 above, outsourcing sensitive work to an untrusted jeweler (cloud) is like following the arrows down, across, and then up. First the data owner encrypts the data and uploads it to the cloud, then the cloud operates on the encrypted data, then the cloud returns the output to the data owner to decrypt.

#### *2.1 History*

Almost 5 decades ago, we already had an example of encryption which is homomorphic for one operation: the RSA encryption scheme [36]. A message *m* is encrypted by raising it to the power *e* modulo *N* for fixed integers *e* and *N*. Thus the product of the encryption of two messages *m*<sup>1</sup> and *m*<sup>2</sup> is *m<sup>e</sup>* 1*m<sup>e</sup>* <sup>2</sup> = (*m*1*m*2)*<sup>e</sup>*. It was an open problem for more than thirty years to find an encryption scheme which was homomorphic with respect to two (ring) operations, allowing for the evaluation of any circuit. Boneh-Goh-Nissim [3] proposed a scheme allowing for unlimited additions and one multiplication, using the group of points on an elliptic curve over a finite field, along with the Weil pairing map to the multiplicative group of a finite field.

In 2009, Gentry proposed the first homomorphic encryption scheme, allowing in theory for evaluation of arbitrary circuits on encrypted data. However it took several years before researchers found schemes which were implementable, relatively practical, and based on known hard mathematical problems. Today all the major homomorphic encryption libraries world-wide implement schemes based on the hardness of lattice problems. A lattice can be thought of as a discrete linear subspace of Euclidean space, with the operations of vector addition, scalar multiplication, and inner product, and its dimension, *n*, is the number of basis vectors.

#### *2.2 Lattice-Based Solutions*

The high-level idea behind current solutions for homomorphic encryption is as follows. Building on an old and fundamental method of encryption, each message is *blinded*, by adding a random inner product to it: the inner product of a secret vector with a randomly generated vector. Historically, blinding a message with fresh randomness was the idea behind encryption via *one-time pads*, but those did not satisfy the homomorphic property. Taking inner products of vectors is a linear operation, but if homomorphic encryption involved only addition of the inner product, it would be easy to break using linear algebra. Instead, the encryption must also add some freshly generated noise to each blinded message, making it difficult to separate the noise from the secret inner product. The noise, or *error*, is selected from a fairly narrow Gaussian distribution. Thus the hard problem to solve becomes a noisy decoding problem in a linear space, essentially Bounded Distance Decoding (BDD) or a Closest Vector Problem (CVP) in a lattice. Decryption is possible with the secret key, because the decryptor can subtract the secret inner product and then the noise is small and is easy to cancel.

Although the above high-level description was formulated in terms of lattices, in fact the structure that we use in practice is a polynomial ring. A vector in a lattice of *n* dimensions can be thought of as a monic polynomial of degree *n*, where the coordinates of the vector are the coefficients of the polynomial. Any number ring is given as a quotient of <sup>Z</sup>[*x*], the polynomial ring with integer coefficients, by a monic irreducible polynomial *f* (*x*). The ring can be thought of as a lattice in R*<sup>n</sup>* when embedded into Euclidean space via the canonical embedding. To make all objects finite, we consider these polynomial rings modulo a large prime *q*, which is often called the ciphertext modulus.

#### *2.3 Encoding Data*

When thinking about practical applications, it becomes clear that real data first has to be embedded into the mathematical structure that the encryption map is applied to, the *plaintext space*, before it is encrypted. This encoding procedure must also be homomorphic in order to achieve the desired functionality. The encryption will be applied to the polynomial ring with integer coefficients modulo *q*, so real data must be embedded into this polynomial ring.

In a now widely cited 2011 paper, "Can Homomorphic Encryption be Practical?" ([30, Sect. 4.1]), we introduced a new way of encoding real data in the polynomial space which allowed for efficient arithmetic operations on real data, opening up a new direction of research focusing on practical applications and computations. The encoding technique was simple: embed an integer *m* as a polynomial whose *i*th coefficient is the *i*th bit of the binary expansion of *m* (using the ordering of bits so that the least significant bit is encoded as the constant term in the polynomial). This allows for direct multiplication of real integers, represented as polynomials, instead of encoding and encrypting data bit-by-bit, which requires a deep circuit just to evaluate simple integer multiplication. When using this approach, it is important to keep track of the growth of the size of the output to the computation. In order to assure correct decryption, we limit the total size of the polynomial coefficients to *t*. Note that each coefficient was a single bit to start with, and a sum of *k* of them grows to at most *k*. We obtain the correct decryption and decoding as long as *q* > *t* > *k*, so that the result does not wrap around modulo *t*.

This encoding of integers as polynomials has two important implications, for performance and for storage overhead. In addition to enabling multiplication of floating point numbers via direct multiplication of ciphertexts (rather than requiring deep circuits to multiply data encoded bit wise), this technique also saves space by packing a large floating point number into a single ciphertext, reducing the storage overhead. These encoding techniques help to squash the circuits to be evaluated, and make the size expansion reasonable. However, they limit the possible computations in interesting ways, and so all computations need to be expressed as polynomials. The key factor in determining the efficiency is the degree of the polynomial to be evaluated.

#### *2.4 Brakerski/Fan-Vercauteren Scheme (BFV)*

For completeness, I will describe one of the most widely used homomorphic encryption schemes, the Brakerski/Fan-Vercauteren Scheme (BFV) [7, 20], using the language of polynomial rings.

#### **2.4.1 Parameters and Notation**

Let *q t* be positive integers and *n* a power of 2. Denote -= *q*/*t*. Define

$$\mathcal{R} = \mathbb{Z}[\mathfrak{x}]/(\mathfrak{x}^n + 1),$$

$$R\_q = R/qR = (\mathbb{Z}/q\mathbb{Z})[\mathbb{x}]/(\mathbb{x}^n+1),$$

and *Rt* <sup>=</sup> <sup>Z</sup>/*t*Z[*x*]/(*<sup>x</sup> <sup>n</sup>* <sup>+</sup> <sup>1</sup>), where <sup>Z</sup>[*x*] is the set of polynomials with integer coefficients and (Z/*q*Z)[*x*] is the set of polynomials with integer coefficients in the range [0, *q* − 1).

In the BFV scheme, plaintexts are elements of *Rt* , and ciphertexts are elements of *Rq* × *Rq* . Let χ denote a narrow (centered) discrete Gaussian error distribution. In practice, most implementations of homomorphic encryption use a Gaussian distribution with standard deviation σ[χ] ≈ 3.2. Finally, let *Uk* denote the uniform distribution on <sup>Z</sup> ∩ [−*k*/2, *<sup>k</sup>*/2).

#### **2.4.2 Key Generation**

To generate a public key, pk, and a corresponding secret key, sk, sample *<sup>s</sup>* <sup>←</sup> *<sup>U</sup><sup>n</sup>* 3 , *a* ← *U<sup>n</sup> <sup>q</sup>* , and *e* ← χ*<sup>n</sup>*. Each of *s*, *a*, and *e* is treated as an element of *Rq* , where the *n* coefficients are sampled independently from the given distributions. To form the public key–secret key pair, let

$$\text{pk} = ([-(as + e)]\_q, a) \in \mathcal{R}\_q^2, \text{ } \text{sk} = s$$

where [·]*<sup>q</sup>* denotes the (coefficient-wise) reduction modulo *q*.

#### **2.4.3 Encryption**

Let *m* ∈ *Rt* be a plaintext message. To encrypt *m* with the public key pk = (*p*0, *p*1) ∈ *R*2 *<sup>q</sup>* , sample *<sup>u</sup>* <sup>←</sup> *<sup>U</sup><sup>n</sup>* <sup>3</sup> and *e*1, *e*<sup>2</sup> ← χ*<sup>n</sup>*. Consider *u* and *ei* as elements of *Rq* as in key generation, and create the ciphertext

$$\mathbf{c}\mathbf{t} = (\mathbf{[\Delta m + p\_0 \mu + e\_1]\_q}, \mathbf{[p\_1 \mu + e\_2]\_q}) \in \mathcal{R}\_q^2.$$

#### **2.4.4 Decryption**

To decrypt a ciphertext ct = (*c*0, *c*1) given a secret key sk = *s*, write

$$\frac{t}{q}(c\_0 + c\_1s) = m + v + bt,$$

where *c*<sup>0</sup> + *c*1*s* is computed as an integer coefficient polynomial, and scaled by the rational number *t*/*q*. The polynomial *b* has integer coefficients, *m* is the underlying message, and v satisfies v ∞ 1/2. Thus decryption is performed by evaluating

$$m = \left\lfloor \frac{t}{q}(c\_0 + c\_1s) \right\rfloor\_t,$$

where · denotes rounding to the nearest integer.

#### **2.4.5 Homomorphic Computation**

Next we see how to enable addition and multiplication of ciphertexts. Addition is easy: we define an operation ⊕ between two ciphertexts ct<sup>1</sup> = (*c*0, *c*1) and ct<sup>2</sup> = (*d*0, *d*1) as follows:

$$\mathbf{c}\mathbf{t}\_1 \oplus \mathbf{c}\mathbf{t}\_2 = ([c\_0 + d\_0]\_q, [c\_1 + d\_1]\_q) \in \boldsymbol{\mathcal{R}}\_q^2.$$

Denote this homomorphic sum by ctsum = (*c*sum <sup>0</sup> , *c*sum <sup>1</sup> ), and note that if

$$\frac{t}{q}(c\_0 + c\_1s) = m\_1 + v\_1 + b\_1t, \quad \frac{t}{q}(d\_0 + d\_1s) = m\_2 + v\_2 + b\_2t, \dots$$

then

$$\frac{d}{dq}(c\_0^{\text{sum}} + c\_1^{\text{sum}}s) = [m\_1 + m\_2]\_t + \upsilon\_1 + \upsilon\_2 + b\_{\text{sum}}t,$$

As long as v<sup>1</sup> + v2 ∞ < 1/2, the ciphertext ctsum is a correct encryption of [*m*<sup>1</sup> + *m*2]*<sup>t</sup>* .

Similarly, there is an operation ⊗ between two ciphertexts that results in a ciphertext decrypting to [*m*1*m*2]*<sup>t</sup>* , as long as v1 ∞ and v2 ∞ are small enough. Since ⊗ is more difficult to describe than ⊕, we refer the reader to [20] for details.

#### **2.4.6 Noise**

In the decryption formula presented above the polynomial v with rational coefficients is assumed to have infinity-norm less than 1/2. Otherwise, the plaintext output by decryption will be incorrect. Given a ciphertext ct = (*c*0, *c*1) which is an encryption of a plaintext *<sup>m</sup>*, let <sup>v</sup> <sup>∈</sup> <sup>Q</sup>[*x*]/(*<sup>x</sup> <sup>n</sup>* <sup>+</sup> <sup>1</sup>) be such that

$$\frac{t}{q}(c\_0 + c\_1s) = m + v + bt.$$

The infinity norm of the polynomial v called the noise, and the ciphertext decrypts correctly as long as the noise is less than 1/2.

When operations such as addition and multiplication are applied to encrypted data, the noise in the result may be larger than the noise in the inputs. This noise growth is very small in homomorphic additions, but substantially larger in homomorphic multiplications. Thus, given a specific set of encryption parameters (*n*, *q*, *t*,χ), one can only evaluate computations of a bounded size (or bounded multiplicative depth).

A precise estimate of the noise growth for the YASHE scheme was given in [4] and these estimates were used in [5] to give an algorithm for selecting secure parameters for performing any given computation. Although the specific noise growth estimates needed for this algorithm do depend on which homomorphic encryption scheme is used, the general idea applies to any scheme.

#### *2.5 Other Homomorphic Encryption Schemes*

In 2011, researchers at Microsoft Research and Weizmann Institute published the (BV/BGV [8, 9]) homomorphic encryption scheme which is used by teams around the world today. In 2013, IBM released HELib, a homomorphic encryption library for research purposes, which implemented the BGV scheme. HELib is written in C++ and uses the NTL mathematical library. The Brakerski/Fan-Vercauteren (BFV) scheme described above was proposed in 2012. Alternative schemes with different security and error-growth properties were proposed in 2012 by Lopez-Alt, Tromer, and Vaikuntanathan (LTV [33]), and in 2013 by Bos, Lauter, Loftus, and Naehrig (YASHE [4]). The Cheon-Kim-Kim-Song (CKKS [14]) scheme was introduced in 2016, enabling approximate computation on ciphertexts.

Other schemes [16, 19] for general computation on bits are more efficient for logical tasks such as comparison, which operate bit-by-bit. Current research attempts to make it practical to switch between such schemes to enable both arithmetic and logical operations efficiently ([6]).

#### *2.6 Microsoft SEAL*

Early research prototype libraries were developed by the Microsoft Research (MSR) Cryptography group to demonstrate the performance numbers for initial applications such as those developed in [4, 5, 23, 29]. But due to requests from the biomedical research community, it became clear that it would be very valuable to develop a wellengineered library which would be widely usable by developers to enable privacy solutions. The Simple Encrypted Arithmetic Library (SEAL) [37] was developed in 2015 by the MSR Cryptography group with this goal in mind, and is written in C++. Microsoft SEAL was publicly released in November 2015, and was released open source in November 2018 for commercial use. It has been widely adopted by teams worldwide and is freely available online (http://sealcrypto.org).

Microsoft SEAL aims to be easy to use for non-experts, and at the same time powerful and flexible for expert use. SEAL maintains a delicate balance between usability and performance, but is extremely fast due to high-quality engineering. SEAL is extensively documented, and has no external dependencies. Other publicly available libraries include HELib from IBM, PALISADE by Duality Technologies, and HEAAN from Seoul National University.

#### *2.7 Standardization of Homomorphic Encryption [1]*

When new public key cryptographic primitives are introduced, historically there has been roughly a 10-year lag in adoption across the industry. In 2017, Microsoft Research Outreach and the MSR Cryptography group launched a consortium for advancing the standardization of homomorphic encryption technology, together with our academic partners, researchers from government and military agencies, and partners and customers from various industries: Homomorphic Encryption.org. The first workshop was hosted at Microsoft in July 2017, and developers for all the existing implementations around the world were invited to demo their libraries.

At the July 2017 workshop, we worked in groups to draft three white papers on Security, Applications, and APIs. We then worked with all relevant stakeholders of the HE community to revise the Security white paper [11] into the first draft standard for homomorphic encryption [1]. The Homomorphic Encryption Standard (HES) specifies secure parameters for the use of homomorphic encryption. The draft standard was initially approved by the HomomorphicEncryption.org community at the second workshop at MIT in March 2018, and then was finalized and made publicly available at the third workshop in October 2018 at the University of Toronto [1]. A study group was initiated in 2020 at the ISO, the International Standards Organization, to consider next steps for standardization.

#### **3 What Kind of Computation Can We Do?**

#### *3.1 Statistical Computations*

In early work, we focused on demonstrating the feasibility of statistical computations on health and genomic data, because privacy concerns are obvious in the realm of health and genomic data, and statistical computations are an excellent fit for efficient HE because they have very low depth. We demonstrated HE implementations and performance numbers for statistical computations in genomics such as the chi-square test, Cochran-Armitage Test for Trend, and Haplotype EstimationMaximization [29]. Next, we focused on string matching, using the Smith-Waterman algorithm for edit distance [15], another task which is frequently performed for genome sequencing and the study of genomic disease.

#### *3.2 Heart Attack Risk*

To demonstrate operations on health data, in 2013 we developed a live demo predicting the risk of having a heart attack based on six health characteristics [5]. We evaluated predictive models developed over decades in the Framingham Heart study, using the Cox proportional Hazard method. I showed the demo live to news reporters at the 2014 AAAS meeting, and our software processed my risk for a heart attack in the cloud, operating on encrypted data, in a fraction of a second.

In 2016, we started a collaboration with Merck to demonstrate the feasibility of evaluating such models on large patient populations. Inspired by our published work on heart attack risk prediction [5], they used SEAL to demonstrate running the heart attack risk prediction on one million patients from an affiliated hospital. Their implementation returned the results for all patients in about 2 h, compared to 10 min for the same computation on unencrypted patient data.

#### *3.3 Cancer Patient Statistics*

In 2017, we began a collaboration with a Crayon, a Norwegian company that develops health record systems. The goal of this collaboration was to demonstrate the value of SEAL in a real world working environment. Crayon reproduced all computations in the 2016 Norwegian Cancer Report using SEAL and operating on encrypted inputs. The report processed the cancer statistics from all cancer patients in Norway collected over the last roughly 5 decades.

#### *3.4 Genomic Privacy*

Engaging with a community of researchers in bioinformatics and biostatistics who were concerned with patient privacy issues led to a growing interdisciplinary community interested in the development of a range of cryptographic techniques to apply to privacy problems in the health and biological sciences arenas [18]. One measure of the growth of this community over the last five years has been participation in the iDASH Secure Genome Analysis Competition, a series of annual international competitions funded by the National Institutes of Health (NIH) in the U.S. The iDASH competition has included a track on Homomorphic Encryption for the last five years 2015–2019, and our team from MSR submitted winning solutions for the competition in 2015 ([27]) and 2016 ([10]). The tasks were: chi-square test, modified edit distance, database search, training logistic regression models, genotype imputation. Each year, roughly 5–10 teams from research groups around the world submitted solutions for the task, which were bench-marked by the iDASH team. These results provide the biological data science community and NIH with real and evolving measures of the performance and capability of homomorphic encryption to protect the privacy of genomic data sets while in use. Summaries of the competitions are published in [38, 40].

#### *3.5 Machine Learning: Training and Prediction*

The 2013 "ML Confidential" paper [23] was the first to propose *training* ML algorithms on homomorphically encrypted data and to show initial performance numbers for simple models such as linear means classifiers and gradient descent. Training is inherently challenging because of the large and unknown amount of data to be processed.

Prediction tasks on the other hand, process an input and model of known size, so many can be processed efficiently. For example, in 2016 we developed a demo using SEAL to predict the flowering time for a flower. The model processed 200, 000 SNPs from the genome of the flower, and evaluated a Fast Linear Mixed Model (LMM). Including the round-trip communication time with the cloud running the demo as a service in Azure, the prediction was obtained in under a second.

Another demo developed in 2016 using SEAL predicted the mortality risk for pneumonia patients based on 46 characteristics from the medical record for the patient. The model in this case is an example of an intelligible model and consists of 46◦ 4 polynomials to be evaluated on the patient's data. Data from 4, 096 patients can be batched together, and the prediction for all 4, 096 patients was returned by the cloud service in a few seconds (in 2016).

These two demos evaluated models which were represented by shallow circuits, linear in the first case and degree 4 in the second case. Other models such as deep neural nets (DNNs) are inherently more challenging because the circuits are so deep. To enable efficient solutions for such tasks requires a blend of cryptography and ML research, aimed at designing and testing ways to process data which allow for efficient operations on encrypted data while maintaining accuracy. An example of that was introduced in CryptoNets [22], showing that the activation function in the layers of the neural nets can be approximated with a low-degree polynomial function (*x* 2) without significant loss of accuracy.

The CryptoNets paper was the first to show the evaluation of a neural net predictions on encrypted data, and used the techniques introduced there to classify hand-written digits from the MNIST [31] data set. Many teams have since worked on improving the performance of CryptoNets, either with hybrid schemes or other optimizations [17, 25, 35]. In 2018, in collaboration with Median Technologies, we demonstrated deep neural net predictions for a medical image recognition task: classification of liver tumors based on medical images.

Returning to the challenge of training ML algorithms, the 2017 iDASH contest task required the teams to train a logistic regression model on encrypted data. The data set provided for the competition was very simple and did not require many iterations to train an effective model (the winning solution used only 7 iterations [26, 28]). The MSR solution [12] computed over 300 iterations and was fully scalable to any arbitrary number of iterations. We also applied our solution to a simplified version of the MNIST data set to demonstrate the performance numbers.

Performance numbers for all computations described here were published at the time of discovery. They would need to be updated now with the latest version of SEAL, or can be estimated. Hardware acceleration techniques using state-of-the-art FPGAs can be used to improve the performance further ([34]).

#### **4 How Do We Assess Security?**

The security of all homomorphic encryption schemes described in this article is based on the mathematics of lattice-based cryptography, and the hardness of well-known lattice problems in high dimensions, problems which have been studied for more than 25 years. Compare this to the age of other public key systems such as RSA (1975) or Elliptic Curve Cryptography ECC (1985). Cryptographic applications of Latticebased Cryptography were first proposed by Hoffstein, Pipher, and Silverman [24] in 1996 and led them to launch the company NTRU. New hard problems such as LWE were proposed in the period of 2004–2010, but were reduced to older problems which had been studied already for several decades: the Approximate Shortest Vector Problem (SVP) and Bounded Distance Decoding.

The best known algorithms for attacking the Shortest Vector Problem or the Closest Vector Problem are called lattice basis reduction algorithms, and they have a more than 30-year history, including the LLL algorithm [32]. LLL runs in polynomial time, but only finds an exponentially bad approximation to the shortest vector. More recent improvements, such as BKZ 2.0 [13], involve exponential algorithms such as sieving and enumeration. Hard Lattice Challenges were created by TU Darmstadt and are publicly available online for anyone to try to attack and solve hard lattice problems of larger and larger size for the record.

Homomorphic Encryption scheme parameters are set such that the best known attacks take exponential time (exponential in the dimension of the lattice, n, meaning roughly 2*<sup>n</sup>* time). These schemes have the advantage that there are no known polynomial time quantum attacks, which means they are good candidates for Post-Quantum Cryptography (PQC) in the ongoing 5-year NIST PQC competition.

Lattice-based cryptography is currently under consideration for standardization in the ongoing NIST PQC Post-Quantum Cryptography competition. Most Homomorphic Encryption deployments use small secrets as an optimization, so it is important to understand the concrete security when sampling the secret from a non-uniform, small distribution. There are numerous heuristics used to estimate the running time and quality of lattice reduction algorithms such as BKZ2.0. The Homomorphic Encryption Standard recommends parameters based on the heuristic running time of the best known attacks, as estimated in the online LWE Estimator [2].

#### **5 Conclusion**

Homomorphic Encryption is a technology which allows meaningful computation on encrypted data, and provides a tool to protect privacy of data in use. A primary application of Homomorphic Encryption is secure and confidential outsourced storage and computation in the cloud (i.e. a data center). A client encrypts their data locally, and stores their encryption key(s) locally, then uploads it to the cloud for long-term storage and analysis. The cloud processes the encrypted data without decrypting it, and returns encrypted answers to the client for decryption. The cloud learns nothing about the data other than the size of the encrypted data and the size of the computation. The cloud can process Machine Learning or Artificial Intelligence (ML or AI) computations, either to make predictions based on known models or to train new models, while preserving the client's privacy.

Current solutions for HE are implemented in 5–6 major open source libraries world-wide. The Homomorphic Encryption Standard [1] for using HE securely was approved in 2018 by HomomorphicEncryption.org, an international consortium of researchers in industry, government, and academia.

Today, applied Homomorphic Encryption remains an exciting direction in cryptography research. Several big and small companies, government contractors, and academic research groups are enthusiastic about the possibilities of this technology. With new algorithmic improvements, new schemes, an improved understanding of concrete use-cases, and an active standardization effort, wide-scale deployment of homomorphic encryption seems possible within the next 2–5 years. Small-scale deployment is already happening.

Computational performance, memory overhead, and the limited set of operations available in most libraries remain the main challenges. Most homomorphic encryption schemes are inherently parallelizable, which is important to take advantage of to achieve good performance. Thus, easily parallelizable arithmetic computations seem to be the most amenable to homomorphic encryption at this time and it seems plausible that initial wide-scale deployment may be in applications of Machine Learning to enable Private AI.

**Acknowledgements** I would like to gratefully acknowledge the contributions of many people in the achievements, software, demos, standards, assets and impact described in this article. First and foremost, none of this software or applications would exist without my collaborators on the SEAL team, including Kim Laine, Hao Chen, Radames Cruz, Wei Dai, Ran Gilad-Bachrach, Yongsoo Song, John Wernsing, with substantial contributions from interns Gizem Cetin, Kyoohyung Han, Zhicong Huang, Amir Jalali, Rachel Player, Peter Rindal, Yuhou Xia as well. The demos described here were developed largely by our partner engineering team in Foundry 99: Shabnam Erfani, Sreekanth Kannepalli, Steven Chith, James French, Hamed Khanpour, Tarun Singh, Jeremy Tieman. I launched the Homomorphic Encryption Standardization process in collaboration with Kim Laine from my team, with Roy Zimmermann and the support of Microsoft Outreach, and collaborators Kurt Rohloff, Vinod Vaikuntanathan, Shai Halevi, and Jung Hee Cheon, and collectively we now form the Steering Committee of HomomorphicEncryption.org. Finally I would like to thank the organizers of ICIAM 2019 for the invitation to speak and to write this article.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Mathematical Approaches for Contemporary Materials Science: Addressing Defects in the Microstructure**

**Claude Le Bris**

**Abstract** We overview a series of mathematical works that introduce new modeling and computational approaches for non-periodic materials and media. The approaches consider various types of defects embedded in a periodic structure, which can be either deterministic or random in nature. A portfolio of possible computational techniques addressing the identification of the homogenized properties of the material or the determination of the actual multi-scale solution is presented.

#### **1 Introduction**

#### *1.1 Contemporary Materials Science*

The works outlined in the present review have been motivated by the following two-fold observation. In the past couple of decades, what we believe to be the most spectacular changes in materials science are


C. Le Bris (B)

Ecole des Ponts and Inria, Paris, France e-mail: claude.le-bris@enpc.fr

© The Author(s) 2022

T. Chacón Rebollo et al. (eds.), *Recent Advances in Industrial and Applied Mathematics*, ICIAM 2019 SEMA SIMAI Springer Series 1, https://doi.org/10.1007/978-3-030-86236-7\_7

and consist of mono-crystalline grains, each of them possibly of a different crystalline structure, each crystalline structure being itself flawed because sprinkled of defects and dislocations; the imperfections, or violations of periodicity, affect every possible scale, and actually cut through scales.

As a result, the real materials that contemporary materials scientists have to model have a *multi-scale, imperfect, possibly random* nature. Such materials have several characteristic length-scales that possibly differ from one another by orders of magnitude but must be accounted for simultaneously. At possibly each such scale, they have defects. Their qualitative and quantitative response might therefore differ a lot from the idealized scenario long considered.

*Our intent here is to present several mathematical and numerical endeavors that aim to better model, understand and simulate non-periodic multi-scale problems.*

The specific theoretical context in which we develop our discussion is homogenization of simple, second order elliptic equations in divergence form with highly oscillatory coefficients:

$$-\operatorname{div}\left[A\_{\varepsilon}(\mathbf{x})\nabla\boldsymbol{u}^{\varepsilon}\right] = f,\tag{1}$$

in a domain *<sup>D</sup>* <sup>⊂</sup> <sup>R</sup>*<sup>d</sup>* , with, say, homogeneous Dirichlet boundary conditions *<sup>u</sup>*<sup>ε</sup> <sup>=</sup> <sup>0</sup> on ∂*D*. This particular case is to be thought of as a prototypical case. It is intuitively clear that the same approaches carry over to other settings. Current works are indeed directed toward extending many of the considerations here to other types of equations, as will be clear in the exposition below.

We conclude this introductory section with a quick presentation of the classical theory. The reader familiar with this theory may of course skip the presentation and directly proceed to Sect. 2.

#### *1.2 Basics of Homogenization Theory*

#### **1.2.1 Periodic Homogenization**

To begin with, we recall some well known, basic ingredients of elliptic homogenization theory in the periodic setting, *see* the classical references [8, 29, 42] for more details, or an overview in [1, Chap. 1] . We consider the problem

$$\begin{cases} -\text{div}\left[A\_{per}\left(\frac{x}{\varepsilon}\right)\nabla\mu^{\varepsilon}\right] = f \quad \text{in} \quad \mathcal{D},\\ \mu^{\varepsilon} = 0 \quad \text{on} \quad \partial\mathcal{D}, \end{cases} \tag{2}$$

where the matrix *Aper* is Z*<sup>d</sup>* -periodic, bounded and bounded away from zero, and (for simplicity) symmetric. The corrector problem associated to Eq. 2 reads, for **p** fixed in R*<sup>d</sup>* ,

$$\begin{cases} -\text{div}\left(A\_{per}(\mathbf{y})\left(\mathbf{p} + \nabla w\_{per,\mathbf{p}}\right)\right) = 0, \\ w\_{per,\mathbf{p}} \text{ is } \mathbb{Z}^d\text{-periodic.} \end{cases} \tag{3}$$

It has a unique solution up to the addition of a constant. This solution is meant to describe prototypical fine oscillations of the exact solution *u*<sup>ε</sup> for ε small. Then, the homogenized coefficients read

$$[A^\*\_{per}]\_{ij} = \int\_{\mathcal{Q}} \mathbf{e}\_i^T A\_{per}(\mathbf{y}) \left(\mathbf{e}\_j + \nabla w\_{per, \mathbf{e}\_j}(\mathbf{y})\right) d\mathbf{y},\tag{4}$$

where *<sup>Q</sup>* is the unit cube and **<sup>e</sup>***<sup>i</sup>* , 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>d</sup>* are the canonical vectors of <sup>R</sup>*<sup>d</sup>* . The main result of periodic homogenization theory for Eq. 2 is that, as ε vanishes, the solution *u*<sup>ε</sup> to Eq. 2 converges to *u*<sup>∗</sup> solution to

$$\begin{cases} -\text{div}\left[A\_{per}^\* \nabla \mu^\*\right] = f \quad \text{in} \quad \mathcal{D},\\ \mu^\* = 0 \quad \text{on} \quad \partial \mathcal{D}. \end{cases} \tag{5}$$

The convergence holds in *<sup>L</sup>*<sup>2</sup>(*D*), and weakly in *<sup>H</sup>*<sup>1</sup> <sup>0</sup> (*D*). The correctors w*per*,**e***<sup>i</sup>* may then also be used to "correct" *<sup>u</sup>*<sup>∗</sup> in order to show that, in the strong topology *<sup>H</sup>*<sup>1</sup>(*D*), *<sup>u</sup>*<sup>ε</sup> <sup>−</sup> *<sup>u</sup>*ε,<sup>1</sup>(*x*) converges to zero, for *<sup>u</sup>*ε,<sup>1</sup> (*x*) = *u*∗(*x*) + ε *<sup>d</sup> i*=1 ∂*xi u*∗(*x*) w*per*,**e***i*(*x*/ε). The rate of convergence may also be made precise.

The practical conclusion is that, at the price of only computing the *d* periodic problems of Eq. 3, the solution to Eq. 2 can be efficiently approached for ε small.

#### **1.2.2 Random Homogenization**

A first option to outreach the simplistic setting of periodic structures is to consider random structures. Of course, materials are never random in nature, but randomness is a suitable, practical way to encode the ignorance of, or at best the uncertainty on the intimate microscopic structure of the material considered.

For homogenization, the random setting is a highly non trivial extension of the periodic setting. Many questions, in particular for nonlinear equations, still remain open in the random case although they are solved and well documented in the periodic case. Fortunately, in the case of linear diffusion equations such as Eq. 1, the state of affairs is that, loosely speaking, all the results of convergence still essentially hold true but (a) they are more difficult to prove and (b) the convergence rates are even more difficult to establish.

To fix the ideas, we now give some more formal details on one random case. For brevity, we skip all technicalities related to the definition of the probabilistic setting, which we assume discrete stationary and ergodic (we refer e.g. to [2] for all details). We now fix *A*(., ω) a square matrix of size *d*, again bounded and bounded away from zero, symmetric, which is assumed stationary in the sense

$$\forall \mathbf{k} \in \mathbb{Z}^d, \quad A(\mathbf{x} + \mathbf{k}, \omega) = A(\mathbf{x}, \mathbf{r}\_\mathbf{k} \omega) \text{ almost everywhere in } \mathbf{x}, \text{ almost surely} \tag{6}$$

(where τ is an ergodic group action). This amounts to assuming that the law of *A*(., ω) is Z*<sup>d</sup>* -periodic. Then we consider the boundary value problem

$$\begin{cases} -\text{div}\left(A\left(\frac{x}{\varepsilon},\omega\right)\nabla\mu^{\varepsilon}\right) = f \quad \text{in} \quad \mathcal{D},\\ \mu^{\varepsilon} = 0 \quad \text{on} \quad \partial\mathcal{D}. \end{cases} \tag{7}$$

Standard results of random homogenization [8, 29] apply and allow to find the homogenized problem for Eq. 7. These results generalize the periodic results recalled in Sect. 1.2.1. The solution *u*<sup>ε</sup> to Eq. 7 converges to the solution to Eq. 5 where the homogenized matrix is now defined as:

$$[A^\*]\_{ij} = \mathbb{E}\left(\int\_{\mathcal{Q}} \mathbf{e}\_i^T A \left(\mathbf{y}, \cdot \right) \left(\mathbf{e}\_j + \nabla w\_{\mathbf{e}\_j}(\mathbf{y}, \cdot) \right) \, d\mathbf{y} \right), \tag{8}$$

where for any **<sup>p</sup>** <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* , <sup>w</sup>**<sup>p</sup>** is the solution (unique up to the addition of a random constant) to

$$\begin{cases} -\text{div}\left[A\left(\mathbf{y},\omega\right)\left(\mathbf{p}+\nabla w\_{\mathbf{p}}(\mathbf{y},\omega)\right)\right] = 0, & \text{a.s.on.} \ \mathbb{R}^d, \\ \nabla w\_{\mathbf{p}} \quad \text{is stationary in the sense of Eq. 6,} \\ \mathbb{E}\left(\int\_{\mathcal{Q}} \nabla w\_{\mathbf{p}}(\mathbf{y},\cdot) \, d\mathbf{y}\right) = \mathbf{0}. \end{cases} \tag{9}$$

A striking difference between the random setting and the periodic setting can be observed comparing Eqs. 3 and 9. In the periodic case, the corrector problem is posed on a *bounded* domain, namely the periodic cell *Q*. In sharp contrast, the corrector problem in Eq. 9 of the random case is posed *on the whole space* R*<sup>d</sup>* , and cannot be reduced, at the theoretical level, to a problem posed on a bounded domain. The fact that the random corrector problem is posed on the entire space has far reaching consequences both for theory and for numerical practice. To some extent, the unboundedness of the domain on which the corrector problem is posed is a common denominator of all the settings that we will address in the present survey. *This unboundedness of the corrector problem is also a fundamental characteristic feature of the practically relevant problems of materials science*. We cannot emphasize enough this fact.

In order to approximate Eq. 9 numerically, truncations of the problem have to be considered, typically on large domains *QN* = [0, *N*] *<sup>d</sup>* and using periodic boundary conditions. The actual homogenized coefficients are only captured in the asymptotic regime *QN* <sup>→</sup> <sup>R</sup>*<sup>d</sup>* . Overall, it is fair to consider that the approach is very expensive computationally, and often actually prohibitively expensive. Therefore, in many practical situations, the size of the "large" domain *QN* considered is in fact small, and the number of realizations of the random microstructure considered therein to approach the expectation in Eq. 8 is also dramatically limited. Put differently, *there* *is a large gap looming between the actual practice and the regime where the theory provides relevant information.*

Important theoretical questions about the quality and the rate of the convergence in terms of the truncation size arise: see, in particular, the pioneering works by Bourgeat and Piatnitski [17, 18] and, more broadly and recently, a series of works by F. Otto, A. Gloria, S. Armstrong, Ch. Smart, J.-C. Mourrat and their many collaborators, see e.g. [25, 26] for examples of contributions.

#### **2 A Mathematical Toolbox for "Weakly" Random Problems**

We begin with this section our study of homogenization of non-periodic problems.We have already mentioned that one possible option is the random setting. And we have mentioned the practical difficulties it raises. In many practical situations, however, the real material under consideration is not far from being a periodic material. At zero-th order of approximation, the material can be considered periodic, and it is only at a higher order that disorder might play a role. We choose, in this section, to encode this disorder using randomness. When the "material" under study is the geological bedrock, there is of course no reason for this assumption to be valid, and the classical random model of Sect. 1.2.2 might be more relevant. In contrast, the assumption makes a lot of sense when considering manufactured materials, where the defect of periodicity typically owes to flaws in the process: the material was *meant* to be periodic, but it is actually not. The practically relevant question is to understand whether or not, despite its smallness, the microscopic amount of randomness might affect the macroscale at order one. Solving this question requires to come up with a modeling strategy for the imperfect material.

Our purpose here is to outline a modeling strategy that accounts for the presence of randomness in a multi-scale computation, but specifically addresses the case when the amount of randomness present in the system is small. In this case, we call the material *weakly random*. The weakly random material is thus considered as a small perturbation of a periodic material. Our purpose is to introduce a toolbox of possible modeling strategies that all keep the computational workload limited (in comparison to a direct attack of the problem as if, like in Sect. 1.2.2, the randomness was *not* small) and that provides an approximation of the response of the material which one may certify by error estimates.

As mentioned above, the simple diffusion equation Eq. 1 is a perfect prototypical testbed for our toolbox. It is ubiquitous in several, if not all engineering sciences and life sciences. Although we have not developed our theory and computations for other, more general equations and settings, we are convinced that the same line of approach (namely small amount of randomness as compared to a reference periodic setting, plus expansion in the randomness amplitude, and simplified computations) can be useful in many contexts.

#### *2.1 Random Deformations of the Periodic Setting*

A first random setting, which has been introduced and studied in [11] and is not, mathematically, a particular case of the classical stationary setting recalled in Sect. 1.2.2, consists of*random deformations* of a periodic structure. As said above, it is motivated by the consideration of random geometries that have some specific proximity to the periodic setting. The periodic setting is here taken as a reference configuration, somewhat similarly to the classical mathematical formalization of continuum mechanics where a reference configuration is used to define the state of the material under study. Another related idea, in a completely different context, is the consideration of a reference element for finite element computations. The real situation is then seen via a *mapping* from the reference configuration to the actual configuration. Here, this mapping is a *random* mapping (otherwise, one would know everything on the material up to a change of coordinates and there would be poor practical interest in the approach). Assuming some regularity of this mapping induces constraints on the sets of geometries that the microstructures of the material can take. Put differently, the material structure, even though it is not entirely known, is not arbitrarily disordered.

We fix someZ*<sup>d</sup>* -periodic *Aper*, assumed to satisfy the usual properties of boundedness and coerciveness, and we consider the following specific form of the coefficient *A*<sup>ε</sup> in Eq. 1

$$A\_{\ell} \left( \chi, \omega \right) = A\_{per} \left( \Phi^{-1} \left( \frac{\chi}{\varepsilon}, \omega \right) \right), \tag{10}$$

where the function(·, ω)is assumed to be, almost surely, a diffeomorphism fromR*<sup>d</sup>* to R*<sup>d</sup>* . The diffeomorphism, called a *random stationary diffeomorphism*, is assumed to additionally satisfy

$$\operatorname{essinf}\_{\omega \in \Omega, \, x \in \mathbb{R}^d} [\det(\nabla \Phi(\mathbf{x}, \omega))] = \nu > 0,\tag{11}$$

$$\operatorname{esssup}\_{\omega \in \Omega, \,\, x \in \mathbb{R}^d} (|\nabla \Phi(\mathbf{x}, \omega)|) = M < \infty,\tag{12}$$

$$\nabla \Phi(\mathbf{x}, \omega) \quad \text{is stationary in the sense of Eq. 6.} \tag{13}$$

Note that the first two assumptions enforce the "homogeneity" of the diffeomorphism: the deformed periodic structure does not implode nor explode anywhere.

Homogenization holds for the above problem (the details are made precise in [11]). The homogenized problem again reads as in Eq. 5 with the homogenized matrix given by:

$$\begin{aligned} [A^\*]\_{ij} &= \det \left( \mathbb{E} \left( \int\_{\mathcal{Q}} \nabla \Phi(z, \cdot) dz \right) \right)^{-1} \\ &\times \mathbb{E} \left( \int\_{\Phi(\mathcal{Q}\_{\cdot})} \mathbf{e}\_i^T A\_{per} \left( \Phi^{-1}(\mathbf{y}, \cdot) \right) \left( \mathbf{e}\_j + \nabla w\_{\mathbf{e}\_j}(\mathbf{y}, \cdot) \right) d\mathbf{y} \right), \end{aligned} \tag{14}$$

where for any **<sup>p</sup>** <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* , <sup>w</sup>**<sup>p</sup>** is the solution (unique up to the addition of a random constant and belonging to the suitable functional space) to

$$\begin{cases} -\text{div}\left[A\_{\text{per}}\left(\Phi^{-1}(\mathbf{y},\omega)\right)\left(\mathbf{p}+\nabla w\_{\mathbf{p}}\right)\right] = 0, \quad \text{a.s.on} \quad \mathbb{R}^d, \\ w\_{\mathbf{p}}(\mathbf{y},\omega) = \tilde{w}\_{\mathbf{p}}\left(\Phi^{-1}(\mathbf{y},\omega),\omega\right), \quad \nabla \tilde{w}\_{\mathbf{p}} \quad \text{is stationary in the sense of Eq. 6,} \\ \mathbb{E}\left(\int\_{\Phi(\mathcal{Q}\_{-})} \nabla w\_{\mathbf{p}}(\mathbf{y},\cdot) d\mathbf{y}\right) = \mathbf{0}. \end{cases} (5.6)$$

At first sight, there seems to be no simplification whatsoever in considering the above system Eq. 15, which even looks way more complex than the classical random problem Eq. 9. The key point, though, is that the introduction of a new modeling "parameter", namely the random diffeomorphism , allows to in some sense introduce a distance between the periodic case ( = *I d*) and the random case ( = *I d*) considered. Our next step consists in proceeding in this direction.

#### *2.2 Small Random Perturbations of the Periodic Setting*

We now superimpose to the setting defined in the previous section the assumption that the material considered is a *small* perturbation of a periodic material. This is formalized upon writing

$$
\Phi(\mathbf{x}, \omega) = \mathbf{x} + \eta \,\Psi(\mathbf{x}, \omega) + O(\eta^2), \tag{16}
$$

where is any random field such that is a random stationary diffeomorphism that satisfies Eqs. 11-13 for η sufficiently small.

It has been shown in [11] that, when is such a perturbation of the identity map (see Fig. 1), the solution to the corrector problem of Eq. 15 may be developed in powers of the small parameter <sup>η</sup>. It reads <sup>w</sup>**<sup>p</sup>**(*x*, ω) <sup>=</sup> <sup>w</sup>*per*,**<sup>p</sup>**(*x*) <sup>+</sup> ηw<sup>1</sup> **<sup>p</sup>**(*x*, ω) + *O*(η<sup>2</sup>), where w*per*,**<sup>p</sup>** is the periodic corrector defined in Eq. 3 and where w<sup>1</sup> **<sup>p</sup>** solves

**Fig. 1** Small random deformation of a periodic structure. In the unperturbed periodic environment, the inclusions are circular and periodic. The deformation of each inclusion is performed randomly. *Source* [21]

$$\begin{cases} -\text{div}\left[A\_{per}\nabla w\_{\mathbf{p}}^{1}\right] \\ = \text{div}\left[-A\_{per}\nabla\Psi\,\nabla w\_{per,\mathbf{p}} - (\nabla\Psi^{T} - (\text{div}\,\Psi)\text{Id})\,A\_{per}\left(\mathbf{p} + \nabla w\_{per,\mathbf{p}}\right)\right], \\ \nabla w\_{\mathbf{p}}^{1}\text{ is stationary and }\mathbb{E}\left(\int\_{\mathcal{Q}}\nabla w\_{\mathbf{p}}^{1}\right) = \mathbf{0}. \end{cases} \tag{17}$$

The problem of Eq. 17 in w<sup>1</sup> **<sup>p</sup>** is random in nature, but it is in fact easy to see, taking the expectation, that w<sup>1</sup> **<sup>p</sup>** <sup>=</sup> <sup>E</sup>(w<sup>1</sup> **<sup>p</sup>**) is periodic and solves the *deterministic* problem

$$\begin{aligned} & \left[ -\text{div}\left[ A\_{per} \,\nabla \overline{w}\_{\mathbf{p}}^{\perp} \right] \right] \\ &= \text{div}\left[ -A\_{per} \,\mathbb{E}(\nabla \Psi) \,\nabla w\_{per,\mathbf{p}} - \left( \mathbb{E}(\nabla \Psi^{T}) - \mathbb{E}(\text{div}\,\Psi) \text{Id} \right) A\_{per} \left( \mathbf{p} + \nabla w\_{per,\mathbf{p}} \right) \right]. \end{aligned}$$

This is useful because, on the other hand, the knowledge of w<sup>0</sup> **<sup>p</sup>** and w<sup>1</sup> **<sup>p</sup>** suffices to obtain a first order expansion (in η) of the homogenized matrix. Indeed, *A*<sup>∗</sup> *per* being the periodic homogenized tensor as defined in Eq. 4, and

$$\begin{split} A^1\_{ij} &= -\int\_{\mathcal{Q}} \mathbb{E}(\text{div}\,\Psi) \, [A^\*\_{per}]\_{ij} + \int\_{\mathcal{Q}} (\mathbf{e}\_i + \nabla w^0\_{per,\mathbf{e}\_i})^T A\_{per} \, \mathbf{e}\_j \, \mathbb{E}(\text{div}\,\Psi) \\ &+ \int\_{\mathcal{Q}} \left(\nabla \overline{w}^1\_{\mathbf{e}\_i} - \mathbb{E}(\nabla \Psi) \nabla w^0\_{per,\mathbf{e}\_i}\right)^T A\_{per} \, \mathbf{e}\_j, \end{split}$$

we then have

$$A^\* = A^\*\_{per} + \eta A^1 + O(\eta^2). \tag{18}$$

For η sufficiently small in function of the accuracy expected, the approach therefore provides a computational strategy to *approximately* compute the homogenized tensor that bypasses the classical random problem and only considers (a sequence of) *deterministic*, periodic problems.

#### *2.3 Rare but Possibly Large Random Perturbations*

The previous section has shown that a perturbative approach can be an interesting modeling and computational strategy for cases when the structure of the material is random but "close" to a periodic structure. We now proceed in a similar direction by presenting an alternative perturbative approach, described in full details in [3, 4]. We consider

$$A\_{\eta}(\mathbf{x}, \omega) = A\_{per}(\mathbf{x}) + b\_{\eta}(\mathbf{x}, \omega) \, C\_{per}(\mathbf{x}), \tag{19}$$

instead of a coefficient *Aper* −<sup>1</sup>(., ω) with of the form Eq. 16. In Eq. 19, *Aper* is again a periodic matrix modeling the unperturbed material, *Cper* is a periodic matrix


modeling the perturbation, and *b*η(., ω)is a random field that is, in some sense, small. Consider then the case

$$b\_{\eta}(\mathbf{x}, \omega) = \sum\_{\mathbf{k} \in \mathbb{Z}^d} \mathbf{1}\_{\{Q + \mathbf{k}\}}(\mathbf{x}) B\_{\eta}^k(\omega), \tag{20}$$

where the *B<sup>k</sup>* <sup>η</sup> are, say, independent identically distributed random variables. One particularly interesting case (see [3, 4] for this case and others) is that when the common law of the *B<sup>k</sup>* <sup>η</sup> is a Bernoulli law of parameter η (see Fig. 2).

We now explain *formally* our approach. The mathematical correctness of the approach has been established in the works [23, 40].

To start with, we notice that in the corrector problem

$$-\operatorname{div}\left[A\_{\eta}\left(\mathbf{y},\omega\right)\left(\mathbf{p}+\nabla w\_{\mathbf{p}}\left(\mathbf{y},\omega\right)\right)\right]=\mathbf{0},\tag{21}$$

the only source of randomness comes from the coefficient *A*<sup>η</sup> (*y*, ω). Therefore, in principle, if one knows the law of this coefficient *A*η, one knows the law of the corrector function w**p**(*y*, ω) and therefore may compute the homogenized coefficient *A*∗, the latter being a function of this law. When the law of *A*<sup>η</sup> is an expansion in terms of a small coefficient, so is the law of w**p**. Consequently, *A*<sup>∗</sup> <sup>η</sup> must be attainable using an expansion.

Heuristically, on the cube *QN* and at order 1 in η, the probability to see the perfect periodic material (entirely modeled by the matrix *Aper*) is (<sup>1</sup> <sup>−</sup> η)*<sup>N</sup><sup>d</sup>* ≈ 1 − *N<sup>d</sup>*η + *O*(η<sup>2</sup>), while the probability to see the unperturbed material on all cells except one (where the material has matrix *Aper* <sup>+</sup> *Cper*) is *<sup>N</sup><sup>d</sup>* (<sup>1</sup> <sup>−</sup> η)*<sup>N</sup>d*−<sup>1</sup><sup>η</sup> <sup>≈</sup> *<sup>N</sup><sup>d</sup>*<sup>η</sup> <sup>+</sup> *<sup>O</sup>*(η<sup>2</sup>). All other configurations, with more than two cells perturbed, contribute at orders higher than or equal to η2. This gives the intuition (indeed confirmed by a mathematical proof) that the first order correction indeed comes from the difference between the material perfectly periodic except on one cell and the perfect material itself: *A*∗ <sup>η</sup> = *A*<sup>∗</sup> *per* + η*A*<sup>1</sup>,<sup>∗</sup> + *o*(η) where *A*<sup>∗</sup> *per* is the homogenized matrix for the unperturbed periodic material and

$$A\_{\mathbf{l},\ast}\mathbf{e}\_i = \lim\_{N \to +\infty} \int\_{\mathcal{Q}\_N} \left[ (A\_{per} + \mathbf{1}\_{\mathcal{Q}} C\_{per}) (\nabla w\_{\mathbf{e}\_i}^N + \mathbf{e}\_i) - A\_{per} (\nabla w\_{per,\mathbf{e}\_i} + \mathbf{e}\_i) \right], \tag{22}$$

where w*<sup>N</sup>* **<sup>e</sup>***<sup>i</sup>* solves

$$-\operatorname{div}\left( (A\_{per} + \mathbf{1}\_{\mathcal{Q}} \mathbf{C}\_{per}) (\mathbf{e}\_i + \nabla w\_{\mathbf{e}\_i}^N) \right) = 0 \quad \text{in} \quad \mathcal{Q}\_N, \quad w\_{\mathbf{e}\_i}^N \text{ is } \mathcal{Q}\_N - \text{periodic}. \tag{23}$$

Note that the integral appearing on the right-hand side of Eq. 22 is *not* normalized: it *a priori*scales as the volume *N<sup>d</sup>* of *QN* and has finite limit only because of cancellation effects between the two terms in the integrand.

This perturbative approach has been extensively tested. It has been observed that the large *N* limit for cubes of size *N* is already accurately approximated for limited values of *N*. As in the previous section (Sect. 2.2), the computational efficiency of the approach is clear: solving the two periodic problems with coefficients *Aper* and *Aper* + **1***QCper* for a limited size *N* is much less expensive than solving the original, random corrector problem for a much larger size *N*. When the second order term is needed, configurations with two defects have to be computed. They all can be seen as a family of PDEs, parameterized by the geometrical location of the defects (see again Fig. 2). Reduced basis techniques have been shown to allow for a definite speed-up in the computation, *see* [33].

On an abstract level, we note that, in the proposed approach for the "weakly" random regime, the determination of the homogenized tensor for a material containing defects with random locations is reduced to a set of computation of the solutions to correctors problems such as Eq. 23 for materials with defects at *some* particular deterministic locations. This naturally establishes a methodological link with our next section where we indeed consider materials with *deterministic* defects. The link is actually more than methodological: the theoretical results of Sect. 3 establishing that the corrector problems with deterministic defects are uniquely solvable in a suitable class of functions are readily useful in the random setting for the foundation of the approach described here in Sect. 2.

#### **3 Deterministic Defects Within an Otherwise Periodic Structure**

We return to the generic multi-scale diffusion equation Eq. 1. Under quite general and mild assumptions on the diffusion (possibly matrix-valued) coefficient *A*<sup>ε</sup> (which needs not be of the form *A*<sup>ε</sup> = *Aper*(*x*/ε) or obey any structural assumption of that type), presumably varying at the tiny scale ε, the equation admits an homogenized limit, which is indeed of the same form as Eq. 1, namely Eq. 5. Celebrated results along these lines are due to S. Spagnolo, E. De Giorgi and L. Tartar and their respec-

**Fig. 3 Localized defects in a periodic structure**. Some periodic cells in the center of the domain are perturbed. The error *<sup>u</sup>*<sup>ε</sup> <sup>−</sup> *<sup>u</sup>*ε,<sup>1</sup> is displayed when calculating *<sup>u</sup>*ε,<sup>1</sup> using (left) the periodic corrector w*per*,**<sup>p</sup>** solution to Eq. 3 and (right) the adjusted corrector w**<sup>p</sup>** solution to Eq. 24. In the former case, the size of the committed error is almost a "defect detector". In the latter case, the error is homogeneous throughout the domain, recovering the quality of the approximation of the unperturbed periodic case. *Source* [12]

tive collaborators, *see* [42]. The strength of such results is their generality. They are obtained by a compactness argument. Schematically the sequence of inverse operators [−div(*A*ε∇.)] <sup>−</sup><sup>1</sup> is (weakly) compact in the suitable topology, converges, up to an extraction, and its limit can be proven to be an operator of the same type, namely [−div(*A*∗∇.)] −1 . On the other hand, and precisely because of the generality, not much is known on the limit *A*∗. This contrasts with periodic homogenization which is both *explicit* (the limit coefficient *A*<sup>∗</sup> is known by a formula, namely Eq. 4, in function of the, also known, corrector) and *precised* (the rate of convergence of *u*<sup>ε</sup> to *u*<sup>∗</sup> is known for a large variety of norms). Besides their theoretical interest *per se*, the combined two ingredients allow for envisioning, in practice, a numerical approach for the computation of the homogenized limit, certified by a numerical analysis that guarantees a control of the numerical error committed, in function of ε and the discretization parameters.

The question arises to find settings sufficiently general that still allow for the quality of results of the periodic setting. The recent decade has witnessed several mathematical endeavors in this direction. We describe here such an endeavor and give one prototypical example of such a setting, where we illustrate the novelty of the mathematical questions involved (Fig. 3).

Consider Eq. 1 and assume that *A*<sup>ε</sup> = *A*(./ε) where the coefficient *A* models a periodic material perturbed by a localized defect. This setting, mathematically, may be encoded in *<sup>A</sup>* <sup>=</sup> *Aper* <sup>+</sup> *<sup>A</sup>*˜ for *<sup>A</sup>*˜ <sup>∈</sup> *<sup>L</sup> <sup>p</sup>*(R*<sup>d</sup>* ) for some *<sup>p</sup>* <sup>&</sup>lt; +∞. Clearly, the presence of this defect does not affect the *macroscopic* behavior, that is the homogenized equation for the *same* homogenized coefficient *A*∗, only actually depending on averages of *A* over large, asymptotically infinite volumes, for which the addition of a function such as *A*˜ does not matter. On the other hand, when it comes to making this limit more precise, one intuitively realizes, zooming in locally in the material, that the corrector equation that describes the *microscopic* response of the material reads as

$$-\operatorname{div}(A(\mathbf{e}\_i + \nabla w\_{\mathbf{e}\_i})) = 0. \tag{24}$$

This equation is different from Eq. 3, and, in sharp contrast with Eq. 3 (and similarly to what we observed for Eq. 9 in the random setting), *does not reduce* to an equation set on a bounded domain with periodic boundary conditions. Note that, for the particular choice *A*˜ = **1***QCper*, Eq. 23 is a particular instance of Eq. 24 when *<sup>N</sup>* = +∞. In essence, Eq. <sup>24</sup> is posed on the entire ambient space <sup>R</sup>*<sup>d</sup>* , a reflection of the fact that, at the microscopic scale, the defect has broken the periodicity of the environment: the local response is affected by the defect and depends on the state of the *whole* microscopic structure. A considerable mathematical difficulty follows. The classical toolbox for the study of the well-posedness of (here linear) equations on bounded domains: the Lax-Milgram Lemma in the coercive case, the Fredholm Alternative, etc., all techniques that one way or another rely upon the boundedness of the domain or the compactness of the setting, are now ineffective. Should *A* be random stationary, then Eq. 24 would read as Eq. 9 and admit an equivalent formulation on the abstract probability space. This would make up for compactness, but other significant complications would arise. For Eq. 24, the difficulty must be embraced. A related difficulty is to define the set of admissible functions for solutions, or the variational space in an energetic formulation of the problem. In the specific case *<sup>A</sup>* <sup>=</sup> *Aper* <sup>+</sup> *<sup>A</sup>*˜ with *<sup>A</sup>*˜ <sup>∈</sup> *<sup>L</sup> <sup>p</sup>*(R*<sup>d</sup>* ), one seeks for the solution to Eq. <sup>24</sup> under the form w**<sup>e</sup>***<sup>i</sup>* = w*per*,**e***<sup>i</sup>* + ˜w**<sup>e</sup>***<sup>i</sup>* that is, *with reference to* the periodic solution w*per*,**e***<sup>i</sup>* , somewhat in echo to what we achieved in Sect. 2.3. Equation 24 then rewrites as

$$-\text{div}\,(A\,\nabla\tilde{w}\_{\mathbf{e}\_i}) = \text{div}\,(\tilde{f})\,,$$

where ˜*<sup>f</sup>* <sup>∈</sup> *<sup>L</sup> <sup>p</sup>*(R*<sup>d</sup>* ), which, by homogeneity, suggests that the suitable functional space for ∇ ˜<sup>w</sup> is *<sup>L</sup> <sup>p</sup>*(R*<sup>d</sup>* ). The question then arises to know whether the operator [∇] [div(*A* ∇ .)] <sup>−</sup><sup>1</sup> [div] acts continuously in *<sup>L</sup> <sup>p</sup>*(R*<sup>d</sup>* ). The answer depends on the properties of the coefficient *A*. In the present setting, it is positive for all 1 < *p* < +∞. The theoretical analysis to reach this conclusion heavily relies upon the celebrated works [5–7] by M. Avellaneda and F. H. Lin for the periodic case (see also [30, 41]).

The consideration of the one-dimensional version of the problem clearly shows (this particular example is worked out in [12]) that when one considers the specific corrector <sup>w</sup> solution to <sup>−</sup> *<sup>d</sup> dy* (*aper* + ˜*a*)(*y*) 1 + *d dy* w(*y*) <sup>=</sup> 0, instead of the periodic corrector <sup>w</sup>*per* solution to <sup>−</sup> *<sup>d</sup> dy aper*(*y*) 1 + *d dy* w*per*(*y*) <sup>=</sup> 0, then the quality of the (two-scale, first order) approximation of the solution *u*<sup>ε</sup> is immediately improved near the defect and at the scale of the defect.

In dimensions higher than or equal to two, the proof is more difficult. Under appropriate conditions, the solution *u*<sup>ε</sup> is well approximated in *H*<sup>1</sup> norm, both at scale one and at scale ε (thus in particular in *L*<sup>∞</sup> norm), by the first order expansion *u*ε,<sup>1</sup> (*x*) = *u*∗(*x*) + ε *<sup>d</sup> i*=1 ∂*xi u*∗(*x*) w**<sup>e</sup>***i*(*x*/ε) constructed using the specific correctors w**<sup>e</sup>***<sup>i</sup>* . The latter approximation property does not in general hold true for the periodic first-order approximation *u*ε,<sup>1</sup> *per*(*x*) = *u*∗(*x*) + ε *<sup>d</sup> i*=1 ∂*xi u*∗(*x*) w*per*,**e***i*(*x*/ε) constructed using the periodic corrector w*per*,**e***<sup>i</sup>* . One may even make precise the rate of convergence in function of the small parameter ε, and likewise may prove similar convergence for different Sobolev or Hölder norms. The proof of these convergences has first been presented in the case *p* = 2 (and slightly formally) in [12]. All results and extensions are carried out in a series of works [9, 10, 13–15].

The procedure above is not restricted to the linear diffusion problem Eq. 1. One may consider semi-linear equations, quasi-linear equations, systems, etc. And of course it gets all the more delicate as the complexity of the equation increases. One such example, namely an Hamilton-Jacobi equation, is the purpose of the work [19] and also the subject of work in progress by the author and his collaborators, see [16, 20, 28].

Various other cases of defects may be considered for homogenization problems that are otherwise "simple". They may formally decay at infinity (like the "localized" functions *A*˜ manipulated above), or not. In the former case, the problem at infinity (that is the problem obtained upon translating the equation far away from the defect) is identical to the underlying periodic problem. In the latter case, the situation may sensitively depend upon what the problem "at infinity" looks like. There may even exist several such problems. Another prototypical example is related to the modeling of *grain boundaries* in materials science: two, different, periodic structures are connected across an interface. The defect is, say, a plane separating the two structures, and at large distances from this interface, different periodic structures are present, depending upon which side of the interface is considered, see [13]. The corresponding mathematical problem is theoretically challenging, and practically relevant. In all cases, the purpose is to identify the homogenized, macroscopic limit, while, in the meantime, retain some of the microscopic features that make the problem relevant.

#### **4 Multi-scale Finite Element Approaches and Nonperiodicity**

Multi-scale Finite Element Methods, abbreviated as MsFEM, have proved to be efficient in a number of contexts. In essence, these approaches are based upon choosing, as specific finite dimensional basis to expand the numerical solution upon, a set of functions that themselves are solutions to a highly oscillatory *local* problem, at scale ε, involving the differential operator present in the original equation. This problem-dependent basis set, precomputed (in an offline stage), is likely to better encode the fine-scale oscillations of the solution and therefore allow to capture the solution more accurately. Numerical observation along with mathematical arguments prove that this is indeed generically the case. The versatility of the classical FEM is lost, but with MsFEM, their efficiency is restored for multi-scale problems.

The standard version of the approach has been originally introduced by T. Hou and his collaborators (see the textbook [24] for a general introduction). There exist many variants of such a multi-scale approach, within the formalism of MsFEM or beyond it, and many outstanding numerical analysts and computational analysts have contributed to the field. Classical examples include the Variational multi-scale Method introduced by Hughes et al. the Local Orthogonal Decomposition method by Malqvist and Peterseim, the localization and subspace decomposition method of R. Kornhuber and H. Yserentant, etc. It is not our purpose here to review all these works. We would like to concentrate ourselves here on an issue that is intrinsically related to the context of our discussion, namely breakings of the periodic structure of a material, and its consequence on the accuracy of a dedicated numerical approach.

We recall, on the prototypical multi-scale diffusion problem Eq. 1, that theMsFEM approach, in one of its simplest variant, consists of the following three steps:

1. Introduce a discretization of *D* with a coarse mesh; throughout this article, we work with the P<sup>1</sup> Finite Element space

$$V\_H = \text{Span}\left\{\phi\_i^0, \ 1 \le i \le N\_{V\_H}\right\} \subset H\_0^1(\mathcal{D}). \tag{25}$$

2. Solve the local problems (one for each basis function for the coarse mesh)

$$-\operatorname{div}\left(A\_{\varepsilon}\nabla\psi\_{i}^{\varepsilon,\mathbf{K}}\right) = 0 \quad \text{in } \mathbf{K}, \qquad \psi\_{i}^{\varepsilon,\mathbf{K}} = \phi\_{i}^{0} \quad \text{on } \partial\mathbf{K}, \tag{26}$$

on each element **K** of the coarse mesh *T<sup>H</sup>* , in order to build the multi-scale basis functions. This is typically performed off-line, using a fine mesh *Th*, with *h H*.

3. Apply a standard Galerkin approximation of Eq. 1 on the space

$$\text{Span}\left\{\psi\_i^{\varepsilon}, \ 1 \le i \le N\_{V\_H}\right\} \subset H\_0^1(\mathcal{D}),\tag{27}$$

where ψ<sup>ε</sup> *<sup>i</sup>* is such that ψ<sup>ε</sup> *i* **<sup>K</sup>** <sup>=</sup> <sup>ψ</sup>ε,**<sup>K</sup>** *<sup>i</sup>* for all **K** ∈ *T<sup>H</sup>* .

The error analysis of this MsFEM method has been performed for *A*<sup>ε</sup> = *A*per (·/ε) with *A*per a fixed periodic matrix. Assuming that the basis functions are perfectly determined (that is, *h* = 0), the main error estimate, under the usual assumption of regularity of the data and the mesh, reads as

$$\|\mu^{\varepsilon} - \mu\_H^{\varepsilon}\|\_{H^1(\mathcal{D})} \le C \left( H + \sqrt{\varepsilon} + \sqrt{\frac{\varepsilon}{H}} \right), \tag{28}$$

where *C* is a constant independent of *H* and ε.

When the coarse mesh size *H* is close to the scale ε, a so-called resonance phenomenon, encoded in the term <sup>√</sup>ε/*<sup>H</sup>* in Eq. 28, occurs and deteriorates the numerical solution. The oversampling method is a popular technique to reduce this effect. In short, the approach, which is non-conforming, consists in setting each local problem on a domain slightly larger than the actual element **K** considered, so as to become less sensitive to the arbitrary choice of boundary conditions on that larger domain, and next truncate on the element the functions obtained. That approach allows to significantly improve the results compared to using linear boundary conditions as in Eq. 26. In the periodic case, the following estimate holds

$$\|u^{\varepsilon} - u\_H^{\varepsilon}\|\_{H^1(\mathcal{T}\_H)} \le C \left( H + \sqrt{\varepsilon} + \frac{\varepsilon}{H} \right),$$

$$\| -u\_{\varepsilon}^{\varepsilon} \|\_{H^1(\mathcal{T}\_H)} = \sqrt{\sum \, \|u^{\varepsilon} - u\_H^{\varepsilon}\|\_{\ast,\dots,\ast}^2} \quad \text{is the } H^1 \text{ broken}$$

where *u*<sup>ε</sup> − *u*<sup>ε</sup> *<sup>H</sup> <sup>H</sup>*1(*T<sup>H</sup>* ) = **K**∈*T<sup>H</sup> u*<sup>ε</sup> − *u*<sup>ε</sup> *<sup>H</sup>* <sup>2</sup> *<sup>H</sup>*1(**K**) is the *H*<sup>1</sup> broken norm of *u*<sup>ε</sup> − *u*<sup>ε</sup> *H* .

The boundary conditions imposed on ∂**K** in Eq. 26 are the so-called linear boundary conditions. Besides the linear boundary conditions, and the oversampling technique we have just mentioned, there are many other possible boundary conditions for the local problems. They may give rise to conforming, or non-conforming approximations. The choice sensitively affects the overall accuracy. In an informal way, the whole history of improvements of the original version of MsFEM can be revisited as the history of improvements of the choice of suitable "boundary conditions" for Eq. 26.

The question of how much the choice of boundary conditions for the local problems Eq. 26 alters the overall accuracy is all the more crucial in the context of nonperiodic structures. A prototypical case of the difficulty is that of perforated materials. Consider the Poisson problem set on a domain with perforations of size ε. For a generic mesh, the edges (or, alternately, the facets in a three-dimensional setting) of the mesh may intersect the perforations. It is intuitive that difficulties then arise since the (linear or else than linear) *postulated* behavior of the basis functions along the edges has little chance to accurately capture the actual behavior of the exact solution, given the perforations. Of course, one may use oversampling in order to circumvent this difficulty, but then the approach is non conformal and other difficulties arise, besides the increased computational cost. Also, one may consider meshing the domain in such a way that the edges intersect as few perforations as possible. For a periodic array of perforations, this is a decent solution. But in a non-periodic setting, and this is all the more true in a fully disordered array of perforations, this is impractical. A possible option introduced in [34], and extended in [35, 38, 39] and other subsequent works by different authors, is to resort to "*weak*" boundary conditions, in the form of Crouzeix-Raviart boundary conditions. The Dirichlet boundary conditions on ∂**K** in Eq. 26 are then replaced by conditions of the type

$$\begin{aligned} \int\_{\text{edge}} \psi\_i^{\varepsilon, \mathbf{K}} &= 0 \quad \text{or} \quad 1, \\\ n\_{\text{edge}} \cdot A\_{\varepsilon} \nabla \psi\_i^{\varepsilon, \mathbf{K}} &= \text{Constant} \end{aligned}$$

on all edges, where the local function ψε,**<sup>K</sup>** *<sup>i</sup>* is now associated to an edge *i*. For this approach, under technical assumptions, the error estimate is identical to that for linear boundary conditions, namely Eq. 28.

More importantly, upon using such "weak" boundary conditions in the context of a perforated computational domain (and adding other, generic ingredients, such as bubble functions), the accuracy, if not improved, is now significantly more robust with respect to the existence of intersection between edges and perforations. A "stress-test" considering two extreme scenarios illustrates this property: see in [35] the detailed comparison of the results obtained with the MsFEM method and different boundary conditions for the local problems for the shifted meshes in Fig. 4.

Let us conclude this section by emphasizing the formal link between the existence results for the non-periodic corrector w**<sup>p</sup>** that have been examined in the previous section and the actual local basis functions ψε,**<sup>K</sup>** *<sup>i</sup>* of the MsFEM approaches discussed here. Up to irrelevant technicalities and details, the corrector and the local functions are, intrinsically, the same mathematical object: they are obtained by zooming in locally and solving the problem at the scale of its heterogeneities.

#### **5 Homogenization Under Partial Information**

One way or another, all the approaches described so far, both at the theoretical level and the numerical level, rely on the full knowledge of the coefficient *A*ε. It turns out that there are several *practical* contexts where such a knowledge is incomplete, or sometimes merely unavailable. From an engineering perspective (think e.g. of experiments in Mechanics), there are indeed numerous prototypical situations for Eq. 1 where the response *u*<sup>ε</sup> can be measured for some loadings *f* , but where *A*<sup>ε</sup> is not completely known, let alone the fact that it is periodic or not. In these situations, it is thus not possible to use homogenization theory, nor to proceed with any MsFEM-type approach or with the similar approaches mentioned above. Finding a pathway alternate to standard approaches is thus a practically relevant question. We are interested in approaches valid for the different regimes of ε, which make no use of the knowledge on the coefficient *A*ε, but only use some responses of the medium obtained for certain given solicitations. Questions similar in spirit have been addressed two decades ago by Durlofsky. The point is also to define an effective coefficient only using outputs of the system. They are however different in practice (see [36] for a detailed discussion).

For simplicity, we restrict ourselves to cases when Eq. 1 admits (possibly up to some extraction) a homogenized limit Eq. 5 where the homogenized matrix coefficient *A*<sup>∗</sup> is deterministic and constant. This restrictive assumption on the class of *A*<sup>∗</sup> (and thus on the structure of the coefficient *A*<sup>ε</sup> in Eq. 1) is useful for our theoretical justifications, but not mandatory for the approach to be applicable.

For any constant matrix *A*, we consider generically the problem with constant coefficients

**Fig. 4 Two extreme cases of meshes regarding intersections with the perforations:** no intersection at all (top), or as many intersections as possible (bottom). The Crouzeix-Raviart version of MsFEM is, roughly, equally accurate in both situations. *Source* [35]

$$-\operatorname{div}\left(\overline{A}\,\nabla\overline{u}\right) = f.\tag{29}$$

We investigate, for any value of the parameter ε, how we may define a constant symmetric matrix such that the solution *u*(*A*, *f* ) = *u* to Eq. 29 with matrix *A* best approximates the solution to Eq. 1. The best constant matrix *A* is (temporarily) defined as a minimizer of

$$I\_{\varepsilon} = \inf\_{\text{constant matrix} \overline{A} > 0} \sup\_{\substack{f \in L^2(\mathcal{D}), \\ \|f\|\_{L^2(\mathcal{D})} = 1}} \left\| u^{\varepsilon}(f) - u(\overline{A}, f) \right\|\_{L^2(\mathcal{D})}^2,\qquad(30)$$

where we have explicitly emphasized the dependency upon the right-hand side *f* of the solutions to Eq. 1 and Eq. 29. The norm in Eq. 30 is an *L*<sup>2</sup> norm (and not e.g. an *H*<sup>1</sup> norm) because, for sufficiently small ε, we wish the best constant matrix *A* to be close to *A*∗, while *u*<sup>ε</sup> strongly converges to *u*<sup>∗</sup> only in the *L*<sup>2</sup> norm but not in the *H*<sup>1</sup> norm. The key point is that Eq. 30 is only based on the knowledge of the outputs *u*<sup>ε</sup> (that could be e.g. experimentally measured), and not on that of *A*<sup>ε</sup> itself. The theoretical study of the minimization problem Eq. 30 has been carried out in [36]. In particular it has been proven that, under classical assumptions, the matrices *A* with energy asymptotically close to the infimum *I*<sup>ε</sup> all converge to *A*<sup>∗</sup> as ε vanishes. In passing, we note that the approach provides, at least in some settings, a characterization of the homogenized matrix which is an alternative to the standard characterization of homogenization theory. To the best of our knowledge, this characterization, although probably known, has never been made explicit in the literature.

In fact (and this does not alter the above theoretical results), the actual minimization problem we use for the practice reads as

$$I\_{\varepsilon}^{\text{prac}} = \inf\_{\text{constant matrix} \overline{A} > 0} \sup\_{\substack{f \in L^2(\mathcal{D}), \\ \|f\|\_{L^2(\mathcal{D})} = 1}} \left\| -\Delta^{-1} \left( -\text{div}\overline{A} \,\nabla u^{\varepsilon}(f) - f \right) \right\|\_{L^2(\mathcal{D})}^2,\tag{31}$$

where − <sup>−</sup><sup>1</sup> is the inverse laplacian operator supplied with homogeneous Dirichlet boundary conditions. The function minimized in Eq. 31 is related to the one of Eq. 30 through the application, inside the *L*<sup>2</sup> norm of the latter, of the zero-order differential operator <sup>−</sup><sup>1</sup> div(*A* ∇ . ). Note that, in sharp contrast with Eq. 30, the function to minimize in Eq. 31 is now, formally, a second-order polynomial in function of *A*. This property significantly speeds up the computations of the infimum. The specific choice Eq. 31 has been suggested to us by Albert Cohen.

Note also that, in practice, we cannot maximize upon all right-hand sides *f* in *<sup>L</sup>*<sup>2</sup>(*D*) (with unit norm) and that we therefore replace the supremum by a maximization upon a finite-dimensional set of thoughtfully selected right-hand sides.

In [36, 37], we have presented a series of numerical experiments using the above approach. Our tests have established that the approach is in particular able to accurately identify the homogenized matrix *A*<sup>∗</sup> in the periodic case (with a computational

**Fig. 5 Homogenization approach within an Arlequin-type coupling**: The fine-scale highly oscillatory model and the coarse-grained model (tentatively identical to the homogenized model) co-exist in an overlap region. The three regions described in the body of our text are displayed, along with the fine and coarse meshes. *Source* [27]

time that is much larger than the classical approach, but this is not the point). More importantly, it is also able to complete this task in the random case (where the classical approach can be prohibitively expensive). Finally, and since no particular structure of the coefficient *A*<sup>ε</sup> is used, it may be applied to a large variety of non-periodic structures.

A remark is in order: in both cases of periodic and random homogenization, the classical approach computes the homogenized coefficients by first approximating the corrector function. A fair comparison between the approaches can therefore only be achieved if the above approach also provides some approximation of the corrector function. It is indeed the case: the latter function can also be obtained in our approach, at a reduced additional computational cost, as demonstrated in [36].

A variant of the above approach, originally introduced in [22], is currently under investigation in [27]. The purpose of this variant is also to approximate *A*<sup>∗</sup> without explicitly using *A*ε, and to achieve this in a robust, engineering-type manner. In a nutshell, the approach consists in considering a domain divided in three regions, see Fig. 5. The inner region and the outer region respectively contain only the oscillatory model of Eq. 1 and the tentative homogenized model of Eq. 29. In between these two regions, an overlap region where both models exist is used for a smooth coupling. Specifically, the coupling is performed using an Arlequin-type approach (see again [22]) but this is not mandatory for the approach to perform. A linear Dirichlet boundary condition, say *u* = *x*1, is imposed on the external surface of the domain. It intuitively plays the role of the right-hand side function *f* in Eq. 31. At ε fixed presumably small, one then solves the minimization problem

$$J\_{\varepsilon} = \inf\_{\text{constant matrix} \, \overline{A} \simeq 0} \left\| \nabla (\mu(\overline{A}) - x\_1) \right\|\_{L^2(\mathcal{D})}^2. \tag{32}$$

In the limit of ε vanishing, it is established that *J*<sup>ε</sup> also vanishes and the only minimizer is obtained for *A* **e**<sup>1</sup> = *A*<sup>∗</sup> **e**1, where **e**<sup>1</sup> = ∇(*x*1) is the first canonical vector of the ambient space R*<sup>d</sup>* . Repeating this procedure along each dimension of R*<sup>d</sup>* allows to eventually identify the matrix *A*∗. Several computational improvements of the original approach are introduced in [27]. A numerical analysis is also presented.

**Acknowledgements** The author wishes to thank his many collaborators on the topics overviewed in the present article, and in particular: Xavier Blanc, Pierre Cardaliaguet, Olga Gorynina, Rémi Goudey, Ulrich Hetmaniuk, Frédéric Legoll, Pierre-Louis Lions, Alexei Lozinski, Panagiotis Souganidis, Sylvain Wolf. The research of the author is partially supported by ONR under grant N00014-20-1-2691 and by EOARD under grant FA8655-20-1-7043.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Hyperbolic Model Reduction for Kinetic Equations**

**Zhenning Cai, Yuwei Fan, and Ruo Li**

**Abstract** We make a brief historical review of the moment model reduction for the kinetic equations, particularly Grad's moment method for Boltzmann equation. We focus on the hyperbolicity of the reduced model, which is essential for the existence of its classical solution as a Cauchy problem. The theory of the framework we developed in the past years is then introduced, which preserves the hyperbolic nature of the kinetic equations with high universality. Some lastest progress on the comparison between models with/without hyperbolicity is presented to validate the hyperbolic moment models for rarefied gases.

#### **1 Historical Overview**

The moment methods are a general class of modeling methodologies for kinetic equations. We would like to start this paper with a historical review of this topic. However, due to the huge amount of references, a thorough overview would be lengthy and tedious. Therefore, in this section, we only restrict ourselves to the methods related to the hyperbolicity of moment models. Even so, our review in the following paragraphs does not exhaust the contributions in the history.

According to Sir J. H. Jeans [29], the kinetic picture of a gas is "a crowd of molecules, each moving on its own independent path, entirely uncontrolled by forces from the other molecules, although its path may be abruptly altered as regards both

Z. Cai

Y. Fan

R. Li (B)

Department of Mathematics, National University of Singapore, 10 Lower Kent Ridge Road, Singapore 119076, Singapore e-mail: matcz@nus.edu.sg

Department of Mathematics, Stanford University, Stanford, CA 94305, USA e-mail: ywfan@stanford.edu

CAPT, LMAM and School of Mathematical Sciences, Peking University, Beijing 100871, People's Republic of China e-mail: rli@math.pku.edu.cn

speed and direction, whenever it collides with another molecule or strikes the boundary of the containing vessel." In order to describe the evolution of non-equilibrium gases using the phase-space distribution function, the Boltzmann equation was proposed [1] as a non-linear seven-dimensional partial differential equation. The independent variables of the distribution function include the time, the spatial coordinates, and the velocity.

In most cases, the full Boltzmann equation cannot be solved even numerically. One has to characterize the motion of the gas by resorting to various approximation methods to describe the evolution of macroscopic quantities. One successful way to find approximate solutions is the Chapman-Enskog method [15, 18], which uses a power series expansion around the Maxwellian to describe slightly non-equilibrium gases. The method assumes that the distribution function can be approximated up to any precision only using equilibrium variables and their derivatives. Alternatively, Grad's moment method [24] was developed in the late 1940s. In this method, by taking velocity moments of the Boltzmann equation, transport equations for macroscopic averages are obtained. The difficulty of this method is that the governing equations for the components of the *n*th velocity moment also depend on components of the (*n* + 1)th moment. Therefore, one has to use a certain closing relation to get a closed system after the truncation.

Among the models given by Grad's method [24], Grad's 13-moment system is the most basic one beyond the Navier-Stokes equations, as any Grad's models with fewer moments do not include either stress tensor or heat transfer. In [23], it was commented that Grad's moment method could be regarded as mathematically equivalent to the Chapman-Enskog method in certain cases. Thus the deduction of Grad's 13-moment system can be regarded as an application of perturbation theory to the Boltzmann equation around the equilibrium. Therefore, it is natural to hope that the 13-moment system will be valid in the vicinity of equilibrium, although it was not expected to be valid far away from the equilibrium distribution [25]. However, due to its complex mathematical expression, it is even not easy to check if the system is hyperbolic, as pointed out in [2]. As late as in 1993, it was eventually verified in [35, 36] that the 1D reduction of Grad's 13-moment equations is hyperbolic around the equilibrium.

In 1958, Grad wrote an article "Principles of the kinetic theory of gases" in Encyclopedia of Physics [26], where he collected his own method in the class of "more practical expansion techniques". However, successful applications of the 13 moment system had been hardly seen within two decades after Grad's classical paper in 1949, as mentioned in the comments by Cercignani [14]. One possible reason was found by Grad himself in [25], where it was pointed out that there may be unphysical sub-shocks in a shock profile for Mach number greater than a critical value. However, the appearance of sub-shocks cannot give any hints on the underlying reason why Grad's moment method does not work for slow flows. Nevertheless, Grad's moment method was still pronounced to "open a new era in gas kinetic theory" [27].

In our paper [5], it was found astonishingly that in the 3D case, the equilibrium is NOT an interior point of the hyperbolicity region of Grad's 13-moment model. Consequently, even if the distribution function is arbitrarily close to the local equilibrium, the local existence of the solution of the 13-moment system cannot be guaranteed as a Cauchy problem of a first-order quasi-linear partial differential system without analytical data. The defects of the 13-moment model due to the lack of hyperbolicity had never been recognized as so severe a problem. The absence of hyperbolicity around local equilibrium is a candidate reason to explain the overall failure of Grad's moment method.

After being discovered, the lack of hyperbolicity is well accepted as a deficiency of Grad's moment method, which makes the application of the moment method severely restricted. "There has been persistent efforts to impose hyperbolicity on Grad's moment closure by various regularizations" [39], and lots of progress has been made in the past decades. For example, Levermore investigated the maximum entropy method and showed in [33] that the moment system obtained with such a method possesses global hyperbolicity. Unfortunately, it is difficult to put it into practice due to the lack of a finite analytical expression, and the equilibrium lies on the boundary of the realizability domain for any moment system containing heat flux [30]. Based on Levermore's 14-moment closure, an affordable 14-moment closure is proposed in [34] as an approximation, which extends the hyperbolicity region to a great extent. Let us mention that actually in [5], we also derived a 13-moment system with hyperbolicity around the equilibrium.

It looks highly non-trivial to gain hyperbolicity even around the equilibrium, while things changed not long ago. Besides the achievement of local hyperbolicity around the equilibrium, the study on the globally hyperbolic moment systems with large numbers of moments was also very successful in the past years. In the 1D case with both spatial and velocity variables being scalar, a globally hyperbolic moment system was derived in [3] by regularization. Motivated by this work, another type of globally hyperbolic moment systems was then derived in [31] using a different strategy. The model in [3] is obtained by modifying only the last equation and the model in [31] revises only the last two equations in Grad's original system. The characteristic fields of these models (genuine nonlinearity, linear degeneracy, and some properties of shocks, contact discontinuities, and rarefaction waves) can be fully clarified, as shows that the wave structures are formally a natural extension of Euler equations.

In [4], the regularization method in [3] is extended to multi-dimensional cases. Here the word "multi-dimension" means that the dimensions of spatial coordinates and velocity are any positive integers and can be different. The complicated multidimensional models with global hyperbolicity based on a Hermite expansion of the distribution function up to any degree were systematically proposed in [4]. The wave speeds and the characteristic fields can be clarified, too. Later on, the multidimensional model for an anisotropic weight function with global hyperbolicity was derived in [20].

Achieving global hyperbolicity was definitely encouraging, while it sounded like a huge mystery for us how the regularization worked in the aforementioned cases. Particularly, the method cannot be applied to moment systems based on a spherical harmonic expansion of distribution function such as Grad's 13-moment system. As we pointed out, the hyperbolicity is essential for a moment model, while it is hard to obtain by a direct moment expansion of kinetic equations. To overcome such a problem, we in [6] fortunately developed a systematic framework to perform moment model reduction that preserves global hyperbolicity. The framework works not only for the models based on Hermite expansions of the distribution function in the Boltzmann equation, but also works for any ansatz of the distribution function in the Boltzmann equation. Actually, the framework even works for kinetic equations in a fairly general form.

The framework developed in [6] was further presented in the language of projection operators in [19], where the underlying mechanism of how the hyperbolicity is preserved during the model reduction procedure was further clarified. This is the basic idea of our discussion in the next section.

#### **2 Theoretical Framework**

In this section, we briefly review the framework in [19] to construct globally hyperbolic moment system from kinetic equations, as well as its variants and some further development. To clarify the statement, we first present the definition of the hyperbolicity as follows:

**Definition 1** The first-order system of equations

$$\frac{\partial \boldsymbol{w}}{\partial t} + \sum\_{d=1}^{D} \mathbf{A}\_d(\boldsymbol{w}) \frac{\partial \boldsymbol{w}}{\partial \boldsymbol{\chi}\_d} = 0, \quad \boldsymbol{w} \in \mathbb{G}^d$$

is hyperbolic at *<sup>w</sup>*0, if for any unit vector *<sup>n</sup>* <sup>∈</sup> <sup>R</sup>*<sup>D</sup>*, the matrix *<sup>D</sup> <sup>d</sup>*=<sup>1</sup> *nd***A***<sup>d</sup>* (*w*0) is real diagonalizable; the system is called globally hyperbolic if it is hyperbolic for any *<sup>w</sup>* <sup>∈</sup> <sup>G</sup>.

Based on this definition, the analysis of the hyperbolicity of moment systems reduces to a problem of linear algebra: the analysis of the real diagonalizablity of the coefficient matrices. Without knowing the exact values of the matrix entries, the real diagonalizability of a matrix has to be studied by some sufficient conditions. Some of them are

**Condition 1** *All its eigenvalues are real and it has n linearly independent eigenvectors.*

**Condition 2** *All the eigenvalues of the matrix are real and distinct.*

**Condition 3** *The matrix is symmetric or similar to a symmetric matrix.*

Grad [24] investigated the characteristic structure of the 1D reduction of Grad's 13-moment system, whose hyperbolicity was further studied in [36] based on the Condition 2. Afterwards, this condition is adopted in the proof of the hyperbolicity of the regularized moment system for the 1D case in [3]. It is worth noting that using Condition 2 usually requires us to compute the characteristic polynomial of the coefficient matrix of the moment system, and for large moment systems, this may be complicated or even impractical. Even if the characteristic polynomial is computed, showing that the eigenvalues are real and distinct is still highly nontrivial. This severely restricts the use of this condition in kinetic model reduction.

To study the hyperbolicity in multi-dimensional cases, we have applied Condition 1 in [5] to show that Grad's 13-moment system loses hyperbolicity even in an arbitrarily small neighborhood of the equilibrium, and in [4] to prove the global hyperbolicity of the regularized moment system for the multi-dimensional case. Due to the requirement on the eigenvectors, both proofs based on Condition 1 are complicated and tedious. By contrast, it is much easier to check Condition 3, based on which Levermore provided a concise and clear proof of the hyperbolicity of the maximum entropy moment system in [33]. In [19], we re-studied the hyperbolicity of the regularized moment system in [3, 4] based on the Condition 3 and then generalized it to a framework. Below we will start our discussion from a review of these hyperbolic moment systems.

#### *2.1 Review of Globally Hyperbolic Moment System*

Let us consider the Boltzmann equation:

$$
\frac{\partial f}{\partial t} + \sum\_{d=1}^{D} v\_d \frac{\partial f}{\partial \chi\_d} = \mathcal{Q}(f),
\tag{1}
$$

and denote the *local equilibrium* by *feq* , which satisfies *Q*( *feq* ) = 0 and *feq* > 0. The key idea of Grad's moment method is to expand the distribution as

$$f(t, \mathbf{x}, \mathbf{v}) = \sum\_{|a| \le M} f\_{eq}(t, \mathbf{x}, \mathbf{v}) f\_a(t, \mathbf{x}) H e\_a(t, \mathbf{x}, \mathbf{v}) = \sum\_{|a| \le M} f\_a(t, \mathbf{x}) \mathcal{H}\_a(t, \mathbf{x}, \mathbf{v}) \tag{2}$$

for a given integer *<sup>M</sup>* <sup>≥</sup> 2, where for the multi-dimensional index <sup>α</sup> <sup>∈</sup> <sup>N</sup>*<sup>D</sup>* , <sup>|</sup>α| = *D <sup>d</sup>*=<sup>1</sup> α*<sup>d</sup>* , and the basis function *H*<sup>α</sup> is defined by *H*<sup>α</sup> = *feqHe*α, with *He*<sup>α</sup> being the orthonormal polynomials of *v* with weight function *feq* . When *feq* is the local Maxwellian, *He*<sup>α</sup> can be obtained by translation and scaling of Hermite polynomials. Grad's moment system can then be obtained by substituting the expansion into the Boltzmann equation and matching the coefficients of *H*<sup>α</sup> with |α| ≤ *M*. To clearly describe this procedure, we assume that the distribution function *f* is defined on a space <sup>H</sup> spanned by the basis functions *<sup>H</sup>*<sup>α</sup> for all <sup>α</sup> <sup>∈</sup> <sup>N</sup>*<sup>D</sup>*, and we let <sup>H</sup>*<sup>M</sup>* := span{*H*<sup>α</sup> : |α| ≤ *M*} be the subspace for our model reduction. Then one can introduce the projection from H to H*<sup>M</sup>* as

$$\mathcal{P}f = \sum\_{|\alpha| \le M} f\_{\alpha} \mathcal{H}\_{\alpha} \text{ with } f\_{\alpha} = \langle f, \mathcal{H}\_{\alpha} \rangle,\tag{3}$$

where the inner product is defined as *f*, *g* = <sup>R</sup>*<sup>D</sup> f g*/ *feq* d*v*. The projection accurately describes Grad's expansion (2) and provides a tool to study the operators in the space <sup>H</sup>*<sup>M</sup>* . For example, matching the coefficients of the basis *<sup>H</sup>*<sup>α</sup> with <sup>|</sup>α| ≤ *<sup>M</sup>* can be understood as projecting the system into the space H*<sup>M</sup>* . Hence, Grad's moment system is written as

$$\mathcal{P}\frac{\partial \mathcal{P}f}{\partial t} + \sum\_{d=1}^{D} \mathcal{P}v\_d \frac{\partial \mathcal{P}f}{\partial \chi\_d} = \mathcal{P}\mathcal{Q}(\mathcal{P}f). \tag{4}$$

Let *H* be the vector whose components are all the basis functions *H*<sup>α</sup> with |α| ≤ *<sup>M</sup>* listed in a given order. Since *<sup>P</sup> <sup>f</sup>* is a function in <sup>H</sup>*<sup>M</sup>* , one can collect all the independent variables in*P f* and denote it by *w* with its length equal to the dimension of <sup>H</sup>*<sup>M</sup>* . Thanks to the definition of the projection operator *<sup>P</sup>*, there exist the square matrices **D** and **B***<sup>d</sup>* , *d* = 1,..., *D* such that

$$\mathcal{P}\frac{\partial \mathcal{P}f}{\partial t} = \mathcal{H}^T \mathbf{D} \frac{\partial \mathbf{w}}{\partial t}, \quad \mathcal{P}v\_d \frac{\partial \mathcal{P}f}{\partial \mathbf{x}\_d} = \mathcal{H}^T \mathbf{B}\_d \frac{\partial \mathbf{w}}{\partial \mathbf{x}\_d}. \tag{5}$$

Accordingly, letting *<sup>Q</sup>* be the vector such that *<sup>P</sup> <sup>Q</sup>*(*<sup>P</sup> <sup>f</sup>* ) <sup>=</sup> *<sup>H</sup><sup>T</sup> <sup>Q</sup>*, one can rewrite Grad's moment system as

$$\mathbf{D}\frac{\partial \boldsymbol{\omega}}{\partial t} + \sum\_{d=1}^{D} \mathbf{B}\_{d} \frac{\partial \boldsymbol{\omega}}{\partial \boldsymbol{\omega}\_{d}} = \boldsymbol{\mathcal{Q}}.\tag{6}$$

Actually, the system (6) is the vector form of (4) in <sup>H</sup>*<sup>M</sup>* with the basis *<sup>H</sup>*α. By comparing these equations, we have the following correspondences

$$\mathbf{w} \leftrightarrow \mathcal{P}f, \quad \mathbf{D}\frac{\partial}{\partial t} \leftrightarrow \mathcal{P}\frac{\partial}{\partial t}, \quad \mathbf{B}\_d \frac{\partial}{\partial \mathbf{x}\_d} \leftrightarrow \mathcal{P}v\_d \frac{\partial}{\partial \mathbf{x}\_d}, \quad \mathbf{Q} \leftrightarrow \mathcal{P}\mathcal{Q}(\mathcal{P}f). \tag{7}$$

Furthermore, we can diagram the procedure to derive Grad's moment system in Fig. 1a. It is noticed in [19] that the time derivative and the spatial derivative are treated differently in such a process, as a projection operator is applied directly to the time derivative, while for the spatial derivative, this projection operator appears only after the velocity v is multiplied. This difference causes the loss of hyperbolicity. By such observation, we have drawn a key conclusion in [19] that one should add a projection operator right in front of the spatial derivative to regain hyperbolicity, as is illustrated in Fig. 1b. The corresponding moment system is

**Fig. 1** Diagram of the procedure of Grad's and regularized moment system

$$\mathcal{P}\frac{\partial \mathcal{P}f}{\partial t} + \sum\_{d=1}^{D} \mathcal{P}v\_d \mathcal{P}\frac{\partial \mathcal{P}f}{\partial \mathbf{x}\_d} = \mathcal{P}\mathcal{Q}(\mathcal{P}f),\tag{8}$$

where the additional projection operator is labeled in red. Using (5), one can claim that there exist the square matrices **M***<sup>d</sup>* , *d* = 1,..., *D* such that

$$\mathcal{P}v\_d \mathcal{P} \frac{\partial \mathcal{P}f}{\partial \mathbf{x}\_d} = \mathcal{H}^T \mathbf{M}\_d \mathbf{D} \frac{\partial \mathbf{w}}{\partial \mathbf{x}\_d},\tag{9}$$

and obtain the vector form of the regularized moment system as

$$\mathbf{D}\frac{\partial \mathbf{w}}{\partial t} + \sum\_{d=1}^{D} \mathbf{M}\_d \mathbf{D} \frac{\partial \mathbf{w}}{\partial x\_d} = \mathbf{Q}.\tag{10}$$

Similar to (7), we have one more correspondence:

$$\mathbf{M}\_d \leftrightarrow \mathcal{P}v\_d,\tag{11}$$

that is to say, the matrices **<sup>M</sup>***<sup>d</sup>* are the representation of the operators *<sup>P</sup>*v*<sup>d</sup>* on <sup>H</sup>*<sup>M</sup>* . It is not difficult to check that the matrices **M***<sup>d</sup>* are symmetric due to the orthonormality of the basis *H*α, so that any linear combination of the matrices **M***<sup>d</sup>* is real diagonalizable. One can also check the matrix **D** is invertible. Hence **D**−<sup>1</sup>**M***d***D** is similar to **M***<sup>d</sup>* so that the system (10) is globally hyperbolic. Moreover, if one multiplies **D***<sup>T</sup>* on both sides of (10), the resulting system

$$\mathbf{D}^T \mathbf{D} \frac{\partial \mathbf{w}}{\partial t} + \sum\_{d=1}^D \mathbf{D}^T \mathbf{M}\_d \mathbf{D} \frac{\partial \mathbf{w}}{\partial x\_d} = \mathbf{D}^T \mathcal{Q} \tag{12}$$

turns out to be a symmetric hyperbolic system of balance laws.

#### *2.2 Hyperbolic Regularization Framework*

Till now, the hyperbolicity of (10) has been proved using the Condition 3. Looking back on the whole procedure, one can find that the key point of the hyperbolic regularization is the extra projection operator in front of the spatial differentiation operator in (8). Meanwhile, the underlying mechanism to obtain hyperbolicity can be extended to much more general cases. For example, the radiative transfer equation has the form

$$\begin{aligned} \frac{\partial f(t, \mathbf{x}, \boldsymbol{\theta}, \boldsymbol{\varphi})}{\partial t} + \xi(\boldsymbol{\theta}, \boldsymbol{\varphi}) \cdot \nabla\_{\mathbf{x}} f(t, \mathbf{x}, \boldsymbol{\theta}, \boldsymbol{\varphi}) &= Q(f)(t, \mathbf{x}, \boldsymbol{\theta}, \boldsymbol{\varphi}), \\ \mathbf{x} \in \mathbb{R}^3, \quad \boldsymbol{\theta} \in [0, \pi), \quad \boldsymbol{\varphi} \in [0, 2\pi), \end{aligned}$$

where the velocity is given by *ξ* (θ , ϕ) = (sin θ cos ϕ,sin θ sin ϕ, cos θ )*<sup>T</sup>* . To derive reduced models, one can replace the local equilibrium *feq* in (2) by a nonnegative weight function ω, and correspondingly, the orthogonal polynomials *He*<sup>α</sup> should be replaced by the orthogonal basis functions φα for the *L*<sup>2</sup> space weighted by ω, so that the basis functions *<sup>H</sup>*<sup>α</sup> become <sup>α</sup> := ωφα. By letting <sup>H</sup>*<sup>M</sup>* := span{<sup>α</sup> : |α| ≤ *<sup>M</sup>*}, one can similarly define the projection operator *P* as in (3). As an extension of the globally hyperbolic moment system, we obtain

$$\mathcal{P}\frac{\partial\mathcal{P}f}{\partial t} + \sum\_{d=1}^{D} \mathcal{P}\xi\_d(\theta,\varphi)\mathcal{P}\frac{\partial\mathcal{P}f}{\partial x\_d} = \mathcal{P}\mathcal{Q}(\mathcal{P}f). \tag{13}$$

Again, if the corresponding matrix **D** as in (6) is invertible, the resulting moment system is globally hyperbolic. We refer the readers to [6, 19, 21] for more details of such applications in radiative transfer equations.

This framework provides a concise and clear procedure to derive the hyperbolic moment system from a broad range of kinetic equations. It has been applied to many fields, including anisotropic hyperbolic moment system for Boltzmann equation [20], semiconductor device simulation [7], plasma simulation [11], density functional theory [8], quantum gas kinetic theory [16], and rarefied relativistic Boltzmann equation [32].

#### *2.3 Further Progress*

The above framework provides an approach to handling the hyperbolicity of the moment system. However, the hyperbolicity is not the only concerned property. Preserving the hyperbolicity and other properties at the same time is often required in model reduction. Below we will list some recent attempts in this direction.

One of the interesting properties is to recover the asymptotic limits of the kinetic equations. For example, the first-order asymptotic hydrodynamic limit of the Boltzmann equation is the Navier-Stokes equations, and therefore it is desirable that the moment equations can preserve such a limit. For the classical Boltzmann equation, most moment systems can automatically preserve the Navier-Stokes limit if the stress tensor and heat flux are included. However, for the quantum Boltzmann equation, the equilibrium has a very special form, so that the moment system directly derived from the framework by taking the equilibrium as the weight function disobeys the Navier-Stokes limit [16]. In this case, the authors of [16] proposed a method called *local linearization* to regularize the moment system. Specifically, we assume the Grad-type system has the form as (6) and define **M**ˆ *<sup>d</sup>* (*w*) = **B***<sup>d</sup>* (*w*)**D**(*w*)−1. In the regularization, the matrix **M**ˆ *<sup>d</sup>* (*w*) is replaced by **M***<sup>d</sup>* := **M**ˆ *<sup>d</sup>* (*weq* ) with *weq* being the local equilibrium of the state *w*. Such a method allows us to acquire both the hyperbolicity and Navier-Stokes limit simultaneously. The symmetry of **M** is thereby lost so that one has to use Condition 1 to prove the hyperbolicity.

Another relevant work is the nonlinear moment system for radiative transfer equation in [21, 22]. In order to retain the diffusion limit (similar to the Navier-Stokes limit for the Boltzmann equation), the authors pointed out that the projection operators in (13) at different places do not have to be same and revised (13) to be

$$
\tilde{\mathcal{P}}\frac{\partial \mathcal{P}f}{\partial t} + \sum\_{d=1}^{D} \tilde{\mathcal{P}}\xi\_d(\theta, \varphi) \tilde{\mathcal{P}}\frac{\partial \mathcal{P}f}{\partial x\_d} = \tilde{\mathcal{P}}\mathcal{Q}(\mathcal{P}f). \tag{14}
$$

The operators*<sup>P</sup>* and*P*˜ are orthogonal projections onto different subspaces of <sup>H</sup>. By a careful choice of the subspace for the operator *P*˜, the diffusion limit can be achieved, and meanwhile, the symmetry of**M** corresponding to that in (10) is preserved, leading again to global hyperbolicity. This generalization has broadened the application the hyperbolic regularization framework and also permits us to take more properties of the kinetic equations into account.

Besides the hyperbolicity for the convection term, one may also be interested in the wellposedness of the complete moment system including the collision term. One related property is Yong's first stability condition [38], which includes the constraints on the convection term, collision term, and the coupling of both. This stability condition is shown to be critical for the existence of the solutions in [37]. In [17], the authors have studied multiple Grad-type moment systems and confirmed that all of these systems satisfy Yong's first stability condition.

Under this concise and flexible framework, one may wonder what is sacrificed for the hyperbolicity. By writing out the equations, one can immediately observe that the form of balance law is ruined by the hyperbolic regularization. A natural question is: how to define the discontinuity in the solution? More generally, one may ask: what is the effect of such a regularization on the accuracy of the model? In the following section, we will provide some clues using numerical experiments.

#### **3 Numerical Validation**

The application of the framework in the gas kinetic theory has been investigated in a number of works [3, 9, 10, 12], where many one- and two-dimensional examples have been numerically studied to show the validity of hyperbolic moment equations. However, these globally hyperbolic models, as an improvement of Grad's original models, have never been compared with Grad's models in terms of the modeling accuracy. The only direct comparison seen in the literature is in [10], wherein for a shock tube problem with a density ratio of 7.0, the simulation of Grad's moment equations breaks down and the corresponding hyperbolic moment equations appear to be stable. Without running numerical tests for the same problem for which both models work and comparing the results, it could be questioned whether we lose accuracy when fixing the hyperbolicity. Such doubt may arise since the globally hyperbolic models can be considered as a partial linearization of Grad's models about the local Maxwellians.

In this section, we will make such straightforward comparison using the same numerical examples for both methods. For simplicity, we only consider the onedimensional physics, for which both *x* and v are scalars. In this case, the characteristic polynomial for the Jacobian of the flux function has an explicit formula [3], so that the hyperbolicity of Grad's equation can be easily checked. The underlying kinetic equation used in our test is the Boltzmann-BGK equation with a constant relaxation time

$$\frac{\partial f}{\partial t} + v \frac{\partial f}{\partial x} = \frac{1}{Kn}(f\_{eq} - f). \tag{15}$$

The ansatz of the distribution function is given by (3), so that (4) stands for Grad's moment system, and (8) stands for the hyperbolic moment system. Below we are going to use two benchmark tests to show the performance of both types of models. In general, both Grad's moment equations and the hyperbolic moment equations are solved by the first-order finite volume method with local Lax-Friedrichs numerical flux. Time splitting is applied to solve the advection part and the collision part separately, and for each part, the forward Euler method is applied. The CFL condition is utilized to determine the time step, and the Courant number is chosen as 0.9. For Grad's moment method, the maximum characteristic speed is obtained by solving the roots of the characteristic polynomial of the Jacobian, and the explicit expression of the charateristic polynomial has been given in [3]. For the hyperbolic moment method, the maximum characteristic speeds have been computed in [3]. The explicit form of the hyperbolic moment system (given in [3]) shows that its last equation contains a non-conservative product, which is discretized by central difference. In all the numerical examples, the number of grid cells is 1000 if not otherwise specified. We have done the convergence test showing that for smooth solutions, such a resolution can provide solutions sufficiently close to the solutions on a much finer grid, so that their difference is invisible to the naked eye. When exhibiting the numerical results, we will mainly focus on the equilibrium variables including density ρ, velocity *u*, and temperature θ, which are defined by

$$\begin{aligned} \rho(t,x) &= \int\_{\mathbb{R}} f(t,x,v) \, \mathrm{d}v, \\ u(t,x) &= \frac{1}{\rho(t,x)} \int\_{\mathbb{R}} v f(t,x,v) \, \mathrm{d}v, \\ \theta(t,x) &= \frac{1}{\rho(t,x)} \int\_{\mathbb{R}} [v - u(t,x)]^2 f(t,x,v) \, \mathrm{d}v. \end{aligned}$$

#### *3.1 Shock Structure*

The structure of plane shock waves is frequently used as a benchmark test in the gas kinetic theory. It shows that the physical shock, which appears to be a discontinuity in the Euler equations, is actually a smooth transition from one state to another. The computational domain is (−∞, +∞) so that no boundary condition is involved, and the initial data are

$$f(0, \mathbf{x}, v) = \begin{cases} \frac{\rho\_l}{\sqrt{2\pi\theta\_l}} \exp\left(-\frac{(v - \mu\_l)^2}{2\theta\_l}\right), & \text{if } \mathbf{x} < \mathbf{0}, \\\\ \frac{\rho\_r}{\sqrt{2\pi\theta\_r}} \exp\left(-\frac{(v - \mu\_r)^2}{2\theta\_r}\right), & \text{if } \mathbf{x} > \mathbf{0}, \end{cases} \tag{16}$$

where all the equilibrium variables are determined by the Mach number *Ma*:

$$\begin{aligned} \rho\_l &= 1, \qquad \mu\_l = \sqrt{3}Ma, \qquad \theta\_l = 1, \\\rho\_r &= \frac{2Ma^2}{Ma^2 + 1}, \\\ u\_r &= \frac{\sqrt{3}Ma}{\rho\_r}, \\\ \theta\_r &= \frac{3Ma^2 - 1}{2\rho\_r}. \end{aligned}$$

**Fig. 2** Left: The comparison of shock structures of two solutions with Mach number 1.4 and *M* = 4. Right: The green area is the hyperbolicity region (horizontal axis: ˆ*fM*−1, vertical axis: ˆ*fM* ), and the red loop is the parametric curve ( ˆ*fM*−1, ˆ*fM* ) with parameter *x*

We are interested in the steady-state of this problem. Since the parameter *Kn* only introduces a uniform spatial scaling, it does not affect the shock structure. Therefore we simply set it to be 1. Numerically, we set the computational domain to be [−30, 30]. The boundary condition is provided by the ghost-cell method, and the distribution functions on the ghost cells are set to be the two states defined in (16).

#### **3.1.1 Case 1:** *Ma* **= 1***.***4 and** *M* **= 4**

In this case, both Grad's system and the hyperbolic moment system work due to the relatively small Mach number. The numerical results are shown in Fig. 2. By convention, we plot the normalized density, velocity, and temperature defined by

$$
\bar{\rho}(\mathbf{x}) = \frac{\rho(\mathbf{x}) - \rho\_l}{\rho\_r - \rho\_l}, \quad \bar{u}(\mathbf{x}) = \frac{u(\mathbf{x}) - u\_r}{u\_l - u\_r}, \quad \bar{\theta}(\mathbf{x}) = \frac{\theta(\mathbf{x}) - \theta\_l}{\theta\_r - \theta\_l}.
$$

so that the value of all variables are generally within the range [0, 1], unless the temperature overshoot is observed.

Figure 2b shows the hyperbolicity region of Grad's moment equations. It has been proven in [3] that for the one-dimensional physics, the hyperbolicity region can be characterized by the following two dimensionless quantities:

$$
\hat{f}\_{M-1} = \frac{f\_{M-1}}{\rho \theta^{(M-1)/2}}, \qquad \hat{f}\_M = \frac{f\_M}{\rho \theta^{M/2}},
$$

where *fM* and *fM*−<sup>1</sup> are the last two coefficients in the expansion (3). The red curve in Fig. 2b provides the trajectory of Grad's solution in this diagram. It can be seen that for such a small Mach number, the whole solution is well inside the hyperbolicity

**Fig. 3** Left: The comparison of shock structures of two solutions with Mach number 2.0 and *M* = 4. Right: The green area is the hyperbolicity region (horizontal axis: ˆ*fM*−1, vertical axis: ˆ*fM* ), and the red loop is the parametric curve ( ˆ*fM*−1, ˆ*fM* ) with parameter *x*

region, so that the simulation of Grad's moment equations is stable. Figure 2a shows that both methods provide smooth shock structures, and the predictions for all the equilibrium variables are similar. This example confirms the applicability of both systems in weakly non-equilibrium regimes. Note that for one-dimensional physics, Grad's equations do not suffer form the loss of hyperbolicity near equilibrium.

#### **3.1.2 Case 2:** *Ma* **= 2***.***0 and** *M* **= 4**

Now we increase the Mach number to introduce stronger non-equilibrium. The same plots are provided in Fig. 3. In this example, despite the numerical diffusion, discontinuities can be identified without difficulty from the numerical solutions. These discontinuities, also known as subshocks, appear due to the insufficient characteristic speed in front of the shock wave, meaning that both systems are insufficient to describe the physics. To capture these discontinuities, 8000 grid cells are used in the spatial discretization. This example shows significantly different shock structures predicted by both methods. For Grad's moment equations, the subshock locates near *x* = −7, while for hyperbolic moment equations, the subshock appears near *x* = −5. The wave structures also differ a lot. By focusing on the high-density region, we find that the solution of hyperbolic moment equations is smoother, showing the possibly better description of the physics.

Here we remind the readers that the wave structure of hyperbolic moment equations may depend on the numerical method, due to its non-conservative nature. The locations and the strengths of the subshock may change when using the different shock conditions. However, we would like to argue that it is meaningless to justify any solution with subshocks for the hyperbolic moment equations, for it is unphysical and should not appear in the solution of the Boltzmann equation. In practice, the appearance of discontinuous solutions is an indication of the inadequate truncation of series, which inspires us to increase *M* to get more reliable solutions without subshocks.

Figure 3b shows that Grad's solution still locates within the hyperbolicity region, although the curve is already quite close to the boundary of the region. This example shows that even in its hyperbolicity region, Grad's moment method may lose its validity.

#### **3.1.3 Case 3:** *Ma* **= 2***.***0 and** *M* **= 6**

Now we try to increase *M* and carry out the simulation again for Mach number 2.0. The results are given in Fig. 4. With the hope that a larger *M* can provide a better solution, we actually see that Grad's moment equations lead to computational failure. The numerical solution before the computation breaks down is plotted in Fig. 4a. Figure 4b clearly shows that this is caused by the loss of hyperbolicity. We believe that this implies the non-existence of the solution.

On the contrary, the simulation of hyperbolic moment equations is still stable. As expected, it provides a smooth shock structure and improves the result predicted by *M* = 4.

#### **3.1.4 Case 4:** *Ma* **= 1***.***7 and** *M* **= 6**

In this example, we decrease the Mach number so that the shock structure of Grad's equations can be found. Figure 5a shows that the results of both systems generally agree with each other, but it can be observed that hyperbolic moment equations provide smoother solutions than Grad's system, so that it is likely to be more accurate. Therefore, despite the higher nonlinearity of Grad's system, it does not necessarily help provide better solutions.

Interestingly, when looking at the phase diagram plotted in Fig. 5b, we see that Grad's solution has run out of the hyperbolicity region. It is to be further studied why the solution is still stable. Here we would like to conjecture that the collision term and the numerical diffusion help stabilize the numerical solution in the evolutionary process, and for the steady-state equations, solutions for non-hyperbolic equations may still exist. Nevertheless, all the above numerical tests show the superiority of hyperbolic moment equations for both accuracy and stability.

#### **3.1.5 Case 5:** *Ma* **= 2***.***0 and** *M* **= 10**

In this example, we would like to show the failure of both systems for a larger *M*. In Fig. 6, we plot the results at *t* = 0.8, where both numerical solutions contain negative temperatures. In [28], the reason for such a phenomenon has been explained, which lies in the divergence of the approximation (3) as *M* tends to infinity. It is

**Fig. 5** Left: The comparison of shock structures of two solutions with Mach number 1.7 and *M* = 6. Right: The green area is the hyperbolicity region (horizontal axis: ˆ*fM*−1, vertical axis: ˆ*fM* ), and the red loop is the parametric curve ( ˆ*fM*−1, ˆ*fM* ) with parameter *x*

rigorously shown in [13] that when θ*<sup>r</sup>* > 2θ*<sup>l</sup>* , for the solution of the steady-state BGK equation, the limit of *P f* (see (3)) as *M* → ∞ does not exist. Here for *Ma* = 2.0, the temperature behind the shock wave is θ*<sup>r</sup>* = 55/16 > 2 = 2θ*<sup>l</sup>* . Thus for a large *M*, the divergence leads to a poor approximation of the distribution function, and it is reflected as a negative temperature in the numerical results. Such a divergence issue is independent of the subshock and the hyperbolicity, and should be regarded as a defect for both systems. The work on fixing the issue is ongoing.

#### *3.2 Fourier Flow*

In this test, we are interested in the performance of both methods with wall boundary conditions. The fluid we are concerned about is between two fully diffusive walls locating at *x* = −1/2 and *x* = 1/2. For the Boltzmann-BGK equation (15), the boundary condition is

$$\begin{aligned} f(t, -1/2, v) &= \frac{\rho\_l}{\sqrt{2\pi\theta\_l}} \exp\left(-\frac{v^2}{2\theta\_l}\right), \qquad v > 0, \\\ f(t, 1/2, v) &= \frac{\rho\_r}{\sqrt{2\pi\theta\_r}} \exp\left(-\frac{v^2}{2\theta\_r}\right), \qquad v < 0, \end{aligned}$$

where θ*<sup>l</sup>*,*<sup>r</sup>* stands for the temperature of the walls, and ρ*<sup>l</sup>*,*<sup>r</sup>* is chosen such that

$$\int\_{\mathbb{R}} v f(t, \pm 1/2, v) \,\mathrm{d}v = 0.$$

Following [24], the boundary conditions of moment equations can be derived by taking odd moments of the diffusive boundary condition. We choose the initial condition as

$$f(0, x, v) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{v^2}{2}\right) \tag{17}$$

for all *x*. Again we are concerned only about the steady-state of the solution.

In our numerical experiments, we choose *Kn* = 0.3, θ*<sup>l</sup>* = 1 and *M* = 11. Two test cases with θ*<sup>r</sup>* = 1.9 and θ*<sup>r</sup>* = 2.7 are considered. For the smaller temperature ratio θ*<sup>r</sup>* = 1.9, the numerical results are given in Fig. 7, where two solutions mostly agree with each other. The reference solution, computed using the discrete velocity model, is also provided in Fig. 7a. It can be seen that both models provide reasonable approximations to the reference solution. The good behavior of Grad's solutions can also be predicted by the phase diagram in Fig. 7b, from which one can observe that the whole solution locates in the central area of the hyperbolicity region.

For θ*<sup>r</sup>* = 2.7, the results are plotted in Fig. 8. In this case, if we start the simulation of Grad's equations from the initial data (17), the computation will break down due to the loss of hyperbolicity in the evolutional process. Therefore, we first run the simulation for hyperbolic moment equations from the initial data (17) and evolve the solution to the steady-state. Afterward, this steady-state solution serves as the initial data of Grad's equations. Although the steady-state solution of Grad's equations can be found using this technique, the approximation looks poorer than hyperbolic moment equations. The phase diagram (Fig. 8b) shows that the solution near the left wall is outside the hyperbolicity region, so that the validity of boundary conditions on the left wall becomes unclear. In contrast, the hyperbolic moment equations still provide reliable approximation despite the high temperature ratio.

**Fig. 7** Left: Steady Fourier flow for θ*<sup>r</sup>* = 1.9 (left vertical axis: ρ, right vertical axis: θ). Right: The green area is the hyperbolicity region (horizontal axis: ˆ*fM*−1, vertical axis: ˆ*fM* ), and the red line is the parametric curve ( ˆ*fM*−1, ˆ*fM* ) with parameter *x*

**Fig. 8** Left: Steady Fourier flow for θ*<sup>r</sup>* = 2.7 (left vertical axis: ρ, right vertical axis: θ). Right: The green area is the hyperbolicity region (horizontal axis: ˆ*fM*−1, vertical axis: ˆ*fM* ), and the red line is the parametric curve ( ˆ*fM*−1, ˆ*fM* ) with parameter *x*

#### *3.3 A Summary of Numerical Experiments*

In all the above numerical experiments, we see that despite the loss of some nonlinearity, the hyperbolicity fix does not appear to lose accuracy in any of the numerical tests. In regimes with moderate non-equilibrium effects, Grad's equations may provide solutions outside the hyperbolicity region without numerical instability. In this situation, our experiments show that the hyperbolicity fix is likely to improve the accuracy of the model. It has also been demonstrated that other issues, such as subshocks and divergence, are not related to the hyperbolicity, and these issues have to be addressed independently.

#### **4 Conclusion**

The loss of hyperbolicity, as one of the major obstacles for the model reduction in gas kinetic theory, is almost cleared through the research works in recent years. With a handy framework introduced in Sect. 2, we can safely move our focus of model reduction to other properties such as the asymptotic limit, the stability, and the convergence issues. Our numerical experiments show that the hyperbolic regularization does not harm the accuracy of the model. It is our hope that such a framework can inspire more thoughts in the development of dimensionality reduction even beyond the kinetic theory.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Cryptography and Digital Transformation**

**Kazue Sako**

**Abstract** Cryptography is implemented using discrete mathematics with security defined in complexity theory. In this article, we review some cryptographic primitives for encryption, signing messages and interactive proofs. By combining cryptographic primitives, we can design and digitally implement various services with desired features in security, privacy and fairness. We will discuss some examples such as electronic voting and cryptocurrencies.

#### **1 Digital Transformation**

Research in mathematics and cryptography play a big role in shaping our digitalized society much better in coming years. There is an immense expectation that technology on Information and Communications, known as ICT, would transform our life to be more efficient, more productive and more functional. However, these are bright side of digital transformation. We also need to take care to transform 'correctly' so that we do not suffer from unexpected consequences.

One evident characteristic of ICT is that it makes us free from physical constraints. Digital data have little weight and thus we can make thousand copies and travel thousand miles at once. While this characteristic brings benefit, it also brings threats to our life. We need alternative ways to create 'constraints' to those who is willing to harm us, and one promising approach to creating such constraints is use of cryptography.

Cryptography started as a way to conceal information. We were able to design cryptographic algorithm that is computationally infeasible to recover the message without knowledge of a decryption key. There are rigorous mathematical proofs that guarantee that indeed this characteristic holds based on some hard problems, like NP problems or factorization. So this computational difficulty would serve as an alternative constraints in a digital world.

K. Sako (B)

Waseda University, Tokyo, Japan e-mail: kazuesako@aoni.waseda.jp

<sup>©</sup> The Author(s) 2022

T. Chacón Rebollo et al. (eds.), *Recent Advances in Industrial and Applied Mathematics*, ICIAM 2019 SEMA SIMAI Springer Series 1, https://doi.org/10.1007/978-3-030-86236-7\_9

In this article, we provide two examples of use cryptography to implement secure digital systems. One is digitalization of voting system, and the other is digitalization of payment system called Bitcoin. Prior to these two examples we oversee some cryptographic primitives such as encryption schemes, digital signature schemes and interactive proofs.

#### **2 Cryptographic Foundations**

In this section, we will introduce three fundamental notions in cryptography. They are Encryption Schemes, Digital Signature Schemes and Interactive Proofs.

#### *2.1 Encryption Schemes*

First, we begin by introducing two types of encryption schemes, depending on how we use keys. The first type, which is called Symmetric-key encryption schemes, uses the same key for both encryption and decryption. This type of encryption schemes existed since the age of Gaius Julius Caesar. The new type of encryption is called Publickey encryption schemes or Asymmetric-key encryption schemes, where we use different keys for encryption and decryption. Moreover, the key to encrypt data can be made public (Fig. 1).

Let us briefly discuss some mathematical model to define encryption schemes and its security. Encryption schemes, either symmetric or asymmetric, can be mod-

**Fig. 1** Two types of encryption schemes

eled in three non-deterministic functions, namely KeyGeneration, Encryption and Decryption, with a security parameter *k*. KeyGeneration, on input *k*, outputs a key pair EncKey and DecKey. (In case of Symmetric Key encryption schemes, EncKey = DecKey holds.) Encryption Function, given a message *m* from its domain and EncKey, outputs a ciphertext *c*.

*c* = Encryption(*k*, *m*, EncKey)

Similarly, Decryption Function, given a ciphertext *c* from its domain and DecKey, outputs a message *m*- .

$$m' = \text{Decryption}(k, c, \text{DecKey})$$

A triplet of nondeterministic functions (KeyGeneration, Encryption, Decryption) is called Encryption scheme if and only if: For any *k*, for any output (EncKey, DecKey) of KeyGeneration on input *k*, and for any message in *m*,

*m* = Decryption(*k*, Encryption(*k*, *m*, EncKey), DecKey)

holds.

As seen in the definition, even an Encryption function that returns *m* as *c* is an Encryption Scheme. So we need to define what property we need to call an Encryption Scheme secure. Cryptographers had studied various ways to do this. A fundamental observation is: given any two messages *m*<sup>1</sup> and *m*2, and given any ciphertext *ci* of either *m*<sup>1</sup> or *m*2, the encryption scheme is secure if no one can guess to which message a ciphertext *c* decrypts to with probability more than half. To be more rigorous, we need to define this in an asymptotic manner. That is, if we chose large enough *k*, the probability of guessing can be made larger than 1/2 + -. We note that in Asymmetric Encryption Schemes, guessing is hard even if they know EncKey that was used to create *c*. There are various other security definitions for Encryption Schemes, be it strong or weak [1].

To prove security of some concrete Encryption Schemes, we assume existence of some one-way functions or some difficult problems like factorization.

#### *2.2 Digital Signature Schemes*

Another exciting tools related to Public Key Encryption Schemes are Digital Signature Schemes. If we can have two related keys PubKey and PrivKey, where one can publish PubKey without worrying about secrecy of PrivKey, we can construct a scheme that serves as Digital Signatures. A person would sign a message with PrivKey and outputs a signature sig. Anyone can verify whether or not the signature was generated using a key corresponding to PubKey, by performing Verification (Fig. 2).

**Fig. 2** Digital signature schemes

Similarly, Digital Signature Scheme is modeled by three nondeterministic functions (KeyGen, Gen-SIG, Verify). KeyGen, on input security parameter k, outputs a key pair PrivKey and PubKey. Gen-SIG Function, given a message m from its domain and PrivKey, outputs a signature sig.

$$\text{sig} = \text{Gen-SIG}(k, m, \text{PrivKey})$$

Verify Function, given a signature sig from its domain, the message *m* and PubKey, outputs either OK or NG.

$$\text{OKNG} = \text{Verify}(k, \text{sig}, m, \text{PubKey})$$

A triplet of nondeterministic functions (KeyGen, Gen-SIG, Verify) is called Signature scheme if and only if: For any *k*, for any output (PrivKey, PubKey) of KeyGeneration on input *k*, and for any message in *m*,

$$\text{OK} = \text{Verify}(k, \text{Gen-SIG}(k, m, \text{PrivKey}), m, \text{PublicKey})$$

holds.

For security of signature schemes, we want to claim that it is only a person who knows PrivKey can generate sig corresponding to m that the Verify Function outputs OK. For this purpose, we claim a Signature Scheme is secure if there is an algorithm that can generate signatures that Verify outputs OK, then we can use the algorithm to 'extract' PrivKey. For sake of space, please refer to reference [1] for more mathematical definition for security of digital signature schemes.

**Fig. 3** Interactive proofs

#### *2.3 Interactive Proofs*

The last primitive we will discuss in the article is Interactive Proofs. In Mathematics, when we say Proof, it is usually something that can be written down in the paper and those who have seen the Proof can verify the correctness of its claim. So the script of Proof is non-interactive. The Prover alone would generate the script of Proof by himself. Also the script of Proof is transferable, that any party who have seen the Proof can verify that the claim is correct.

Instead, there are protocols where Prover and Verifier talks interactively and at the end Verifier is persuaded that the Claim is correct. This is called Interactive Proofs (Fig. 3). This type of interactive proofs can provide further characteristic that the Verifier learn nothing from the interaction except that the Claim is correct. That is, Verifier learned no knowledge or zero knowledge in engaging the proof protocol. These types of protocols are called Zero Knowledge Interactive Proofs, which are frequently used in cryptographic protocols. Because the Verifier learned no new knowledge, he cannot prove to a third party that the Claim Prover proved is correct.

#### **3 Digitalizing Voting**

In this section we discuss how voting procedure can be securely digitalized using cryptography. Typically the process of designing cryptographic protocols consists of clarifying the purpose and modeling its feature, then design the protocol, and verify the designed protocol meets the previously set goal.

**Fig. 4** Model of electronic voting

#### *3.1 Requirements for Voting*

So let us clarify the purpose of the voting and its desired property. Here, we assume there is a list of legitimate voters with their respective public keys and a Tallying authority. Each legitimate voter cast either yes or no vote and the Tally authority wants to have a correct counting of the votes (Fig. 4). The three main requirements we need to meet are the following:


#### *3.2 Designing Voting Protocol*

It seems these three requirements are hard to achieve simultaneously. If we let all legitimate voters sign their vote, then the first requirement can be met. However, if the votes are signed with the voter's key, it means the votes are not anonymous thus conflicts the third requirement. If we make all votes anonymous, then we cannot verify if the votes are from legitimate voters or even if they are, they could have voted more than once. Moreover, we cannot verify if the Tallying Authority just neglected some of the anonymous votes cast in counting the tally.

There are several ideas to meet all three requirements that seems conflicting. In this subsection, we will discuss one of such ideas using shuffling [2].

**Fig. 5** Overview of voting protocol using shuffling

The underlying idea came from how we meet those requirements using paper ballots in voting. In one providence, a voter fills in his paper ballot and put in a blank envelope. Then the voter puts this bank envelope in a larger envelope and signs with the voter's name. The voter hands this envelope to the Tallying Authority. The Tallying Authority can verify that the voter is a legitimate voter and has hand in one envelope, but because they are in an envelope the Authority cannot learn the vote. How about counting? On the day of counting the votes, all the outer envelopes are removed, but still in a blank inner envelope. All blank envelopes are thrown on the table and the envelopes will be shuffled manually so that no one learns which inner envelope came from which outer envelope. After adequate shuffling are performed, inner envelopes will be opened and count the ballots within. All the procedure will be supervised by an observer so that Tallying Authority cannot cheat while shuffling or opening the envelopes. So this trick may be able to use in digitalization (Fig. 5).

So we will encrypt the ballot using a public key of the system to mimic the blank inner envelope. As an outer envelope, the voters would sign on the encrypted ballot, and cast to the Tallying Authority. The Authority learns from the signature on the encrypted ballot that the ballot is from a legitimate voter and the same voter had not voted more than once, but the ballot itself cannot be seen as it is encrypted. Then the Authority removes the digital signature part and 'shuffles' the encrypted ballots. After the encrypted ballots has been well mixed, that is, it has been made difficult to match who submitted the encrypted ballot, the ballots will be decrypted to enable tallying. This way, we can ensure that we have only counted legitimate voter's vote once, and authority would not learn the vote of each voter as long as decrypting keys are kept safe. To ensure that the Authority performed correct Tallying, the Authority would provide Zero Knowledge Interactive Proofs to prove that it has

**Fig. 6** Permutation is not shuffling

followed the procedure correctly and that the result of the tally is trustworthy. In the next subsection, we discuss in more detail how we 'shuffle' digital data.

#### *3.3 Shuffling Encrypted Data Using Probabilistic Encryption*

If 'shuffling digital data' was simply changing the location of some digital data, then even after shuffling it is easy to spot which digital data came from whom, by matching the bit patterns (Fig. 6).

So in digital shuffling, we need to change a look of digital data. For this purpose, we are going to use a public key encryption scheme that is probabilistic [3]. That is, the encryption function is non-deterministic, therefore there are many ciphertexts that decrypt to a same message. So changing 'the look' of encrypted digital data is to replace the encrypted data with another encrypted data that decrypts to the same message. Figure 7 illustrates such shuffling procedure. First a list of encrypted ballots are permutated. Then each encrypted ballot is replaced with another encrypted data without changing the content of the ballot. Looking at the input list and the output list, it is difficult to trace which ballot was shuffled to which position.

An example of a probabilistic encryption scheme that offer this characteristic is called ElGamal Encryption [4]. Here we provide an overview of the scheme. ElGamal Encryption is based on the assumption that given a prime *p*, an generator *g* of Zp and *y* = *g<sup>a</sup>* mod *p*, it is difficult to compute a from (*p*, *g*, *y*) for randomly chosen *y* in Zp. This is called Discrete Logarithm Problem. So KeyGeneration function for ElGamal Encryption is generating *p* of length *k* (security parameter) *g*, and *y* for

**Fig. 7** Shuffling procedure

randomly chosen *a*. Public Key will be (*p*, *g*, *y*) and the exponent *a* will serve as secret key. Encryption function, on input message *m* in Zp and Public Key (*p*, *g*, *y*), generates a random number *r*, and outputs

$$(c\_1, c\_2) = (\text{g}^r \text{ mod } p, m \ast \text{y}^r \text{ mod } p).$$

as a ciphertext of *m*. On input(*c*1, *c*2) and secret key *a*, Decryption function performs *c*2/(*c*1)*<sup>a</sup>* mod *p* which should be equal to the message *m* if the ciphertext was correctly conveyed. In order to change the look of (*c*1, *c*2),

$$(d\_1, d\_2) = (c\_1 \* g^s \bmod p, c\_2 \* \mathbf{y}^s \bmod p)$$

for a randomly chosen *s*, would provide another different looking ciphertext that also decrypts to the message m. It is interesting to see that this transformation can be performed without the knowledge of the secret key.

#### **4 Bitcoin Blockchain**

Perhaps one of the most impressive digital transformation through cryptography was digitalizing 'money' called Bitcoin [5]. There are many prepaid electronic money systems today like PayPay, but it is restricted to one currency and there is an accountable organization who is operating the system. Satoshi Nakamoto designed a system where only the algorithms ensure the correctness of the money transfer and excluding the existence of a centralized authority. We provide below an overview of his design. We note some details are omitted for the sake of simplicity.

**Fig. 8** Data managers and transaction logs

#### *4.1 Modeling Blockchain*

Blockchain is a technology that is used to manage transaction data in Bitcoin. There are users of Bitcoin who issue transaction data, typically saying 'sending *x* Bitcoin from my account *yyy* to the address *zzz*.' The transaction is accepted if the message is indeed sent from the owner of the account *yyy* and indeed there are *x* Bitcoin left in the account. The log of transaction infers that after the transaction has been accepted, *x* Bitcoin should be decreased from the account *yyy* and added to the account *zzz*. Unlike previous systems where there is one organization keeping record of all the transactions, there are multiple voluntary 'Data managers' in Bitcoin known as Full Node, connected in Peer-to-peer fashion. When a user issued a transaction, Data managers check its correctness and propagates the transaction to other Data managers. The ideal goal is that all the Data managers keep these transaction log in a consistent way (Fig. 8). However, as transaction logs are created by various account holders internet-wide and that communication through Peer-to-peer network may not always be perfect, there is no guarantee that the list of logs are consistent among all the Data managers. So the big problem Satoshi had to solve was how to synchronize the transaction log among the Data managers while they are connected in asynchronous Peer-to-peer network.

**Fig. 9** Crypto puzzles for synchronization

#### *4.2 Crypto Puzzle for Synchronization*

A core idea behind synchronization is to restrict frequent distribution of transactions. If the distribution happens infrequently, for example once in every 10 min or so, that should provide enough time within Peer-to-peer network to share the same data. In order to achieve this, Bitcoin blockchain is designed so that a bulk of transaction log are bundled in a block, and the block cannot be distributed among Data Managers unless accompanied by a certain solved crypto puzzle related to the content of that block. This crypto puzzle is so designed that the puzzle for any block can be solved with high probability, but is time consuming. We note that while the puzzle is hard to solve, it is easy for other Data managers to verify that the solution is correct (Fig. 9).

In order to define crypto puzzle, we use a mathematical function called Hash Function. Hash Function deterministically maps an arbitrarily long input string to a fixed length integer of say 256 bits. The output is called a hashed value. With cryptographically secure hash function, it is computationally difficult to find two different input that maps to a same hashed value. There are known algorithms that is believed to achieve this property, such as SHA-256 [6].

Let us assume a Data Manager wants to add bulk of data *D*1,..., *Dn*, on top of the latest Block data Bn. The Puzzle is defined to find an string str that satisfies the following equation.

$$\text{Hash}(\text{Hash}(\text{Bn}) \parallel D\_1 \parallel \dots \parallel D\_n \parallel \text{ str}) < 2^{\text{Bn}(k)}$$

where represents concatenation of strings and Bn(*k*) is an integer defined from the previous block Bn, which is called difficulty. A typical output of Hash function is an integer of length 256, so if Bn(*k*) is about 60, one need to try many possible str to check if it meets the equation. The difficulty is so designed that this try and error process would take 10 min on average to find the desired string str.

The list of Data *D*1,..., *Dn*, accompanied by the correct puzzle solution str, is the propagated as a new block within Data Managers. Other Data Managers who received the block verifies the correctness of the solution. If correct, they add this block on top of the previous blocks, as the chain of data store. Then they will try to solve the next puzzle based on the new block with other transaction log that has not yet been stored in the blockchain.

#### *4.3 Incentives for Data Managers*

We conclude the overview of Bitcon Blockchain by mentioning why the Data managers spend their computational effort to solve meaningless puzzle. The Data managers are awarded by Bitcoin if they solved the puzzle and followed by the future Blocks. Their incentives for receiving the award play a central role in maintaining consistent data among Data managers, and distract them from behaving maliciously.

#### **5 Concluding Remarks**

In this article we have discussed some of the examples of securely implementing current social activities in cyber world using cryptography. We have shown some of the cryptographic primitives are defined mathematically. The procedure to design secure protocols begin with clarifying the goal and requirements and then design to meet those criteria. Although these examples show that cryptography is a promising approach, we still lack in technology to model and evaluate mathematically overall system for digital transformation. The author sincerely hope that this article would encourage the researchers in mathematics, cryptography and information technology to get together and share their strengths for the goal of making our digital society more secure and fair place.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Efficient Algorithms for Tracking Moving Interfaces in Industrial Applications: Inkjet Plotters, Electrojetting, Industrial Foams, and Rotary Bell Painting**

**Maria Garzon, Robert I. Saye, and James A. Sethian**

**Abstract** Moving interfaces are key components of many dynamic industrial processes, in which complex interface physics determine much of the underlying action and performance. Level set methods, and their descendents, have been valuable in providing robust mathematical formulations and numerical algorithms for tracking the dynamics of these evolving interfaces. In manufacturing applications, these methods have shed light on a variety of industrial processes, including the design of industrial inkjet plotters, the mechanics of electrojetting, shape and evolution in industrial foams, and rotary bell devices in automotive painting. In this review, we discuss some of those applications, illustrating shared algorithmic challenges, and show how to tailor these methods to meet those challenges.

Moving interfaces are key components of many dynamic industrial processes, whose dynamics are critical to the underlying physics. Examples include turbines, flames and combustion, plastic injection molding, microfluids, and pumping. In each of these examples, complex physics at the interface, such as between a fluid and a moving wall, or through a membrane or a transition region, determines much of the underlying action and performance (Fig. 1).

One approach to propagating interfaces is given by "level set methods". These algorithms to track interfaces in multiple dimensions, couple the driving physics with the interface in a natural way, and smoothly handle topological change due to merger and breaking. They accurately and robustly compute high order solutions

M. Garzon

R. I. Saye

J. A. Sethian (B) Department of Mathematics, University of California, Berkeley, California 94720, USA e-mail: sethian@math.berkeley.edu

© The Author(s) 2022 T. Chacón Rebollo et al. (eds.), *Recent Advances in Industrial and Applied Mathematics*, ICIAM 2019 SEMA SIMAI Springer Series 1, https://doi.org/10.1007/978-3-030-86236-7\_10

Department of Applied Mathematics, University of Oviedo, Oviedo, Spain e-mail: maria@uniovi.es

Mathematics Group, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA e-mail: rsaye@lbl.gov

**Fig. 1** Examples of industrial interfaces

to moving interface problems, and are easily discretized using standard techniques, such as finite difference, finite element, and discontinuous Galerkin methods.

The paper is a review of the application of these methods to some industrial problems, and draws from multiple sources [10–17, 26–29, 42–44] to discuss the design of industrial inkjet plotters, jetting and electrojetting devices, industrial foams, and rotary bell spray devices. Rather than extensively focus on the equations or the algorithms, we provide an overview of the approaches, with an emphasis on the results. References are provided for more in-depth discussions.

#### **1 Modeling Interface Evolution Using Level Set Methods**

Level set methods, introduced in [19], have been used in a large number of applications to track moving interfaces. They are based on both a general mathematical theory as well as a robust numerical methodology, which relies on exchanging the typical Lagrangian perspective on front propagation, in which the front is explicitly tracked, for an Eulerian view in which the moving interface is embedded as a particular level set of a higher dimensional function posed in a fixed coordinate system. The motion of the interface corresponds to solving the evolution of this higher-dimensional function according to a Hamilton-Jacobi-type initial value partial differential equation.

A brief summary is as follows. Consider a moving interface -(*t*), parameterized by *N* − 1 dimensions.We restrict ourselves to interfaces which are closed and simple, and separate the domain into an "inside" and an "outside". We recast the problem by implicitly defining the moving interface -(*t*) propagating in *N* − 1 dimensions as the zero level set of the solution to an evolving level set function φ(*x*, *t*), φ : <sup>R</sup>*<sup>N</sup>* <sup>×</sup> *<sup>t</sup>* <sup>→</sup> <sup>R</sup>, which satisfies a time-dependent partial differential equation. There are many ways to initialize this implicit function: one approach is to let φ(*x*, *t* = 0) be the signed distance from the interface -(*t* = 0), linking the interface to the zero level set.

We assume that the underlying physics specifies a speed *F* normal to the interface at every point on the interface. Constructing this speed function typically involves solving complex physics both on and off the interface.

Thus, there are two embeddings. First, the interface itself is embedded and implicitly defined through a higher-dimensional function φ. Second, to move the other level sets, we embed the speed *F* in a higher-dimensional function, known in the literature as the "extension velocity" *Fext* , which defaults to the given speed on the zero level set corresponding to the interface.

#### *1.1 Equations of Motion*

Here, we review the basic ideas behind the derivation and implementations of level set methods. We follow the derivation and discussion in [35, 36].

We wish to produce an Eulerian formulation for the motion of a hypersurface - representing the interface and propagating along its normal direction with speed *F*, where *F* can be a function of various arguments. Let ±*d*(*x*) be the signed distance from the point *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* to the interface at time *<sup>t</sup>* <sup>=</sup> 0. Define a function φ(*x*, *<sup>t</sup>* <sup>=</sup> <sup>0</sup>) by the equation

$$
\phi(\mathbf{x}, t=0) = \pm d(\mathbf{x}).\tag{1}
$$

By requiring that the zero level set of the evolving φ (see Fig. 2, left) always match the propagating hypersurface, means that

$$
\phi(\mathbf{x}(t), t) = 0. \tag{2}
$$

**Fig. 2** Left: Implicit embedding of level set function. Right: Topological change

By the chain rule, φ*<sup>t</sup>* + ∇φ(*x*(*t*), *t*) · *x* (*t*) = 0, Since *x* (*t*) · **n** = *Fext* , where **n** = ∇φ/|∇φ| with extension velocity *Fext* , this yields an evolution equation for φ, namely,

$$\left|\phi\_t + F\_{ext}|\nabla\phi| = 0, \quad \text{given} \quad \phi(\mathbf{x}, t=0). \tag{3}$$

This is the level set equation introduced by Osher and Sethian [19]. Propagating fronts can develop shocks and rarefactions in the gradient, corresponding to corners and fans in the evolving interface, and numerical techniques designed for hyperbolic conservation laws can be exploited to construct schemes which produce the correct, physically reasonable entropy solution, see [32–34].

There are several advantages to this approach. First, the formulation works in any number of dimensions. Second, topological changes are handled without special attention: fronts split and merge. Third, geometric quantities along the interface can be calculated by taking advantage of the embedding and computing quantities in the fixed Eulerian setting. Fourth, this formulation naturally lends itself to numerical approximations, for example, through finite difference or finite element formulations on the fixed background mesh.

#### *1.2 Computational Advances*

Since its introduction, a large number of computational advances have been developed to make this approach efficient, accurate, and economical. These include


A large number of reviews have been appeared over the years, containing these and many related ideas. We refer the interested reader to [20, 30, 35, 36, 38–40].

#### **2 Industrial Printing**

#### *2.1 Physical Problem and Modeling Goals*

Industrial inkjet printing involves ejecting ink housed in a well through a narrow nozzle, which is then deposited on a material. The ink in the bath is expelled by an electro-actuator mechanism at the bottom, which quickly propels ink through the nozzle. The shape of the nozzle, the force and timing of the actuator, and the properties of the ink are instrumental in determining the ultimate shape, delivery, and performance of the printing device.

This is a two-phase incompressible fluid flow problem, with the interface separating air and ink. Depending on the constituency of the ink, the flow can either be Newtonian or visco-elastic. Boundary conditions include both no-slip and no-flow at solid walls, and triple points where air-ink boundaries meets solid nozzle walls are subject to typical critical angle dynamics controlling slipping. While a common use for inkjet printers is in commercial home printing, over the past two decades a large number of sophisticated industrial applications have appeared, ranging from printing integrated circuits and the manufacture of display devices on through to construction of tissue scaffolding and layered manufacturing.

The goal of numerical simulation is to identify and optimize key aspects of the process, including


#### *2.2 Equations of Motion and Computational Challenges*

We solve for incompressible flow in a non-rectangular geometry, with no-slip and no-flow on walls, with air satisfying Newtonian flow and ink satisfying a visco-elastic Oldroyd-B model. The equations of motion [42–44], are given by

$$\begin{aligned} \frac{D\mathbf{u}\_1}{Dt} &= -\nabla p\_1 + \nabla \cdot (2\mu\_1 \mathcal{D}\_1) + \nabla \cdot \mathbf{r}\_1 \,, \qquad \nabla \cdot \mathbf{u}\_1 = 0 \,, \\\frac{D\mathbf{r}\_1}{Dt} &= \mathbf{r}\_1 \cdot (\nabla \mathbf{u}\_1) + (\nabla \mathbf{u}\_1)^T \cdot \mathbf{r}\_1 - \frac{1}{\lambda\_1} \left(\mathbf{r}\_1 - 2\mu\_{p1} \mathcal{D}\_1\right) \, . \end{aligned} \tag{4}$$

$$\rho\_1(Air) \quad \rho\_2 \frac{Du\_2}{Dt} = -\nabla p\_2 + \nabla \cdot (2\mu\_2 \mathcal{D}\_2) \quad , \qquad \nabla \cdot \mathfrak{u}\_2 = 0 \,. \tag{5}$$

$$\mathcal{D}\_{i} = \frac{1}{2} \left[ \nabla \boldsymbol{\mu}\_{i} + (\nabla \boldsymbol{\mu}\_{i})^{T} \right], \quad \boldsymbol{\mu}\_{i} = \boldsymbol{\mu}\_{i} \mathbf{e}\_{r} + \boldsymbol{v}\_{i} \mathbf{e}\_{z} \,, \qquad i = 1, 2 \tag{6}$$

where, for the ink, *τ* <sup>1</sup> is the viscoelastic stress tensor, λ<sup>1</sup> is the viscoelastic relaxation time, μ*<sup>p</sup>*<sup>1</sup> is the solute dynamic viscosity and subscript 2 refers to (Newtonian) air.

We use a level set method to track the air-ink interface, starting with the initial pressure disturbance in the reservoir: the fluid then moves through the nozzle and is then ejected into the ambient air, and then may separate into one or more droplets. We compute an approximate solution to the incompressible Navier-Stokes given above

**Fig. 3** Left: Experimental profiles, showing ejected ink and satellite formation; note the formation of the trailing satellite droplet as the initial bubble stretches, and changes topology. Right: simulation of full ejection cycle (taken from [43]). Inflow pressure from an equivalent circuit model which describes the cartridge, supply channel, vibration plate, PZT actuator, and applied voltage. Fluid is an Epson dye-based ink, with critical advancing θ*<sup>a</sup>* = 70◦ and receding θ*<sup>r</sup>* = 30◦ contact angle, and with <sup>ρ</sup><sup>1</sup> <sup>=</sup> 1070 kg/m3, <sup>μ</sup><sup>1</sup> <sup>=</sup> <sup>3</sup>.<sup>34</sup> <sup>×</sup> <sup>10</sup>−<sup>3</sup> kg/m s, and <sup>σ</sup> <sup>=</sup> <sup>0</sup>.032kg/s2. The nozzle geometry has diameter 26 microns at opening and 65µm at bottom

in both phases simultaneously, with surface tension terms mollified to the righthand-side as a forcing term. Thus, the solution accounts for both the ink velocity, the air-ink interface, and air currents induced in the air by the fluid ejection. We use a second order projection method [7–9] on a body-fitted logically rectangular mesh. Calculations are performed in both axi-symmetric two dimensions and full three-dimensional regimes. For details, see [42–44]. Figure 3 shows the results of both an experiment and simulation.

#### **3 Droplet Formation and Electro-jetting**

#### *3.1 Physical Problem and Modeling Goals*

A large number of industrial problems involve microjetting and droplet dynamics, in which small droplets both move through small structures and also transport key materials, for example, in such areas as deposition of evaporation substances, delivery of biological materials, and substance separation.

Part of the challenge in computing these problems stems from the critical role of surface tension and shear forces, which often drive topological change, breakage, and merger in the evolving droplets. Level set methods, because of their ability to handle these structural changes, are particularly well-suited for computing droplet dynamics. Here, we summarize work on microjetting dynamics first presented in [10], see also [11–17].

Consider the dynamics of a thin tube of fluid as it pinches off due to surface tensions effects at a narrowing neck of the fluid (see Fig. 5), where mean curvature drives the interface inward until it breaks into two separate lobes of fluid. The pinchoff dynamics reveal considerable intricacy: as the droplet breaks, rapidly moving capillary waves on the surface cause instabilities and oscillations in the fluid lobes.

#### *3.2 Equations of Motion*

Following the arguments in [11, 12], we model the fluid as incompressible and irrotational with a potential flow formulation. Euler's equation gives

$$\nabla \cdot \mathbf{u} = 0 \text{ in } \Omega(t) \tag{7}$$

$$\mathbf{u}\_t + \mathbf{u} \cdot \nabla \mathbf{u} = \frac{-\nabla p}{\rho} + \mathbf{b} \mathbf{u} \mathbf{f} \text{f forces on } \Gamma\_t(\mathbf{s}). \tag{8}$$

Assuming irrotationality (∇ × **u** = 0), the problem can then be written in terms of a fluid velocity potential **u** = ∇ψ, namely

$$
\Delta \psi\_- = 0 \text{ in } \Omega(t) \tag{9}
$$

$$
\psi\_t + \frac{1}{2} (\nabla \psi \cdot \nabla \psi) + \frac{p - p\_a}{\rho} = 0 \quad \text{on } \Gamma\_t(\mathbf{s}), \tag{10}
$$

where *pa* is the atmospheric pressure and ρ is the fluid density.

As shown in [11, 12], this can be reformulated as

$$\mathbf{u} = \nabla \psi \quad \text{in} \quad \Omega(t), \quad \Delta \psi = 0 \quad \text{in} \quad \Omega(t) \tag{11}$$

$$\frac{D\psi}{Dt} = \frac{1}{2} (\nabla\psi \cdot \nabla\psi) - \frac{\chi}{\rho} (\frac{1}{R\_1} + \frac{1}{R\_2}) \text{ on } \Gamma\_t(\mathbf{s})\,,\tag{12}$$

where (*t*) is the fluid tube, *<sup>t</sup>*(*s*) is the boundary of the tube, *R*<sup>1</sup> and *R*<sup>2</sup> are the principle radii of curvature, and γ is the surface tension.

Although the potential ψ is only defined on the interface, our plan is to build an extension of both the potential and the interface to all of space, so that we can then employ the level set methodology. This embedded implicit formulation then allows calculation of the fluid interface motion through pinch off, and can compute dynamics of the split fluid lobes.

These embeddings produce a new set of equations, namely

For details about the derivation of these equations, see [10–12].

#### *3.3 Computational Challenges*

The computational challenges that stem from these equations of motion lie in part on the delicate, sharp singularity at pinch off. The curvature becomes very large, and as soon as pinch off occurs, the two pieces of the neck retract very quickly. Constructing correct extension values for the velocity and the potential requires care as well.

We solve these equations through a time-cycle. Given values for the embedded implicit potential and level set function on a fixed background mesh, we construct the zero level set corresponding to the interface, place boundary element nodes on that interface, and then employ a boundary element method to find the new potential and associated velocity field, suitably extended. These nodes are then discarded, and the discrete grid values for the level set function, potential, and velocity are updated.

#### *3.4 Example Results*

Extensive numerical experiments are given in [12, 14]: the self-similar behavior of some variables near pinch-off time is checked within the computations and the computed scaling exponents agree with experimental and theoretical reported values. Here we review those results. Figure 4 shows a snapshot after pinch-off, revealing capillary surface waves on the undulating surface. Figure 5 shows the fine-scale structure of droplet dynamics after pinch-off.

#### *3.5 Charged Droplet Separation*

The above situation becomes considerably more complicated when the droplets are electrically charged, in which the droplet motion is driven by a background electrical field. Applications include electrospray ionization, electrospinning to produce fibers by drawing charged threads of polymers, particle deposition for nanostructures, drug delivery systems, and electrostatic rotary bell painting.

**Fig. 4** Droplet dynamics. Left, experiment taken from [41]. Right, level set calculation of surface capillary waves, taken from [12]

**Fig. 5** Simulation: fine-scale structure of droplet dynamics after pinch-off [12]

**Fig. 6** Experimental profile of electrically charged droplet motion [18]

The fundamental mechanism relies on the motion of an electrically conductive liquid in an electric field. The shape of the droplets starts to deform under the action of the electric field, afterwards the competition between inertial, surface tension and electric forces drives the dynamics, see Fig. 6.


**Fig. 7** Equations for electrically charged droplet motion. Note: In the shown equations, the velocity potential is labelled  but is labelled by in the main text

#### *3.6 Equations of Motion and Computational Challenges*

The equations of motion are the previous potential formulation for droplet hydrodynamic motion, plus electrodynamics. We assume a perfectly conducting fluid and an unlimited dieletric exposed to an external uniform force field. Model equations from [16] are shown in Fig. 7.

Algorithmic challenges include accurate and reliable computation of the electric field and handling sharp breakup and fast ejection.

#### *3.7 Example Results*

We show a numerical simulation [16] of a free charged droplet carrying a charge above the critical one, reproducing experimental results before and after jet emission. Figure 8 shows the focused droplet end from which charged tiny droplets are ejected.

#### **4 Industrial Foams**

#### *4.1 Physical Problem and Modeling Goals*

Many problems involve the interaction of multiply-connected regions moving together. These include the mechanics and architecture of liquid foams, such as polyurethane and colloidal mixtures, and of solid foams, such as wood and bone.

The industrial applications of these problems are manifold. Liquid foams are key ingredients in industrial manufacturing, used in fire retardants and in froth flotation for separating substances. Solidification of liquid foams results in solid foams, which have remarkably strong compressible strength because of their pore-like internal structure; and include lightweight bicycle helmets and automotive absorbers.

**Fig. 8** Time evolution of electrically charged droplet motion, from [16]

**Fig. 9** Examples of multiphase problems

In such problems, multiple domains share walls meeting at multiple junctions. Boundaries move under forces which depend on both local and global geometric properties, such as surface tension and volume constraints, as well long-range physical forces, including incompressible flow, membrane permeability, and elasticity.

Foam modeling is made challenging by the vast range of space and time scales involved [6]. Consider an open, half-empty bottle of beer. It may seem that nothing is happening in the collection of interconnected bubbles near the top, but currents in the lamellae separating the air pockets show slow but steady drainage. It can take tens to hundreds of seconds for the lamellae fluid to drain and then rupture, triggering an lamella explosion that retracts at hundreds of centimeters a second, after which the imbalanced configuration rights itself to a new stable structure in less than a second. Spatially, membranes are barely micrometers thick, while large gas pockets can span many millimeters or centimeters. All told, the biggest and smallest scales differ by roughly six orders of magnitude in space and time.

Another example comes from grain metal coarsening, in which surface energy, often associated with temperature changes, drives a system to larger structures. A third example comes from foam-foamed fiber networks, found in both industrial materials such as paper and biological materials, such as plant cells and tissues (Fig. 9).

In all of these engineering problems, understanding how such factors as pocket formation and distribution, tensile strengths, and foam architecture is a key part of producing mechanisms to optimize foam performance.

#### *4.2 Computational Challenges*

Producing good mathematical models and numerical algorithms that capture the motion of these interfaces is challenging, especially at junctions where multiple interfaces meet, and when topological connections change. Methods have been proposed, including front tracking, volume of fluid, variational, and level set methods. It has remained a challenge to robustly and accurately handle the wide range of possible motions of an evolving, highly complex, multiply-connected interface separating a large number of phases under time-resolved physics.

The problem is exacerbated by the nature of the mathematical components that contribute to the dynamics, including: velocities dependent on such factors as curvature, normal directions and anisotropy; the solution of complex PDEs with jump conditions, source terms, and prescribed values at the interface and internal boundary conditions; area and volume-dependent integrals over phases; thermal effects and diffusion within phases; and balance of forces at complex junctions.

From a numerical perspective, some of the challenges stem from the vast time and space scales involved. Using the same spatial resolution to resolve the physics along interfaces is often impractical in the bulk phases. Sharp resolution of the interface and front-driven physical quantities located on the interface is required as input to the bulk PDEs. Accurately resolving interface junctures is critical in order to provide reliable values for the balances of forces at junctions.

All told, these lead to formidable numerical modeling challenges.

#### *4.3 Voronoi Implicit Interface Methods*

Voronoi Implicit Interfaces Methods (VIIM), introduced in [26], provide an accurate, robust, and reliable way to track multiphase physics and problems with a large number of collected, interacting phases. They work in any number of space dimensions, represent the complete phase structure by a single function value plus indicator at each discretized element of the computational domain, couple easily to complex physics, and handle topological change, merger, breakage, and phase extinction in a natural manner. The underlying equations of motion that represent the evolving interface and complex physics may be approximated in either a finite difference or finite element framework. These equations couple level set methods for an evolving initial value Hamilton-Jacobi-type partial differential equation to a computational geometry-based Eikonal equation to produce a faithful phase representation. Here, we provide a brief review of the methods. For details, see [26].

The starting point is to consider a collection of non-overlapping phases which divide up the domain. The "interface" consists of places where these phases meet. In two dimensions, the simplest example is a single curve separating two phases. More complex structures might have multiple closed curves, each surrounding a separate phase, which meet in triple points or higher-order junctions. In three dimensions, the situation is far more complex.

The Voronoi Implicit Interface Method begins by characterizing the entire system through an implicit representation. For each point *x* in the plane, define φ(*x*) as the distance to the closest interface. Additionally, define χ (*x*) as an integer-valued function which indicates the phase. By construction, the interface representing all possible boundaries is given as the zero level set{φ(*x*) = 0} of this unsigned distance function, and the indicator function reveals the type of phase.

Thus, for example, if φ(*x*) = 5 and χ (*x*) = 4, then we know that the point *x* is located in phase 4, and the closest interface point is located a distance 5 away.

Starting with this unsigned distance function representation, we execute a two-step process. With interface speed *F* in the normal direction:

• Advance φ through *k* time steps using the standard level set methodology. That is, produce φ*<sup>n</sup>*+<sup>1</sup> from φ*<sup>n</sup>* by solving a discrete approximation to

$$|\phi\_t + F|\nabla \phi| = 0.$$

• Use the level sets of this time-advanced solution to reconstruct a new unsigned distance function. This is done by first computing the Voronoi interface from the level sets: this corresponds to the set of all points equidistant from at least two of the level sets from different phases, and closer to any of the non-equidistant phases. This Voronoi interface is then used to rebuild the unsigned distance function.

These two steps give the method its name: "Implicit Interface" because of the level set step for the time evolution, and "Voronoi" because of the reconstruction step used to rebuild the unsigned distance function and characteristic indicator function.

There are several things to note:


For details, see [26, 27].

**Fig. 10** Collapse of a foam cluster, visualized with thin-film interference taken from [29]

#### *4.4 Application of VIIM to Foam Dynamics*

Here, we review some current work applying VIIM to tracking the evolution of liquid foams. The vast time and space scales mean that one cannot compute over all scales simultaneously. Instead, we use a scale-separation model which allows us to divide the foam physics into three distinct stages.

We characterize the foam structures as represented by thin, interconnected membranes (lamellae) each surrounding pockets of air, and containing fluid. Membranes can share common walls, and fluid in each lamella drains toward common, shared Plateau borders that form a network of triple junctions and quadruple points. This drainage is slow, and once a membrane becomes too thin, it ruptures, causing the large air pockets to be out of macroscopic balance, which then readjust according to the equations of incompressible flow driven by interfacial forces along the lamellae.

These events can be thought of as taking places over different scales. The macroscopic air-fluid incompressible flow phase takes place over the whole domain, and evolves to an equilibrium relatively quickly. The lamellae drainage phase is slow, but takes place only over the very thin membrane walls. Rupture occurs very quickly.

In [28, 29], these three phases were used to develop a mathematical model and numerical simulation framework for foam evolution. During the macroscopic phase, a second order projection method is used to solve the incompressible Navier-Stokes equations on a rectangular mesh, with the interface smoothing its influence to the right-hand side through a mollified surface tension term. The individual lamellae are advanced under the incompressible flow by the Voronoi Implicit Interface Method, with the internal liquid transported by the method of characteristics. When the motion is almost gone, the model enters a different phase and assumes that the multi-phase configuration has essentially reached equilibrium; a fourth order PDE is then solved for thin film drainage, approximated through a discretized finite element triangulation. The final phase results from membrane rupture, idealized as an instantaneous disappearance of a lamella when a user-chosen minimal thickness is reached, which then redistributes the lamella liquid mass and sends the configuration into macroscopic disequilibrium.

#### *4.5 Example Results*

An example of the complete dynamics developed in the multi-scale foam model is shown in Fig. 10, which shows the time evolution of a bubble cluster, starting from 26 separate bubbles and ending up in a single bubble. The bubble colors are computed from thin film interference determined by the computed fluid thickness in the lamellae.

#### **5 Rotary Bell Painting in the Automotive Industry**

In manufacturing settings, paints are frequently applied by an electrostatic rotary bell atomizer. Paint flows to a cup rotating at 10,000–70,000 rpm and is driven by centrifugal forces to form thin sheets and tendrils at the cup edge, where it then tears apart into dispersed droplets. Vortical structures generated by shaping air currents are key to shearing these sheets and transporting paint droplets. Advantages of this manufacturing process include the ability to paint at high volume and to achieve uniform consistency in the paint application.

Schematic of paint flow and air currents [21] in rotary bell atomizing applications

Understanding the generation, size distribution, delivery, and adhesion of these paint droplets is a problem of considerable importance. For example, (a) much of the energy involved in automotive assembly is associated with the paint process; (b) a significant amount of paint does not attach to the cars and ends up as pollutants; and (c) 10–20% of automobiles need to be repainted due to aberrations in the process.

The goal of computational modeling of the rotary bell delivery system includes


#### *5.1 Computational Challenges*

The computational challenges posed by the painting delivery mechanism are formidable. The range of physical parameters is substantial. The droplet size ranges from 5 to 100 µm, the films are 10–50 µm thick, while the rotary bell diameter is on the order of centimeters. The cup rotates at 200 m/s, droplets breakup over microseconds, whereas droplet statistics requires milliseconds. As such, modeling requires tracking droplets across a wide range of length scales, paint fluid mechanics is subject to high centrifugal and Coriolis forces, and the impact of highly vortical air structures on film sheeting requires careful resolution.

From a computational point of view, these translate into daunting challenges:


• Mass conservation is important: tracking and accurately accounting for small droplets is critical, since all the paint ultimately breaks into such small objects.

These translate into several modeling/mathematical/algorithmic/numerical challenges which must be tackled in order to build a workable approach, including:


#### *5.2 Level Set Methods and High-Order Multiphase Flow*

The central problem in applying level set methods is that the equations of motion need to include jump conditions at the air-paint interface, e.g., droplet boundaries. The usual level set approach of "smearing" forces to a background mesh in order to provide source terms to the incompressible Navier-Stokes equations is problematic. The droplets can be so small, and the density/viscosity jumps so large and sharp, that this mollified approach does not provide the required accuracy.

Instead, we make use of an algorithmic technology building on implicitly-defined meshes [22–24]. There are several ideas at work in this approach:

• First, two-phase incompressible flow is solved using a discontinuous Galerkin (DG) approach, with a level set method used to track paint-air interfaces.

**Fig. 11** Implicitly defined meshes using multi-phase cell merging. Left: Phase cells, defined by the intersection of each phase (blue and green) with the cells of a background Cartesian/quadtree grid, are classified according to whether they fall entirely within one phase, entirely outside the domain, or according to whether they have a small or large volume fraction. Right: Small cells are merged with neighboring cells in the same phase to form a finite element mesh composed of standard rectangular elements and elements with curved, implicitly defined boundaries. Figures adapted from [23, 24]


**Adaptivity**: The next issue stems from the fact that there is a wide range of physical space scales involved in the process. The paint comes off the bell as a very thin film, and then breaks into small bubbles; as such, computing on a uniform mesh is impractical. Instead, we employ adaptively refined meshes wherein the mesh resolution adapts to such triggers as: (a) the distance to liquid-gas interface; (b) amount of curvature of interface; (c) the thickness of droplets, tendrils, films; and (d) the proximity to bell cup. See, for example, Fig. 12.

**High performance computing**: The above calculations are complex and the time step, spatial resolution, and physics make it impossible to model the entire bell. With a numerical framework targeting high performance computing facilities, using massively parallel MPI and OpenMP techniques, we can conduct high-resolution indepth studies of rotary bell atomization on small wedges, about 5 degrees in angle, using tens of thousands of cores. In Fig. 13 we present one result from a large family of parameter studies. For further details, see [31].

**Fig. 12** Adaptively refined meshing in the rotary bell atomizing problem

**Fig. 13** Three-dimensional model results of rotary bell atomization for time- and spatially-varying inflow film thickness, high mesh resolution, and shaping air currents simulating nozzle inlets. In each of the nine panels, two viewpoints at the same time frame are given: a top-down perspective and a side-on view to show the vertical drifting of the shedding droplets, being pushed upwards by the shaping air currents. The liquid surface is colored copper, with the bell cup situated beneath

#### **6 Conclusions and Summary**

We have tried to review a few examples in which the interface dynamics are a profound contributor to the efficiency of the industrial processes, and have focused on the application of level set methods for interface tracking to these problems. We have considered only a few contributions and works, and refer the interested reader to the referenced review articles.

**Acknowledgements** This work was supported in part by the Applied Mathematics Program of the U.S. DOE Office of Advanced Scientific Computing Research under contract number DE-AC02- 05CH11231.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Numerical Study for Blood Flows in Thoracic Aorta**

**Hiroshi Suito, Koki Otera, Viet Q. H. Huynh, Kenji Takizawa, Naohiro Horio, and Takuya Ueda**

**Abstract** Numerical simulations for blood flows related to cardiovascular diseases are presented. Differences in vessel morphologies produce different flow characteristics, stress distributions, and ultimately different outcomes. Some examples illustrating the effects of curvature and torsion on blood flows are presented both for simplified and patient-specific simulations. The goal of this study is to understand relationships between geometrical characteristics of blood vessels and blood flow behaviors.

#### **1 Introduction**

In aging societies, cardiovascular conditions such as aortic aneurysms and aortic dissections persist as life-threatening diseases. Moreover, congenital diseases such as hypoplastic left heart syndrome constitute an important issue for our society. In recent years, patient-specific simulations have become common in the biomedical engineering field. Several mathematical viewpoints are expected to be added and to play important roles in this context. For instance, geometrical characterization of blood vessels, which vary widely among individuals, provides useful information to medical sciences. Differences in blood vessel morphology give rise to different flow

H. Suito (B) · V. Q. H. Huynh

Advanced Institute for Materials Research, Tohoku University, 2-1-1 Katahira,Aobaku, Sendai 980-8577, Japan

e-mail: hiroshi.suito@tohoku.ac.jp

K. Otera Graduate School of Environmental and Life Sciences, Okayama University, Okayama, Japan

K. Takizawa Faculty of Science and Engineering, Waseda University, Shinjuku City, Japan

#### N. Horio Department of Cardiovascular Surgery, Okayama University Hospital, Okayama, Japan

T. Ueda

Department of Diagnostic Radiology, Tohoku University Hospital, Sendai, Japan

© The Author(s) 2022 T. Chacón Rebollo et al. (eds.), *Recent Advances in Industrial and Applied Mathematics*, ICIAM 2019 SEMA SIMAI Springer Series 1, https://doi.org/10.1007/978-3-030-86236-7\_11

characteristics, which cause different stress distributions and outcomes. Therefore, characterization of these vessels' respective morphologies represents an important clinical question. Our objective in this study is to understand possible mechanisms connecting geometrical characteristics and stress distributions through flow behaviors. The studies presented in this paper are parts of a CREST [1] framework supported by the Japan Science and Technology Agency in a strategic area for promoting collaboration between mathematical science and other scientific fields.

#### **2 Numerical Methods and Results**

#### *2.1 Governing Equations*

We adopted incompressible Navier–Stokes equations as governing equations.

$$\begin{cases} \frac{\partial u\_i}{\partial t} + u\_j \frac{\partial u\_i}{\partial x\_j} = -\frac{1}{\rho} \frac{\partial p}{\partial x\_i} + v \frac{\partial}{\partial x\_j} \left( \frac{\partial u\_i}{\partial x\_j} + \frac{\partial u\_j}{\partial x\_i} \right), \\\frac{\partial u\_j}{\partial x\_j} = 0 \end{cases} \quad \text{in } \Omega \times (0, T) \text{ .} \quad (1)$$

In those equations, *t*, *ui* (*i* = 1, 2, 3), *p*, ρ, and ν respectively represent time, velocity, pressure, density, and the kinematic viscosity of blood. We assumed that blood can be regarded as a Newtonian fluid in large arteries. Several numerical results with different numerical methods are presented in the following subsections. Finite difference method is used in Sect. 2.2, applied for blood flows in a thoracic aorta and for flows in simple spiral tubes to examine torsion effects. Then, finite element method is applied in Sect. 2.3 where fluid structure interaction (FSI) is considered and some flow mechanisms in a configuration after Norwood surgery are examined.

#### *2.2 Finite Difference Approximation*

#### **2.2.1 Visualization of Flows in a Thoracic Aorta**

Effects of curvature on flows in curved tubes have been discussed extensively in earlier studies [2–4]. When a tube has curvature, centrifugal force acts in the opposite direction, depending on the axial component of the velocity. Subsequently, secondary flow occurs on the cross-section and forms a set of twin vortices called Dean's vortices, thereby playing an important role in blood flow through the aortic arch where a strong curvature exists.

Figure 1 presents streamlines that can be visualized based on numerical results obtained through an earlier study [5]. We assumed a blood vessel as a rigid body and applied finite-difference method on a centerline-fitted curvilinear coordinate system, where the centerlines and cross-sections were extracted from patient-specific CT scans of patients with aortic aneurysms. Incompressible Navier–Stokes equations were solved numerically with a boundary condition for the inflow velocity profile given by a phase-contrast MRI measurement.

Figure 1a presents streamlines through the whole thoracic aorta at peak systolic phase. Circulation in the aneurysm is apparent. Figure 1b shows the Dean's vortices on the aortic arch superimposed to the main axial flow. In Fig. 1c, a spiral flow is apparent in the descending aorta.

Helicity, *u* · (∇ × *u*), represents swirling flow regions of opposite signs. Figure 2a depicts helicity isosurfaces of a positive and a negative values, which shows Dean's vortices generated at the aortic arch and subsequently flowing down to the descending aorta. In Fig. 2b, an isosurface of the second largest eigenvalue λ<sup>2</sup> of *S*<sup>2</sup> + 2, where *S* and respectively represent symmetric and antisymmetric parts of the velocity gradient tensor, also shows a swirling flow region [6]. Enstrophy, |∇ × *u*| 2 , exhibits the strength of vorticity in Fig. 2c. In Fig. 2b, c, colors of isosurfaces show λ<sup>2</sup> values.

#### **2.2.2 Effects of Torsion in Simple Spiral Tubes**

We also examined the effects of torsion using a pulsating flow in simple spiral tubes, as shown in [5]. Torsion of a three-dimensional curve is defined through the Frenet– Serret formula shown below.

**Fig. 1** Instantaneous streamlines

**Fig. 2** Several fluid dynamics quantities

**Fig. 3** Secondary flows in a zero-torsion tube

$$
\frac{d}{ds} \begin{pmatrix} t \\ n \\ b \end{pmatrix} = \begin{pmatrix} 0 & \chi & 0 \\ -\chi & 0 & \tau \\ 0 & -\tau & 0 \end{pmatrix} \begin{pmatrix} t \\ n \\ b \end{pmatrix} . \tag{2}
$$

Therein, χ and τ respectively represent curvature and torsion, where *t*, *n*, and *b* respectively denote the tangential, normal, and bi-normal vectors.

Figures 3 and 4 portray secondary flows, which are obtainable by subtracting the main axial flow from the total flow velocities at peak systolic, late systolic, and late diastolic phases, respectively, for zero-torsion and nonzero-torsion cases. When the torsion is zero, the secondary flow is invariably symmetric. However, when the torsion is not zero, merging phenomena occur; one large vortex persists in a diastolic phase. Such difference brings about differences in torque exerted on vessel walls.

**Fig. 4** Secondary flows in a nonzero-torsion tube

#### *2.3 Finite Element Approximation*

#### **2.3.1 Torsion Effects on Flows in the Thoracic Aorta**

Next we consider fluid–structure interaction (FSI) to examine torsion effects using patient-specific morphologies [7]. Here, FSI analysis is handled with the Sequentially-Coupled Arterial FSI (SCAFSI) technique [8] because the class of an FSI problem here has temporally–periodic FSI dynamics. Fluid mechanics equations are solved using Space–Time Variational Multiscale (ST-VMS) method [9–11]. First, we carry out structural mechanics computation to assess arterial deformation under an observed blood pressure profile in a cardiac cycle. Then we apply fluid mechanics computation over a mesh that moves to follow the lumen as the artery deforms. These steps are iterated where the stress obtained in fluid mechanics computation is used for the next structural mechanics computation. To assess torsion effects, the torsion-free model geometry is generated by projecting the original centerline to its averaged plane of curvature, as presented in Fig. 5.

Figure 6 presents secondary flows. On the left-hand side (projected shape), symmetric Dean's vortices are apparent, although they are not visible on the right-hand side (original shape), similarly to the simple spiral tubes in Fig. 4.

Next we compare the wall shear stresses (WSS) patterns corresponding to the projected and the original geometries to examine the influence of torsion. Figure 7 presents WSS at peak systolic phase. In the projected torsion-free shape, a high WSS region is apparent at the aortic arch, which results from the strong Dean's twin vortices, although it is not apparent in the original shape with torsion there.

#### **2.3.2 Flow Mechanism in Morphology After Norwood Surgery**

This subsection presents examples of patient-specific blood flow simulations at an anastomosis site after Norwood surgery for hypoplastic left heart syndrome. Our target is the geometry surrounding an anastomosis site of the aortic arch and pulmonary artery after Norwood surgery, which is one step taken during surgeries for

**Fig. 5** Projected and original shapes

**Fig. 6** Secondary flows in projected and original shapes

hypoplastic left heart syndrome. The target geometry was extracted from a CT scan with boundary conditions obtained from ultrasound measurements. Here, we again adopt the rigid body assumption, i.e., not considering fluid–structure interactions. The SUPG/PSPG stabilized finite element formulation is used, which is solved on P1/P1 elements.

Figure 8a portrays instantaneous streamlines at the peak systolic phase, whereas Fig. 8b depicts the energy-dissipation distribution. Energy dissipation is a clinically important quantity because it imposes a load on the heart directly [12]. In Fig. 8b, high energy dissipation is apparent at the anastomosis site, which can be understood

**Fig. 8** Streamlines at an anastomosis site after Norwood surgery

straightforwardly because the velocity is extremely high there. Although high energy dissipation is also apparent in the descending aorta, it cannot be qualified straightforwardly. This dissipation apparently derives from spiral flow there, which is generated at the aortic arch immediately after blood passes out of the thin anastomosis channel, as shown in Fig. 9. Here, a relation can be found between morphology and energy dissipation patterns through flow structures.

**Fig. 9** Front and back views of streamlines

#### **3 Conclusions**

We have presented some relations between geometrical characteristics of blood vessels and flow behaviors. Those relations are expected to explain how and why vessel morphologies affect WSS distributions and energy dissipations. As described in Sect. 2.2, vessel curvature induces Dean's vortices as a secondary flow by centrifugal force, thereby creating strong WSS there. Moreover, Dean's vortices show different behaviors depending on the existence of torsion. In the example from a Norwood surgery morphology, an energy dissipation pattern on the descending aorta can be explained through flow structures. As a next step, predictions based on geometrical characteristics of blood vessels are expected to contribute to better risk assessments and surgery planning through mathematical modellings and numerical simulations.

**Acknowledgements** This work was supported by JST CREST Grant Number JPMJCR15D1, Japan.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **An Iterative Thresholding Method for Topology Optimization for the Navier–Stokes Flow**

**Haitao Leng, Dong Wang, Huangxin Chen, and Xiao-Ping Wang**

**Abstract** We develop an efficient iterative thresholding method for topology optimization for the Navier–Stokes flow. The method is proposed to minimize an objective energy functional which consists of the potential power in the fluid and a fluidsolid interface perimeter penalization. The perimeter is approximated by a nonlocal energy, subject to a fluid volume constraint and the incompressible Navier–Stokes equation. The method is an iterative scheme which alternates two steps: (1) solving a system containing the Brinkman equation and an adjoint system, and (2) convolution and thresholding. Various numerical experiments in both two and three dimensions are given to show the performance of the proposed method.

#### **1 Introduction**

Topology optimization was originally developed for the optimal design in structural mechanics ([3, 4, 6]). Nowadays it has attracted much attention due to its wide application in the fields of industry problems such as optimization of transport vehicles, biomechanical structure, etc. So far, the density method [5, 31] has been well devel-

D. Wang

School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, Guangdong, China e-mail: wangdong@cuhk.edu.cn

H. Chen

X.-P. Wang (B) Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China e-mail: mawang@ust.hk

© The Author(s) 2022 T. Chacón Rebollo et al. (eds.), *Recent Advances in Industrial and Applied Mathematics*, ICIAM 2019 SEMA SIMAI Springer Series 1, https://doi.org/10.1007/978-3-030-86236-7\_12

H. Leng

School of Mathematical Sciences, South China Normal University, Guangzhou 510631, Guangdong, China e-mail: htleng@m.scnu.edu.cn

School of Mathematical Sciences and Fujian Provincial Key Laboratory on Mathematical Modeling and High Performance Scientific Computing, Xiamen University, Fujian 361005, China e-mail: chx@xmu.edu.cn

oped for implementation of topology optimization. It was originally developed for the design of stiffness and compliant mechanism [32, 33] and has been applied in various physical problems such as acoustics, electromagnetics, fluid flow, and thermal problems [7, 11, 15, 24, 34]. In fluid mechanics, the concept of density method was first developed by Borrvall and Petersson [7] for topology optimization for the Stokes flow. Then it was extended to the Darcy-Stokes flow [21, 43], the Navier– Stokes flow [12, 18, 20, 27, 36, 47], the non-Newtonian flow [30], the turbulent flow [13], and more complicated fluidic devices [1, 25, 26]. Approaches using the topological sensitivity analysis (providing an asymptotic expansion of a shape function with respect to the size of a small inclusion inserted inside the domain) can also be used for shape optimization for Stokes flows [22] and Navier–Stokes flows [2]. Generally, the discrete optimization problem for the topology optimization was solved by the method of moving asymptotes (MMA) [35], level set based methods [8, 36, 47] and phase field based methods [18].

The threshold dynamics method developed by Merriman, Bence and Osher (MBO) [23] is an efficient method for approximating the mean curvature flow. In this method, the interface is implicitly represented by the characteristic functions of the domains. It alternates two simple steps: convolution between the characteristic functions and a heat kernel and point-wise thresholding. Recently, Esedoglu and Otto generalized the original MBO method to multiphase problems with arbitrary surface tensions [17]. The method has attracted much attention and it has been extended to many other applications, such as image processing [16, 37, 39], wetting dynamics [38, 44, 45], and target-valued problems [28, 29, 40–42].

In this paper we extend the iterative thresholding method developed in [9] to topology optimization for the Navier–Stokes flow. The porous medium approach based on the density method is utilized in the algorithm, and a Darcy term is introduced into the Navier–Stokes equation to "interpolate" between the Navier–Stokes equation in the fluid region and the Darcy flow through a porous medium (a weakened solid region with low permeability) (i.e., Brinkman equation). Then the total energy consists of the potential power in the fluid, the perimeter regularization, and a Darcy term. The perimeter term is computed based on the convolution between the heat kernel and the characteristic functions of regions. There are two steps per iteration in the proposed algorithm. The first step is to solve the Brinkman equation and an adjoint system, which can both be efficiently solved using the mixed finite element method. The second step is to update the fluid-solid regions by a simple convolution and thresholding step. The convolution can be efficiently computed on a uniform grid by the fast Fourier transform (FFT) with the computational complexity *O*(*N* log *N*). A variety of numerical experiments in both two and three dimensions are shown to verify the efficiency of the proposed algorithm. In addition, numerical results indicate that the total energy decays.

The paper is organized as follows. In Sect. 2, we introduce the mathematical model, the approximation to the model, and the derivation of the iterative thresholding method. The numerical implementation is discussed in Sect. 3. We verify the performance through extensive numerical experiments in Sect. 4. We draw some conclusions in Sect. 5.

#### **2 Derivation of the Method**

#### *2.1 The Mathematical Model*

In this section, we consider the mathematical model for topology optimization for the Navier–Stokes flow. Denote - <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* (*<sup>d</sup>* <sup>=</sup> <sup>2</sup>, <sup>3</sup>) as the computational domain which is fixed throughout optimization and assume that is a bounded Lipschitz domain with an outer unit normal **<sup>n</sup>** such that <sup>R</sup>*<sup>d</sup>* \ is connected. Furthermore, we denote -<sup>0</sup> ⊂ as the domain of the fluid which is a Caccioppoli set whose boundary is measurable and has a (at least locally) finite measure and - \ -<sup>0</sup> as the domain of solid. Our goal is to determine an optimal shape of -<sup>0</sup> that minimizes the following objective functional consisting of the total potential power and a perimeter regularization term,

$$\min\_{\left(\Omega\_{0},\mathbf{u}\right)} J\_{0}(\Omega\_{0},\mathbf{u}) = \int\limits\_{\Omega} \frac{\mu}{2} |\nabla \mathbf{u}|^{2} d\mathbf{x} + \mathcal{y} \left|\Gamma \right|\tag{1}$$

subject to

$$\nabla \cdot \mathbf{u} = 0,\ \text{ in }\ \Omega\_0,\tag{2a}$$

$$(\mathbf{u} \cdot \nabla)\mathbf{u} + \nabla p - \nabla \cdot (\mu \nabla \mathbf{u}) = 0, \quad \text{in } \Omega\_0,\tag{2b}$$

$$\mathbf{u} = 0,\quad\text{in }\Omega \backslash \overline{\Omega}\_0 \quad\text{and on }\partial\Omega\_0,\tag{2c}$$

$$\mathbf{u}|\_{\partial\Omega} = \mathbf{u}\_D,\quad\text{on}\quad\partial\Omega,\tag{2d}$$

$$|\Omega\_0| = |\beta|\Omega|\text{ with a fixed parameter }\beta \in (0,1). \tag{2e}$$

Here, **u** : - <sup>→</sup> <sup>R</sup>*<sup>d</sup>* ,μis the viscosity of the fluid, *<sup>p</sup>* is the pressure, **<sup>u</sup>***<sup>D</sup>* : ∂- <sup>→</sup> <sup>R</sup>*<sup>d</sup>* is a given function, || is the perimeter of the boundary (i.e., = ∂-0), and γ > 0 is a weighting parameter.

#### *2.2 The Relaxation and Approximation of the Problem*

Since the goal is to minimize the objective functional (1) subject to several constraints (2) with respect to the fluid-solid interface, it is necessary to have a proper representation of the fluid-solid interface. Motivated by [9, 17, 37, 44], in this paper, we use the characteristic function χ<sup>1</sup> of the fluid domain (i.e., -0) to implicitly represent the fluid-solid interface, i.e.,

$$\chi\_1(\mathbf{x}) := \begin{cases} 1, & \text{if } \mathbf{x} \in \Omega\_0, \\ 0, & \text{otherwise.} \end{cases}$$

χ2(**x**) = 1 − χ1(**x**) is denoted as the characteristic function of - \ -0. Then, the interface is implicitly represented by χ<sup>1</sup> and χ2. Under this representation, || can be approximated by

$$|\Gamma| \approx \sqrt{\frac{\pi}{\tau}} \int\_{\Omega} \chi\_1 G\_{\mathbf{r}} \ast \chi\_2 d\mathbf{x} = \sqrt{\frac{\pi}{\tau}} \int\_{\Omega} \chi\_1 G\_{\mathbf{r}} \ast (1 - \chi\_1) d\mathbf{x},\tag{3}$$

where *<sup>G</sup>*<sup>τ</sup> (**x**) <sup>=</sup> <sup>1</sup> (4πτ) *<sup>d</sup>* 2 exp −|**x**| 2 4τ (*d* = 2, 3) is the Gaussian kernel and ∗ denotes the convolution [17].

Similar to [9], to avoid solving the Navier–Stokes equation in a changing domain at each iteration, the porous medium approach [18] is utilized to "interpolate" between the Navier–Stokes equation in the fluid region (i.e., {**x**| χ1(**x**) = 1}) and **u** = 0 in the solid region (i.e., {**x**| χ2(**x**) = 1}) by introducing an additional penalization term, α(**x**)**u**, as follows:

$$
\nabla \cdot \mathbf{u} = 0,\text{ in }\Omega,\tag{4a}
$$

$$(\mathbf{u} \cdot \nabla)\mathbf{u} + \nabla p - \nabla \cdot (\mu \nabla \mathbf{u}) + a(\mathbf{x})\mathbf{u} = 0,\text{ in } \Omega,\tag{4b}$$

$$\mathbf{u}|\_{\partial\Omega} = \mathbf{u}\_D,\quad\text{on}\quad\partial\Omega,\tag{4c}$$

$$\int\_{\Omega} \chi\_1 d\mathbf{x} = \beta |\Omega|. \tag{4d}$$

Accordingly, the original objective functional (1) can be approximated by adding a Darcy penalty term as follows:

$$J^{\mathbf{f}}(\boldsymbol{\chi}, \mathbf{u}) = \int\_{\Omega} \left( \frac{\mu}{2} |\nabla \mathbf{u}|^2 + \frac{\alpha(\mathbf{x})}{2} |\mathbf{u}|^2 \right) d\mathbf{x} + \mathcal{V} \sqrt{\frac{\pi}{\tau}} \int\_{\Omega} \chi G\_{\mathbf{f}} \ast (1 - \chi) d\mathbf{x} \tag{5}$$

where χ denotes the characteristic function of the solid domain, i.e., χ = χ2.

Now, we discuss the computation of α in the current representation of the interface (i.e., using characteristic functions). Theoretically, α should be large enough in the solid domain to penalize the condition **u** = 0 and close to 0 in the fluid domain to make **u** satisfy the Navier–Stokes equation. For numerical considerations, we relax α to a smooth function which undergoes rapid changes through the interface. We use the 0.5 level set of ϕ = *G*<sup>τ</sup> ∗ χ to approximate the position of the interface and such ϕ is a smooth function between [0, 1] and admits a change from 0 to 1 in an *O*( <sup>√</sup>τ ) transition region. Thus, we compute <sup>α</sup> by

$$
\alpha(\mathbf{x}) = \bar{\alpha}\varphi = \bar{\alpha}G\_{\mathfrak{r}} \* \chi \tag{6}
$$

where α¯ is a sufficiently large constant, and thus by the porous medium approach we can solve the system (4) in a fixed domain -.

Finally, using (6), we arrive in the following formulation of the problem:

$$\min\_{\chi, \mathbf{u}} J^{\mathfrak{t}}(\chi, \mathbf{u}) = \int\_{\Omega} \left( \frac{\mu}{2} |\nabla \mathbf{u}|^2 + \frac{\bar{\alpha}}{2} (G\_{\mathfrak{t}} \ast \chi) |\mathbf{u}|^2 + \chi \sqrt{\frac{\pi}{\tau}} \chi G\_{\mathfrak{t}} \ast (1 - \chi) \right) d\mathbf{x} \quad (7)$$

subject to

$$\chi \in \mathcal{B} := \{ \chi \in BV(\mathfrak{Q}) \mid \chi(\mathbf{x}) = \{0, 1\}, \ a.e., \text{ and } \int\_{\mathfrak{Q}} (1 - \chi) d\mathbf{x} = \beta |\mathfrak{Q}| \} \quad (8a)$$

$$
\nabla \cdot \mathbf{u} = 0, \quad \text{in } \Omega,\tag{8b}
$$

$$(\mathbf{u} \cdot \nabla)\mathbf{u} + \nabla p - \nabla \cdot (\mu \nabla \mathbf{u}) + (\bar{\alpha} G\_{\mathbf{t}} \ast \chi)\mathbf{u} = 0, \quad \text{in } \Omega,\tag{8c}$$

$$\mathbf{u}|\_{\partial\Omega} = \mathbf{u}\_D,\quad\text{on}\quad\partial\Omega.\tag{8d}$$

#### *2.3 Derivation of the Method*

In this section, we will derive an iterative scheme to find the approximate solution for (7) and (8). Denote

$$\mathbf{U} := \{ \mathbf{u} \in H^1(\Omega) | \nabla \cdot \mathbf{u} = 0, \mathbf{u}|\_{\partial \Omega} = \mathbf{u}\_D \} \quad \text{and} \quad \mathbf{V} := \{ \mathbf{v} \in H^1\_0(\Omega) | \nabla \cdot \mathbf{v} = 0 \}.$$

To derive the first order necessary optimality conditions for a solution (χτ , **u**<sup>τ</sup> ) of (7) and (8), we introduce the Lagrangian *<sup>E</sup>* : *<sup>B</sup>* <sup>×</sup> **<sup>U</sup>** <sup>×</sup> **<sup>V</sup>** <sup>→</sup> <sup>R</sup> by

$$\mathcal{E}^{\tau}(\boldsymbol{\chi}, \mathbf{u}, \tilde{\mathbf{u}}) := J^{\tau}(\boldsymbol{\chi}, \mathbf{u}) + \int\_{\Omega} (\mathbf{u} \cdot \nabla) \mathbf{u} \cdot \tilde{\mathbf{u}} + \mu \nabla \mathbf{u} \cdot \nabla \tilde{\mathbf{u}} + (\bar{a} G\_{\tau} \ast \boldsymbol{\chi}) \mathbf{u} \cdot \tilde{\mathbf{u}} d\mathbf{x}$$

where the pressure term is not shown because ∇ · **u**˜ = 0. The variational inequality is formally derived by

$$\left\langle \frac{\delta \mathcal{E}^{\tau}}{\delta \chi} (\chi\_{\tau}, \mathbf{u}\_{\tau}, \tilde{\mathbf{u}}\_{\tau}), \chi - \chi\_{\tau} \right\rangle \ge 0, \quad \forall \, \chi \in \mathcal{B} \tag{9}$$

and the adjoint equation can be deduced by

$$\left\langle \frac{\delta \mathcal{E}^{\tau}}{\delta \mathbf{u}} (\chi\_{\tau}, \mathbf{u}\_{\tau}, \tilde{\mathbf{u}}\_{\tau}), \mathbf{v} \right\rangle = 0, \quad \forall \, \mathbf{v} \in \mathbf{V} \tag{10}$$

where ·, · denotes the *L*2-inner product.

(13)

To be specific, assume (χτ , **u**<sup>τ</sup> ) ∈ *B* × **U** is a minimizer of (7) and (8), the following inequality is fulfilled:

$$\left\langle \frac{\bar{\alpha}}{2} G\_{\tau} \ast |\mathbf{u}\_{\tau}|^{2} + \mathcal{V} \sqrt{\frac{\pi}{\tau}} G\_{\tau} \ast (1 - 2\chi\_{\tau}) + \bar{\alpha} G\_{\tau} \ast (\mathbf{u}\_{\tau} \cdot \tilde{\mathbf{u}}\_{\tau}), \chi - \chi\_{\tau} \right\rangle \ge 0, \ \forall \ \chi \in \mathcal{B} \tag{11}$$

where **u**˜ <sup>τ</sup> is the solution to the following adjoint system at (**u**<sup>τ</sup> , χτ ):

$$- (\mathbf{u}\_{\mathbf{t}} \cdot \nabla) \mathbf{u}\_{\mathbf{t}} - (\mathbf{u}\_{\mathbf{t}} \cdot \nabla) \tilde{\mathbf{u}} + (\nabla \mathbf{u}\_{\mathbf{t}})^T \tilde{\mathbf{u}} + \nabla \tilde{p} - \nabla \cdot (\mu \nabla \tilde{\mathbf{u}}) + (\bar{a} G\_{\mathbf{t}} \ast \chi\_{\mathbf{t}}) \tilde{\mathbf{u}} = 0,\tag{12a}$$

$$\nabla \cdot \tilde{\mathbf{u}} = 0,\tag{12b}$$

$$
\tilde{\mathbf{u}}|\_{\partial\Omega} = \mathbf{0}.\tag{12c}
$$

Here, *p*˜ is the pressure associated to the adjoint system.

Based on the first order necessary optimality condition, to solve (7) and (8), we use an iterative scheme to decrease the value of the objective functional with **u** satisfying (8) and **u**˜ satisfying (12). Without loss of generality, assume the *k*-th iteration χ*<sup>k</sup>* is given, we compute (**u***<sup>k</sup>* , **u**˜ *<sup>k</sup>* ) via solving the following system

$$\begin{cases} \nabla \cdot \mathbf{u} = 0, \\ \nabla \cdot \tilde{\mathbf{u}} = 0, \\ (\mathbf{u} \cdot \nabla)\mathbf{u} + \nabla p - \nabla \cdot (\mu \nabla \mathbf{u}) + (\bar{\alpha} G\_{\tau} \ast \chi^{k}) \mathbf{u} = \mathbf{f}, \\ -(\mathbf{u} \cdot \nabla)\mathbf{u} - (\mathbf{u} \cdot \nabla)\tilde{\mathbf{u}} + (\nabla \mathbf{u})^{T} \tilde{\mathbf{u}} + \nabla \tilde{p} - \nabla \cdot (\mu \nabla \tilde{\mathbf{u}}) + (\bar{\alpha} G\_{\tau} \ast \chi^{k}) \tilde{\mathbf{u}} = 0, \\ \mathbf{u}|\_{\partial \Omega} = \mathbf{u}\_{D}, \\ \tilde{\mathbf{u}}|\_{\partial \Omega} = 0. \end{cases}$$

After (**u***<sup>k</sup>* , **u**˜ *<sup>k</sup>* ) are solved from (13), χ*<sup>k</sup>*+<sup>1</sup> is updated through

$$\chi^{k+1} = \arg\min\_{\chi \in \mathcal{B}} \mathcal{E}^{\tau}(\chi, \mathbf{u}^k, \tilde{\mathbf{u}}^k). \tag{14}$$

Write the objective functional *<sup>E</sup>*<sup>τ</sup> (χ , **<sup>u</sup>***<sup>k</sup>* , **<sup>u</sup>**˜ *<sup>k</sup>* ) into *<sup>E</sup>*˜τ,*<sup>k</sup>* (χ ):

$$\begin{split} \tilde{\mathcal{E}}^{\tau,k}(\boldsymbol{\chi}) := \mathcal{E}^{\tau}(\boldsymbol{\chi}, \mathbf{u}^{k}, \tilde{\mathbf{u}}^{k}) &= \int\_{\Omega} \frac{\bar{\alpha}}{2} \boldsymbol{\chi} \boldsymbol{G}\_{\tau} \ast |\mathbf{u}^{k}|^{2} d\mathbf{x} + \mathcal{V} \sqrt{\frac{\pi}{\tau}} \int\_{\Omega} \boldsymbol{\chi} \boldsymbol{G}\_{\tau} \ast (1 - \boldsymbol{\chi}) d\mathbf{x} \\ &+ \int\_{\Omega} \bar{\alpha} \boldsymbol{\chi} \boldsymbol{G}\_{\tau} \ast (\mathbf{u}^{k} \cdot \tilde{\mathbf{u}}^{k}) d\mathbf{x} + \mathcal{N}(\mathbf{u}^{k}, \tilde{\mathbf{u}}^{k}), \end{split}$$

where *<sup>N</sup>* (**u***<sup>k</sup>* , **<sup>u</sup>**˜ *<sup>k</sup>* ) contains all other terms in *<sup>E</sup>*<sup>τ</sup> (χ , **<sup>u</sup>***<sup>k</sup>* , **<sup>u</sup>**˜ *<sup>k</sup>* ) which are independent of <sup>χ</sup>. The only problem now is to minimize *<sup>E</sup>*˜τ,*<sup>k</sup>* (χ ) on *<sup>B</sup>*, i.e., finding <sup>χ</sup>*<sup>k</sup>*+<sup>1</sup> such that

$$\chi^{k+1} = \arg\min\_{\chi \in \mathcal{B}} \tilde{\mathcal{E}}^{\tau,k}(\chi). \tag{16}$$

We first relax (16) to a problem defined on a convex admissible set by finding *r <sup>k</sup>*+<sup>1</sup> such that

$$r^{k+1} = \arg\min\_{r \in \mathcal{H}} \tilde{\mathcal{E}}^{\mathfrak{r},k}(r), \tag{17}$$

where *H* is the convex hull of *B*:

$$\mathcal{H} := \{ r \in BV(\Omega) \mid r(\mathbf{x}) \in [0, 1] \, a.e., \, \text{and} \int\limits\_{\Omega} r d\mathbf{x} = V\_0 \}.$$

The following lemma holds similarly as that in [9] and we refer the details of a similar proof to [9]. Thus, we can solve the relaxed problem (17) instead of (16).

**Lemma 2.1** *Let* **u** ∈ *H*<sup>1</sup> **<sup>u</sup>***<sup>D</sup>* (-, <sup>R</sup>*<sup>d</sup>* ) *be a given function and r* <sup>=</sup> (*r*1,*r*2)*. Then we have*

$$\arg\min\_{r\in\mathcal{H}}\tilde{\mathcal{E}}^{\mathfrak{r},k}(r) = \arg\min\_{r\in\mathcal{B}}\tilde{\mathcal{E}}^{\mathfrak{r},k}(r).$$

Next we show that (17) can be solved by a thresholding step. Because *<sup>E</sup>*˜τ,*<sup>k</sup>* (*r*) is quadratic and concave in *<sup>r</sup>*, we first linearize the energy *<sup>E</sup>*˜τ,*<sup>k</sup>* (*r*) at *<sup>r</sup> <sup>k</sup>* by

$$
\tilde{\mathcal{E}}^{\tau,k}(r) \approx \tilde{\mathcal{E}}^{\tau,k}(r^k) + \mathcal{L}\_{r^k}^{\tau,k}(r - r^k),
$$

where

$$\begin{aligned} \mathcal{L}\_{r^k}^{r,k}(r) &= \int\_{\Omega} \left( \nu \sqrt{\frac{\pi}{\tau}} r \, G\_{\tau} \ast (1 - 2r^k) + r \frac{\bar{\alpha}}{2} G\_{\tau} \ast |\mathbf{u}^k|^2 + r \bar{\alpha} G\_{\tau} \ast (\mathbf{u}^k \cdot \tilde{\mathbf{u}}^k) \right) d\mathbf{x} \\ &= \int\_{\Omega} r \phi d\mathbf{x} \\ &\qquad \qquad \qquad \qquad \qquad \end{aligned}$$

where φ = γ <sup>π</sup> <sup>τ</sup> *<sup>G</sup>*<sup>τ</sup> <sup>∗</sup> (<sup>1</sup> <sup>−</sup> <sup>2</sup>*<sup>r</sup> <sup>k</sup>* ) <sup>+</sup> <sup>α</sup>¯ <sup>2</sup> *G*<sup>τ</sup> ∗ |**u***<sup>k</sup>* | <sup>2</sup> + ¯α*G*<sup>τ</sup> ∗ (**u***<sup>k</sup>* · **u**˜ *<sup>k</sup>* ). Then (17) can be approximately solved by

$$\chi^{k+1} = \arg\min\_{r \in \mathcal{H}} \mathcal{L}\_{r^k}^{\tau,k}(r) = \arg\min\_{r \in \mathcal{H}} \int\_{\Omega} r\phi d\mathbf{x}.\tag{18}$$

Then we have the following lemma as in [9] and one can also refer the details of proof to [9].

**Lemma 2.2** *Let* φ = γ <sup>π</sup> <sup>τ</sup> *<sup>G</sup>*<sup>τ</sup> <sup>∗</sup> (<sup>1</sup> <sup>−</sup> <sup>2</sup>χ*<sup>k</sup>* ) <sup>+</sup> <sup>α</sup>¯ <sup>2</sup> *G*<sup>τ</sup> ∗ |**u***<sup>k</sup>* | <sup>2</sup> + ¯α*G*<sup>τ</sup> ∗ (**u***<sup>k</sup>* · **u**˜ *<sup>k</sup>* ) *and*

> *D<sup>k</sup>*+<sup>1</sup> <sup>2</sup> = {**x** ∈ -| φ<δ}

*for some* <sup>δ</sup> *such that* <sup>|</sup>*D<sup>k</sup>*+<sup>1</sup> <sup>2</sup> | = (1 − β)|-|*. Then with* χ*<sup>k</sup>*+<sup>1</sup> = χ*Dk*+<sup>1</sup> <sup>2</sup> *, we have*

$$
\mathcal{L}\_{\chi^k}^{\mathfrak{r},k}(\chi^{k+1}) \le \mathcal{L}\_{\chi^k}^{\mathfrak{r},k}(\chi^k) \text{ for all } \mathfrak{r} > 0.
$$

The above lemma shows that (18) can be solved by

$$\begin{cases} \chi^{k+1}(\mathbf{x}) = 1, & \text{if } \phi(\mathbf{x}) < \delta, \\ \chi^{k+1}(\mathbf{x}) = 0, & \text{otherwise}, \end{cases}$$

where δ is chosen as a constant such that χ*<sup>k</sup>*+1*d***x** = (1 − β)|-|.

To determine the value of δ, one can treat χ*<sup>k</sup>*+1*d***x** − (1 − β)|-| as a function of δ (i.e., *f* (δ) = χ*<sup>k</sup>*+1*d***x** − (1 − β)|-|) and use an iteration method (e.g., bisection method or Newton's method) to find the root of *f* (δ) = 0. For the uniform discretization of -, a more efficient method is the quick-sort technique proposed in [44]. Assume we have a uniform discretization of with grid size *h*, we can approximate χ*<sup>k</sup>*+1*d***x** by *mh<sup>d</sup>* where *m* is the number of grid points where χ*<sup>k</sup>*+<sup>1</sup> = 1. Assume (1 − β)|-| is approximated by *Mh<sup>d</sup>* , we then sort the values of φ in an ascending order and simply set χ*<sup>k</sup>*+<sup>1</sup> = 1 on the first *M* points.

Now, we arrive at Algorithm 1.

**Remark 2.1** We remark here that it's obvious that the Step 2 in Algorithm 1 decreases the energy which can be proved similar as we did in [9], i.e.,

$$J^\mathfrak{t}(\boldsymbol{\chi}^{k+1}, \mathbf{u}^k) \le J^\mathfrak{t}(\boldsymbol{\chi}^k, \mathbf{u}^k).$$

In the Step 1, we don't have

$$J^\mathfrak{t}(\boldsymbol{\chi}^k, \mathbf{u}^k) \le J^\mathfrak{t}(\boldsymbol{\chi}^k, \mathbf{u}^{k-1})$$

because this step can be interpreted as a projection step. It could increase the value of the energy. However, in the numerical experiments in Sect. 4, we checked the energy curves for all examples as displayed. All of them indicate that the algorithm has the energy decaying property.

**Remark 2.2** In the implementation, the stopping criteria is χ*<sup>k</sup>*+<sup>1</sup> = χ*<sup>k</sup>* on each grid point. It is easy to see that the stationary solution (obtained from Algorithm1) satisfies the first order necessary optimality condition (8), (9), and (10).

**Algorithm 1** An iterative thresholding method for topology optimization for the Navier-Stokes flow

**Input:** Discretize uniformly into a grid *T<sup>h</sup>* with grid size *h* and set *M* = (1 − β)|-<sup>|</sup>/*h<sup>d</sup>* . Set τ > 0, α >¯ 0, *<sup>k</sup>* <sup>=</sup> 0, a tolerance parameter *tol* <sup>&</sup>gt; 0 and give the initial guess <sup>χ</sup><sup>0</sup> <sup>∈</sup> *<sup>B</sup>*.

**Iterative solution:**

**Step 1. Given** χ*<sup>k</sup>* **, update u and u.** ˜ Solve the following system

⎧ ⎪⎪⎪⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎪⎪⎪⎩ ∇ · **u** = 0, ∇ · **u**˜ = 0, (**<sup>u</sup>** · ∇)**<sup>u</sup>** + ∇ *<sup>p</sup>* −∇· (μ∇**u**) <sup>+</sup> (α¯ *<sup>G</sup>*<sup>τ</sup> <sup>∗</sup> <sup>χ</sup>*<sup>k</sup>* )**<sup>u</sup>** <sup>=</sup> **<sup>f</sup>**, <sup>−</sup>(**<sup>u</sup>** · ∇)**<sup>u</sup>** <sup>−</sup> (**<sup>u</sup>** · ∇)**u**˜ <sup>+</sup> (∇**u**)*<sup>T</sup>* **<sup>u</sup>**˜ +∇ ˜*<sup>p</sup>* −∇· (μ∇**u**˜) <sup>+</sup> (α¯ *<sup>G</sup>*<sup>τ</sup> <sup>∗</sup> <sup>χ</sup>*<sup>k</sup>* )**u**˜ <sup>=</sup> <sup>0</sup>, **u**|∂- = **u***D*, **u**˜|∂-= 0.

to obtain **<sup>u</sup>***<sup>k</sup>* and **<sup>u</sup>**˜ *<sup>k</sup>* .

**Step 2. Update** χ**.** Evaluate

$$
\phi = \chi \sqrt{\frac{\pi}{\tau}} G\_{\mathsf{T}} \ast (1 - 2\chi^k) + \frac{\bar{\alpha}}{2} G\_{\mathsf{T}} \ast |\mathbf{u}^k|^2 + \bar{\alpha} G\_{\mathsf{T}} \ast (\mathbf{u}^k \cdot \tilde{\mathbf{u}}^k),
$$

sort the values of <sup>φ</sup> in an ascending order, and set <sup>χ</sup>*k*+<sup>1</sup> <sup>=</sup> 1 on the first *<sup>M</sup>* points. **Step 3.** Compute *e<sup>k</sup>* <sup>χ</sup> = χ*k*+<sup>1</sup> <sup>−</sup> <sup>χ</sup>*k*2. If *<sup>e</sup><sup>k</sup>* <sup>χ</sup> ≤ *tol*, stop the iteration and go to the output step. Otherwise, let *k* + 1 → *k* and continue the iteration.

**Output:** (χ , **u**) that approximately solves (7) subject to (8)(a-d).

#### **3 Numerical Implementation**

Now we illustrate the implementation of Algorithm 1 and we focus on Step 1. The Navier-Stokes equations with a Dacry term penalty and the adjoint problem (13) are solved by the mixed finite element method, and the standard Taylor-Hood finite element space is used for discretization. Let *T<sup>h</sup>* be a uniform grid of the domain -, and *N<sup>h</sup>* is the set of all vertices of *Th*. For a given χ*<sup>h</sup>* ∈ *B<sup>h</sup>* where *B<sup>h</sup>* is the discrete version of *B* defined on *Nh*. We introduce the Taylor-Hood finite element space

$$\begin{aligned} V\_h &:= \{ \mathbf{v} \in H^1(\Omega, \mathbb{R}^d) \mid \mathbf{v}|\_K \in [P\_2(K)]^d, \ K \in \mathcal{T}\_h \}, \\ \mathcal{Q}\_h &:= \{ q \in L^2(\Omega, \mathbb{R}) \mid \int\_{\Omega} q \, d\mathbf{x} = 0, \ q|\_K \in P\_1(K), \ K \in \mathcal{T}\_h \}. \end{aligned}$$

Let *V <sup>D</sup> <sup>h</sup>* := {**v** ∈ *V<sup>h</sup>* | **v**|∂- <sup>=</sup> **<sup>u</sup>***<sup>h</sup> <sup>D</sup>*}, where **<sup>u</sup>***<sup>h</sup> <sup>D</sup>* is the a suitable approximation of the Dirichlet boundary condition **u***<sup>D</sup>* on the boundary edges/faces of *Th*. For the solution of (13), find (**u***h*, *ph*) <sup>∈</sup> *<sup>V</sup> <sup>D</sup> <sup>h</sup>* × *Qh* such that

$$\begin{aligned} ( (\mathbf{u}\_h \cdot \nabla) \mathbf{u}\_h, \mathbf{v}\_h) - (p\_h, \nabla \cdot \mathbf{v}\_h) + (\mu \nabla \mathbf{u}\_h, \nabla \mathbf{v}\_h) + (\alpha (\overline{\chi}\_h) \mathbf{u}\_h, \mathbf{v}\_h) &= 0, & \forall \, \mathbf{v}\_h \in \mathbf{V}\_h^0, \\ (\nabla \cdot \mathbf{u}\_h, q\_h) &= 0, & \forall \, q\_h \in \underline{Q}\_h. \end{aligned}$$

and (**u**˜ *<sup>h</sup>*, *<sup>p</sup>*˜*h*) <sup>∈</sup> *<sup>V</sup>*<sup>0</sup> *<sup>h</sup>* × *Qh* such that

$$\begin{aligned} -( (\mathbf{u}\_h \cdot \nabla) \tilde{\mathbf{u}}\_h, \mathbf{v}\_h) + ( (\nabla \mathbf{u}\_h)^T \tilde{\mathbf{u}}\_h, \mathbf{v}\_h) - (\tilde{p}\_h, \nabla \cdot \mathbf{v}\_h) + (\mu \nabla \tilde{\mathbf{u}}\_h, \nabla \mathbf{v}\_h) + (\alpha (\overline{\chi}\_h) \tilde{\mathbf{u}}\_h, \mathbf{v}\_h) \\ = ( (\mathbf{u}\_h \cdot \nabla) \mathbf{u}\_h, \mathbf{v}\_h), \qquad \forall \ \mathbf{v}\_h \in V\_h^0, \\ (\nabla \cdot \tilde{\mathbf{u}}\_h, q\_h) = 0, \qquad \forall \ q\_h \in \mathcal{Q}\_h, \end{aligned}$$

where *V*<sup>0</sup> *<sup>h</sup>* = *V<sup>h</sup>* ∩ *H*<sup>1</sup> <sup>0</sup> (-). All above systems are solved by standard Newton's iteration and each iteration is solved by the generalized minimal residual method (GMRES).

We also note that the above bilinear form can be straightforwardly extended to the problem both with Dirichlet boundary *<sup>D</sup>* and Neumann boundary *<sup>N</sup>* , where *<sup>D</sup>* ∩ *<sup>N</sup>* = ∅, *<sup>D</sup>* ∪ *<sup>N</sup>* = ∂-, and (μ∇**u** − *pI*) · *n*|*<sup>N</sup>* = *g*.

When **u***<sup>h</sup>* and **u**˜ *<sup>h</sup>* are obtained, we can use the FFT to compute φ*<sup>h</sup>* on each node of *N<sup>h</sup>* as follows:

$$\phi^h = \chi \sqrt{\frac{\pi}{\tau}} G\_\tau \ast (1 - 2\overline{\chi}^h) + \frac{\bar{\alpha}}{2} G\_\tau \ast (|\mathbf{u}\_h|^2 + 2\mathbf{u}\_h \cdot \tilde{\mathbf{u}}\_h)$$

Following Algorithm 1, we can now use φ*<sup>h</sup>* to update the indicator function χ*<sup>h</sup>* by the strategy presented in Algorithm 1.

**Remark 3.1** Similar to the adaptive in time strategy used in [9], we can modify Algorithm 1 into an adaptive algorithm by adjusting τ during the iterations. We set a threshold value τ*<sup>t</sup>* and a given tolerance *et* , if *e<sup>k</sup>* <sup>χ</sup> ≤ *et* , let τnew = ητ with η ∈ (0, 1) and update τ := τnew in the next iteration unless τ ≤ τ*<sup>t</sup>* . Otherwise, τ will not be updated, and the iteration will continue with the same τ .

#### **4 Numerical Experiments**

In this section, we perform extensive numerical examples to demonstrate the efficiency of our new algorithm with an adaptive strategy for the choice of τ . We choose η = 0.5 in the update of τ . If no confusion is possible, we still denote by τ as its initialization in the following. Also, we denote the Reynolds number by *Re* <sup>=</sup> <sup>1</sup> μ .

#### *4.1 Two Dimensional Results*

In this section, we test the performance of the proposed algorithm on two dimensional problems on several different design domains as displayed in Fig. 1. For most examples in this section, we assume that the Dirichlet boundary condition with a parabolic profile and the magnitude of the velocity are set as <sup>|</sup>**u***D*| = *<sup>g</sup>*(<sup>1</sup> <sup>−</sup> <sup>4</sup>(*<sup>t</sup>*−*<sup>a</sup> <sup>l</sup>* )<sup>2</sup>)

**Fig. 1** Design domains of two dimensional examples

with *<sup>t</sup>* ∈ [*<sup>a</sup>* <sup>−</sup> *<sup>l</sup>* <sup>2</sup> , *<sup>a</sup>* <sup>+</sup> *<sup>l</sup>* <sup>2</sup> ], where *l* is the length of the section of the boundary at which the inflow/outflow velocity is imposed, and *g*¯ is the prescribed velocity at the midpoint *a* of the flow profile. The directions of the inflow/outflow velocity are illustrated separately in the design domain in each example.

**Example 1** In this example, we consider the design of a bend, which has been tested by the level set method in [10, 14, 19]. The design domain is presented in Fig. 1a. Let *g*¯ be 1 both in inlet and outlet, and we set the fluid fraction as β = 0.08π. Here, we use our algorithm to obtain the optimal design result on a 128 × 128 grid. We assume the initial distribution χ = 0 in the whole domain, and set the parameter α¯ = 1.5μ × 10<sup>4</sup> through this example.

The boundary conditions in this example are slightly different with [10, 19], but are same as that in [14]. Based on the 128 × 128 grid, firstly, we test the example for different Reynolds numbers, in which the other parameters are set as τ = 0.001 and γ = 0.0001. The optimal design results together with the velocity field and the energy decaying curve are displayed in Fig. 2 for the cases of Re = 10, 100 and 1000,

**Fig. 2** (Example 1) Left to right: Optimal results and the corresponding energy decaying curve for the cases of Re = 10, 100, and 1000. The parameters are set as τ = 0.001, γ = 0.0001 and <sup>α</sup>¯ <sup>=</sup> <sup>1</sup>.5<sup>µ</sup> <sup>×</sup> <sup>10</sup><sup>4</sup>

**Fig. 3** (Example 1) Plots of energy curves for <sup>α</sup>¯ <sup>=</sup> <sup>1</sup>.5<sup>µ</sup> <sup>×</sup> <sup>10</sup><sup>4</sup> and Re <sup>=</sup> 10. Left: For fixed γ = 0.0001, energy curves for the cases of τ = 0.02, 0.005, 0.001. Right: For fixed τ = 0.001, energy curves for the cases of γ = 0.0005, 0.0001, 0.00005

separately. It was mentioned in [46] that the radius of curvature of the fluid domain is decreased as the Reynolds number is increased. This phenomenon can also be observed in Fig. 2, and the optimal results are consistent with those obtained by the level set methods in [10, 14, 19].

Furthermore, we numerically check the sensitivity of τ and γ on the energy decaying properties. In Fig. 3, we displayed the energy decaying curves for different choices of τ and γ with fixed Re = 10. We observe that the energy converges to almost the same value. In addition, the final design results we obtained are also identical to the left one in Fig. 2.

**Fig. 4** (Example 2) Left to right: Optimal results and energy curves for β = 0.5 and β = 0.4

**Example 2** We test the example presented in Fig. 1b which has one parabolic inlet and four parabolic outlets. We assume *g*¯ = 3, *l* = 0.2 and *a* = 0.8 on the inlet boundary *x* = 0. For the four outlets, we let (*g*¯,*l*, *a*) = (1, 0.1, 0.8), (1, 0.1, 0.65), (1, 0.2, 0.7) and (1, 0.2, 0.25) on *y* = 0, *y* = 1, *x* = 1 and *x* = 1, respectively. This example has been tested by the phase field method in [18] with the same boundary conditions. Here, we use our algorithm to obtain the final optimal result on a 256 × 256 grid. Throughout this example, we set τ = 0.001, γ = 0.01, α¯ = 1.5µ × 10<sup>4</sup> and Re = 10.

For the initial distribution χ = 1 − χ{(*x*,*y*):*x*∈(0,1),*y*∈( <sup>1</sup> 6 , 5 <sup>6</sup> )}, we test this example for different fluid fractions β. For the left graph of Fig. 4 with β = 0.5, we obtain the optimal result after 40 iterations. For the the right graph of of Fig. 4 with β = 0.4, the optimal result is obtained after 38 iterations. We find that the final result in Fig. 4 has a treelike structure which is consistent with that obtained using the phase field method in [18]. The energy decaying curves for different fluid fractions β are also displayed in Fig. 4.

**Example 3** In this example, we consider the minimization of the power dissipation in a four terminal device. We set *g*¯ = 1 for the two inflows and homogeneous Neumann boundaries on parts of the top and bottom boundaries with centers [0.5, 0] and

**Fig. 5** (Example 3) Left to right: Optimal results and energy curves on a 128 × 128 grid and <sup>256</sup> <sup>×</sup> 256 grid. The parameters are set as <sup>τ</sup> <sup>=</sup> <sup>0</sup>.001, <sup>γ</sup> <sup>=</sup> <sup>0</sup>.0001, <sup>α</sup>¯ <sup>=</sup> <sup>2</sup>.5<sup>µ</sup> <sup>×</sup> <sup>10</sup><sup>4</sup> and Re <sup>=</sup> <sup>1</sup>

[0.5, 1] (see Fig. 1c). The fluid fraction is defined as β = 0.4. Here, we utilize our algorithm to achieve the optimal configurations on 128 × 128 and 256 × 256 grids.

We test the case for τ = 0.001, γ = 0.0001, α¯ = 2.5µ × 10<sup>4</sup> and Re = 1 on 128 × 128 and 256 × 256 grids. The initial distribution is set as χ = 1 − χ{(*x*,*y*):*x*∈(0,1),*y*∈( <sup>1</sup> 3 , 2 <sup>3</sup> )}. In Fig. 5, we observe that the final optimal configuration is consistent with the result obtained using the level set method in [10]. And the final results for different grids are almost the same, which indicates that our algorithm is independent on grid for this example. Furthermore, the energy decaying property can be observed in Fig. 5.

**Example 4** In this example, we consider a three terminal device on the design domain as displayed in Fig. 1d. We set *g*¯ = 1 on the two inflows and the homogeneous Neumann boundary condition on the outflow. The fluid fraction is set as β = 0.3 and we test this example on a 128 × 128 grid for τ = 0.0005, γ = 0.0002 and α¯ = 1.5µ × 104.

In this example, we study the relation of optimal configurations on different choices of Reynolds numbers. Based on the initial χ = 1 − χ{(*x*,*y*):*x*∈(0,1),*y*∈( <sup>1</sup> 5 , 4 5 )}, An Iterative Thresholding Method for Topology … 219

**Fig. 6** (Example 4) Left to right: Optimal configurations and energy decaying curves for Re = 20 and 500

the final optimal design results with the velocity fields for Re = 20, and 500 are displayed in Fig. 6. We observe that the configuration gradually separates from each other as the Reynolds number increases. The energy decaying curves are also displayed and the iteration converges in about 20 steps for Re = 20 and 25 steps for Re = 500, respectively.

#### *4.2 Three Dimensional Results*

In this section, we show the performance of the algorithm on several three dimensional problems for different design domains in Fig. 7. In the following examples, the magnitude of the velocity for the Dirichlet boundary condition on a slice is set as

$$|\mathbf{u}\_D| = \bar{\mathbf{g}} \left( 1 - \frac{(\mathbf{s}\_1 - a)^2 + (\mathbf{s}\_2 - b)^2}{l^2} \right),$$

**Fig. 7** Design domains of three dimensional examples

where *g*¯ is the prescribed velocity at the center (*a*, *b*) of a circle in which the inflow/outflow velocity is imposed, *l* is the radius of the circle, (*s*1,*s*2) are Cartesian coordinates on the slice.

(a) The design domain of Example 5. (b) The design domain of Example 6.

**Example 5** In the example, we consider the multi-outlet problem in Fig. 7a. For the inflow, we set *<sup>g</sup>*¯ <sup>=</sup> 1, *<sup>l</sup>* <sup>=</sup> <sup>0</sup>.2, and (*a*, *<sup>b</sup>*) <sup>=</sup> ( <sup>1</sup> 2 , 1 <sup>2</sup> ) on *x* = 0 plane. For the outflow, we set *l* = 0.1, *g*¯ = 1, and (*a*, *b*) = (0.8, 0.5), (0.8, 0.5), (0.8, 0.5), and (0.8, 0.5) on *y* = 0, *y* = 1, *z* = 0, and *z* = 1 planes respectively. Throughout this example, we choose the initial distribution with fluid domain in a region of {(*x*, *y*,*z*) : *x* ∈ (0, <sup>1</sup>), *<sup>y</sup>* <sup>∈</sup> (0, <sup>1</sup>),*<sup>z</sup>* <sup>∈</sup> ( <sup>1</sup> 3 , 2 <sup>3</sup> )}, and set β = 0.2, α = 2.5μ × 10<sup>4</sup> and Re = 20.

We first test the case for τ = 0.005 and γ = 0.0001 on 32 × 32 × 32 and 85 × 85 × 85 grids. The optimal results in the left graphs of Fig. 8 are consistent with those obtained using the level set method in [10]. In addition, from the energy decaying curves in Fig. 8, we observe that the iteration converges in about 20 steps and 30 steps on coarse and fine grids respectively. In Fig. 9, we displayed the slices on 32 × 32 × 32 grids on *z* = 0.5 and *y* = 0.5 planes.

Next, we compute the result for different τ and γ on the 32 × 32 × 32 grid. The energy curves for γ = 0.0001 and τ = 0.01, 0.005, 0.001 are displayed in the left graph of Fig. 10, and the energy curves for τ = 0.005 and γ = 0.001, 0.0005, 0.0001 are displayed in the right graph of Fig. 10. We observe that the energy converges to almost the same value for different γ and τ .

**Example 6** Here, we consider an example with two inlets and four outlets. The design domain is defined in Fig. 7a. For the two inflows, let *g*¯ = 2, *l* = 0.05 and (*a*, *b*) = (0.5, 0.5) on *x* = 0 and *x* = 1 planes respectively. For the four outflows, we set *g*¯ = 1, *l* = 0.05 and (*a*, *b*) = (0.5, 0.5) on *y* = 0, *y* = 1, *z* = 0 and *z* = 1 planes respectively. In the example, we use our algorithm to obtain the final optimal result for τ = 0.001, γ = 0.0001, α¯ = 2.5µ × 10<sup>4</sup> and Re = 1. The initial distribution of fluid region is set as {(*x*, *<sup>y</sup>*,*z*) : *<sup>x</sup>* <sup>∈</sup> (0, <sup>1</sup>), *<sup>y</sup>* <sup>∈</sup> (0, <sup>1</sup>),*<sup>z</sup>* <sup>∈</sup> ( <sup>1</sup> 6 , 5 <sup>6</sup> )}.

**Fig. 8** (Example 5) Left to right: Optimal configurations on different grids (top: 32 × 32 × 32, bottom: 85 × 85 × 85) and energy curves. The parameters are set as τ = 0.005, γ = 0.0001, α¯ = <sup>2</sup>.5<sup>µ</sup> <sup>×</sup> <sup>10</sup><sup>4</sup> and Re <sup>=</sup> <sup>20</sup>

**Fig. 9** (Example 5) The slices on the 85 × 85 × 85 grid for τ = 0.005, γ = 0.0001, α¯ = 2.5µ × <sup>10</sup><sup>4</sup> and Re <sup>=</sup> 20. Left: The slice on *<sup>z</sup>* <sup>=</sup> <sup>0</sup>.5 plane. Right: The slice on *<sup>y</sup>* <sup>=</sup> <sup>0</sup>.5 plane

**Fig. 10** (Example 5) Plots of energy curves for <sup>α</sup>¯ <sup>=</sup> <sup>2</sup>.5<sup>µ</sup> <sup>×</sup> <sup>10</sup><sup>4</sup> and Re <sup>=</sup> 20. Left: For fixed γ = 0.0001, energy curves for the cases of τ = 0.01, 0.005, 0.001. Right: For fixed τ = 0.005, energy curves for the cases of γ = 0.0005, 0.0001, 0.00005

**Fig. 11** (Example 6) Left to right: Optimal configurations on the different grids (top: 64 × 64 × 64, bottom: 90 × 90 × 90) and energy decaying curves. The fluid fraction is β = 0.18

For the fluid fraction β = 0.1, we design optimal configurations on 64 × 64 × 64 and 90 × 90 × 90 grids. The final results for the coarse and fine grids with corresponding energy decaying curves are displayed in Fig. 11. We observe that the interface is smoother on the fine mesh and the iteration converges in 25 and 30 steps for the coarse and fine grids respectively.

**Fig. 12** (Example 6) Left to right: Optimal configurations for different β (top: β = 0.1, bottom: β = 0.18), energy decaying curves, and slices on *y* = 0.5 plane

Based on the 64 × 64 × 64 grid, we check the dependency of the results on the choice of β. In Fig. 12, we displayed the results, energy decaying curves, and slices on the *y* = 0.5 plane for the optimal shape obtained by β = 0.1 and 0.18. The iteration converges in about 25 steps and 20 steps for β = 0.1 and 0.18. From Fig. 12, we can observe that the solid domain in the center shrinks as β increases.

#### **5 Conclusion**

In this paper, we present an efficient threshold dynamics method for topology optimization for Navier–Stokes flow. This is an extension of our previous work [9] to the case of fluids in Navier–Stokes flow. We aim to minimize a total energy functional that consists of the potential power and the perimeter approximated by nonlocal energy. Different from the algorithm in [9], during the iterations of the algorithm, we need to solve not only the Brinkman equation but also an adjoint problem by the mixed finite element method. Then the indicator functions of fluid-solid regions are updated by a thresholding step which is based on the convolutions evaluated by the FFT. A simple adaptive time strategy is used to accelerate the convergence of the algorithm. Some numerical examples are presented to verify the efficiency of the new algorithm, and the total energy decaying property of the proposed algorithm can be observed numerically. The proposed algorithm is simple and easy to implement. For the numerical experiments that we have performed, the proposed algorithm always finds an optimal shape and the numerical results are relatively insensitive to the initial guesses and parameters.

**Acknowledgements** This research was supported in part by the Hong Kong Research Grants Council (GRF grants 16324416, 16303318 and 16305819). The work of H. Leng was supported by the NSF of China (Grant No. 12001209). The work of D. Wang was supported by the University Development Fund of the Chinese University of Hong Kong, Shenzhen (UDF 01001803). The work of H. Chen was supported by the NSF of China (Grant No. 11771363, 91630204, 51661135011), the Fundamental Research Funds for the Central Universities (Grant No. 20720180003).

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Dynamics of Complex Singularities of Nonlinear PDEs**

## **Analysis and Computation**

#### **J. A. C. Weideman**

**Abstract** Solutions to nonlinear evolution equations exhibit a wide range of interesting phenomena such as shocks, solitons, recurrence, and blow-up. As an aid to understanding some of these features, the solutions can be viewed as analytic functions of a complex space variable. The dynamics of poles and branch point singularities in the complex plane can often be associated with the aforementioned features of the solution. Some of the computational and analytical results in this area are surveyed here. This includes a first attempt at computing the poles in the famous Zabusky–Kruskal experiment that lead to the discovery of the soliton.

#### **1 Introduction**

Ever since Kruskal [22] remarked that soliton motion may be thought of as a "parade of poles," the study of complex pole dynamics in nonlinear wave equations has been an active research field. This paper is an overview of the field, using some of the wellknown model problems, including the Korteweg–De Vries equation that prompted Kruskal's remark. The plan is to take these equations, some of them dissipative and others dispersive, and start them all with the same set of initial and boundary conditions. Using analysis where we can and numerical computation otherwise, we shall then track the evolution of the complex singularities. The singularity dynamics of the various equations will be contrasted, and also connected to the typical nonlinear features associated with these equations such as shock formation, soliton motion, finite time blow-up, and recurrence. Here, a particular interest is the entry of the singularities when the initial condition has no singularities in the finite complex plane.

J. A. C. Weideman (B)

Department of Mathematical Sciences, Stellenbosch University, Stellenbosch, South Africa e-mail: weideman@sun.ac.za

<sup>©</sup> The Author(s) 2022

T. Chacón Rebollo et al. (eds.), *Recent Advances in Industrial and Applied Mathematics*, ICIAM 2019 SEMA SIMAI Springer Series 1, https://doi.org/10.1007/978-3-030-86236-7\_13

We consider equations of the form

$$
\mu\_t + \mu u\_x = L(\mu), \quad t > 0, \ -\pi \le x < \pi,\tag{l}
$$

and assume 2π-periodic solutions in the space variable, *x*. The linear operator on the right can be any one of

$$L(\mu) = \nu \,\mu\_{xx} \qquad\text{ (Burgers)},\tag{2}$$

$$L(u) = -\nu \, u\_{xxx} \quad \text{ (Korteweg-De Vries)}, \tag{3}$$

$$L(u) = \upsilon \, H\{u\_{xx}\} \quad (\text{Benjamin} - \text{Ono}), \tag{4}$$

where ν is a nonnegative constant and *H* denotes the periodic Hilbert transform, defined below. As initial condition we consider

$$u(\mathbf{x},0) = -\sin(\mathbf{x}),\tag{5}$$

the particular form of which allows us to make connections to several works of historical interest, namely papers by Cole [10], Hopf [21], Platzman [27], and Zabusky and Kruskal [39].

The numerical procedure we follow is similar to the one proposed in [35]. The first step involves a Fourier spectral method in space and a numerical integrator in time to compute the solution on [−π, π]×[0, *T* ]. The second step is to continue the solution at any time *t* in [0, *T* ] into the complex *x*-plane. For the continuation we use a Fourier–Padé method, although other possibilities are considered as well.

In order to identify and display poles and branch points in the complex plane, we shall plot what is called the "analytical landscape" in [34]. With the solution *f* (*z*) expressed in polar form *rei*<sup>θ</sup> , the software of [34] can be used to generate a 3D plot in which the horizontal axes represent the real and imaginary components of *z* = *x* + *iy*, the height represents the modulus *r*, and colour represents the phase *e<sup>i</sup>*<sup>θ</sup> . The two examples in Fig. 1 should clarify this visualization.

The outline of the paper is as follows: The inviscid Burgers equation and its viscous counterpart are discussed, respectively, in Sects. 2 and 3. Here, analysis provides the exact locations of the branch point singularities in the inviscid case and approximate locations of the poles in the case of small viscosity. For the other PDEs considered here, namely Benjamin-Ono (BO) in Sect. 4 and Korteweg-de Vries (KdV) in Sect. 5, analytical results are harder to come by and we resort to the numerical procedure mentioned above. The nonlinear Schrödinger equation (NLS) also makes an appearance in our discussion of recurrence in Sect. 6. In the final section we discuss the details of the numerical methods employed in the earlier sections.

Novel results presented here include the pole dynamics of the BO, KdV, and NLS equations. Related studies of KdV were undertaken in [7, 17], but these authors did not consider the Zabusky–Kruskal experiment which is our focus here. Pole behaviour in KdV and NLS was also discussed in the papers [11, 22] and [9, 23], respectively, but those analyses were based on cases where explicit solutions are

**Fig. 1** Analytical landscapes of the functions *<sup>f</sup>* (*z*) <sup>=</sup> <sup>1</sup>/*z*<sup>2</sup> (top left), and *<sup>f</sup>* (*z*) <sup>=</sup> *<sup>z</sup>*1/<sup>2</sup> (top right). The height represents the modulus and the colour represents the phase, as defined by the NIST standard colour wheel (bottom); see [13]. For details about the software used to produce these figures, see [34]

available. Moreover, in those papers the poles were already present in the initial condition. Here, our interest is in the situation where the singularities are "born" at infinity.

Although this paper focuses only on simple model equations such as (1)–(4), pole dynamics have been studied in more complex models, particularly in the water wave context. Among the many references are [3, 7, 14].

#### **2 The Inviscid Burgers Equation**

The inviscid Burgers equation, *ut* + *uux* = 0, subject to the initial condition (5), develops a shock at(*x*, *t*) = (0, 1), as can be verified by the method of characteristics. It also admits an explicit Fourier series solution [27]

$$u(x,t) = -2\sum\_{k=1}^{\infty} c\_k(t)\sin(kx), \qquad c\_k(t) := \frac{J\_k(kt)}{kt},\tag{6}$$

valid for 0 < *t* < 1. The *Jk* are the Bessel functions of the first kind. This series is of limited use for numerical purposes, however, particularly for continuation into the complex plane. When truncated, it becomes an entire function and will not reveal

**Fig. 2** Solution to the inviscid Burgers equation as computed by applying Newton iteration to the implicit solution formula (7). The four frames correspond to *<sup>t</sup>* <sup>=</sup> <sup>1</sup> <sup>4</sup> , <sup>1</sup> <sup>2</sup> , <sup>3</sup> <sup>4</sup> , and 1 (in the usual order). The thicker black curve is the real-valued solution on the real axis, displaying the typical steepening of the curve until the shock forms in the last frame. The solution in the upper half-plane is displayed in the format of Fig. 1. The solution in the lower half-plane is not shown because of symmetry. The black dot represents a branch point singularity that travels along the imaginary axis according to (9). By referring to the colour wheel of Fig. 1, one can see that on the imaginary axis, there is no jump in phase between the origin and the branch point (in some printed versions the abrupt change in phase may appear to be discontinuous but it is not.) From the branch point to +*i*∞, however, there is a phase jump consistent with a singularity of quadratic type

much singularity information other than perhaps the location and type of the singularity nearest to the real axis [26, 32].

Instead, for numerical purposes we shall use the implicit solution formula

$$u = f(\mathbf{x} - ut), \qquad f(\mathbf{x}) = -\sin(\mathbf{x}).\tag{7}$$

This transcendental equation can be solved by Newton iteration for values of *x* in the complex plane. One can start at a small time increment, say *t* = *t*, use *u* = *f* (*x*) as initial guess, and iterate until convergence. Then *t* is incremented to 2*t*, the initial guess is updated to the current solution, and the process is repeated. Figure 2 shows the corresponding solutions in the visualization format described in the introduction.

The figure shows one member of a conjugate pair of branch point singularities, born at +*i*∞, which travels down the positive imaginary axis and meet its conjugate partner (not shown) at(*x*, *t*) = (0, 1) when the shock occurs. This behaviour was first reported in [5, 6], where a cubic polynomial was used as initial condition (similar to the first two terms in the Taylor expansion of (5)). In the cubic case, eq. (7) can be solved explicitly by Cardano's formula, which enabled a complete description of the singularity dynamics as summarized in [5, 6, 28, 29]. In our case, the initial condition is trigonometric and therefore Cardano's formula is not applicable. It is nevertheless possible to find the singularity locations and their type explicitly.

The singularity location, say *z* = *zs*, and the corresponding solution value, say *u* = *us*, are defined by the simultaneous equations

$$
\mu\_s = f(\mathbf{z}\_s - \mathbf{u}\_s t), \qquad \mathbf{l} = -t f'(\mathbf{z}\_s - \mathbf{u}\_s t), \tag{8}
$$

the latter equation representing the vanishing Jacobian of the mapping; see for example [26]. With *f* (*x*) defined by (5), the solution is, for 0 < *t* < 1,

$$z\_s = \pm i \left( \sqrt{1 - t^2} - \tanh^{-1} \sqrt{1 - t^2} \right), \quad u\_s = \pm i \, t^{-1} \sqrt{1 - t^2}. \tag{9}$$

These formulas are consistent with the solution shown in Fig. 2. A graph of the singularity location as a function of time is shown as the dashed curve in Fig. 3 of the next section.

Further analysis shows that the singularity is of quadratic type, consistent with the phase colours in Fig. 2 and in agreement with the analysis of [5, 6, 28, 29] for the cubic initial condition. When *t* = 1, i.e., at the time the shock occurs, the singularity type changes from quadratic to cubic. The Riemann surface structure associated with this is discussed in [5, 6], in connection with the cubic initial condition.

#### **3 The Viscous Burgers Equation**

When viscosity is added, i.e., ν > 0 in the Burgers equation (1)–(2), shock formation does not occur. In the complex plane interpretation this means the singularities do not reach the real axis. Moreover, they become strings of poles rather than the branch points observed in the previous section. The poles travel in conjugate pairs from ±*i* ∞, with rapid approach towards the real axis, before turning around. They retrace their steps along the imaginary axes at a more leisurely pace, and eventually recede back to infinity, which ultimately leads to the zero steady state solution.1

Analogously to (6), the Burgers equation subject to the initial condition (5) has an explicit series solution, this time not a Fourier series but a ratio of two such series:

$$u(\mathbf{x},t) = -2\nu \frac{\theta\_{\mathbf{x}}}{\theta}, \quad \theta(\mathbf{x},t) := I\_0(\frac{1}{2\nu}) + 2\sum\_{k=1}^{\infty}(-1)^k I\_k(\frac{1}{2\nu})e^{-\nu k^2 t}\cos(kx). \tag{10}$$

<sup>1</sup> A movie of the pole dynamics of this solution and some of the other solutions in this paper can be found on the author's web page [36].

**Fig. 3** Left: Solution of the viscous Burgers equation (2), with ν = 0.1, *t* = 1, as computed from the series solution formula (10). Right: The locations on the positive imaginary axis of the first four poles as a function of time. The dash-dot curve is the location of the branch-point singularity when ν = 0, as given by formula (9) (the pole curves approach the dash-dot curve asymptotically as *t* → 0<sup>+</sup> but could not be computed reliably for small values of *t* because of ill-conditioning, hence the gaps)

The *Ik* are the modified Bessel functions of the first kind. This solution is derived from the famous Hopf–Cole transformation; in fact, the above series is a special case of one of the examples presented in the original paper of Cole [10]. Presumably the solutions (6) and (10) can be connected in the limit ν → 0+, but we found no such reference in the literature.

The pole locations in Fig. 3 can be computed from the series solution (10). For asymptotic estimates, however, a better representation is the integral form [10, 21]:

$$u(\mathbf{x},t) = \frac{\int\_{-\infty}^{\infty} \frac{x-s}{t} \exp\left(\frac{1}{2v}F(\mathbf{x},\mathbf{s},t)\right)ds}{\int\_{-\infty}^{\infty} \exp\left(\frac{1}{2v}F(\mathbf{x},\mathbf{s},t)\right)ds}.\tag{11}$$

In the case of the initial condition (5) the function *F* is defined by

$$F(\mathbf{x}, \mathbf{s}, t) = 1 - \cos(\mathbf{s}) - \frac{(\mathbf{x} - \mathbf{s})^2}{2t}. \tag{12}$$

To estimate the pole locations in the inviscid Burgers equation one can analyze the denominator of the formula in (11). Looking for poles on the positive imaginary axis, we define, for *y* > 0,

$$D(\mathbf{y}, t) = \int\_{-\infty}^{\infty} \exp\left(\frac{1}{2\nu} F(\mathbf{y}i, \mathbf{s}, t)\right) d\mathbf{s}.\tag{13}$$

A saddle point method can be used to estimate this integral when 0 < ν 1. We present an informal analysis here, focussed on an explanation of the situation shown in Fig. 3. A more comprehensive analysis (for the cubic initial condition) can be found in [28].

Figure 4 shows level curves of the real and imaginary parts of *F*(*iy*,*s*, *t*) in the complex *s*-plane, with *y* = 1 and *t* = 1. The figure reveals three saddle points, two in the upper half-plane and one in the lower half-plane. The contour of integration in (13) is accordingly deformed into the upper half-plane, in order to pass through the two saddle points.

To estimate the saddle point contributions, we differentiate (13) with respect to *s* (and suppress the dependence on *y* and *t*),

$$F'(\mathbf{s}) = \sin(\mathbf{s}) - \frac{(\mathbf{s} - \mathbf{y}i)}{t}, \qquad F''(\mathbf{s}) = \cos(\mathbf{s}) - \frac{1}{t}. \tag{14}$$

The saddle points are defined by *F* (*s*) = 0, i.e.,

$$
\dot{s} - \dot{y}i - t\sin(s) = 0.\tag{15}
$$

No explicit solution of this equation seems to exist, but it can be checked that for *t* = 1 and all *y* > 0 there is precisely one root on the negative imaginary axis, and two roots in the upper half-plane, symmetrically located with respect to the imaginary axis. The configuration shown in Fig. 4 can therefore be taken as representative of all *y* > 0, except that the saddle points coalesce at the origin as *y* → 0+.

We label the roots in the first and second quadrants as *s*<sup>1</sup> and *s*2, respectively, with *s*<sup>2</sup> = −*s*1. The corresponding saddle point contributions are *D*<sup>1</sup> and *D*2, where

$$D\_j = 2\sqrt{\frac{\pi\nu}{|F''(s\_j)|}} \exp\left(\frac{1}{2\nu}F(s\_j) - \frac{1}{2}i(\theta\_j \pm \pi)\right),\tag{16}$$

where the upper (resp., lower) sign choice refer to *j* = 1 (resp., *j* = 2). The quantities θ *<sup>j</sup>* are defined by *F*(*sj*) = |*F*(*sj*)|*e<sup>i</sup>*<sup>θ</sup> *<sup>j</sup>* .

The approximation to the denominator integral (13) is now given by *D* ∼ *D*<sup>1</sup> + *D*<sup>2</sup> as ν → 0+. After using the symmetry relationships between *s*<sup>1</sup> and *s*<sup>2</sup> noted above, as well as the fact that |*F*(*s*1)|=|*F*(*s*2)|, this becomes

$$D \sim 4\sqrt{\frac{\pi\nu}{|F''(s\_1)|}}e^{\frac{1}{2\nu}\lambda\_1}\sin\left(\frac{\mu\_1}{2\nu} - \frac{1}{2}\theta\_1\right), \quad F(s\_1) := \lambda\_1 + \mu\_1 i. \tag{17}$$

In the second frame of Fig. 4 the graph of this function is shown as a function of *y*. In comparison with a high-accuracy quadrature approximation of the integral (13), the approximation (17) is seen to be quite accurate. The exception is for small values of *y*, because of the coalescence of the saddle points mentioned above.

**Fig. 4** Saddle point analysis for the viscous Burgers equation shown in Fig. 3. Left: The dots are saddle points of *F*(*yi*,*s*, *t*), with *y* = 1, *t* = 1. The colour represents level curves of the real part of *F*(*yi*,*s*, *t*), and the dash-dot curves are level curves of the imaginary part. For the saddle point analysis the path of integration in (13), i.e., the real line, is deformed into the dash-dot curve in the upper half-plane that defines the steepest descent direction. The main contributions to the integral come from the regions in the neighbourhood of the saddle points. Right: The function *D*(*y*, 1), computed by numerical integration of (13) (solid curve), in comparison with the saddle point approximation (17) (dash-dot curve). The zeros of this function define the locations of the poles seen in Fig. 3

**Table 1** Left: Pole locations on the positive imaginary axis for the solution shown in Fig. 3, i.e., *t* = 1 and ν = 0.1. The 'exact' values were computed by numerical quadrature of (13) and root finding, both processes executed to high precision. The estimated values were computed by a numerical solution of the two equations (15) and (18). Right: Turning points of the poles, i.e., the coordinates of the local minima in the right frame of Fig. 3. This was computed by a numerical solution of the two equations (15) and (18) in combination with a minimization procedure with objective function *y*


Approximate pole locations can be computed as the zeros of (17), i.e.,

$$
\mu\_1 - \nu \theta\_1 = 2\nu k \pi, \quad k = 1, 2, \dots, \tag{18}
$$

which is solved simultaneously with the saddle point equation (15). In Table 1 we compare this estimate with the actual pole locations.

The equations (15)–(18) can be used as basis for further analysis, both theoretical and numerical, of the pole locations. For example, by solving these equations numerically and simultaneously minimizing over *y*, the closest distance any particular pole gets to the real axis can be computed. These results are also summarized in Table 1.

**Fig. 5** Finite time blow-up in the Burgers equation (2) with ν = 0.1, subject to the complex initial condition (19). The poles approach the origin from the positive imaginary direction, as can be seen in the left frame, which corresponds to *t* = 0.7. In the right frame the leading pole has reached the real axis, roughly at *t* = 1, which results in a blow-up (note that there is no upper/lower half-plane symmetry as was the case in Fig. 2, so we show both half-planes in this figure)

In conclusion of this section on the Burgers equation we mention a lesser known fact, namely, that nonlinear blow-up is possible with complex initial data. For example, Fig. 5 shows the blow-up in the solution corresponding to the complex Fourier mode initial condition

$$u(\mathbf{x},0) = -\sin(\mathbf{x}) - i\cos(\mathbf{x}).\tag{19}$$

Features such as the blow-up time or the minimum value of ν that allows blow-up can be analyzed by the saddle point method outlined above, but we shall not pursue this here.

When dispersion replaces diffusion in (1), the poles drift away from the imaginary axis. The pole behaviour is more complicated than in the Burgers case and the bigger the dispersive effects, the more intricate the behaviour. For this reason we tackle the less famous BO equation first, before getting to the more celebrated KdV equation.

#### **4 The Benjamin-Ono Equation**

The periodic Hilbert transform *H* in (4) can be defined as a convolution integral involving a cotangent kernel [19, Ch. 14], or, equivalently, in terms of Fourier series

$$u(\mathbf{x},t) = \sum\_{k=-\infty}^{\infty} c\_k(t)e^{ikx} \quad \Rightarrow \quad H\{u\_{xx}\} = \sum\_{k=-\infty}^{\infty} (-i) \text{sgn}(k) k^2 c\_k(t) e^{ikx} . \tag{20}$$

When the nonlinear term in (1) is absent, both the BO and KdV equations are linear dispersive wave equations. They admit travelling wave solutions *u*(*x*, *t*) =

**Fig. 6** Solutions to the Benjamin-Ono equation (1) and (4), corresponding to the initial condition (5), with ν = 0.1. The pole dynamics of this solution can be seen in Fig. 7

*ei*(*kx*−ω(*k*)*t*) with dispersion relations ω = −ν sgn(*k*)*k*<sup>2</sup> and ω = −ν *k*3, respectively. The quadratic vs cubic dependence on the wave number *k* makes dispersive effects in the BO equation less pronounced than in the KdV equation.

With the nonlinear term in (1) present, both the BO and KdV equations are completely integrable and solvable, in principle, by the inverse scattering transform [1]. For arbitrary initial conditions and particularly with periodic boundary conditions, however, it is unlikely that all steps of the procedure can be completed successfully to obtain explicit solutions. Numerical methods will therefore be used to study singularity dynamics. As mentioned in the introduction, this consists of a standard method of lines procedure to obtain the solution on the real axis, followed by numerical analytical continuation into the complex plane by means of a Fourier-Padé method. Details are postponed to Sect. 7. Our choice of a Padé based method stems from the fact that singularities in both BO and KdV (next section) are expected to be poles. This is related to the complete integrability of these equations and the Painlevé property as discussed in [1, Sect. 2].

Figure 6 shows the solution on the real axis for the BO equation. Like diffusion, dispersion prevents shocks, but the mechanism is different: oscillations appear and separate into travelling wave solutions. In the case of KdV, this behaviour gave rise to the numerical discovery of the soliton, as discussed in Sect. 5. In the present example, about eight such solitons can be seen, perhaps most clearly identifiable in the pole parade shown in Fig. 7.

**Fig. 7** Pole locations of a subset of the solutions of the BO equation shown in Fig. 6. Each soliton in that figure can be associated with a pair of conjugate simple poles in the complex plane. The poles that exit on the left re-enter on the right because of the periodic boundary conditions

The initial pole behaviour is very similar to that observed in the Burgers equation, namely, the poles are born at infinity and start to travel in conjugate pairs towards the imaginary axes. Unlike the Burgers case, however, the poles do not remain on the imaginary axes but veer off into the left half-plane. Eight pairs can eventually be associated with the solitons shown in Fig. 6.

In the absence of readily computable error estimates for our procedure we have used the following strategy to validate the results. Poles of the BO equation are simple, each with residue ±2*i*ν; see for example [8]. The order and residue of each pole can be checked by contour integration on a small circle surrounding its location [35].<sup>2</sup> Using this technique, spurious poles and other numerical artifacts can be identified (one example of which is the slight irregularity near −3 + 0.8*i* in the third frame of Fig. 7.)

#### **5 The Korteweg-De Vries Equation**

In the case of KdV, the qualitative behaviour of the solutions is similar to that of the BO equation. The dispersion prevents shock formation in the solution by breaking it up into a number of solitons, which is the famous discovery of Zabusky and Kruskal [39]. The iconic figure from that paper is reprinted in Fig. 8. In the left frame of Fig. 9 we reproduce that solution, but rescaled to the domain [−π, π] in order to facilitate comparisons with the other solutions shown in this paper.

The initial behaviour is the same as for the other equations we have seen thus far, namely, there are poles that enter from infinity and travel towards the real axis in conjugate pairs, roughly similar to the first two frames in Fig. 7. As was the case for the BO equation, dispersion causes the poles to drift into the left half-plane and eventually re-enter in the right half-plane because of periodicity. The eight solitons marked in the Zabusky–Kruskal figure are clearly identifiable in the pole plot of Fig. 9, with the poles closer to the real axis corresponding to the taller solitons.

We have used the same strategy mentioned at the end of Sect. 4 for validation of Fig. 9. In the case of KdV the poles are locally of the form −12ν/(*z* − *z*0)2. The phase information of Fig. 9, when viewed in colour, makes it clear that the computed poles are indeed of order two, and contour integration confirmed the strength coefficient of −12ν.

It should be noted, however, that numerical analytical continuation is inherently ill-conditioned as one goes further into the complex plane, and that puts some limitations on our investigations. Two examples are as follows:

First, for *t* 1 we found that the Fourier-Padé based method was not able to produce the theoretical pole information accurately, presumably because of the distance between the real axis and the nearest singularity. Therefore no figures of this initial phase of the evolution are presented here. Second, in the literature the existence of 'hidden solitons' in the Zabusky–Kruskal experiment is mentioned; see [12] (and the references therein). In order to investigate these hidden solitons, the solution of Fig. 9 has to be continued much farther into the complex plane. Because of spurious poles and the ill-conditioning alluded to above, our efforts at tracking these hidden solitons were inconclusive. Both of these investigations are offered as a challenge to computational mathematicians.

Here are two suggestions for such investigations. First, for the KdV method it is recommended that the equation be transformed into the potential KdV equation, by

<sup>2</sup> The order of a pole can also be confirmed visually by examining the phase information in the pole plots.

**Fig. 8** The iconic figure of soliton formation in the KdV equation. The initial condition is *u*(*x*, 0) = cos(π*x*) on [0, <sup>2</sup>], with <sup>ν</sup> <sup>=</sup> <sup>0</sup>.0222. Reprinted, with permission, from [39]. Copyright (1965) by the American Physical Society

**Fig. 9** Left: the Zabusky–Kruskal solution shown in Fig. 8, after rescaling to [−π, π]. Right: the corresponding poles in the complex plane

the substitution *u* = v*<sup>x</sup>* ; see [22]. This equation has simple poles, which makes it better suited for approximation by Padé methods. Second, the use of multi-precision arithmetic is advisable. Here, everything was done in IEEE double precision, mainly because of the speed if offers to create animations of the pole parades [36].

#### **6 Recurrence**

Historically, the discovery of the soliton in [39] overshadowed the fact that the objective of that paper was something else entirely, namely, the verification of the recurrence phenomenon previously discovered by Fermi, Pasta, Ulam, and Tsingou (FPUT) in yet another celebrated numerical experiment [16].<sup>3</sup> In short, this means that if a nonlinear system is started in a low mode configuration such as the initial condition (5), then higher modes are created by the nonlinear interaction, causing an energy cascade from low modes to high. The upshot of the FPUT experiment was that this process is not continued indefinitely, but eventually reverses with most of the energy flowing back to the low modes. The effect of this is that the initial condition is reconstructed briefly—approximately so and with a shift in phase—after a certain period of time.

Numerical experiments with KdV such as those reported in Sect. 5 do not reveal the recurrence behaviour in the pole dynamics. Had true recurrence occurred, the poles would have retraced their steps back along the imaginary axes out to infinity or would have cancelled somehow. The most we could observe at the purported recurrence time was a slight widening of the strip of analyticity around the real axis. This lack of a clear recurrence can be attributed to the fact that the phenomenon is rather weak in KdV, as discussed in detail in [20].

For a more convincing demonstration of recurrence one has to look outside the family (1)–(4). Perhaps the best PDE for this purpose is the NLS equation

$$\|\left|u\_t + u\_{xx} + \nu|u|^2 u = 0,\tag{21}$$

where the solution, *u*(*x*, *t*), is complex-valued. We shall consider ν > 0 (known as the focussing case) and continue to work with 2π-periodic boundary conditions. It will be necessary, however, to modify our initial condition to have nonzero mean, so we consider

$$u(\mathbf{x},0) = 1 + \epsilon \cos \mathbf{x}.\tag{22}$$

The corresponding solution is an -perturbation of the *x*-independent solution *u* = *ei*ν*<sup>t</sup>* . Linearisation about this solution shows that the side-bands *e*±*inx* grow exponentially for all integers *n* satisfying [37, 38]

$$0 < n^2 < 2\nu. \tag{23}$$

That is, for ν < <sup>1</sup> <sup>2</sup> there is no instability, for <sup>1</sup> <sup>2</sup> <ν< 2 a single pair of side-bands is unstable, a double pair for 2 <ν< <sup>9</sup> <sup>2</sup> , and so on. The instability is named after Benjamin and Feir, who derived it not via the NLS but directly from the water wave setting [4]. The growth does not continue unboundedly but subsides, and recurrences occur at periodic time intervals. The connection between Benjamin-Feir instability and FPUT recurrence was pointed out in [38].

The growth and recurrence pattern for a special case with two unstable modes can be seen in Fig. 10. In frames 2, 3 and 7, 8 the unstable mode *e*±*i x* dominates, while *e*±2*i x* dominates in frames 4, 5, and 6. An almost perfect recurrence occurs in frame 9, after which time the process continues periodically.

<sup>3</sup> Since the mid-2000s it has been recognized that Mary Tsingou deserves credit for her computations, and so the FPU experiment was renamed FPUT.

**Fig. 10** Solutions to the nonlinear Schrödinger equation (21) corresponding to the initial condition (22), with <sup>ν</sup> <sup>=</sup> 3, <sup>=</sup> <sup>0</sup>.1. The unstable modes *<sup>e</sup>*±*i x* and *<sup>e</sup>*±2*i x* take turns in dominating the solution, with a near perfect recurrence at *t* = 5. The pole dynamics of the first phase of this solution can be seen in Fig. 11

Pole locations of some of the solutions in Fig. 10 can be seen in Fig. 11. The first unstable mode is controlled by a conjugate pair of simple poles on the imaginary axis. The second is controlled by two pairs of conjugate poles, each pair symmetrically located with respect to the imaginary axis. The first frame shows the initial onset, with the poles on the imaginary axis leading the procession. The second frame is roughly where the first mode reaches its maximum growth, which corresponds to the point at which the poles reach their minimum distance to the real axis. In the third frame, these poles are receding back along the imaginary axes and are overtaken by the approaching secondary sets of poles. The last frame shows a situation where the second mode has become dominant. At the recurrence time, all of these poles will have receded back to infinity.

#### **7 Numerical Tools**

In this final section we review some of the numerical techniques that can be used in this field. Our discussion, which focuses primarily on Padé approximation and its variants, is by no means exhaustive. For other approaches, including tracking the

**Fig. 11** Pole locations of a subset of the solutions of the NLS equation shown in Fig. 10. In the first two frames the unstable mode *e*±*i x* dominates, while *e*±2*i x* dominates in the last two frames. This is determined by which pairs of poles are closest to the real axis

poles through the numerical solution of certain dynamical systems, we refer to [7, 26, 32, 33].

We limit the discussion to 2π-periodic solutions that admit a Fourier series expansion of the form

$$u(x,t) = \sum\_{k=-\infty}^{\infty} c\_k(t)e^{ikx}, \quad -\pi \le x < \pi. \tag{24}$$

In some rare cases the coefficients *ck* (*t*) are known explicitly; cf. (6). Otherwise, the *ck* (*t*) can be computed numerically by a Fourier spectral method and the method of lines [35]. In order to do this step as accurately as possible, it is necessary to truncate the Fourier series to a large number of terms (here we used |*k*| ≤ 256 or 512), and also use small error tolerances in the time-integration (here on the order of 10−<sup>12</sup> in the stiff integrator ode15s in MATLAB).

When truncated, the series (24) becomes an entire function and will not reveal much singularity information other than perhaps the width of the strip of analyticity around the real axis [32]. A more suitable representation is obtained by converting the truncated series to Fourier-Padé form. For a fixed value of *t* (suppressed for now in the notation) we convert the series to Taylor-plus-Laurent form by the substitution *z* = *ei x* :

$$u(\mathbf{x}) = \sum\_{k=-\infty}^{\infty} c\_k e^{ik\mathbf{x}} = \sum\_{k=0}^{\infty} c\_k \mathbf{z}^k + \sum\_{k=0}^{\infty} c\_{-k} (\mathbf{1}/\mathbf{z})^k. \tag{25}$$

(It is necessary to redefine *c*<sup>0</sup> → *c*0/2.) Each term on the right can be converted to a type (*N*, *N*) rational form as follows. Consider the first term and define

$$f(z) = \sum\_{k=0}^{\infty} c\_k z^k, \quad p(z) = \sum\_{k=0}^{N} a\_k z^k, \quad q(z) = \sum\_{k=0}^{N} b\_k z^k. \tag{26}$$

One then requires that

$$f(z) \approx \frac{p(z)}{q(z)} \quad \Rightarrow \quad p(z) - q(z)f(z) = \mathcal{O}(z^{2N+1}).\tag{27}$$

The latter equation can be set up as a linear system to solve for the coefficients *ak* and *bk* (after fixing one coefficient, typically *b*<sup>0</sup> = 1). The second term on the right in (25) can be converted to rational form in the same way, which then gives the approximation to *u*(*x*) as the ratio of two Fourier-series. The pole plots in Sects. 4, 5 and 6, were all computed using this Fourier-Padé approach.

A promising alternative to the Padé approach to rational approximation is the so-called AAA method, recently proposed in [24], with subsequent extensions to the periodic case [25]. It is not implemented in coefficient space like (24)–(26), but rather uses function values, easily obtained from (26) by an inverse discrete Fourier transform. The representation is the barycentric formula for trigonometric functions [18]

$$u(\mathbf{x}) = \frac{\sum\_{k=1}^{M} (-1)^{k} \csc\left(\frac{1}{2}(\mathbf{x} - \mathbf{x}\_{k})\right) u\_{k}}{\sum\_{k=1}^{M} (-1)^{k} \csc\left(\frac{1}{2}(\mathbf{x} - \mathbf{x}\_{k})\right)},\tag{28}$$

applicable when *M* is odd (a similar formula holds for even *M*). When *xk* = −π + (*k* − 1)2π/*M* (i.e., evenly spaced nodes in [−π, π)) and *uk* = *u*(*xk* ), then *u*(*x*) is identical to the series (26) when truncated to |*k*| ≤ *N*, where 2*N* + 1 = *M*.

In the AAA algorithm the so-called support points *xk* are not chosen to be equidistant, which changes the formula (28) from a truncated Fourier series to a rational form. The choice of the *xk* proceeds adaptively so as to avoid exponential instabilities.

In preliminary numerical tests the trigonometric AAA algorithm was competitive with the Fourier-Padé method described above. But further experimentation is needed to decide the winner in this particular application field.

Neither of these two methods, however, can give much information on branch point singularities. One way of introducing branches into the approximant is quadratic Padé approximation [30], which is a special case of Hermite-Padé approximation [2]. Define a polynomial *r*(*x*) similar to *p*(*x*) and *q*(*x*) in (26), and in analogy with the rightmost expression in (27) define

$$p(z) + q(z)f(z) + r(z) \left(f(z)\right)^2 = \mathcal{O}(z^{3N+2}).\tag{29}$$

Dropping the order term on the right yields

$$f(z) \approx \frac{-q(z) \pm \sqrt{q(z)^2 - 4p(z)r(z)}}{2\, r(z)},\tag{30}$$

and when this is used to approximate the two terms on the right of (25) a two-valued approximant to *u*(*x*)is obtained. Cubic and higher order approximants can be defined analogously, but will not be considered here.

Recall that Fig. 2 showed a solution of the inviscid Burgers equation with a branch point singularity. To test how accurately this singularity can be approximated by these methods, we solved the equation numerically as described below eq. (24). (We refrained from using the explicit series (6), which is too special.) The numerical solution (24) was then continued into the complex plane using the Fourier-Padé and quadratic Fourier-Padé approximations. Although we have a large number of Fourier coefficients available, we found that best results are obtained if only a fraction of those are used in the Padé approximations. For the results shown here, we used only *N* = 35 terms in the series for *f* (*z*) in (26), which translates into a type (17, 17) linear Fourier-Padé approximant, and type (11, 11, 11) in the quadratic case.

The results are shown in Fig. 12. The middle figure is the reference solution, computed to high accuracy by the Newton iteration described in Sect. 2. On the left is the approximation obtained by the linear Fourier-Padé approximant. Away from the imaginary axis the approximation is good, but it is poor on the axis itself. In the absence of branches in the approximant, a series of poles and zeros (the latter not clearly visible) appears as a proxy for the jump in phase. The fact that alternating poles and zeros 'fall in the shadow' of the branch point is a well-known phenomenon in standard Padé approximation [31], and is evidently also present in the trigonometric case.<sup>4</sup> On the other hand, the quadratic Fourier-Padé approximant shown on the right is virtually indistinguishable from the reference solution.

The relative errors in these two approximations are shown in Fig. 13. The linear approximant has low accuracy near the imaginary axis because of the spurious poles mentioned above. By contrast, the quadratic approximant maintains high accuracy, even on the imaginary axis. If one takes the solution generated by the Newton method as exact, the quadratic approximant yields more than five decimal digits of accuracy in almost the whole domain shown in Fig. 13.

Further discussion of numerical aspects of quadratic Padé approximation, including their computation and conditioning, can be found in [15].

<sup>4</sup> Comparing the left frames of Figs. 3 and 12 is interesting. Both solutions can be viewed as a perturbation of the multivalued solution shown in Fig. 2. In Fig. 3 the perturbation is caused by a small amount of diffusion, while in Fig. 12 it is caused by numerical approximation. In both cases the proximity of the multivalued solution is revealed by a sequence of zeros and poles along the phase discontinuity.

**Fig. 12** Approximation of a branch point singularity in the inviscid Burgers equation, at *t* = 0.75. Left: a type (17, 17) linear Padé approximation. Middle: reference solution computed by Newton iteration from (7). Right: a type (11, 11, 11) quadratic Padé approximation

**Fig. 13** Relative errors in the approximation of the branch point singularity of Fig. 12. Left: the linear Padé approximation. Right: the quadratic Padé approximation. Bottom: the colour bar in a log10 scale, so each change in shade represents roughly one decimal digit of accuracy

**Acknowledgements** This paper is an extended version of an invited talk given at ICIAM2019. The author is grateful to the ICIAM secretariat for the invitation. During the preparation the practical assistance and support of Marco Fasondini and Nick Hale were invaluable, as was Nick Trefethen in the role of sounding board. Numerous other people responded to emails and provided input. The software of Elias Wegert [34] made experimentation a pleasure. The author would like to thank the Isaac Newton Institute for Mathematical Sciences for support and hospitality during the programme Complex analysis: techniques, applications and computations while this paper was written. This work was supported by: EPSRC grant number EP/R014604/1. A grant from the H.B. Thom foundation of Stellenbosch University is also gratefully acknowledged.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Author Index**

#### **B**

Berger, Marsha, 3 Bermúdez, Alfredo, 19

#### **C**

Cai, Zhenning, 137 Chen, Huangxin, 205 Cohen, Albert, 57 Conca, Carlos, 39

#### **D**

Dahmen, Wolfgang, 57 DeVore, Ron, 57

**E** Edelstein-Keshet, Leah, 79

**F** Fan, Yuwei, 137

**G** Garzon, Maria, 173

#### **H**

Horio, Naohiro, 195 Huynh, Viet Q. H., 195 **L** Lauter, Kristin, 97 Le Bris, Claude, 115 Leng, Haitao, 205 Li, Ruo, 137

#### **O** Otera, Koki, 195

**S** Sako, Kazue, 159 Saye, Robert I., 173 Sethian, James A., 173 Suito, Hiroshi, 195

**T** Takizawa, Kenji, 195

**U** Ueda, Takuya, 195

#### **W** Wang, Dong, 205 Wang, Xiao-Ping, 205 Weideman, J. A. C., 227

© The Editor(s) (if applicable) and Author(s) 2022 T. Chacón Rebollo et al. (eds.), *Recent Advances in Industrial and Applied Mathematics*, ICIAM 2019 SEMA SIMAI Springer Series 1, https://doi.org/10.1007/978-3-030-86236-7