**Artifcial Intelligence: Foundations, Theory, and Algorithms**

## Gerhard Paaß Sven Giesselbach

# Foundation Models for Natural Language Processing

Pre-trained Language Models Integrating Media

## **Artificial Intelligence: Foundations, Theory, and Algorithms**

#### **Series Editors**

Barry O'Sullivan, Dep. of Computer Science, University College Cork, Cork, Ireland

Michael Wooldridge, Department of Computer Science, University of Oxford, Oxford, UK

*Artificial Intelligence: Foundations, Theory and Algorithms* fosters the dissemination of knowledge, technologies and methodologies that advance developments in artificial intelligence (AI) and its broad applications. It brings together the latest developments in all areas of this multidisciplinary topic, ranging from theories and algorithms to various important applications. The intended readership includes research students and researchers in computer science, computer engineering, electrical engineering, data science, and related areas seeking a convenient way to track the latest findings on the foundations, methodologies, and key applications of artificial intelligence.

This series provides a publication and communication platform for all AI topics, including but not limited to:


This series includes monographs, introductory and advanced textbooks, state-ofthe-art collections, and handbooks. Furthermore, it supports Open Access publication mode.

Gerhard Paaß • Sven Giesselbach

# Foundation Models for Natural Language Processing

Pre-trained Language Models Integrating Media

Gerhard Paaß Knowledge Discovery Department, Team NLU Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS) Sankt Augustin, Nordrhein-Westfalen Germany

Sven Giesselbach Knowledge Discovery Department, Team NLU Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS) Sankt Augustin, Nordrhein-Westfalen Germany

This work was supported by Bundesministerium für Bildung und Forschung (ML2R (01IS18038B))

ISSN 2365-3051 ISSN 2365-306X (electronic) Artificial Intelligence: Foundations, Theory, and Algorithms ISBN 978-3-031-23189-6 ISBN 978-3-031-23190-2 (eBook) https://doi.org/10.1007/978-3-031-23190-2

© The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication. **Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

#### **Foreword**

Artificial Intelligence ("AI") and Machine Learning, in particular, have been in the center of interest for science, business, and society alike for several years now, and for many, they might seem like an old friend whose capabilities we have come to know and appreciate. After all, Machine Learning-based AI seems to be almost everywhere now. Machine Learning algorithms give us recommendations when we look at our timeline in social media, when we listen to music or watch movies. They are able to transcribe our speech and answer simple questions when we talk to the digital assistants on our mobile phones. AI systems sometimes produce better diagnoses than human doctors in certain cases, and behind the scenes, they run many of today's digital systems in business administration, production, and logistics. Perhaps some of us are even using the Machine Learning-powered capabilities of semi-autonomous driving in the latest automobiles.

As impressive as these applications are—yet another revolution is already on its way. A new wave of AI technology is about to completely change our conception of the capabilities of artificially intelligent systems: *Foundation Models*. While up to now, AI systems were usually built by training learning algorithms on datasets specifically constructed for a particular task at hand, researchers and engineers are now using the almost limitless supply of available data, documents, and images on the Internet to train models relatively independently of the possible tasks for which they might be used later on. Using large document sets with trillions of words, and incorporating hundreds of billions of parameters, such deep network models construct a re-representation of their inputs and store them in a way that later allows them to be used for different tasks such as question/answering and even inference. Such models already produce results that were unimaginable before, and will lead to AI systems that are significantly more flexible, dramatically more powerful, and ultimately closer to a truly general AI.

This book constitutes an excellent and in-depth introduction to the topic of Foundation Models, containing details about the major classes of such models and their use with text, speech, images, and video. It can thus serve as an overview for

those interested in entering the area, as well as a more detailed reference for those interested in learning more about individual approaches. May this book contribute to making Foundation Models accessible to an even wider audience, and thus help to further spread and develop this exciting technology!

July 2022

Bonn, Germany Prof. Dr. Stefan Wrobel

#### **Preface**

Forty years ago, when Deep Neural Networks were proposed, they were intended as a general-purpose computational device that would mimic the workings of the brain. However, due to the insufficient power of computers at that time, they could only be applied to small problems and disappeared from the focus of scientific research.

It was only about 10 years ago that a variant, Convolutional Neural Networks, succeeded in identifying objects in images better than other methods. This was based on the availability of a very large training set of manually annotated images, the high computing power of graphic processing units, and the efficiency of new optimization techniques. Shortly thereafter, many specialized models could improve performance in other areas, for example, recurrent neural networks for predicting sequences or reinforcement learning models for controlling video games. However, the results of these deep neural networks were mediocre in most cases and usually could not match human performance.

The field of language processing could particularly benefit from the idea that the meaning of each word was represented by a long vector, an embedding. Five years ago, this approach was decisively improved by Google engineers. They correlated these embeddings with the embeddings of the other words, which enabled them to compute new embeddings in the next layer, which adapt the embedding of a word to the context. For example, the word "bank" is usually a financial institution near the word "money" and a "sloping land" in the neighborhood of "river". This operation was called self-attention and enabled the models to acquire an unprecedented amount of semantic information. Instead of processing a text word by word, all words were correlated at once, which increases the processing speed.

These models can be used as language models that predict the next word given the previous words of a text. They do not require human annotations and can be trained on plain text, e.g. from the Internet. It turned out that the larger these models become and the more training text they process, the better they perform. A milestone was the GPT-3 model, which has 175 billion parameters and was trained on 570 GB of text. It was able to generate syntactically and semantically convincing text passages that were almost indistinguishable from human-generated texts.

Further experiments showed that these models can also be applied to other types of sequences besides text, e.g. pictures, videos, sound recordings, or sequences of molecules. Each time, small input patches are represented by embeddings and the relationship of the patches is acquired by self-attention. Since this can be done for different media at the same time, the embeddings act as a common cross-media representation. While earlier deep neural networks were designed for one task, these models can be applied to a variety of tasks and are therefore often called "Foundation Models". They offer the perspective of capturing text, speech, images, and sensory impressions of the environment with a single high-performance model, coming close to the original vision of Neural Networks.

The purpose of this book is to describe language models pre-trained on extensive training data. If these models have a sufficient number of parameters, they are called Foundation Models, which can perform new task simply by instruction and, moreover, can handle different media types. In particular, the technical vocabulary but also concepts, methods, and network architectures are introduced. Further, approaches to improve the models are presented and the performance but also the weaknesses of the models are discussed. An extensive section of the book provides an overview of the application of Foundation Models to various language processing tasks. Finally, the capabilities of the Foundation Models in cross-media processing are presented.

The book enables researchers and decision-makers familiar with the fundamentals of text and media processing to participate in the design of language models and Foundation Models and to better evaluate model properties in terms of their impact. For data analysts, students, engineers, and researchers, the book provides an ideal introduction to more advanced literature.

#### **Acknowledgments**

This book was only made possible by the motivating and professionally stimulating environment of the Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS in Sankt Augustin. We would like to thank all colleagues and people from our personal environment who supported us in this book project—be it through professional discussions, proofreading of individual chapters, and helpful comments: Katharina Beckh, Ewald Bindereif, Eduardo Brito, Nilesh Chakraborty, Heike Horstmann, Birgit Kirsch, Katrin Klug, and Najmeh Mousavi. Special thanks go to Heike Horstmann, who provided valuable advice on the structure of the book and organized the open-source publication of the book despite many administrative difficulties.

This research has been funded by the Federal Ministry of Education and Research of Germany as part of the competence center for machine learning ML2R (01IS18038B). This generous support has given us the time we needed to study Foundation Models extensively. The stimulating discussions with colleagues at the research center brought many aspects of the topic to our attention.

Preface ix

But the biggest thanks go to our families, who gave us the necessary space during the long time of writing. In particular, I, Gerhard Paaß, would like to thank my wife Margret Paaß, whose patience and encouragement played a major role in the success of this book, and who was an indispensable help from the planning stage to the correction of the galley proofs. Without your encouragement and support, we would not have been able to produce this book. Thank you very much for all your support!

Sankt Augustin, Germany Gerhard Paaß July 2022 Sven Giesselbach

#### **Contents**








#### **About the Authors**

**Dr. Gerhard Paaß** graduated in mathematics and computer science at the university of Bonn and wrote his doctoral thesis on the forecasting accuracy of economic models. He joined the Gesellschaft für Mathematik und Datenverarbeitung (GMD), today's Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, in Sankt Augustin. He has been project leader of a number of research projects on uncertain reasoning, multimedia neural networks, and prediction uncertainty, and founded the text mining group of IAIS. Dr. Paaß worked in the context of many research stays at universities abroad (China, USA, Australia, Japan). He is the author of numerous publications and has received several best paper awards in the field of AI. In addition, he has been active as a lecturer for many years and, within the framework of the Fraunhofer Big Data and Artificial Intelligence Alliance, has played a very significant role in defining the new job description of the Data Scientist and successfully establishing it in Germany as well. He recently wrote a book on "Artificial Intelligence" in German, which will soon be published in English. As Lead Scientist at Fraunhofer IAIS, Dr. Paaß has contributed to the development of numerous curricula in this field.

**Sven Giesselbach** is the team leader of the Natural Language Understanding (NLU) team at the Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), where he has specialized in Artificial Intelligence and Natural Language Processing. His team develops solutions in the areas of medical, legal, and general document understanding. Sven Giesselbach is also part of the Competence Center for Machine Learning Rhine-Ruhr (ML2R), where he works as a research scientist and investigates Informed Machine Learning, a paradigm in which knowledge is injected into machine learning models, in conjunction with language modeling. He has published more than 10 papers on natural language processing and understanding which focus on the creation of application-ready NLU systems and how to integrate expert knowledge in various stages of the solution design. Sven Giesselbach led the development of the Natural Language Understanding Showroom, a platform for showcasing state-of-the-art Natural Language Understanding models. He regularly gives talks about NLU at summer schools, conferences, and AI-Meetups.

### **Chapter 1 Introduction**

**Abstract** With the development of efficient Deep Learning models about a decade ago, many Deep Neural Networks have been used to solve pattern recognition tasks such as natural language processing and image recognition. An advantage of these models is that they automatically create features arranged in layers which represent the content and do not require manually constructed features. These models rely on Machine Learning employing statistical techniques to give machines the capability to 'learn' from data without being given explicit instructions on what to do. Deep Learning models transform the input in layers step by step in such a way that complex patterns in the data can be recognized. This chapter first describes how a text is pre-processed and partitioned into tokens, which form the basis for natural language processing. Then we outline a number of classical Machine Learning models, which are often used as modules in advanced models. Examples include the logistic classifier model, fully connected layers, recurrent neural networks and convolutional neural networks.

**Keywords** Natural language processing · Text preprocessing · Vector space model · Static embeddings · Recurrent networks · Convolutional networks

#### **1.1 Scope of the Book**

With the development of efficient Deep Learning models about a decade ago, many Deep Neural Networks have been used to solve pattern recognition tasks such as *natural language processing* (*NLP*) and image processing. Typically, the models have to capture the meaning of a text or an image and make an appropriate decision. Alternatively they can generate a new text or image according to the task at hand. An advantage of these models is that they create intermediate features arranged in layers and do not require manually constructed features. *Deep Neural Networks* such as Convolutional Neural Networks (CNNs) [32] and Recurrent Neural Networks (RNNs) [65] use low-dimensional dense vectors as a kind of distributed representation to express the syntactic and semantic features of language.

All these models can be considered as *Artificial Intelligence* (*AI*) Systems. AI is a broad research field aimed at creating intelligent machines, acting similar to humans and animals having natural intelligence. It captures the field's longterm goal of building machines that mimic and then surpass the full spectrum of human cognition. *Machine Learning (ML)* is a subfield of artificial intelligence that employs statistical techniques to give machines the capability to 'learn' from data without being given explicit instructions on what to do. This process is also called 'training', whereby a 'learning algorithm' gradually improves the model's performance on a given task. *Deep Learning* is an area of ML in which an input is transformed in layers step by step in such a way that complex patterns in the data can be recognized. The adjective 'deep' refers to the large number of layers in modern ML models that help to learn expressive representations of data to achieve better performance.

In contrast to computer vision, the size of *annotated* training data for NLP applications was rather small, comprising only a few thousand sentences (except for machine translation). The main reason for this was the high cost of manual annotation. To avoid overfitting, i.e. overadapting models to random fluctuations, only relatively small models could be trained, which did not yield high performance. In the last 5 years, new NLP methods have been developed based on the *Transformer*  introduced by Vaswani et al. [67]. They represent the meaning of each word by a vector of real numbers called *embedding*. Between these embeddings various kinds of "attentions" can be computed, which can be considered as a sort of "correlation" between different words. In higher layers of the network, attention computations are used to generate new embeddings that can capture subtle nuances in the meaning of words. In particular, they can grasp different meanings of the same word that arise from context. A key advantage of these models is that they can be trained with unannotated text, which is almost infinitely available, and overfitting is not a problem.

Currently, there is a rapid development of new methods in the research field, which makes many approaches from earlier years obsolete. These models are usually trained in two steps: In a first *pre-training* step, they are trained on a large text corpus containing billions of words without any annotations. A typical pretraining task is to predict single words in the text that have been masked in the input. In this way, the model learns fine subtleties of natural language syntax and semantics. Because enough data is available, the models can be extended to many layers with millions or billions of parameters.

In a second *fine-tuning* step, the model is trained on a small annotated training set. In this way, the model can be adapted to new specific tasks. Since the finetuning data is very small compared to the pre-training data and the model has a high capacity with many millions of parameters, it can be adapted to the finetuning task without losing the stored information about the language structure. It was demonstrated that this idea can be applied to most NLP tasks, leading to unprecedented performance gains in semantic understanding. This *transfer learning*  allows knowledge from the pre-training phase to be transferred to the fine-tuned model. These models are referred to as *Pre-trained Language Models* (*PLM*).

In the last years the number of parameters of these PLMs was systematically enlarged together with more training data. It turned out that in contrast to conventional wisdom the performance of these models got better and better without suffering from overfitting. Models with billions of parameters are able to generate syntactically correct and semantically consistent fluent text if prompted with some starting text. They can answer questions and react meaningful to different types of prompts.

Moreover, the same PLM architecture can simultaneously be pre-trained with different types of sequences, e.g. tokens in a text, image patches in a picture, sound snippet of speech, image patch sequences in video frames, DNA snippets, etc. They are able to process these media types simultaneously and establish connections between the different modalities. They can be adapted via natural language prompts to perform acceptably on a wide variety of tasks, even though they have not been explicitly trained on these tasks. Because of this flexibility, these models are promising candidates to develop overarching applications. Therefore, large PLMs with billions of parameters are often called *Foundation Models* [9].

This book is intended to provide an up-to-date overview of the current Pre-trained Language Models and Foundation Models, with a focus on applications in NLP:


There are a number of previous surveys of Deep Learning and NLP [1–4, 10, 15, 16, 27, 39, 50, 53, 54, 59, 66]. The surveys of Han et al. [22], Lin et al. [41], and Kalyan et al. [31] are the most up-to-date and comprehensive. Jurafsky and Martin [30] prepare an up-to-date book on this field. In addition, there are numerous surveys for specific model variants or application areas. Where appropriate, we provide references to these surveys. New terminology is usually printed in *italics* and models in **bold**.

The rest of this chapter introduces text preprocessing and *classical NLP models*, which in part are reused inside PLMs. The second chapter describes the main architectures of *Pre-trained Language Models*, which are currently the workhorses of NLP. The third chapter considers a large number of *PLM variants* that extend the capabilities of the basic models. The fourth chapter describes the information captured by PLMs and Foundation Models and analyses their syntactic skills, world knowledge, and reasoning capabilities.

The remainder of the book considers various application domains and identifies PLMs and Foundation Models that currently provide the best results in each domain at a reasonable cost. The fifth chapter reviews *information extraction*  methods that automatically identify structured information and language features in text documents, e.g. for relation extraction. The sixth chapter deals with *natural language generation* approaches that automatically generate new text in natural language, usually in response to a prompt. The seventh chapter is devoted to models for analyzing and creating *multimodal content* that typically integrate content understanding and production across two or more modalities, such as text, speech, image, video, etc. The general trend is that more data, computational power, and larger parameter sets lead to better performance. This is explained in the last *summary* chapter, which also considers social and ethical aspects of Foundation Models and summarizes possible further developments.

#### **1.2 Preprocessing of Text**

The first step in preprocessing is to extract the actual text. For each type of text document, e.g. pdf, html, xml, docx, ePUB, there are specific parsers, which resolve the text into characters, words, and formatting information. Usually, the layout and formatting information is removed.

Then, the extracted text is routinely divided into *tokens*, i.e. words, numbers, and punctuation marks. This process is not trivial, as text usually contains special units like phone numbers or email addresses that must be handled in a special way. Some text mining tasks require the splitting of text into sentences. Tokenizers and sentence splitters for different languages have been developed in the past decades and can be included from many programming toolboxes, e.g. *Spacy* [64].

In the past, many preprocessing methods aimed at generating new relevant features (part-of-speech tags, syntax parse trees) and removing unnecessary tokens (stemming, stop word removal, lemmatization). In most cases, this is no longer necessary with modern approaches that internally automatically derive the features relevant for the task at hand.

In an optional final step, the word-tokens can be further subdivided and rearranged. A simple technique creates *character n-grams* (i.e. all sequences of *n* adjacent characters in a word) as additional features. Alternatively, *word n-grams*  can be formed consisting of *n* consecutive words.

Currently, the most popular approach tries to limit the number of different words in a vocabulary. A common choice is *byte-pair encoding* [19]. This method first selects all characters as tokens. Then, successively the most frequent token pair is merged into a new token and all instances of the token pair are replaced by the new token. This is repeated until a vocabulary of prescribed size is obtained. Note that new words can always be represented by a sequence of vocabulary tokens and characters. Common words end up being a part of the vocabulary, while rarer words are split into components, which often retain some linguistic meaning. In this way, out-of-vocabulary words are avoided.

The *WordPiece* [69] algorithm also starts by selecting all characters of the collection as tokens. Then it assumes that the text corpus has been generated by randomly sampling tokens according to their observed frequencies. It merges tokens *a* and *b* (inside words) in such a way that the likelihood of the training data is maximally increased [60]. There is a fast variant whose computational complexity is linear in the input length [63]. *SentencePiece* [35] is a package containing several subword tokenizers and can also be applied to all Asian languages. All the approaches effectively interpolate between word level inputs for frequent words and character level inputs for infrequent words.

Often the language of the input text has to be determined [29, 57]. Most *language identification methods* extract character *n*-grams from the input text and evaluate their relative frequencies. Some methods can be applied to texts containing different languages at the same time [42, 71]. To filter out offensive words from a text, one can use lists of such toxic words in different languages [62].

#### **1.3 Vector Space Models and Document Classification**

To apply Machine Learning to documents, their text has to be transformed into scalars, vectors, matrices, or higher-dimensional arrangements of numbers, which are collectively called *tensors*. In the previous section, text documents in a corpus were converted into a sequence of tokens by preprocessing. These tokens now have to be translated into tensors.

The *bag-of-words* representation describes a given text document *d* by a vector *x* of token counts. The *vocabulary* is a list of all different tokens contained in the collection of training documents, the *training corpus*. Ignoring the order of tokens, this bag-of-words vector records how often each token of the vocabulary appears in document *d*. Note that most vector entries will be zero, as each document will only contain a small fraction of vocabulary tokens. The vector of counts may be modified to emphasize tokens with high information content, e.g. by using the *tf-idf* statistic [43]. Table 1.1 summarizes different representations for documents used for NLP.

*Document classification* methods aim to categorize text documents according to their content [33, 61]. An important example is the logistic classifier, which uses a bag-of-words vector *x* as input and predicts the probability of each of the *k* possible output classes *y* ∈ {1*,...,k*}. More precisely, there is a random variable *Y* which may take the values 1*,...,k*. To predict the output class *y* from the input *x*, a score vector is first generated as

$$
\mathfrak{u} = A\mathfrak{x} + \mathfrak{b} \tag{1.1}
$$


**Table 1.1** Representations for documents used in NLP Models.

using an *affine transformation* of the input *x*. Here, the vector *x* is transformed by a *linear transformation Ax* and then a *bias* vector *b* is added. The resulting *score vector u* of length *k* is then transformed to a probability distribution over the *k* classes by the *softmax function* 

$$\text{softmax}(\mu\_1, \dots, \mu\_k) = \frac{(\exp(\mu\_1), \dots, \exp(\mu\_k))}{\exp(\mu\_1) + \dots + \exp(\mu\_k)},\tag{1.2}$$

$$p(Y=m|\mathbf{x};A,\mathbf{b}) = \text{softmax}(A\mathbf{x}+\mathbf{b}).\tag{1.3}$$

Since the softmax function converts any vector into a probability vector, we obtain the conditional probability of output class *m* as a function of input *x*. The function

$$\text{LRM}(\mathbf{x}) = \text{softmax}(A\mathbf{x} + \mathbf{b}) \tag{1.4}$$

is called a *logistic classifier* model [48] with parameter vector *w* = *vec(A, b)*. In general, a function mapping the input *x* to the output *y* or a probability distribution over the output is called a *model f (x*; *w)*.

The model is trained using *training data T r* = {*(x*[1] *, y*[1] *), . . . , (x*[*N*] *, y*[*N*] *)*}, whose *examples (x*[*i*] *, y*[*i*] *)* have to be independent and identically distributed (*i.i.d.*). The task is to adjust the parameters *w* such that the predicted probability *p(Y* =*m*|*x*; *w)* is maximized. Following the *Maximum Likelihood principle*, this can be achieved by modifying the parameter vector *w* such that the complete training data has a maximal probability [24, p. 31]

$$\max\_{\mathbf{w}} p(\mathbf{y}^{[1]}|\mathbf{x}^{[1]};\mathbf{w}) \ast \cdots \ast p(\mathbf{y}^{[N]}|\mathbf{x}^{[N]};\mathbf{w}).\tag{1.5}$$

Transforming the expression by log and multiplying by −1*.*0 gives the *classification loss* function *L*MC*(w)*, also called *maximum entropy loss*.

$$L\_{\rm MC}(\mathfrak{w}) = -\left[ \log p(\mathbf{y}^{[1]}|\mathbf{x}^{[1]}; \mathfrak{w}) + \dots + \log p(\mathbf{y}^{[N]}|\mathbf{x}^{[N]}; \mathfrak{w}) \right]. \tag{1.6}$$

To optimize the loss function, its gradient is computed and minimized by stochastic gradient descent or another optimizer (c.f. Sect. 2.4.1).

The performance of classifiers is measured on separate *test data* by accuracy, precision, recall, F1-value, etc. [21, p. 410f]. Because the bag-of-words representation ignores important word order information, document classification by a logistic classifier is less commonly used today. However, this model is still a component in most Deep Learning architectures.

#### **1.4 Nonlinear Classifiers**

It turns out that the logistic classifier partitions the input space by linear hyperplanes that are not able to solve more complex classification tasks, e.g., the XOR problem [47]. An alternative is to generate an internal *hidden vector h* by an additional *affine transformation A*1*x* + *b*<sup>1</sup> followed by a monotonically non-decreasing nonlinear *activation function g* and use this hidden vector as input for the logistic classifier to predict the random variable *Y*

$$h = \mathbf{g}(A\_{\parallel}\mathbf{x} + \mathbf{b}\_{\parallel}),\tag{1.7}$$

$$p(Y = m | \mathbf{x}; \; \mathbf{w}) = \text{softmax}(A\_2 \mathbf{h} + \mathbf{b}\_2),\tag{1.8}$$

where the parameters of this model can be collected in a parameter vector *w* = *vec(A*1*, b*1*, A*2*, b*2*)*. The form of the nonlinear activation function *g* is quite arbitrary, often tanh*(x)* or a *rectified linear unit* ReLU*(x)* = max*(*0*, x)* is used. FCL*(x)* = *g(A*1*x* + *b*1*)* is called a *fully connected layer*.

This model (Fig. 1.1) is able to solve any classification problem arbitrarily well, provided the length of *h* is large enough [21, p. 192]. By prepending more fully connected layers to the network we get a *Deep Neural Network*, which needs

**Fig. 1.1** A neural network for classification transforms the input by layers with affine transformations and nonlinear activation functions, e.g. ReLU. The final layer usually is a logistic classifier

fewer parameters than a shallow network to approximate more complex functions. Historically, it has been called *Multilayer Perceptron* (MLP). Liang et al. [40] show that for a large class of piecewise smooth functions, the sizes of hidden vectors needed by a shallow network to approximate a function is exponentially larger than the corresponding number of neurons needed by a deep network for a given degree of function approximation.

The *support vector machine* [14] follows a different approach and tries to create a hyperplane, which is located between the training examples of the two classes in the input space. In addition, this hyperplane should have a large distance (*margin*) to the examples. This model reduces overfitting and usually has a high classification accuracy, even if the number of input variables is high, e.g. for document classification [28]. It was extended to different kernel loss criteria, e.g. graph kernels [56] which include grammatical features. Besides SVM, many alternative classifiers are used, such as random forests [24, p.588f] and gradient boosted trees [24, p.360], which are among the most popular classifiers.

For these conventional classifiers the analyst usually has to construct input features manually. Modern classifiers for text analysis are able to create relevant features automatically (Sect. 2.1). For the training of NLP models there exist three main paradigms:


#### **1.5 Generating Static Word Embeddings**

One problem with bag-of word representations is that frequency vectors of tokens are unable to capture relationships between words, such as synonymy and homonymy, and give no indication of their semantic similarity. An alternative are more expressive representations of words and documents based on the idea of *distributional semantics* [58], popularized by Zellig Harris [23] and John Firth [18]. According to Firth "a word is characterized by the company it keeps". This states that words occurring in the same neighborhood tend to have similar meanings.

**Fig. 1.2** Word2vec predicts the words in the neighborhood of a central word by logistic classifier *L*. The input to *L* is the embedding of the central word. By training with a large set of documents, the parameters of *L* as well as the embeddings are learned [54, p. 2]

Based on this idea each word can be characterized by a *demb*-dimensional vector *emb(*word*)* <sup>∈</sup> <sup>R</sup>*demb* ,a *word embedding*. Usually, a value between 100 and 1000 is chosen for *demb*. These embeddings have to be created such that words that occur in similar contexts have embeddings with a small vector distance, such as the Euclidean distance. A document then can be represented by a sequence of such embeddings. It turns out that words usually have a similar meaning, if their embeddings have a low distance. Embeddings can be used as input for downstream text mining tasks, e.g. sentiment analysis. Goldberg [20] gives an excellent introduction to static word embeddings. The embeddings are called *static embeddings* as each word has a single embedding independent of the context.

There are a number of different approaches to generate word embeddings in an unsupervised way. Collobert et al. [13] show that word embeddings obtained by predicting neighbor words can be used to improve the performance of downstream tasks such as named entity recognition and semantic role labeling.

**Word2vec** [45] predicts the words in the neighborhood of a central word with an extremely simple model. As shown in Fig. 1.2 it uses the embedding vector of the central word as input for a logistic classifier (1.3) to infer the probabilities of words in the neighborhood of about five to seven positions. The training target is to forecast all neighboring words in the training set with a high probability. For training, Word2Vec repeats this prediction for all words of a corpus, and the parameters of the logistic classifier as well as the values of the embeddings are optimized by stochastic gradient descent to improve the prediction of neighboring words.

The vocabulary of a text collection contains *k* different words, e.g. *k* = 100*,*000. To predict the probability of the *i*-th word by softmax (1.2), *k* exponential terms exp*(ui)* have to be computed. To avoid this effort, the fraction is approximated as

$$\frac{\exp(u\_l)}{\exp(u\_1) + \dots + \exp(u\_k)} \approx \frac{\exp(u\_l)}{\exp(u\_l) + \sum\_{j \in S} \exp(u\_j)},\tag{1.9}$$

where *S* is a small sample of, say, 10 randomly selected indices of words. This technique is called *noise contrastive estimation* [21, p. 612]. There are several variants available, which are used for almost all classification tasks involving softmax computations with many classes. Since stochastic gradient descent works with noisy gradients, the additional noise introduced by the approximation of the softmax function is not harmful and can even help the model escape local minima. The shallow architecture of Word2Vec proved to be far more efficient than previous architectures for representation learning.

Word2Vec embeddings have been used for many downstream tasks, e.g. document classification. In addition, words with a similar meaning may be detected by simply searching for words whose embeddings have a small Euclidean distance to the embedding of a target word. The closest neighbors of "neutron", for example, are "neutrons", "protons", "deuterium", "positron", and "decay". In this way, synonyms can be revealed. Projections of embeddings on two dimensions may be used for the exploratory analysis of the content of a corpus. **GloVe** generates similar embedding vectors using aggregated global word-word co-occurrence statistics from a corpus [51].

It turns out that differences between the embeddings often have an interpretation. For example, the result of *emb(*Germany*)* − *emb(*Berlin*)* + *emb(*Paris*)* has *emb(*France*)* as its nearest neighbor with respect to Euclidean distance. This property is called *analogy* and holds for a majority of examples of many relations such as capital-country, currency-country, etc. [45].

**FastText** [8] representations enrich static word embeddings by using subword information. Character *n*-grams of a given length range, e.g., 3–6, are extracted from each word. Then, embedding vectors are defined for the words as well as their character *n*-grams. To train the embeddings all word and character *n*gram embeddings in the neighborhood of a central word are averaged, and the probabilities of the central word and its character *n*-grams are predicted by a logistic classifier. To improve the probability prediction, the parameters of the model are optimized by stochastic gradient descent. This is repeated for all words in a training corpus. After training, unseen words can be reconstructed using only their *n*-gram embeddings. *Starspace* [68] was introduced as a generalization of FastText. It allows embedding arbitrary entities (such as authors, products) by analyzing texts related to them and evaluating graph structures. An alternative are *spherical embeddings*, where unsupervised word and paragraph embeddings are constrained to a hypersphere [44].

#### **1.6 Recurrent Neural Networks**

*Recurrent Neural Networks* were developed to model sequences *v*1*,...,vT* of varying length *T* , for example the tokens of a text document. Consider the task to predict the next token *vt*+<sup>1</sup> given the previous tokens *(v*1*,...,vt)*. As proposed by Bengio et al. [6] each token *vt* is represented by an embedding vector *x<sup>t</sup>* = *emb(vt)*

**Fig. 1.3** The RNN starts on the left side and successively predicts the probability of the next token with the previous tokens as conditions using a logistic classifier *L*. The hidden vector *h<sup>t</sup>* stores information about the tokens that occur before position *t*

indicating the meaning of *vt* . The previous tokens are characterized by a hidden vector *h<sup>t</sup>* , which describes the state of the subsequence *(v*1*,...,vt*−1*)*. The RNN is a function RNN*(ht, xt)* predicting the next hidden vector *ht*+<sup>1</sup> by

$$\mathbf{h}\_{l+1} = \text{RNN}(\mathbf{h}\_l, \mathbf{x}\_l). \tag{1.10}$$

Subsequently, a *logistic classifier* (1.3) with parameters *H* and *g* predicts a probability vector for the next token *vt*+<sup>1</sup> using the information contained in *ht*+1,

$$p(V\_{l+1}|v\_l, \dots, v\_l) = \text{softmax}(H \* h\_{l+1} + \mathbf{g}),\tag{1.11}$$

as shown in Fig. 1.3. Here *Vt* is the random variable of possible tokens at position *t*. According to the definition of the conditional probability the joint probability of the whole sequence can be factorized as

$$p(v\_1, \ldots, v\_T) = p(V\_T = v\_T | v\_1, \ldots, v\_{T-1}) \* \cdots \* p(V\_2 = v\_2 | v\_1) \* p(V\_1 = v\_1). \tag{1.12}$$

A model that either computes the joint probability or the conditional probability of natural language texts is called *language model* as it potentially covers all information about the language. A language model sequentially predicting the next word by the conditional probability is often referred to *autoregressive language model*. According to (1.12), the observed tokens *(v*1*,...,vt)* can be used as input to predict the probability of the next token *Vt*+1. The product of these probabilities yields the correct joint probability of the observed token sequence *(v*1*,...,vT )*. The same model RNN*(h, x)* is repeatedly applied and generates a sequence of hidden vectors *h<sup>t</sup>* .A *simple RNN* just consists of a single *fully connected layer* 

$$\text{RNN}(\boldsymbol{h}\_{I}, \mathbf{x}\_{I}) = \tanh\left(\boldsymbol{A} \* \begin{bmatrix} \boldsymbol{h}\_{I} \\ \boldsymbol{x}\_{I} \end{bmatrix} + \boldsymbol{b}\right). \tag{1.13}$$

The probabilities of the predicted words *v*1*,...,vT* depend on the parameters *w* = *vec(H, g, A, b, emb(v*1*), . . . , emb(vT ))*. To improve these probabilities, we may use the stochastic gradient descent optimizer (Sect. 2.4.1) and adapt the unknown parameters in *w*. Note that this also includes the estimation of new token embeddings *emb(vt)*. A recent overview is given in [70, Ch. 8–9].

It turns out that this model has difficulties to reconstruct the relation between distant sequence elements, since gradients tend to vanish or "explode" as the sequences get longer. Therefore, new RNN types have been developed, e.g. the *Long Short-Term Memory* (LSTM) [26] and the *Gated Recurrent Unit* (GRU) [11], which capture long-range dependencies in the sequence much better.

Besides predicting the next word in a sequence, RNNs have been successfully applied to predict properties of sequence elements, e.g. named entity recognition [36] and relation extraction [38]. For these applications *bidirectional RNNs* have been developed, consisting of a forward and a backward language model. The *forward language model* starts at the beginning of a text and predicts the next token, while the *backward language model* starts at the end of a text and predicts the previous token. Bidirectional LSTMs are also called *biLSTMs*. In addition, *multilayer RNNs* were proposed [72], where the hidden vector generated by the RNN-cell in one layer is used as the input to the RNN-cell in the next layer, and the last layer provides the prediction of the current task.

*Machine translation* from one language to another is an important application of RNNs [5]. In this process, an input sentence first is encoded by an *encoder* RNN as a hidden vector *h<sup>T</sup>* . This hidden vector is in turn used by a second *decoder* RNN as an initial hidden vector to generate the words of the target language sentence. However, RNNs still have difficulties to capture relationships over long distances between sequence elements because RNNs do not cover direct relations between distant sequence elements.

*Attention* was first used in the context of machine translation to communicate information over long distances. It computes the correlation between hidden vectors of the decoder RNN and hidden vectors of the encoder RNN at different positions. This correlation is used to build a *context vector* as a weighted average of relevant encoder hidden vectors. Then, this context vector is exploited to improve the final translation result [5]. The resulting translations were much better than those with the original RNN. We will see in later sections that attention is a fundamental principle to construct better NLP model.

**ELMo** [52] generates embeddings with bidirectional LSTM language models in several layers. The model is pre-trained as forward and backward language model with a large non-annotated text corpus. During fine-tuning, averages of the hidden vectors are used to predict the properties of words based on an annotated training set. These language models take into account the words before and after a position, and thus employ contextual representations for the word in the central position. For a variety of tasks such as sentiment analysis, question answering, and textual entailment, ELMo was able to improve SOTA performance.

#### **1.7 Convolutional Neural Networks**

*Convolutional Neural Networks* (*CNNs*) [37] are widely known for their success in the image domain. They start with a small quadratic arrangement of parameters called *filter kernel*, which is moved over the input pixel matrix of the image. The values of the filter kernel are multiplied with the underlying pixel values and generate an output value. This is repeated for every position of the input pixel matrix. During training the parameters of a filter kernel are automatically tuned such that they can detect local image patterns such as blobs or lines. Each layer of the network, which is also called *convolution layer*, consists of many filter kernels and a network contains a number of convolution layers. Interspersed *max pooling*  layers perform a local aggregation of pixels by maximum. The final layer of a Convolutional Neural Network usually is a fully connected layer with a softmax classifier.

Their breakthrough was *AlexNet* [34], which receives the RGB pixel matrix of an image as input and is tasked with assigning a content class to the image. This model won the 2012 *ImageNet* competition, where images had to be assigned to one of 1000 classes, and demonstrated the superior performance of Deep Neural Networks. Even earlier the deep CNN of Cire¸san et al. [12] achieved SOTA performance on a number of image classification benchmarks. A highly successful CNN is *ResNet*  [25] which employs a so-called *residual connection* working as a bypass. It can circumvent many layers in the beginning of the training and is the key to training neural networks with many hundred layers. It resulted in image classifiers which have a higher accuracy than humans.

While Recurrent Neural Networks were regarded as the best way to process sequential input such as text, some CNN-based architectures were introduced, which achieved high performance on some NLP tasks. Kim [32] proposed a rather shallow CNN for sentence classification. It contains an embedding layer, a convolutional layer, a max-pooling layer, and a fully connected layer with softmax output. *1-D convolutions* were applied to the embeddings of the input words, basically combining the information stored in adjacent words, treating them as *n*-grams. The embeddings are processed by a moving average with trainable weights. Using this architecture for classification proved to be very efficient, having a similar performance as recurrent architectures that are more difficult to train.

Another interesting CNN architecture is *wavenet* [49], a deeper network used mainly for text-to-speech synthesis. It consists of multiple convolutional layers stacked on top of each other, with its main ingredient being *dilated causal convolutions*. Causal means that the convolutions at position *t* can only utilize prior information *x*1*,..., xt*−1. Dilated means that the convolutions can skip input values with a certain step size *k*, i.e. that in some layer the features at position *t* are predicted using information from positions *t,t* − *k,t* − 2*k,....* This step size *k* is doubled in each successive layer, yielding dilations of size *k*0*, k*1*, k*2*,....* In this way, very long time spans can be included in the prediction. This model architecture has been shown to give very good results for text-to-speech synthesis.

#### **1.8 Summary**

Classical NLP has a long history, and machine learning models have been used in the field for several decades. They all require some preprocessing steps to generate words or tokens from the input text. Tokens are particularly valuable because they form a dictionary of finite size and allow arbitrary words to be represented by combination. Therefore, they are used by most PLMs. Early document representations like bag-of-words are now obsolete because they ignore sequence information. Nevertheless, classifiers based on them like logistic classifiers and fully connected layers, are important building blocks of PLMs.

The concept of static word embeddings initiated the revolution in NLP, which is based on contextual word embeddings. These ideas are elaborated in the next chapter. Recurrent neural networks have been used to implement the first successful language models, but were completely superseded by attention-based models. Convolutional neural networks for image processing are still employed in many applications. PLMs today often have a similar performance on image data, and sometimes CNNs are combined with PLMs to exploit their respective strengths, as discussed in Chap. 7.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### **Chapter 2 Pre-trained Language Models**

**Abstract** This chapter presents the main architecture types of attention-based language models, which describe the distribution of tokens in texts: Autoencoders similar to BERT receive an input text and produce a contextual embedding for each token. Autoregressive language models similar to GPT receive a subsequence of tokens as input. They produce a contextual embedding for each token and predict the next token. In this way, all tokens of a text can successively be generated. Transformer Encoder-Decoders have the task to translate an input sequence to another sequence, e.g. for language translation. First they generate a contextual embedding for each input token by an autoencoder. Then these embeddings are used as input to an autoregressive language model, which sequentially generates the output sequence tokens. These models are usually pre-trained on a large general training set and often fine-tuned for a specific task. Therefore, they are collectively called Pre-trained Language Models (PLM). When the number of parameters of these models gets large, they often can be instructed by prompts and are called Foundation Models. In further sections we described details on optimization and regularization methods used for training. Finally, we analyze the uncertainty of model predictions and how predictions may be explained.

**Keywords** BERT · Language model · GPT-2 · Transformer · Pre-training · Fine-tuning · Sequence-to-sequence model

A model that either computes the joint probability or the conditional probability of natural language texts is called a *language model* as it potentially covers all information about the language. In this chapter, we present the main architecture types of attention-based *language models* (*LMs*), which process texts consisting of sequences of *tokens*, i.e. words, numbers, punctuation, etc.:

• *Autoencoders* (*AE*) receive an input text and produce a contextual embedding for each token. These models are also called *BERT models* and are described in Sect. 2.1.


In this chapter, we focus on NLP, where we consider sequences of text tokens. Historically, the transformer encoder-decoder was developed in 2017 by Vaswani et al. [141] to perform translation of text into another language. The *autoencoder*  [39] and the *autoregressive language model* [118] are the encoder-part and the decoder-part of this transformer encoder-decoder and were proposed later. As they are conceptually simpler, they are introduced in preceding sections. A final section (Sect. 2.4) describes methods for optimizing models during training, determining a model architecture, and estimating the uncertainty of model predictions.

It turned out that the models can first be trained on a large training set of general text documents and are able to acquire the distribution of tokens in correct and fluent language. Subsequently, they can be adapted to a specific task, e.g. by fine-tuning with a small supervised classification task. Therefore, the models are called *Pretrained Language models*.

As we will see later, all models can be applied to arbitrary sequences, e.g. musical notes, sound, speech, images, or even videos. When the number of parameters of these models gets large, they often can be instructed by prompts and are called *Foundation Models*.

#### **2.1 BERT: Self-Attention and Contextual Embeddings**

Common words often have a large number of different meanings. For the word "bank", for instance, the lexical database WordNet [94] lists 18 different senses from "sloping land" to "financial institution". In a simple embedding of the word "bank" introduced in Sect. 1.5 all these meanings are conflated. As a consequence, the interpretation of text based on these embeddings is flawed.

As an alternative, *contextual embeddings* or contextualized embeddings were developed, where the details of a word embedding depend on the word itself as well as on the neighboring words occurring in the specific document. Consequently, each occurrence of the same word in the text has a different embedding depending on the context. Starting with the Transformer [141], a number of approaches have been designed to generate these contextual embeddings, which are generally trained in an unsupervised manner using a large corpus of documents.

**BERT** (Bidirectional Encoder Representations from Transformers) was proposed by Devlin et al. [39] and is the most important approach for generating contextual embeddings. BERT is based on the concept of attention [8] and on prior work by Vaswani et al. [141]. The notion of **attention** is inspired by a brain mechanism that tends to focus on distinctive parts of memory when processing large amounts of information. The details of the computations are explained by Rush [126].

#### *2.1.1 BERT Input Embeddings and Self-Attention*

As input BERT takes some text which is converted to tokens, e.g. by the Wordpiece tokenizer (Sect. 1.2) with a vocabulary of a selected size, e.g. 30,000. This means that frequent words like "dog" are represented by a token of their own, but more rare words like "playing" are split into several tokens, e.g. "play" and "##ing", where "##" indicates that the token is part of a word. As all characters are retained as tokens, arbitrary words may be represented by a few tokens. In addition, there are special tokens like [CLS] at the first position of the input text and two "[SEP]" tokens marking the end of text segments. Finally, during training, there are [MASK] tokens as explained later. Each token is represented by a *token embedding*, a vector of fixed length *demb*, e.g. *demb* = 768. Input sequences of variable length are padded to the maximal length with a special padding token.

Since all token embeddings are processed simultaneously, the tokens need an indication of their position in the input text. Therefore, each position is marked with *position embeddings* of the same length as the token embeddings, which encode the position index. The BERT paper encodes the position number by trainable embeddings, which are added to the input token embeddings [39]. Finally, BERT compares the first and second input segment. Therefore, the algorithm needs the information, which token belongs to the first and second segment. This is also encoded by a trainable segment embedding added to the token and position embedding. The sum of all embeddings is used as *input embedding* for BERT. An example is shown in Fig. 2.1.

#### **Self-Attention to Generate Contextual Embeddings**

BERT starts with input embeddings *x<sup>t</sup>* of length *demb* for each token *vt* of the input sequence *v*1*,...,vT* . These embeddings are transformed by linear mappings to socalled *query-vectors q<sup>t</sup>* , *key-vectors k<sup>t</sup>* and *value-vectors v<sup>t</sup>* . These are computed by multiplying *<sup>x</sup><sup>t</sup>* with the matrices *<sup>W</sup>(q)*, *<sup>W</sup>(k)*, and *<sup>W</sup>(v)* with dimensions *demb* <sup>×</sup> *dq* , *demb* × *dq* and *demb* × *dv* respectively

$$\mathbf{q}\_{\mathsf{I}}^{\mathsf{T}} = \mathbf{x}\_{\mathsf{I}}^{\mathsf{T}} \mathbf{W}^{(q)} \qquad \mathbf{k}\_{\mathsf{I}}^{\mathsf{T}} = \mathbf{x}\_{\mathsf{I}}^{\mathsf{T}} \mathbf{W}^{(k)} \qquad \mathbf{v}\_{\mathsf{I}}^{\mathsf{T}} = \mathbf{x}\_{\mathsf{I}}^{\mathsf{T}} \mathbf{W}^{(v)}.\tag{2.1}$$


**Fig. 2.1** The input of the BERT model consist of a sequence of embeddings corresponding to the input tokens. Each token is represented by a sum consisting of the embedding of the token text, the embedding of its segment indicator and an embedding of its position [39]

Note that the query- and key-vectors have the same length. Then scalar products *q*- *<sup>r</sup> k<sup>t</sup>* between the query-vector *q<sup>r</sup>* of a target token *vr* and the key-vectors *k<sup>t</sup>* of all tokens of the sequence are computed:

$$(\alpha\_{r,1}, \ldots, \alpha\_{r,T}) = \text{softmax}\left(\frac{\mathbf{q}\_r^\mathsf{T}\mathbf{k}\_1}{\sqrt{d\_k}}, \ldots, \frac{\mathbf{q}\_r^\mathsf{T}\mathbf{k}\_T}{\sqrt{d\_k}}\right). \tag{2.2}$$

Each scalar product yields a real-valued *association score (q*- *<sup>r</sup> <sup>k</sup>t)/*√*dk* between the tokens, which depends on the matrices *W(q)* and *W(k)*. This association score is called *scaled dot-product attention*. It is normalized to a probability score *αr,t* by the softmax function. The factor 1*/* <sup>√</sup>*dk* avoids large values, where the softmax function has only tiny gradients. With these weights a weighted average of the value vectors *v<sup>t</sup>* of all sequence elements is formed yielding the new embedding *x*˘*<sup>r</sup>* of length *dv* for the target token *vr*:

$$
\check{\mathfrak{x}}\_{\boldsymbol{r}} = \boldsymbol{\alpha}\_{\boldsymbol{r},\boldsymbol{1}} \* \boldsymbol{\mathfrak{v}}\_{\boldsymbol{1}} + \dots + \boldsymbol{\alpha}\_{\boldsymbol{r},\boldsymbol{T}} \* \boldsymbol{\mathfrak{v}}\_{\boldsymbol{T}}.\tag{2.3}
$$

This algorithm is called *self-attention* and was first proposed by Vaswani et al. [141]. Figure 2.2 shows the computations for the *r*-th token "mouse". Note that the resulting embedding is a *contextual embedding* as it includes information about all words in the input text. A component of *v<sup>t</sup>* gets a high weight whenever the scalar product *q*- *<sup>r</sup> k<sup>t</sup>* is large. It measures a specific form of a correlation between *x<sup>r</sup>* and *x<sup>t</sup>* and is maximal if the vector *x*- *<sup>r</sup> W(q)* points in the same direction as *x*- *<sup>t</sup> W(k)*.

The self-attention mechanism in general is non-symmetric, as the matrices *W(q)* and *W(k)* are different. If token *vi* has a high attention to token *vj* (i.e. *q*- *<sup>i</sup> k<sup>j</sup>* is large), this does not necessarily mean that *vj* will highly attend to token *vi* (i.e. *q*- *<sup>j</sup> k<sup>i</sup>* also is large). The influence of *vi* on the contextual embedding of *vj* therefore is different from the influence of *vj* on the contextual embedding of *vi*. Consider the following example text "Fred gave roses to Mary". Here the word "gave" has different relations to the remaining words. "Fred" is the person who is performing the giving, "roses" are the objects been given, and "Mary" is the recipient of the given objects. Obviously these semantic role relations are nonsymmetric. Therefore, they can be captured with the different matrices *W(q)* and *W(k)* and can be encoded in the embeddings.

**Fig. 2.2** Computation of a contextual embedding for a single token "mouse" by self-attention. By including the embedding of "cheese", the embedding of mouse can be shifted to the meaning of "rodent" and away from "computer pointing device". Such an embedding is computed for every word of the input sequence

Self-attention allows for shorter computation paths and provides direct avenues to compare distant elements in the input sequence, such as a pronoun and its antecedent in a sentence. The multiplicative interaction involved in attention provides a more flexible alternative to the inflexible fixed-weight computation of MLPs and CNNs by dynamically adjusting the computation to the input at hand. This is especially useful for language modeling, where, for instance, the sentence "She ate the ice-cream with the *X*" is processed. While a feed-forward network would always process it in the same way, an attention-based model could adapt its computation to the input and update the contextual embedding of the word "ate" if *X* is "spoon", or update the embedding of "ice-cream" if *X* refers to "strawberries" [17].

In practice all query, key, and value vectors are computed in parallel by *Q* = *XW(q)*, *<sup>K</sup>* <sup>=</sup> *XW(k)*, *<sup>V</sup>* <sup>=</sup> *XW(v)*, where *<sup>X</sup>* is the *<sup>T</sup>* <sup>×</sup> *demb* matrix of input embeddings [141]. The query-vectors *q<sup>t</sup>* , key-vectors *k<sup>t</sup>* and value vectors *v<sup>t</sup>* are the rows of *Q*, *K*, *V* respectively. Then the self-attention output matrix ATTL(X) is calculated by one large matrix expression

$$\check{X} = \operatorname{ATT}(X) = \operatorname{ATT}(\underline{\mathcal{Q}}, \mathcal{K}, V) = \operatorname{softmax}\left(\frac{\mathcal{Q}K^{\mathsf{T}}}{\sqrt{d\_k}}\right)V,\tag{2.4}$$

resulting in a *T* × *dv*-matrix *X*˘ . Its *r*-th row contains the new embedding *x*˘*<sup>r</sup>* of the *r*-th token *vr*.

A number of alternative compatibility measures instead of the scaled dot-product attention (2.2) have been proposed. They are, however, rarely used in PLMs, as described in the surveys [27, 46].

It turns out that a single self-attention module is not sufficient to characterize the tokens. Therefore, in a layer *d*head parallel self-attentions are computed with different matrices *W(q) <sup>m</sup>* , *W(k) <sup>m</sup>* , and *W(v) <sup>m</sup>* , *m* = 1*,...,d*head, yielding partial new embeddings

$$
\check{X}\_m = \text{ATTL}(X\,\textbf{W}\_m^{(q)}, X\,\textbf{W}\_m^{(k)}, X\,\textbf{W}\_m^{(v)}).\tag{2.5}
$$

The emerging partial embeddings *x*˘ *m,t* for a token *vt* are able to concentrate on complementary semantic aspects, which develop during training.

The BERTBASE model has *d*head=12 of these parallel *attention heads*. The lengths of these head embeddings are only a fraction *demb/d*head of the original length *demb*. The resulting embeddings are concatenated and multiplied with a *(d*head <sup>∗</sup> *dv)* <sup>×</sup> *demb*-matrix *<sup>W</sup>(o)* yielding the matrix of intermediate embeddings

$$
\check{X} = \left[ \check{X}\_1, \dots, \check{X}\_{d\_{\text{head}}} \right] W\_0,\tag{2.6}
$$

where *W*<sup>0</sup> is a parameter matrix. If the length of the input embeddings is *demb*, the length of the query, key, and value vector is chosen as *dk* = *dv* = *demb/d*head. Therefore, the concatenation again creates a *T* ×*demb* matrix *X*˘ . This setup is called *multi-head self-attention*. Because of the reduced dimension of the individual heads, the total computational cost is similar to that of a single-head attention with full dimensionality.

Subsequently, each row of *X*˘ , the intermediate embedding vectors *x*˘ - *<sup>t</sup>* , is converted by a *fully connected layer* FCL with a ReLU activation followed by another linear transformation [141]

$$\tilde{\mathbf{x}}\_{l}^{\mathsf{T}} = \mathrm{FCL}(\check{\mathbf{x}}\_{l}) = \mathrm{ReLU}(\check{\mathbf{x}}\_{l}^{\mathsf{T}} \ast \mathbf{W}\_{l} + \mathbf{b}\_{1}^{\mathsf{T}}) \ast \mathbf{W}\_{2} + \mathbf{b}\_{2}^{\mathsf{T}}.\tag{2.7}$$

The matrices *W*0*, W*1*, W*<sup>2</sup> and the vectors *b*1*, b*<sup>2</sup> are parameters. These transformations are the same for each token *vt* of the sequence yielding the embedding *x*˜*<sup>t</sup>* .

To improve training speed, *residual connections* are added as a "bypass", which simply copy the input. They were shown to be extremely helpful for the optimization of multi-layer image classifiers [54]. In addition, *layer normalization* [6] is used for regularization (Sect. 2.4.2), as shown in Fig. 2.3. Together the multi-head selfattention (2.5), the concatenation (2.6), and the fully connected layer (2.7) form an *encoder block*.

This procedure is repeated for a number of *k* layers with different encoder blocks, using the output embeddings of one block as input embeddings of the next block. This setup is shown in Fig. 2.4. The embeddings *x*˜ *k,t* of the last encoder block

**Fig. 2.3** Multi-head self-attention computes self-attentions for each layer *l* and head *m* with different matrices *W(q) l,m*, *W(k) l,m*, and *W(v) l,m*. In this way, different aspects of the association between token pairs, e.g. "mouse" and "cheese", can be computed. The resulting embeddings are concatenated and transformed by a feedforward network. In addition, residual connections and layer normalization improve training convergence [39]

**Fig. 2.4** Parallel computation of contextual embeddings in each encoder block by BERT. The output embeddings of an encoder block are used as input embeddings of the next encoder block. Finally, masked tokens are predicted by a logistic classifier *L* using the corresponding contextual embedding of the last encoder block as input

provides the desired contextual embeddings. The structure of an encoder block overcomes the limitations of RNNs (namely the sequential nature of RNNs) by allowing each token in the input sequence to directly determine associations with every other token in the sequence. BERTBASE has *k*=12 encoder blocks. It was developed at Google by Devlin et al. [39]. More details on the implementation of self-attention can be found in these papers [38, 41, 126].

#### *2.1.2 Training BERT by Predicting Masked Tokens*

The BERT model has a large number of unknown parameters. These parameters are trained in a two-step procedure.


The performance on the fine-tuning task is much better than without pre-training because the model can use the knowledge acquired during pre-training through *transfer learning*.

To pre-train the model parameters, a training task is designed: the *masked language model* (*MLM*). Roughly 15% of the input tokens in the training documents are selected for prediction, which is performed by a logistic classifier (Sect. 1.3)

$$p(V\_l | v\_l, \dots, v\_{l-1}, v\_{l+1}, \dots, v\_T) = \text{softmax}(A\tilde{\mathbf{x}}\_{k,l} + \mathbf{b}),\tag{2.8}$$

receiving the embedding *x*˜ *k,t* of the last layer at position *t* as input to predict the random variable *Vt* of possible tokens at position *t*. This approach avoids cycles where words can indirectly "see themselves".

The tokens to be predicted have to be changed, as otherwise the prediction would be trivial. Therefore, a token selected for prediction is replaced by:


The second and third variants were introduced, as there is a discrepancy between pre-training and the subsequent fine-tuning, were there is no [MASK] token. The authors mitigate this issue by occasionally replacing [MASK] with the original token, or by sampling from the vocabulary. Note that in 1.5% of the cases a random token is inserted. This occasional noise encourages BERT to be less biased towards the masked token (especially when the label token remains unchanged) in its bidirectional context encoding. To predict the masked token, BERT has to concentrate all knowledge about this token in the corresponding output embedding of the last layer, which is the input to the logistic classifier. Therefore, it is often called an *autoencoder*, which generates extremely rich output embeddings.

In addition to predicting the masked tokens, BERT also has to predict, whether the next sentence is a randomly chosen sentence or the actual following sentence (*next sentence prediction*). This requires BERT to consider the relation between two consecutive pieces of text. Again a logistic classifier receiving the embedding of the first [CLS] token is used for this classification. However, this task did not have a major impact on BERT's performance, as BERT simply learned if the topics of both sentences are similar [158].

In Fig. 2.4 the task is to predict a high probability of the token "likes" for the input text "The mouse [MASK] cheese". At the beginning of the training this probability will be very small (≈ 1*/*no. of tokens). By backpropagation for each unknown parameter the derivative can be determined, indicating how the parameters should be changed to increase the probability of "likes". The unknown parameters of BERT comprise the input embeddings for each token of the vocabulary, the position embeddings for each position, matrices*W(q) l,m*, *W(k) l,m*, *W(v) l,m* for each layer *l* and attention head *m* (2.4), the parameters of the fully connected layers (2.7) as well as *A, b* of the logistic classifier (2.8). BERT uses the Adam algorithm [69] for stochastic gradient descent.

The BERTBASE model has a hidden size of *demb*=768, *k*=12 encoder blocks each with *d*head=12 attention heads, and a total of 110 million parameters. The BERTLARGE model has a hidden size of *demb*=1024, and *k*=24 encoder blocks each with *d*head=16 attention heads and a total of 340 million parameters [39]. The English Wikipedia and a book corpus with 3.3 billion words were encoded by the WordPiece tokenizer [154] with a vocabulary of 30,000 tokens and used to pre-train BERT. No annotations of the texts by humans were required, so the training is selfsupervised. The pre-training took 4 days on 64 TPU chips, which are very fast GPU chips allowing parallel processing. Fine-tuning can be done on a single Graphical Processing Unit (GPU).

To predict the masked tokens, the model has to learn many types of language understanding features: syntax ([MASK] is a good position for a verb), semantics (e.g. the mouse prefers cheese), pragmatics, coreference, etc. Note that the computations can be processed in parallel for each token of the input sequence, eliminating the sequential dependency in Recurrent Neural Networks. This parallelism enables BERT and related models to leverage the full power of modern SIMD (single instruction multiple data) hardware accelerators like GPUs/TPUs, thereby facilitating training of NLP models on datasets of unprecedented size. Reconstructing missing tokens in a sentence has long been used in psychology. Therefore, predicting masked tokens is also called a *cloze task* from 'closure' in Gestalt theory (a school of psychology).

It turns out that BERT achieves excellent results for the prediction of the masked tokens, and that additional encoder blocks markedly increase the accuracy. For example, BERT is able to predict the original words (or parts of words) with an accuracy of 45.9%, although in many cases several values are valid at the target position [125]. In contrast to conventional language models, the MLM takes into account the tokens before and after the masked target token. Hence, it is called a *bidirectional encoder*. In addition, self-attention directly provides the relation to distant tokens without recurrent model application. Finally, self-attention is fast, as it can be computed in parallel for all input tokens of an encoder block.

#### *2.1.3 Fine-Tuning BERT to Downstream Tasks*

Neural networks have already been pre-trained many years ago [16], but the success of pre-training has become more evident in recent years. During pre-training BERT learns general syntactic and semantic properties of the language. This can be exploited for a special training task during subsequent *fine-tuning* with a modified training task. This approach is also called *transfer learning* as the knowledge acquired during pre-training is transferred to a related application. In contrast to other models, BERT requires minimal architecture changes for a wide range of natural language processing tasks. At the time of its publication, BERT improved the SOTA on various natural language processing tasks.

Usually, a fine-tuning task requires a classification, solved by applying a logistic classifier *L* to the output embedding *x*˜ *k,*<sup>1</sup> of the [CLS] token at position 1 of BERT's last encoder block. There are different types of fine-tuning tasks, as shown in Fig. 2.5.



**Fig. 2.5** For fine-tuning, BERT is enhanced with an additional layer containing one or more logistic classifiers *L* using the embeddings of the last layer as inputs. This setup may be employed for text classification and comparison of texts with the embedding of [CLS] as input of the logistic classifier. For sequence tagging, *L* predicts a class for each sequence token. For span prediction, two logistic classifiers *L*<sup>1</sup> and *L*<sup>2</sup> predict the start and end of the answer phrase [39]

• *Span prediction* tags a short sequence of tokens within a text. An example is *question answering*. The input to BERT consists of a question followed by "[SEP]" and a context text, which is assumed to contain the answer. Here two different logistic classifiers *L* and *L*˜ are applied to every token output embedding *x*˜ *k,t* of the context and generate the probability that the answer to the question starts/ends at the specific position. The valid span (i.e. the end is not before the start) with the highest sum of start/end scores is selected as the answer. An example is the input "[CLS] When did Caesar die ? [SEP] . . . On the Ides of March, 44 BC, Caesar was assassinated by a group of rebellious senators ...", where the answer to the question is the span "Idesstart of March, 44 BCend". Span prediction may be applied to a number of similar tasks.

Therefore, BERT just needs an extra layer with one or more logistic classifiers for fine-tuning. During fine-tuning with a downstream application, parameters of the logistic models are learned from scratch and usually all parameters in the pre-trained BERT model are adapted. The parameters for the logistic classifiers of the masked language model and the next sentence prediction are not used during fine-tuning.

#### *2.1.4 Visualizing Attentions and Embeddings*

According to Bengio et al. [14], a good representation of language should capture the implicit linguistic rules and common sense knowledge contained in text data, such as lexical meanings, syntactic relations, semantic roles, and the pragmatics of language use. The contextual word embeddings of BERT can be seen as a big step in this direction. They may be used to disambiguate different meanings of the same word.

The self-attention mechanism of BERT computes a large number of "associations" between tokens and merges embeddings according to the strengths of these associations. If *x*1*,..., x<sup>T</sup>* are the embeddings of the input tokens *v*1*,...,vT* , the associations *q*- *<sup>r</sup> k<sup>t</sup>* are determined between the query *q*- *<sup>r</sup>* <sup>=</sup> *<sup>x</sup>*- *<sup>r</sup> W(q)* and the key *k*- *<sup>t</sup>* <sup>=</sup> *<sup>x</sup>*- *<sup>t</sup> <sup>W</sup>(k)* vectors (2.1). Then a sum of value vectors *v*- *<sup>t</sup>* <sup>=</sup> *<sup>x</sup>*- *<sup>t</sup> W(v)* weighted with the normalized associations is formed yielding the new embeddings (2.3).

This is repeated with different matrices *W(q) l,m, <sup>W</sup>(k) l,m, <sup>W</sup>(v) l,m* in *m* self-attention heads and *l* layers. Each layer and head the new embeddings thus captures different aspects of the relations between the embeddings of each layer. For BERTBASE we have *l* = 12 layers and *m* = 12 bidirectional self-attention heads in each layer yielding 144 different "associations" or self-attentions. For the input sentence "The girl and the boy went home. She entered the door." Figure 2.6 shows on the left side the strength of associations for one of the 144 self-attention heads. Between every pair of tokens of the sentence an attention value is calculated and its strength is symbolized by lines of different widths. We see that the pronoun "she" is strongly associated with "the girl". In the subsequent calculations (c.f. Fig. 2.2) the word "she" is disambiguated by merging its embedding with the embeddings of "the" and "girl" generating a new *contextual embedding* of "she", which includes its relation to "girl". On the right side of the figure the input "The girl and the boy went home. He entered the door." is processed. Then the model creates an association of "boy" with "he".

**Fig. 2.6** Visualization of a specific self-attention in the fifth layer of a BERT model with BERTviz [142]. If the next sentence contains the pronoun "she" this is associated with "the girl". If this pronoun is changed to "he" it is related to "the boy". Image created with BERTviz [142], with kind permission of the author

**Fig. 2.7** Visualization of some of the 144 self-attention patterns computed for the sentence "[CLS] the cat sat on the mat [SEP] the cat lay on the rug[SEP]" with BERTviz. Image reprinted with kind permission of the author [142]

Figure 2.7 shows a subset of the self-attention patterns for the sentence "[CLS] the cat sat on the mat [SEP] the cat lay on the rug [SEP]". The self-attention patterns are automatically optimized in such a way that they jointly lead to an optimal prediction of the masked tokens. It can be seen that the special tokens [CLS] and [SEP] often are prominent targets of attentions. They usually function as representatives of the whole sentence [124]. Note, however, that in a multilayer PLM the embeddings generated by different heads are concatenated and transformed by a nonlinear transformation. Therefore, the attention patterns of a single head do not contain the complete information [124]. Whenever the matrices are randomly initialized, the self-attention patterns will be completely different, if the training is restarted with new random parameter values. However, the overall pattern of attentions between tokens will be similar.

Figure 2.10 shows on the left side a plot of six different senses of the token embeddings of "bank" in the *Senseval-3 dataset* projected to two dimensions by *T-SNE* [140]. The different senses are identified by different colors and form wellseparated clusters of their own. Senses which are difficult to distinguish, like "bank building" and "financial institution" show a strong overlap [153]. The graphic

demonstrates that BERT embeddings have the ability to distinguish different senses of words which are observed frequently enough.

There is an ongoing discussion on the inner workings of self attention.Tay et al [134] empirically evaluated the importance of the dot product *q*- *<sup>r</sup> k<sup>s</sup>* on natural language processing tasks and concluded that query-key interaction is "useful but not that important". Consequently they derived alternative formulae, which in some cases worked well and failed in others. A survey of attention approaches is provided by de Santana Correia et al. [37]. There are a number of different attention mechanisms computing the association between embedding vectors [50, 61, 104, 151]. However, most current large-scale models still use the original scaled dot-product attention with minor variations, such as other activation functions and regularizers (c.f. Sect. 3.1.4).

The fully connected layers FCL*(x*˘*t)* in (2.7) contain 2/3 of the parameters of BERT, but their role in the network has hardly been discussed. Geva et al. [49] show that fully connected layers operate as key-value memories, where each key is correlated with text patterns in the training samples, and each value induces a distribution over the output vocabulary. For a key the authors retrieve the training inputs, which yield the highest activation of the key. Experts were able to assign one or more interpretations to each key. Usually lower fully connected layers were associated with shallow patterns often sharing the last word. The upper layers are characterized by more semantic patterns that describe similar contexts. The authors demonstrate that the output of a feed-forward layer is a composition of its memories.

#### *2.1.5 Natural Language Understanding by BERT*

An outstanding goal of PLMs is *Natural Language Understanding* (*NLU*). This cannot be evaluated against a single task, but requires a set of benchmarks covering different areas to assess the ability of machines to understand natural language text and acquire linguistic, common sense, and world knowledge. Therefore, PLMs are fine-tuned to corresponding real-world downstream tasks.

**GLUE** [146] is a prominent benchmark for NLU. It is a collection of nine NLU tasks with public training data, and an evaluation server using private test data. Its benchmarks cover a number of different aspects, which can be formulated as classification problems:


Each task can be posed as *text classification* or *text pair classification* problem. The performance of a model is summarized in a single average value, which has the value 87.1 for human annotators [145]. Usually, there is an online leaderboard where the performance of the different models are recorded. A very large repository of leaderboards is on the PapersWithCode website [109]. Table 2.1 describes the tasks by examples and reports the performance of BERTLARGE. BERT was able to lift the SOTA of average accuracy from 75.2 to 82.1%. This is a remarkable increase, although the value is still far below the human performance of 87.1 with much room for improvement. Recent benchmark results for NLU are described in Sect. 4.1 for the more demanding SuperGLUE and other benchmarks.

#### **BERT's Performance on Other Fine-Tuning Tasks**

The pre-training data is sufficient to adapt the large number of BERT parameters and learn very detailed peculiarities about language. The amount of training data for pre-training usually is much higher than for fine-tuning. Fine-tuning usually only requires two or three passes through the fine-tuning training data. Therefore, the stochastic gradient optimizer changes most parameters only slightly and sticks relatively close to the optimal pre-training parameters. Consequently, the model is usually capable to preserve its information about general language and to combine it with the information about the fine-tuning task.

Because BERT can reuse its general knowledge about language acquired during pre-training, it produces excellent results even with small fine-tuning training data [39].


From these experiments a large body of evidence has been collected demonstrating the strengths and weaknesses of BERT [124]. This is discussed in Sect. 4.2.

In summary, the advent of the BERT model marks a new era of NLP. It combines two pre-training tasks, i.e., predicting masked tokens and determining whether the second sentence matches the first sentence. Transfer learning with unsupervised pretraining and supervised fine-tuning becomes the new standard.

**Table 2.1** GLUE language understanding tasks. BERTLARGE was trained for three epochs on the fine-tuning datasets [38]. The performance of the resulting models is printed in the last column yielding an average value of 82.1


#### *2.1.6 Computational Complexity*

It is instructive to illustrate the computational effort required to train PLMs. Its growth determines the time needed to train larger models that can massively improve the quality of language representation. Assume *D* is the size of the hidden embeddings and the input sequence has length *T* , then the intermediate dimension of the fully connected layer FCL is set to 4*D* and the dimension of the keys and values are set to *D/H* as in Vaswani et al. [141]. Then according to Lin et al. [81] we get the following computational complexities and parameters counts of self-attention and the position-wise FCL (2.7):


As long as the input sequence length *T* is small, the hidden dimension *D* mainly determines the complexity of self-attention and position-wise FCL. The main limiting factor is the FCL. But when the input sequences become longer, the sequence length *T* gradually dominates the complexity of these modules, so that self-attention becomes the bottleneck of the PLM. Moreover, the computation of self-attention requires that an attention score matrix of size *T* × *T* is stored, which prevents the computation for long input sequences. Therefore, modifications reducing the computational effort for long input sequences are required.

To connect all input embeddings with each other, we could employ different modules. Fully connected layers require *T* ∗ *T* networks between the different embeddings. Convolutional layers with a kernel width *K* do not connect all pairs and therefore need *O(*log*K(T ))* layers in the case of dilated convolutions. RNNs have to apply a network *T* times. This leads to the following complexities per layer [81, 141]


The last line describes a restricted self-attention, where self-attention only considers a neighborhood of size *R* to reduce computational effort. Obviously the computational complexity per layer is a limiting factor. In addition, computation for recurrent layers need to be sequential and cannot be parallelized, as shown in the column for sequential operations. The last column shows the path length, i.e. the number of computations to communicate information between far-away positions. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies. Here self-attention has a definite advantage compared to all other layer types. Section 3.2 discusses advanced approaches to process input sequences of larger length. In conclusion, BERT requires less computational effort than alternative layer types.

#### *2.1.7 Summary*

*BERT* is an autoencoder model whose main task is to derive context-sensitive embeddings for tokens. In a preliminary step, tokens are generated from the words and letters of the training data in such a way that most frequent words are tokens and arbitrary words can be composed of tokens. Each token is encoded by an input embedding. To mark the position of each input token, a position embedding is added to the input embedding.

In each layer of BERT, the lower layer embeddings are transformed by selfattention to a new embedding. Self-attention involves the computation of scalar products between linear transformations of embeddings. In this way, the embeddings in the next layer can adapt to tokens from the context, and the embeddings become context-sensitive. The operation is performed in parallel for several attention heads involving different linear projections. The heads can compute associations in parallel with respect to different semantic features. The resulting partial embeddings are concatenated to a new embedding. In addition to selfattention heads, each encoder block contains a fully connected layer as well as normalization operations.

The original BERT model consists of six encoder blocks and generates a final embedding for each input token. BERT is pre-trained on a very large document collection. The main pre-training task is to predict words from the input sequence, which have been replaced by a [MASK] token. This is done by using the last layer embedding of the token as input to a logistic classifier, which predicts the probabilities of tokens for this position. During pre-training the model parameters are optimized by stochastic gradient descent. This forces the model to collect all available information about that token in the output embedding. The first input token is the [CLS] token. During pre-training, it is used for next sentence prediction, where a logistic classifier with the [CLS]-embedding as input has to decide, if the first and second sentence of the input sequence belong together or not.

Typically, the pre-trained model is fine-tuned for a specific task using a small annotated training dataset. An example is the supervised classification task of whether the input text expresses a positive, negative or neutral sentiment. Again a logistic classifier with the [CLS]-embedding as input has to determine the probability of the three sentiments. During pre-training all parameters of the model are adjusted slightly. It turns out that this transfer learning approach has a much higher accuracy than supervised training only on the small training dataset, since the model can use knowledge about language acquired during pre-training.

Experiments show that BERT is able to raise the SOTA considerably in many language understanding tasks, e.g. the GLUE benchmark. Other applications are named entity recognition, where names of persons, locations, etc. have to be identified in a text, or question answering, where the answer to a question has to be extracted from a paragraph. An analysis of computational complexity shows that BERT requires less computational effort than alternative layer types. Overall, BERT is the workhorse of natural language processing and is used in different variants to solve language understanding problems. Its encoder blocks are reused in many other models.

Chapter 3 describes ways to improve the performance of BERT models, especially by designing new pre-training tasks (Sect. 3.1.1). In Chap. 4 the knowledge acquired by BERT models is discussed. In the Chaps. 5–7, we describe a number of applications of BERT models such as relation extraction (Sect. 5.4) or document retrieval (Sect. 6.1).

#### **2.2 GPT: Autoregressive Language Models**

#### *2.2.1 The Task of Autoregressive Language Models*

To capture the information in natural language texts the conditional probability of tokens can be described by a language model. These *autoregressive language models* aim to predict the probability of the next token in a text given the previous tokens. If *Vt*+<sup>1</sup> is a random variable whose values are the possible tokens *vt*+<sup>1</sup> at position *t* + 1, we have to calculate the conditional probability distribution *p(Vt*+1|*v*1*,...,vt)*. According to the definition of conditional probability the probability of the complete text *v*1*,...,vT* can be computed as

$$p(V\_{\mathbf{l}} = v\_{\mathbf{l}}, \dots, V\_{T} = v\_{T}) = p(V\_{T} = v\_{T} | v\_{\mathbf{l}}, \dots, v\_{T-\mathbf{l}}) \* \dots \* p(V\_{\mathbf{l}} = v\_{\mathbf{l}}).\tag{2.9}$$

Therefore, the conditional probability can represent all information about valid sentences, including adequate and bad usage of language. Qudar et al. [115] provide a recent survey of language models.

In Sect. 1.6, we used RNNs to build language models. However, these had problems determining long-range interactions between tokens. As an alternative, we can employ self-attention to infer contextual embeddings of the past tokens *v*1*,...,vt* and predict the next token *vt*+<sup>1</sup> based on these embeddings.

Consequently, we need to restrict self-attention to the tokens *v*1*,...,vt* . This is the approach taken by the **Generative Pre-trained Transformer** (**GPT**) [116, 118]. Before training, the text is transformed to tokens, e.g. by byte-pair encoding (Sect. 1.2). On input, these tokens are represented by token embeddings and position embeddings (Sect. 2.1.1). During training the GPT-model performs the selfattention computations described in Sect. 2.1.1 in the same way as for BERT. For

predicting the probabilities of different tokens at position *t* + 1, the self-attentions are restricted to previous tokens *v*1*,...,vt* and their embeddings. The probability of the possible next tokens at position *t* + 1 is computed by a logistic classifier

$$p(V\_{l+1}|v\_l, \dots, v\_l) = \text{softmax}(A\tilde{\mathbf{x}}\_{k,l} + \mathbf{b}),\tag{2.10}$$

which takes as input the embedding *x*˜ *k,t* of the last layer *k* at position *t* to predict the random variable *Vt*+<sup>1</sup> of possible tokens at position *t* +1 (Fig. 2.8). This approach is called *masked self-attention* or *causal self-attention* because the prediction depends only on past tokens. Since GPT generates the tokens by sequentially applying the same model, it is called an *autoregressive language model*.

#### *2.2.2 Training GPT by Predicting the Next Token*

The training objective is adapted to the language modeling task of GPT. Figure 2.8 shows the range of computations for two consecutive tokens. By *teacher forcing* the model uses the observed tokens *v*1*,...,vt* up to position *t* to compute self-attentions and predict the token probabilities for the next token *vt*+1. This is justified by the factorization (2.9) of the full distribution. Note that the contextual embedding of a token *vs*, *s<t*, changes each time when a new token *vt*+<sup>1</sup>*, vt*+2*,...* is taken into account in the masked self-attention. As GPT considers only the tokens before the target token *vt*+1, it is called an *unidirectional encoder*. An intuitive high-level overview over GPT is given by Alammar [3].

During training the model parameters have to be changed by optimization such that the probabilities of observed documents (2.9) get maximal. By this *Maximum Likelihood estimation* (*MLE*) the parameters can be optimized for a large corpus of documents. To avoid numerical problems this is solved by maximizing the *loglikelihood*, sum of logarithms of (2.9)

$$\log p(v\_1, \dots, v\_T) = \log p(v\_T|v\_1, \dots, v\_{T-1}) + \dots + \log p(v\_2|v\_1) + \log p(v\_l). \tag{2.11}$$

Alternatively we can minimize the negative log-likelihood − log *p(v*1*,...,vT )*.

GPT-2 can process an input sequence of 1024 tokens with an embedding size of 1024. In its medium version it has 345M parameters and contains 24 layers, each with 12 attention heads. For the training with gradient descent a batch size of 512 was utilized. The model was trained on 40 GB of text crawled from Reddit, a social media platform. Only texts that were well rated by other users were included, resulting in a higher quality data set. The larger model was trained on 256 cloud TPU v3 cores. The training duration was not disclosed, nor the exact details of training.

The quality of a language model may be measured by the probability *p(v*1*,...,vT )* of a given text collection *v*1*,...,vT* . If we normalize its inverse

**Fig. 2.8** The input of the GPT model are the embeddings of tokens *v*1*,...,vt* up to position *t*. GPT computes contextual self-embeddings of these tokens in different layers and uses the output embedding of the last token *vt* ="to" in the highest layer to predict the probabilities of possible tokens at position *t* + 1 with a logistic classifier *L*. This probability should be high for the actually observed token "new" (left). Then the observed token *vt*+<sup>1</sup> ="new" is appended to the input sequence and included in the self-attention computation for predicting the probabilities of possible tokens at position *t* + 2, which should be high for "york" (right)

by the number *T* of tokens we get the *perplexity* [28]

$$ppl(v\_1, \ldots, v\_T) := p(v\_1, \ldots, v\_T)^{-\frac{1}{T}}.\tag{2.12}$$

A low perplexity indicates a high probability of the text. If we assume that the conditional probabilities *p(vt*|*v*1*,...,vt*−1*)* are identical for all *t*, we get *ppl(v*1*,...,vT )* = 1*/p(vt*|*v*1*,...,vt*−1*)*, i.e. the inverse probability of the next token. GPT-2 was able to substantially reduce the perplexity on a number of benchmark data sets, e.g. from 46.5 to 35.8 for the *Penn Treebank corpus* [117] meaning that the actual words in the texts were predicted with higher probability.

#### **Visualizing GPT Embeddings**

Kehlbeck et al. [66] investigated the relative location of embeddings in multivariate space for both BERT and GPT-2, each with 12 layers. They calculated 3-D projections using both *principal component analysis (PCA)* [111] and UMAP [89]. The latter can preserve the local structure of neighbors, but—differently to PCA—is unable to correctly maintain the global structure of the data. These 3d-scatterplots can be interactively manipulated on the website [66]. It turns out that GPT-2 forms two separate clusters: There is a small cluster containing just all tokens at position 0, while the embeddings at other positions form ribbon-like structures in the second cluster.

Careful investigations have indicated that most embedding vectors are located in a narrow cone, leading to high cosine similarities between them [25]. The authors identify isolated clusters and low dimensional manifolds in the contextual embedding space. Kehlbeck et al. [66] show that tokens with the same part-ofspeech tag form ribbon-like structures in the projections (Fig. 2.9 left). Function words are all located on a tight circular structure, whereas content words like nouns

**Fig. 2.9** Visualization of embeddings with PCA together with the corresponding part-of speech tags. On the left side are GPT-2 embeddings of layer 0 of tokens of positions *>* 0 which form ribbon-like structures for the different POS tags, with function words close to the top. On the right side the embeddings of BERT for layer 0 are shown. Image reprinted with kind permission of the author [66]

and verbs are located in other elongated structures and have overlap with other POStags. The embeddings generated by BERT form one or more clusters (Fig. 2.9 right). They are quite separated for function words, but show some overlap for content words like nouns, verbs, or adjectives.

The GPT-2 embeddings of content words like "banks" and "material" at positions *>* 0 form elongated band-structures, as shown in the right part of Fig. 2.10. For higher layers the PCA projections get more diffuse. The user can read the token context by pointing to each dot.

Token-based *self-similarity* is the mean cosine similarity of the same token found in different sentences. In BERT as well as GPT-2, the self-similarity is higher for content than function words [66]. This may indicate that function words have more diverse semantic roles in different contexts. It is interesting to evaluate the 10 nearest neighbors of a token with respect to cosine similarity. In the lower layers, for both models the nearest tokens were in most cases the same tokens, except for a few content words. In the higher layers this changed and different tokens were the nearest tokens. This shows that more and more context is included in the embeddings of higher layers.

The authors also investigated the embeddings generated by a number of other PLM types. They find that their structure is very different as they form different clusters and manifolds. They argue that this structure has to be taken into account for new applications of the models.

**Fig. 2.10** Plot of BERT-embeddings of different senses of "bank" projected to two dimensions by T-SNE (left). The legend contains a short description of the respective WordNet sense and the frequency of occurrence in the training data. Image[153]. The right side shows PCA projections of the embeddings of "banks" (lower strip) and "material" (middle strip) as well as other words computed for different contexts. Image interactively generated, printed with kind permission of the authors [66]

#### *2.2.3 Generating a Sequence of Words*

After training the GPT model can predict the probabilities of the tokens at the next position *t* + 1 given the previous tokens *v*1*,...,vt* . To generate a text we have to select a sequence of tokens according to these probabilities.


There are also strategies which explicitly avoid previously generated tokens by reducing the corresponding scores in the update formula [67]. Both top-*k* and top-*p* sampling usually generate plausible token sequences and are actually employed to generate texts.

There are a number of approaches to improve token selection. Meister et al. [90] found that human-produced text tends to have evenly distribution of "surprise". This means that the next token should on average not be too rare and not be too frequent. They propose a number of sampling criteria, e.g. a variance regularizer.

Martins et al. [86] argue that softmax-generated output distributions are unrealistic, as they assign a positive probability to every output token. They propose the *Entmax transformation* which generates a sparse probability distribution from the computed scores, where part of the probabilities are exactly zero. The Entmax transformation can be controlled by a parameter *α* ≥ 1. For *α* = 1 we get softmax and *α* = ∞ recovers arg max. For intermediate values ∞ *>α>* 1*.*0 some tokens get exactly zero probability. Entmax losses are convex and differentiable and therefore may be trained by backpropagation. As in top-*p* sampling and in opposition to top-*k* sampling, Entmax sampling considers a varying number of tokens depending on the context. Experiments show that Entmax leads to better perplexities and less repetitions than other approaches. Compared with top-*p* sampling it has a higher variation in the number of tokens considered.

Khandelwal et al. [68] try to improve the estimated probabilities of the language model by statistics of token *n*-grams. They perform a nearest neighbor search on the last tokens already processed. As distance measure they use the distances of the pretrained embedding space. From the retrieved nearest neighbors they get additional evidence on the probable next token, which is merged with the token probabilities of the language model. In this way, they are able to improve the perplexity of language models. The approach is particularly helpful in predicting rare patterns, e.g. factual knowledge.

Yang et al. [157] analyze the properties of the softmax function. They find that the standard softmax does not have enough capacity to model natural language, as it restricts the rank of the mapping to probabilities. They propose to predict probabilities by a *Mixture of Softmaxes*, a convex combination of different logistic classifiers, which is more expressive than a single softmax. The authors show that this modification yields better perplexities in language modeling and also improves the performance of other transformer architectures [101].

#### *2.2.4 The Advanced Language Model GPT-2*

**GPT-2** [118] is the first language model, which is able to generate documents of grammatically correct and semantically plausible text. Its largest version has 48 encoder blocks with 1.5B parameters and covers sequences of 1600 tokens. Given an initial text the model adapts to the style and content of this text and generates an answer, which often cannot be distinguished from human-generated continuations. Longer generated texts, however, sometimes tend to be repetitive and less coherent.

For GPT-2 top-*k* truncated sampling was used to generate the example text [117] shown in Fig. 2.11. As can be seen there are no syntax errors and the generated content is plausible. The authors remark that one in two trials were of high quality.

**Fig. 2.11** Given the input text, GPT-2 generates a continuation by top-*k* sampling [117]. Quoted with kind permission of the authors

The model adapts to the style and content of the input text. This allows the user to generate realistic and coherent continuations about a topic they like. Obviously the topic has to be mentioned in the Reddit training data, which covers a broad spectrum of themes such as news, music, games, sports, science, cooking, and pets.

The model was able to solve many tasks better than previous models without being trained on the specific task. This type of learning is called *zero-shot learning*. For example, GPT-2 had a perplexity of 35.8 on the test set of the Penn Treebank compared to the inferior prior SOTA of 46.5 [117]. This was achieved without training GPT-2 on the *Penn Treebank corpus* [135].

#### *2.2.5 Fine-Tuning GPT*

By fine-tuning, GPT-2 may be adapted to new types of text, for example new genres of text. To create song lyrics, for example, St-Amant [4] uses a dataset of 12,500 English rock song lyrics and fine-tunes GPT-2 for 5 epochs. Then the model is able to continue the lyrics of pop songs, which had not been seen by the model during training. The model had a high BLEU score of 68 when applied to song lyrics. Another experiment describes the generation of poetry [19].

Similar to BERT, a pre-trained GPT-2 can also be modified to perform a classification task. An example is fine-tuning to the classification of the sentiment of a document as positive or negative. Radford et al. [116] encode the classification task as a text with specific tokens and a final end token [END]. Then the model has to predict the sequence. The embedding of [END] in the highest layer is used as input to a logistic classifier, which is trained to predict the probability of classes. The authors found that including language modeling (2.11) of the fine-tuning data as an auxiliary objective to fine-tuning improved generalization and accelerated convergence. They were able to improve the score on GLUE (Sect. 2.1.5) from 68.9 to 72.8 and achieved SOTA in 7 out of 8 GLUE tasks for natural language understanding. The results show that language models capture relevant information about syntax and semantics.

However, GPT operates from left to right when predicting the next token. In the sentences "I went to the bank to deposit cash" and "I went to the bank to sit down", it will create the same context-sensitive embedding for "bank" when predicting "sit" or "deposit", although the meaning of the token "bank" is different in both contexts. In contrast, BERT is bidirectional and takes into account all tokens of the text when predicting masked tokens. This fact explains why BERT for some tasks shows a better performance.

#### *2.2.6 Summary*

GPT has an architecture similar to a BERT model that generates the tokens of a sentence one by one. It starts with an input sequence of tokens, which can be empty. Tokens are encoded as a sum of token embeddings and position embeddings. GPT uses the same encoder blocks as BERT, but the computations are masked, i.e. restricted to the already generated tokens. For these tokens the model produces contextual embeddings in several layers. The embedding of the last token in the top layer is entered into a logistic classifier and this calculates the probability of the tokens for the next position. Subsequently, the observed token is appended to the input at the next position and the computations are repeated for the next but one position. Therefore, GPT is called an autoregressive language model.

During training the parameters are changed by stochastic gradient descent in such a way that the model predicts high probabilities of the observed tokens in the training data. The maximum likelihood criterion is used, which optimizes the probability of the input data. When the model has been trained on a large text dataset it can be applied. Conditional to a start text it can sequentially compute the probability of the next token. Then a new token can be selected according to the probabilities.

If all alternative tokens are taken into account, rare tokens are often selected. Usually, the number of eligible tokens is restricted to *k* high-probability tokens (top-*k* sampling) or only high-probability tokens are included up to a prescribed probability sum *p* (top-*p* sampling). In this way, much better texts are generated. Advanced language models like GPT-2 have billions of parameters and are able to generate plausible stories without syntactic errors.

GPT models can also be fine-tuned. A first type of fine-tuning adapts the model to a specific text genre, e.g. poetry. Alternatively, GPT can be used as a classifier, where the output embedding of the most recently generated token for an input text is input to a logistic classifier. With this approach, GPT-2 was able to improve SOTA for most natural language understanding task in the GLUE benchmark. This shows that GPT-2 has acquired a comprehensive knowledge about language. However, since self-attention is only aware of past tokens, models like BERT are potentially better as they can take into account all input tokens during computations.

Chapter 3 discusses how to improve the performance of GPT models, in particular by using more parameters (Sect. 3.1.2). These large models with billions of parameters can be instructed to perform a number of tasks without fine-tuning (Sect. 3.6.3). In the Chaps. 5–7, we describe a number of applications of GPTmodels such as question-answering (Sect. 6.2.3), story generation (Sect. 6.5), or image generation from text (Sect. 7.2.6).

#### **2.3 Transformer: Sequence-to-Sequence Translation**

#### *2.3.1 The Transformer Architecture*

Translation models based on Recurrent Neural Networks (Sect. 1.6) have a major limitation caused by the sequential nature of RNNs. The number of operations required to determine the relation between tokens *vs* and *vt* grows with the distance *t* − *s* between positions. The model has to store the relations between all tokens simultaneously in a vector, making it difficult to learn complex dependencies between distant positions.

The *Transformer* [141]—similar to RNN-translation models—is based on an encoder and a decoder module (Fig. 2.13). The encoder is very similar to BERT, while the decoder resembles GPT. It is a *sequence-to-sequence model* (*Seq2seq*), which translates a source text of the input language to a target text in the target language. Instead of relating distant tokens by a large number of computation steps, it directly computes the self-attention between these token in parallel in one step.

The *encoder* generates contextual embeddings *x*˜ <sup>1</sup>*,..., x*˜ *<sup>T</sup>*src of the source text tokens *v*1*,...,vT*src with exactly the same architecture as the BERT model (Fig. 2.4). The original transformer [141] uses 6 encoder blocks. The generated embeddings of the last layer are denoted as *x*˘ <sup>1</sup>*,..., x*˘ *<sup>T</sup>*src .

The transformer *decoder* step by step computes the probability distributions *p(St*|*s*1*,...,st*−1*, v*1*,...,vT*src *)* of target tokens *st* similar to the Recurrent Neural Network. Note that the source tokens *vi* as well as observed target tokens *sj* are taken as conditions. By the definition of conditional probability this yields the total probability of the output distribution

$$p(S\_l = s\_1, \ldots, S\_T = s\_T | \upsilon\_1, \ldots, \upsilon\_{T\_{\text{sw}}}) \tag{2.13}$$

$$p = p(S\_T = s\_T | s\_1, \ldots, s\_{T-1}, \upsilon\_1, \ldots, \upsilon\_{T\_{\text{sw}}}) \cdot \cdot p(S\_l = s\_1 | \upsilon\_1, \ldots, \upsilon\_{T\_{\text{sw}}}),$$

where *St* is a random variable with the possible target tokens *st* at position *t* as its values. This probability is maximized during training.

**Fig. 2.12** The transformer [141] uses *k* encoder blocks with the same architecture as in BERT (Fig. 2.4) to generate contextual embeddings of all tokens of the input text. The decoder block is an autoregressive language model (Fig. 2.8) and sequentially predicts the next token in the target language. Each encoder block contains a multi-head self-attention for the current sequence of output tokens. By cross-attention the information from the input sequence is included. The calculations are repeated for all current input tokens and are very similar to the self-attention computations. The resulting vector is transformed by a fully connected layer yielding the embeddings of that layer

We denote the already translated tokens by *s*0*, s*1*,...,st*−<sup>1</sup> were *s*<sup>0</sup> is the token "[BOS]" indicating the beginning of the output text. The decoder first computes a self-attention for these tokens using the formula (2.4) as for BERT. As only part of the target tokens are covered and the rest is 'masked', this layer is called *masked multi-head self-attention* yielding intermediate contextual embeddings *s*˜0*,s*˜1*,...,s*˜*t*−<sup>1</sup> for the target tokens *s*0*, s*1*,...,st*−1.

#### **Cross-Attention**

Then the decoder performs a *cross-attention* CATL*(V*˜ *, X*˘ *)* with the input text embeddings of the highest encoder block (Fig. 2.12). Here the query-vectors are computed for the embeddings of the target tokens *S*˜*<sup>t</sup>* = *(s*˜0*,s*˜1*,...,s*˜*t*−1) provided by the respective decoder block. The key and value vectors are computed for the embeddings *X*˘ = *x*˘ <sup>1</sup>*,..., x*˘ *<sup>T</sup>*src of the last encoder block. Note that cross attention employs the same Eq. (2.4) with matrices *W(q), W(k), W(v)* as the BERT selfattentions. This is done in parallel and called *multi-head cross-attention*. In this

**Fig. 2.13** The transformer [141] uses an encoder with the same architecture as BERT to generate embeddings of all tokens of the input sentence. Each encoder block performs multi-head selfattention of the input sequence followed by a fully connected layer (FCL) . The decoder is similar to a GPT model and sequentially predicts the next token in the target language. Each encoder block contains a multi-head cross-attention including the final embeddings of the encoder. Using the last output embedding of the final decoder block, a logistic classifier *L* predicts probabilities of the next token of the output sentence

way, information from the source text is taken into account. Subsequently, the embeddings computed by different heads are concatenated (2.6) and the result is transformed by a fully connected layer with ReLU activation (2.7). In addition, residual "bypass" connections are used as well as layer normalization [6] for regularization. The output of the fully connected layer yields a new 'output' embedding *s*˜0*,...,s*˜*t*−<sup>1</sup> for the target tokens *s*1*,...,st*−1. Together these layers are called a *decoder block* (Fig. 2.13).

The next decoder block gets the computed token output embeddings of the previous block as input and computes a new embedding of the target tokens *s*1*,...,st*−1. The decoder consists of several decoder blocks (6 in the original model). Using the output embedding *s*˘*t*−<sup>1</sup> of the righmost token *st*−<sup>1</sup> in the last decoder block, the token probabilities *p(St* = *st*|*s*1*,...,st*−1*, v*1*,...,vT*src *)* of the next token *st* of the target text at position *t* are predicted by a logistic classifier, e.g. for the token "Maus" in Fig. 2.13.

Note that for the prediction of a further token at position *t* + 1 the observed token *st* is added to the computation (2.13) of the self-attentions in the decoder. Hence, the decoder embeddings change and all decoder computations have to be repeated. In this respect the model still works in a recursive way. Nevertheless, all self-attentions and cross-attentions in each layer are computed in parallel. However, the computations for the encoder are only performed once.

Sequences of variable length are padded with a special token up to the maximal length. This is done for the input and the output sequence. If a sequence is very short, a lot of space is wasted. Therefore, the sequence length may be varied in different mini-batches called buckets in the training data.

The transformer has a large set of parameters. First it requires embeddings of the input and target token vocabularies. Then there are the *W(q), W(k), W(v)* matrices for the multi-head self-attention, the masked multi-head self-attention and the multihead cross-attention of the different heads and layers. In addition, the parameters of the fully connected networks and the final logistic classifier have to be specified. While the base model had an input sequence length of 512 and 65M parameters, the big model had an input sequence length of 1024 and 213M parameters [141]. The values of all these parameters are optimized during training.

The training data consists of pairs of an input sentence and the corresponding target sentence. Training aims to generate the target tokens with maximal probability for the given input tokens to maximize the joint conditional probability (2.13) of the output sequence by stochastic gradient descent. In our example in Fig. 2.13 for the given input text "The mouse likes cheese" the product of conditional probabilities of the output tokens "Die Maus mag Käse" has to be maximized. The original model [141], for instance, used 36M sentences of the WMT English-French benchmark data encoded as 32,000 wordpiece tokens. Both the encoder and decoder are trained simultaneously by stochastic gradient descent end-to-end, requiring 3.5 days with 8 GPUs.

Cross-attention is the central part of the transformer, where the information from the input sentence is related to the translated output sentence. In Fig. 2.14 a German input sentence is displayed together with its English translation. Both sentences are tokenized by byte-pair encoding, where the beginning of a word is indicated by "\_". Below the strength of cross-attentions between the input tokens and output tokens is depicted for two different heads. Obviously the first input token "\_The" has a special role.

#### *2.3.2 Decoding a Translation to Generate the Words*

After training, the Transformer is able to predict the probabilities of output tokens for an input sentence. For a practical translation, however, it is necessary to generate an explicit sequence of output tokens. Computing the output sequence with maximal probability is computationally hard, as then all output possible sequences have to be considered. Therefore, an approximate solution is obtained using greedy decoding or beam search.

**Greedy decoding** simply picks the most likely token with the highest probability at each decoding step until the end-of-sentence token is generated. The problem with this approach is that once the output is chosen at any time step *t*, it is impossible to

**Fig. 2.14** An English input sentence tokenized by Byte-Pair encoding and the translated tokenized German output sentence. Below are two cross-attention graphs from different heads of the 4-th decoder layer [126]. Dark values indicate a low cross-attention score. Image source: [126]

go back and change the selection. In practice there are often problems with greedy decoding, as the available probable continuation tokens may not fit to a previously assigned token. As the decision cannot be revised, this may lead to suboptimal generated translations.

**Beam search** [52] keeps a fixed number *k* of possible translations *s*1*,...,st* of growing length (Fig. 2.15). At each step each translation of length *t* is enlarged by *k* different tokens at position *t* +1 with the highest conditional probabilities *p(St*+<sup>1</sup> = *st*+1|*s*1*,...,st, v*1*,...,vT*src *)*. From these *k*∗*k* token sequences only the *k* sequences with largest total probabilities *p(s*1*,...,st*+1|*v*1*,...,vT*src *)* are retained. A complete translation (containing the end-of-sentence token) is added to the final candidate list. The algorithm then picks the translation with the highest probability (normalized by the number of target words) from this list. For *k* = 1 beam search reduces to greedy decoding. In practice, the translation quality obtained via beam search (size of 4) is significantly better than that obtained via greedy decoding. Larger beam sizes often lead to suboptimal solutions [31]. However, beam search is computationally very expensive (25%–50% slower depending on the base architecture and the beam size) in comparison to greedy decoding [29].

**Fig. 2.15** Beam search is a technique for decoding a language model and producing text. At every step, the algorithm keeps track of the *k* most probable partial translations (bold margin). The score of each translation is equal to its log probability. The beam search continues until it reaches the end token for every branch [78]

#### *2.3.3 Evaluation of a Translation*

Traditionally, evaluation is done by comparing one or more reference translations to the generated translation, as described in the survey [127]. There are a number of automatic evaluation metrics:

**BLEU** compares counts of 1-grams to 4-grams of tokens. The BLEU metric ranges from 0 to 1, where 1 means an identical output with the reference. Although BLEU correlates well with human judgment [110], it relies on precision alone and does not take into account recall—the proportion of the matched *n*-grams out of the total number of *n*-grams in the reference translation.

**ROUGE** [80] unlike BLEU is a recall-based measure and determines which fraction of the words or n-grams in the reference text appear in the generated text. It determines, among other things, the overlap of unigrams or bigrams as well as the longest common subsequence between a pair of texts. Different versions are used: ROUGE-1 measures the overlap of unigram (single words) between the pair of texts. ROUGE-2 determines the overlap of bigrams (two-words sequences) between the pair of texts. ROUGE-L: measures the length of the longest sequence of words (not necessarily consecutive, but still in order) that is shared between both texts. This length is divided by the number of words in the reference text.

**METEOR** [75] was proposed to address the deficits of BLEU. It performs a wordto-word alignment between the translation output and a given reference translation. The alignments are produced via a sequence of word-mapping modules. These check, if the words are exactly the same, same after they are stemmed using the Porter stemmer, and if they are synonyms of each other. After obtaining the final alignment, METEOR computes an F-value, which is a parameterized harmonic mean of unigram precision and recall. METEOR has also demonstrated to have a high level of correlation with human judgment, often even better than BLEU.

**BERTscore** [164] takes into account synonyms and measures the similarity of embeddings between the translation and the reference. It computes the cosine similarity between all token embeddings of both texts. Then a greedy matching approach is used to determine assignments of tokens. The maximum assignment similarity is used as BERTscore.

For high-quality translations, however, there is a noticeable difference between human judgment and automatic evaluation. Therefore, most high-end comparisons today use human experts to assess the quality of translation and other text generation methods. Since the transformer was proposed by Vaswani et al. [141] in 2017, its variants were able to raise the SOTA in language translation performance, e.g. for translation on WMT2014 English-French from 37.5 to 46.4 BLEU score.

The transformer architecture was analyzed theoretically. Yun et al. [160, 161] showed that transformers are expressive enough to capture all continuous sequence to sequence functions with a compact domain. Pérez et al. [112] derived that the full transformer is Turing complete, i.e. can simulate a full Turing machine.

#### *2.3.4 Pre-trained Language Models and Foundation Models*

A model *language model* either computes the joint probability or the conditional probability of natural language texts and potentially includes all information about the language. BERT is an *autoencoder* language models containing encoder blocks to generate contextual embeddings of tokens. GPT is an *autoregressive language models* which predicts the next token of a sequence and restricts self-attention to tokens which already have been generated. *Transformers* (or *Transformer encoder-decoders*) use a transformer encoder to convert the input text to contextual embeddings and generate the translated text with an autoregressive transformer decoder utilizing the encoder embeddings as inputs (Fig. 2.16). These models are the backbone of modern NLP and are collectively called *Pre-trained Language Models*  (*PLM*).

All these models, especially BERT and GPT, are initialized via pre-training on a large corpus of text documents. During pre-training, parts of the input are hidden from the model, and the model is trained to reconstruct these parts. This has proven to be extremely effective in building strong representations of language and in finding parameter initializations for highly expressive NLP models that can be adapted to specific tasks. Finally, these models provide probability distributions over language that we can sample from.

Most network types have some built-in assumptions called *inductive bias*. Convolutional networks have local kernel functions that are shifted over the input matrix

**Fig. 2.16** Autoencoders like BERT (left) and autoregressive LMs like GPT-2 (middle) use transformer blocks to generate contextual embeddings of tokens. The transformer (right) combines a transformer encoder and an autoregressive transformer decoder to produce a translation. All models predict the probability of tokens with a logistic classifier *L*. Collectively these models are called Pre-trained Language Models (PLMs)

**Fig. 2.17** Timeline for the development of embeddings, pre-training and fine-tuning

and therefore have an inductive bias of translation invariance and locality. Recurrent networks apply the same network to each input position and have a temporal invariance and locality. The BERT architecture makes only few assumptions about the structural dependency in data. The GPT model is similar to the RNN as it assumes a Markovian structure of dependencies to the next token. As a consequence, PLMs often require more training data to learn the interactions between different data points, but can later represent these interactions more accurately than other model types.

Historically, learned embedding vectors were used as representations of words for downstream tasks (Fig. 2.17). As early as 2003 Bengio et al. [15] proposed a distributed vector representation of words to predict the next word by a recurrent model. In 2011 Collobert et al. [32] successfully employed word embeddings for part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. In 2013 Mikolov et al. [93] derived their word embeddings using a logistic classifier. In 2015 Dai et al. [33] trained embeddings with an RNN language model in a self-supervised way and later applied it to text classification. In 2017 McCann et al. [87] pre-trained multilayer LSTMs for translation computing contextualized word vectors, which are later used for various classification tasks.

In the same year Vaswani et al. [141] developed the attention-only transformer for language translation. In 2018 Howard et al. [59] pre-trained a language model (ULMFiT), and demonstrated the effectiveness of fine-tuning to different target tasks by updating the full (pre-trained) model for each task. In the same year Howard et al. [116] used a pre-trained autoregressive part of the transformer [141] to solve a large number of text understanding problems by fine-tuned models. At the same time Devlin et al. [39] pre-trained the autoencoder using the masked language model objective and adapted this BERT model to many downstream tasks by fine-tuning. In 2019 Radford et al. [118] presented the GPT-2 language model, which was able to generate semantically convincing texts. Brown et al. [21] proposed the GPT-3 model, which could be instructed to solve NLP-tasks by a task description and some examples. In 2021 Ramesh et al. [121] applied language modeling to text and pictures and were able to create impressive pictures from textual descriptions. Borgeaud et al. [18] presented the Retro model that answers questions by retrieving information from a text collection of 2 trillion tokens and composes an answer in natural language.

Almost all state-of-the-art NLP models are now adapted from one of a few Pretrained Language Models, such as BERT, GPT-2, T5, etc. PLMs are becoming larger and more powerful, leading to new breakthroughs and attracting more and more research attention. Due to the huge increase in performance, some research groups have suggested that large-scale PLMs should be called *Foundation Models*, as they constitute a 'foundational' breakthrough technology that can potentially impact many types of applications [17, p. 3]. In this book, we reserve the term 'Foundation Models' for large Pre-trained Language Models with more than a billion parameters, since these models are able of generating fluent text, can potentially handle different media, and can usually be instructed by prompts to perform specific tasks.

If one of these models is improved, this high degree of homogeneity can lead to immediate benefits for many NLP applications. On the other hand all systems could share the same problematic biases present in a few basic models. As we will see in later chapters PLM-based sequence modeling approaches are now applied to text (Sect. 2.2), speech (Sect. 7.1), images (Sect. 7.2), videos (Sect. 7.3), computer code (Sect. 6.5.6), and control (Sect. 7.4). These overarching capabilities of Foundation Models are depicted in Fig. 2.18.

The next Sect. 2.4 discusses some common techniques for optimizing and regularizing pre-trained language models. In addition, some approaches to modify the architecture of these networks are presented. In Chap. 3 we present a number of approaches to improve the capabilities of PLMs, especially by modifying the training tasks (Sect. 3.1.3). In the Chaps. 5–7 we discuss a number of applications of PLMs. Chapter 5 covers traditional NLP tasks like named entity recognition and relation extraction, where PLMs currently perform best. Most important applications of Foundation Models are on the one hand text generation and related tasks like question-answering and dialog systems, which are introduced in Chap. 6. On the other hand Foundation Models can simultaneously process different media and perform tasks like image captioning, object detection in images, image generation following a text description, video interpretation, or computer game control, which

**Fig. 2.18** A Foundation Model can integrate the information in the data from different modalities. Subsequently it can be adapted, e.g. by fine-tuning, to a wide range of downstream tasks [17, p. 6]. Credits for image parts in Table A.1

are discussed in Chap. 7. Because of the potential social and societal consequences of such Foundation Models, it is particularly important that researchers in this field keep society's values and human rights in mind when developing and applying these models. These aspects are summarized in Sect. 8.2.

#### **Available Implementations**


#### *2.3.5 Summary*

A transformer is a sequence-to-sequence model, which translates a source text of the input language into a target text in the target language. It consists of an encoder with the same architecture as an autoencoder BERT model that computes contextual embeddings of tokens of the source text. The decoder resembles an autoregressive GPT model and sequentially generates the tokens of the target text. Internally, contextual embeddings of the target tokens are computed in the different layers. Each decoder block has an additional cross-attention module in which the query vectors are taken from the embeddings of the target tokens and the key and value vectors are computed for the embeddings of the source tokens of the last layer. In this way, the information from the source text is communicated to the decoder. The embedding of the last token in the top layer is entered into a logistic classifier and this calculates the probability of the tokens for the next position. Subsequently, the observed token at the next position is appended to the target input and the computations are repeated for the next but one position.

During training the parameters of the transformer are adapted by stochastic gradient descent in such a way that the model assigns high probabilities to the observed target tokens of the translation in the training data. When the model has been trained on a large text dataset it can be applied for translation. Conditional on an input text, it can sequentially compute the probability of the next token of the translation.

During application of a trained model either the token with the maximal probability is selected or several alternatives are generated by beam search and the final output sequence with maximal probability is chosen. The evaluation of the translations quality is difficult as different translations may be correct. A number of metrics, e.g. BLEU, have been developed, which compare the machine translation to one or more reference translations by comparing the number of common word *n*-grams with *n* = 1*,...,* 4. Often the results are assessed by human raters. The transformer was able to generate better translation than prior models. In the meantime the translation quality for a number of language pairs is on par with human translators.

In the previous sections, we discussed *autoencoder BERT* models, *autoregressive GPT* models and the *encoder-decoder Transformers*. Collectively these models are called *pre-trained language models*, as transfer learning with a pre-training step using a large training set and a subsequent fine-tuning step is a core approach for all three variants. The self-attention and cross-attention modules are central building blocks used by all three models. Despite the development of many variations in recent years, the original architecture developed by Vaswani et al. [141] is still commonly employed.

It turns out that these models can be applied not only to text, but to various types of sequences, such as images, speech, and videos. In addition, they may be instructed to perform various tasks by simple prompts. Therefore, large PLMs are also called *Foundation Models*, as they are expected to play a crucial role in the future development of text and multimedia systems.

#### **2.4 Training and Assessment of Pre-trained Language Models**

This section describes some techniques required to train and apply PLMs.


Approaches to solving these problems are discussed in this section. PLMs are usually specified in one of the current Deep Learning frameworks. Most popular are *TensorFlow* provided from Google [137] and *PyTorch* from Meta [114]. Both are based on the Python programming language and include language elements to specify a network, train it in parallel on dedicated hardware, and to deploy it to different environments. A newcomer is the *JAX* framework [22], which is especially flexible for rapid experimentation. It has a compiler for linear algebra to accelerate computations for machine learning research.

#### *2.4.1 Optimization of PLMs*

#### **Basics of PLM Optimization**

For the i.i.d. training sample *T r* = {*(x*[1] *, y*[1] *), . . . , (x*[*N*] *, y*[*N*] *)*} parameter optimization for Deep Neural Networks aims to find a model that minimizes the loss function *L(x*[*i*] *, y*[*i*] ; *w)*

$$\min\_{\mathbf{w}} L(\mathbf{w}) = L(\mathbf{x}^{[1]}, \mathbf{y}^{[1]}; \mathbf{w}) + \dots + L(\mathbf{x}^{[N]}, \mathbf{y}^{[N]}; \mathbf{w}). \tag{2.14}$$

First-order optimization methods, also known as gradient-based optimization, are based on first-order derivatives. A requirement is that the loss function *L(w)* is smooth, i.e. is continuous and in addition differentiable at almost all parameter values *<sup>w</sup>* <sup>=</sup> *(w*1*,...,wk)*. Then the partial derivatives *∂L(w) ∂wj* of *L(w)* with respect to any component *wj* of *w* can be computed at almost all points. The *gradient* of *L(w)* in a specific point *w* is the vector

$$\frac{\partial L(\mathfrak{w})}{\partial \mathfrak{w}} = \left(\frac{\partial L(\mathfrak{w})}{\partial w\_1}, \dots, \frac{\partial L(\mathfrak{w})}{\partial w\_k}\right)^\mathsf{T}.\tag{2.15}$$

**Fig. 2.19** On all points of a grid the negative gradients are computed for this two-dimensional function *L(w)* (left). The gradient descent algorithm follows the negative gradients and approaches the local minima (right). The blue lines are the paths taken during minimization. Image credits in Table A.1

The gradient points into the direction, where *L(w)* in point *w* has its steepest ascent. Consequently, the direction of the steepest descent is in the opposite direction −*∂L(w) <sup>∂</sup><sup>w</sup>* . The batch *gradient descent* algorithm therefore changes the current parameter *w(t)* in the direction of the negative gradient to get closer to the minimum

$$
\omega\_{(l+1)} = \omega\_{(l)} - \lambda \frac{\partial L(\mathbf{w})}{\partial \mathbf{w}}.\tag{2.16}
$$

The *learning rate λ* determines the step-size or how much to move in each iteration until an optimal value is reached. As the gradient is usually different for each parameter *w(t)* it has to be recomputed for every new parameter vector (Fig. 2.19). The iteration process is repeated until the derivative becomes close to zero. A zero gradient indicates a *local minimum* or a *saddle point* [51, p. 79]. In practical applications it is sufficient to repeat the optimization beginning with different *w*values and stop, if the derivative is close to zero.

Deep Neural Networks often require many millions of training examples. The repeated computation of the gradient for all these examples is extremely costly. The **Stochastic Gradient Descent** (*SGD*) algorithm does not use the entire dataset but rather computes the gradient only for a small *mini-batch* of *m* training examples at a time. In general, a mini-batch has sizes *m* ranging from 32 up to 1024, with even higher values for recent extremely large models. Subsequently, the parameters of the model are changed according to (2.16).

For each iteration a new mini-batch is selected randomly from the training data. According to the law of large numbers the gradients computed from these minibatches fluctuate around the true gradient for the whole training set. Therefore, the mini-batch gradient on average indicates an adequate direction for changing the parameters. Mertikopoulos et al. [91] show that by iteratively reducing the learning rate to 0, the SGD exhibits almost sure convergence, avoids spurious critical points such as saddle points (with probability 1), and stabilizes quickly at local minima. There are a number of variations of the SGD algorithm, which are described below [65].

An important step of optimization is the *initialization of parameters*. Their initial values can determine whether the algorithm converges at all and how fast the optimization approaches the optimum. To break symmetry, the initial parameters must be random. Furthermore, the mean and variance of the parameters in each layer are set such that the resulting outputs of the layer have a well-behaved distribution, e.g. expectation 0.0 and variance 1.0. In addition, all gradients also should have such a benign distribution to avoid exploding or vanishing gradients. All Deep Learning software frameworks contain suitable initialization routines. A thorough introduction is given by Goodfellow et al. [51, p. 292].

#### **Variants of Stochastic Gradient Descent**

**Momentum** is a method that helps SGD to increase the rate of convergence in the relevant direction and reduce oscillations. Basically a moving average *u(t)* of recent gradients with a parameter *γ* ≈ 0*.*9 is computed and the parameter update is performed with this average by

$$\mathfrak{u}\_{(l)} = \chi \mathfrak{u}\_{(l-l)} - \lambda \frac{\partial L(\mathfrak{w})}{\partial \mathfrak{w}} \qquad \text{where} \qquad \mathfrak{w}\_{(l)} = \mathfrak{w}\_{(l-l)} - \mathfrak{u}\_{(l)}.\tag{2.17}$$

Note that in addition to the parameter vector *w(t)* the moving average *u(t)* of the same length has to be stored requiring the same memory as the parameter vector *w*. This can consume a large additional memory size if the number of parameters approaches the billions. In recent years a number of further optimizers were developed [65]:


Due to the extremely large number of parameters of PLMs second order optimization methods like *Conjugate Gradient* or *Quasi-Newton* are rarely employed. As the number of second order derivatives grows quadratically, only crude approximations may be used. An example is Adam, as described before.

An important architectural addition to PLMs to improve training are *residual connections*, which were proposed by Vaswani et al. [141] for the Transformer. Residual connections have been shown to be very successful for image classification networks such as ResNet [54] and allowed training networks with several hundred layers. The identity shortcuts skip blocks of layers to preserve features. Zhang et al. [163] analyze the representational power of networks containing residual connections.

#### **Parallel Training for Large Models**

Recently, there have been suggestions to reduce the optimization effort by employing larger mini-batches. You et al. [159] propose the **LAMB optimizer** with layerwise adaptive learning rates to accelerate training of PLMs using large minibatches. They prove the convergence of their approach to a stationary point in a general nonconvex setting. Their empirical results demonstrate the superior performance of LAMB. It is possible to reduce the BERT training time from 3 days to just 76 min with very little hyperparameter tuning and batch sizes of 32,868 without any degradation of performance. The LAMB program code is available online [97]. In addition, the memory requirements of the optimization may be reduced [119] to enable parallelization of models resulting in a higher training speed.

Large models such as GPT-3 have many billion parameters that no longer fit into the memory of a single computational device, e.g. a GPU. Therefore, the computations have to be distributed among several GPUs. There are different parallelization techniques [156]:


The implementation of a parallelization strategy for a model is a tedious process. Support is given by the **DeepSpeed** library [122] that makes distributed training easy, efficient, and effective. Recently the **GSPMD** system [156] was developed which automates this process and is able to combine different parallelism paradigms in a unified way. GSPMD infers the distribution of computations to a network of GPUs based on limited user annotations to the model definition. It was, for instance, applied to distribute models with 1 trillion parameters on 2048 GPUs.

#### *2.4.2 Regularization of Pre-trained Language Models*

If a model contains too many parameters it can nearly perfectly adapt to the training data by optimization, reflecting nearly all details of the training data. During this *overfitting* the model learns the random variations expressed in the training data and deviates from the mean underlying distribution. Consequently, it has usually a lower performance on test data and a larger *generalization error*. To avoid this phenomenon, the representational capacity of the model has to be reduced by *regularization methods*, which often have the same effect as reducing the number of parameters. Well known approaches for Deep Learning models are the *L*<sup>2</sup> regularization and *L*<sup>1</sup> regularization penalizing large parameter values, or *Dropout* temporarily setting randomly selected hidden variables to 0. A survey of regularization strategies for Deep Neural Networks is given by Moradi et al. [96].

The training of PLMs is often non-trivial. One problem is the occurrence of vanishing or exploding gradients, which is connected to the problem of the vanishing or exploding variance of input values of different layers [55]. *Batch normalization* normalizes the values of the components of hidden units to mean 0.0 and variance 1.0 and thus reduces the variation of input values. For a mini-batch of training cases the component values are aggregated to compute a mean and variance, which are then used to normalize the input of that component on each training case [62]. It can be shown that batch normalization makes hidden representations increasingly orthogonal across layers of a Deep Neural Network [35].

In their paper on the Transformer, Vaswani et al. [141] use a variant called *layer normalization* [6] for regularization. The authors compute the mean and variance of the different components of hidden units for each training example and use this to normalize the input to mean 0.0 and variance 1.0. In addition, they apply dropout to the output of self-attention. Finally, they use *label smoothing* [133] where the loss function is reformulated such that the observed tokens are not certain but alternative tokens may be possible with a small probability. This is a form of regularization which makes optimization easier. The RMSNorm [162] is a variant of the layer normalization, which only normalizes the input by division with the root-meansquare error without shifting the mean. In experiments, it compares favorably with the layer normalization [101].

#### *2.4.3 Neural Architecture Search*

The structure of the self-attention block was manually designed, and it is not clear, whether it is optimal in all cases. Therefore, there are some approaches to generate the architecture of PLMs in an automatic way called *Neural Architecture Search* (*NAS*). A survey is provided by He et al. [56], who argue that currently the contributions of architecture search to NLP tasks are minor. Zöller [166] evaluate architecture search for machine learning models.

Wang et al. [149] propose an architecture search space with flexible encoderdecoder attentions and heterogeneous layers. The architecture search produces several transformer versions and finally concentrates on hardware restrictions to adapt the computations to processors at hand. The authors report a speedup of 3 and a size reduction factor of 3.7 with no performance loss. For relation classification Zhu et al. [165] design a comprehensive search space. They explore the search space by reinforcement learning strategy and yield models which have a better performance.

Architecture search may also be formulated as a ranking task. **RankNAS** [60] solves this by a series of binary classification problems. The authors investigate translation and language models. For translation the usual encoder-decoder is included in a super-net, where each of the 1023 subnetworks is a unique architecture. The importance of an architectural feature (e.g., the number of layers) is measured by the increase in the model error after permuting the feature. The authors use an evolutionary optimization strategy and evaluate their approach on translation (WMT2014 En-De). They get increases in BLEU-values at a fraction of cost of other approaches.

Recently differentiable architecture search has been developed, which embeds architecture search in a continuous search space and finds the optimal architecture by gradient descent. This leads to an efficient search process that is orders of magnitude faster than the discrete counterparts. This idea is applied by Fan et al. [43], who propose a gradient-based NAS algorithm for machine translation. They explore attention modules and recurrent units, automatically discovering architectures with better performances. The topology of the connection among different units is learned in an end-to-end manner. On a number of benchmarks they were able to improve the performance of the Transformer, e.g. from 28.8 to 30.1 BLEU scores for the WMT2014 English-to-German translation. There are other successful architecture search approaches for neural translation [130], named entity recognition [64], and image classification models [34, 147, 148], which may possibly be applied to other NLP tasks.

#### *2.4.4 The Uncertainty of Model Predictions*

Variations in the outcome of a PLM can have two main sources:

• *Epistemic uncertainty* reflects our limited knowledge about the real world. The real world situation corresponding to the training set can change causing a distribution shift. Moreover, the collected documents can have biases or errors and cover unwanted types of content. It is clear that the structure of the real world and the PLM differ. Therefore, a PLM can only approximate the correct conditional probabilities of language. This type of uncertainty is often called *structural uncertainty* and is difficult to estimate.

• *Aleatoric uncertainty* is caused by random variations which can be assessed more easily. The training data is usually a sample of the underlying data in the population and therefore affected by the sampling variation. If a model is randomly re-initialized, it generates a completely different set of parameter values which leads to different predictions. Finally, language models predict probabilities of tokens and the generation of new tokens is also affected by uncertainty. The Bayesian framework offers a well-founded tool to assess this type of uncertainty in Deep Learning [44].

A recent survey of methods for estimating the model uncertainty is provided by Gawlikowski et al.[47]. We will describe three approaches to capture model uncertainty: Bayesian statistics, a Dirichlet distributions, and ensemble distributions.

#### **Bayesian Neural Networks**

*Bayesian Neural Networks* directly represent the uncertainty of the estimated parameters *w* = *(w*1*,...,wdw )* by the *posterior distribution* 

$$p(\mathfrak{w}|X,Y) \propto p(\mathfrak{y}|X,\mathfrak{w})p(\mathfrak{w}).\tag{2.18}$$

Here *X* and *Y* are the observed inputs and outputs in the training set and *p(Y*|*X, w)* is the *likelihood*, i.e. the probability of the outputs given *X* and a parameter vector *w*. The *prior distribution p(w)* describes the distribution of parameters before data is available. The distribution of predictions for a new input *x*˜ is given by

$$p(\tilde{\mathbf{y}}|\tilde{\mathbf{x}},X,Y) = \int p(\tilde{\mathbf{y}}|\tilde{\mathbf{x}},\mathbf{w})p(\mathbf{w}|X,Y)d\mathbf{w}.\tag{2.19}$$

The integral usually cannot be solved analytically and has to be approximated. Often a *Monte Carlo* approximation is used, which infers the integral by a sum over different parameter values *w*[*i*] distributed according to the posterior distribution *p(w*|*X, <sup>Y</sup>)*. If *<sup>y</sup>*˜[*i*] <sup>=</sup> *f (x*˜*, <sup>w</sup>*[*i*] *)* is a deterministic network predicting the output for a parameter *w*[*i*] and input *x*˜, the resulting sample *y*˜[1] *,..., <sup>y</sup>*˜[*k*] can be considered as a sample of the output distribution *p(y*˜|*x*˜*, X, Y)* [108].

Bayesian predictive distributions can be approximated in different ways:

• *Sampling approaches* use a *Markov Chain Monte Carlo* algorithm to generate parameter values distributed according to the posterior distributions, from which realizations can be sampled [102]. Markov Chain Monte Carlo defines a sampling strategy, where first a new parameter value *w* is randomly generated and then the algorithm computes the probability to accept *w*, or to keep the previous parameter value. Welling et al. [150] combined this approach with stochastic gradient descent and demonstrated that Bayesian inference on Deep Neural Networks can be done by a noisy SGD. A review of the favorable convergence properties has been given by Nemeth et al. [103]. Practical evaluations of this technique are performed by Wenzel et al. [152].

• *Variational inference* approximates the posterior distribution by a product *q(w)* of simpler distributions, which are easier to evaluate [9]. Using multiple GPUs and practical tricks, such as data augmentation, momentum initialization and learning rate scheduling, and learning rate scheduling, Osawa et al. [105] demonstrated that variational inference can be scaled up to ImageNet size datasets and architectures.

It can be shown [45] that dropout regularization (Sect. 2.4.2) can be considered as approximate variational inference. Hence, the predictive uncertainty can be estimated by employing dropout not only during training, but also at test time. A variant called *Drop connect* randomly removes incoming activations of a node, instead of dropping an activation for all following nodes. This approach yields a more reliable uncertainty estimate and can even be combined with the original dropout technique [88].

• *Laplace approximation* considers the logarithm of the posterior distribution around a local mode *w*ˆ and approximate it by a normal distribution *N (w*ˆ *,*[*H* + *βI* ] <sup>−</sup>1*)* over the network weights [9]. *H* is the Hessian, the matrix of second derivatives, of log *p(w*|*X, Y)*. This approximation may be computed for already trained networks and can be applied to Deep Neural Networks [76]. A problem is the large number of coefficients of *H*, which limits the computations to elements on the diagonal. Extensions have been proposed by George et al. [48].

#### **Estimating Uncertainty by a Single Deterministic Model**

Most PLMs predict tokens by a discrete probability distribution. If the softmax function is used to compute these probabilities, the optimization over the training set usually leads to very extreme probabilities close to 0 or 1. The network is often overconfident and generates inaccurate uncertainty estimates. To assess uncertainty, the difference between the estimated distribution and the actual distribution has to be described. If *v*1*,...,vdv* is the vocabulary of tokens and *π* a discrete distribution over these tokens, then we can use the *Dirichlet distribution p(π*|*α(x))* to characterize a distribution over these discrete distributions. The vector *α* depends on the input *x* and has a component *αi* for each *vi*. The sum *<sup>i</sup> αi* characterizes the variance. If it gets larger, the estimate for the probability of *vi* has a lower variance.

Malinin et al. [85] use the expected divergence between the empirical distribution and the predicted distribution to estimate the *p(π*|*α(x))* for a given input *x*. In the region of the training data the network is trained to minimize the expected *Kullback-Leibler* (*KL*) divergence between the predictions of in-distribution data and a lowvariance Dirichlet distribution. In the region of out-of-distribution data a Dirichlet distribution with a higher variance is estimated. The distribution over the outputs can be interpreted as a quantification of the model uncertainty, trying to emulate the behavior of a Bayesian modeling of the network parameters [44].

Liu et al. [83] argue that the distance between training data elements is relevant for prediction uncertainty. To avoid that the layers of a network cause a high distortion of the distances of the input space, the authors propose a spectral normalization. This **SNGP** approach limits the distance *h(x*[1] *)* <sup>−</sup> *h(x*[2] *)* compared to *<sup>x</sup>*[1] <sup>−</sup> *<sup>x</sup>*[2] , where *x*[1] and *x*[2] are two inputs and *h(x)* is a deep feature extractor. Then they pass *h(x)* into a distance-aware *Gaussian Process* output layer. The Gaussian Process posterior is approximated by a Laplace approximation, which can be predicted by a deterministic Deep Neural Network.

The authors evaluate SNGP on BERTBASE to decide, if a natural utterance input is covered by the training data (so that it can be handled by the model) or outside. The model is only trained on in-domain data, and their predictive accuracy is evaluated on in-domain and out-of-domain data. While ensemble techniques have a slightly higher prediction accuracy, SNGP has a better calibration of probabilities and out-of-distribution detection. An implementation of the approach is available [138].

A number of alternative approaches are described in [47, p. 10f], which also discuss mixtures of Dirichlet distributions to characterize predictive uncertainty. In general single deterministic methods are computational less demanding in training and evaluation compared to other approaches. However, they rely on a single network configuration and may be very sensitive to the underlying network structure and the training data.

#### **Representing the Predictive Distribution by Ensembles**

It is possible to emulate the sampling variability of a training set by resampling methods. A well-founded approach is *bagging*, where *nb* samples of size *n* are drawn with replacement from a training set of *n* elements [20, 107]. For the *i*-th sample a model may be trained yielding a parameter *w*<sup>ˆ</sup> [*i*] . Then the distribution of predictions *f (x, <sup>w</sup>*<sup>ˆ</sup> [*i*] *)* represent the uncertainty in the model prediction for an input *x*, and it can be shown that their mean value <sup>1</sup> *nb <sup>i</sup> f (x, <sup>w</sup>*<sup>ˆ</sup> [*i*] *)* has a lower variance than the original model prediction [73]. In contrast to many approximate methods, ensemble approaches may take into account different local maxima of the likelihood function and may cover different network architectures. There are other methods to introduce data variation, e.g. random parameter initialization or random data augmentation. A survey on ensemble methods is provided by Dong et al. [40].

Besides the improvement in the accuracy, ensembles are widely used for representing prediction uncertainty of Deep Neural Networks [73]. In empirical investigations, the approach was at least as reliable as Bayesian approaches (Monte Carlo Dropout, Probabilistic Backpropagation) [73]. Reordering the training data and a random parameter initialization induces enough variability in the models for the prediction of uncertainty, while bagging may reduce the reliability of uncertainty estimation [77]. Compared to Monte Carlo Dropout, ensembles yield more reliable and better calibrated prediction uncertainties and are applicable to real-world training data [13, 53]. Already for a relatively small ensemble size of five, deep ensembles seem to perform best and are more robust to data set shifts than the compared methods [106].

Although PLMs have been adapted as a standard solution for most NLP tasks, the majority of existing models is unable to estimate the uncertainty associated with their predictions. This seems to be mainly caused by the high computational effort of uncertainty estimation approaches. In addition, the concept of uncertainty of a predicted probability distribution is difficult to communicate. However, it is extremely important to get a diagnosis, when a PLM is given an input outside the support of its training data, as then the predictions get unreliable.

Among the discussed approaches the ensemble methods seem to be most reliable. However, they require a very high computational effort. New algorithms like SNGP are very promising. More research is needed to reduce this effort or develop alternative approaches. Recently benchmark repositories and datasets have been developed to provide high-quality implementations of standard and SOTA methods and describe best practices for uncertainty and robustness benchmarking [99].

#### **Implementations**

Uncertainty Baselines [10, 98] provide a collection high-quality implementations of standard and state-of-the-art methods for uncertainty assessment.

#### *2.4.5 Explaining Model Predictions*

PLMs such as BERT are considered as black box models, as it is hard to understand, what they really learn and what determines their outputs. Hence, a lot of research goes into investigating the behavior of these models. There are three main reasons to explain the model predictions. *Trust* in the model predictions is needed, i.e. that the model generates reliable answers for the problem at hand and can be deployed in real-world applications. *Causality* asserts that the change of input attributes leads to sensible changes in the model predictions. *Understanding* of the model enables domain experts to compare the model prediction to the existing domain knowledge. This is a prerequisite for the ability to adjust the prediction model by incorporating domain knowledge.

Explanations can also be used to debug a model. A striking example was an image classification, where a horse was not detected by its shape, but by a label in the image [74]. Explanations are most important for critical decisions that involve humans or can cause high damage. Examples are health care, the judicial system, banking, or self-driving cars.

Explanation methods roughly can be grouped into local explanations or global explanations. A local explanation provides information or justification for the model's prediction for a specific input *x*, whereas global explanations cover the model in general. A large majority of models aims at local explanations, as these may be used to justify specific predictions. Surveys on methods for the explanation of PLMs are provided by Danilevsky et al. [36], Burkart and Huber [23], Xu et al. [155], Bauckhage et al. [11], Tjoa and Guan [139], and Belle and Papantonis [12]. Molnar [95] devotes a whole book to this topic and Bommasani et al. [17, p. 125] provide a recent overview. For language models different types of explanation can be used:


Other approaches like a sequence of reasoning steps or rule invocations are unusable for PLMs with many millions of parameters.

The self-attention mechanism is the central function unit of PLMs. **BertViz** [144] is a visualization tool that allows users to explore the strength of attention between different tokens for the heads and layers in a PLM and allows users to get a quick overview of relevant attention heads. However, Jain et al. [63] demonstrate that attention does not correlate with feature importance methods and counterfactual changes of attention do not lead to corresponding changes in prediction. This may, for instance, be caused by the concatenation of head outputs and their subsequent processing by a fully connected nonlinear layer. Attentions are noisy predictors of the overall importance of components, but are not good at identifying the importance of features [129].

#### **Linear Local Approximations**

An important concept is the contribution of input *xi* towards an output *yj* , e.g. a class probability. Gradient-based explanations estimate the contribution of input *xi* towards an output *yj* , e.g. a class probability, by computing the partial derivative *∂yj /∂xi*. This derivative is often called *saliency* and can be interpreted as linear approximation to the prediction function at input *x*. **LIME** [123] defines a local linear regression model around a single input *x*. Because of correlation of features, the coefficients of the input features depend on the presence or absence of the other input features. The **SHAP** approach therefore determines the influence of a feature


**Fig. 2.20** Contributions for the question classification task (left). Red marks positive influence, blue negative, and black tokens are neutral. Contributions for the task of translating "good morning ladies and gentlemen" to the German "Guten Morgen Damen und Herren" are shown on the right side [132]. Words are tokenized to word pieces

by the average influence of the feature for all combinations of other features [84]. The authors show the favorable theoretical properties of this approach and derive several efficient computation strategies.

#### **Nonlinear Local Approximations**

Sundararajan et al. [132] formulate two basic requirements for this type of explanation. *Sensitivity*: if the inputs *x*[1] and *x*[2] differ in just one feature and lead to different predictions, then the differing feature should be given a non-zero contribution. *Implementation invariance*: i.e., the attributions are always identical for two functionally equivalent networks. As the prediction functions are usually nonlinear, gradient-based methods violate both requirements and may focus on irrelevant attributes.

**Integrated Gradients** [132] generates an approximation to the prediction function *<sup>F</sup>* : <sup>R</sup>*<sup>n</sup>* → [0*,* <sup>1</sup>], which captures nonlinear dependencies. To assess the difference from baseline input *x*[1] to another input *x*[2] , the authors compute the mean value of gradients *∂F(x)/∂x* of the output with respect to inputs along the line from *x*[1] to *x*[2] by an integral. It can be shown that this approach meets the above requirements. The authors apply the approach to question classification according to the type of the answer (Fig. 2.20). The baseline input is the all zero embedding vector. Another application considers neural machine translation. Here the output probability of every output token is attributed to the input tokens. As baseline all tokens were zeroed except the start and end markers. A similar analysis is based on a Taylor expansion of the prediction function [7].

Liu et al. [82] propose a generative explanation framework which simultaneously learns to make classification decisions and generate fine-grained explanations for them. In order to reach a good connection between classification and explanation they introduce a classifier that is trained on their explanation. For product reviews they, for instance, generate the following positive explanations "excellent picture,

attractive glass-backed screen, hdr10 and dolby vision" and negative reasons "very expensive". The authors introduce an explanation factor, which represents the distance between the probabilities of the classifier trained on the explanations vs. the classifier trained on the original input and the gold labels. They optimize their models with minimum risk training.

#### **Explanation by Retrieval**

Recently, Deep Learning models have been playing an increasingly important role in science and technology. The algorithms developed by Facebook are able to predict user preferences better than any psychologist [24, 71]. AlphaFold, developed by DeepMind, makes the most accurate predictions of protein structures based on their amino acids [131]. And the PaLM and Retro models are capable of generating stories in fluent English, the latter with the knowledge of the Internet as background. However, none of the programs were actually able to justify their decisions and cannot indicate why a particular sequence was generated or on what information a decision was based on.

In 2008, Anderson [5] predicted the end of theory-based science. In his view, theories are an oversimplification of reality, and the vast amount of accumulated data contains knowledge in a much more detailed form, so theories are no longer necessary. This is also the problem of *Explainable AI*, which aims to explain the decisions of Deep Learning models. It is always faced with a trade-off where predictive accuracy must be sacrificed in order to interpret the model output.

As large autoregressive language models are combined with retrieval components, document retrieval can be used not only to incorporate more accurate knowledge into the language generation process, but also to support the generated answers by authoritative citations. Metzler et al. [92] argues that future PLMs should justify created text by referring to supporting documents in the training data or background document collection. To implement this approach Nakano et al. [100] combine *GPT-3* with the search engine *BING* to enhance language generation for question-answering by retrieved documents. Their **WebGPT** [100] first creates a text in natural language (Sect. 6.2.3). After that, it enhances the generated sentences by different references to the found documents, similar to the way a scientist expands his texts by references. By this procedure WebGPT is able to justify and explain the created answer. This could be a way to make the generated text more trustworthy. Note that the advanced dialog model **LaMDA** can include links to external documents supporting an answer (Sect. 6.6.3).

#### **Explanation by Generating a Chain of Thought**

Large autoregressive PLMs like GPT-3 are able to produce a very convincing continuation of a start text, and, for instance, generate the answer for a question. It turned out that their ability to generate the correct answer could drastically be

**Fig. 2.21** Explaining by a chain of thoughts. The first box contains two examples of thought chains, which are used for every query. This chain-of-thought prompt was input to the PaLM model together with the input query, and the model output was generated by PaLM [30, p. 38]

improved by giving a few examples with a chain of thought (Sect. 3.6.4) for deriving the correct answer. This has been demonstrated for the PaLM language model [30].

A generated *thought chain* can be used for other purposes. First, it can be checked whether the model produces the correct answer for the "right reasons", rather than just exploiting superficial statistical correlations. In addition, the explanation can potentially be shown to an end-user of the system to increase or decrease their confidence in a given prediction. Finally, for some queries (e.g., explaining a joke), the explanation itself is the desired output [30].

Figure 2.21 contains a few-shot query and the resulting answer. For application only a few example chains of thought are necessary, which can be reused. To generate the best answer for the question greedy decoding has to be used, yielding the optimal prediction. As PaLM shows, the enumeration of argument steps works empirically. However, a sound theory of how models actually use such arguments internally is still lacking. Further, it is not known under which circumstances the derivation of such a chain of thoughts succeeds. It should be investigated to what extent the reasoning of a model corresponds to the reasoning steps performed by humans.

#### **Implementations**

Ecco [2] and BertViz [143] are tools to visualize the attentions and embeddings of PLMs. An implementation and a tutorial on integrated gradients is available for TensorFlow [136]. Captum [26, 70] is an open-source library to generate interpretations and explanations for the predictions of PyTorch models containing most of the approaches discussed above. Transformers-interpret [113] is an alternative opensource model explainability tool for the Hugging Face package.

#### *2.4.6 Summary*

Similar to other large neural networks, PLMs are optimized with simple stochastic gradient descent optimizers that are able to approach the region of minimal cost even for huge models with billions of parameters and terabytes of training data. This requires parallel training on computing networks which can be controlled by suitable software libraries. There are many recipes in the literature for setting hyperparameters such as batch size and learning rate schedules. Important ingredients are residual connections to be able to optimize networks with many layers and regularization modules to keep parameters in a manageable range.

Neural architecture search is a way to improve performance and reduce memory requirements of networks. A number of approaches have been proposed that significantly speed up training. Some methods provide models with better performance and lower memory footprint. There are new differential methods that have the potential to derive better architectures with little effort.

PLMs aim to capture relations between language concepts and can only do so approximately. Therefore, it is important to evaluate their inherent uncertainty. Three different approaches to analyze the uncertainty are described. Among these, ensemble methods appear to be the most reliable, but involve a high computational cost. New algorithms such as SNGP, which are based on a single model, are very promising.

To enable a user to decide whether a model result makes sense, it is necessary to explain how the result was obtained. Explanations can be provided by showing the importance of features for a result, by exploring the PLM by related examples or by approximating the PLM with a simple model. Some libraries are available that allow routine use of these methods. A new way of explaining texts generated by PLMs is to enhance the texts with appropriate citations of relevant supporting documents. Finally, a PLM can be instructed by chain-of-thought prompts to provide an explanation for the model response. This type of explanation is particularly easy to understand and can reflect the essential parts of a chain of arguments.

The next chapter discusses approaches to improve the three basic PLM types by new pre-training tasks or architectural changes. The fourth chapter examines the knowledge, which can be acquired by PLMs and that can be used to interpret text and to generate new texts.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 3 Improving Pre-trained Language Models**

**Abstract** This chapter describes a number of different approaches to improve the performance of Pre-trained Language Models (PLMs), i.e. variants of BERT, autoregressive language models similar to GPT, and sequence-to-sequence models like Transformers. First we may modify the pre-training tasks to learn as much as possible about the syntax and semantics of language. Then we can extend the length of the input sequence to be able to process longer inputs. Multilingual models are simultaneously trained with text in different languages. Most important is the inclusion of further knowledge into the PLM to produce better predictions. It turns out that by increasing the number of parameters, the size of the training data and the computing effort the performance of the models can always be increased. There are a number of different fine-tuning strategies which allow the model to be adapted to special tasks. In addition, models may be instructed by few-shot prompts to solve specific tasks. This is especially rewarding for larger PLMs, which therefore are called Foundation Models.

**Keywords** Pre-training objective · Input size · Multilingual model · Long dependencies · Additional knowledge · Fine-tuning

This chapter describes a number of different approaches to improve the performance of *Pre-trained Language Models* (PLMs), i.e. variants of BERT, autoregressive language models similar to GPT, and sequence-to-sequence models like Transformers. When these models have a large number of parameters, they can be instructed by input prompts to solve new tasks and are called *Foundation Models*.


as then the number of parameters grows quadratically. In Sect. 3.2, alternatives for establishing sparse attention patterns for remote tokens are explored.


Note that nearly all proposals may be combined for most model types, resulting in the vast number of model variants that is currently discussed.

#### **3.1 Modifying Pre-training Objectives**

The basic BERT model [49] has two pre-training tasks: the prediction of masked tokens with the masked language model (MLM) and next sentence prediction (NSP) (Sect. 2.1). These tasks were chosen heuristically and there are many plausible loss functions and architectures. Researchers have investigated many alternative training objectives, model structures, and attention mechanisms. In this section, the most promising of these variations of the BERT and Transformer architecture are discussed and their relative merits are compared.

An important question is the level of aggregation of the input sequence. Here subword tokens are standard. One option is to use raw letters as input. However, this may lead to a high computational burden, as the computational cost of selfattention grows quadratically with the size of the input. Another option is the use of domain-adapted knowledge to model the input sequence by learned tokenizations or patch embeddings (e.g. for image representation, Sect. 7.2). These methods reduce the input complexity, but may potentially ignore useful information in the input [19].

#### *3.1.1 Autoencoders Similar to BERT*

To improve BERT's performance a number of alternatives to capture knowledge from the unlabeled data were proposed:


The details are given in the following paragraphs. Popular loss functions are defined in Table 3.1. A list of prominent autoencoders is provided in Table 3.2. They can be compared by their performance on natural language understanding tasks (Sect. 2.1.5) like GLUE [218].

**RoBERTa** [127] is an enhanced BERT model boosted by tweaking parts of the pre-training process. The authors improved the BERTBASE architecture by the following changes: (1) Instead of using the same mask for all epochs, they replicate training sequences with different masks. (2) They remove the Next-Sentence-Prediction objective and found that performance is best, when all sentences in a batch are from the same document. (3) Larger batches with larger step sizes increase perplexity for both the masked language model task and downstream task performance. (4) A 10-fold increase of training data to 160 GB, which is used in large batches. The resulting model achieves an impressive SOTA result of 88.5 on *GLUE* (language understanding [217]), and the reading comprehension tasks *RACE*  and *SQuAD* [173].

**SpanBERT** [98] introduces a span-level pre-training approach. Rather than masking single tokens during pre-training, spans of one or more complete words are masked covering about 15% of the tokens. A new span-boundary objective (SBO) is introduced, where tokens inside of the masked span are predicted, using only representations of the tokens just outside the boundaries of the span combined with positional information. The details are shown in Fig. 3.1. SBO is used together with the usual MLM objective. Finally, the authors omit the next sentence prediction task as in [127] and only use single text fragments/sentences for training. The authors find that masking random spans is more effective than masking linguistic units. SpanBERT has the same configuration as BERTLARGE and is pre-trained on the


**Table 3.1** Loss functions for PLMs. A sequence is denoted by *x* = *(x*1*,...,xT )* and *z* = *(z*1*,...,zR)* is a related sequence, e.g. a translation

BooksCorpus and the English Wikipedia. SpanBERT achieves a new SOTA of 79.6% F1 on the *OntoNotes coreference task* [164], which requires identifying pronouns and the corresponding nouns or two phrases referring to the same thing (Sect. 5.4.1).

**Table 3.2** Autoencoders similar to BERT. The pre-training and fine-tuning loss functions are defined in Table 3.1. The benchmark figures are only a hint, as they depend on the number of parameters and the computing effort


**Fig. 3.1** SpanBERT [98] concatenates the embeddings outside the border of a span with a position embedding. With this input a 2-layer model predicts the probabilities of masked tokens

**StructBERT** [223] enhances the original BERT MLM objective by the task to predict the order of shuffled token triples. In addition, the order of three sentences has to be detected. Using models with the same number of parameters, StructBERT can increase the SOTA on GLUE in comparison to BERT and RoBERTa to 83.9 and 89.0, respectively.

**Electra** [39] proposes a new pre-training task called *replaced token detection*  (RTD). In the paper a generator network, trained with a masked language model loss, is combined with a discriminator network. Some tokens in the input sequence are replaced with plausible alternatives which are generated by a small language model (about 1*/*4 of the size of the discriminator). The discriminator network has to predict for every token, whether it is a replacement or not. This corruption procedure solves a mismatch in BERT, where MASK tokens appear in pre-training but not in fine-tuning. The model learns from all input tokens instead of just the small masked subset, making it more computationally efficient than e.g. BERT and RoBERTa, while performing better on several tasks, e.g. 89.4% on the GLUE language understanding task.

**ALBERT** (a lite BERT) [113] uses two parameter-reduction techniques to tackle the huge memory consumption of BERT and its slow training speed. The first tweak is untying the dimensionality of the WordPiece embeddings from the hidden layer size of BERT. Instead of using a single embedding matrix *M*, the authors factorize *M* = *A* ∗ *B*, such that the joint number of parameters in *A* and *B* is much lower than the number of parameters in *M*. The second tweak is sharing all parameters across all layers of BERT, which is shown to stabilize training and keep the number of parameters fixed even if more layers are added. In addition to the two tweaks, a new sentence order prediction (SOP) is introduced. Specifically, the model has to predict if the order of two sentences is correct or reversed. The authors report that this task improves accuracy compared to BERT's NSP task, which could be solved by comparing the topics of the two sentences. It is still unclear, however, if this is the best way to incorporate text structure in training. ALBERT achieved new SOTA results on GLUE and SQuAD.

**XLNet** solves an autoregressive pre-training task instead of predicting masked words [240]. This addresses the problem that BERT's [MASK] token only appears during pre-training and not in fine-tuning. The words in a sequence, e.g. "The 1 mouse 2 likes 3 cheese 4", are reordered together with their position information (indices) by a random permutation, e.g. "cheese 4 The 1 likes 3 mouse 2". The task is to successively predict the tokens in the permuted sequence similarly to a GPT language model. The model has to predict, e.g. *p*(mouse|2, cheese 4, The 1, likes 3). Note that the model must additionally know the position, here 2, of the word to be predicted. The transformer, however, mixes the position information with the content information by forming a sum. Hence, the position information is inseparable from the token embedding.

Therefore, the authors decided to compute an additional self-attention embedding called *query stream*, which as query only receives the target position and then can compute the attention with the key and value vectors (Sect. 2.1.1). The resulting embedding encodes the position of the token to be predicted and correlations to other tokens, but has no information on the content of that token. This information can be added as input to the model. The normal self-attention and the query stream have the same parameter matrices *Q* (query),*K* (key), *V* (value). To save training effort, XLNet only predicts a few tokens at the end of the permuted sequence. In addition, XLNet integrates the segment recurrence mechanism and relative encoding scheme of Transformer-XL (Sect. 3.2.2) into pre-training, which empirically improves the performance especially for tasks involving a longer text sequence.

When a token is predicted information about tokens before and after it may be used. Therefore, the model is a bidirectional encoder. With BERT, if the two tokens "New" and "York" are masked, both words are predicted independently, ignoring valuable information. In contrast, XLNet properly handles the dependence of masked tokens. XLNet was able to outperform BERT and RoBERTa on many tasks, e.g. the GLUE language understanding tasks, reading comprehension tasks like SQuAD (Sect. 2.1.5), text classification tasks such as *IMDB* (movie review classification) [130].

**Product Keys** [112] replace the dot-product attention by a nearest neighbor search. A query *<sup>q</sup><sup>r</sup>* is split into two sub-queries *<sup>q</sup>*[1] *<sup>r</sup>* and *q*[2] *<sup>r</sup>* . For each sub-query the *k* closest sub-keys *k*[1] *<sup>i</sup>* and *<sup>k</sup>*[2] *<sup>j</sup>* are selected. From the *<sup>k</sup>*<sup>2</sup> combinations of sub-keys the highest dot products can be efficiently computed and the *k* highest combinations are selected. The results are normalized with the softmax function and used for the computation of a weighted sum of value vectors. During optimization only the *k* optimal keys are affected reducing the training effort. The approach allows very large transformers to be defined with only a minimal computational overhead. With 12 layers the authors achieve the same performance as a 24 layer BERT model using only half of the computation time. In a comprehensive comparison of transformer architectures [142] the approach yields an increase for SuperGLUE NLU task (Sect. 4.1.2) from 71.7% for the standard T5 model to 75.2%.

**DeBERTa** [76] uses a *disentangled attention* mechanism, where each word is represented by two different types of vectors encoding content and position. The attention weights between tokens are computed using different matrices for content and relative position. In addition, DeBERTa includes absolute word positions in the last layer to capture different syntactic roles in the sentence. During finetuning the model employs an "adversarial" training approach, where embeddings are normalized to probability vectors. Then the model is trained to be robust against small perturbations of embeddings. According to the authors, this improves the performance of fine-tuned models. The large version of the model with 1.5B parameters has superior performance in several application areas, e.g. in natural language understanding (Sect. 4.1.2), where DeBERTa surpasses the human performance on the *SuperGLUE benchmark* [219] for the first time, increasing the macro-average score to 89.9%.

Bengio et al. [12] argue that representations, e.g. embeddings, should be *disentangled* and should represent different content aspects, e.g. syntax, style, semantics, in different parts of the embedding vector. Locatello et al. [129] have proven that this is not possible in an unsupervised way. Hence, some explicit supervision or prior information has to be used to generate interpretable subvectors of embeddings.

**DeBERTaV3** [75] substitutes the MLM loss of DeBERTa with the replaced token detection (RTD) of Electra (Sect. 3.1.1). In addition, a new gradient-disentangled embedding sharing method is employed that improves both training efficiency and the quality of the pre-trained model. Its largest version has a 128k-token vocabulary, 24 layers, and 304M parameters. For the GLUE benchmark with fine-tuning, the model increases the score by 1.4% to a new SOTA of 91.4%. The multi-language version of the model mDeBERTaBASE outperforms XLM-RBASE by 3.6% in terms of the cross lingual transfer accuracy on the *XNLI* task (Sect. 3.3.1).

#### *3.1.2 Autoregressive Language Models Similar to GPT*

By increasing the number of parameters and the training set size the capabilities of GPT models can be markedly improved. An overview is given in Table 3.3.

**GPT-3** [25] is a language model with extreme dimensions. Its largest version has 96 layers, 96 attention heads, 175 billion parameters and covers sequences of length 2048. It was trained on a text collection of books, Wikipedia and web pages of about 500 billion tokens. The details of the architecture are not known yet. GPT-3 is structurally similar to GPT-2, and therefore its higher level of accuracy is attributed to its increased capacity and higher number of parameters. The model achieved an unprecedented performance in language modeling, question answering, etc. Some results are compiled in Table 3.4 and many more in the paper [25].


**Table 3.3** Autoregressive language models (LM) similar to GPT. 'Details' provides the number of parameters and specific features. The 'benchmark' figures are only a hint, as they depend on the selected number of parameters and the computing effort. Best benchmark value printed in bold

**Table 3.4** Comparing different versions of PaLM, GPT-3, Chinchilla, Gopher, OPT, GLaM, and BLOOM on a number of popular benchmarks covering text completion, pronoun coreference, common sense reasoning and question answering (QA) [22, 25, 35, 51]. FLOPS measures the computational effort in floating point operations per second. Best benchmark values printed in bold


GPT-3 is able to generate fluent texts and covers a huge amount of world knowledge, as the example in Fig. 3.2 shows. Examples of generated texts can be found in many locations [23, 149]. The amount and quality of knowledge captured by PLMs is discussed in Chap. 4. In contrast to other language models, GPT-3 can be instructed by a few sentences to perform quite arbitrary tasks (few-shot learning). This is a very simple way to use GPT-3 to solve quite specific tasks such as translating into another language, summarizing a document, correcting grammar, writing an essay on a given topic, etc. Details are discussed in Sect. 3.6.3.

At the end of 2021 OpenAI provided an API to fine-tune GPT-3 with user-specific data [123]. In this way, the model can be adapted to a specific domain language and, in addition, be prepared to perform specific classification tasks. In general, this yields higher quality results than prompt design. In addition, no few-shot examples are necessary anymore. Details of fine-tuning GPT-3 are discussed in Sect. 3.6.2. Table 3.4 compares GPT-3 with other more recent language models on a number of popular benchmarks. There is a clear advantage of the new PaLM model.

**Fig. 3.2** Text generated by GPT-3 in response to an input. Quoted with kind permission of the authors [25, p. 28]

**GPT-J-6B** is an open-source GPT model with 28 layers, 16 heads, a context size of 2048, and 6B parameters [221]. It has a similar performance as the GPT-3 version with 6.7B parameters. There is an interactive web demo where users can enter their prompts and a continuation text is generated [220]. **GPT-Neo** [16] is another free version of GPT with 2.7B parameters. It was trained on the *Pile*, a 825 GB data set containing data from 22 diverse sources, including academic sources (e.g. ArXiv), Internet webpages (e.g. StackExchange), dialogs from subtitles, GitHub, etc. It outperforms the GPT-3 version with the same parameter size on some natural language understanding tasks [89]. Recently, **GPT-NeoX-20B** [215] was released. It has 44 layers, an internal vector dimension of 6144, 64 heads and uses batches of size 3.1M for training. In the LAMBADA benchmark (Sect. 4.1.3) with the task of predicting the missing last word of the last sentence of each passage, it achieves an accuracy of 72.0%. This value is close to GPT-3 with 75.2%.

**Megatron-LM** [193] scale language models such as GPT-2 and BERT efficiently by introducing intra-layer model parallelism. The authors place self-attention heads as well as feed-forward layers on different GPUs, reducing the memory burden of a single GPU. They present a GPT-variant with 8.3B parameters and a 3.9B parameter model similar to BERT. Highlights of the approach include 76% scaling efficiency when using 512 GPUs. Their GPT model reduces the *WikiText-103* [134] SOTA perplexity from 15.8 to 10.8 and their BERT model increases RACE (reading comprehension) [110] accuracy to 90.9%.

**Jurassic-1** [122] is an autoregressive language model similar to GPT-3 with 178B parameters. The authors chose a token vocabulary of 256k instead of 50k for GPT-3, which also included frequent multi-word expressions such as named entities and common phrases. The training text could be represented with 28% fewer tokens than GPT-3. Hence, the model can process queries up to 1.4× faster when using the same architecture. The model used a maximal sequence length of 2048 tokens. In spite of the larger vocabulary only 2% of all parameters were required for the input embeddings. The model was trained on 300B tokens drawn from public text corpora using a final batch size of 3.2M tokens.

**PanGu-***α* [248] is a model of Huawei similar to GPT-3 with up to 200B parameters. It was trained on 1.1TB Chinese text, and was applied to a large number of tasks in zero-shot, one-shot, and few-shot settings without any fine-tuning. The model has a performance comparable to GPT-3.

**OPT-175B** (Open Pre-trained Transformer) [253] is a suite of 8 GPT models with 125M to 175B parameters developed by Meta. It was trained on publicly available datasets with 180B tokens. The largest models has 96 layers, each with 96 heads. Although OPT-175B has the same parameter count as GPT-3, its training required only 1/7th of computing effort of GPT-3. The model was evaluated on 16 NLP tasks and showed approximately the same performance as GPT-3 (Table 3.4). All trained models up to 30B parameters are freely available. The large 175B parameter model is only available to academic researchers upon request to discourage the production of fake news. The model can be trained and deployed on only 16 NVIDIA V100 GPUs. Some benchmark results are provided in Table 3.4.

**BLOOM** [139] is an autoregressive large language model with 176B parameters. It has 70 layers with 112 attention-heads per layer and 2048 token sequence length. It was developed by the BigScience initiative of over 1000 AI researchers to provide a free large language model for everyone who wants to try. Its training data covers 46 natural languages (English 30%, Chinese 16%, French 12%, Spanish 11%, ...) and 11% code (java, php, . . . ) with 350B tokens. The 176B BLOOM model has been trained using the Megatron-DeepSpeed library [26] offering different types of parallelism. The model can be evaluated on 8 large GPUs. Hence, BLOOM is one of the largest trained model available for research purposes. Some benchmark results are provided in Table 3.4.

**Gopher** [168] employed the GPT-2 architecture with two modifications. For regularization the authors used RMSNorm (Sect. 2.4.2) instead of LayerNorm and they employed the relative positional encoding scheme [44] instead of absolute positional encoding. Gopher has 80 layers with 128 attention heads and 280B parameters. All models were trained on 300B tokens with a context window of 2048 tokens and a batch size of up to 6M tokens. For the large models a 16 bit float numbers was used to reduce memory and increase training throughput.

Six model versions with different numbers of parameters were trained to assess the effect of model size. The authors present a comprehensive evaluation on 152 tasks described in Table 4.3. Gopher shows an improvement on 100 of 124 tasks. One of these is the *LAMBADA benchmark* [154] where Gopher generates a zero-shot score of 74.5, which is only slightly below the value 76.6 of *MT-NLG* model with 530B parameters [106]. For instance Gopher achieves SOTA for all 12 benchmarks on humanities covering areas like econometrics and psychology surpassing the best supervised results for 11 benchmarks. Some results are provided in Table 3.4 while Sect. 4.1.4 describes more details.

**Chinchilla** [83] is a mid-size encoder model with 70B parameters, which has the same compute budget as the larger Gopher model, but four times as much data. Chinchilla consistently has a better performance than Gopher (Table 3.4) and significantly outperforms GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large set of downstream evaluation tasks. For every doubling of model size the number of training tokens should also be doubled. This is a much larger scaling rate than that predicted by Kaplan et al. [102] in Sect. 3.5.1.

**Turing-NLG** [179] introduces an autoregressive language model with 78 transformer layers, a hidden vector-size of 4256, 28 attention heads and 17B parameters. As a model with more than 1.3B parameters cannot fit into a single GPU with 32 GB memory it must be parallelized, or broken into pieces, across multiple GPUs. Turing-NLG leverages a SOTA Deep Learning hardware with high communication bandwidth, the Megatron-LM framework, and the DeepSpeed library, which further optimizes the training speed and reduces the resources needed. The model achieved SOTA performance on language modeling tasks and also proved to be effective for zero-shot question answering and abstractive summarization.

Its successor **MT-NLG** [4] is a 105-layer encoder model with 530B parameters and was trained across 280 GPUs with a huge batch size of 1920. Similar to GPT-3 it improves performance on zero-, one- and few-shot tasks. For the *LAMBADA benchmark* [154], for example, the model has to predict the last word of paragraph (Sect. 4.1.3). On this benchmark MT-NLG improves the few-shot accuracy of GPT-3 (86.4%) to the SOTA 87.2%.

**PaLM** [35] is an autoregressive language model developed by Google with 540B parameters. It has 118 layers, 48 heads and an input sequence length of 2048. There are also smaller versions with 8B and 62B parameters. It uses a standard autoregressive decoder with SwiGLU activation function and shared query-value projections for the heads of a layer, which improves autoregressive decoding speed. The model is trained on a high-quality dataset with 780B tokens, where sloppy and toxic language have been filtered. Each training example is used only once. The training set contains social media conversation (50%), multilingual web pages (27%), books (13%), source code files (5%), multilingual Wikipedia articles (4%), and news articles (1%). Training required 3072 TPU chips for 1368 h, resulting in a total emission that is 50% higher than the emissions for a direct round-trip flight in an aircraft between San Francisco and New York [35, p. 18].

PaLM was evaluated on hundreds of natural language inference, mathematical, reasoning and knowledge intensive tasks and achieved SOTA accuracy in the large

**Fig. 3.3** Evaluation of PaLM, GPT-3, Gopher, and Chinchilla (left). Previous models were only evaluated on a subset of tasks, so this graph shows the aggregated results on the 58 tasks where all three models have been evaluated [35]. The medium accuracy of PaLM is better than the average performance of humans. The right side shows the results for four specific BIG-tasks. A detailed comparison between the performance of three PaLM models of different size as well as human levels is presented in [35, p. 15f]

majority of benchmarks, e.g. in 28 of 29 most widely evaluated English language understanding benchmarks (cf. Table 3.4). This demonstrates that the scaling effects continue to hold for large Foundation Models. Figure 3.3 shows the results on BIGbench data compared to prior models. PaLM 540B 5-shot outperforms the prior SOTA on 44 out of the 58 common tasks, and on average is significantly better than the other models (Gopher, Chinchilla, GPT-3). Moreover, PaLM 540B 5-shot achieves a higher score than the average score of the humans asked to solve the same tasks. When fine-tuned on SuperGLUE, the model outperforms the best decoderonly model and is competitive with encoder-decoder models, which in general perform better for fine-tuning. A significant number of tasks showed discontinuous improvements from model scale, meaning that the performance improvement from the smaller version to the largest model was higher than expected.

PaLM has been fine-tuned on program code documents. The resulting model is called *PaLM-Coder* [35, p.23]. The quality of the code is measured by the pass@*k* metric, in which for each problem in the test set, *k* samples of source code are generated by PaLM-Coder, and a problem is counted as solved if any sample solves the problem. PaLM-Coder is able to solve a number of benchmark tasks with about a pass@1-value of about 50. There is an elaborate evaluation of the properties of the PaLM-Coder model.


**Fig. 3.4** Few-shot example of a chain-of-thought prompt for a common sense question-answering task [35, p. 38]. The same two example chains of thought were combined with different prompts requiring an answer

For about a quarter of tasks the authors observe a discontinuous jump in accuracy, if the model is increased from 58B to 540B parameters, far exceeding the 'power law' postulated by Kaplan et al. [102] (Sect. 3.5.1). Examples are 'english proverbs' and 'logical sequence' shown in Fig. 3.3. This suggests that new abilities of PLMs can evolve when the model reaches a sufficient size, and that these abilities also develop beyond the model sizes studied so far.

The training data contains 22% multilingual documents. For translation between different languages, the few-shot PaLM model comes close to or even exceeds the fine-tuned SOTA. For English-French translation, Palm 540B few-shot achieves 44.0 BLEU compared to a SOTA of 45.6. For German-English, PaLM 540B few-shot reaches 47.5 BLEU vs. a 45.6 BLEU SOTA. For other tasks like summarization and question answering, Palm 540B few-shot comes close to the fine-tuned models, and can outperform them in a few cases.

Reasoning with a number of intermediate steps was always difficult for language models. Recently chain-of-thought prompting (Sect. 3.6.4) was proposed which adds intermediate reasoning steps [226] into the few-shot prompts (Fig. 3.4). Following this recipe, the PaLM model similarly produces its own intermediate steps for a multistep problem before giving the final answer. This leads to a boost in performance for a number of benchmark tasks. Using this technique PaLM is even able to explain jokes, as Fig. 3.5 demonstrates.

**Fig. 3.5** By using thought-chain-prompts PaLM can explain jokes [35]

#### *3.1.3 Transformer Encoder-Decoders*

The Transformer encoder-decoder [212] was pre-trained with a translation task (Sect. 2.3). To improve performance a number of alternatives were proposed:


The details are given in the following paragraphs. A representative list of transformer encoder-decoders is provided in Table 3.5.

**MASS** [196] is based on the transformer architecture. In contrast to the original transformer, a sequence of consecutive tokens in the encoder is masked and the decoder's task is to predict the masked tokens recursively (Fig. 3.6). Therefore, MASS can jointly train the encoder and decoder to develop the capability of extracting embeddings and language modeling. MASS is fine-tuned on language generation tasks such as neural machine translation, summarization and conversational response generation. It shows significant performance improvements compared to prior transformer architectures.

**BART** [119] uses a standard Transformer-based encoder-decoder architecture. The pre-training task is to recover text corrupted by a number of different approaches (Fig. 3.6): predict masked tokens as with BERT; predict deleted tokens and their positions, predict the missing tokens replaced by a single mask, reconstruct a permuted sentence as with XLNet, and find the beginning of a rotated document. BART was fine-tuned on a number of tasks like GLUE, SQuAD, summarization, and machine translation. BART achieved the best performance with the prediction of missing tokens replaced by a single mask. A large version of BART was trained

**Table 3.5** Transformer encoder-decoders. The pre-training and fine-tuning loss functions are defined in Table 3.1. Benchmarks: En-De WMT2014 English-to-German BLEU, GLUE Sect. 4.1.1 accuracy, SuperGLUE Sect. 4.1.2 accuracy, TriviaQA [99] Sect. 6.2.1 accuracy, Penn Treebank [136] perplexity. The benchmark figures are only a hint, as they depend on the number of parameters and the computing effort



**Fig. 3.6** Different pre-training tasks to restore corrupted text by the transformer. Span masking is the task for MASS [196]. BART uses all tasks from token masking to document rotation [119]

with a hidden size of 1024 and 12 encoder and decoder layers with a similar dataset as used by RoBERTa. The resulting performance was similar to that of RoBERTa. For abstractive summarization, e.g. on the *CNN/Daily Mail benchmark* [78], BART achieves SOTA.

**Fig. 3.7** Every task in T5 is expressed as a translation task, where the type of the task is a prefix to the input text (on the left) and the model produces the corresponding output (right) . Adapted from [170, p.3] with kind permission of the authors

**PEGASUS** [251] proposed pre-training large Transformer-based Seq2seq models on massive text corpora with a new objective: *gap-sentences generation*, where sentences instead of tokens are masked or removed. The model has to generate these modified parts as a one sentence output. On 12 document summarization tasks the model achieves SOTA performance.

**T5** [170] is based on the standard transformer architecture. Pre-training is performed on a huge training set by restoring corrupted texts, which is formulated as a sequence-to-sequence tasks. The comparison of different pre-training tasks listed in Fig. 3.6 found that, similar to BART, text infilling achieves the best results. If the original text is "Thank you for inviting me to your party last week ." the model receives the input "Thank you [X] me to your party [Y] week ." with masked phrases and has to generate the output "[X] for inviting [Y] last [Z]" to reconstruct the masked phrases.

*Salient span masking* [72] was especially effective. To focus on relevant phrases a BERT-tagger was trained to recognize named entities (person names, locations, etc. Sect. 2.1.3), and dates were identified by regular expressions. If the model had to recreate these spans the model performance was significantly increased. By predicting the omitted tokens, the model is able to collect an enormous amount of information on syntactic and semantic knowledge. Extensive comparisons show that the sequence-to-sequence architecture yields better results than other architectures, e.g. autoregressive language models.

T5 is pre-trained on a multitask mixture of unsupervised and supervised tasks using a training dataset of 750 GB of cleaned English web text. Its largest version has 24 layers, 128 attention heads, and 11B parameters. For each task the data is converted into a text-to-text format (Fig. 3.7). The model achieves SOTA results on many benchmarks, for example summarization, question answering, text classification, and more. The results for GLUE is 90.3% [11].

**Primer** [195] proposes two modifications of the original self-attention architecture. First the ReLU activation function is squared. In addition, a convolution layer is added after each of the multi-head projections for query *Q*, key *K*, and value *V* . For the original T5 architecture this reduces the training cost by a factor 4.

**UniLM2** [8] simultaneously pre-trains a bidirectional language models and a sequence-to-sequence model for language generation. The model parameters are shared between the two tasks, and the encoding results of the context tokens are reused. The model uses two mask types, one for bidirectional masking similar to BERT and pseudo masks for language modeling. With special self-attention masks and position embeddings, the model can perform both language modeling tasks in one forward pass without redundant computation of context. The model beats BARTBASE for reading comprehension on SQuAD 1.1 and T5BASE for abstractive summarization on CNN/Daily Mail.

**GLM** (General Language Model) [54, 55] is a successor of UniLM2 aiming to combine the different learning paradigms of BERT, GPT and the transformer. For pre-training GLM has the task to generate multiple text spans in an autoregressive way basically using the GPT architecture. From the input text *x* = *(x*1*,...,xT )* a number *m* spans *xi*<sup>1</sup> *,...,xi*1+*li* are sampled. Each span is replaced with a single [MASK] token yielding the corrupted input *x*corrupt. The model then successively generates the tokens of the spans having access to the corrupted input and the already generated tokens of the spans (Fig. 3.8). Within the input text all tokens are connected by self attention while in the output section a masked self-attention is used. Each span is finished by an [END] token. To identify the positions of generated tokens two positions are encoded by embeddings: the input position and the position within a span. Note that the mask prediction can be done in arbitrary sequence and the model has to predict the length of the spans during reconstruction.

For fine-tuning, text classification tasks are converted to word predictions. To assess the sentence "The waiters were friendly." in a sentiment classification task

**Fig. 3.8** During pre-training GLM has the task to reconstruct masked single words or multi-word phrases. The position of generated words in the text and in the masks are indicated by position embeddings, which are added to the token embeddings. The generated answers are terminated by an [END] token [54]

the input is extended to "The waiters were friendly. It's really [MASK]." where [MASK] has to be replaced by "good" or "bad". For a text generation task a [MASK] token is appended to the input text. Then the model generates the continuation as the output text in an autoregressive way. In contrast to BERT the model observes the dependency between masked tokens yielding more consistent predictions. In comparison to XLNet no additional attention for position encoding is needed reducing the computational requirements. Compared to T5, GLM predicts the spans in arbitrary order and requires fewer extra tokens.

To evaluate the model performance, Du et al. [54] train GLMBASE and GLMLARGE with the same training data and parameter counts (110M and 340M) as BERTBASE and BERTLARGE. For both model configurations, GLM outperforms BERT on SuperGLUE (Sect. 4.1.2), e.g. GLMLARGE has an average score of 77.0 compared to 72.0 for BERTLARGE. On a larger pre-training dataset for a model with the same size as RoBERTa they yield an average SuperGLUE score of 82.9 compared to 81.5 for RoBERTa. They show that by multitask learning, a single model with the same parameters can simultaneously achieve higher accuracy in NLU, generating text given an input, and solve other tasks such as summarization [53].

Larger models like **GLaM** [51] and **WuDao-2.0** [257] have a mixture-of-experts architecture and are described in Sect. 3.5.2.

#### *3.1.4 Systematic Comparison of Transformer Variants*

As an example of a fair comparison of architectural features, we report the following experimental analysis of PLMs, where Narang et al. [142] evaluated the effect of a number of transformer modifications. The following transformer features were investigated:


The authors evaluated the variants in two settings: Transfer learning based on the T5 transformer (Sect. 3.1.3) and supervised machine translation on the *WMT2014 En-De* [17]. With some caution, the results can also be applied to other types of PLMs like BERT and GPT.

Each architecture variant of T5 was pre-trained on the *C4 dataset* [171] of 806 GB using the "span corruption" masked language modeling objective. Subsequently, T5 was fine-tuned on three tasks: the *SuperGLUE* language understanding task [219], the *XSum* abstractive summarization dataset [143], and the *WebQuestions benchmark* [13], where no additional knowledge was provided as background information. The computing effort and the number of parameters for each model was fixed to the same level. An exception was an architecture with significantly fewer parameters, which was trained for longer.

Several *activation functions* achieve a better performance compared to the ReLU activation, especially *SwiGLU* and *GEGLU*, which are *gated linear units*  (GLU) forming a product with another activation [189]. The improvement can be observed for pre-training, fine-tuning, and supervised training without affecting the computation time. For SuperGLUE, for instance, an increase from 71.7% to about 76.0% can be observed. Replacing *layer normalization* with *RMS normalization*  [249] causes performance gains for all tasks. The SuperGLUE score, for example, was improved from 71.7% to 75.5%. In addition, the training speed was higher.

As expected, increasing the depth of a models usually led to a better performance even if the number of parameters is kept constant. On SuperGLUE the model with 18 layers achieved a score of 76.5% compared to 71.7% for the base model. Similar improvements can be observed for WebQuestions and translation, while there were no improvements for the summarization task. This is in line with theoretical results (Sect. 3.5.1). A drawback is that deeper models require more computation time.

Architectures, which share parameters in different layers, usually lead to a decreased performance. The effect of using the same embeddings for encoders and decoders is mixed. Factorization of embeddings into a matrix product usually cause inferior results. If a *Mixture of Softmaxes* [239] is used to predict the output probabilities, the performance usually is better, e.g. an increase to 76.8% for SuperGLUE. However, this approach requires up to 40% more computation effort.

Of the architectural variants evaluated, two combinations of the *Synthesizers* with dot-product attention (Sect. 3.2.2) perform better than the standard Transformer. The Synthesizers do not compute a "correlation" of embeddings but determine the attention weights from a single embedding or randomly. Switch Transformer, Mixture-of-experts, and Product key memories all have significantly more parameters than the baseline transformer but are able to improve performance. The *Switch*  transformer ([56] Sect. 3.5.2) has many more parameters than the base T5 model. To reach the same performance as Switch, T5 needs seven times more training FLOPS (floating point operations per second). The *Mixture-of-experts* model [116] distributes computations to 2 expert models in both the encoder and the decoder. *Product key memory* ([112] Sect. 3.1.1) replaces the dot-product attention by a nearest neighbor search.

For all other 12 architectures, there were no improvements over the standard transformer [142]. This is different to the findings of the papers proposing the models. A reason seems to be that changes of the transformer architecture are difficult to transfer to other code bases and applications. Therefore, the authors propose to try out new modifications on different low-level implementations. In addition, a new approach should be evaluated on a variety of downstream applications including transfer learning, supervised learning, and language modeling. *Hyperparameter*  optimization should be kept fixed to assure the robustness of the approach. Finally, the mean and standard deviation of results should be reported to avoid the selection of a single best result.

#### *3.1.5 Summary*

The modification of pre-training tasks has a profound influence on the performance of PLMs. Many different types of pre-training losses have been evaluated, such as masked phrase prediction, replaced token detection, or sentence order recognition. According to the benchmarks, the prediction of permuted tokens by XLNET is especially rewarding because XLNET takes into account the dependency between masked tokens. In addition, DeBERTa's disentangled token and position embeddings are able to boost the performance in downstream classifiers. With respect to applications, autoencoders like BERT are particular important for information extraction in Chap. 5.

For autoregressive PLMs like GPT, a number of variants with larger model size and larger training data have been presented. However, in most cases, the pre-training tasks were not changed. The training of the larger models required improvements in the parallel computing infrastructure and resulted in an unprecedented performance in text generation. By creating custom start texts (prompting), the models can solve a large number of specific tasks with very high accuracy without further fine-tuning (Sect. 3.6.3). The amount and quality of knowledge captured by PLMs is surprisingly high and is discussed in Chap. 4. In terms of applications, autoregressive PLMs are used in particular for text (Chap. 6) and image generation (Sect. 7.2). Because of their versatility and the tremendous increase in performance, recent large-scale PLMs are called *Foundation Models*.

Encoder-decoder transformers were introduced for translating a text from one language to another. A number of new pre-training tasks were evaluated for these models. Some of them are similar to the tasks for autoencoders, such as predicting masked spans or inserting omitted tokens. Others were adapted to the inputoutput architecture, e.g. the reconstruction of sentence permutations and document rotations. Here BART and T5 achieved the best performances in the GLUE and SuperGLUE natural language understanding tasks. By creating additional synthetic training examples, the performance of T5 and other models can be increased (Sect. 3.6.6).

A systematic comparison of transformer architectures demonstrated that several architectural changes increased performance. The SwiGLU and GEGLU activation function instead of ReLU increased accuracy for SuperGLUE by more than 4%. Similar gains were observed when using RMS normalization instead of layer normalization. Increasing the model depth resulted in better performance even when the number of parameters was held constant. Synthesizers, mixtures-of-experts, and Product keys replacing scalar products by *k*-means clustering also performed better than the standard transformer.

T5 and GLM demonstrate that transformers, controlled by instructive prompts, can be used to solve arbitrary problems of text classification, text generation, and text translation. They thus combine the capabilities of BERT, GPT, and translation models. Transformers are used extensively in complex text generation tasks, e.g. machine translation (Sect. 6.3), dialog (Sect. 6.6), and image generation (Sect. 7.2).

#### **3.2 Capturing Longer Dependencies**

A well-known concern with self-attention is the quadratic time and memory complexity, which can hinder the scalability of the model in many settings (Sect. 2.1.6). If the sequence length *T* is increased to 2*T* then four times as many associations (attentions) between tokens have to be computed. This limits the direct applicability of models when a task requires larger contexts, such as answering questions or summarizing a document. Moreover, a larger memory is required to store the attentions for training. Therefore, a number of concepts have been proposed to cover long sequences without excessive computational and memory demands.


Surveys of techniques for enlarging the input sequence are provided by Tay et al. [207] and Fournier et al. [59].

#### *3.2.1 Sparse Attention Matrices*

**BigBird** [247] reduces the number of attention computations by omitting entries according to some pre-determined pattern from the matrix of attention relations.

**Fig. 3.9** Attention mechanism used in BigBird [247] to compute the association between input tokens. Matrix indicating attention between pairs of tokens: attentions between sequence neighbors (left), global attentions to a few tokens (second left), random attentions (third from left), the combined BigBird attentions (right). White blocks indicate omitted attention pairs

BigBird extends transformer-based models, e.g. BERT, and uses a set of *g global tokens* attending on all tokens of the sequence. In addition, each token *vt* attends to a set of *nl* local *neighboring tokens* and to a set of *nr random tokens*. The resulting association matrices are shown in Fig. 3.9. If the numbers *g*, *nl*, and *nr* do not increase with sequence length *T* the number of attentions grows linearly with *T* .

The model is constructed in such a way that the length of the path between arbitrary token pairs along intermediate tokens is kept small, as in a small-world graph. The authors prove that their model allows to express all continuous sequenceto-sequence functions with only *O(T )* inner products (Table 3.6). In addition, they show that under standard assumptions BigBird is Turing complete, i.e. can perform arbitrary computations (see also [246]). The BigBird attention module can be used in BERT, autoregressive language models, and Transformer architectures. In a number of applications BigBird using a sequence length of 4096 is able to improve the SOTA, e.g. for question answering requiring multi-hop reasoning from the given evidences. Note that BigBird without random attention performed better than BigBird with random attention in a set of experiments.

Prior models using these concepts were the *Sparse Transformer* [33] and the *Longformer* [10], which similarly to WaveNet [148] employ strided or "dilated" neighborhoods. Here not all adjacent neighbors are attended by a token, but only every *d*-th neighbor with *d >* 1. If *k* layers are used, this construction covers *d<sup>k</sup>* neighbors and thus allows associations over large distances. The **Extended Transformer Construction** (ETC) model [3] generalizes the idea of global tokens, which can communicate associations between far-away tokens of the whole sequence.

**GPT-3** [25] (Sect. 3.1.2) is a recent language model with 96 layers, 96 attention heads, 175 billion parameters covering sequences of length 2048. To cope with the excessive sequence length the authors used "alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer" [33]. The details of the architecture are not yet known. The model achieved an unprecedented performance in language modeling, question answering, etc., which is discussed in Sect. 3.6.3.


**Table 3.6** Important models with sparse self-attention for long dependencies. *T* is the sequence length, *g* number of global tokens, *k* is window size. (cf. [207])

#### *3.2.2 Hashing and Low-Rank Approximations*

The **Reformer** [108] introduces locality-sensitive hashing to cluster tokens with similar key/query vectors. This approach hashes similar input items into the same "buckets" with high probability. For each cluster the same query/key parameters are used. In this way, tokens are aggregated in a data-driven fashion. In a similar way, the *Routing Transformer* [180] clusters tokens by *k*-means clustering.

**Transformer-XL** [44] reuses computation results from prior segments of a sequence. With this recurrence mechanism applied to every two consecutive segments of a corpus, it essentially creates a segment-level recurrence in the hidden states. With multiple layers, the effective context being utilized can go way beyond just two segments. A similar approach is used by the *Compressive Transformer*  [169]. *Segatron* is a variant that encodes a paragraph index in a document, a sentence index in a paragraph, and token index in a sentence as embeddings to be added to the token embedding. This modification leads to a better perplexity in language modeling.

The **Performer** [34] reduces the computational load by employing low rank approximations of the self-attention matrix. It uses a random kernel with positive orthogonal random features to compute the self-attention. By orthogonality, the authors avoid computing the full square matrix of products, since the dot product of orthogonal features is 0. Hence, computation requirements grow linearly with sequence length. The authors are able to prove that their model allows nearlyunbiased estimation of the full attention matrix as well as uniform convergence and lower variance of the approximation.

The **Linear Transformer** [105] also uses a kernel-based formulation of selfattention reducing complexity to linear. For predicting the future elements from past inputs, the authors are able to construct an iterative algorithm similar to RNNs that is dramatically faster than standard transformers. The model has been shown to improve inference speeds up to three orders of magnitude without much loss in predictive performance.

The **Transformer-LS** (Long-Short Transformer) [258] has a local sliding window attention between neighboring tokens and a long-range attention with dynamic projections to represent relationships between distant tokens. The dynamic low-rank projections depends on the content of the input sequence. The authors claim that the approach is more robust against insertion, deletion, paraphrasing, etc. The scheme achieves SOTA perplexities in language modeling for different benchmarks, e.g. 0.99 for enwik8 and SOTA results as vision transformer on ImageNet.

The **Combiner** [174] represents groups of embeddings by key vectors. The probability that a given token *vt* attends to a token *vs* is described by a product, where *vt* first attends to the key vector that represents a group of locations containing *vs* multiplied by the probability of choosing *vs* within that group. In this way, the Combiner can be applied to sequences of length up to 12,000. The approach is able to achieve SOTA perplexity on large benchmarks. In addition, it improves the average performance on the *Long Range Arena benchmark* [209] specifically focused on evaluating model quality for long documents.

The **Synthesizer** [206] replaces the pairwise dot products of attention with "synthesizing functions" that learn attention matrices, which may or may not depend on the input tokens (cf. Sect. 3.1.4). In the Dense Synthesizer, each token embedding *xi*, *i* = 1*,...,T* , in a layer is projected to a vector of the length *T* using a two-layered nonlinear feed-forward network with a ReLU activation. The values of this vector are used as weights to determine the mixture of values to form the output embedding. Hence, no "correlations" between embeddings are computed to determine their similarity, as it is done for the standard self-attention. There is an extreme variant, where the mixing proportions are set randomly. Nevertheless, on multiple tasks such as machine translation, language modeling, dialogue generation, masked language modeling and document classification, this "synthetic" attention demonstrates competitive performance compared to vanilla self-attention. The combination of Random Synthesizers with normal dot-product attention is able to beat T5 on several benchmarks.

The **Perceiver** [93] defines an asymmetric attention mechanism iteratively converting the long input sequence *x*1*,..., x<sup>T</sup>* (e.g. the 50k pixels of an image) into a shorter sequence of latent units *u*1*,..., u<sup>n</sup>* (e.g. *n* = 512) that form a bottleneck through which the inputs must pass (Fig. 3.10). With cross-attention (Sect. 2.3.1) the *Q*-transformed latent sequence embeddings *Qu<sup>i</sup>* and the *K*-transformed long input sequence embeddings *Kx<sup>j</sup>* form a scalar product *(Qui)*-*(Kx<sup>j</sup> )*. It is used as a weight for the *V* -transformed long sequence embedding *V x<sup>j</sup>* to generate the new short embeddings. The Perceiver is basically a BERT model with a sequence

**Fig. 3.10** If the input sequence is too long, a short latent sequence is defined by the Perceiver. By

cross-attention between the long sequence and the latent sequence the information is compressed. A standard transformer block computes the self-attentions between the latent sequence elements, which in the end generates a classification [93]

length of *n* instead of *T* , which avoids that the computing effort scales quadratically with the input length. The iterative approach enables the model to devote its limited capacity to the most relevant inputs. In experiments the Perceiver was able to beat the leading ResNet-50 CNN with respect to image classification [93]. *Perceiver IO*  [92] projects the resulting *n* output embeddings of a Perceiver to a larger sequence of output embeddings by another cross-attention operation, which, for instance, gets the position embeddings of output elements as query vectors. The *Perceiver AR*  [73] extends the Perceiver to generate an output sequentially similar to the encoderdecoder transformer.

**S4** [68] is a Structured State Space Sequence model based on the Kalman filter for the observation of a state model with errors [101]. A continuous state space model is defined by

$$\mathbf{x}'(t) = \mathbf{A}\mathbf{x}(t) + \mathbf{B}\mathbf{u}(t) \qquad \mathbf{y}(t) = \mathbf{C}\mathbf{x}\_l + \mathbf{D}\mathbf{u}(t),\tag{3.1}$$

which maps an input signal *u(t)* to output *y(t)* through a latent state *x(t)*. The authors reparametrize the matrices *A* and decompose them as the sum of a low-rank and skew-symmetric term. Moreover, they compute its generating function of the associated infinite sequence truncated to some length *L* in frequency space. The

low-rank term can be corrected by the Woodbury identity for matrix inversion. The skew-symmetric term can be diagonalized and can be reduced to a Cauchy kernel [153].

The *A* matrix is initialized with an special upper-triangular "HIPPO" matrix that allows the state *x(t)* to memorize the history of the input *u(t)*. The authors prove that in complex space C the corresponding state-space model can be expressed by matrices *(-* − *PQ***∗***, B, C)* for some diagonal matrix  and vectors *P, Q, B, C* ∈ C. These are the 5*N* trainable parameters of S4, where *N* is the state dimension. Overall, S4 defines a sequence-to-sequence map of shape (batch size, sequence length, hidden dimension), in the same way as related sequence models such as Transformers, RNNs, and CNNs. For sequence length *L* this requires a computing effort of ∼*O(N* + *L)* and *O(N* + *L)* memory space, which is close to the lowest value for sequence models. Gu et al. [69] provide a detailed exposition and implementation of the S4 model.

In empirical evaluations it turned out that S4 for an input length of 1024 is 1.6 times faster than the standard transformer and requires only 43% of its memory. For an input length of 4096, S4 is 5 times faster and requires just 9% of the memory of the standard transformer. For the benchmarks of the *Long Range Arena benchmark*  S4 increased SOTA average accuracy from 59.4% to 80.5% (Table 3.7). Moreover, S4 was able to solve the extremely challenging Path-X task that involves reasoning over sequences of length 16k where all previous models have failed. Finally, S4 was able to perform raw speech signal classification on sequences of length 16k and achieves a new SOTA of 98.3% accuracy. S4 involves a genuine breakthrough in long range sequence processing. In addition, S4 is better in long-range *time-series forecasting*, e.g. reducing Mean Square Error by 37% when forecasting 30 days of weather data. *DSS* [70] is a variant of S4 that is simpler to formulate and achieves a slightly lower performance.

#### *3.2.3 Comparisons of Transformers with Long Input Sequences*

The *Long Range Arena* [209] aims to evaluate the performance on tasks with long input sequences from 1k to 16k tokens. It contains six different benchmark datasets covering text, images, mathematical expressions, and visual spatial reasoning. The tasks include ListOps (computations in a list-notation), text classification (classify IMDB reviews using character sequences), document retrieval (based on document embeddings), image classification (based on a sequence of pixels), and pathfinder (detection of circles) in two versions. The authors evaluate nine transformer architectures with the ability to process long inputs.

The results are shown in Table 3.7. For the hierarchically structured data of ListOps, it turns out that kernel-based approaches, for instance the Performer and the Linear Transformer, are not appropriate. For text classification, kernel-based


**Table 3.7** Accuracy results for the Long-Range Arena Benchmark. The best score is printed in bold, results improving the standard transformer are underlined (cf. [209])

methods perform particularly well. For image classification most models do well, except for the Reformer. The pathfinder task is solved by all models with an acceptable performance, with the Performer doing best. However, all models except S4 fail on the extended Pathfinder task and are not able to find a solution. In terms of all benchmarks, S4 is the best model by a wide margin.

With respect to speed, the Performer was best, being 5.7 times faster than the standard transformer on sequences of length 4k. Memory consumption ranged from 9.5 GB for the standard transformer to about 1.1 GB for the Linear Transformer. All other models except the Synthesizer require less than 3 GB with S4 doing well in both aspects.

#### *3.2.4 Summary*

There are a variety of proposals for PLMs to efficiently process long input sequences. Often a sparse attention matrix is employed, where only a part of the possible attentions is used to establish the connection between far-away positions. Usually, full attention is computed for near positions. Some tokens have a global attention to communicate information between positions not connected directly. A prominent example is BigBird, which adds random attentions. Its computational effort only grows linearly with input size and it still can perform arbitrary sequence computations. There are other architectures like the Performer and the Linear Transformer, which also exhibit linear growth.

Some architectures either approximate the attention matrices by low-rank factorizations or aggregate tokens, which express similar content (Reformer, Combiner). Another approach is to use a recurrence mechanism such that computations are reduced for far-away tokens (Transformer-XL, Linear Transformer, Transformer-LS, Perceiver). An alternative is the factorization of the self-attention matrix (Performer) or its replacement with simpler computations (Synthesizer). Recently, the S4 model has been proposed that applies a state-space model to long-range prediction. It uses an architecture based on complex number computations, which is completely different from the usual transformer setup. It outperforms all prior models by a large margin and is efficient in terms of computation time and memory.

The performance of these approaches was evaluated with six different benchmarks of the Long Range Arena. It turned out that S4 beats the other models with respect to all benchmarks. All approaches were able to reduce memory consumption compared to the standard transformer. The larger input length allow new applications, e.g. in raw speech processing, image processing or genomics [247].

#### **3.3 Multilingual Pre-trained Language Models**

There are more than 7100 languages in the world [9], and each language can express almost all facts and concepts. Therefore, PLMs should also be able to generate consistent representations for concepts in different languages. Languages differ to some extent in the basic word order of verbs, subjects, and objects in simple declarative sentences. English, German, French, and Mandarin, for example, are SVO languages (subject-verb-object) [100]. Here, the verb is usually placed between the subject and the object. Hindi and Japanese, on the other hand, are SOV languages, meaning that the verb is placed at the end of the main clause. Irish and Arabic, on the other hand, are VSO languages. Two languages that have the same basic word order often have other similarities. For example, VO languages generally have prepositions, while OV languages generally have postpositions. Also, there may be a lexical gap in one language, where no word or phrase can express the exact meaning of a word in the other language. An example is the word "Schadenfreude" in German, which roughly translates to "have joy because some other person has bad luck". More such differences are discussed by Jurafsky and Martin [100].

To gain cross-lingual language understanding, a PLM has to be trained with more than one language and has to capture their structural differences. During training, PLMs can establish an alignment between concepts in different languages.


Training a language model with several languages in parallel can improve the performance—especially for languages with little training data. This could already be demonstrated for static word embeddings [194].

#### *3.3.1 Autoencoder Models*

**mBERT** (multilingual BERT) [48] is a standard BERT model. It has been pretrained with the MLM loss on non-parallel Wikipedia texts from 104 languages and has a shared token vocabulary of 110k WordPiece tokens for all languages. This implies that Chinese is effectively character-tokenized. Each training sample is a document in one language, and there are no cross-lingual dictionaries or training criteria. To demonstrate its properties the model was fine-tuned to a multilingual version *XNLI* [40] of the Natural Language Inference (NLI) benchmark, i.e. the task to predict, whether the first sentence entails the second. It turns out that mBERT may be fine-tuned with a single language on NLI and still yields good test results on related languages [40, 232].

The results for 6 languages [111] are shown in Table 3.8. Compared to finetuning XNLI with all languages, there is only a small drop in accuracy for related languages, e.g. Spanish and German, if the fine-tuning is done with XNLI in English and the evaluation in the other language. For the other languages the reduction of performance is larger, but the results are still good. There is even a transfer of information between languages with different scripts, e.g. for Arabic and Urdu. The authors also consider the embeddings of a word and its translation. It turns out that the cosine similarity between a word and its translation is 0.55, although there is no alignment between languages.

Karthikeyan et al. [104] investigate the factors for the success of mBERT. They find that mBERT has cross-lingual capabilities even if there is absolutely no overlap in the token vocabulary. Moreover, a higher number of identical tokens in both vocabularies contributes little to the performance improvements. Comparing different language pairs the authors show that a large network depth and a high total number of parameters of a bilingual BERT are crucial for both monolingual and cross-lingual performance, whereas the number of attention heads is not a significant factor. On the other hand, the structural similarity of the source and target language, i.e. word order and frequency of words, has a large influence on cross-lingual performance.

**XLM** [111] improves the transfer of knowledge between different languages by using translated sentences from different language pairs during pre-training. The authors concatenate a sentence with its translations to another language for

**Table 3.8** Cross-lingual natural language inference (XNLI) [40] test accuracy for 6 languages. Fine-tuning with XNLI for all languages is compared to fine-tuning with XNLI only for English. Results for mBERT [48] and XLM [111]


**Fig. 3.11** The translation language modeling (TLM) task is applied to pairs of translated sentences. To predict a masked English word, the model can attend to both the English sentence and its French translation, and is thus encouraged to align English and French representations [111]

training and introduce a new *translation language modeling* (*TLM*) objective for improving cross-lingual pre-training. To predict masked words in the input sentence, the algorithm can attend to the words in the translated sentence. In this way, the model learns to correlate words from different languages. An example is shown in Fig. 3.11. As shown in Table 3.8, XLM has a much higher cross-lingual accuracy for XNLI compared to mBERT. The transfer from a model fine-tuned in English to other languages incurs only a small loss. The experiments show that TLM is able to increase the XNLI accuracy for 3.6% on average. The model was also evaluated for unsupervised machine translation from German and other languages to English, yielding a very good performance (cf. Sect. 6.3).

**Unicoder** [88] is an improved XLM model with three additional training tasks. Cross-lingual word alignment learns to associate the corresponding words in translated sentences. Cross-lingual paraphrase detection takes two sentences from different languages as input and classifies whether they have the same meaning. The document-level cross-lingual masked language model applies the MLM task to documents where part of the sentences are replaced by their translations. On XNLI the authors report an average accuracy improvement of 1.8%.

**XLM-R** is an optimized version of XLM [41]. It is based on RoBERTa and trained on a huge multilingual CommonCrawl dataset of 2.5TB covering 100 languages with a common vocabulary of 250k tokens. It increased the SOTA on the XNLI-score to 79.2%. For cross-lingual question answering, models are finetuned on the English SQuAD dataset and evaluated on 7 other languages. XLM-R improves the F1 score on this SQuAD version by 9.1%–70.7%. It outperforms mBERT on cross-lingual classification by up to 23% accuracy on low-resource languages. The performance of XLM-R is nearly as good as that of strong monolingual models.

These results support the observation that the performance of PLMs can be improved by training on large volumes of text [102]. More languages lead to better cross-lingual performance on low-resource languages under the condition that the model capacity is large enough. Combined with the approach of Aghajanyan et al. [2], which avoids too large changes in representation during fine-tuning (Sect. 3.6), the XLM-RLARGE model increases the SOTA in XNLI to 81.4%. If an additional criterion of separating semantically-equivalent sentences in different languages from other sentences is added to XLM-R, the accuracy on semantic tasks is increased [228]. Even larger models like *XLM-R*XXL [66] with 10.7B parameters were pre-trained on CC-100, which consists of 167B tokens of non-parallel text also covering low-resource languages, and increased the XNLI performance by 2.4%.

**RemBERT** [37] redistributes the parameters of multilingual models. First the authors showed that using different input and output embeddings in state-of-the-art pre-trained language models improved model performance. Then they demonstrated that assigning more parameters to the output embeddings increased model accuracy, which was maintained during fine-tuning. As a consequence Transformer representations were more general and more transferable to other tasks and languages. The *Xtreme* collection [86] is a multitask benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. RemBERT outperformed XLM-R on Xtreme, despite being trained only on a smaller subset of training data and ten additional languages.

PLMs like BERT generate contextual token embeddings. However, the user often needs contextual *embeddings for passage* or sentences to compare their content. **LaBSE** [57] is a language-agnostic generator of passage embeddings, where source and target sentences are encoded separately using a shared BERTbased encoder. The representations of [CLS] in the final layer were taken as the *sentence embeddings* for each input. LaBSE combined a masked language model (MLM) and a translation language model (TLM) loss with a margin criterion. This criterion computes the cosine distance cos*(x, y)* between the passage embeddings *x* and the embedding *y* of its correct translation. Then it is required that *cos(x, y)*−*m* is larger than cos*(x, yi)*, where *m* is a positive margin and the *y<sup>i</sup>* are embeddings of arbitrary other passages. LaBSE was trained using 17B monolingual sentences and 6B bilingual translated sentences. The resulting sentence embeddings markedly improve the retrieval accuracy SOTA of sentences in cross-lingual information retrieval (cf. Sect. 6.1). The code and pre-trained models are available.

#### *3.3.2 Seq2seq Transformer Models*

**mT5** is a multilingual version of the T5 Seq2seq transformer (Sect. 3.1.3) with up to 13B parameters [236]. It was pre-trained using a training dataset of web pages covering 101 languages with about 48B tokens and a common vocabulary of 250k tokens. For pre-training, the model had to predict masked phrases in monolingual documents in the same way as T5. Similar to T5 the model may be instructed to perform different tasks by a prefix, e.g. "summarize". These tasks were trained by fine-tuning on the corresponding datasets.

For the *XNLI benchmark* [40] the model has to decide, if the first sentence entails the second sentence. When the model is fine-tuned on XNLI with English data and performance is measured for 15 languages, accuracy is 84.8% compared to 65.4% for mBERT, 69.1% for XLM, and 79.2% for XLM-R. Although the texts in the different languages are not parallel, the model is able to exploit structural similarities between languages to solve the task. The code of this model is available at [235]. Similar models are used for multilingual translation (Sect. 6.3). **mT6** [31] enhances the training of mT5 with pairs of translated sentences and defines new training tasks. Experimental results show that mT6 has improved cross-lingual capabilities compared to mT5. A further improvement is **Switch** [56] with a *mixture-of-experts*  (*MoE*) architecture of mT5 requiring only one fifth of the training time of mT5 while yielding a performance gain across all 101 languages (Sect. 3.5.2).

**mBART** [126] is a multilingual encoder-decoder based on the BART model (Sect. 3.1.3). The input texts are corrupted by masking phrases and permuting sentences, and a single Transformer model is pre-trained to recover the corrupted text. This is performed for the training documents covering 25 languages. Subsequently, the pre-trained model is fine-tuned with a translation task between a single language pair. In addition, *back-translation* may be used, where another model is trained to translate the target sentence back to the source language and an additional loss encourages to reconstruct the source sentence. mBART adds a language symbol both to the end of the encoder input and the beginning of the decoder input. This enables models to know the languages to be encoded and generated. It turns out that pre-training improves translation, especially for languages with little parallel training data. In addition, back-translation markedly ameliorates the translation results. Many experiments are performed to analyze the effect of different algorithmic features. Pre-training is especially important if complete documents are translated instead of single sentences.

mBART may also be used for *unsupervised machine translation*, where no parallel text of any kind is used. Here the authors initialize the model with pretrained weights and then learn to predict the monolingual sentences from the source sentences generated by back-translation. The results for languages with similar structure are very good, e.g. for En-De mBART achieves a BLEU-value of 29.8, which is close to the supervised value of 30.9. Note that mBART has a similar performance as MASS (Sect. 3.1.3). For dissimilar pairs of languages, e.g. English-Nepali, mBART has reasonable results where other approaches fail.

**MARGE** [118] is a multilingual Seq2seq model that is trained to reconstruct a document *x* in one language by retrieving documents *z*1*,...,zk* in other languages. It was trained with texts in 26 languages from Wikipedia and CC-News. A document was encoded by the output embedding of the first token of a Transformer [212]. A retrieval model scores the relevance *f (x, zj )* of the target document *x* to each evidence document *zj* by embedding each document and computing their cosine similarities. A transformer receives the embedded texts of *z*1*,...,zk* and auxiliary relevance scores *f (x, zj )* from retrieval as input and is trained to generate the target document *x* as output. The similarity score is used to weight the cross-attention from the decoder to the encoder, so that the decoder will pay more attention to more relevant evidence documents. The models jointly learn to do retrieval and reconstruction, given only a random initialization. In a zero-shot setting the model can do document translation with BLEU scores of up to 35.8 in the *WMT2019 De-En benchmark*, as well as abstractive summarization, question answering and paraphrasing. Fine-tuning gives additional strong performance on a range of tasks in many languages, showing that MARGE is a generally applicable pre-training method.

**XLNG** [32] pre-trains the same Seq2seq model simultaneously using an MLM and a translation TLM loss (Table 3.1). The pre-training objective generates embeddings for different languages in a common space, enabling zero-shot crosslingual transfer. In the fine-tuning stage monolingual data is used to train the pre-trained model on natural language generation tasks. In this way, the model trained in a single language can directly solve the corresponding task in other languages. The model outperforms methods based on machine translation for zeroshot cross-lingual question generation and abstractive summarization. In addition, this approach improves performance for languages with little training data by leveraging data from resource-rich languages.

#### *3.3.3 Autoregressive Language Models*

Generative models like GPT-3 are trained on huge collections of documents which usually contain texts from different languages. By this training data, the model also acquires the knowledge about these languages and generates joint contextual representations of meanings. As described in Sect. 3.6.3, it is able to translate between languages if given an appropriate prompt and some examples (few-shot learning). On WMT2016 En→De, for instance, GPT-3 achieves a few-shot BLEU of 29.7 compared to a supervised SOTA of 41.2, whereas in the De→En direction GPT-3 outperforms the current SOTA of 40.2 BLEU with 40.6 BLEU [25].

Winata et al. [231] evaluate in detail the multilingual capabilities of GPT-2, GPTNEO and T5 with 1.6B, 6B, and 3B parameters respectively. The models are able to use the context from English to predict the answer in non-English languages. The authors find that the largest model GPTNEO always performs best on a set of multilingual benchmarks. The performance depends on the language pair. The models, for instance, achieve higher performance for En→Es than for the other two target languages (De and Fr). For the *MultiNLU benchmark* [187] the error 12.1% of the SOTA model fully trained on the target language is not much lower than the error of 17.3% for few-shot prompts of GPTNEO.

#### *3.3.4 Summary*

Machine translation is one of the most widely used applications of NLP. Languages have both structural and lexical differences that make translation difficult. The joint processing of multiple languages must take these differences into account.

When BERT is trained with documents from multiple languages, it is able to transfer knowledge between languages, e.g. solve language inference tasks, even if it has no access to parallel texts. Knowledge transfer is improved in XLM by using the translation language modeling loss, such that translated sentences are employed to reconstruct masked tokens. There are a number of improved versions of XLM that are able to increase the accuracy of cross-language inference.

Encoder-decoder models such as T5 can be generalized to multiple languages and induce powerful multilingual embeddings. mT5 can be controlled by a prefix and solves various task like translation, summarization, and language inference. mT6 and Switch are more effective variants of mT5. mBART is pre-trained by recovering corrupted text in different languages. It can even be used for unsupervised machine translation. XNLG generates joint embeddings in a multilingual space and MARGE leverages retrieval of background documents to reconstruct a target document. Both models are able to perform multiple tasks such as abstractive summarization, question answering, and paraphrasing. Note, however that specialized models are used for translating single language pairs (Sect. 6.3.1).

Autoregressive language models such as GPT-3 are trained on huge corpora, which also contain multilingual documents. Therefore, these models can also be instructed by few-shot learning to perform multilingual tasks such as translations or question answering. However, performance is usually not as good as for dedicated, fine-tuned models.

#### **3.4 Additional Knowledge for Pre-trained Language Models**

During unsupervised pre-training, PLMs like BERT and GPT2 are forced to predict missing words from the context. They are optimized to predict either the next word in a sequence or some masked words (e.g. "Einstein was [MASK] in the city of Ulm." ). Trained on this task, they obviously gather knowledge about real-world facts and relations from the training data. PLMs do surprisingly well in reproducing facts and relations based on unsupervised training. In Sect. 4.2 we discuss, what knowledge is covered by standard PLMs. It turns out, however that due to the still limited number of parameters only a fraction of knowledge contained in the training data can be remembered by a PLM. In addition, events that occurred after the training are missed.

**Fig. 3.12** A PLM gets an input text and collects additional knowledge from different sources. This knowledge may be added beforehand or can be retrieved on demand. Subsequently, an output is generated using the additional knowledge

This section presents methods for extending factual knowledge in PLMs, either during training or on the fly during actual model usage Fig. 3.12.A *Knowledge Base* (*KB*) describes knowledge about the world, e.g. by entities and their relations. We outline a number of different approaches with which information in KBs or other knowledge sources such as text collections can be incorporated into PLMs (Table 3.9):


**Table 3.9** Models integrating additional knowledge (cf. [166, p. 10]). Benchmarks: GLUE natural language understanding Sect. 4.1.1, TACRED relation extraction Sect. 5.4.2 [199], TriviaQA question answering Sect. 6.2.1 [99], English all word WSD [14], Nat. Quest question answering [109] Sect. 6.1.2


Enhancing Logical Consistency: PLMs sometimes do not generate logically consistent content. By additional fine-tuning tasks a model can be trained to respect logical consistency.

Surveys of methods to incorporate domain knowledge into Deep Neural Networks are given by Dash et al. [45] and Yu et al. [243].

#### *3.4.1 Exploiting Knowledge Base Embeddings*

Typically, *Knowledge Bases* are graph structures where the nodes correspond to entities and the edges represent *relations* connecting the entities. Many large-scale KBs, such as *WordNet* [137], *YAGO* [200], *Freebase* [18], *DBpedia* [15], and *DiffBot*  [77] have been released in recent years with millions of entities. Figure 3.13 shows a small subset of the WordNet hierarchy. In most cases a KB can be described by triples *(h, r, t)*, where *h* and *t* are entities in a set *E*, and *r* is a relation holding between these entities. To assess the semantic contents of a KB, it was proposed to encode its entities as well as its relations as embeddings in a low-dimensional space, allowing to determine the similarity of entities and relations [43]. Subsequently, these embeddings can be used to disambiguate entities (entity linking, Sect. 5.3.3), or predict new relations (Sect. 5.4).

For the embeddings *emb(*word*)* of words generated by Word2Vec [135] it turned out that relations between entities often are represented in the space of word embeddings as vector differences between entity embeddings (Sect. 1.5). An example is the relation between a country and its capital, for which we have approximately *emb(*Germany*)* − *emb(*Berlin*)* ≈ *emb(*France*)* − *emb(*Paris*)* .

The **TransE** model [20] is built on this pattern. TransE adapts the embeddings in such a way that whenever *(h, r, t)* holds and *emb(h)* and *emb(t)* are the embeddings of *h* and *t*, then equation *emb(h)* + *emb(r)* ≈ *emb(t)* should be approximately valid for some vector *emb(r)*, which is considered as the embedding of the relation *r*. Consequently, for all triples *(h, r, t)* in the set *S* of correct triples the *TransE-loss* 

**Fig. 3.13** Small part of the WordNet knowledge base describing the relations between English words. It contains synsets of word with approximately the same meaning, which are related by the hypernym (is-a) meronym (has-part) and member-of relations [137]

**Fig. 3.14** KEPLER [224] trains a conventional BERT-like model by the MLM-loss. For a knowledge base with text entries it generates entity embeddings using the special *<S>* token and encodes relations by the TransE-loss. Both loss functions are added during training

*fr(h, t)* = *emb(h)* + *emb(r)* − *emb(t)* 2 <sup>2</sup> should become 0. The TransE-model uses the hinge loss to approximate this goal, which modifies the embeddings in such a way that *fr(h, t)* for correct relation triples gets lower than *fr(h,*˜ *t)*˜ for randomly selected incorrect triples *(h, r,* ˜ *t)*˜ . The models and embeddings are trained with relations from WordNet and Freebase.

There are a number of more elaborate models to encode relations from KBs, as described in the surveys [43, 94]. *TransH* overcomes TransE's inability to model complex relations, and *TransD* aims to reduce the parameters by proposing two different mapping matrices for head and tail. But these alternatives are rarely used for contextual embeddings. Another method for KB representation is tensor factorization [144, 145]. This approach, however, is not based on word embeddings and therefore mainly used for KB completion and not to enhance PLMs.

In the rest of the section we describe approaches, which merge KB-embeddings usually computed by TransE and token embeddings generated by language models. A difficulty is to establish a relation between the token embeddings and the entities, which usually contain several tokens.

**KEPLER** [224] consists of a BERT-like language model generating token embeddings by the MLM objective. In addition, it computes embeddings for entities from descriptive text in the KB using a special token "*<S>*" at the beginning of the input text. This token is trained to produce an embedding of the named entity argument of the relation, e.g. for the input "*<S>* Johannes Kepler" in Fig. 3.14. In this way, the arguments *h* and *t* of the relation are embedded. The embedding of the relation *r* is either a parameter to be trained, or it may be determined by the text verbalizing the relation. These embeddings are fed into the TransE loss and used as an extra training criterion in addition to MLM (Fig. 3.14). In a number of language understanding tasks the approach is able to achieve good results. On the relation extraction benchmark *TACRED* [254] the approach reaches 71.5% F1-value.

**KnowBERT** [157] explicitly models entity spans in the input text and uses an entity linker to retrieve precomputed entity embeddings from a KB to form knowledge enhanced entity-span representations. The KB-embeddings are precomputed with a loss function similar to TransE. Projection mappings are used to transform LM-embeddings to KB-embeddings and vice versa. Information from the best matching KB-embeddings is averaged and retransformed to enhance the LM-embeddings. These computations form an additional layer of BERT. Wikipedia and WordNet were used as KBs. To test KnowBERT's ability to retrieve facts from the KB, a relation was formulated and one argument of the relation was masked. KnowBERT reaches a *mean reciprocal rank* (*MRR*) of 0.31, indicating that on average the correct entity appeared on rank 3, whereas for BERT it shows up on rank 9. Hence, the model generates better answers than BERT, but is only approximately able to reproduce the relations of the KB. However, it often leads to improvements in downstream tasks.

**ERNIE-THU** [255] relates named entities in a KB to the named entities in a document in a similar way, and transforms embeddings between these two spaces. *E-BERT* [162] is similar in spirit to KnowBert, but it requires no expensive further pre-training of the BERT encoder. *Facts as Experts* [213] also links factual information and entities using embeddings, and in this way can inject new information into the model.

In summary the methods presented in this section directly infuse domain-specific knowledge expressed by relation embeddings into token embeddings of PLMs. There are, however, a number of disadvantages. The KB entity embeddings are separately pre-trained with some knowledge embedding models (e.g., TransE [20]) and fixed during training of the PLMs. Thus KB-embedding and token embeddings are not learned simultaneously. Moreover, the KB entity embeddings often cannot fully capture the rich contextual and relational information of an entity in the KB. Furthermore, they are static and do not depend on the context. In addition, they rely to a great extent on the performance of the linking algorithm and on the reliability of graph embeddings. This means that in general other approaches perform better, e.g. for relation extraction (Sect. 5.4).

#### *3.4.2 Pre-trained Language Models for Graph Learning*

Relations between objects and concepts can be joined in a graph and provide a uniform representation for the relatedness of many items. Using the structure of a graph many properties of nodes can be predicted. In recent years there was a great effort to design models which can capture the composition of a graph and predict its parts, e.g. *node2vec* [67] or *graph convolutional networks* [107]. However, the node representations obtained by such deep models tend to be oversmoothed and also become very vague. PLMs potentially are able to improve the representation by self-attention over long distances. Xia et al. [233] provide a survey on PLMs for graphs. Nodes and edges are characterized by different feature and position embeddings, and are processed with different types of PLMs. Prominent applications are *recommender systems* exploiting user-product graphs and *drug discovery* evaluating molecule structures.

**Graph-BERT** [250] is trained on sample nodes taken from a large graph together with their context. These samples are drawn using the closeness according to the PageRank algorithm [24] and contain no direct link information. Nodes are characterized by feature embeddings, embeddings based on the PageRank information, and hop-based distance embeddings. These embeddings are summarized and form the input of a BERT model. The model is pre-trained to reconstruct the information of masked nodes and to predict the relation between two nodes by evaluating their cosine similarity. The model is fine-tuned for node classification and graph clustering. Graph-BERT achieves the second-best accuracies for node classification on three graph benchmarks [128, p. 16].

**GPT-GNN** [87] proposes an autoregressive PLM to perform an iterative reconstruction on given graphs. The method assumes a random order on the edges and nodes. Given the edges and nodes up to a specific position, it predicts the properties of the next nodes/edges. GPT-GNN generates one masked node and its edges at a time and optimizes the parameterized models via maximizing the likelihood of the node and edges generated in the current iteration. Then, it iteratively generates nodes and edges until all masked nodes are generated. The model is trained on a graph of 178M scientific papers with their features, the venue and the authors, and on a graph with 83M Amazon reviews, users and products. On both benchmarks the model has the best accuracies.

**MPG** [120] consists of a BERT model encoding node and edge features. As a pre-training task, the model has to learn whether two graphs divided into two halves actually belong together or whether the halves are a random pair. The model is applied to the modeling of molecules and achieves SOTA results on a range of 14 benchmarks, especially drug discovery.

**GraphFormers** [238] jointly models a graph structure together with sequences of words. Each node of the graph contains a text. A center node and its neighbors are tokenized into sequences of tokens. The model has special transformer layers for computing the embeddings of text tokens and for the derivation of node embeddings by aggregating the corresponding text embeddings. The model is pre-trained with the task to predict, if two nodes are linked or not. GraphFormers is tested on three benchmark tasks, e.g. a graph with scientific papers characterized by their titles and their citation graph. The model consistently outperforms all prior approaches in the prediction of links.

#### *3.4.3 Textual Encoding of Tables*

Tabular data probably makes up the majority of all business and administrative data today. Examples are retail transactions, official statistics, processing data from industrial applications, etc. A survey on the interpretation of tables on the web is provided by de Alwis et al. [46]. Previous work often relies on manually selected features, cannot handle the flexible schemas in web tables, and does not generalize well across tasks.

**Fig. 3.15** Learning table relations with TURL [47]. On the left side the table caption and the column headers are trained. On the right side the row markers together with input entities (cells in a specific row) are processed


**Fig. 3.16** TaBERT [241] encodes the rows of a table as text in a special format. The "context" contains corresponding text. Each table cell is represented as (column header, column value type, value). Here the first table row is encoded by the line starting with [CLS]

**TURL** [47] characterizes a relational table by the table caption *C* (a short text, may be enhanced by section title), column headers *hi* (a sequence of tokens) describing the table scheme *H* = {*h*1*,...,hm*} and cell values, where each cell may represent an entity, e.g. a person. Cells in the same row share some relation, and cells in the same column share another relation. This requires a structureaware attention mechanism implemented by a visibility matrix, which restricts the attention to specific columns and rows.

TURL is pre-trained according to the masked language model loss on a large unstructured dataset consisting of the table captions and headers. Subsequently, the relation between entities in the same row or column can be learned. Entities in a table are masked, and the model has the task to predict them based on the table context and the visibility matrix. By this target TURL can learn factual relations from the table and encode them into entity embeddings (Fig. 3.15).

The model is trained on 570k tables extracted from Wikipedia. All columns containing at least one linked cell are marked as entity columns. After fine-tuning, the model is able to predict the masked contents of table cells in the test set with precision of 54.8%, beating competing approaches. An ablation study shows that the visibility attention matrix is essential for achieving a high performance.

**TaBERT** [241] aims to include both, natural language text and structured table data. TaBERT is trained on 26.6M tables and surrounding text from English Wikipedia and the WDC WebTable Corpus [115]. Each table cell is described as (column header, column value type, value). Subsequently, the table rows are encoded as text, as shown in Fig. 3.16. For pre-training 20% of the columns of a table are randomly selected and the model has to predict the masked column names and types. In addition, the cell values are reconstructed according to a special scheme. The model is fine-tuned on the *WikiTableQuestions benchmark* [155], which contains questions requiring compositional, multi-hop reasoning over a series of entries in the given table. To reduce effort only table rows containing query tokens are encoded. TaBERT is able to increase the SOTA accuracy on this benchmark to 51.8%. The authors show that their table cell encoding is more effective than alternatives. **RPT** [205] proposes a similar scheme for table encoding. **BRIDGE**  [124] is a system for *semantic parsing*, which converts information from text and tables to an SQL query extracting information from a database.

**Tapas** [81] is a variant of BERT optimized for table processing. The table is flattened row-by-row, tokenized and enhanced with position embeddings. Following embeddings are added: a row id embedding, a column id embedding, and a rank embedding indicating the rank in the sorted sequence, e.g. for numbers. The model is pre-trained on 6.2M table-text pairs from the English Wikipedia with the task to restore words in both table and text that have been replaced with a mask. The model can do this with relatively high accuracy (71.4% accuracy on a test set).

During fine-tuning the model learns to answer questions from a table, e.g. "Which wrestler had the most number of reigns?" for a table with wrestling results. [CLS] and a query are prepended to the flattened table and both parts are distinguished by an additional segment embedding. The model has two output types: (1) a score for each table cell with the probability that this cell will be part of the answer and (2) a probability of the result type (none, count, sum, average) for [CLS] to produce the final answer. Together the result indicates which operation should be performed over which table cells to generate the final answer. On several benchmarks Tapas reaches SOTA results, e.g. improving from 55.1% to 67.2% for *SQA* benchmark [90]. The source code and pre-trained models are available at Hugging Face.

The results show that the models described above are able to extract information from tables and answer question about the table content. This makes it possible to use a large source of information, since tables are ubiquitous in text documents and web pages. In principle, the approach can also be used by large Foundation Models to include table information in the text they generate.

**TableGPT** [63] generate a text from a table using the GPT-2 language model. It enhances GPT-2 for table-to-text generation with two auxiliary tasks, table structure reconstruction and content matching, for improving text fidelity.

#### *3.4.4 Textual Encoding of Knowledge Base Relations*

A number of proposals try to verbalize KB-relations as text. In this way, KBrelations may be directly incorporated in the training text of the language models.

**WKLM** [234] randomly replaces a fraction of the entity mentions in the original document with names of other entities of the same type. The model is trained to

**Fig. 3.17** CoLAKE [202] identifies entities and encodes them with specific embeddings. Type embeddings distinguish words, entities and relations. The input embeddings are the sum of token/entity, position, and type embeddings. For all entities in the input text relations are extracted from the Knowledge Base and appended after "[SEP]", e.g. mother(Harry Potter, Lily Potter). A masking mechanism ensures that relation elements can attend only to their corresponding elements in the input text. During pre-training the model has to predict masked tokens and entities

distinguish the correct entity mention from the randomly chosen ones. In addition, the model has to predict masked token. The types of entities are obtained from Wikidata [214]. In this way, the model can better capture entity information from natural language and yields better results for entity-related NLP tasks. WKLM is able to predict relation arguments much better than BERT. In question answering (SQuAD and open domain, Sect. 6.2) the model is also able to reach SOTA results. Similar approaches [191, 203, 234] propose entity and phrase masking and replacement schemes.

**CoLAKE** [202] extracts the knowledge context of an entity from large-scale knowledge bases. The model links entity mentions to the underlying entities in a KB by an entity linker. The mention nodes are then replaced by their linked entities. The CoLAKE model is initialized with the RoBERTaBASE model. It is trained on Wikipedia with 3 million entity embeddings and 822 relation embeddings aligned to the *Wikidata5M* KB [224] on 26M training samples. The example input "[CLS] Harry Potter points his wand at Lord Voldemort [SEP]" is shown in Fig. 3.17. The type of inputs (word, entity, relation) is encoded as type embeddings and added to the token and position embeddings. To introduce a relation from the KB, e.g. "(Harry Potter, mother, Lily Potter)", the relation node "mother" and the entity node "Lily Potter" are introduced with the position embeddings 2 and 3, as the first relation argument "Harry Potter" is located at position 1. Self attention is computed between text inputs. There is a masking mechanism restricting the self-attention for relation elements, e.g. to the pairs "(Harry Potter, mother)" as well as "(mother, Lily Potter)" in our example.

During pre-training about 15% of the input elements (words, entities, relations) are masked and have to be predicted by the model. As entity nodes simultaneously appear in the input text and the knowledge base this helps to align the representations of language and relations. Masking relation nodes helps CoLAKE to learn contextualized representation for relations. On the language understanding tasks of GLUE the CoLAKE model achieves a similar average of 86.3 as RoBERTa. An alternative task consist of the completion of relation triplets *(h, r, t)* using a sentence describing the relation. It turns out that CoLAKE is much better than its competitors, e.g. the correct relation is inferred from two entities in 72.1% of the cases.

**LUKE** [237] treats words and entities in a given text as independent tokens, and outputs contextualized representations of both. The model is based on BERT and trained to predict randomly masked words and entities in a large entityannotated corpus derived from Wikipedia. It contains an entity-aware self-attention mechanism that is an extension of BERT's self-attention. It takes into account embeddings indicating if a token represents text or an entity. LUKE yields SOTA results in relation classification, entity typing and NER. **K-adapter** [222] is a related approach using RoBERTa (Sect. 3.1.1) as fixed background model and building several independent "Adapters" to include knowledge from different KBs.

**EWISER** [14] similarly targets word sense disambiguation (WSD). Starting with BERT embeddings, it computes scores for WordNet synsets (sets of words with similar meaning). Exploiting the interdependence of the synset graph the approach computes final scores that a word belongs to a synset. It achieves a new SOTA on a number of WSD benchmarks (Sect. 5.2).

**PET** (Pattern-Exploiting Training) [184] as an alternative constructs an additional training set using only a few labeled examples. Consider a 5-star scale rating for a restaurant in the Yelp dataset [185]. The authors add text to the reviews to express the ratings, e.g. "All in all it was great". Using this approach the authors convert the Yelp dataset to a task for predicting masked words, e.g. "All in all it was [MASK]". However, they provide the verbalized labels only for a small number of examples. Subsequently, they predict the best class for the non-labeled examples and train the model with the predicted classes as well as the language modeling loss to avoid *catastrophic forgetting*. This can be done in several iterations. Although only a few labels have been used, the model performs better on Yelp than standard supervised approaches. The SuperGLUE benchmark data covers eight challenging NLP tasks. With just 32 labeled examples the PET approach trained according to the above schema yields a better average (75.4%) than GPT-3 (71.8%) with the same number of few-shot examples. This shows that good results can be achieved with a small model (223M) and only few labeled examples. Note that the fine-trained SOTA for SuperGLUE is 90.4% using T5 and Meena.

**TeKGen** [1] is a data-to-text sequence-to-sequence model to verbalize a complete KB. It is applied to the English *Wikidata knowledge base* [214] with ≈ 6M entities and about 1500 relations. The model starts with a large training corpus of heuristically aligned Wikipedia text and Wikidata triples. Relations sharing a common entity subject are converted to the input subject relation1 object1, ..., relation*n* object*<sup>n</sup>* for the T5 transformer (Sect. 3.1.3). As an example "To kill a Mockingbird, author: Harper Lee, publication date: 11 July 1960" is translated to "To Kill a Mockingbird is a novel by Harper Lee published in 1960." The T5 model is fine-tuned and subjected to an addition check to generate good verbalizations.

The resulting dataset of verbalized triples was used in a question answering task. It was able to increase the accuracy in the *Natural QuestionsNatural Questions (NQ) benchmark* [109] (Sect. 6.1.2) from 38.8% to 41.5%. **KGPT** [30] in a similar way converts structural knowledge into the serialized text and lets model learn knowledge-text alignments.

In summary these methods transform KB relations into text, e.g. as complete sentences expressing relations or as concatenated triples (e.g., [head text, relation text, tail text]) into LMs for training or fine-tuning. This text is transformed into contextual embeddings and the model is trained to detect the underlying relation. The drawback is that focusing on knowledge base completion tends to over-adapt the models to this specific task, which comes at the cost of generalization.

#### *3.4.5 Enhancing Pre-trained Language Models by Retrieved Texts*

An *open domain question answering* system has the task of answering questions not restricted to a specific domain [27]. Consider the following example from the *TriviaQA benchmark* [99]. "Question: The Dodecanese Campaign of WWII that was an attempt by the Allied forces to capture islands in the Aegean Sea was the inspiration for which acclaimed 1961 commando film?" "Answer: The Guns of Navarone". It is not plausible that the model can reproduce such a specific response from the knowledge stored in its parameters, even if it was present in the data before training. Therefore, it would be desirable for the system to be able to gather additional evidence by a *retriever* collecting relevant documents from a large text repository. Subsequently, it has to align the retrieved information with the question and generate an answer by another PLM, a *reader*. New web search techniques can be used for this approach. They are based on comparing embeddings for words or passages consisting of several sentences. There are numerous applications such as question answering, summarization, and dialog systems. In Sect. 6.1 this is discussed in more detail. Recent surveys are provided by Zhu et al. [259] and Yu et al. [244].

**DPR** (Dense Passage Retriever) [103] employs a PLM to encode KB-passages *di*, e.g. from Wikipedia, as embeddings *emb(di)*. This can be achieved by finetuning a BERT model to encode passages by the embedding of the token [CLS]. These embeddings can be stored in an index for fast access. Then the DPR *retriever*  processes the query sequence *x* by another BERT model and generates the query embedding *emb(x)*. A number of *k* = 100 passages *dj* with maximal inner product *emb(x)emb(dj )* is retrieved by a *nearest-neighbor search*. Both BERT encoders can be trained together to generate appropriate embeddings using weak supervision in the form of question-answer pairs (cf. Sect. 6.1.5). If, for instance, the query is "Who is the bad guy in lord of the rings", the algorithm can retrieve "Sala Baker is best known for portraying the villain Sauron in the Lord of the Rings trilogy",

because "bad guy" and "villain" have similar embeddings. Therefore, DPR can find passages with similar meaning, expressed with different words. Karpukhin et al. [103], for instance, show that already with 1000 training examples the dense retriever is better than the classical keyword search. For 40k training examples the top-20 retrieved passages contain the correct answer in about 79% of the time, while this value is only 59% for the classical retrieval. An in-depth discussion is given in Sect. 6.1.5.

The DPR *reader* is another BERT model. Similar to BERT's text pair classification, it is fine-tuned to predict a probability for each retrieved passage that this passage contains the correct answer. In addition, it selects a span of tokens by span prediction, which probably provides the answer. In the example it selects "Sala Baker" as the answer. Together both components form a *retriever-reader architecture*, which recently became popular. The approach can be easily applied to KBs with billions of passages [103, 201]. On the *Natural Questions* [109] it yields a test set accuracy of 41.5%.

**DensePhrases** is a different system creating embeddings for phrases of up to 20 words in the KB, which are computed without knowing the query [114]. The processing of the retrieved phrases directly yields the answer without much computational effort. Using careful workflow optimization the authors achieve near-SOTA results with a much lower processing time than dense passage retrieval systems, e.g. a test set accuracy of 40.9% on Natural Questions.

**FiD** (Fusion in Decoder) [91] employs DPR as retriever. In the reader step it uses the special tokens "question:", "title:", and "context:". These tokens mark the question, the retrieved passage title and the passage text and are concatenated forming the input. Subsequently, these *k* retrieved triples are fed one-by-one into a transformer encoder like T5 [170] (770M parameters), which independently processes each triples by the encoder. Only in the decoder the passages are handled jointly and the text of the answer is generated. This approach drastically reduces the computational effort. The transformer is fine-tuned on a QA-task. The architecture of the model is shown in Fig. 3.18. Raffel et al. [170] provided evidence that

**Fig. 3.18** A retrieval enhanced language model [91] encodes the query and the KB passages as embeddings and uses a pre-trained retriever to find passages corresponding to the query. The reader is a Seq2seq model (T5) combining the query and the passages to generate the answer. This model setup is fine-tuned with different benchmark datasets

generative models like T5 are even competitive for QA-tasks such as SQuAD [173], where answers are spans in a given document.

The system achieves a test set exact match accuracy of 51.4% on the Natural Questions benchmark compared to 41.5% for DPR. The *TriviaQA* benchmark [99] contains a set of trivia questions with answers that were originally scraped from the Web. On this benchmark the model yields SOTA results with 80.1% exact match accuracy [211]. This is better than the accuracy of other much larger models, like GPT3 with 175B parameters (71.2% EM), or T5 without retrieval and 11B parameters (60.5% EM). It turns out that increasing the number of retrieved passages strongly enhances the answer quality.

There are a number of new approaches to augment PLMs with text from an external KB. In Sect. 6.1 we describe different PLMs for retrieval that can be used by web search engines. In Sect. 6.2 we investigate systems for question answering that often employ a PLM-based retrieval mechanism and an additional PLM to generate the answer text. It combines the query, the knowledge acquired during training, as well as the information in the retrieved documents.

In summary, combining language models with retrieval is currently the most efficient way to incorporate additional information into PLMs. The new information is focused on the current query and thus very informative. The retrieval model can access semantically related passages within fractions of a second using new approximate open-source nearest neighbor index structures. By relying on embeddings, synonyms and paraphrases can be found and the meaning of words can be disambiguated. In addition, the underlying knowledge bases can be updated on the fly to keep the information current.

#### *3.4.6 Summary*

The knowledge covered by the textual training data can be leveraged in various ways to improve the performance of PLMs. Entities and relations from a knowledge base can be represented by embeddings, e.g. by TransE. However, the utilization of these embeddings for PLMs is not very efficient and error-prone. A more promising alternative is the direct use of table content or knowledge base relations by specialized PLMs, which capture relationships between entities and table cells by specific self-attention patterns. Similar to Graph-CNNs PLMs have been directly used to acquire the relationship between the nodes of a graph by encoding the features of links by embeddings in a BERT-like model. Along this line a promising way to transfer relational knowledge from a graph to a language model is proposed by GraphFormers.

A very simple and efficient approach of incorporating tables and knowledge bases in PLMs is the creation of text that expresses the information content. This can be used by the PLM either as conditioning text or during training. However, the most promising way to include knowledge is *retrieval*, since most information is stored in the form of unstructured text on the Web or databases. Here, the retriever-reader architecture emerged as an effective way to collect relevant passages. Subsequently, the PLM generates new text by combining the internal knowledge, the start text, and the retrieved passages.

Much effort was devoted to the extension of the length of input sequences (Sect. 3.2). This was mainly achieved by sparse attention patterns reducing the increase in computational effort from quadratic to linear with S4 as a leading approach. Nevertheless, larger input sequences still have limited range of context both within the same sample and outside of it.

In contrast, retrieval can cover an indefinite context within the same sample by gathering appropriate passages, even if there is no simultaneous attention over the whole context. In addition, retrieval can access relevant information in huge document collections. Either the highly developed traditional keyword search engines may be used. Alternatively dense retrieval may be employed which compares embeddings of the query and passages using approximate nearest neighbor search over an index. It turns out that relatively small retrieval-based models outperform large Foundation Models like GPT-3. FiD, for example, achieves an exact match accuracy of 51.4% on the Natural Questions benchmark compared to 29.9% for GPT-3. Retrieval is extensively used by recent models such as WebGPT and Retro.

#### **3.5 Changing Model Size**

The size of a model, especially its number of parameters, has a marked influence on the performance of the model, its memory requirements and the computational resources required for training. In the first section we discuss that models with more parameters potentially have a better performance. This, however, requires a larger computational effort during training and model utilization. An alternative are mixture-of-experts models, which define a number of parallel model structures which selectively compute a solution. This is described in the second section.

As initial versions of successful models often are extremely large, a variety of model compression and acceleration techniques have been developed. They reduce memory requirements and training time without noticeable degradation of accuracy, and allow the models to be deployed on low resource computing devices, such as cell phones. There are three main techniques for model size reduction [65]—parameter compression and reduction, low-rank factorization, and knowledge distillation which are outlined in the subsequent sections.

#### *3.5.1 Larger Models Usually Have a better Performance*

As a rule for machine learning, the number of parameters of a model should be limited to avoid *overfitting*, i.e. adapting to random fluctuations in the data. It turned out that this does not hold for PLMs if the amount of training data and the number of model parameters are increased simultaneously. Larger PLMs have been shown to have better performance on NLP tasks, which is underscored by theoretical work on PLMs [19, p. 117]. The benefits of increasing the number of parameters come from two factors: additional computations at training and inference time, and increased memorization of the training data. Kaplan et al. [102] empirically investigated in detail the dependency between the number of model parameters *R* (excluding embeddings), the size *N* of the training data, and the amount of computing effort *C* used for training. They evaluated a large number of models and draw the following conclusions:


These findings show that the success of larger PLMs is a systematic feature. A larger number of model parameters is much more sample efficient than thought before, when overfitting was a major problem for smaller training tasks. This also explains the success of large models like T5, BigBird, or GPT-3. Hernandez et al. [80] investigate empirical scaling laws for the transfer from pre-training to finetuning. Figure 3.20 plots the training efforts of some Deep Learning models during the last two decades.

**Fig. 3.19** A series of language model training runs with varying model sizes [102]. The left graph shows that larger models require fewer samples to reach a fixed test loss. The right graph demonstrates that the model size should grow with compute budget. Image reprinted with kind permission of the authors [102, p. 4]

**Fig. 3.20** Number of parameters for Deep Learning Models since 2017 [188]. Note that the parameter scale is logarithmic. The number of parameters roughly increased from 100M up to 1000B

#### *3.5.2 Mixture-of-Experts Models*

As discussed above a model with more parameters usually can achieve a better performance. A simple way to increase the number of parameters without a higher training effort is a **mixture-of-experts** architecture. It was already proposed in the nineties by Nowlan et al. [147] and has a strong resemblance to decision tree models [152]. It consists of a single gating module and a number of expert modules with identical architecture but different parameters. Each expert specializes in only a subset of the data, and the gating module assigns each input to the appropriate experts. Specifically, the gating network computes a probability distribution over the experts indicating how well each expert is able to process the incoming input. A reduction in computational effort can be achieved, if only a few expert modules are actually used. The model is trained by stochastic gradient descent, which can compute the parameter gradient despite the discontinuities if some expert is exchanged. Increasing the number of experts keeps the computational cost constant because the model always selects the same small number of experts for each input, regardless of the total number of experts. The architecture enables massive models and is particularly efficient for distributed systems where the experts are spread across different computational devices.

Clark et al. [38] analyze the theoretical properties of such *routing networks*, where each input is processed only by subnetworks with a fraction of the network's parameters.The authors analyze three different architectures and get the following results.


The analysis is based on the evaluation of several magnitudes of size, including models with hundreds of experts and hundreds of billions of parameters.

**GLaM** [51] is an autoregressive *mixture-of-experts* (*MoE*) model with up to 1200B parameters. It replaces the fully connected layer of every second encoder block (Sect. 2.1.1) with 64 copies having different parameters. For each embedding, a gating module selects two of these 64 fully connected layer for processing. The architecture is shown in Fig. 3.21. The model was trained on a huge collection of 1.6T tokens documents and quality-checked web pages. It has approximately 7 times more parameters than GPT-3 but requires only 1/3 of its training effort. In this way, the model has many more parameters increasing its representational capacity. As for a given input token, only two expert models are used, the computational effort for training and application is lower. The zero-shot and one-shot performance is better than for GPT-3 on 29 NLP tasks. Some results are compared to those of other models in Tables 3.3 and 3.4. GLaM is remarkable as it requires only 1/3 of the training effort of GPT-3 but it achieves a similar or better performance than GPT-3 on NLP tasks.

**WuDao-2.0** [175, 178, 257] is a recent giant autoregressive language model with 1750B parameters, ten times larger than GPT-3. It has *mixture-of-experts* layers, where a gating network selects a submodule for processing based on the input. WuDao-2.0 uses the *FastMoE* library [74] and employs the GLM 2.0 architecture (Sect. 3.1.3) combining the different learning paradigms of BERT, GPT and the encoder-decoder transformer [175].

The training data consist of 1.2TB Chinese text, 2.5TB Chinese graphic data and 1.2TB English text data from the *Pile* corpus [61]. The *Cogview* model is used for

**Fig. 3.21** Architecture of GLaM [51]. For each input token, e.g., "likes", the gating module dynamically selects two most relevant experts out of 64 available experts. This is indicated by the blue grid. The weighted average of the outputs from these two experts' feedforward models is then passed to the next encoder block. For the other inputs different experts are selected. A mixture-of-experts layer is used in every second encoder block

the joint processing of images Sect. 7.2. In addition, WuDao-2.0 can learn on the fly, draw pictures and compose poetry. These capabilities are a significant difference to GPT-3.

The published performance claims are impressive. On the LAMA benchmark for measuring world knowledge [158] it scores higher than AutoPrompt [192]. For the *SuperGLUE* few-shot natural language understanding task [219] it achieves SOTA and surpasses GPT-3. For the Lambada benchmark (Sect. 4.1.3), where the last word of a paragraph has to be predicted, it yields better results than Microsoft Turing NLG. In addition, it increases SOTA for a number of text-graphics tasks (Sect. 7.2.8).

**Switch** [56] is a variant of the transformer encoder-decoder T5 (Sect. 3.1.3). It has a *mixture-of-experts* architecture, which replaces the fully connected layer of each encoder block with *k* = 128 copies having different parameters. There is a simple linear gating network, which selects one of the 128 single fully connected layers (the experts) per token. Hence, the number of parameters is drastically increased with approximately constant computational effort. For this architecture a gradient can be computed and the model may be optimized using a number of specific strategies and a special TensorFlow version. It turns out that Switch achieves the same loss level compared to the standard T5 version with 1/7 of the computing time. On a number of fine-tuning tasks the large Switch model with 1600B parameters and 2048 experts yields better results than T5-large (Sect. 3.1.3) with 13B parameters requiring a quarter of the computational training effort.

As an alternative to the gating network in the mixtures-of-experts architecture, it is possible to use hash values to activate different parts of the network. **Token Switch** [177] computes a hash value for each input token and routes the generated embeddings of each token to different feedforward networks based on the hash values. The authors show that their approach compares favorable to Switch and works well on comprehensive language modeling tasks.

**ST-MoE-32B** [261] is a mixture-of-experts model with 269B parameters and a comparable training cost of a 32B dense model. The authors modify the routing algorithm which dispatches token embeddings to one or two experts, and resolve instability issues. The model is similar to a T5-Large encoder-decoder [170]. The ST-MoE-32B has 32 experts with an expert layer frequency of 1/4, such that every fourth feedforward layer of T5 is replaced by an MoE layer. The authors use the *GEGLU* activation function, which contains multiplicative elements [142]

$$FFN\_{GEGLU}(\mathbf{x}, W, V, \mathbf{b}, \mathbf{c}) = GELU(\mathbf{x}W + \mathbf{b}) \odot (\mathbf{x}V + \mathbf{c}).\tag{3.2}$$

The authors compare a large number of variants and hyperparameters to improve training.

The model achieves SOTA in many transfer learning benchmarks, e.g. for SuperGLUE with an average accuracy of 93.2% beating the PaLM LM with 540B parameters. Other SOTA results were reached for summarization (XSum [143] with 27.1 ROUGE-2, CNN/Daily Mail [78] with 21.7 ROUGE-2), closed book question answering (WebQA [13] 47.4% exact match, Natural Questions [109] 41.9% exact match), and adversarially constructed tasks for common sense reasoning (Winogrande [182] 96.6%, ANLI R3 [146] 74.4%).

#### *3.5.3 Parameter Compression and Reduction*

*Model quantization* is a parameter reduction technique, where parameters are stored in low precision and therefore the computations in PLMs are also less precise. Conventional models normally use parameters of 32 bits or 16 bits, while parameters after quantization can have 8 bits or even 1 or 2 bits. **Q-BERT** [190], for example, quantizes Transformer models to ultra-low precision. This reduces the model size 13-fold while only loosing 2.3% performance. The authors avoid the naive approach of simply reducing weight precision, but use additional training steps to adjust the quantized weights and allow higher precision for more "sensitive" parameters. Other authors propose to delete parameters with small values [64]. ALBERT [113] uses the same weights across all layers and achieves a significant parameter reduction. Nevertheless, ALBERT has the same or better performance compared to BERT.

Another approach aims to reduce the number of parameters, e.g. by removing attention heads. It was shown that most attention heads focus only on nearly identical positional relations and can be replaced with fixed attention patterns [172]. It turned out that high performance is possible with only 1–2 attention heads per encoder unit instead of the 16 attention heads of the original model. A detailed overview on parameter compression techniques is provided by Ganesh et al. [60].

Another method to reduce model parameters is model pruning, which cuts off irrelevant parts in PLMs to achieve a smaller memory footprint and faster execution without compromising performance. It could be shown, for example that some attention heads of the transformer may be removed with little impact on the accuracy [256]. Other researchers prune the weights of attention layers and linear layers to reduce the number of parameters without reducing the accuracy [29, 64]. Note that model pruning does not always lead to speedups, as sparse computations may be hard to parallelize on GPUs.

#### *3.5.4 Low-Rank Factorization*

This technique employs matrix and tensor decomposition to reduce the number of parameters of full rank parameter matrices and already has been discussed in Sect. 3.2.2 for the extension of the input sequence length. Examples are the Performer [34] and the Linear Transformer [105] (Sect. 3.2.2). As an alternative, ALBERT (Sect. 3.1.1) approximates the embedding matrix as a product of two smaller matrices.

#### *3.5.5 Knowledge Distillation*

In machine learning the knowledge distillation approach [82] transfers knowledge from a large *teacher model* to a smaller *student model*. The large model can often be trained successfully to approximate a functional relation without using its full representational capacity. To reduce the high computational and memory requirements during application, a smaller model is trained to imitate the large model without sacrificing accuracy.

The advantage of this approach is that the student model may be trained to approximate *internal activations* of the teacher model. Often the target probabilities generated by the teacher model are used to train the student network . Typically the outputs of the teacher model for an input *x* is *z(x)*, which can be translated to a probability by a scaled softmax

$$\mathbf{y}(\mathbf{x}|\tau) = \frac{[\exp(z\_1(\mathbf{x})/\tau), \dots, \exp(z\_k(\mathbf{x}))/\tau]}{\exp(z\_1(\mathbf{x})/\tau) + \dots + \exp(z\_k(\mathbf{x})/\tau)},\tag{3.3}$$

where *y(x*|*τ )* is a probability vector and *τ* is a parameter called *temperature*, which for a standard softmax is normally set to 1.0. The student model is trained to imitate the probabilities *y*ˆ*(x*|*τ )* generated by the teacher model by minimizing *cross entropy* 

$$E(\mathbf{y}|\tau) = -\sum\_{j=1}^{k} \hat{\mathbf{y}}\_{j}(\mathbf{x}|\tau) \log \mathbf{y}\_{j}(\mathbf{x}|\tau),\tag{3.4}$$

where *y(x*|*τ )* is the output probability vector of the student model. If observed values are available the probabilities of the teacher model *yj (x*|*τ )* may be replaced by 1.0 for the observed class and 0.0 otherwise. During training the temperature may be varied. A high temperature avoids extreme probability values and reduces the gradients. This may lead to a faster convergence in the beginning of the optimization.

**DistilBERT** [183] uses MLM cross-entropy loss to predict token probabilities and in addition the cosine similarity between the embedding matrices of the teacher and student networks to train a smaller BERT model. It utilizes knowledge distillation during pre-training to reduce the size of BERT by 40% while retaining 99% of its original capabilities and making the inference 60% faster. *MobileBERT* [204] is based on a specific large BERT model and transfers information about multi-headattention as well as the resulting embeddings. Experiments show that MobileBERT is 4.3× smaller and 5.5× faster than BERT while achieving competitive results on well-known benchmarks.

**TinyBERT** [97] proposes distillation of a BERT model during pre-training and fine-tuning. The model is adapted to: (1) the output of the embedding of selected layers; (2) the hidden states and attention matrices derived from selected Transformer layers; (3) the logit outputs of the prediction layer. As distillation is also performed during fine-tuning the model can be better adapted to the finetuned BERT. On a number of benchmarks TinyBERT is on par with BERTBASE and outperforms DistilBERT.

Note that the knowledge distillation methods discussed above require the data used for pre-training the teacher model, which is often not released because of data copyright. It has not yet been evaluated whether distillation is also feasible with new data. The training time for knowledge distillation is high, because the teacher model needs to perform a forward prediction over the entire pre-training data to generate activation values or intermediate representations.

Rogers et al. [176] list a large number of size reduction studies for BERT and report parameter size and computing time reduction as well as the resulting performance. For a number of approaches there is a marked reduction in memory and computing effort with nearly identical performance.

#### *3.5.6 Summary*

The number of model parameters, the size of the training data and the amount of computation effort for training are the determining factors for the performance of a model. Kaplan et al. [102] show by experiments that increasing parameter count and training set size reliably lead to a better performance and provide a detailed formula for the dependency. If a fixed compute budget is available, one should use a very large model and much data.

Mixtures-of-experts follow this approach by increasing the number of parameters without requiring more computational effort. By routing inputs to specific subnetworks they are able to increase performance compared to monolithic networks. Examples are GLaM, WuDao-2.0, and Switch. However, these networks have hundreds of billions of parameters and require a specific parallel computational infrastructure.

Often the trained networks are too large and have to be reduced to fit to smaller computing devices. A viable approach is low-precision computation, which reduces memory requirements for parameter storing. Low-Rank factorization of matrices also has a lower memory footprint as a side effect. Finally, knowledge distillation may be employed to create a student model which imitates the inner working of a large trained teacher network. DistilBERT, for example, was able to reduce the memory size by 40%, kept 99% of the original performance and was 60% faster. There are a number of other size reduction approaches with similar results.

#### **3.6 Fine-Tuning for Specific Applications**

Self-supervised pre-training of language models on large text collections and subsequent fine-tuning them to solve specific tasks has become the standard paradigm in natural language processing and understanding. It has been shown that pre-trained language models such as BERT are excellent for generalization and can easily be fine-tuned to multiple tasks. However, sometimes simple fine-tuning to a domainspecific task is not sufficient, and other transfer learning approaches have to be used to better adapt models to domain-shift in the data [166]. There are a number of surveys covering transfer learning in depth [230, 252, 260]

Fine-tuning updates all the model layers, including the embedding layer, but there are larger changes in the higher layers [133]. First, we discuss whether fine-tuning can destroy the knowledge gained during pre-training. *Standard fine-tuning* adapts a large pre-trained PLM with many parameters to a relatively small fine-tuning training data set with little computational effort. We investigate whether *overfitting*  occurs during this phase. Subsequent sections introduce different approaches for fine-tuning:


#### *3.6.1 Properties of Fine-Tuning*

Fine-tuning of PLMs is commonly employed to adapt a pre-trained model to a specific task by supervised training. This adaption of the model from a source task to a related target task is also called *transfer learning*. Transfer learning is especially rewarding if we have abundant training data for self-supervised learning—as it is typical for non-annotated text—and only little annotated data for the target task. A survey of transfer learning is provided by Zhuang et al. [260]. Fine-tuning has a number of advantages:


Autoencoder models like BERT are typically fine-tuned for classification tasks, where the logistic classifiers for masked language modeling and next sentence prediction have to be removed. Using the [CLS] token or other tokens as input, new logistic classifier models as well as all model parameters are trained end-to-end with the new task for a few epochs (Sect. 2.1.3). Compared to pre-training, finetuning is relatively inexpensive. Usually, only a small fraction of the pre-training effort is required to achieve good results.

Tripuraneni et al. [210] have theoretically proven that transfer learning requires far less data than learn tasks in isolation. They prove that transfer learning improves if the task diversity is enhanced. Bansal et al. [7] investigate the theoretical properties of fine-tuning a classifier using pre-trained embeddings. The authors prove that these classifiers have a smaller generalization gap between their train and test accuracy, than standard classifiers.

#### **Catastrophic Forgetting**

The question is whether fine-tuning can destroy the original capabilities of the model. This means, after fine-tuning a pre-trained model for a few epochs, it could lose predictive performance available after pre-training. A possible reason can be *catastrophic forgetting*, where all parameters are adapted to a new learning task while forgetting learned content.

Merchant et al. [133] fine-tune BERTBASE with three different tasks: (1) MNLI sentence pair classification task [229] measuring if the first sentence entails the second; (2) SQuAD question answering [173], where the answer to a question has to be marked in a text; (3) Dependency Parsing [50] to capture the syntactic structure of sentences. Then they investigate the performance of a number of probing classifiers before and after fine-tuning. The results demonstrate that the fine-tuned models only show a small decrease in the accuracy to detect linguistic concepts. The reduction cause by the MNLI task in most cases is less than 1%, while higher differences (less than 3%) are observed for SQuAD and dependency parsing. Therefore, catastrophic forgetting cannot be observed. The authors state that fine-tuning primarily changes the top layers of BERT, with dependency parsing also affecting deeper layers. More detailed results are provided by Wallat et al. [216].

Fine-tuning only benefits from the pre-training, if there are similarities between the two tasks. Hence, pre-training should have a loss function which enforces the learning of semantics at word, phrase and document level. In addition, its training documents should originate from a domain close to the fine-tuning task. Otherwise the vocabulary may not include many domain-specific words. As a result, domainspecific words are split into a number of tokens which hinders model learning and degrades its performance in downstream tasks. In the next sections we will discuss alternative training regimes which improve BERT's capabilities.

#### **Fine-Tuning and Overfitting**

During pre-training BERT's parameters are adapted to the pre-training data, acquiring universal language representations. As pre-training provides a good initialization, it avoids overfitting on the small fine-tuning datasets, if the fine-tuning error is not minimized too much.

Since PLMs have a very large number of parameters, there is the risk of overfitting on the fine-tuning data. As a result, generalization from unseen data can be poor and counterstrategies may be required. D'Amour [42] present a comprehensive discussion of this *underspecification* phenomenon. Jiang et al. [95] introduces a form of regularization, which makes the model invariant to small perturbations of the input, inducing smoothness in the local neighborhood. They develop a class of Bregman proximal point optimization methods, which penalize large updates of the model at each iteration. Aghajanyan et al. [2] introduce the notion of representational collapse, stating that fine-tuned models lose their ability to generalize. They propose fine-tuning optimization based on trust-region theory, which alleviates representational collapse at a fraction of the cost of other recently proposed fine-tuning methods and, for instance, improves the best known results on fine-tuning RoBERTa on GLUE.

Fine-tuning the same model with multiple random seeds can lead to large variance in task performance. Most papers argue that this effect is caused by *catastrophic forgetting* and the small size of the fine-tuning datasets. However, Mosbach et al. [140] show that often fine-tuning has an optimization problem due to vanishing gradients. In addition, it can often occur that a model does not generalize well, although it has the same fine-tuning loss as a successful model. This is an indication for the underspecification mention above. The authors recommend to use small learning rates with bias correction to avoid vanishing gradients early in training. In addition, they propose to use more iterations for fine-tuning. More recipes to improve fine-tuning are provided by Rogers et al. [176].

#### *3.6.2 Fine-Tuning Variants*

#### **Fine-Tuning in Two Stages**

The intermediate training set should be closer to the final task. Although this approach can increase performance in some cases, an experimental evaluation demonstrates a decrease in performance in 44% of the cases [163]. An intermediate training with a task requiring high-level inference and reasoning abilities tend to work best, as was shown in a large experiment [165]. However, the authors also observe catastrophic forgetting of the pre-trained abilities. Gururangan et al. [71] have shown that a second phase of pre-training, using domain-specific data, leads to significant performance gains, both in high- and low-resource settings. In addition, pre-training on tasks-specific unlabeled data improves performance on various tasks and domains.

#### **Fine-Tuning for Multiple Tasks**

For each task, a task-specific layer is added to the underlying pre-trained model. Then the model is simultaneously trained with all tasks. However, it sometimes happens that performance does not increase compared to standard fine-tuning [141], perhaps because of contradicting requirements of tasks. As an alternative, a subset of fine-tuning tasks from the available datasets may be selected based on similarity measures [131].

**HyperGrid** [208] is a multitask learning approach evaluated on the T5 model. It learns grid-wise projections that help to specialize regions in weight matrices for different tasks. As an example, a single model is simultaneously adapted to all GLUE and SuperGLUE tasks at once. In spite of the multitude of tasks, the model has a slightly better performance on SuperGLUE than the single models.

#### **Meta-Learning to Accelerate Fine-Tuning**

During fine-tuning a pre-trained PLM is adapted to a new NLP task. It is usually trained for two or three epochs on a labeled fine-tuning dataset. Although this is much faster than pre-training the model on a large training corpus it still requires a lot of effort. To reduce this effort researchers tried to prepare the pre-trained model to fine-tuning by *meta-learning*. A survey of meta-learning is provided by Yin [242].

Usually, there is a set T of related fine-tuning tasks *Ti*. During meta-training a task *Ti* is sampled from a distribution *p(*T*)*. Then the model is trained with *K* training samples from *T* train *<sup>i</sup>* and then tested on the validation set of *<sup>T</sup>* val *<sup>i</sup>* . The validation error of *Ti* is utilized as the training error of the meta-learning framework for the current iteration. The **MAML** algorithm [58] follows this pattern:


This scheme was applied to BERT [6]. The authors generate a large, rich, metalearning task distribution from unlabeled text by gathering tokens-to-be masked from a few vocabulary terms. On 17 NLP tasks, they show that this type of metatraining leads to better few-shot generalization than language-model pre-training followed by fine-tuning. Chen et al. [28] provide data-dependent generalization bounds for these approaches.

#### **Fine-Tuning a Frozen Model by Adapters**

A downside of fine-tuning for task-adoption is that new model parameters are needed for every task. *Task adapters* [84] aim to mitigate this problem. The authors introduce adapter layers, which are inserted in a encoder block after the multi-head attention and the feedforward layer (2.7). Now, to fine-tune transformer models to new tasks, instead of relearning all parameters, all weights of the network are frozen except for the adapter layers and the normalization layers. On tasks like GLUE this yields a significant reduction of parameters that need to be trained while preserving model quality.

Rather than having multiple adapters for different tasks, Stickland et al. [197] propose training a multitasking version of BERT that can be used for several tasks simultaneously. They add low-dimensional projected attention layers as bypass to BERT encoder blocks, which connect the input to layer-norm layers and the subsequent layer-norm layers. They sample data from the different tasks during training proportionally to the sizes of the respective training sets and use an annealing mechanism to converge towards equally distributed training samples by the end of the training. Their results surpass the results of a BERTBASE model.

**MAD-X** [160] is a framework to adapt multilingual models to arbitrary languages and tasks. The authors introduce language- and task-specific adapters, which consist of a linear down-projection to a small vector, a ReLU activation and a linear up-projection. The language specific adapters are trained with an MLM objective, while the rest of the model is frozen. The task-specific adapters are trained with the task-specific data, fixing the rest of the parameters. Finally, invertible adapters are added after the input embedding layer and before the output embedding layer to mitigate differences between the multilingual vocabulary and the target language vocabulary. MAD-X achieves SOTA for NER and common sense reasoning for a set of different languages.

**LoRA** [85] freezes the weights of the pre-trained model and adds trainable bypasses to the model, which consist of trainable matrix transformations to a short vector and to the full rank. This drastically reduces the number of trainable parameters (1/30 for GPT-3 and 1/100 for GPT-2) while achieving better results than with traditional fine-tuning on many NLP tasks. *AdapterHub* [161] is a repository for adapters that as of writing contains around 380 adapters. AdapterHub is built on the Hugging Face transformer library for compatibility with existing transformer models.

#### **Fine-Tuning GPT-3**

GPT-3 is an extremely powerful Foundation Model, but it is not publicly available (Sect. 3.1.2). By using the API for fine-tuning GPT-3 with user-specific data [123], the model can be adapted to specific domain languages and particular tasks. This typically yields a higher quality than few-shot examples and prompt design described below. To fine-tune the 175B parameter model on a 1M token file for four epochs OpenAI charges about \$120. The fine-tuning can be used in a number of ways [123]:


It can be assumed that GPT-3 and other Foundation Models like PaLM fine-tuned in this way will increase SOTA in many areas due to their comprehensive knowledge about language.

#### *3.6.3 Creating Few-Shot Prompts*

For *zero-shot learning* the model just gets a task description or *prompt*, e.g. "Translate English to French: cheese =*>*", and directly generates the answer

**Fig. 3.22** The accuracy of few-shot learning of GPT-3 is increased by extending the model size as well as the number of presented examples [25]. The task is to remove random symbols from a word. A natural language description of the task can support the model especially in the one-shot regime. Image reprinted with kind permission of the authors [25, p. 4]

"fromage". For *one-shot* or *few-shot learning* the model receives a task description as well as one or more examples, e.g. "Translate English to French: sea otter =*>* loutre de mer; cheese =*>*", which helps the model to find the answer "fromage". This happens without training, the parameters of the model are not changed, and the model creates the answer based on the knowledge acquired during pre-training.

In this way, GPT-3 can be instructed by natural language prompts to generate short stories, songs, answers to questions, press releases, technical manuals, and more [181]. It can adapt its output texts to specific styles, personalities or ideologies. Here are some of the recommended prompts used for few-shot learning [150]:


Figure 3.22 shows the accuracy of "few-shot learning" for different GPT-3 model sizes and different numbers of given examples.

In a comprehensive survey Liu et al. [125] compile approaches to prompt design to create prompts for language models that reliably generate the desired response. For example, when we want to recognize the sentiment of the text "I missed the bus today.", we may insert the prompt "I felt so ", and use the language model to replace the blank. There are two types of prompts: *cloze prompts* [159], which fill in the blanks of a textual string by an autoencoder model similar to BERT, and *prefix prompts* [117], which continue a text by an autoregressive language model.

For prompt mining [96], for instance, a large number of sentences with phrases *x* and *y* are collected. Subsequently, prompts are generated using the words between *x* and *y*, or on the dependency path generated by parser. Another approach is based on paraphrasing existing prompts, for instance by translation to another language and back-translation. The probability of desired answers may be increased by gradient-based search [192] as demonstrated with the *AutoPrompt* model. Alternative approaches are described in [62, 245]. It should be noted, however, that the output of a model instructed with few-shot prompts can be easily altered if an adversary adds some new prompts [79].

Instead of improving prompt tokens, which generate a desired output by the language model, one can optimize the input embeddings of some "virtual" tokens, such that the desired answer is created. The embeddings of this "continuous" prompt can be optimized by gradient descent while keeping the parameters of the language model fixed [121]. Lester et al. [117] apply this approach with a continuous prompt sequence of 100 tokens to the T5 transformer. On the *SuperGLUE* benchmark they achieve the same performance of 90.5% as for fine-tuning T5. This demonstrates that prompt tuning becomes competitive with fine-tuning and is much better than few-shot instructions. Note that the effort for prompt tuning is much lower than for fine-tuning, as the number of parameters is much smaller. It would be interesting to see this technique applied to recent autoregressive models like GPT-3 or PaLM.

#### *3.6.4 Thought Chains for Few-Shot Learning of Reasoning*

To improve the reasoning capabilities of language models, prompts can contain a *chain of thought*, a sequence of short sentences that imitate the reasoning process a person might have when answering a question [226]. Two examples are shown in Fig. 2.21. The idea is that a chain of thought allows language models to split a multistep problem into intermediate steps that are solved one at a time, rather than solving an entire multistep problem in a single pass.

The approach has a number of advantages. First, the chain-of-thought approach enables a model to decompose complex reasoning tasks into simpler intermediate steps, which can be solved by the model. To solve an entire class of problems, only a few chains of thought need to be provided. Second, when a model performs the intermediate steps, it is easier to check where the model has introduced an error. This may give a clue how to improve the chain of thought. Chain of thought reasoning can be applied to symbolic manipulation, common sense reasoning and math tasks, and is potentially applicable to any task that humans can solve via language.

Prompts also do not need to be restricted to input-output pairs or explanations and can cover many arguments, including things to avoid, rules of thumb, reasoning chains, positive or negative examples. Mishra et al. [138] consider instructions for crowdworkers, which contain very detailed prescriptions how to solve a task. They compile a dataset of tasks, instructions and generated input-output pairs. Subsequently, they investigate how well models are able to generalize to similar tasks. The results show that PLMs benefit from instructions when evaluated in terms of generalization to unseen tasks (19% improvement). However, there is much room for improvement.

Du et al. [52] investigate few-shot learning theoretically. They investigate the case that a model is pre-trained on a number of tasks with a large training set and subsequently fine-tuned on a related task. They theoretically derive bounds on the required sample size for the fine-tuning task, which can be reduced when there is a good common representation.

#### *3.6.5 Fine-Tuning Models to Execute Instructions*

Instead of querying autoregressive PLMs by few-shot instructions it is possible to fine-tune these models to execute instructions without additional examples.

**InstructGPT** [151] is a new version of GPT-3. It is optimized to follow instructions instead of predicting the probable next words. Instead of needing a series of examples, GPT-3 now directly executes an instruction, e.g. "Write a short story about the moon and the stars:", and the model generates a plausible story. In a first trial a dataset of 13k pairs of instructions and completions was collected to adapt GPT-3. GPT-3 was fine-tuned using this data. However, the model did not adequately match the intended human preferences. Therefore, the model was modified using a different training approach.

To adjust GPT-3 a *reinforcement learning* approach with human feedback was used. The *proximal policy optimization* (PPO) [186] follows the policy gradient pattern. It approximates the conditional distribution *π(at*|*st*; *w)* of actions *at* ∈ A at step *t* conditional to the current observation *st* ∈ S about the state of the environment and a vector *w* of parameters. In usual reinforcement learning, the environment generates a reward and the algorithm tries to maximize the weighted sum of rewards. The gradient for this optimization (policy gradient) can be easily computed from the model. PPO computes an update at each step that minimizes the cost function while ensuring the deviation from the previous policy is relatively small [186].

The algorithm needs a numeric score to measure the quality of each generated sequence. To reduce the data necessary for optimization, a human can express preferences [198] between trajectories *τ* = *(y, x)* for pairs of instructions *x* and generated text *y*. Informally, the goal is to produce trajectories which are preferred by the human, while querying the human as little as possible. To achieve this goal, a reward function *r(y, <sup>x</sup>)* <sup>∈</sup> <sup>R</sup> is postulated [36] with the property that *(y*[1] *, x*[1] *)* is preferred to *(y*[2] *, x*[2] *)* if *r(y*[1] *, x*[1] *) > r(y*[2] *, x*[2] *)*. The original policy *π(at*|*st*; *w)* induces a conditional distribution *π(y*|*x*; *w)*. To construct this,


**Fig. 3.23** InstructGPT is trained in three steps [151, p. 3]. First GPT-3 is fine-tuned on instructions and the corresponding completions. Then a reward model is generated by optimizing the selection of a completion for an instruction. Finally, a policy is trained to generate token by token of the answer with maximal reward. Credits for image parts in Table A.1

the reward function *r(y, x)* is approximated by a deep neural network *r(*ˆ *y, x*; *u)* with parameter *u*. The network is trained by three alternating steps (Fig. 3.23):


For a set of 33k instructions, a *reward model r(*ˆ *y, x*; *u)* was built with 6B parameters, where *x* is the instruction and *y* a completion [198]. It selects the best completion from a small set of proposed completions. Proximal policy optimization (PPO) was used as reinforcement model [151, p. 41]. To avoid catastrophic forgetting (Sect. 3.6.1), pre-training samples were mixed into fine-tuning.

The reward model was then applied to create a final model by another reinforcement learning step. During this process, InstructGPT generates a completion for an instruction. The reward model calculates a reward and the policy is updated to approximate the preferences encoded in the reward model. By mimicking human utterances, the model implicitly learns human intentions and preferences. This process is called *alignment to human preferences* and is extensively discussed by Askell et al. [5].

#### **InstructGPT Results**

The GPT-3 model with 175B parameters fined-tuned in a supervised way to the 13k instruction-completion examples was taken as the base model called SFT. The final completions were again scored by human raters [151]. The InstructGPT completions were preferred to the standard GPT-3 output in 85% of cases and to few-shot-GPT-3 in 71% of cases.

Specifically, raters found that InstructGPT attempts to follow the correct instruction in 92% of cases, compared to 85% for SFT and 75% for few-shot GPT-3 [151, p. 53]. In addition, InstructGPT follows explicit constraints in 50% of the cases, compared to 43% for SFT and 34% for SFT and 28% for few-shot GPT-3. Hallucinations were observed for 20% of the cases for InstructGPT compared to 16% for SFT and 50% for few-shot GPT-3. Finally, the raters found that the language use is appropriate for a customer assistant in 92% of the cases for InstructGPT, about 90% for SFT and about 85% for GPT-3 few-shot. InstructGPT was also evaluated on a few natural language benchmarks where it achieved very similar results to GPT-3 [151, p. 56].

It turned out that InstructGPT is able to generalize to unseen labeler preferences. Thus, InstructGPT does not simply adapt to the preferences of a few training labelers. In addition, InstructGPT produces slightly less toxic language than standard GPT-3. However, InstructGPT still makes simple mistakes, e.g., given an instruction with a false premise, the model sometimes incorrectly assumes the premise is true. Note that the results depend on the subjective preferences of the labelers.

Comparisons between alternatives are not necessarily the most effective approach to generate an improvement signal. For example, one could ask labelers to edit model responses to make them better, or generate critiques of model responses in natural language. There is also a vast space of options for designing interfaces for labelers to provide feedback to language models; this is an interesting humancomputer interaction problem. The authors note that the cost of aligning GPT-3 to human preferences described above is just 1.6% of the cost spent to train GPT-3. Therefore, it seems to make sense to put more effort into alignment than into the mere enlargement of the models.

The results show that the InstructGPT techniques potentially make language models more helpful, truthful, and harmless. In a way InstructGPT works like an intelligent assistant for speech generation and information provision. However, the model is currently not fit for use in safety-critical applications, because failures cannot be ruled out. What is still missing is a comprehensive evaluation similar to Gopher or PaLM (Sect. 3.1.2) that shows the real utility of this approach. It can be expected that the combination of this approach with retrieval techniques as used for WebGPT (Sect. 6.2.3) and Retro (Sect. 6.2.3) will increase the performance, reliability, and correctness of InstructGPT.

**Fig. 3.24** FLAN instruction tuning fine-tunes a pre-trained language models on a set of tasks with instructions of ten different templates (left). The trained model can be applied to unseen tasks by formulating prompts according to these templates (right). Image adapted from [227, p. 1] with kind permission of the authors

#### **Instruction Tuning with FLAN**

**FLAN** [227] uses instruction tuning to improve the ability of the language model to respond to natural language prompts. The language model has to learn through supervision to perform tasks described by prompts, and to follow instructions, even for unfamiliar tasks (Fig. 3.24). The authors group 62 publicly available NLP datasets into twelve task clusters, e.g. "sentiment" "natural language inference", "summarization", etc. For each of the datasets they compose ten templates describing the task in natural language. Then an existing language model is fine-tuned to provide better answers to the prompts.

The approach was applied to a LaMDA-PT language model with 137B parameters using retrieval and filters (Sect. 6.6.3). For 18 NLI tasks the FLAN model was compared to LaMDA-PT 137B, GPT-3 175B, and GLaM 64B. In 14 of 18 cases FLAN substantially improved the performance of its unmodified counterpart and achieved better results than the competitors, while in 4 cases it was surpassed by GLaM [227]. FLAN even outperforms few-shot GPT-3 by a large margin on a number of tasks.

#### *3.6.6 Generating Labeled Data by Foundation Models*

The performance of GPT-3 and other Foundation Models in few-shot learning enables the generation of new high-quality training data for other models. By *Unsupervised Data Generation* (*UDG*) the creation of fine-tuning data for models of downstream tasks is possible that would otherwise be produced by manual human annotation. This approach is similar to Sect. 4.2.3.


**Fig. 3.25** New data can be generated by GPT-3 and other Foundation Models using the few-shot UDG strategy. Here the prompts for two examples, Amazon reviews and Copa common sense reasoning, and the generated answers are shown [225]

The idea for data generation is to utilize the language model to learn the inputlabel relation based on the task description and a few sample input-label pairs [225]. Instead of generating and predicting a label for a classification task the language model has to create the input text using the output class and a task description as input. For a classification task like product reviews on Amazon, the approach is able to produce 10k new examples for each class, covering a much larger spectrum as the currently available labeled data. It turns out that up to 32 few-shot examples still increase the quality of the generated training data. Examples are shown in Fig. 3.25. The authors use an additional module to filter out noisy examples. In this approach, a given training example is removed if the trained classifier does not match its label with high probability.

The T5-XXL encoder-decoder model fine-tuned on SuperGLUE data enhanced with UDG data is able to improve the overall accuracy on the SuperGLUE task for natural language understanding to 90.4% and is even able to beat DeBERTa with 90.3%. Moreover, the approach achieves very high performance scores on a list of text classification and sentiment analysis tasks [225].

#### *3.6.7 Summary*

When pre-training Foundation Models on a big text collection and subsequent supervised fine-tuning on a small labeled dataset, PLMs achieved unprecedented performance on many NLP tasks. Fine-tuning has been shown to change model parameters only slightly and, in general, no catastrophic forgetting occurs. Usually, no overfitting is observed if fine-tuning is stopped after a few epochs. If necessary, there are some approaches to avoid overfitting.

Fine-tuning can be performed in different ways. It has been suggested to use an intermediate fine-tuning with a more related dataset before the final fine-tuning on the small dataset takes place. The results of such approaches have been mixed. Also, simultaneous fine-tuning to several tasks is possible. In some cases, it could improve performance. As an alternative, there are strategies to accelerate fine-tuning by meta-learning. To avoid that the full model is changed adapter layers can be defined, and only their parameters are adapted. This can drastically reduce the number of trainable parameters and nevertheless lead to good performance on the fine-tuning tasks. Finally, fine-tuning APIs have been recently provided for proprietary models like GPT-3.

Foundation Models like GPT-3 and PaLM can be instructed by prompts to solve specific tasks without training. A large number of different prompts has been collected to order the model to complete a task. InstructGPT is a new version of GPT-3 that directly takes instructions and provides the answers for a large spectrum of tasks. The model was customized to carry out the instructions by adapting to user judgments through reinforcement learning. Instruction tuning is a variant, where a Foundation Model is fine-tuned to provide improved answers to instructions for a number of tasks. It turns out that afterwards the model generates better answers even for unseen tasks.

Finally, big language models may be employed to generate high-quality training data for fine-tuning. Again, the few-shot learning technique is used to generate input texts for specific learning tasks. In this way, the scarce training data can be expanded and better fine-tuning results can be achieved.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 4 Knowledge Acquired by Foundation Models**

**Abstract** During pre-training, a Foundation Model is trained on an extensive collection of documents and learns the distribution of words in correct and fluent language. In this chapter, we investigate the knowledge acquired by PLMs and the larger Foundation Models. We first discuss the application of Foundation Models to specific benchmarks to test knowledge in a large number of areas and examine if the models are able to derive correct conclusions from the content. Another group of tests assesses Foundation Models by completing text and by applying specific probing classifiers that consider syntactic knowledge, semantic knowledge, and logical reasoning separately. Finally, we investigate if the benchmarks are reliable and reproducible, i.e. whether they actually test the targeted properties and yield the same performance values when repeated by other researchers.

**Keywords** Knowledge in foundation models · Common Sense knowledge · Logical coherence · Benchmark collections · Reproducibility

During pre-training, Pre-trained Language Models (PLMs) and the larger Foundation Models are trained on an extensive collection of documents and learn the distribution of words in correct and fluent language. During fine-tuning, the models are adapted to a specific task using the knowledge from the pre-training and requiring only a small set of manually labeled fine-tuning data. In this chapter, we investigate the knowledge acquired by these models by different types of tests:


demonstrate the achievements and deficits in different areas and for different model architectures.

• Finally, we investigate if the benchmarks are reliable, i.e. actually test the targeted properties (Sect. 4.3). Moreover, we analyze if published benchmark results are reproducible and yield the same performance values if they are repeated by other researchers.

#### **4.1 Benchmark Collections**

In order to arrive at quantitative measures of common sense knowledge and commonsense reasoning, the community has compiled a number of benchmarks. These allow a standardized comparison of different aspects of natural language understanding and provide comparable scores for the strength and weaknesses of different PLMs. Benchmarks have been a key driver for the development of language models. A comprehensive collection of benchmarks and the corresponding leaderboards are provided by PapersWithCode [45]. A survey of actual benchmarks is given by Storks et al. [62].

A fair comparison of model architectures requires that the number of parameters, the size of the training data, and the computing effort for training are similar. This has been extensively discussed in Sect. 3.5.1. Therefore, many authors conduct extensive ablation studies to adjust their training resources to a standard, e.g. to BERT as a "benchmark model". This is really important, as it helps the reader to get an intuition for the impact of pre-training resources. Nevertheless, comparability is often hampered by two problems:


Therefore, statements like "Model architecture A is superior to model architecture B on performing task X." in general are not valid, but have to be qualified [2], e.g. "Model architecture *A* is superior to model architecture *B* on performing task *X*, when pre-trained on a small/large corpus of low/high quality data from domain *Y* with computing effort *Z*."

#### *4.1.1 The GLUE Benchmark Collection*

To test the ability of PLMs to capture the content of a document, the GLUE (Sect. 2.1.5) set of benchmarks has been developed. This is a collection of 9 benchmarks testing different aspects of *Natural Language Understanding* (*NLU*). The joint performance is measured by a single score, which has the value 87.1 for

human annotators. The tasks are described in detail by examples in Table 2.1. It turns out that variants of BERT fine-tuned to the different GLUE-tasks can yield better results than people. The results are determined for the large variants of the models and shown in Table 4.1.

In the past years GLUE was routinely employed to demonstrate the NLU capabilities of PLMs. Currently, the best average value of 91.4 after fine-tuning was reached by DeBERTaV3 [18] (Sect. 3.1.1). It uses separate embeddings for content and position and employs a corresponding disentangled attention mechanism. There are only three tasks where PLMs are worse than humans, but only by a small margin. Note that ensembles of several models often yield slightly better results. Nangia et al. [42] also measures the performance of human teams of 5 people. The numbers are not comparable as cases were excluded when the teams arrived at split judgment. Newer models such as PaLM use SuperGLUE instead of GLUE because GLUE is considered too simple.

#### *4.1.2 SuperGLUE: An Advanced Version of GLUE*

Due to the progress in the last years, PLMs have reached human performance in most tasks and the GLUE is no longer able to discriminate between models. Therefore, the authors of GLUE proposed a more demanding test suite called **SuperGLUE** [68] as an advanced version of GLUE with eight challenging tasks. The tasks are similar to GLUE with longer contexts to consider.


The performance again is measured by a single average score with a value of 89.8 for human annotators [66].



*GPT-3* [7] is a huge language model (Sect. 3.1.2), which can be instructed to perform a task without fine-tuning (Sect. 3.2). With this few-shot learning GPT-3 achieved an average SuperGLUE score of only 71.8 as shown in Table 4.2. Obviously fine-tuning the specific tasks seems to be important. Recently a fine-tuned DeBERTa ensemble (Sect. 3.1.1) surpassed human performance on SuperGLUE with an average score of 90.3. The most difficult task is a comparison of word senses in two sentences (WiC), where an accuracy of about 77% can be reached. The autoregressive LM *PaLM* 540B was fine-tuned on SuperGLUE and achieved an average of 90.4% on the test set [9, p. 13]. The best average of 91.2% was obtained by the *ST-MoE*32B mixture-of-experts model (Sect. 3.5.2) with 269B parameters [73]. This shows that Foundation Models are able to analyze complex text semantics.

GLUE and SuperGLUE have been criticized, as the answers of the posed problems always can be reduced to a classification task and the systems do not have to formulate an answer in natural language. In addition, it turns out that the performance of PLMs is not very stable. It has been shown that the prediction of current models often change in an inconsistent way, if some words are replaced [51]. If, for instance, in a sentiment analysis the input "I love the flight" is classified as positive, then "I didn't love the flight" should not be classified as neutral. Ribeiro et al. [51] show that inconsistencies like this often occur. They developed the **CheckList** system (Sect. 4.3.1), which automatically generates test examples for probing a model.

#### *4.1.3 Text Completion Benchmarks*

The task of an autoregressive language models is the reliable generation of the next word in a text. This has to obey grammatical correctness as well as semantic consistency. The *LAMBADA benchmark* [44] is a good test to demonstrate this ability. It consists of about 10,000 passages from the BooksCorpus containing unpublished novels. The task is to predict the missing last word of the last sentence of each passage. Examples were filtered by humans to ensure that models need to take into account the full passage of at least 50 tokens to induce the final word.

An example is the passage "Both its sun-speckled shade and the cool grass beneath were a welcome respite after the stifling kitchen, and I was glad to relax against the tree's rough, brittle bark and begin my breakfast of buttery, toasted bread and fresh fruit. Even the water was tasty, it was so clean and cold. It almost made up for the lack of .", where "coffee" is the missing target word to be predicted. Examples which could be easily predicted by simpler language models were omitted. Examples were only selected, if the target word could be predicted by humans from the full passage but not from the last sentence.

The GPT-3175B autoregressive language model [48] predicted the last word with 76.2% [7, p. 12]. PaLM540B with few-shot instructions could increase the accuracy to 89.7 [9, p. 79]. This means that in nearly nine of ten cases, the predicted word was exactly the missing word in the test data.


Another relevant benchmark for language modeling is *WikiText-103* [38] of 28k articles from Wikipedia with 103M tokens. If large Foundation Models are applied to this corpus the following perplexities result: GPT-21.7B 17.5 [48], Megatron-LM 10.8 [58], Gopher280B 8.1 [49, p. 61]. Recently a small Retro1.8B model with retrieval could reduce this perplexity to 3.9 [49, p. 12]. Note that there might be a partial overlap of Wikitext 103 with Retro's training data not caught by deduplication.

#### *4.1.4 Large Benchmark Collections*

Recently large autoregressive language models like GPT-3, Gopher, and PaLM have been developed, which are trained on extremely large document collections with hundreds of billions of tokens. The models should perform well across a wide range of tasks. Therefore, instead of the limited GLUE benchmarks a large number of benchmarks covering many aspects of possible applications are used to evaluate their performance.

A frequent opinion is that current benchmarks are insufficient and "saturate", "have artifacts", and are "overfitted by researchers". Bowman et al. [5] argue that "evaluation for many natural language understanding (NLU) tasks is broken". They complain that there are systems at the top of the leaderboards which fail in simple test cases (cf. [51]). As a consequence they formulate four requirements on new benchmarks:


They summarize some promising developments that could support these challenges, including data collection involving both crowdworkers and domain experts, and larger-scale data validation.

To address this criticism, two comprehensive collections of benchmarks have been defined. The *Massive Multitask Language Understanding* (MMLU) benchmark [20] emulates human exams with multiple choice questions, each with four responses. In addition to logical and mathematical reasoning it tests a model's ability across a wide range of academic subjects from computer science to history and law. The other collection is the *BIG-bench* collaborative benchmark [1, 60], designed to evaluate language interpretation aspects like reading comprehension, question answering, world understanding, etc. Both benchmark collections include more than a hundred tasks.


**Table 4.3** Groups of evaluation benchmarks for Gopher and related models [49, p. 8]

The *Gopher* model with 280B parameters together with alternatives like GPT-3, Jurassic-1, and Megatron-Turing NLG (all discussed in Sect. 3.1.2) were tested on these and other benchmarks. Note that this was done with a total of 152 benchmarks described in Table 4.3. Gopher shows an improvement on 100 of 124 tasks (81%) compared to the previous SOTA scores. In language modeling (next word prediction) Gopher improves SOTA for 10 of 19 benchmarks. Note that all benchmark results were not obtained after fine-tuning but by zero-shot or few-shot learning.

The distribution Gopher accuracies for thematic groups are shown in Fig. 4.1. Gopher is able to increase SOTA for 4 out of 7 math tasks, 5 out of 9 common sense tasks, 9 out of 12 logical reasoning tasks, 22 out of 24 fact checking and general knowledge tasks, all 24 STEM (Science Technology Engineering Mathematics) and medicine tasks, all 15 humanities and ethics task, and 10 out of 11 reading comprehension tasks. The average accuracies for common sense and general knowledge are about 50%, indicating that some knowledge exists but can be improved. Among these tests were benchmarks on logical reasoning, which, for instance, include "Formal Fallacies Syllogisms Negation" or "Logical Fallacy Detection". Only two of the 19 benchmarks achieved an accuracy of more than 60% [49, p. 58], indicating that even for this large model correct reasoning is a major obstacle. Obviously this spectrum of evaluation gives a deep insight into the capabilities of the compared models. It can be expected that the new Retro model (Sect. 6.2.3), which performs retrieval during language generation, will improve these results.

The *PaLM* autoregressive language model with 580B parameters [9, p. 15] recently was evaluated with the BIG-bench benchmark. On the 150 tasks, PaLM with 5-shot prompts achieved an normalized average score of 46%, which was better than the average human score of 39%. However, the best human experts have a score of about 77%. The detailed results for the different BIG benchmark areas are not yet available. On a subset of 58 BIG-tasks, which were also used by prior models, PaLM obtained a 5-shot normalized score of about 55%, again above the human average of 49%, outperforming Chinchilla (47%) and Gopher (30%). GPT-3 achieved a 1 shot performance of 16% on the 58 tasks. In general Foundation Models like Gopher and PaLM with several hundred billion parameters have 'dramatically better' results

**Fig. 4.1** Accuracies in percent of different groups covering 152 different benchmarks evaluated for the Gopher model [49, p. 57]. The 25% and 75% percentiles are given by the box, and the inner line is the median. The outside lines indicate variability outside the upper and lower quartiles

on BIG than smaller models, even if the model architecture is not fundamentally different [1]. In this respect Foundation Models show a qualitatively new behavior.

Researchers at Google have proposed to use the BIG-bench benchmark with currently 200 tasks as a replacement for the Turing test for "intelligence" [61]. In this way the knowledge of an AI-System can be checked at a large scale. Recently, a Google engineer published a dialog [31] with the LaMDA language model (Sect. 6.6.3). In his view this indicates that LaMDA is "sentient". However, this aspect of human intelligence is not checked by knowledge and reasoning tests such as BIG and requires the development of new types of tests.

#### *4.1.5 Summary*

Benchmark collections are a popular way to demonstrate the superiority of a Pretrained Language Model for specific tasks. To show the merits of an architecture, however, also the number of parameters, the size of training data, and the computing

effort has to be reported and compared, because these numbers also affect the model performance.

The GLUE benchmark collection of nine language understanding tasks has demonstrated the considerable progress of PLMs during the last years. It tests the ability of PLMs to detect paraphrases, coreference relations, logical entailments and grammatical correctness. Meanwhile, the average accuracy exceeds the average human performance. The similar, more challenging SuperGLUE benchmark suite has been introduced, where human performance is also surpassed. For autoregressive language models the LAMBADA benchmark requires an impressive ability to determine the most probable last word of a paragraph. Current models like PaLM are able to predict the last word with an accuracy of nearly 90% demonstrating its ability to capture the flow of arguments.

Foundation Models are usually tested by extensive standardized test collections covering many aspects like common sense knowledge, emotional intelligence, logical reasoning, or social sciences. Recent Foundation Models like Gopher and PaLM, with several hundred billion parameters, have been able to achieve performance better than that the human average and 'dramatically better' than smaller models. However, these models still have a lower accuracy than human experts. Although the benchmarks are very expressive, they do not take into account the societal impact of the models and are unable to detect features like self-awareness and sentience.

#### **4.2 Evaluating Knowledge by Probing Classifiers**

In this section, we examine the extent to which PLMs acquire different types of knowledge. We discuss the covered knowledge for the small BERT model and later review the improvements for foundation models such as GPT-3 and PaLM. First, we consider their syntactic knowledge of correct language. Then, we investigate how much common sense knowledge is represented by PLMs. Finally, we explore whether the output produced by PLMs is logically consistent.

#### *4.2.1 BERT's Syntactic Knowledge*

We discuss the syntactic knowledge incorporated in PLMs using BERT as an example. In the course of pre-training BERT is able to capture *syntactic knowledge* [54]. Embeddings can encode information about parts of speech, syntactic phrases and syntactic roles. Probing classifiers can predict part-of-speech tags and supersense information with an accuracy of 85% [33]. Obviously, this information has to be encoded in BERT's final embeddings. BERT also has knowledge of subject-verb agreement [17] and semantic roles [14]. It is also possible to extract dependency trees and syntactic constituency trees from BERT [21, 23, 27]. While probing indicates that the information can be extracted from the representation, it can be

shown [13] that in some cases the features are not used for prediction. According to an empirical evaluation PLMs encode linguistic information with phrase features in the bottom layers, syntactic features in the middle layers and semantic features in the top layers [23].

However, BERT's syntactic knowledge is incomplete and there is, for example, evidence that BERT often does not capture *negations*. For instance, BERTLARGE is able to determine the correct supersense, e.g. "bird" in the masked sentence "A robin is a [MASK]" with high probability [14]. On the other hand, the model predicts "robin", "bird", "penguin", "man", "fly" with maximum probabilities for the mask in "A robin is not a [MASK]", effectively ignoring the negation.

Some benchmarks described in Sect. 4.1 check the syntactic knowledge of PLMs. An example is the GLUE's CoLA task testing the grammatical correctness of sentences, which is the most difficult task of GLUE where the best models only yield about 75% correct answers (Table 4.1). SuperGLUE (Sect. 4.1.2) is a benchmark, which requires syntactic knowledge, e.g. for the textual entailment task COPA and the coreference resolution task WSC. While the fine-tuned BERT gets an average score of 69.0 the fine-tuned PaLM540B achieves an average of 91.4 (Table 4.2). Large foundation models such as PaLM, which has more than 1000 times as many parameters as BERT, are obviously able to capture syntactical knowledge much better than the 'small' BERT.

#### *4.2.2 Common Sense Knowledge*

*World knowledge*, also called *common sense knowledge*, consists of facts about our every day world, such as "fire is hot". A simple method of checking world knowledge is to query BERT with cloze statements, for example, "Einstein was born in [MASK]". BERT acquires some *semantic knowledge* about semantic roles and encodes information about entity types and relations [54]. For instance, in the sentence "to tip a [MASK]" the token "waiter" gets a high probability for the position of [MASK]. Petroni et al. [46] and Zhou et al. [72] experimented with such queries and concluded that BERT contains world knowledge competitive with traditional supervised information extraction methods. It has been shown that BERT's contextual embeddings make up clusters corresponding to *word senses* [56]. This explains why BERT is quite capable of word sense disambiguation (Fig. 2.10).

Petroni et al. [46] remark that certain types of factual knowledge are learned much more easily than others by the standard language model pre-training approaches. They state that without fine-tuning, BERT contains relational knowledge competitive with traditional NLP methods that have some access to oracle knowledge. In addition, BERT also does remarkably well on open-domain question answering against a supervised baseline. These capabilities of BERT are a great achievement.

The language model GPT-3 has one hundred times more parameters than BERT and a dramatically better common sense knowledge. This, for example, can be seen

from its answers (A) to the questions (Q): "Q: Are there any animals with three legs?" "A: No, there are no animals with three legs." or "Q: Which is heavier, a football player or a car?" "A: A car is heavier than a football player." [29]. In an initial experiment eighty persons were asked to assess, if short 200 word articles were written by humans or GPT-3. The persons judged incorrectly 48% of the time, doing only slightly better than random guessing [7].

However, the semantic knowledge of PLMs is not perfect. BERT, for instance, has difficulties with the representation of numbers and often has problems with the replacement of *named entities* (*NE*s), e.g. person names or location names. For example, replacing names in the coreference task changes 85% of coreference assignments of expressions that refer to the same entity [3]. Obviously the pretrained version of BERT struggles to generalize the relations involving one named entity to other named entities of the same type. Moreover, BERT has problems to transfer knowledge based on the roles or types of objects. In addition, it is possible to mislead BERT by adding some content to a cloze query. An example is the word "Talk" in "Talk? Birds can [MASK]". A human would ignore "Talk?" and use his world knowledge to generate a result like "fly". In contrast, PLMs can be misled and produce the wrong answer "talk" for the mask [26].

A related phenomenon is the invariance to *paraphrases*. Elazar et al. [12] generate a high-quality set of 328 paraphrases to express 38 relations. Examples are "X originally aired on [MASK]" and "X premiered on [MASK]", which should give the same prediction for [MASK], if "X" is replaced by some TV series like "Seinfeld". Although the models in about 60% of the cases have access to the required knowledge to fill the mask correctly, BERTLARGE yields a consistency in paraphrases in only 48.7% of the cases. This indicates that not every fact present in the training data is encoded in the parameters and that the model does not always detect the equivalence of paraphrases. The model variants RoBERTa and ALBERT achieve a lower consistency, although they are superior to BERT in other tasks.

It is instructive to consider the influence of word order on the performance of BERT. Word order is taken into account by specific position embeddings, which are added to the token embeddings. It turns out, however that masked language models like BERT still achieve a high accuracy, if word positions are permuted. For pre-training Sinha et al. [59] perform sentence permutations, where each word in a sentence is randomly placed at a different position. The model was fine-tuned on GLUE, a set of classification tasks for natural language understanding (Sect. 2.1.5). If we ignore the CoLA-task, which checks linguistic acceptability, the model on average only looses 3.4% accuracy if the word order is permuted compared to the original RoBERTa results (88.7% on average). The authors conclude that BERT-like models achieve high performance on downstream tasks almost entirely by exploiting higher-order word co-occurrence statistics.

Another aspect of common sense knowledge is time. When a PLM is applied to new documents it often does not know the meaning of new named entities and concepts [30]. Often, the model cannot infer the time and region of a document and may not be able to correctly combine facts from documents that originate from different time periods or geographical regions. A benchmark for assessing

the temporal reasoning capabilities of PLMs in dialogs shows that BERT and T5 have major deficits on this task [47]. In summary it can be expected that the new Retro (Sect. 6.2.3) or WebGPT (Sect. 6.2.3) models, which perform retrieval during language generation, will considerably mitigate the problems discussed in this section.

To be able to check a multitude of different knowledge types in a standardized way large benchmarks like BIG-bench have been developed (Sect. 4.1.4). It comprises benchmarks on common sense, emotional intelligence, ethics, fact checking, general knowledge, humanities, mathematics, medicine, reading comprehension, science and social sciences. Figure 4.1 shows the performance of the Gopher model with 280B parameters on these benchmark groups. On most groups more than 50% accuracy was achieved. The PaLM model with 540B parameters was able to improve these performance figures. On about 2*/*3 of these tasks PaLM using 5-shot prompts achieves a better performance than average humans [9, p. 17]. This indicates that PaLM has a much better common sense knowledge than earlier models. Nevertheless, PaLM surpasses the performance of human experts only in a small fraction of cases suggesting further headroom for improvement.

An interesting idea is to use large pre-trained multilingual language models as a multilingual knowledge base [25]. The authors evaluate this for mBERT (Sect. 3.3.1), a standard BERT model, which has been pre-trained with the MLM loss on non-parallel Wikipedia texts from 104 languages. The authors find that correct entities can be retrieved for many languages. However, there is a clear performance gap between English and, e.g., Japanese and Thai. This suggests that mBERT does not store knowledge about entities in a language-independent way. It would be revealing if these experiments could be repeated with up-to-date language models like PaLM.

#### *4.2.3 Logical Consistency*

A set of statements is logically inconsistent if they cannot all be true at the same time. As an example consider the statements "John is Tom's father. Tom is the daughter of John." Sometimes, BERT is unable to reason, i.e. logically connect different pieces of knowledge. It reproduces, for instance, the relations that persons can walk into houses, and that houses are big, but it cannot infer that houses are bigger than persons [15, 52]. However, semantic knowledge problems tend to be smaller for models with more parameters.

Richardson et al. [52] formulated nine different types of simple sentence pairs containing e.g. negations, quantifiers, comparatives, etc. These sentences express logical entailment, contradiction or neutrality. In addition, they also employ chains of hypernomy, e.g. poodle ≤ dog ≤ mammal ≤ animal, and use these relations to generate new sentences expressing the corresponding logical properties. It turns out that BERT fine-tuned with the 'logical tasks' SNLI and MNLI predicts correct statements only with 47.3% accuracy of the cases.

Ribeiro et al. [51] propose to generate a large number of simple examples to test relations by a *CheckList procedure* described in Sect. 4.3.1. It tests, for instance, whether negating a positive sentiment expression leads to a negative sentiment rating. For more than half of the tests with commercial and open-source models they observed failure rates of more than 50%.

Even the larger model GPT-3 is not perfect, e.g. it incorrectly answers some common sense physics questions like "If I put cheese into the fridge, will it melt?" [7]. In addition, it has difficulties with logical reasoning, e.g. to determine if one sentence implies another. If a question is not covered in its training material, GPT-3 compiles the most probable answer and sometimes this is wrong, e.g. "Q: How many eyes does the sun have?" "A: The sun has one eye." or "Q: Who was president of the United States in 1600?" "A: Queen Elizabeth I was president of the United States in 1600." [29]. As another example consider the following input "You poured yourself a glass of cranberry, but then absentmindedly, you poured about a teaspoon of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can't smell anything. You are very thirsty. So you ...". The continuation generated by GPT-3 is "drink it. You are now dead.". GPT-3 assumes wrongly that "grape juice" is a poison and drinking it will kill you [36].

#### **Improving Logical Consistency**

PLMs can improve logical reasoning capabilities if they are trained with appropriately generated textual expressions. By fine-tuning a BERT model with created sentences containing negations, hypernomy, etc., and testing with other generated sentences, Richardson et al. [52] achieve an accuracy of 98%. This approach is similar to the data generation strategy proposed in Sect. 3.6.6.

Similarly, Clark et al. [10] generate datasets of the form (context, statement, answer), where context contains different logical facts and rules, statement is a logical question to prove and answer is either T or F. Facts, rules, and the question statements are then expressed in (synthetic) English. The problems require simultaneous consideration of a number of different statements to reach a conclusion, from depth 0 (simple lookup) to depth 5. During fine-tuning on this data, RoBERTa was trained to answer these questions as true or false. On the test data RoBERTa is able to answer the questions with 99% accuracy. If the facts and rules are paraphrased the accuracy drops to 66%. However, by training on paraphrased rules the model again reaches an accuracy beyond 90%. Clark et al. [10] suggest that by this approach the transformer can be considered as a "soft theorem prover" able to work with statements in language.

It is possible to combine the implicit, pre-trained knowledge of an LM and explicit statements in natural language. Talmor et al. [64] show that models trained with such datasets can perform inferences involving implicit world knowledge and taxonomic knowledge (e.g. the WordNet hierarchy) . In addition, inference patterns provided by examples are used by the model to solve logical problems.

There were a number of prior approaches to combine logical reasoning with neural networks. If a neural network provides probabilities for logical facts, then we can use a probabilistic reasoning system to enforce additional constraints. Examples are *DeepProblog* [35] that incorporates Deep Learning by means of neural predicates, i.e. statements whose probability is determined by a neural network. An alternative is *probabilistic soft logic* (*PSL*) [28], which allows first order probabilistic reasoning in relational domains. However, PLMs do not directly provide probabilities for facts. There have been approaches to translate natural language sentences to logical statements and apply logical reasoning. However, this "semantic parsing" [24] was not very successful.

A number of researchers have developed methods for neural theorem proving. This work combines symbolic and neural methods to reason about results derived from language. Examples are e.g. Minervini et al. [39], which jointly embed logical predicates and text in a shared space by using an end-to-end differentiable model, or Weber et al. [70] which combine a Prolog prover with a language model to apply rule-based reasoning to natural language. The **DeepCTRL** approach [57] integrates rules with Deep Learning. It has a rule encoder which allows to control the strengths of the rules at inference. It can be applied to domains like healthcare, physical models or accounting, where obeying rules is essential.

A simple but effective way to improve logical consistency is to increase the number of model parameters creating a Foundation Model. A large fraction of the tasks in the BIG-bench benchmark [1, 60] is devoted to checking logical consistency, e.g. the benchmark groups with analogical reasoning and logical reasoning. *Gopher*  (Sect. 3.1.2) is a language model with 280B parameters. It was applied to about 150 benchmarks, among them 19 logical reasoning tasks. In all but 4 benchmarks it could increase SOTA indicating that larger PLMs have better reasoning capabilities. Nevertheless, the average accuracy was only about 50%. It was not yet evaluated whether the recent *Retro* (Sect. 6.2.3) language model with retrieval of additional text documents is able to improve these results.

*PaLM* (Sect. 3.1.2) is an even larger language model with 540B parameters. On the SuperGLUE logical tasks CB, COPA, RTE, it can drastically increase the scores compared to BERT, e.g. for COPA from 70.6 to 99.2 (Table 4.2). It has been evaluated on hundreds of benchmarks including those used for Gopher. It uses a new prompt technique to pose logical questions, where examples are presented to the system together with *thought chains* partitioning a reasoning task into smaller problems (Sect. 3.6.4). Two examples are shown in Fig. 2.21. Note that *k*-shot reasoning only requires a single sequence of *k* thought chain prompts to be provided for the training examples. The model then generates a thought chain for each test example. This can be used for error analysis and explaining the model behavior.

Using this technique, PaLM is able to match or surpass the performance level of an average human asked to solve the task. As an example consider the *StrategyQA benchmark* [16], which contains questions like "Did Aristotle use a laptop?". For this question the model has to collect facts on the lifespan of Aristotle and the year, when the first laptop was invented to arrive at the answer "No". Without thought chain prompts PaLM reached 69%, while the use of thought chain prompts could

improve the prior SOTA from 70% to 73.9%. As a comparison, average humans achieve 62.9%, while expert humans have an accuracy of 90%.

There are other ways to improve learning with such intermediate outputs. Wang et al. [69] sample multiple chains of thought exploiting the diversity of reasoning paths and then return the most consistent final answer in the set. Since it is expensive to obtain chains-of-thought for a large number of examples, Zelikman et al. [71] generate explanations for a large dataset by bootstrapping a model in the few-shot setting and only retaining chains-of-thought that lead to correct answers.

#### *4.2.4 Summary*

Pre-trained PLMs have a huge number of parameters and are able to represent an enormous amount of syntactic and factual knowledge. This knowledge can be elicited by probing classifiers, the prediction of masked words, by generating answers to inputs, or by solving benchmark tasks.

As far as syntactic knowledge is concerned, Foundation Models like GPT-3 produce almost error-free text and 'know' a lot about syntactic rules. One problem is to adequately reflect the effect of negations.

Even smaller models like BERT are capable of producing a lot of commonsense knowledge. Here, the effect of substituting names or using paraphrases is problematic. Larger language models like GPT-3 are more robust, and the recently proposed language models with retrieval (WebGPT, Retro) are able to include relevant external documents for the current task. This information can reduce errors considerably. However, there is no comprehensive evaluation yet. One problem is the correct temporal and spatial location of information. Here, smaller models like BERT and T5 have large deficits. Foundation Models meanwhile surpass the average human score in 2/3 of the BIG-bench tests on common sense knowledge. They can even be used as a multilingual knowledge base, since models like PaLM cover many languages.

Logical consistency of inferences is a problem, and the PLMs often associate answers that are plausible but wrong. The models are only able to make logical inferences for relationships mentioned in the training text, and they are often incapable of making abstractions and generalizing an observed relationship to other objects or entities of the same type. Logical consistency can be improved by generating additional training texts containing assumptions and valid logical consequences resulting from them. The direct inclusion of logical reasoning systems in Foundation Models was not very successful. The PaLM language model with 540B parameters achieved a fundamental improvement of the accuracy of logical reasoning through the use of thought chain prompts. Here in a few-shot prompt a logical derivation is broken down into smaller logical substeps . At present, it is not clear, to what extent language models with retrieval can reduce the still existing deficits in logical reasoning.

#### **4.3 Transferability and Reproducibility of Benchmarks**

In this section, we consider whether benchmarks actually evaluate the properties they are supposed to test. We also discuss the extent to which they are reproducible.

#### *4.3.1 Transferability of Benchmark Results*

On a number of benchmarks, the performance of human annotators is exceeded by Foundation Models. This is an indication that the model has learned valuable contents about language. However, Ribeiro et al. [51] argue that this can be misleading, because the test sets often do not cover the right content. While performance on held-out test data is a useful measure, these datasets are often not comprehensive. Hence, there is the danger of overestimating the usability of the model in real applications.

#### **Benchmarks May Not Test All Aspects**

On the MRPC task of the GLUE benchmark for detecting paraphrases RoBERTa, BERTLARGE, and humans have F1 scores of 90.9% [34], 89.3% [42] and 86.3% respectively. Therefore, both models perform better than humans. To test whether the models respect basic logical relationships, Ribeiro et al. [51] propose to generate a large number of simple examples using a **CheckList procedure**. This approach is similar to testing software by systematically generating a large variety of inputs in unit tests.

The following scheme, for instance, can be used to check the effect of a negation in a sentiment classification task "I *<*negation*> <*positive\_verb*>* the *<*thing*>*". It generates sentences like "I didn't love the food" or "I don't enjoy sailing". The authors formulate *minimum functionality tests*, which are useful to check if the model actually detected the reason of an outcome or used some unjustified association. In addition, they utilize *invariance tests* to find out, if neutral perturbations or paraphrases change the result. Finally, they create *directional expectation tests*, where a modification is known to change the result in an expected way.

For MPRC it turned out that the failure rates of RoBERTa and BERT on these 23 test templates are larger than 50% for 11 and 14 of the templates respectively. Therefore, the "superhuman" performance of the two models should be taken with a grain of salt.

The authors also tested five current PLMs: BERTBASE, RoBERTaBASE, Microsoft's Text Analytics, Google Cloud's Natural Language, and Amazon's Comprehend. They report the results of 17 tests for sentiment classification, where most problems occurred with negations. For instance, the following example "I thought the plane would be awful, but it wasn't." was misclassified by most models.

Also very difficult is the detection of paraphrases with 23 tests templates. Here RoBERTa had for 11 and BERT for 14 of the test templates a failure rate of more than 50%. A similar failure rate was observed for reading comprehension when test cases were generated with logical templates. These results indicate that the examples in the original test sets of the benchmarks are too easy.

To increase robustness of PLMs it is possible to generate adversarial examples [8, 65]. The authors discuss methods that augment training data with adversarial examples as well as methods that produce certificates of robustness. They also investigate methods to avoid spurious correlations, i.e. predictive patterns that work well on a specific dataset but do not hold in general.

Talman et al. [63] checked, if the results for benchmarks may be transferred to similar datasets. They trained six PLMs on different benchmarks for *natural language inference* (*NLI*) containing sentence pairs manually labeled with the labels entailment, contradiction, and neutral. While six models perform well when the test set matches the training set, accuracy is significantly lower when a test set from another benchmark is used. BERTBASE, for instance, yields a test accuracy of 90.4% for SNLI, which drops on average 21.2% for the test sets of the other benchmarks. The reason behind this drop is a slightly different definition of the task as well as small differences in the documents domains. Obviously, it cannot be expected that the performance of PLMs can simply be transferred to new data.

#### **Logical Reasoning by Correlation**

The *Winograd schema challenge* (WNLI) was developed by Levesque et al. [32] and is part of the GLUE benchmark collection. The test consists of a pair of sentences differing by exactly one word, each followed by a question [41], e.g.


In this pair of sentences, the difference of one word changes which thing or person a pronoun refers to. Answering these questions correctly seems to require common sense reasoning and world knowledge. In addition, the authors have designed the questions to be "Google-proof": The system should not be able to use a web search (or anything similar) to answer the questions. GPT-3 reaches a value of 88.6% using few-shot prompts without fine-tuning [7] and DeBERTa managed an accuracy of 95.6% after fine-tuning [19]. This accuracy roughly equals human performance.

As Mitchell [41] argues, this does not necessarily mean that neural network language models have attained human-like understanding. For a number of question pairs it seems possible to answer the question by some sort of correlation instead of actual world knowledge. If pre-trained on a large corpus the model will learn the high correlation between "sports car" and "fast" and between "mail truck" and "slow" for the above example. Therefore, it can give the correct answer on the

coreference of "it" based on those correlations alone and not by recourse to any understanding. It turns out that many of the Winograd schema challenge question follow this pattern. A similar argument states [6, 37] that a model might heuristically accept a hypothesis by assuming that the premise entails any hypothesis whose words all appear in the premise. This means that the model can give the right answer without 'understanding' the situation in question.

To reduce the deficits of the Winograd schema challenge a much larger *Winogrande* benchmark [55] was created using crowdsourcing. The researchers discarded sentences which could be answered by exploiting intuition and correlation. They used the embeddings created by RoBERTa (Sect. 3.1.1) to determine if these embeddings strongly indicated the correct response option. In this case they discarded the question pair and finally ended up with 44k sentences. An example for a question pair without correlation problems is:


While humans reach an accuracy of 94%, the best PLMs, standard models like RoBERTa only reached 79.1% accuracy. Recently, *T5-XXL* achieved an accuracy of about 91% [43] and the *ST-MoE-32B* mixture-of-experts model [73] with 269B parameters (Sect. 3.5.2) obtained 96.1%, drastically reducing the errors. It appears that in most cases the latter models are able to perform 'reasoning' without simply correlating statements.

#### *4.3.2 Reproducibility of Published Results in Natural Language Processing*

Many publications in NLP claim that their model achieves SOTA for some benchmark. Examples are the GLUE benchmark [67] for language understanding and the SQuAD data [50] for reading comprehension. There are two main problems with this approach. First it is difficult to assess, if the results are reproducible and significant. As Crane [11] demonstrates, there are usually a number of unreported conditions that affect the reproducibility of the result. An example is the random initialization of the network parameters. The resulting variance is often larger than the reported improvement in SOTA scores. However, the variance resulting from these phenomena is usually not reported. Other effects are the underlying programming frameworks and libraries, which change over time. Often the hyperparameters and the details of preprocessing and model configuration are not communicated.

To document the model architecture, the training and evaluation process of a model, Mitchell et al. [40] proposed the description of relevant facts and hyperparameters in a **model card**. After a short high-level description of the model and its purpose the model card should contain nine different sections [40]:


More details are given by huggingface [22]. Even if models still can be published without a model card, the explicit documentation of the model can only benefit future users. Therefore, model cards should be provided if possible. For most recent models, a model card is provided even if the model is not open-source.

A survey on *reproducibility in NLP* is given by Belz et al. [4]. They note that the performance results often depend on seemingly small differences in model parameters and settings, for example minimum counts for rare word or the normalization of writing. The authors state in their study on repeated experiments that only 14% of the 513 reported scores were the same. An annoying fraction of 59% of the scores were worse than the published numbers. Therefore, the experimental results published in papers should be treated with caution.

Another issue is the question of what causes an increase in performance. As we have discussed above, a growth in the number of parameters and in the computing effort regularly leads to better results for PLMs (Sect. 3.5.1). As a consequence, it is often not clear, whether the architectural changes to a model yield the improved performance or just the number of additional parameters or the larger training set [53].

Obviously a first place in a leaderboard can be achieved with a larger model and more computing effort. This, however, "is not research news" according to Rogers [53]. In addition, these results are often not reproducible: Who can afford to retrain GPT-3 for 4.6 million dollars. As a consequence, the development of smaller but more innovative models is less rewarding, as it is difficult to beat the bigger model. Only if the authors of a new model can show that their architecture is better than the previous SOTA model with the same number of parameters and compute budget, they can claim to have made a valuable contribution. Rogers [53] proposes to provide a standard training corpus for a leaderboard and limit the amount of computation effort to that of a strong baseline model. As an alternative the size of the training data and the computational effort should be reported and taken into account in the final score.

#### **Available Implementations**


#### *4.3.3 Summary*

The transferability of benchmark results to real applications is not always granted. Even if a PLM is better than humans at logical reasoning on the test set, it may not be able to classify generated logical reasoning chains correctly. This indicates that the test set does not cover the full spectrum of possible examples. It is common for performance to be lower on related benchmarks because the domain or the definition of the task may deviate.

There are cases where a logical conclusion is obtained not by logical deduction, but by a simple correlation of antecedent and consequent. This could be demonstrated for the Winograd task of the GLUE benchmark. To avoid this type of 'reasoning' a new variant task called Winogrande was developed where correlations are unrelated to the reasoning task. Meanwhile, a Foundation Model with 269B parameters was also able to solve this task better than humans.

A survey on the reproducibility of results in NLP demonstrated that the published performance often depends on a number of unreported effects, such as random number initialization. Often the variability of such effects is larger than the reported improvement. Therefore, it is necessary to report the variance caused by these effects. In addition, the details of the model architecture, its training and evaluation should be documented in a model card. In about 500 repeated experiments, an irritating rate of about 60% of final scores were lower than reported. Note that improvements due to more parameters, more training data, or higher computational effort are not indicative of a better model architecture.

#### **References**


*Abstr*. Punta Cana, Dominican Republic & Online: Association for Computational Linguistics, Nov. 2021, pp. 22–26. URL: https://aclanthology.org/2021.emnlp-tutorials.5 (visited on 11/24/2021).


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 5 Foundation Models for Information Extraction**

**Abstract** In the chapter we consider Information Extraction approaches that automatically identify structured information in text documents and comprise a set of tasks. The Text Classification task assigns a document to one or more pre-defined content categories or classes. This includes many subtasks such as language identification, sentiment analysis, etc. The Word Sense Disambiguation task attaches a predefined meaning to each word in a document. The Named Entity Recognition task identifies named entities in a document. An entity is any object or concept mentioned in the text and a named entity is an entity that is referred to by a proper name. The Relation Extraction task aims to identify the relationship between entities extracted from a text. This covers many subtasks such as coreference resolution, entity linking, and event extraction. Most demanding is the joint extraction of entities and relations from a text. Traditionally, relatively small Pre-trained Language Models have been fine-tuned to these task and yield high performance, while larger Foundation Models achieve high scores with fewshot prompts, but usually have not been benchmarked.

**Keywords** Text classification · Named entity recognition · Relation extraction · Sentiment analysis · Language understanding

There are a large number of NLP applications of Pre-trained Language Models (PLMs), which can be roughly divided into three areas


These applications are described in the three following chapters.


**Table 5.1** Language analysis tasks based on text classification illustrated by examples

In the present chapter we focus on **information extraction** with PLMs. Information extraction includes the following tasks:


#### 5.1 Text Classification 189

• *Relation Extraction* aims to identify the relationship between *entities* extracted from a text (Sect. 5.4). This covers many subtasks such as coreference resolution, entity linking, and event extraction (Table 5.3).

Due to the large number of different approaches, we focus on representative models which exhibit a high performance at the time of writing. Traditionally relatively small PLMs have been fine-tuned to these task and yield high performance, while larger Foundation Models achieve high scores with few-shot prompts, but usually have not been benchmarked.

We outline the inner logic and main features of the methods, taking into account necessary resources, e.g. computational and memory requirements. For standard models a link to the description in earlier chapters is provided. Under the heading "Available Implementations" you will find links to available code and pre-trained models for a task. Good general sources for code are the websites [30, 35, 74, 79].

#### **5.1 Text Classification**

Automatic *text classification* is a common task in natural language processing where a *class*, (also called *category* or *label*) is assigned to a short text or a document. The set of classes is predefined and may contain just two classes (*binary classification*), or more classes (*multiclass classification*). Each text must be assigned a single class, which means that the classes are exclusive. Typical tasks include spam detection in emails, sentiment analysis , categorization of news articles , hate speech detection, dialog act classification, and many more. Some examples are listed in Table 5.1. Kowsari et al. [44], Li et al. [49] and Minaee et al. [64] provide surveys on text classification.

Often a document covers several topics simultaneously, e.g. a news article on the construction cost of a soccer stadium. In this case it is necessary to assign multiple classes to a document, in our example "soccer" and "finance". This type of classification is called *multilabel classification*. *Extreme multilabel classification* is a variant containing a very large label set with thousands of labels.

There are a number of popular benchmarks to assess the performance of document classification approaches covering two or more classes. Typically, the benchmarks contain many thousand training and test examples. Table 5.2 describes the properties of some popular text classification benchmarks. Often documents are categorized according to the subjective opinions of users. An example are reviews of movies or restaurants, which can be classified as positive, negative, or neutral. Then the classification corresponds to a *sentiment analysis* task.

Early methods for document classification in the 1990s used classical machine learning approaches [44]. In the first preprocessing step, manually created features were extracted from the documents. In the second step, a classifier was trained with these features to reconstruct the manually assigned class labels (Sect. 1.3). Finally, this classifier was applied to new documents. Usually, *bag-of-words* representations were used to represent the input documents. Popular classification methods included


**Table 5.2** Popular text classification benchmarks

*naive Bayes*, *logistic classifier*, the *support vector machine*, and tree-based methods like *random forests*. However, all these methods were hampered by the shortcomings of the bag-of-words representation (Sect. 1.3), which ignores the sequence of words in a document.

In the next sections, we consider current classification models for mutually exclusive as well as "overlapping" classes. It turns out that most of the current best approaches are based on PLMs.

#### *5.1.1 Multiclass Classification with Exclusive Classes*

A prominent application of **BERT** is fine-tuning for a classification task (Sect. 2.1.2). Here, a pre-trained BERT is adapted to this task by supervised finetuning, using the contextual embedding of the "[CLS]" token in the highest layer as input for a logistic classifier. This classifier is extremely successful for natural language understanding tasks (Sect. 2.1.5).

**XLNet** [120] is trained by reconstructing a permuted token sequence (Sect. 3.1.1), and is therefore able to capture a lot of knowledge about the language. It achieves 96.2% accuracy on the binary IMDB classification task. This performance is surpassed by **ERNIE-Doc** [26] with 97.1%. ERNIE-Doc is a transformer with an enhanced recurrence mechanism capable of considering many previous segments of a text in the same layer. The model aims to mitigate problems of other transformer-based models for long contexts such as the Longformer, which do not provide the contextual information of whole documents to each segment. The SOTA is currently held by a simpler model [101], which modifies the well known paragraph vector [47] and Naive Bayes weighted bag of *n*-grams. It achieves an accuracy of 97.4%.

The current best model on the IMDB dataset with 10 classes is **ALBERT-SEN**  [23]. The authors propose an approach which evaluates the overall importance of sentences to the whole document, with the motivation that different sentences can contain different polarities but that the overall polarity depends on a few important sentences. Their model uses ALBERT (Sect. 3.1.1) to encode sentences via the [SEP] and [CLS] token representations. They concatenate these representations with class-weighted representations. Then they have a document encoder that calculates importance weights for every sentence and creates a weighted representation of the sentences as document representation. Finally, they calculate a sentiment score by utilizing the document representation and the class representations, which were also used in the sentence encoder. The model achieves an accuracy of 54*.*8%. It should be noted that subtle nuances in language expressions must be taken into account in this classification task with 10 classes.

For the Yelp benchmark, **XLNet** performs best for the binary version with an accuracy of 98.4% and achieves the second-best accuracy of 73*.*0% for the fine-granular version with 5 classes. The leading model for this task is **HAHNN**  [1] with an accuracy of 73*.*3%. HAHNN combines convolutional layers, gated recurrent units and attention mechanisms. It builds on non-contextual FastText [16] embeddings as word representations and uses a stack of convolutional layers to obtain contextual information. This is followed by a word encoder which applies recurrent GRU cells to obtain word representations, and an attention mechanism to create weights for the input words. Sentence representations are then formed as an attention-weighted average of the words. Another GRU layer is employed to create sentence representations, which are then combined via attention to generate a document level representation. This establishes the input to a fully connected layer with softmax activation for classification.

**BigBird** [127] is especially valuable for classification tasks with long documents, as it can process input sequences of length 4096 (Sect. 3.2.1). Following BERT, the output embedding of the first [CLS] is input for the classifier. For the IMDB data with shorter documents there is no performance gain compared to simpler models. On the *ArXiv benchmark* [32] with documents of average length 6300 and 11 classes BigBird improves SOTA by about 5% points.

With the advent of Web 2.0 and the ability for users to create and share their own content with the world, the proliferation of harmful content such as hate

speech, has increased. This is now fueled by bots and machine learning models that automatically create such content at a scale that humans can barely manage. *Hate speech* is often defined as a hostile or disparaging communication by a person or group referring to characteristics such as race, color, national origin, gender, disability, religion, or sexual orientation [36]. According to European law, hate speech is a punishable criminal offense.

Hate speech detection can be solved as a text classification task. Recognizing such a text is difficult because the line between hate speech, irony, free speech, and art is blurred. Jahan et al. [36] and Yin et al. [123] give a systematic review on automatic hate speech detection. Because of the importance of the task, let's take a closer look at current approaches.

Roy et al. [88] follow a multilingual approach. They preprocess the text from Twitter by using a special tokenization of tweets. The cleaned text, emojis and segmented hashtags are encoded by different transformers and concatenated. A final multilayer perceptron generates the classification. The results for the *HASOC 2019 tweet dataset* [58] show that the additional signal from the emojis and the hashtags yield a performance boost for hate speech detection as well as for classifying the type of hate speech. They achieve F1-values of 90.3%, 81.9% and 75.5% on the English, German, and Hindi test sets.

Mathew et al. [59] argue that the decisions of hate speech classifiers should be explained. They present the *HateXplain* dataset with about 20k posts. The annotation contains class labels (hateful, offensive, or normal), the target group being vilified, and span annotations of words causing the classification. Overall a BERT model yields the best results in explaining the hate speech classification decisions.

A recent competition was the SemEval-20 Task 12 [128], where 14,100 Twitter tweets were manually labeled as either offensive or not offensive. Using a **RoBERTa** classifier (Sect. 3.1.1) Wiedemann et al. [110] achieved 92.0% F1-value and won the competition. In a later experiment an ensemble of *ALBERT* models (Sect. 3.1.1) increased this score to 92.6%. In summary, the automatic classification of hate speech can be solved by PLMs with high quality.

#### *5.1.2 Multilabel Classification*

Multilabel classification is required whenever a text can belong to multiple classes simultaneously. When a very large number of classes is available, this is sometimes called *extreme multilabel classification*. An example problem is the assignment of tags to Wikipedia articles, where Wikipedia has almost 500k tags. In multilabel classification usually a score or probability for each class is returned. This can be used to rank the classes. Traditional metrics such as accuracy, which assume that only one class is correct, cannot be applied. An alternative is to measure the quality of ranking induced by the score (c.f. Sect. 6.1.2). A popular measure for a predicted score vector *y*ˆ*<sup>i</sup>* ∈ [0*,* 1] and a ground truth label vector *yi* ∈ {0*,* 1} is the *precision at*

*k*, which counts, how many correct classes are among the *k* classes with the highest score:

$$\operatorname{precec@k} k = \frac{1}{k} \sum\_{l \in \operatorname{rank}\_k(\hat{\mathbb{y}})} \operatorname{y} l \qquad DCG \circledast k = \frac{1}{k} \sum\_{l \in \operatorname{rank}\_k(\hat{\mathbb{y}})} \frac{\operatorname{y} l}{\log(l+1)},\tag{5.1}$$

where rank*(y)*ˆ = *(i*1*,...,ik)* is the vector of the indices of the *k* largest values of *y*ˆ*<sup>i</sup>* sorted in descending order *y*ˆ*i*<sup>1</sup> ≥···≥ ˆ*yik* . The second measure *DCG*@*k* is the *discounted cumulative gain*, where the correct assignments *yl* are weighted by their rank *l* transformed with 1*/* log*(l*+1*)* [14]. This reflects that correct assignments with a lower rank should get a lower weight. In addition, there is a normalized version *nDCG*@*k*, where *DCG*@*k* is divided by its maximal possible value.

Separate classifiers for each class often yield a very good accuracy, but suffer from very bad training and prediction time. In the worst case these classifiers have to be trained per label with all positive instances of a label and all instances of the other labels as negative samples. To mitigate this effect **Parabel** [83] is based on a tree ensemble. First Parabel creates label representations by averaging all the instances that belong to a label and normalizing this averaged vector to 1. Then balanced 2 means clustering is applied on the label space recursively until all leaf nodes in the clustered label tree contain fewer than *M* labels, e.g. *M* = 100. For each internal node of the tree and for the leaf nodes, classifiers are trained to decide which path of the tree an instance follows. Thus, a balanced label hierarchy is generated efficiently based on a label representation such that labels with similar inputs end up together at the leaves. Up to 3 such trees are used as an ensemble.

Finally, for each label, 1-vs-All classifiers are trained as a MAP estimate of the joint probability distribution over labels. The negative examples used for training these classifiers are drawn from the other labels in the same leaf, so the most similar or confusing counterexamples are employed. For prediction a beam search is performed in the tree and only for the *k* most probable labels a classification is actually performed. Parabel has been applied to problems with 7M labels and can make predictions in logarithmic time. Parabel is significantly faster at training and prediction than state-of-the-art extreme classifiers while having almost the same precision. On the EURLex-4K it achieves a prec@1 value of 81.5 and on the Amazon-670k a prec@1 value of 43.9, which is worse than the 45.4 of the best approach, but its time for prediction is only 1/1000.

**AttentionXML** [124] is a tree-based classifier, which uses contextual embeddings as input features. With an attention between the many labels and the tokens, AttentionXML represents a given text differently for each label. The architecture of AttentionXML consists of a word representation layer, a bidirectional LSTM layer, an attention layer with attention from all labels to the BiLSTM (Sect. 1.6) encoded input and lastly a fully connected layer and an output layer.

AttentionXML first builds a deep tree similar to Parabel. Then the tree is compressed to a shallow and wide tree, which allows to handle millions of categories, especially for "tail labels", i.e. classes with only a few examples in the

training set [37]. The model uses the binary cross-entropy loss function. For each level of the tree this model is trained, being initialized with the model of the prior tree level. AttentionXML trains label ranking with negative labels sampled by finetuned label recalling models. For prediction the tree is used for a beam search, so only tree branches where the parent nodes have highest scores are considered.

On the *EURLex-4K benchmark* AttentionXML achieves *prec*@1 = 87*.*1% and *prec*@5 = 61*.*9%. This means that the highest scoring prediction of the model is correct for 87*.*1% of the test predictions and 61*.*9% of the five highest scoring predictions are correct. Note that the choice of *k* should be made according to the average number of labels per document in the training set. On the *Amazon670k dataset* [60] with 679k categories AttentionXML achieves *prec*@1 = 47*.*6% and *prec*@5 = 38*.*9%. This means that about 40% of the alternative products are correctly identified.

**LightXML** [39] employs a transformer encoder to generate contextual word features and generates negative examples for each category in a dynamic way. First, a set of label clusters is created based on the input features so that each label belongs to one cluster. Then a pre-trained model like RoBERTa (Sect. 3.1.1) is employed to encode the input text of an instance into contextual embeddings. To represent the input text of a training example, the embeddings of the [CLS] token in the last five layers are concatenated.

A specific *label recalling* model aims to predict the label clusters using the [CLS] embeddings as input. In addition, the *label ranking model* receives the [CLS] embeddings of a training instance as well as the corresponding label . Negative examples with other labels are dynamically generated with the label recalling model. The loss terms of both the generator and the discriminator are combined in a joint loss function allowing end-to-end training. On the EURLex-4K benchmark LightXML achieves a *prec*@1 = 87*.*6% and *prec*@5 = 63*.*4%. On the Amazon670k benchmark it reaches a *prec*@1 = 49*.*1% and *prec*@5 = 39*.*6%. Both values are slightly better than those of AttentionXML. The approach also demonstrates SOTA performance compared to 7 alternative model on three other multilabel datasets.

**Overlap** [51] groups labels into overlapping clusters. In product categorization, for example, the tag "belt" can be related to a vehicle belt (in the "vehicle accessories" category), or a man's belt (under "clothing" category). Each label can now occur at most *λ*-times, where *λ* is a hyperparameter of the approach. The authors initialize their partitioning with a balanced *k*-means clustering and then proceed with an optimization method to reassign labels in a way that maximizes the precision rate. On the Amazon670k benchmark the model reaches SOTA values of *prec*@1 = 50*.*7% and *prec*@5 = 41*.*6%. There are also alternative models with a tree-based search, which are able to increase recall rates and reduce effort [22].

There is a great similarity of extreme multilabel classification with text retrieval, which is covered in Sect. 6.1. This group of text applications has seen a large progress in recent years. For dense retrieval the query and the document representations are encoded by a BERT model, and the documents with largest cosine similarity are returned. Probably many approaches from this field may be used for text classification.

#### *5.1.3 Few- and Zero-Shot Classification*

Large autoregressive language models like GPT-2, GPT-3, Gopher and PaLM have acquired an enormous amount of information about facts and language by pre-training. They can be instructed to classify a text by a few examples [76], as described in Sect. 3.6.3. Figure 5.1 provides an example prompt for the classification of a text by sentiment [91]. This means that no additional fine-tuning dataset is required, but only a prompt with a few examples. In the same way the pre-trained Gopher model [85] was applied to a comprehensive set of about 150 benchmark tasks, which require the generation of answers using few-shot instructions. Similar to other autoregressive models it may predict class labels for documents (Sect. 2.2.5). As the results show [85, p. 56], Gopher is often able to outperform conventional PLMs fine-tuned on the domain. Therefore, classification by instruction seems to be a viable alternative, if a large autoregressive PLM such as GPT-3, Gopher or GPT-Neo is available.

Recently, the *RAFT* [3] benchmark was released. RAFT is specifically designed for evaluating few-shot performance in text classification tasks. It covers 11 realworld datasets, 8 of which are binary classification, two contain three classes, and one contains 77 classes. Each task comes with natural language instructions and 50 labeled training examples. An example benchmark is "Label the sentence based on whether it is related to an adverse drug effect. Sentence: No regional side effects were noted. Label: not related. ...". A prompt contained less than 50 examples. The performance is measured by an average F1 over all 11 tasks. On these RAFT benchmarks BART yields an F1 average of 38.2%, GPT-Neo (2.7B) achieves 48.1%,

AdaBoost decision trees 51.4%, and GPT-3 (175B) scores 62.7%. Humans achieve an average F1 of 73.5%.

**PET** [90] asks users to specify one or more patterns that convert an input example *x* into a *cloze prompt* (Sect. 2.1.2) so that it can be processed by a masked language model like BERT. In addition, users must describe the meaning of all output classes. This is done with a "verbalizer" that assigns a natural language expression to each output *y*. Multiple verbalizers may be specified for the same data. An example is "I really enjoyed this movie. It was [MASK]." and "I really enjoyed this movie. Question: Is this a positive movie review? Answer: [MASK]." for the text "I really enjoyed this movie". The PLM is then trained to maximize *p(y*|*x)* for observed pairs. PET achieves a new state of the art on RAFT with an average F1 of 82.2% and performs close to nonexpert humans for 7 out of 11 tasks.

Foundation Models can also be used to generate new data for a text classification task. If, for example, input for a restaurant classification task is required, the model can be prompted to generate a new restaurant review for a specific label Sect. 3.6.6. In this way training data for fine-tuning a model can be created.

#### **Available Implementations**


#### *5.1.4 Summary*

For document classification, a PLM that has been pre-trained with a large set of documents is usually fine-tuned to solve a specific classification task. Typically, the embedding of a particular token such as [CLS] is used as input to a logistic classifier. This setup has outperformed all previous bag-of-word classifiers such as the SVM. Specialized PLM variants like XLNET or ALBERT show a higher performance because of their more effective pre-training. For longer documents, suitable models like BigBird yield good results. Identifying hate speech can be considered as a classification task, where good results are achieved with standard models such as BERT and RoBERTa.

The situation is different for multi-label classification, where several categories can be correct for one document. Here, tree-like classifiers in combination with contextual embeddings show good results. By the tree a small number of candidate classes can be selected reducing the training and execution times. Extreme multilabel classifications, such as matching product descriptions to related product descriptions, are close to a document retrieval tasks and can benefit from techniques developed in this area, e.g. dense retrieval by DPR.

Large pre-trained autoregressive language models like GPT-3, Gopher and PaLM may be instructed by few-shot learning to solve classification tasks. Recent approaches achieve a performance close to humans. Not long ago an API has been released which allows to pre-train GPT-3 and adapt it to specific data and specific classification tasks (Sect. 3.6.2). A simpler alternative is InstructGPT, which can be easily directed to perform a classification, e.g. a sentiment analysis (Sect. 3.6.5). However, a formal evaluation of the performance of this approach is not yet available, as the model would have to process the training data.

While PLMs have achieved promising results on demanding benchmarks, most of these models are not interpretable. For example, why does a model arrive at a particular classification? Why does a model outperform another model on one dataset, but performs worse on other datasets? Although the mechanisms of attention and self-attention provide some insight into the associations that lead to a particular outcome, detailed investigation of the underlying behavior and dynamics of these models is still lacking (Sect. 2.4.5). A thorough understanding of the theoretical aspects of these models would lead to a better acceptance of the results.

#### **5.2 Word Sense Disambiguation**

In nearly all languages the same word may express different concepts. An example is the word "set", which may be a verb, an adjective, or a noun and can be interpreted as 'a group of things', a 'scenery', a mathematical concept, a sports term, etc. The WordNet [62] lexical database lists 45 different senses for this word. *Word sense disambiguation* (*WSD*) aims to distinguish these different meanings and annotate each word with its sense. It can be treated as a classification task, where each word is assigned to a sense of a sense inventory such as WordNet. The contextual embeddings generated by PLMs offer a way to identify these meanings. Bevilacqua et al. [13] provide a recent survey of WSD approaches.

WSD can be used for a number of purposes. A traditional application is search, where the different senses of the same word are distinguished in the query. *Lexical substitution* [13] aims to replace a word or phrase in a text with another with nearly identical meaning.

#### *5.2.1 Sense Inventories*

WSD obviously depends on the definition of senses, which have to be assigned to the words. The main sense inventory for WSD in English is *WordNet* [62]. It consist of expert-made *synsets*, which are sets of synonymous words that represent a unique concept. A word can belong to multiple synsets denoting its different meanings. Version 3.0 of WordNet covers 147,306 words (or phrases) and 117,659 synsets. WordNet is also available for languages other than English through the *Open*

*Multilingual WordNet* project [17]. *Wikipedia* is another sense inventory often used for *Entity Linking* (Sect. 5.3.3), where a person, a concept or an entity represented by a Wikipedia page has to be linked to a given *mention* of the entity in a text. *BabelNet*  [71] is a mixture of WordNet, Wikipedia and several other lexical resources, such as Wiktionary [111] and OmegaWiki [75]. It is highly multilingual covering more than 500 languages.

WordNet's sense inventory is often too fine-grained. For example, the noun "star" has eight meanings in WordNet. The two meanings referring to a "celestial body" distinguish only whether the star is visible from earth or not. Both meanings are translated in Spanish as "estrella", so this sense distinction is useless for this translation. It has been shown that for many tasks more coarse-grained sense inventories are better [81].

The best WSD algorithms use PLMs pre-trained on large document corpora. Through fine-tuning, they are trained to assign senses from the available sense inventory. In some cases, nearest neighbor operations are employed to measure the distance between embeddings and determine the most appropriate sense.

#### *5.2.2 Models*

**GlossBERT** [33] employs a pre-trained BERT encoder. Its fine-tuning input is both the context sentence (where the word is used in the specific sense) and the *gloss*  (a sentence defining the meaning of the word). GlossBERT is trained to predict whether the gloss correctly describes the use of the target word. The *SemCor3.0* [61] benchmark is annotated with WordNet senses. GlossBERT achieves a new SOTA of 77.0% F1 on this data.

**EWISER** [12] expresses WSD as a simple *Word annotation* task (Sect. 2.1.3), where a sense label is assigned to each word. It starts with an average of BERT embeddings for each word *vt* from different contexts and transforms them with a linear layer and the *Swish* [86] activation function *f (x)* = *x* ·sigmoid*(βx)*. For each combination of a word and a part-of-speech a set *S(vt)* of possible word senses and hypernyms is determined similar to [78]. Then the approach computes probabilities that a word belongs to a synset in *S(vt)*. By this approach the prediction takes into account which WordNet senses are possible for a word. It achieves a new SOTA of 80.1% on a combination of WSD benchmarks. This value is also an estimated upper bound on human inter-annotator agreement [69], showing that WSD is on par with humans. The paper lists the results for a number of alternative approaches. The **BEM** model [15] is a similar system yielding comparable accuracy. A detailed analysis of how PLMs (especially BERT) capture lexical ambiguity can be found in [52]. The authors show that the embedding space of BERT covers enough detail to distinguish word senses.

**MuLaN** [9] is based on a multilingual list D of *synsets* in different languages. For example, D may contain the synset corresponding to the "fountain" meaning of "spring", which is expressed in different languages as "QuelleDE", "springEN", "fountainEN", "manantial ES", "brolladorCAT", "sourceFR", "fonte IT", and "sorgente IT". The semantic repositories *WordNet* [62] and *BabelNet* [71] are employed to create D. MuLaN has the task to annotate an unlabeled corpus *U* in the target language with senses using a corpus *L*lab in the source language (e.g. English) as input, which is annotated with senses from D . This is done in the following steps:


As a labeled corpus *L*lab a union of *SemCor* [63] and the *WordNet Glos Corpus*  (*WNG*) [46] is used, which are annotated with senses. As unlabeled corpus *U* the Wikipedia is used for Italian, French, Spanish and German. When tested on *SemEval-13* [70] and *SemEval-15* [66], MuLaN is the best system to annotate words with senses in the four languages with F1-values above 80%. An important advantage of MuLaN is that it is able to transfer sense annotations from highresource to low-resource languages.

**Escher** [8] reformulates WSD as a span prediction problem. The input to the model is a sentence with a target word and all its possible sense definitions. The output is a text span identifying the gloss expressing the target words most suitable meaning. As an example consider Fig. 5.2 with the input sentence "<s> The bully had to <t> back down </t>. </s>" where the target word is enclosed in "<t>" and "</t>". Subsequently, two glosses are appended.

The span is predicted similar to Sect. 2.1.3 by separately computing the probability for the first and last token of the span covering the correct gloss. In the example the sentence "Move backwards from a certain position." is selected as span, which describes the correct sense. By lowering the prior probability of the most

**Fig. 5.2** Escher [8] takes as input a sentence, where the target word "back down" is enclosed by "<t>" and "</t>". The most probable sense of the target word is indicated by the sentence selected by span prediction. A high probability of a span start is indicated by "[" and a high probability of the span end is indicated by "]" 

frequent sense for a word the approach is able to reduce the most frequent sense bias. Escher uses BARTLARGE (Sect. 3.1.3) as PLM architecture, as it is effective for reading comprehension. The output of its last decoder layer is used to represent the input tokens and to compute the start and end token distributions. On a number of SemEval datasets [66] Escher has higher F1-scores compared to its competitors and this difference is statistically highly significant. Best results are achieved for nouns and adjectives with F1-values *>* 83%, while for verbs the F1-value is only 69.3%.

**ConSec** [10] determines the sense of a token by considering not only the context words, but also the senses assigned to the neighboring words. It is based on an extension of DeBERTa, a BERT variant with superior performance (Sect. 3.1.1). ConSec uses WordNet example sentences with annotated meanings (glosses) as additional training data. The approach yields a SOTA of 83.2% F1 when applied to the *SemCor3.0* benchmark [61].

#### **Available Implementations**


#### *5.2.3 Summary*

WSD can be handled as a classification task, where each word is assigned to a number of possible meaning classes. Often WordNet is used as the sense inventory.

GlossBERT compares the contextual embedding of a word with the embedding of a word in an example sentence (gloss) of WordNet. EWISER and MULAN directly work on the synsets of WordNet and capture the sets of possible senses and hypernyms. They are able to annotate senses in four languages with an F1-value above 80%. Escher reformulates WSD as a span prediction problem increasing F1 to 83%. ConSec takes into account the senses of nearby tokens and achieves a similar performance.

As WSD models get better, there is a need for more demanding benchmark datasets, which possibly may be generated by adversarial techniques. Moreover, there is a trend to WSD models which are more robust to domain shift and can cope with text from social media documents. To advance WSD it is necessary to extend sense-annotated data, especially for rare senses. In addition, multilingual WSD systems may be constructed which require large-scale multilingual WSD benchmarks. There are tendencies in WSD to do away with the fixed sense inventory and to distinguish the senses in other ways, e.g., in a lexical substitution task or by generating the definition of a word in a particular context.

An opportunity is the integration of WSD with entity linking (Sect. 5.3.3), where the model is required to associate mentions with entries in a knowledge base such as Wikipedia. As WSD systems work fairly well now, it would be possible to combine them with other applications like question answering or dialog systems. It has to be tested, whether an explicit inclusion of WSD is able to generate better results. For retrieval tasks, WSD has been superseded by embedding-based methods (Sect. 6.1), which provide a better hit rate.

#### **5.3 Named Entity Recognition**

*Named entity recognition* (*NER*) refers to the task of tagging *mentions* of *named entities*, such as persons, organizations and locations in texts. Labeled datasets for NER exist across many domains, e.g. news, science and medicine [72]. Typically these datasets are annotated in the *IOB2 format*, which, for instance annotates the first token of a person with B-per and all other tokens of that entity with Iper. The O-tag is used for all tokens outside of entity mentions. An example is "U.N.B-org official O PeterB-per EkeusI-per headsO forO BagdadB-loc." NER involves the prediction of these tags for each token, i.e. the suffixes in the prior example. Therefore, it can be considered as a classification task, where a tag is assigned to each token. A standard dataset for NER is the CoNLL-2003 dataset [89], which contains English resp. German news texts with annotations for persons, organizations, locations, and miscellaneous names. Surveys on NER are provided by Li et al. [48], Nasar et al. [68] and Bose et al. [18].

NER is particularly useful in areas with a highly specialized vocabulary. Examples include the fields of healthcare or electromobility, where many thousands of publications are released each year. Since few experts understand the terminology,

NER systems are particularly valuable for identifying publications on specialized topics. Of course, the NER types must be adapted to each area.

In the following section, we present approaches to ordinary NER where each word can have a single entity type. Named entities can also be nested, e.g. "[[UK] gpe Embassy in [France] gpe]facility". This case is discussed in the second section. Even more challenging is the mapping of a named-entity phrase to the underlying unique entity in a knowledge base or ontology, e.g., a person. This is called entity linking and is discussed in the third section.

#### *5.3.1 Flat Named Entity Recognition*

In *flat named entity recognition* each token corresponds to at most one named entity. **BERT** can be fine-tuned to NER by predicting tags for each token using a logistic classifier (Fig. 2.5) as a final layer. For this setup BERTLARGE yielded 92.8% F1 value on the CoNLL-2003 test data. While the F1-values for persons and locations were higher (≈ 95%), the F1-value for miscellaneous names (78%) was much lower, as these entities form a vaguely defined class.

**LUKE** [117] treats words and entities in a given text as independent objects, and outputs contextual embeddings of tokens and entities. The model is based on RoBERTa and trained to predict randomly masked words and entities in a large entity-annotated corpus derived from Wikipedia. In this way, it obtains a lot of information on the relation between entities in the text. It contains an entity-aware self-attention mechanism that is an extension of BERT's self-attention mechanism and takes into account embeddings, which indicate if a token represents text or an entity. It yields an F1-value of 94.3-F1 for CoNLL-2003, which is near-SOTA.

**ACE** [106] builds on the assumption that weighted sums *<sup>i</sup>*∈*<sup>I</sup> Ai* <sup>∗</sup> *emb(vi)* of different embeddings *emb(vi)* of tokens *vi* yield better results than single embeddings. A controller samples a subset *I* from a set of eight embeddings (e.g. BERTBASE, GloVe, fastText, etc.) and a NER model is trained and returns an accuracy score. The accuracy is treated as a reward signal in a reinforcement setting using the policy gradient algorithm ([112]) to select an optimal subset *I* . As NER model a BiLSTM model (Sect. 1.6) with a final CRF-layer was chosen. A CRF (Conditional Random Field) [100] is able to model the probabilistic relation between the tags in detail. The fine-tuned model reaches a SOTA F1-score of 94*.*6% for CoNLL-2003.

**KeBioLM** [126] is a biomedical pre-trained language model aiming to improve NER by including additional knowledge. The authors extract 660M entities from the *PubMed corpus* [73] with abstracts of biomedical literature and link them to the UMLS knowledge base that contains more than 4M entities and their synonyms as well as relations. They train a variant of BERT on the PubMed data and explicitly generate embeddings for entities. Relation information is included by the TransEmechanism (Sect. 3.4.1). The joint loss function is a mixture of loss functions for masked language modeling, entity detection, and entity linking. The *JNLPBA*

*benchmark* contains 2000 PubMed abstracts with molecular biology-related entities. KeBioLM reaches a SOTA of 82.0% F1 on JNLPBA. This shows that pre-training on domain texts and the inclusion of additional knowledge can improve NER results.

*Retrieval* is a way to enhance the context a PLM may use for NER. Wang et al. [107] query a search engine with the input text that should be tagged. They rank (Sect. 3.4.5) the returned results by the similarity of RoBERTa embeddings and concatenate the top ranked results and the input text. This is fed into a variant of RoBERTa to generate token embeddings. As the model can exploit the attention to the retrieved texts, the generated embeddings are potentially more expressive. The results on CoNLL 2003 indicate that retrieval can increase the F1-value about 0.5% and could be combined with current SOTA-models.

#### *5.3.2 Nested Named Entity Recognition*

Often named entities have an internal structure. An example for such *nested entities*  is the sentence "Last night, the [[Chinese] gpe embassy in [France] gpe]facility was closed." In this case a single token may have several entity tags and the NER task has to be formulated differently.

**MRC** [50] treats nested NER as a question-answering task. For example, the extraction of entities with a "location" label is formalized as the question: "Which locations are mentioned in the text?" The questions are formulated using templates that reflect the annotation guidelines. When these questions are answered for each entity type, overlapping named entities can be detected. MRC uses BERT's span prediction approach (Sect. 2.1.3) to mark the beginning and end of spans in the token sequence for an entity type. In addition, MRC predicts the start and the end of each entity to allow that there are overlapping entities of the same type.

Nested entities are common in the medical domain. The *Genia Corpus* [43] contains entity annotations for proteins, viruses, DNA, RNA and many more, with 17% of the entities being nested. MRC achieves a SOTA of 83.8% F1 on the Genia benchmark. The ACE-2005 benchmark [104] contain diverse nested entities like persons, facilities, or vehicles with an overlap of 22%. MRC reached an F1-value of 86.9% for ACE-2005. A similar approach [125] also predicts spans of different entities and yields 85.4% for ACE-2005. A two-stage algorithm called Locate and Label is proposed by Shen et al. [93], who first extract candidate entities and then categorize them in a second step. They yield 86.7% for the nested NER on ACE-2005 using BERT or one of its variants.

Instead of using a BERT model pre-trained on general documents, **PubMed-BERT** [102] pre-trains its BERT model with 100M parameters exclusively on 21 GB medical texts from PubMed. PubMedBERT achieves 86.3% F1 for NER on the *BLURB benchmark* [31]. The model also yields SOTA scores for other task like classification and relation extraction summarized in an average score of 82.9%. This result strongly supports pre-training on domain-specific data. **BioELECTRA** [42] is a biomedical domain-specific language encoder model that adapts ELECTRA

(Sect. 3.1.1) for the Biomedical domain. ELECTRA employs a sample-efficient 'replaced token detection' technique for pre-training, which causes the model to include an enormous amount of information from the training data. BioELECTRA is pre-trained on PubMed and PubMed Central full-text medical articles. For NER, it arrives at the best score with 86.7% F1-value on the BLURB benchmark [31]. The model also yields a similar score of 82.6% as PubMedBERT for the other BLURB tasks.

#### **Available Implementations**


#### *5.3.3 Entity Linking*

After identifying a named entity in a text (*entity mention*), one often wants to disambiguate it, i.e. assign the mention to a unique entity in a KB or ontology. This involves unifying different writings of an entity name. To attach the corresponding facts and relation to the same entity, it is important to link the different writings of a name, e.g. "Joe Biden was elected as 46th president of the United States of America" and "President Biden was born in Scranton Pennsylvania". Note that there exist about 35 writings for the name "Muammar Muhammad Abu Minyar al-Gaddafi", e.g. "Qadhafi", "Gaddafi" and "Gadhafi" in addition to versions with the different first names. *Entity Linking* approaches aim to solve this problem.

Entity linking is useful for tasks such as knowledge base population, chatbots, recommender systems, and question answering to identify the correct object or entity referred to. It is also required as a preprocessing step for models that need the entity identity, such as KnowBERT [80] or ERNIE [99] (Sect. 3.4.1). Early approaches rely on semantic embeddings to match entity mentions belonging together [82]. Modern procedures use contextual embeddings to characterize the entity mentions. Sevgili et al. [92] provide a comprehensive survey of Deep Learning based entity linking approaches. They sketch the general solution architecture of entity linking approaches as shown in Fig. 5.3 and compare different methods.

**BLINK** [113] follows the scheme of Fig. 5.3. First entity mentions together with their types are extracted from a text by NER. Then it uses a BERT model to compute embeddings for mention contexts and the entity descriptions in the KB. This also involves the normalization of entity names. Using an efficient approximate

**Fig. 5.3** Entity Linking includes the three steps entity recognition, which identifies entity mentions in a text, candidate generation generating possible entities for the mention using the KB, and entity ranking, computing a similarity score between the candidates and the mention. Image adapted from [92], reprinted with kind permission of authors

*k*-nearest-neighbor indexing scheme FAISS [40] for embeddings (Sect. 6.1.4). FAISS is able to retrieve the best matching entity candidates from the KB with little computational effort. This approach is identical to dense retrieval by DPR (Sect. 3.4.5). Each retrieved candidate is then examined more carefully with a cross-encoder that concatenates the input context, the mention and entity text and assigns a score to each candidate entity. Finally, the candidate with the highest score is selected. Although no explicit entity embeddings are computed, the approach achieves SOTA on the *TACKBP-2010 benchmark* [29] with an accuracy of 94.5%. A very similar approach is chosen by **EntQA** [130], which also exploits a retrieverreader architecture and yields competitive results on several benchmarks.

**GENRE** [25] departs from the common solution architecture to most entity linking approaches and uses the encoder-decoder model BART (Sect. 3.1.3) to disambiguate entities. This model has to recover text corrupted by a number of different approaches during pre-training and therefore gathers a lot of knowledge about language. The model is fine-tuned to generate disambiguated named entities. For example, the sentence "In 1503, Leonardo began painting the Mona Lisa." is translated to "In 1503, [Leonardo](Leonardo da Vinci) began painting the [Mona Lisa](Mona Lisa).", where "[Leonardo](Leonardo da Vinci)" and "[Mona Lisa](Mona Lisa)" are the unique headings of the corresponding articles in Wikipedia. GENRE uses a constrained BEAM search for decoding, which either copies the input text or generates a unique Wikipedia entity name. In addition, GENRE can perform mention detection and end-to-end entity linking by associating a mention with the corresponding KB entity (e.g. the Wikipedia article). On six different benchmarks, GENRE achieves an average F1-value of 88.8% and outperforming BLINK, which scores 77.0%. In addition, GENRE has a smaller memory footprint (2.1 GB) than BLINK (30.1 GB). Finally, the model has a tendency to copy the mention exactly, which is helpful for new, unseen named entities.

**Fig. 5.4** BERTLARGE can be fine-tuned to predict masked 'entity tokens' taking into account the corresponding text. During application successively the entities with highest probability are assigned. In this way, the joint probability of entities can be exploited [118]

**EntMask** [118] is similar to LUKE (Sect. 3.4.4) and learns to predict masked entities. To disambiguate new mentions, the authors use local contextual information based on words, and global contextual information based on already disambiguated entities. Their model is trained to jointly produce embeddings of words and entities and is also based on BERTLARGE. For fine-tuning 30% entities corresponding to Wikipedia hyperlinks are masked randomly and have to be predicted as shown in Fig. 5.4. During application the model predicts an entity for each mention, and from the unresolved mentions actually assigns the mention with the highest probability as 'observed'. In this way, this assignment can influence the prediction for the remaining mentions, introducing a global perspective. On a number of benchmarks the approach yields roughly similar results to GENRE, with a small advantage on a few benchmarks.

#### **Available Implementations**


#### *5.3.4 Summary*

It is well known that named entities play a crucial role in understanding the meaning of a text. Thousands of new named entities appear every day, requiring special effort to interpret their sense. Due to the availability of contextual embeddings in PLMs Named Entity Recognition (NER) could increase F1-value on the CoNLL 2003 benchmark from 85% to 94.6%, dramatically reducing errors. The standard approach is token annotation by BERT, which marks each token with its corresponding entity type. Higher performance can be achieved by treating named entities as special tokens (LUKE), combining different kinds of embeddings (ACE), or using retrieval approaches based on embeddings. Empirical evaluations demonstrate that it is extremely important to train the underlying PLM on domain texts, e.g. from the medical domain. Single tokens or compounds can belong to multiple entity types at the same time. For this, nested NER question-answering approaches can be used to mark token spans as belonging to an entity type. Again training on domain texts is essential.

In Sect. 5.4.4 approaches for joint entity and relation extraction are presented. The approaches described there can also be used for NER alone and promise high performance. An example is REBEL, which uses the BART encoder-decoder to translate the input sentence to a unique representation of the covered entities and relations.

Entity linking aims to map an entity mention to the underlying unique entity in a KB. One approach exploits the retriever-reader architecture to find entity candidates from a knowledge base (BLINK, EntQA). Subsequently, a reader module scrutinizes candidates and the mention to arrive at a final assignment. An alternative is GENRE's encoder-decoder architecture, which translates entity mentions to unique entity names. Finally, a BERT model can determine self-attentions between token embeddings and entity embeddings and exploit this to predict unique entities contained in a text.

The majority of entity linking models still rely on external knowledge like Wikipedia for the candidate generation step. However, this is not sufficient when identifying a person who is not a celebrity. In this case we have to perform a search in the web or social media to find information. As retrieval-reader approaches gain popularity, this may be possible in the future. It turns out that NER and entity linking should be performed jointly, i.e. assignments should take into account each other to increase accuracy.

#### **5.4 Relation Extraction**

After identifying relevant entities in a sentence, a crucial part of information extraction is often the extraction and classification of relations between these entities. This is useful, for example, when we automatically want to populate databases


**Table 5.3** Language analysis tasks based on relation extraction [4, p. 10]. Underlining indicates phrases annotated by the model

or knowledge graphs with linked information. Table 5.3 contains examples of language analysis tasks based on relation extraction that are discussed in this section. Instances include coreference resolution, i.e. finding different mentions of an entity in the same text, aspect-based sentiment analysis, which links phrases in a text to opinions about them, or semantic role labeling, which identifies the function of a phrase for a predicate in a sentence. Because entity linking associates mentions of entities with the underlying unique object or person in an ontology, it differs from relation extraction. A survey on prior work in relation extraction is given by Nasar et al. [68].

#### *5.4.1 Coreference Resolution*

A first type of relation extraction is *coreference resolution*, whose goal is to establish a relation between all entity mentions in a text that refer to the same real-world entities. As an example, consider the sentence "I voted for Biden because he was most aligned with my values", she said. where "I", "my", and "she" refer to the speaker, and "Biden" and "he" pertain to Joe Biden. Due to the combinatorial number of subsets of related phrases, coreference analysis is one of the most challenging tasks of NLP. A survey of coreference resolution is provided by Stylianou et al. [98].

**SpanBERT** [41] is a version of BERT, which predicts contiguous subsequences of masked tokens during pre-training, and therefore accumulates knowledge about spans of words (Sect. 3.1.1). The authors consider all possible spans of text and identify relevant mentions spans. In parallel, for each span *x*, the preceding spans *y*

are examined, and a scoring function estimates whether the spans refer to the same entity.

This scoring function is defined as *s(x, y)* = *sm(x)* + *sm(y)* + *sc(x, y)*. Here *sm(x)* and *sm(y)* measure how likely *x* and *y* are entity mentions. *sc(x, y)* determines how likely *x* and *y* refer to the same entity. As input from a span, the scoring function gets the output embeddings of the two span endpoints and a summary of the tokens embeddings of the span. The probability that *y* is coreferent to *x* is computed as *p(y)* = exp*(s(x, y))/ y* <sup>∈</sup>*<sup>Y</sup>* exp*(s(x, y ))*. In this way, subsets of spans mentioning the same entity are formed. During the iterations of the approach, the span definitions may be refined, and an antecedent pruning mechanism is applied to reduce the number of spans to be considered. *OntoNotes*  [109] is a corpus of 1.5M words comprising various genres of text with structural information, e.g. coreference. After fine-tuning on OntoNotes, Span-BERT achieves a SOTA result of 79.6% F1-value on the test set. Dobrovolskii [27] propose a variant which performs its analysis on the word level thus reducing the complexity of the task. It raises the SOTA on OntoNotes to 81.0%.

**CorefQA** [114] solves coreference resolution as a question-answering problem. A first stage considers all spans up to a maximum length as potential mentions. The authors use a SpanBERT model to compute embeddings for all tokens. To reduce the number of mentions, a proposal module combining the start and end embeddings of spans is pre-trained to predict relevant mentions. Subsequently, each mention is in turn surrounded by special tokens and the network is trained to mark all coreferent spans similar to the question-answering fine-tuning of BERT (Sect. 2.1.3). To reduce the number of computations only a limited number of candidates in one direction is considered. The mention proposal and mention clustering can be trained endto-end. On the coreference benchmark CoNLL 2012 [84] the approach improves SOTA significantly to 83.1% F1-value. Toshniwal et al. [103] extend this approach by tracking only a small bounded number of entities at a time. This approach can reach a high accuracy in coreference resolution even for long documents.

#### **Available Implementations**


#### *5.4.2 Sentence-Level Relation Extraction*

There are various types of relations which can be extracted, e.g. in the sentence "Goethe succumbed to his suffering in Weimar" the "died-in" relation relates a person ("Goethe" ) to a location ("Weimar"). In this section we assume that entities

have already been extracted from a sentence by NER (Sect. 5.3). Therefore, NER errors will increase the errors for relation extraction.

**SpanBERT** [41] is particularly suitable for relation extraction, since entity mentions often span over multiple tokens, and are masked by SpanBERT during pre-training (Sect. 3.1.1). For fine-tuning the model gets one sentence and two spans with possible relation arguments as input, which are replaced by their NER tags. An example is "[CLS] [SUBJ-PER] was born in [OBJ-LOC] , Michigan, . . .". The final [CLS] embedding is input to a logistic classifier, which predicts one of the 42 predefined relation types, including "no relation". *Re-TACRED* [97] is a largescale relation extraction dataset with 120k examples covering 41 relation types (e.g., per:schools-attended and org:members) and carefully checked relation annotations. SpanBERT showed good performance on Re-TACRED with 85.3% F1-value [95].

**RoBERTa** (Sect. 3.1.1) can be used to generate token embeddings for relation extraction. Zhou et al. [135] evaluate various entity representation techniques. They use RoBERTaLARGE to encode the input text by embeddings of the last layer. The embeddings of the first token in each span of relation argument mentions are used to represent these arguments. These are concatenated and adopted as input for a softmax classifier. It turns out that enclosing an entity and adding its type with special tokens yields the best results on the Re-TACRED dataset with 91.1% F1 value.

**Relation-QA** [24] rephrase the relation classification problem into a question answering problem. Consider the sentence *s* = "Sam Brown was born in 1991." with the extracted entities "Sam Brown" and "1991". Then the authors create two queries, such as "When was Sam Brown born?" and "Who was born in 1991?". They fine-tune ALBERT (Sect. 3.1.1) to answer these queries by marking the spans containing the desired entity. If no span is returned the relation does not hold. The approach achieves an F1-value of 74.8% for TACRED, an older version of ReTACRED with many annotation problems. **RECENT** [55] extends SpanBERT and trains more than one relation classification model, i.e. one classifier for each different pair of entity types. This restricts the possible output relation types and helps to increase performance. On TACRED the approach yields a SOTA F1-value of 75.2%.

#### *5.4.3 Document-Level Relation Extraction*

Especially for larger documents, the assumption that relations occur only inside a sentence is too restrictive. Therefore, some models check for relations on the document level. When relation arguments are in different sentences the corresponding entities are often only referred to via coreferent mentions. Therefore, we assume in this section that entities have been extracted and grouped into clusters denoting the same entity by coreference resolution (Sect. 5.4.1). Obviously the errors of coreference resolution will increase the final relation extraction errors.

**SSAN** [115] (Structured Self-Attention Network) directly takes into account structural information such as coreference and cooccurrence of entity mentions for PLMs such as RoBERTa. The authors modify the self-attention computations in encoder blocks by adding specific biases, if two mentions refer to the same entity and/or are located in the same sentence. These biases are computed from the query and key vectors by a "transformation model" trained during fine-tuning. Therefore, the scalar products between keys and queries are modified depending on whether the corresponding tokens are coreferent, in the same sentence, or not. Entity embeddings are obtained via average pooling of token embeddings of the entity mention. For each pair *embi, embj* of entity embeddings the probability of a relation *r* is computed by a bilinear transformation sigmoid*(emb*- *<sup>i</sup> Wrembj )* with a trainable parameter matrix *Wr*.

*DocRED* [121] is a large benchmark of documents annotated with named entities, coreferences, and relations whose arguments may be located in different sentences. Using RoBERTaLARGE as base network, the authors achieve a SOTA of 65.9% F1 on DocRED. Using a special BERT version SciBERT [11] trained on scientific papers from Semantic Scholar, the algorithm also yields SOTA results for benchmarks with chemical as well as biological texts.

**ATLOP** [136] marks the start and end of a mentions by a special token and encodes a document by BERT resulting in embeddings for each token. The embedding of token at the mention start is used as the mention embeddings. An entity embedding is computed by pooling coreferent mentions. The first and the second argument entity embedding of a relation are transformed by different fully connected layers to *x*<sup>1</sup> and *x*2. Subsequently, the probability of a relation *r* for an entity pair is estimated by a sparse bilinear transformation sigmoid*(x*- <sup>1</sup> *Wx*2*)*. Trainable probability thresholds are used to decide if a relation holds. On the DocRED benchmark the model achieves an F1-value of 63.4%.

#### *5.4.4 Joint Entity and Relation Extraction*

Since NER and relation extraction are closely related tasks and relation extraction depends on the results of NER, it is a natural choice to model these tasks jointly.

**UniRE** [108] encodes entity and relation properties in a joint matrix, which has a row and a column for each text token. While named entities, e.g. PER, are marked on the diagonal, relations are matrix entries off-diagonal. If, for example, "David Perkins" lives in "California" the matrix entries in the rows of the "David Perkins" tokens and the columns of the "California" tokens are marked with the *PHYS* relation. Note that in this way asymmetric relations may be specified.

All words in the input are encoded using a BERT encoder and then a biaffine model is used to create a scoring vector for a pair *hi* and *hj* of embeddings

$$p(\mathbf{y}\_{l,j}|\mathbf{s}) = \text{softmax}\left( (\mathbf{h}\_{l}^{firs})^{\mathsf{T}} U\_{1} \mathbf{h}\_{j}^{sec} + U\_{2} [\mathbf{h}\_{l}^{firs}, \mathbf{h}\_{j}^{sec}] + b \right),\tag{5.2}$$

**Fig. 5.5** For a possible relation the PL-marker model marks the first relation argument by special 'solid' markers and the possible second arguments by 'leviated' markers outside the text. The latter get the same positions as the corresponding tokens, and do not influence the embeddings of normal tokens during attention computation. The marker embeddings are concatenated to compute the probability of the corresponding relation [122]

where *hf irst <sup>i</sup>* <sup>=</sup> FCL*f irst(hi)* and *hsec <sup>i</sup>* = FCL*sec(hi)* are fully connected layer transformations of the first and second relation argument respectively. The softmax function obtains a probability distribution over the entity and relation labels for all matrix cells. The model minimizes three losses, one based on the actual labels of each cell, one based on the knowledge that diagonal of entity labels should be symmetrical and one based on the fact that a relation label implies that respective entity labels must be present. *ACE 2005* [104] consists of text of various types annotated for entities, relations and events. On ACE 2005 UniRE yields an F1-value of 66.0% for joint entity and relation extraction, which is less than the current SOTA of 70.5%.

**PL-Marker** [122] investigate different types of mention encodings. For a possible relation it surrounds the first argument span (subject) by solid marker tokens. The possible second argument spans (objects) are marked by *leviated tokens Oi* and */Oi* outside the text (Fig. 5.5). These get the same position embeddings as the corresponding object spans in the text. Their attention connections are restricted, i.e they are visible to each other, but not to the text token and other pairs of markers. Therefore, depending on the subject span the object token embeddings can capture different aspects. For each pair of subject-object arguments, the corresponding embeddings are concatenated and used as input to a logistic classifier to estimate the probability of the possible relations (or 'no relation'). Pre-trained variants of BERT are fine-tuned with ACE 2005 to predict the relations. With a BERTBASE model of 105M parameters the approach yields an F1-value of 68.8% on the ACE05 benchmark. If ALBERTXXLARGE [45] with 235M parameters is used to compute the embeddings, the F1-score grows to 72.3%.

For NER, the PL-Marker model uses a similar approach. For each possible span in the input starting at token *vi* and ending at token *vj,j*≥*i*, leviated markers are created, which do not affect the embeddings of the normal tokens. Again the embeddings of the start and end tokens of a span as well as the embeddings of leviated markers are input for a logistic classifier computing the probability of the

**Fig. 5.6** For the training set the relation information on the left side is linearized to the representation on the right side. The REBEL model thus learns to translate the input text to this linearized representation [20]

different NE-types. The model uses an efficient 'packing' to reduce computational effort. On the CoNLL03 named entity benchmark, PL-markers with a pre-trained RoBERTaLARGE achieve an F1-value of 94.0, which is well below the current SOTA of 96.1% held by DeBERTa [19]. When the relation extraction employs the entity types and spans predicted by the PL-MARKER NER, the F1-value of the joint approach drops to 70.5%, which is SOTA for the ACE05 benchmark on joint NER and relation extraction.

**REBEL** [20] uses the encoder-decoder transformer BARTLARGE (Sect. 3.1.3) for joint entity and relation extraction that outputs each relation *(h, r, t)* triplet present in the input text. It translates a raw input sentence containing entities, together with implicit relations between them, into a set of triplets that explicitly refer to those relations. An example is shown in Fig. 5.6. Each relation in the text appears in the output according to the position of its first argument. An entity may be part of different relations, which are ordered according to the position of the second argument. This defines the order of relations in the linearized representation.

The pre-trained BARTLARGE with 400M parameters is first fine-tuned on a Wikipedia and WikiData training set with 220 relation types. Then it is fine-tuned a second time on varying benchmark datasets. On the *DocRED benchmark* [121] it achieves SOTA with an F-value of 47.1%. On the *New York Times dataset* it has a SOTA performance with 93.4% F1. On the ReTACRED benchmark it yields 90.4% F1 without the inclusion of entity type markers used by other approaches.

#### **Aspect-Based Sentiment Analysis**

*Aspect-based sentiment analysis*, also known as aspect-level sentiment analysis, feature-based sentiment analysis, or simply, aspect sentiment analysis, allows organizations to perform a detailed analysis of their member or customer feedback data. This ranges from analyzing customer reactions for a restaurant to evaluating the attitude to political statements made by a politician. An example is "The waiter1-aspect was very friendly1-positive, but the steak mignon2-aspect was

extremely burnt2-negative." Note that a sentence may contain different aspects and each sentiment has to be assigned to one aspect. A recent survey of aspect-based sentiment analysis is given by Zhang et al. [129].

**DeBERTa** (Sect. 3.1.1) is a powerful BERT-like model, which assumes that the aspects are already known. It employs a disentangled attention mechanism for computing separate attention scores between words and positions disentangling semantic (content) and syntactic (position) representation of the textual data. The objective is to determine the sentiment of each aspect of a given entity. The input consist of a text and an aspect, e.g. *x* ="[CLS] . . . nice video camera and keyboard . . . [SEP] keyboard [SEP]", where "keyboard" is a possible aspect span from the text [94]. The output embedding of [CLS] is used as input to a logistic classifier which generates the probabilities of three possible labels positive, negative, neutral. The model is fine-tuned on the *SemEval 2014 Task 4.2 benchmark*. It yields a mean accuracy for the Restaurant and Laptop data of 86.1%. There are much more complex approaches like **LSA** (local sentiment aggregation) [119] achieving a SOTA of 88.6% on this benchmark.

**GRACE** [54] aims at extracting aspects and labels simultaneously. It consists of a first BERTBASE module generating token embeddings of the input text, which are fine-tuned to mark aspects by IOB2 tags for each token. The resulting information is fed into a Transformer decoder to predict the sentiments (positive, negative, neural) for each token. This decoder uses a multi-head cross attention to include the information from the first aspect module. Again for each token embedding in the last layer a logistic classifier is used to compute the probabilities of sentiments. To make the model more robust, small perturbations for input token embeddings are used during training. Note that no masked cross-attention is necessary as the decoder is not autoregressive. In this way, the model is able to take into account the interactions between aspect terms when labeling sentiments. The model achieves 87.9% F1 score for aspect extraction for the laptop reviews from SemEval 2014 and a SOTA of 70.7% F1-value for the joint extraction of aspects and sentiments. On the restaurant reviews it yields an F1 of 78.1% and on a tweet benchmark 58.3% for joint sentiment extraction, again outperforming a number of other models.

#### **Semantic Role Labeling**

Semantic role labeling considers a predicate (e.g. verb) of a sentence and word phrases are classified according to their syntactic roles, such as agent, goal, or result. It can be used to determine the meaning of the sentence. As an example consider the sentence "They want to do more ." where "want" is the predicate, "They" is the agent and "to do more" is the object (thing wanted).

**Crf2o** [133] is a tree-structured conditional random field (treecrf) [28] using contextual embeddings of the input tokens computed by RoBERTa as input. The sequence *x* = *(x*1*,...,xT )* of inputs can be arranged in a tree *y* and gets a score, which is the sum of all scores of its subtrees *s(x, y)* = *<sup>t</sup>*∈*<sup>y</sup> s(x,t)*. Similar to dependency parsing, this can be used to model the dependency of phrases from the

predicate in semantic role labeling [87]. To generate all possible subtrees requires *T* <sup>3</sup> operations, which is very inefficient. The authors were able to reduce this effort using structural constraints. In addition, they could take into account the dependency between two branches of the tree, which generated a second order tree. During training the models maximize the probability of the provided tree structure of the training data for an input. *CoNLL05* [21] and *OntoNotes* [84] are two widely used benchmarks for semantic role labeling. For CoNLL05 the Crf2o yields an F1-value of 89.6% and for OntoNotes it achieves an F1-value of 88.3%, which both constitute a new SOTA. Note that this technique may also be used for *dependency parsing*  [132], which describes the syntactic structure of a sentence by a tree structure.

#### **Extracting Knowledge Graphs from Pre-trained PLMs**

A systematic way to extract knowledge from big language models has been demonstrated by Wang et al. [105]. Their **MaMa** approach consist of a match stage and a map stage. The match stage generates a set of candidate facts from the text collection exploiting the internal knowledge of a language model. Similar to TransE (Sect. 3.4.1) each fact is represented as a relation triple *(*head, relation, tail*)*, or *(h, r, t)*. A language model is used to generate tokens corresponding to *r* or *t*. As a condition, the *r* values should be contiguous text sequences and express frequent relations.

In the map stage the triples are mapped to related triples with appropriate relations. As an example *(*Dylan, is, songwriter*)* is mapped to *(*Bob Dylan.Q392, occupation.P106, Songwriter.Q753110*)* according to the Wikidata schema. This stage is related to entity linking discussed in Sect. 5.3.3. The reason for mapping to an existing KG schema is to make use of the high-quality schema designed by experts.

A subgraph of the generated relations is shown in Fig. 5.7. Compared to the SOTA information extraction system Stanford OpenIE [5] with 27.1% F1-value the approach yields 29.7% F1-value. The authors report that performance increases with model size because larger models can store more knowledge.

#### *Available Implementations*


**Fig. 5.7** A snapshot subgraph of the open KG generated by MAMA [105] using BERTLARGE from Wikipedia pages neighboring "Bob Dylan". The blue node and arrow represent the mapped facts in the Wikidata schema, while the yellow node and arrow denote the unmapped facts in the open schema. The correct facts that are new in Wikidata are visualized in yellow. Image source: [105, p. 6], with kind permission of the authors

#### *5.4.5 Distant Supervision*

Obtaining a large annotated dataset for relation extraction is a tedious task and often difficult due to privacy issues. Since much relational knowledge is stored in knowledge bases, Mintz et al. [65] proposed the *distant supervision* paradigm. The idea behind it is to collect all text mentions where two entities co-occur, which are in a relation in the knowledge base. Then it is assumed that for this mention pair the relation holds. Since this is not correct for all such mention pairs, many approaches aim to combat this 'noise'. One approach is *multi-instance learning*, which relaxes the original assumption that all text mention pairs represent the relation to the assumption that the relation holds for at least one pair [2, 137], or a specified fraction like 10% or depending on a score value. Take for example the entities "Barack Obama" and "Hawaii", which might be in a relation "born\_in" in a KB. Sentences obtained by searching for occurrences of these two entities could be "Obama was born in Hawaii" as well as "Obama was on family vacation in Hawaii", where only the former represents the relation and should be used for training.

**KGPool** [67] uses entity pairs obtained from a KB, but also attributes associated with them. The idea is to create representations of the entity nodes, the sentence in which they occur, and the attributes of the entity nodes in a knowledge base, such as their description, instance-of and alias attribute. All this information is embedded using word and character embeddings and bidirectional LSTMs and connected as a heterogeneous information graph. Next three layers of graph convolutional networks are used with readout layers. Only relevant attribute nodes are picked by using self-attention on the readout representations, calculating a softmax score and then filtering via a hyperparameter according to the scores. A dynamic mask

is created which pools out the less essential entity attribute nodes. Finally, all intermediate representations of both entities, the sentence and the readouts are each concatenated to form the final entity, sentence and readout representation. These representations together with relation representations are then passed through a fully connected layer with softmax activation to calculate the scores per relation. The *New York Times dataset* is a standard benchmark for relation extraction with distant supervision. KGPool achieves a SOTA precision@10 of 92.3%, which is the fraction of relevant results if the 'best' 10 of the matches are used.

#### *5.4.6 Relation Extraction Using Layout Information*

To understand a formal text, often the document layout has be taken into account in addition to its text. Especially in form-like texts, the positions of words and filled-in values are important. In Sect. 7.2 we will describe, how text and images can be simultaneously processed by one or more transformers to extract meaning from both media. In anticipation, we will use this ability of transformers to process multimodal inputs and additionally include layout information via 2-dimensional positional features. A comprehensive overview of progress in layout analysis is provided by Stanisławek [96]. We will focus on methods for key-value extraction in this subchapter. In the task of key-value extraction, documents are analyzed to extract printed values to written keys of interest. Sample applications are the automatic processing of invoices, in which keys are attributes such as invoice date or the total amount to be paid.

**ReLIE** [57] is a framework for key-value extraction from form-like documents. The candidate generation step has the purpose of finding all possible value candidates for a certain key, e.g. the value "1/16/2018" for the key "Date". Often these value candidates correspond to basic types such as numbers, amounts, dates, etc. and can be found via rule based matchers. Then a transformer-based scoring model is trained, to identify valid values among the extracted value candidates. To this end, embeddings are learned for the keys, the position of the value candidate and for neighboring tokens and their positions. Positions of a value candidate and each of its neighbors are described using the 2-D Cartesian coordinates of the centroids of their respective bounding boxes. Note that the text of the candidate value is not encoded to avoid overfitting. All embeddings are related to each other by selfattention in an autoencoder. The field embedding and the candidate embedding are then compared via cosine similarity and the resulting score is scaled into a range of [0*,* 1]. The model achieves an f1-score of 87.8% on key-value extraction for invoices and 83.3% for receipts.

**DocFormer** [6] consists of a CNN visual backbone and an encoder-only transformer architecture. Visual embeddings of the document are produced via a ResNet50 model and projected to the appropriate embedding size via a linear layer. Text tokens are contained in a bounding box and the top-left and lowerright position of each token bounding box are transformed to embeddings by two

different matrices. In addition, the height, width and distances between neighboring bounding boxes are encoded. The 2D-positional embeddings are enriched with absolute positions via 1D-positional embeddings. Separate spatial embeddings are trained for visual and textual features. The attention mechanism of the DocFormer is a modified version of the original attention mechanism. Separate attention scores are calculated for the visual and the textual representation of tokens. In addition to the key-query attention, the relative position embeddings of both query and key tokens are used to add relative position attentions as well as a spatial attention for both the visual and the textual embeddings. The spatial attention weights are shared between the visual and the textual representations.

DocFormer is pre-trained with three different pre-training tasks: multi-modal masked language modeling (MM-MLM), learn to reconstruct (LTR) and text describes image (TDI). In the MM-MLM task, tokens are masked and should be reconstructed by the model. In LTR, the model is tasked to reconstruct the image of a document, given the multi-modal representation. A smooth-L1 loss is used to calculate differences between the original and the reconstructed image. TDI requires a text-image matching task, in which the model has to predict for random samples whether the image and the text are aligned or not. The *FUNSD benchmark* [38] considers forms in 199 scanned documents, where tokens have to be assigned to a semantic key, such as 'question' or 'answer'. On FUNSD DocFormer reaches an F1-value of 84.6%, which was SOTA at publication time.

**LayoutLM3** [34] uses an image embedding method inspired by the Vision Transformer (Sect. 7.2.2). Each image is partitioned into 16 × 16 image patches similar to the Vision Transformer and linearly transformed to embeddings. As shown in Fig. 5.8 words and image patches are processed by the same autoregressive Transformer. For pre-training the model uses the masked language modeling task, masked image patches and word-patch alignment pre-training task. In the masked image patches task, image patches have to be reconstructed by the model. The wordpatch alignment task has to enable the model to learn alignments between textual and visual representations. The model should classify whether text and image patch of a token are aligned, i.e. both are unmasked, or unaligned, i.e. the image patch is masked. The *PubLayNet benchmark* [134] contains the document layout of more than 1 million pdf documents matched against the correct document structure. Here LayoutLM3 achieves SOTA with 94.5% mean average precision of bounding boxes. It outperforms DocFormer on the FUNSD key-value extraction tasks and other benchmarks. *LayoutXLM* is a recent multilingual version of LayoutLM2 [116].

**Fig. 5.8** LayoutLMv3 takes the linear projection of image patches and word tokens as inputs and encodes them into contextualized vector representations. LayoutLMv3 is pre-trained with discrete token reconstructive objectives of Masked Language Modeling (MLM) and Masked Image Modeling (MIM). Additionally, LayoutLMv3 is pre-trained with a Word-Patch Alignment (WPA) objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. "Seg" denotes segment-level positions. Image source: [34, p. 3], printed with kind permission of the authors

#### **Available Implementations**

• KGPool at https://github.com/nadgeri14/KGPool

#### *5.4.7 Summary*

Relation extraction has the task to evaluate the expressed relationship in the text with respect to specific entities. An example is the assessment of certain product characteristics by customers, which can help to improve the product or service. Given the massive amount of textual content, it is intractable to manually process the opinion information.

For simple cases, the relation arguments are know and relation extraction can be solved as a simple classification task using some BERT variant like RoBERTa,

DeBERTa, or SpanBERT. However, to actually use these models we have to extract the relation arguments in a prior step, which leads to an increased total error.

More challenging is the simultaneous extraction of relation arguments and the corresponding relation type, as these task depend on each other. UniRE annotates entities and relations in a joint matrix and introduces a corresponding bias into the self-attention computations. PL-marker marks the first relation arguments with special tokens and the second argument with so-called leviated tokens. These tokens have specific attention properties and are able to improve the performance on popular benchmarks. GRACE employs a specific encoder-decoder architecture where the encoder labels the relation arguments (aspects) and the decoder assigns relation tags to each token. REBEL uses the BART encoder-decoder to translate the input sentence to a unique representation of the covered relations.

Relation extraction models have been adapted to specific applications. GRACE has been tuned for aspect-based sentiment analysis and Crf2o to semantic role labeling. The latter uses contextual embeddings and determines the relation between predicate and corresponding phrases by an efficient TreeCRF. Finally, MaMa can be used to build a knowledge graph from extracted relations between entities.

Often the spatial layout of documents and web pages contains relevant information for the extraction of relation arguments. In this case, visual information from the document image can be exploited to arrive at a valid interpretation. This visual information can be included via the position of bounding boxes for keys and values, but also in the form of image patches, which are explored later with the image transformer.

All recent relation extraction approaches are based on PLMs. Most models use small BERT variants for their experiments. Therefore, it can be assumed that larger models will directly increase performance. In addition, Foundation Models like GPT-3 may be fine-tuned (Sect. 3.6.2) and probably will result in a higher accuracy. A related alternative is InstructGPT (Sect. 3.6.5), which can be easily directed to perform a relation extraction via question answering, e.g. "Who built the statue of liberty?" [77, p. 29]. However, it seems to be difficult to evaluate the performance of this approach with respect to some test data.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 6 Foundation Models for Text Generation**

**Abstract** This chapter discusses Foundation Models for Text Generation. This includes systems for Document Retrieval, which accept a query and return an ordered list of text documents from a document collection, often evaluating the similarity of embeddings to retrieve relevant text passages. Question Answering systems are given a natural language question and must provide an answer, usually in natural language. Machine Translation models take a text in one language and translate it into another language. Text Summarization systems receive a long document and generate a short summary covering the most important contents of the document. Text Generation models use an autoregressive Language Model to generate a longer story, usually starting from an initial text input. Dialog systems have the task of conducting a dialog with a human partner, typically not limited to a specific topic.

**Keywords** Question answering · Machine translation · Text summarization · Text generation · Dialog systems · Document retrieval

In this chapter we describe Foundation Models, i.e. large Pre-trained Language Models for generating new text in different application areas.



**Table 6.1** Language generation tasks illustrated by an example

• *Dialog systems* have the task of conducting a dialog with a human partner, typically not limited to a specific topic (Sect. 6.6).

Due to the large number of different approaches, we focus on representative models which exhibit a high performance at the time of writing. We review the current best techniques for each area, measured against appropriate benchmarks and taking into account the computational resources required. For standard models a link to the description in earlier chapters is provided. Examples for each application area are shown in Table 6.1.

#### **6.1 Document Retrieval**

*Information retrieval* (*IR*) uses computer systems to search databases for content. The resulting IR system is often called a *search engine*. Often, the user formulates a sentence or a *query* about to some topic, and the system is expected to return a sorted list of documents relevant to the query (*ad hoc retrieval*). Here we focus on

**Fig. 6.1** Retrieve-and-rerank architecture using PLMs. First, texts are retrieved from the document collection, usually with exact-match bag-of-words queries. These candidates are then reranked using PLM embeddings, e.g. from BERT. Image adapted from [123], reprinted with kind permission of authors

retrieving textual information from a stored collection of documents. In contrast to question answering approaches in Sect. 6.2, the system does not generate a direct answer to the query in natural language.

Former IR systems were *keyword-based*: all words contained in a document were stored in an *inverted index*. The retrieval algorithm searched the index to identify documents that contained the query words. Then, these documents were ranked according to the information content of each query word found in a document, e.g. measured by tf-idf or BM25 [186]. These two steps are shown in Fig. 6.1. A survey of earlier retrieval techniques is given by Abbasiyantaeb and Momtazi [2]. However, this approach had three major problems:


As an alternative, contextual embeddings were used to represent queries and documents. By identifying matching documents through comparison of contextual semantic representations, word meaning differences between documents and queries can be reduced and texts with synonyms, homonyms, and paraphrases can be retrieved. These models have achieved SOTA results on various retrieval benchmarks [137] and have recently been introduced in commercial search engines. They are therefore one of the most commercially important applications of PLMs to date.

#### *6.1.1 Dense Retrieval*

*Dense retrieval* methods encode text as an embedding vector with a fixed length much smaller than the text length. Whether a document is relevant to a given query is determined by the similarity of embedding vectors, which is computed by cosine similarity or inner products. Unlike question answering (Sect. 6.2), these models do not generate a direct natural language response to a search query, but return complete documents or text passages. Recently, dense retrieval methods based on PLMs outperformed their keyword counterparts when fine-tuned on a small set of in-domain relevance-labeled documents. Lin et al. [124] provide a comprehensive overview of retrieval systems with PLMs. Different approaches for dense retrieval can be distinguished and are covered in the next sections:


Only a very small selection of methods can be described, which should give an impression of the approaches currently used as shown in Table 6.2. In Sects. 6.2.2 and 6.2.3 retrieval techniques for question answering are discussed, which are even more powerful. A very comprehensive survey on PLMs for retrieval is provided by Lin et al. [124].

#### *6.1.2 Measuring Text Retrieval Performance*

There are a number of benchmark datasets used for training and comparing retrieval approaches. The *MS-MARCO* benchmark [16] is a large-scale collection created from about half a million anonymized questions sampled from Bing's search query logs. For the passage ranking task it contains a corpus of 8.8M passages with an average length of 55 words extracted from 3.6M web documents. The goal is to retrieve passages that answer the question. The training set contains approximately 500k pairs of queries and relevant documents, and another 400M pairs of queries and non-relevant documents. There is a development set and a secret test set with about each 7k queries. However, there is a discussion that the gold annotation of the MS-MARCO benchmark is biased to some extent [10].


**Table 6.2** Document retrieval models with their performance. Benchmarks (Sect. 6.1.2): MARCO: MS-MARCO [16], NQuest: Natural Questions benchmark [109], Wiki65K: long Wikipedia documents [247]

The *Natural Questions* (*NQ*) [109] contains questions with at least 8 words from real users to the Google search engine. It requires QA systems to read and comprehend an entire Wikipedia article, which may or may not contain the answer to the question. An example is the question "Where is blood pumped after it leaves the right ventricle?" The task is to retrieve a long answer, i.e. a paragraph from the page that answers the question, e.g. "From the right ventricle, blood is pumped through the semilunar pulmonary valve ...", or an indication that there is no answer. The task was designed to be close to an end-to-end question answering application. One to five answers are provided by human annotators. While the original Natural Questions benchmark was a reading comprehension task providing a number of evidence documents for each question, the *EfficientQA benchmark* [147] adapted this to open-domain QA by taking examples with up to five token answers and discarding the evidence documents.

Min et al. [146] note that over half of the queries in Natural Questions are ambiguous, with many sources of ambiguity such as event and entity references. They develop an *AmbigQA* with reformulated questions that yield a unique answer.

A simple evaluation measure is the *top-k accuracy*, the proportion of queries for which one of the *k* most likely answers returned is correct. More complex is the *mean reciprocal rank* (*MRR*), the inverse of the rank of the first correct answer and 0, if no correct answer was returned. If, for instance, the third answer is correct, the reciprocal rank is 1*/*3. The MRR for |*Q*| queries is

$$MRR = \frac{1}{|\mathcal{Q}|} \sum\_{l=1}^{|\mathcal{Q}|} \frac{1}{rank\_l}.\tag{6.1}$$

*MRR*@*m* indicates that always an ordered list of *m* documents is returned.

We may define *P r(i)* as the precision reached by the first *i* elements of the list of size *m*, i.e. the fraction of relevant documents of the first *i*. Then we may define the *average precision* as

$$AP = \frac{1}{m} \sum\_{i=1}^{m} Pr(i) \* rel(i) \qquad MAP = \frac{1}{|\mathcal{Q}|} \sum\_{j=1}^{|\mathcal{Q}|} AP\_j \tag{6.2}$$

where *rel(i)* = 1 if the *i*-th document is relevant and 0 otherwise. The *mean average precision* (*MAP*) is the average of AP over |*Q*| different queries.

#### *6.1.3 Cross-Encoders with BERT*

**monoBERT** [155] performs reranking based on a fine-tuned BERT classifier based on the embedding of the [CLS] token. Query and document are combined to the input "[CLS] *<*query*>* [SEP] *<*document*>* [SEP]". This is processed by a BERT fine-tuned on MS-MARCO, where the embedding of [CLS] in the last layer is used by a logistic classifier to predict the probability that the current document is relevant for the query. This output score is used for ranking (Fig. 6.2). Note that by this technique paraphrases like "symptoms of influenza include fever and nasal 

**Fig. 6.2** The monoBERT model uses a fine-tuned BERT model for ranking passages with respect to queries. The input contains the query concatenated with the passage. The [CLS] token embedding is trained to return the probability that the passage answers the query

congestion" and "a stuffy nose and elevated temperature are signs you may have the flu" may be identified.

On the MS-MARCO benchmark [153] monoBERT yields an MRR@10 value of 35.9% (i.e. the first relevant document at position 2.8 on average). As the keyword-based BM25-search before had an MRR@10-value of 16.5% (first relevant document at position 6.1 on average), this result was a dramatic increase in performance of search engines. Such a big jump in effectiveness caused by an individual model is rarely observed in either academia or industry, which led to immediate excitement in the community.

It is quite striking how monoBERT provides a simple yet effective solution to the problem of text ranking (at least for texts that are shorter than its maximal input length) [124]. In several studies monoBERT has been found to be better than BM25 in estimating relevance when term frequency is held constant. Using textual manipulation tests that alter existing documents, rearranging the order of words within a sentence or across sentences was found to have a large negative effect, while shuffling the order of sentences within a document has a modest negative effect. In contrast, rearranging only prepositions had little effect. Experimental results from input template variations show that monoBERT uses exact match, "soft" semantic matches, and information about the position of words. Exactly how these different components are combined—for different types of queries, across different corpora, and under different settings, etc.—remains an open question. Note that this search approach requires enormous computational resources, as for each passage a new evaluation has to be performed, while the effort for index search grows only logarithmically.

**monoT5** [154] used the T5 encoder-decoder model instead of BERT to rerank retrieved documents. The model receives the input "Query: *<*query*>* Document: *<*document*>* Relevant:". monoT5 is fine-tuned to produce the tokens true or false if the document is relevant to the query or not. The predicted probability of true can be used as a relevance score. For T5 with 3B parameters the authors get an MRR@10-value of 38% for MS-MARCO passage retrieval. This shows that larger models increase performance of retrieval systems.

#### *6.1.4 Using Token Embeddings for Retrieval*

The all-to-all nature of the BERT attention patterns at each transformer encoder layer means that there is a quadratic complexity in terms of time and space with respect to the input length. In Sect. 3.2 we have introduced a number of approaches to cope with longer inputs. These all can be used to process longer documents. Among the many approaches we discuss ColBERT and Model 1 in more detail.

**ColBERT** [99] reranks the output of another (cheaper) retrieval model, typically a term-based model, or directly for end-to-end retrieval from a document collection. Queries and documents were prepended by different special tokens. ColBERT uses a single pre-trained BERT model to encode each query or document into a bag of token embeddings. In a final layer the size of embeddings is reduced and they are normalized to Euclidean length 1.0. Hence, the inner product is equivalent to the cosine similarity. If *(q*1*,...,qm)* are the query tokens and *di,*1*,...,di,k* are the tokens of the *i*-th document, the similarity of *q* and *di* is computed as

$$s\_{q,d\_l} = \sum\_{r=1}^{m} \max\_j \eta(q\_r)^\mathsf{T} \eta(d\_{l,j}).\tag{6.3}$$

This is the sum of maximum cosine similarities (MaxSim) between each query term and the "best" matching term contained in the document *di*. For each query embedding the L2-nearest 10 embeddings are taken into account and *k* = 1000 closest document vectors are retrieved.

For ranking a preliminary search result of, say 1000 documents, the maximum similarities (e.g. cosine similarity) between all query embeddings and all embeddings in the retrieved documents are computed. This approach is very efficient as it requires orders of magnitude fewer FLOPS than previous approaches. On the MS-MARCO benchmark [153] a reranking ColBERT achieves a MRR@10-value of 34.9% (first relevant document at position 2.9 on average), which is slightly below the cross-encoder monoBERT.

ColBERT can also be used for end-to-end retrieval. It employs the *FAISS*  index [91] to store the document token embeddings for a *k*-nearest neighbor search in a preparatory step. Note that for each token in each document an embedding has to be stored, as the embedding depends on the context. The retrieval requires two stages: in the first stage, a number of approximate searches for each query token is performed. In the second refinement stage, these approximate matches are reranked according to the MaxSim criterion. On the MS-MARCO benchmark the end-to-end retrieval by ColBERT has a MRR@10-value of 36.7%, which is much better than the reranking performance and on par with the much more expensive BERT crossencoder approach.

**Model 1** [28] mixes a number of techniques for their retrieval model based on token embeddings. First the authors estimate the probability *p(q*|*d)* that the query *q* has been generated as a "translation" of the document *d*. Using Bayes rule the authors get

$$p(\mathbf{d}|\mathbf{q}) \propto p(\mathbf{q}|\mathbf{d})p(\mathbf{d}) \propto p(\mathbf{q}|\mathbf{d})\tag{6.4}$$

assuming a uniform prior *p(d)* [21]. They consider the probability *r(qi*|*dj )* that a query token *qi* is a translation of a document token *dj* . Approximating *r(qi*|*dj )* by a neural network, they use embeddings of tokens *qi* and *dj* as inputs and are able to estimate *p(d*|*q)*. The approach requires little computational effort. The authors combined the BERT dense retriever with a Lucene search index. Finally, they expand documents for Model 1 with Doc2query. *Doc2query* [156] aims at generating queries, for which the document is relevant. The approach trains a transformer to generate up to 100 query tokens from a document of up to 400

tokens. The model is trained using datasets consisting of pairs of query and relevant documents, e.g. MS-MARCO. On MS-MARCO they achieve 39.1% MRR@100. The context-free neural Model 1 is less effective than a BERT-based ranking model, but it can run efficiently on a CPU (without expensive index-time precomputation or query-time operations on large tensors).

Currently, no retriever tries to process long documents. This has many important applications like news recommendation, related article recommendation and paper citation suggestion. Usually, long documents are partitioned into passages with the idea that the relevant contents is contained in a passage. Note that PLMs with longer inputs, e.g. BigBird, can improve performance (Sect. 3.2). However, it is clear that this has to be evaluated. The **SMITH** model [247] uses a BERT-based hierarchical encoder to capture the document structure information. The document is first partitioned into sentences and for each sentence token embeddings are computed. Each sentence starts with an [CLS] token, whose embedding represents the sentence. There is a higher sentence level BERT which just receives the sentence embeddings as input. The first artificial token of second level BERT is used as the embedding of the whole document.

The model is pre-trained by the masked language modeling task to get token embeddings. In addition, in the second level there is a masked sentence block prediction task where the model has to select the correct embedding from all sentence embeddings in a batch. The fine-tuning task maximizes the relevance score predicted from the document embedding by a logistic classifier for the relevanceannotated fine-tuning dataset. On the *Wiki65K* with long Wikipedia articles [87] the approach achieves an accuracy of 95.9% which is a significant improvement over prior approaches.

#### *6.1.5 Dense Passage Embeddings and Nearest Neighbor Search*

Representing text passages by embedding vectors has the potential to solve the problem of vocabulary mismatch by directly matching "meaning" in a representation space. These so-called *dense retrieval* techniques can perform ranking directly on vector representations generated by PLMs. In contrast to calculating pairwise differences of token embeddings, this approach offers a much more efficient retrieval procedure. This is performed by matching the embedding vector of a query with the embedding vectors of passages employing an index and approximate nearest neighbor search. Efficient, scalable solutions are available today in opensource libraries.

Given a query *q* and a set of documents *D* = {*d*1*,...,dn*} we want to define functions *η<sup>q</sup> (*·*)* and *η<sup>d</sup> (*·*)*, which convert the token sequences *q* and *d* into fixedwidth vectors. The functions should have the property that the similarity between *η<sup>q</sup> (q)* and *η<sup>d</sup> (di)* is maximal if *di* is relevant for query *q*. We want to estimate

**Fig. 6.3** The SentenceBERT model uses two fine-tuned BERT models to transform queries and passages to embeddings of the [CLS] token. Subsequently, a cosine similarity module is used to compute a similarity value

$$p(\text{relevant} = 1 | d\_l, q) := \phi(\eta\_q(q), \eta\_d(d\_l)), \tag{6.5}$$

where *φ(*·*)* is a similarity comparison function, e.g. the scalar product [124, p. 133]. Note that *η<sup>d</sup> (di)* may be precomputed and organized in an index. By using different encoders *η<sup>q</sup> (*·*)* and *η<sup>d</sup> (*·*)* for queries and documents, we can take into account the different roles and wordings of queries and documents.

**SentenceBERT** [183] is the prototype of a bi-encoder design for generating semantically meaningful sentence embeddings to be used in large-scale textual similarity comparisons (Fig. 6.3). The query *q* and the documents *di* are processed by the same PLM (BERT or RoBERTa). Similarity was compared by the *cosine similarity* 

$$\phi(\eta\_q(q), \eta\_d(d\_l)) = \frac{\eta\_q(q)^\dagger \eta\_d(d\_l)}{\|\eta\_q(q)\| \* \|\eta\_d(d\_l)\|}. \tag{6.6}$$

To generate sentence embeddings the authors investigated three alternatives. (1) Use the embedding of the [CLS] token. (2) Averaging (mean-pooling) of all output embeddings. (3) Component-wise maximum (max-pooling) of all output embeddings. Without fine-tuning the results were worse than for non-contextual embeddings. Fine-tuning boosted performance and yields a new SOTA. It turned out that average pooling was the most effective design, slightly better than max pooling or using the [CLS] token. Most important the computation time for finding the best match in 10,000 documents was reduced from 65 h to 5 s.

**DPR** [94] used separate encoders *η<sup>q</sup> (q)* and *η<sup>d</sup> (di)* for the query *q* and the text passages *di* of about 100 words. Both encoders took the [CLS] embedding from BERTBASE as its output representation. As comparison function the inner product *η<sup>q</sup> (q)η<sup>d</sup> (di)* was used. For each query *qi* the training set contained one correct passage *d*+ *<sup>i</sup>* and a number of negative passages *d*<sup>−</sup> *i,*1*,...,d*<sup>−</sup> *i,m*. The loss function encoded the goal to get a large *φ*-value (i.e. similarity) for *qi* and *d*<sup>+</sup> *<sup>i</sup>* and small similarities for *qi* and *d*<sup>−</sup> *i,j*

$$L(w) = -\log \frac{\exp[\mathfrak{\eta}\_q(q)\,\mathfrak{\tau}\,\mathfrak{\eta}\_d(d\_l^+)]}{\exp[\mathfrak{\eta}\_q(q)\,\mathfrak{\tau}\,\mathfrak{\eta}\_d(d\_l)] + \sum\_{j=1}^m \exp[\mathfrak{\eta}\_q(q)\,\mathfrak{\tau}\,\mathfrak{\eta}\_d(d\_{l,j}^-)]}\tag{6.7}$$

The negative examples were a mixture of passages retrieved with keyword search that did not contain the answer and thus were difficult negatives. In addition, passages from other examples in the same training batch were used. Instead of performing an exhaustive computation of similarities for all documents between *ηq (q)* and the *ηd (di)*, we can employ an approximate nearest neighbor search. *FAISS*  [91] is an open-source method based on hierarchical navigable small world graphs. For the Natural Questions benchmark they achieved a top-20 accuracy of 79.4%, which is much better than the previous top-20 accuracy of 59.1% for the keywordbased BM25 search. The replication study [136] could confirm these results, but found that a hybrid approach of DPR and BM25 could increase the performance to 82.6%.

**ANCE** [238] uses a single RoBERTa model to encode query and document. During training, hard negative examples are selected by approximate nearest neighbor search on an index over the representations generated by the trained encoder. In this way, they can select "difficult" negative examples. The index is periodically updated. On Natural Questions ANCE achieved 82.1% top-20 accuracy. The performance was also compared with the monoBERT cross-encoder, which reranks first-stage BM25 results with monoBERT by comparing all documents to the query. It turned out that on MS-MARCO the application of monoBERT to BM25 had a MRR@10 of 34.7% while ANCE has 33%. The cross-encoder obviously is more effective than ANCE. The authors also applied ANCE to 8 billion documents using embeddings of size 64 and approximate nearest neighbor search. They reported a gain of 16% compared to the prior commercial implementation.

**RocketQA** [184] performs a first retrieval step and subsequently a re-ranking procedure. Both approaches are jointly optimized using a listwise training approach, where a list of positive and negative examples is used for training both modules. In addition, they perform a data augmentation to construct diverse training instances by incorporating both random sampling and denoised sampling. They report a MRR@10 on MS-MARCO of 38.8% for passage retrieval. When the 50 top results are reranked later, they can increase MRR@10 to 41.9%.

**coCondenser** [63] is one of the highest entries of the MS-MARCO leaderboard [140]. The model is forced to learn to aggregate information into the "CLS" embedding, which will then participate in the LM prediction. Then an additional "contrastive loss" is used: "CLS" embeddings of passages from the same document close together should be similar, while those for passages in different documents should have a larger distance. This yields highly expressive embeddings for passages. When the model is fine-tuned on MS-MARCO, it returns an *MRR*@100 of 40*.*8% on the MS-MARCO leaderboard [140].

#### **Available Implementations**


#### *6.1.6 Summary*

Retrieval is a crucial step in web search, in which a small set of query-relevant candidate passages are identified from a corpus of billions of texts. Discovering more semantically related candidates in the retrieval phase holds great promise for presenting more high-quality results to the end user. Dense retrieval approaches represent a paradigm shift in search engine technology. They make it possible to recognize the meaning of words and paraphrases and thus find much better passages matching a query. Search results can also be used for question-answer models (Sect. 6.2) and dialog systems (Sect. 6.6). They are already being used in production search engine by Bing [35, 238, 266], Google [152, 197], and Facebook [82].

Dense retrieval methods discussed above are fine-tuned in a supervised setting using human relevance labels as input, e.g. from MS-MARCO. Best results are obtained by two different PLMs to encode the query and the documents. Both PLMs are trained to improve the probability of a correct reference document in contrast to some negative documents. As two different PLMs require more effort, most systems use a single model to encode the question and the documents. Experiments show that the combination of dense retrieval and keyword retrieval seems to have advantages. In Sects. 6.2.2 and 6.2.3 retrieval techniques for question answering are discussed, which are even more powerful.

A problem is the transferability of a search system to a new domain. BERT was found to have strong cross-domain relevance classification capabilities when used in a similar way as monoBERT [124, p. 72]. If a BERT model is fine-tuned using relevance judgments from one domain (e.g., tweets) it can be successfully applied to a different domain (e.g., newswire articles). On the other hand, Thakur et al. [221] created a benchmark called *BEIR* with 18 retrieval tasks from very different domains like bio-medicine and tweets. The authors trained a large number of dense retrieval techniques on MS-MARCO and evaluated then on the other tasks. They found that they were on average less effective than BM25, which due to its simplicity just works in most cases.

The memory requirements for an index for embeddings cannot be ignored. While a keyword Lucene index for the MS-MARCO passage corpus with 8.8M passages needs 661 MB, a FAISS index for vectors of size 768 requires 42 GB and an index for ColBERT takes 156 GB [124, p. 159]. To apply these techniques to web-scale, approaches with a smaller memory footprint are needed.

#### **6.2 Question Answering**

*Question Answering* (QA) is an application of NLP that receives a natural language query and automatically generates a precise answer in natural language. It is a longstanding AI task dating back to the 1960s [69]. Compared with search engines discussed in Sect. 6.1, the QA system presents the final answer to a question directly instead of returning a list of relevant snippets or hyperlinks. Thus, it is more userfriendly and efficient. Often, the system has access to a database or a *knowledge base*  (*KB*) of documents, such as Wikipedia, where it can search for relevant information.

A *Closed Domain QA system* handles questions for a specific domain, e.g. medicine, and has background knowledge about that domain or is trained with a large training set covering that domain. *Open Domain QA systems* (ODQA) deal with questions on almost any topic and usually rely on general KBs or Internet search [37]. *Multimodal QA* systems address questions in different media, e.g., text and images. A survey of ODQA is given by Zhu et al. [265]. Table 6.3 compiles leading QA Models with their performance.

A simple form of question answering is *Reading Comprehension*, where the system has to identify an answer to a question in a given text. Often a BERT-like system marks the answer span in the text by span prediction (Sect. 2.1.3). This task can mainly be considered as solved. For the *SQuAD 2.0 benchmark* [179] ALBERT yields more than 93% F1-value and the fine-tuned *ST-MoE-32B* mixture-of-experts model (Sect. 3.5.2) with 269B parameters [270] achieves 96.3% F1-value, while the human F1-value is 89.5% [178]. However, Sen et al. [199] indicate that systems trained on one dataset may not generalize well to other benchmarks.

#### *6.2.1 Question Answering Based on Training Data Knowledge*

Language models often are trained on comprehensive text collections and are able to memorize a large amount of information. A frequently used benchmark is *Natural Questions* (*NQ*) [109], which has been sampled from the Google search logs (Sect. 6.1.2). For the given question, the system has to find a short answer span in the given support documents. An example is the question "When are hops added to the brewing process?", which should yield the answer "The boiling process".

The *TriviaQA* benchmark [92, 226] contains a set of trivia questions with answers that were originally scraped from the Web. Different from Natural Questions, the

**Table 6.3** Question answering models with their performance. The lower part contains retrieval models. Benchmarks: NQ: natural Questions benchmark of Google queries [109], TriviaQA: TriviaQA benchmark [92, 226], HotpotQA: multihop benchmark [249], EM: exact match


questions here are written with known answers in mind. *TruthfulQA* [125] is a special QA benchmark with short questions that are constructed adversarially, so that some people's answers might be wrong due to false beliefs and misconceptions. The answers are evaluated according to informativeness and truthfulness.

#### **Fine-Tuned Question Answering Models**

The **BigBird** (Sect. 3.2) self-attention was used as an autoencoder and trained with the MLM objective using an input sequence of 4096 tokens [253]. During finetuning on Natural Questions the model had to find a short answer span in one of the given evidence documents. The model achieved 57.9% F1-value on this task. The **PoolingFormer** [256] is an alternative model for long input sequences with a two-level attention schema. Its first level uses a smaller sliding window pattern to aggregate information from neighbors. Its second level employs a larger window to increase receptive fields with pooling attention. An ensemble of fine-tuned PoolingFormers achieves 61.6% F1-value on the Natural Questions benchmark. The model is similar to the **SMITH** model [247], which uses a BERT-based hierarchical encoder to capture the document structure information (Sect. 6.1.4).

An alternative is **Macaw** [218], a freely available QA-system with 11B parameters. It is built on T5 and has strong zero-shot QA-capabilities. On a set of 300 challenge questions the authors claim that Macaw outperforms GPT-3 by 10%, although it has only a small fraction of its parameters. In addition to providing an answers for a question, Macaw can also take an answer and produce a question; or generate multiple-choice options for an answer and a question. The authors also provide a detailed analysis of errors.

It is much more difficult to combine different pieces of evidence to find an answer. A benchmark to test this ability is *WikiHop* [232], where information from different documents has to be merged. An example is the question "Hanging gardens of Mumbai, country?" and the documents "The Hanging Gardens, in Mumbai, also known as Pherozeshah Mehta Gardens, are terraced gardens ..." and "Mumbai is the capital city of the Indian state of Maharashtra. It is the most populous city in India . . . ". For each query up to 140 background paragraphs are provided to the model. On this benchmark BigBird-ETC (Sect. 3.2.1) achieved an accuracy of 82.3%. Currently, the best model for this task is the **RealFormer** with an accuracy of 84.4% [171], which is slightly below the human performance of 85%. The RealFormer is an autoencoder with a modified architecture, which provides a bypass with the raw attention scores of all attention heads from the previous layer in the subsequent layers [76].

#### **Question Answering with Few-Shot Language Models**

Recent Foundation Models are trained with an enormous collection of documents and can generate answers to questions without additional knowledge input. An example is the autoregressive language model **GPT-3** with 175B parameters, which was pre-trained on a text collection of books, Wikipedia and web pages of about 500 billion tokens (Sect. 3.1.2). Because of its high model capacity it can absorb a lot of 'knowledge' in its parameters. When a Foundation Model is not allowed to use external information, this is called *Closed-book QA*.

As discussed in Sect. 3.6.3, GPT-3 can be instructed by a few examples (fewshot) to solve a task. Figure 6.4 provides a few-shot prompt example. For Natural Questions, GPT-3 achieves an exact match accuracy of 14.6% in the zero-shot setting, 23.0% in the one-shot setting, and 29.9% in the few-shot setting [29, p. 14]. This was achieved without fine-tuning on Natural Questions. The larger **Gopher**  model with 280B parameters (Sect. 3.1.2) performs slightly worse with 28.2% in the few-shot setting [175, p. 80].

The even larger **PaLM** model with 540B parameters (Sect. 3.1.2) was trained on a high-quality dataset with 780B tokens. It uses a new prompt technique to pose logical questions, where examples are presented to the system together with *thought chains* partitioning a reasoning task into smaller problems (Sect. 3.6.4). In this way it gets the recipe to combine facts from different sources to arrive at the final answer.

PaLM was evaluated on a large number of other benchmarks, which in part are QA-tasks. On Natural Questions it arrived at an accuracy of 21.2% with 0-shots and at 36.0% with few-shot prompts [43, p. 47]. On Trivia QA (questions concerning the Wikipedia), BoolQ (question answering with yes/no answers), and PIQA (question answering with reasoning) PaLM also achieved a new SOTA. The results are shown in Table 3.4. PaLM was benchmarked with a large number of tests, among them the more than 150 BIG-bench tasks (Sect. 4.1.4). Many of them are QA-related tasks: 21 contextual QA tasks, 24 context-free QA tasks, 36 reading comprehension tasks, and a large number of tasks on specific knowledge and common sense [1, 22]. Additional outcomes for QA-benchmarks of PaLM are given in [43, p. 12], where PaLM always achieves SOTA.

**Fig. 6.4** A possible few-shot prompt for GPT-3 to get an answer based on existing knowledge acquired during pre-training [160]

#### *6.2.2 Question Answering Based on Retrieval*

Retrieval ODQA systems usually work in two stages: for a question a *retriever*  module finds a number of documents from a text collection, which might contain the answer. Subsequently, a *reader* considers the question and the retrieved documents and generates a natural language answer (Fig. 6.5). Since the model relies on external information, it is referred to as *Open-book QA*.

Retrievers have been introduced in Sect. 3.4.5 and were discussed in the context of document retrieval in Sect. 6.1. The retriever may employ a traditional search engine using tf-idf weighting or BM25. Alternatively the retriever may be a *dense retriever* based on document and question embeddings. It is trained to retrieve passages by computing embedding similarities e.g. by *DPR* [94] (Sect. 3.4.5). A tutorial on ODQA is provided by Chen [36].

The *reader* is usually an autoregressive language model that receives both the query and the retrieved documents as inputs. It is fine-tuned to generate a response to the query based on the retrieved information and its internal knowledge.

Question answering with external knowledge bases has the advantage that curated KBs usually are checked for correctness. They may have, however, limited coverage of entities and relations may not be up-to-date. There are a number of approaches to combine PLMs with KBs using techniques like entity mapping (Sect. 3.4.1). Recent papers propose a hybrid approach using KBs and retrieval [239]. Knowledge-Guided Text Retrieval [145] starts with retrieving text passages for a query. It creates a passage graph, where vertices are passages of text and edges represent relationships that are derived either from an external knowledge base or co-occurrence in the same article. On Natural Questions [109] they achieve an accuracy of 34.5%.

**HYBRIDER** [41] uses information from a retriever as well as from a structured KB and tables. The authors collected Wikipedia pages and constructed a benchmark dataset HybridQA containing question-answer pairs requiring multi-hop reasoning using text, tables and hyperlinks (Fig. 6.6). The model first links questions to

**Fig. 6.5** Question answering often combines dense retrieval with an answer selection module. The retriever performs a dense retrieval by comparing the embedding of the query with the embeddings of passages. The reader ranks the retrieved documents and generates an answer by an autoregressive Pre-trained Language Model [36]. Credits for image parts in Table A.2


**Fig. 6.6** For hybrid question answering Wikipedia pages are retrieved by HYBRIDER [41] (top left). Some pages contain tables (left). Here the column titles may be interpreted as well as hyperlinks to entities (underlined). The lower part shows two human-annotated question-answer pairs. Image reprinted with kind permission of the authors [41, p. 2]

tables cells as well as Wikipedia passages and hyperlinks. In a reasoning phase the linked information is ranked and consolidated to derive the probabilities of different answers. The experiments with the dataset show that the utilization of tables or retrieval alone achieves an exact match accuracy of about 20% while the joint model yields more than 40%. However, the hybrid model's score is still far below human performance.

One of the first retrieval-reader systems was **DPR** (Dense Passage Retriever) [94]. It employs a BERT model to encode passages by embeddings and retrieves them by approximate *k*-nearest neighbor search with the FAISS index (Sect. 6.1.5). In this way it can gather passages with similar meaning but different wording. The DPR reader is another BERT model which is fine-tuned to predict a probability for each retrieved passage that this passage contains the correct answer. In addition, it selects a span of tokens by span prediction, which probably provides the answer. The approach can be easily applied to KBs with billions of passages [94, 213]. On the *Natural Questions* [109] it yields a test set accuracy of 41.5%.

**FiD** [84] is described in Sect. 3.4.5. The retriever is based on DPR and compares query and passages embeddings. Raffel et al. [177] have shown that generative models like T5 can produce the answer for QA-tasks. FiD processes the query and the retrieved passages by a *reader* based on a T5 model to generate an answer. Since the first step is to process the passages one by one, the system is very efficient. FiD achieves an exact match accuracy of 51.4% on the Natural Questions test set compared to 41.5% for DPR.

**REALM** [75] and **RAG** [114] are retrieval augmented generative models for open domain question answering. However, they process all retrieved passages simultaneously in an autoregressive language model and were unable to take into account a large number of passages leading to lower accuracies on Natural Questions of 40.4 for REALM and 44.5 for RAG. Sachan et al. [194] propose an end-to-end differentiable training method for retrieval-augmented ODQA. Latent variables indicate which of the relevant documents should be included. The values of the latent variables are iteratively estimated by an EM-algorithm. On Natural Questions they achieve an exact match accuracy of 52.5%.

**MTR** [138] starts from the observation that neural retrievers perform well on their fine-tuning domain, but will typically achieve low out-of-domain performance. The authors propose a multitask retriever similar to DPR which is jointly fine-tuned on eight diverse retrieval tasks. They use a shared passage encoder—so that a single index of encoded passages can be used—as well as a query encoder that is shared across all tasks. In five of the eight models they achieve a higher performance than special models tuned to the corresponding domain.

**AISO** [268] is a retriever-reader architecture for solving multi-hop QA tasks, where multiple documents are required to answer a question. Repeated retrieval rounds are performed in which associated terms are taken as new search queries to find additional evidence. The approach is adaptive and at each step selects one of three types of retrieval operations (e.g., BM25, DPR, and hyperlink) or one answer operation. On the *HotpotQA benchmark* [249], the question-answering system must find the answer to a query in the scope of the entire Wikipedia. The AISO model achieved a new SOTA with a joint F1-value of 72.0%.

The **FB Hybrid** system was presented at the EfficientQA competition [147], where real user questions for the Google search engine from the Natural Questions dataset [109] were tackled. While the original NQ was a reading comprehension task providing a number of evidence documents for each question, the EfficientQA benchmark [147] adapted this to open-domain QA by taking examples with up to five token answers and discarding the evidence documents. The system uses a retriever-reader architecture [158]. The retriever is a mixture of DPR and another retrieval system, which covers lists and tables as well as KB-relations and retrieves 100 passages. The reader is a T5-large Seq2seq model, which is given 100 passages from the retriever and generates an answer. The background corpus contained 18.8M passages from Wikipedia. On Natural Questions the model achieves an exact match accuracy of 53.9%. According to an evaluation by human raters the model was able to answer 67.4% of the questions correctly, which is about as good as the performance of human experts using a search engine. The **MS UnitedQA** model had a similar architecture [139]. It uses a BERT-based retriever and a reader combined from a T5-model and ELECTRA processes the returned documents to generate different answers. A final re-ranking model selects the answer. MS UnitedQA yields an exact match accuracy of 54.0% and 65.8% correctness on Natural Questions. If the systems were restricted to a memory footprint of 6 GB the performance was only marginally reduced.

#### *6.2.3 Long-Form Question Answering Using Retrieval*

#### **A Language Model with Integrated Retrieval**

**Retro** [25] is an autoregressive language model with 7B parameters using retrieved information to predict the next token. As retriever a frozen BERT model is employed (Fig. 6.7). Each training sequence is split into chunks, which are augmented with their *k*-nearest neighbors retrieved from the database of 2 trillion tokens. The returned information is processed in a language model to improve the prediction of the next token leading to large performance gains. The reader consists of a differentiable autoregressive encoder and a chunked cross-attention module to predict tokens.

An input sequence *v* = *(v*1*,...,vn)* of *n*=2048 tokens is split into chunks *c<sup>t</sup>* = *(v(t*−1*)*∗*m*+<sup>1</sup>*,...,vt*∗*m)* of length *m*=64. Each chunk *c<sup>t</sup>* is expanded with a set RET*(ct)* of retrieved *k* nearest neighbor chunks from the database. The probability of a token *vt*∗*m*+*<sup>i</sup>* in the next chunk *ct*+<sup>1</sup> then can be recursively computed as

$$p(v\_{l\ast m+l}|v\_{l\ast m+(l-1)},\dots,v\_{l\ast m+1},\mathbf{c}\_l,\text{RET}(\mathbf{c}\_l),\dots,\mathbf{c}\_l,\text{RET}(\mathbf{c}\_l)).\tag{6.8}$$

The probability of the *i*-th token of the *(t* + 1*)*-th chunk *ct*+<sup>1</sup> depends only on the previous tokens and on the data RET*(c<sup>j</sup> )* retrieved from the database for the previous chunks. This integrates the retrieval process in the language model.

The retriever for a chunk *c<sup>t</sup>* uses the average BERT*(ct)* of all BERT embeddings of the tokens in *c<sup>t</sup>* as key. It retrieves the *k* nearest neighbors from the database with respect to the *L*<sup>2</sup> distance ||BERT*(ct)* <sup>−</sup> BERT*(c*˜*s)*||<sup>2</sup> <sup>2</sup>. The model receives the corresponding chunks *c*˜*s,j* and additionally their continuation chunk *c*˜*s*+1*,j* for *j* = 1*,...,k*, which collectively form the elements of RET*(ct)*. By filtering it is avoided that the chunk to be predicted is included in RET*(ct)*, as this would invalidate the conditional probability definition. The retrieval is performed in *O(*log *T )* time using the *SCaNN* library [73], which collects a set of chunks from a database of 2 trillion tokens in 10ms. Note that the document corpus of Retro is about 1000 times larger than the databases of FiD and other retrieval models.

**Fig. 6.7** The Retro language model retrieves information for the input sequence. The model uses the input sequence and the retrieved neighbor chunks from the database as input and generates an appropriate output [176]

Inside the reader the retrieved tokens in RET*(ct)* are fed into an autoencoder, which computes a set *E* of encoded neighbors. Then, so-called RETRO blocks

$$\text{RETRO}(H, E) := \text{FCL}(\text{CAT}(\text{ATTL}(H), E)), \tag{6.9}$$

and standard self-attention blocks LM*(H )* := FCL*(*ATTL*(H ))* are interleaved and operate on the intermediate embeddings *<sup>H</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>d</sup>* . Here FCL*(*·*)* is a fully connected layer, ATTL*(*·*)* a self-attention layer, and CATL*(*·*,E)* a cross-attention layer which includes the information in *E*. The input and output dimension of these modules is R*n*×*<sup>d</sup>* .

The resulting language model is able to predict the next token with a high reliability. The *Pile data* [62] is a 825GB open-source text collection set that consists of 22 diverse, high-quality datasets. It was screened for toxic language and bias, e.g. with respect to gender, religion, and race. Its authors recommend measuring the quality of token prediction in *bits per byte* (*bpb*), which in contrast to perplexity is independent of the tokenizer [62, p. 6]. The authors compare Retro with GPT-3175B [29], Jurassic-1178B [121], and Gopher280B [176]. It turns out that Jurassic-1 has the lowest (and best) bpb-value on 5 Pile datasets, Gopher on 2 datasets and Retro on 9 datasets, although it is far smaller than the other models [25]. GPT-3 was inferior to all three models. A possible problem for these results is the overlap of the retrieval corpus with the test data.

For the *LAMBADA benchmark* [165] a model has to predict the last word of a paragraph. The authors measure the following accuracies: Retro without retrieval 70%, Retro with retrieval 73%, Gopher 74.5%, and GPT-3 76.2%. Note that Retro has only 4% of the parameters of GPT-3. For question answering the Natural Question benchmark is relevant. Here, Retro achieved an exact match accuracy of 45.5%.

The *LaMDA* [222] dialog system (Sect. 6.6.3) is an expanded version of Retro with 137B parameters. It demonstrates that facticity can be improved by retrieval models. In addition, it is able to reduce toxic language by a system of filters that block unwanted speech. Although this model could also easily be used for question answering, no corresponding benchmark results are known.

#### **Controlling a Search Engine by a Pre-trained Language Model**

**WebGPT** [149] extends GPT-3 to control the *Bing search engine* and performs a web search for a specific query. The language model must issue commands such as "Search ...", "Find in page: ..." or "Quote: . . . ", as shown in Fig. 6.8. In this way, the model collects passages from web pages which contain information relevant for the question. The utilization of Bing has the advantage that it has powerful search capabilities, and covers a large number of up-to-date documents.

Browsing continues until the model issues a command to end browsing, the maximum total length of references has been reached, or the maximum number


**Fig. 6.8** Possible actions of the WebGPT language model. If another text is generated, this is an invalid action and ignored [149]

of actions has been reached. If a relevant reference has been retrieved, the model will generate a long-form answer to the question.

The GPT-3 model is first fine-tuned to mimic human demonstrations, enabling it to use the text-based browser to answer questions. Then, the usefulness and accuracy of the model's answers is improved by fine-tuning a reward model to predict human preferences, and optimizing it by rejection sampling. Specifically the model is fine-tuned to answer questions from *ELI5* [56], a dataset of openended questions obtained from the subreddit 'Explain Like I'm Five'. An example is given in Fig. 6.9. The proposed WebGPT answers should be coherent, relevant, and supported by trustworthy documents. No details are reported on the input prompts of GPT-3 containing the current state of search, and how the GPT-3 model combines the returned documents into an answer. Note, however, that there is significant overlap between training and validation in ELI5, as at least 81% of ELI5 validation questions occur in the training set [106] in circumscribed form.

The final answers were selected from 64 trials of the 175B WebGPT model by ranking. These answers were preferred by human raters to the reference responses from the ELI5 dataset 69% of the time. Moreover, they were preferred to the human demonstrator responses in 56% of the cases.

For WebGPT, responses to TruthfulQA [125] were correct about 75% of time, whereas GPT-3 scored 64% with helpful prompts. While GPT-3's answers were truthful and informative in about 20% of the time, the best version of WebGPT increased this to about 56%. Since people answered 94% of the questions correctly, the models still have a significant performance difference. On TriviaQA WEBGPT achieved a score of 69.5%, which is far less than the value of PaLM with 81.4%.

An innovative feature is the support of text passages by references. This corresponds to the approach of scientific papers to underpin claims by references and was already suggested by Metzler et al. [143]. The references explain the answer and support the factual accuracy of the statements. The citations are selected by Bing in response to the query. They should therefore be close to the final reader-generated response and provide an easy way to assess the correctness of the response.

However, the authors point out that the references are not always representative for the available evidence, although the model cites references that correspond to the generated text. In addition, it is difficult for the model to verify the trustworthiness

**Fig. 6.9** Long-form answer to a question generated by WebGPT. The best of 64 answers was automatically selected. The citations were automatically retrieved from the Bing search engine and added to the answer [80]

of references. Here, Web-of-Trust systems and search engine technology could be employed, which favor trust-checked frequently linked web pages.

#### **Available Implementations**


#### *6.2.4 Summary*

A number of Foundation Models have been presented, which were able to improve Question Answering performance. Examples are the autoregressive language models GPT-3 (175B), Gopher (175B), and PaLM (540B) with huge parameter sets, which are trained on a large document collections and can acquire extensive knowledge. Using few-shot prompts they are able to answer questions with high accuracy without employing external knowledge.

Recently, the retriever-reader architecture has been increasingly used for QA systems. It has the potential to tap into a larger knowledge base or the Internet that can easily be kept up-to-date. The retriever can employ keyword search or dense retrieval. Dense retrieval mitigates the term-mismatch problem, where relevant paraphrases are ignored. Usually, embeddings for each document or phrase are precomputed and the embedding index is constructed beforehand. Current systems can access document collections of up to trillions of tokens using advanced nearestneighbor search engines like FAISS and SCaNN to compare embeddings.

The reader usually receives the query and the returned passages in text form and generates the answer. It is fine-tuned to select the correct answer and to provide answers which are expressive and truthful. The Retro model is an autoregressive language model with only 7B parameters, which uses passages retrieved by a frozen BERT model as additional current state information to generate the next tokens. It is capable of improving accuracy to high levels for many QA tasks, but can also be used for other applications such as story generation.

WebGPT combines GPT-3 and the Bing search engine to retrieve documents and create appropriate answers. It is able to enhance the generated text by references to documents, which justify and explain the answer. The LaMDA dialog model is an expanded version of Retro with 137B parameters with specific tuning to increase usability and factual accuracy. In addition, it is able to reduce toxic language by a system of filters that block unwanted speech. These techniques can also be applied to question answering.

Still difficult is the generation of answers where the correct response needs information from multiple documents. In this case several rounds of querying are necessary. Special models like RealFormer, HYBRIDER, or AISO can improve the performance for benchmarks like WikiHop.

**Fig. 6.10** This map shows some of the world's 7100 languages, with each dot representing a language and the color indicating the top language family for each language. Only a small fraction of the world's languages are currently represented in Foundation Models. Image reprinted with kind permission of the authors [24, p. 23]

#### **6.3 Neural Machine Translation**

Language is the cornerstone of most human communication and interaction. Moreover, many persons think in terms of language, and use it to express and communicate feelings, goals, and ideas. We communicate knowledge by language and use it to establish social and emotional relationships. There are more than 7100 languages in the world [19], some of which are shown in Fig. 6.10. The ability to understand each other across language barriers is essential for communicating ideas between people.

After an initial success with Recurrent Neural Networks [15, 215] the development of the Transformer encoder-decoder (Sect. 2.3) has driven progress in Neural Machine Translation (NMT). By cross-attention a "correlation" between each token of the source text and the translated text can be established, producing better translations than before. The availability of large training sets and better model architectures has steadily increased the performance of Pre-trained Language Models for NMT (Fig. 6.11). Standard models for multilingual processing are described in Sect. 3.3. A survey is provided by Yang et al. [248].

**Fig. 6.11** BLEU scores for Google translation of 100+ different languages to English for different years. Image credits in Table A.2

#### *6.3.1 Translation for a Single Language Pair*

The training data of NMT consist of text pairs of the source language and its translations to the target language. Traditionally evaluation is done by comparing one or more reference translations to the proposed translation, as described in the survey [195]. There are a number of automatic metrics like BLEU, METEOR or BERT-score (Sect. 2.3.3). It turned out that there is a noticeable difference between human judgment and automatic evaluation. Therefore, most high-end comparisons today use human translators to assess the quality of translation methods.

At the WMT2021 Machine Translation conference, numerous teams solved benchmarks tests for translating English news texts from/to German, Japanese, Russian, Chinese, and a number of low-resource languages [5]. Instead of using comparison statistics like BLEU, the translations of each system was evaluated by a number of human evaluators without showing them a reference translation. They were asked to rate a given translation according to how adequately it expressed the meaning of the corresponding source language input on an analog scale, which corresponds to an underlying absolute rating scale of 0–100. As some raters could be stricter, the systems are ranked by a z-score, where the score is mean-centered and normalized per rater. Systems are grouped together according to which system significantly outperforms all others measured by the Wilcoxon rank-sum test. A large effort was spent to assess the validity of human evaluation.

In total 173 submissions were received. In addition, five anonymized online systems were included. Further human-produced reference translations were denoted by "HUMAN" in all tables. Results show that almost all good systems are based on transformer encoder-decoders. Words are mostly encoded by the SentencePiece [107] tokenizer (Sect. 1.2). A widely used technique is *back-translation* [200]. Here a monolingual text is translated to the other language and then back-translated. By minimizing the difference to the original text, both models may be improved. Up to 500M sentences per language were available and could be used for back-translation, which led to a significant improvement in quality. In addition, ensembles are able to increase the performance in most cases.

The result of the best system for each language pair is shown in Table 6.4. Usually, there is a cluster of 2–5 models at the top, whose performance differences are not significant. The Facebook-AI model (FB) had the best results for half of the language pairs. In addition, the BLEU scores for the best systems automatically computed from n-grams are shown. As can be seen, the values are in general better for the translation "to English" than "from English" especially for morphology rich languages like Czech and German. Compared to the human reference translation, the best system was significantly better for three language pairs. This has already been discussed critically by Toral [223], who decry the limited amount of context between sentences and the limited translation proficiency of the evaluators.

Improved performance was reached by increasing the number of parameters. The Facebook model [224], for instance, used a standard model of 4.7B parameters and a sparsely gated mixture-of-experts system with up to 128 experts. In each Sparsely Gated MoE layer, each token is routed to the top-2 expert feedforward blocks based on the score of a learned gating function. In addition, the models were fine-tuned with domain-specific data from the news domain. The *n*-best hypotheses were generated with a beam search. These were ranked with a weighted average of the probabilities *p(*tgt|src*)*, *p(*src|tgt*)*, and *p(*tgt*)*, where src is the source and tgt is the target sentence.

It is well-known that the translation of single sentences suffers from ambiguities (e.g. pronouns or homonyms), which can be resolved by considering the document context. In WMT2021 this is taken into account by assessing the quality of translation within the document context [5]. As current encoder-decoder Foundation Models are able to consider larger contexts, this could improve translation performance [141]. Instead of finding the most probable translation of a sentence, we need to generate the best translation for a given complete source document. While comparing sentence-level translation often does not indicate a difference between human and machine translation, the comparison of document-level translation often yields a statistically significant preference for human translations [110].

Instead of using a Seq2seq model with extra long input sequence, **HAT** [187] proposes a hierarchical attention transformer. The authors split the input text into sentences and start each sentence *i* with a specific [*BOSi*] token. These tokens summarize the sentence content and are connected to the other sentences by the usual self-attention and cross-attention. While the usual encoder-decoder transformer has a BLEU of 32.5 for the document translation from English to German on WMT2019, HAT is able to yield a SOTA BLEU of 34.5.


**Table 6.4** Leading systems of the WMT2021 News Translation Task. The systems are ordered by normalized z-score [5, pp. 15–19]. If either the best system or a human reference translation is significantly better, the value is printed in bold. Systems: FB: Facebook-AI, BL: Borderline, HW: HW-TSC, NV: Nvidia-NeMo, NI: NiuTrans, OB: Online-B, OW: Online-W, HN: HappyNewYear Score

#### *6.3.2 Multilingual Translation*

Usually, languages with scarce training data have a much lower translation accuracy, as holds for Hausa in Table 6.4. A recent success was the extension of NMT by multilinguality, which was already discussed in Sect. 3.3. This led to a marked improvement of translations for languages with few resources. For a survey see [48].

**M2M** of Facebook AI [57] improves translation between many languages by utilizing a massive corpus of 7.5B sentences covering 100 languages and thousands of translation directions with supervised data, created through large-scale mining. The model is a transformer encoder-decoder with 15B parameters. The authors add a special token in the encoder indicating the source language and a special token in the decoder indicating the target language. The transformer has 12 encoder and 12 decoder layers and an embedding size of 1024. As there is a joint token vocabulary for all languages, the input and output embeddings are shared. To improve performance the authors added language-specific layers to the decoder for five languages. Using specific parallelization techniques they were able to train the model with only hundreds of GPUs.

Except for four language directions (En→Chinese, Chinese→En, En→Fi, En→Estonian) the model improved translation results on the WMT benchmarks for 1.9 BLEU points on average. Especially marked is the improvement for regional languages with an average increase of 7.6 BLEU. For resource-rich language pairs Liu et al. [130] propose to use very deep transformers with up to 60 encoder layers and 12 decoder layers. They develop a simple yet effective initialization technique that stabilizes training and achieve SOTA on WMT2014 En-Fr of 46.4 BLEU.

Although multilingual translation has many advantages, it usually performs worse than specially trained bilingual models for high-resource language pairs. Recently Facebook [225] presented a single multilingual model, which outperformed the best specially trained bilingual models across 10 out of 14 language pairs of the WMT2021 news benchmark. Facebook built two multilingual systems: anyto-English and English-to-any. They employed data mining techniques to identify translations in large web crawl data and leverage available monolingual data with hundreds of millions of sentences from all eight languages to maximize performance of MT systems. They filtered the available monolingual data to reduce the amount of noise, and then back-translated them with an ensemble of the strongest multilingual models available. The number of parameters was increased from 15B to 53B to enhance the model capacity.

The BLEU scores are shown in Table 6.5. In comparison to the best bilingual models of WMT2021, the multilingual model achieves a better BLEU in 9 of 14 cases indicating that the additional training data from other languages supports translation. Only for Chinese→English there was a larger drop of 1.3 BLEU points. The authors also performed a human evaluation for the language pairs English→Russian and English→German. It turned out that there was no perceived difference between the quality of bilingual and multilingual translations.

**Table 6.5** BLEU scores of the Facebook multilingual model and the best language pair model submitted to the WMT2021 news task. The numbers reported are BLEU scores on the final WMT2021 test set [225]. The difference between the models is printed in bold, if the multilingual model is better


**Table 6.6** Influence of different modeling improvements on the BLEU scores on the development set of WMT2021 for Facebook AI's WMT2021 submission [225]. The version of the last row was submitted


Table 6.6 shows the effect of employed improvement strategies for the different languages of the multilingual model. Back-translation has a large effect for languages with little training data like Hausa and Icelandic. The authors note, however that back-translation produces *translationese* by generating artificial uncommon phrases in a language. These effects may be mitigated by fine-tuning on the specific domain, e.g. news texts. This yields about 3 BLEU points for translation into English and 0.7 BLEU points for translation out of English. Switching to the multilingual model generates an improvement for all models. While the effect of model ensembles is minor, re-ranking the BEAM translations with conditional targetsource probabilities yields about 0.4 BLEU points. Postprocessing (for example applying standard punctuation rules) can have a large effect, e.g. 5 BLEU points for Chinese.

The **PaLM** autoregressive language model with 540B parameters [43] has about 22% non-English training texts among its 780B training tokens (Sect. 3.1.2). Similar to other large LMs, PaLM is not trained explicitly on parallel text, although some such data is likely to exist naturally in the training corpus. In Table 6.7 the results of PaLM 540B few-shot translation is compared with prior few-shot and fine-tuned SOTA [43, p. 27]. The best BLEU value per language pair is underlined and the

**Table 6.7** Comparison of PaLM few-shot translation performance against prior fine-tuned translation performance by specialized models and prior few-shot performance. On the left you find the translation from English and into English for the traditional WMT language pairs. On the right there is the translation to and from English to Kazakh (kk) and a translation between German and French [43, p. 27]


best few-shot BLEU is printed in bold. The table shows that PaLM on the traditional WMT translation pairs always achieves the best few-shot BLEU, often improving by a wide margin. For the low-resource language Kazakh (kk) the fine-tuned translation models have a better BLEU than PaLM. However, for de→en and ro→en PaLM is able to outperform the supervised models. In addition, the 0-shot PaLM translation of fr→en achieves a BLEU value of 25.2, which is better than the fine-tuned SOTA of 24.9. Overall, PaLM performs well close to the fine-tuned models without having been trained for this task.

#### *6.3.3 Multilingual Question Answering*

In recent years open domain question answering (ODQA) has taken a rapid development (Sect. 6.2). Therefore, it is extremely rewarding to extend these techniques to multilingual question answering. In this way, information encoded with the world's different languages can be tapped and the digital divide can be narrowed by bringing answers to people who speak rarer languages. There is a tutorial on multilingual ODQA by Ruder [192, 193].

A simple way to perform multilingual ODQA is to translate the question to English, use an English ODQA system to generate an answer, and translate the answer back to the target language. Because of ambiguities in translation, this procedure may generate errors in some cases [132]. In addition, information specific to the target language and conceptualizations of the target culture may not be available in English [258].

The *TyDiQA-GoldP benchmark* [44] is a question answering dataset covering 11 typologically different languages with 204K question-answer pairs. The following languages are included: English, Arabic, Bengali, Finnish, Indonesian, Japanese, Kiswahili, Korean, Russian, Telugu, Thai. As the languages represented in this benchmarks have a very diverse structure, a model which performs well on this data can be expected to have a good QA-accuracy on other languages. *MKQA* [133] is an evaluation dataset created by translating 10k Natural Questions [109] to 25 target languages.

**Fig. 6.12** Cross-lingual retrieval by mDPR and answer generation with mGEN for the CORA system [13, p. 9]. The answers to the questions are correct, however, on the left side the answer should have been given in French

As an alternative, one can train cross-lingual retriever and reader models combining the information from multiple languages to generate an answer in the target language (Fig. 6.12). **CORA** [13] answers questions across many languages, even for ones without language-specific annotated data or knowledge sources. It includes a dense passage retriever collecting documents with different languages for a question. A pre-trained multilingual language model *mDPR* using mBERT (Sect. 3.3.1) is fine-tuned to encode passages and questions separately. By performing a maximum inner product search the top *k* documents are retrieved similar to DPR (Sect. 3.4.5). It could be shown that mBERT improves the search quality in non-English mono-lingual retrieval [203]. The reader *mGEN* is a multilingual autoregressive sequence model (e.g. mT5, Sect. 3.3.2) generating the answer in the target language by compiling the information in the retrieved passages. No specific translation models are used. The initial training data is a combination of multilingual QA datasets. Each training instance from these datasets comprises a question, a positive passage, and an answer. However, these datasets suffer from limitations on language diversity. Therefore, the authors iteratively generate more representative training data for low-resource languages by exploiting links between Wikipedia articles in different languages.

It turns out that CORA substantially outperforms the previous SOTA on multilingual open QA benchmarks across 26 languages, 9 of which are unseen during training. Here CORA can improve the average F1-value from 17.1 to 21.8. Retrieval with mDPR performs well in Indo-European languages with Latin script, even when the language is unseen. There is a major drop for languages with non-Latin script


**Table 6.8** Comparison against SOTA on TyDiQA question answering benchmark with 11 typologically different languages. The values are for the validation set with respect to the exact match accuracy [43, p. 32]. Best values for each language printed in bold

(e.g., Japanese, Russian, Chinese). Here, perhaps, the model is unable to use relevant passages from other languages to answer questions.

**mT5** (Sect. 3.3.2) is a multilingual version of the T5 Seq2seq transformer with up to 13B parameters [246]. It was pre-trained using a training dataset of web pages covering 101 languages with about 48B tokens and a common vocabulary of 250k tokens. After fine-tuning on the TyDiQA benchmark, it arrives at an exact match score of 79.1%. **ByT5** [245] is a variation of the mT5 multilingual encoder-decoder with 12.9B parameters. It operates on utf-8 bytes with a vocabulary of 256 possible byte values instead of tokens. The model is pre-trained to replace corrupted spans of 20 bytes on average. The largest model uses 36 encoder and 12 decoder layers. When the model is fine-tuned on gold data in all target languages, it achieves an exact match score of 81.4% on the TyDiQA benchmark.

The **PaLM** Foundation Model [43] has about 22% non-English training texts in its 780B training tokens (Sect. 3.1.2). Therefore, it can be applied to multilingual tasks such as translation and question answering. With few-shot prompts it gets an exact match score on TyDiQA of 60.5%. When the model is fine-tuned on TyDiQA, the score grows to 80.0%, which is slightly below of the performance of ByT5 XXL. The detailed results in Table 6.8 show the performance for different languages. Here PaLM has a better score for two languages than ByT5. The authors remark, that ByT5 was trained with 50% more non-English text compared to PaLM, which may explain the difference.

#### **Available Implementations**


#### *6.3.4 Summary*

In recent years, machine translation has taken a dramatic development. The use of encoder-decoder PLMs could overcome the limitations of RNN architectures and increase the performance to near-human levels. Besides the utilization of encoderdecoder Transformers, the availability of high-quality training examples by web crawlers using Foundation Models and specific assessment procedures is a reason for progress [33]. A further improvement resulted from sentence back-translation, which particularly increases results for low-resource languages, and from training a single multilingual model for translation between all languages. Training multilingual translation models with up to 600B parameters—using appropriate parallelization strategies—leads to significant performance increase for 100 languages, as measured by BLEU [113]. Recently multilingual models even were able to outperform high-resource bilingual translation models. This is also demonstrated by the PaLM Foundation Model, which achieved higher performance in few-shot translation than the prior fine-tuned models for some language pairs. Therefore, multilingual models are likely to become standard in the future. However, current multilingual models using unsupervised multilingual training may not deeply model the subtleties of languages and language varieties to their full extent. This has to be checked in future applications.

The developments opened up the opportunity for multilingual question answering systems, e.g. CORA, where queries can be posed in a large number of languages. The answers are compiled from information available in multiple languages. In this way, cultural characteristics and concepts that are not available in all languages can be taken into account. There are also close links to cross-lingual semantic parsing, where a natural language utterance is translated to a logical form for execution in some knowledge base to return an answer [202]. Again the PaLM Foundation Model provided few-shot answers to multilingual questions, which are competitive in accuracy to fine-tuned models for the same benchmarks. A fine-tuned version of PaLM is even able to outperform prior fined-tuned SOTA for two languages.

However, machine translation is not yet solved. There is still the problem of domain mismatch between train and test data. In some cases, it fails to accurately capture the meaning of a sentence. Systems can generate biased text, e.g. if gender is handled differently in different languages. But attention allows the decoder to look directly at faraway text and provides a soft alignment between words for free. Recently, performance could be increased by translating entire documents, as sentences often are not sufficient to disambiguate all words. To extend current multilingual models to thousands of languages, new techniques are required [19]. One approach is to use monolingual datasets to improve translation, since the amount of available monolingual text is orders of magnitude greater than the amount of translated text. This in addition requires highly reliable language detectors which also work for low-resource languages.

#### **6.4 Text Summarization**

With the rapid increase of textual information in companies and on the Internet, it is increasingly difficult for people to keep track of a topic. Automatic *summarization*  of documents, which compiles the essential statements from a text, can help to grasp the most relevant information in the documents. A *summary* is a short version produced from a single document or multiple documents conveying the main points of the original texts. The purpose of automatic text summarization is to create a *summarizer* method to produce this summary efficiently and precisely. Recent indepth surveys are provided by Ma et al. [135], Guan et al. [71], Syed et al. [216], and El-Kassas et al. [95].

Earlier machine learning approaches produced *extractive summaries* selecting a few sentences from the document. This approach typically selected grammatically correct sentence parts, but the language style of the combined parts and the coverage were usually not sufficient. Modern summarizers pose summarization as a translation problem, which translates the original document to a short version covering the main points. Since 2017 the encoder-decoder transformer (Sect. 2.3) provided an effective technique to generate *abstractive summaries* containing the main points of the document. Abstractive summarization is a bit more complex because the text is paraphrased, and the summary usually has words different from the original document. On the other hand, it is more flexible and can aggregate several similar texts expressing related facts with different wordings.

Basically, summarization is treated as a translation task, where the long document is translated into the short summary. Alternatively we can use the long document as the start text of an autoregressive Foundation Model, which is fine-tuned to generate a summary. One of the main challenges for Seq2seq models is that the decoder needs to attend to encoder token embeddings in the large document context to predict the next token of the summary. Therefore, Seq2seq models covering a long input context (Sect. 3.2) are natural candidates. Summarization systems can be either *single document summarizers* or *multi-document summarizers*. Table 6.9 lists popular summarization models and their performance.

#### *6.4.1 Shorter Documents*

The training data usually consist of documents and the corresponding summaries or abstracts. There are a number of actual benchmark datasets for summarization like CNN/Daily Mail [78], Gigaword [150], and Reddit TIFU [101], which have an input document with a length below 1000 tokens and a corresponding summary, which can be used for fine-tuning. The difference between a reference summary and a predicted summary is assessed by measures like ROUGE, BLEU, or METEOR (Sect. 2.3.3) with the recall-oriented ROUGE most frequently used.

**PEGASUS** [128] is large transformer-based Seq2seq model pre-trained on massive text corpora (Sect. 3.1.3). It follows a new pre-training objective in which

**Table 6.9** Summarization models with their performance measured in ROUGE-2. Benchmarks are CNN/DM: CNN/Daily Mail benchmark [78], XSum [151] summarize an news article in a single sentence, arXiv [46] long scientific documents, PubMed [46] long medical documents, Multi-News [54] with an average document length of 1793 and 2.8 documents per cluster


not tokens are masked, but sentences. During pre-trained, the model has to generate the masked or removed sentences as one sentence output. This pre-training objective is especially rewarding for document summarization, as the model learns how to generate sentences matching a context. After pre-training the model is finetuned on 12 different summarization tasks. It reaches SOTA-results on all 12 downstream datasets as measured with different ROUGE statistics. In most cases the improvements are considerable [128], e.g. for the CNN/Daily Mail benchmark it had a ROUGE-2-score of 21.7. The ROUGE-2-scores of other Seq2seq models are similar, e.g. 21.6 for T5, 21.3 for BART, and 21.5 for R3F [4]. Note that for text generation often a BEAM search (Sect. 2.2.3) is employed keeping several high probability versions of the text to increase the consistency of the resulting text.

**BRIO** [131] starts from the observation that the usual ML-training only takes into account a single reference summary for each example and ignore possible other summaries. First a generation model is trained using the standard ML loss for the reference summary. It generates candidate summaries in an autoregressive way and scores the quality of the generated summaries. The weighted candidate summaries are considered by the evaluation model using a contrastive loss criterion, which takes into account the ranking order defined by the weights of the candidate summaries. The approach uses BART or PEGASUS as backbone Seq2seq models. On the *CNN/Daily Mail benchmark* benchmark [78] the BRIO model with 10B parameters has SOTA performance with the ROUGE-2 score of 23.6 on CNN/DM and 25.6 on XSum. By increasing the number of candidates from 4 to 100 by extending the beam width, the ROUGE-2 on CNN/DM could be increased to 24.1. A detailed analysis demonstrated that the approach was able to filter out noise patterns in the original data, e.g. the phrase "click here".

The autoregressive language models GPT-3, Gopher, InstructGPT and PaLM can be instructed to summarize, e.g. by entering a text and appending "TL;DR:" [159]. For **PaLM** with 540B parameters an evaluation is available. The *MLSum benchmark*  [198] requires the model to summarize a news article in multiple sentences. For German texts PaLM 1-shot arrives at 12.8 ROUGE-2 and a fine-tuned version of PaLM achieves a ROUGE-2 score of 33.1, which is below the fine-tuned SOTA at 36.4 [43, p. 30]. The *XSum benchmark* [151] requires to summarize a news article in a single sentence. Here PaLM gets a few-shot ROUGE-2 score of 12.2 and a finetuned ROUGE-2 of 21.2, whereas the fine-tuned SOTA ROUGE-2 by BRIO is 25.6.

**ST-MoE-32B** [270] is a mixture-of-experts model (Sect. 3.5.2) with 269B parameters. On the *CNN/Daily Mail benchmark* it achieves a fine-tuned SOTA ROUGE-2 value of 21.7 and on the *XSum benchmark* it yields 27.1 ROUGE-2 with fine-tuning. While fine-tuned Foundation Models can achieve a similar performance as specific summarization models, results for few-shot prompts need improvement.

ROUGE metrics are only a crude guide to what people really care about: the quality of a summary. Stiennon et al. [211] directly optimize their model with respect to human judgment. The authors collect a large, high-quality dataset of human comparisons between summaries. Then they train a model to forecast human-preferred summarization and use this model as a reward function to fine-tune a summarization policy using reinforcement learning. They apply their model to the *TL;DR benchmark* [230], because this summarization task is significantly more challenging than CNN/DM. They find that the summaries of their 6.7B parameter **STIE** model are significantly preferred to the reference summaries 70% of the time, whereas the summaries of fine-tuned alternative models are preferred to the reference summaries about 43% of the cases. The model can also be applied to new domains better than other methods. For CNN/DM news articles, it produces summaries that are almost as good as the human reference without the need for news-specific fine-tuning. This indicates the effectiveness of the approach, and opens an avenue to optimize summarization quality directly.

#### *6.4.2 Longer Documents*

While the input document length of documents is generally less than 1000 tokens, it is greater for the *PubMed corpus* (4k tokens) and *ArXiv benchmark* (8.6k tokens) [46]. For these benchmarks transformers with longer input sequences (Sect. 3.2) are capable of taking into account the whole document.

**BigBird** [253] is able to cope with long documents (Sect. 3.2.1). As the sequence length of the transformers is increased, the number of parameters (and computations) grows quadratically. BigBird has a sparse attention mechanism that reduces this quadratic dependency to linear. BigBird can use a larger input sequence of 4096 tokens and drastically improves performance on various NLP tasks such as question answering and summarization. Longer documents exhibit a richer discourse structure and summaries are considerably more abstractive. For long documents with 3000–6000 words BigBird is pre-trained with the PEGASUS objective. After fine-tuning it yields a marked improvement on SOTA, e.g. on the ArXiv benchmark with the ROUGE-2 score 19.0. **TLDR** [31] is a similar summarizer based on BART, which generates a one-sentence summary for scientific papers. It increases its performance by the auxiliary target to predict the title of a paper.

**HAT** [187] aims to capture the content of longer documents in a better way. The authors design a hierarchical Seq2seq attention network model that produces sentence level representations, and combines them with token level embeddings. They determine sentence boundaries by punctuation and insert [*BOS*] tokens at the start of every sentence. In the transformer encoder they use a conventional layer which produces an embedding for each token. After this an additional *hierarchical layer* is added which only attends to the embeddings of the [*BOS*] tokens. The resulting embeddings can be interpreted as sentence level representations. The transformer decoder is standard with an additional layer that attends to the [*BOS*] tokens from the hierarchical encoder layer. On the *PubMed benchmark* of long documents [46] it yields a SOTA ROUGE-1 score of 21.4. while on arXiv it has a ROUGE-1 score of 19.7. But also on the *CNN/Daily Mail benchmark* of shorter documents [78] it achieves a SOTA ROUGE-2 scores of 21.3,

**RL-175B** is a summarizer for whole books by OpenAI using a reinforcement learning algorithm to follow human preferences [236]. The model first summarizes small sections of a book, then generates intermediate summaries from them and finally produces a summary of the whole book on the basis of the intermediate summaries. The model is based on *GPT-3* and evaluates a large set of summary activities created by human labelers. The small sections are generated by a fixed chunking algorithm. Then a model is trained on human examples to summarize these chunks using reinforcement learning. It uses the approach explained in Sect. 3.6.5. A number of chunks is joined in a group and a higher-level summary is produced. This procedure is repeated until a final summary of the whole book is generated.

The fine-tuning was performed for the GPT-3 with 7B and 175B parameters. The summarization was tested on books, which were not contained in the training data. The scoring is done by a *Likert scale* from 1 to 7. It assigns numbers to human judgments (e.g. 1 = very bad, 2 = bad, . . . , 7 = very good), and computes averages from these numbers. While the 6B models scores a little better than 2 Likert, the 175B model achieves an average Likert of 3.5. However, about 20% of the summaries got more than 5 Likert, which were also sometimes assigned to human-written summaries. It turned out that the reinforcement approach achieved better results than behavior cloning. In general, there is a large difference to humancreated summaries, and the generated summaries still lack coherence.

#### *6.4.3 Multi-Document Summarization*

Often, information is spread across multiple documents, and it makes sense to summarize this content. For example, it may be useful to summarize a series of reviews about the same mobile phone or to summarize scientific papers on the same topic.

**Primer** [237] is based on the *Longformer* encoder-decoder (Sect. 3.2.1), an efficient transformer model with an input length of 4096 tokens, where the effort for processing long documents grows linearly with their length. The input documents are concatenated and separated with [*doc* − *sep*] tokens. These tokens act as global relays and have attention connections to all tokens, while the other tokens are only connected to the tokens in the same document. In this way, large sequences of input documents can be processed. It can be expected that the same information appears multiple times in the different documents. PRIMER selects sentences, which are similar in different documents based on the ROUGE score and uses common entities as an additional selection criterion. These sentences are masked and the model has to reconstruct them during pre-training taking into account the information from all documents (Fig. 6.13).

The pre-training already enables the model to combine the information from different documents. Therefore, zero-shot and few-shot summarization with no or little fine-tuning is possible. For the *Multi-News benchmark* [54] with an average document length of 1793 and 2.8 documents per cluster, PRIMER achieves a zeroshot ROUGE-2 score of 13.6 and can increase this to 21.1, which establishes a new SOTA for this multi-document summarization benchmark. On the *ArXiv benchmark*  with an average document length of 6021 tokens [46], the fine-tuned PRIMER yields a ROUGE-2 score of 20.8, indicating the performance on long documents.

**Fig. 6.13** Multiple documents form the input for PRIMER, separated with [doc-sep] tokens. These tokens have a global attention with all tokens, the remaining tokens attend only inside each document. Some sentences are selected and have to be recovered by the decoder [237]

#### **Available Implementations**


#### *6.4.4 Summary*

Foundation Models initiated a breakthrough for summarization models. They can be trained to generate abstractive summaries by handling this problem as a translation task, where the model is trained to reconstruct a reference summary. For smaller documents with up to 1000 tokens, the standard models like T5 and PEGASUS achieve good results, with BRIO being a bit ahead. Models with more parameters have a slightly better performance. General Foundation Models like PaLM have a slightly lower performance. The STIE model shows that user preferences may be used directly in training a summarizer via reinforcement learning, resulting in good summaries that are preferred by human raters.

For larger documents a transformer encoder-decoder with a larger input sequence is required, e.g. BigBird. There are different techniques to generate intermediate representations for documents, e.g. for sentences by HAT or chunks by RL-175B. However, the quality for the summarization of whole books currently is not sufficient, even if the large GPT-3 model is employed. A recent alternative is InstructGPT (Sect. 3.6.5), which can be easily directed to perform a summarization, e.g. by the prompt "Summarize this for a second-grade student: *<*text*>*" [162, p. 30]. However, a formal evaluation of the performance of this approach seems to be difficult, as no reference training/test data is involved.

Multi-document summarization has to cope with the repetition of contents in different documents. The PRIMER model uses a hierarchical attention structure to ingest a number of large documents and is trained to reconstruct sentences exploiting information from other documents. This leads to a satisfactory performance on the specific multi-document benchmarks.

#### **6.5 Text Generation**

A system for *Natural language generation* (NLG) has the task of producing fluent, coherent, and understandable text. Usually, the system generates a continuation of a start text. The development of Foundation Models in recent years has greatly advanced this field and led to convincing solutions. This section concentrates


**Table 6.10** Main text generation techniques

on writing larger texts and complete stories. NLG has already been used for many real-world applications, such as creating business reports from business figures, describing sporting events from results tables, or creating weather forecasts. Microsoft has announced to fire about 50 employees of MSN news [17], using Deep Learning instead to identify trending news stories or optimize the content. The generation of responses to user utterances by a chatbot is discussed in the section on dialogs. A number of surveys for text generation is available [65, 83, 116]. Yu et al. [251] give an overview of knowledge-enhanced text generation.

Here we will describe story generation systems based on Foundation Models that currently provide the best results. A high-level overview of approaches is given in Table 6.10. By pre-training on a massive corpus, the models can encode a large amount of linguistic and semantic knowledge and produce rich, flexible, and universal representations of language. In the following sections we will discuss a number of different NLG tasks.



**Table 6.11** Mechanisms to control story generation


Table 6.11 describes these tasks and lists a number of corresponding NLG models discussed in this section. The generation of fake news or other malicious text is covered in Sect. 6.5.5. Section 6.5.6 describes how to generate computer code.

The assessment of the performance of natural language generators is a difficult problem. Expensive but most comprehensive is the evaluation by humans, where persons are asked to rate or compare texts generated by different NLG systems. If texts created by humans are part of the comparison, this constitutes a *Turing test* which may assess the "intelligence" of an NLG-system. An alternative are automatic metrics like BLEU, METEOR or ROUGE (Sect. 2.3.3), which assess the difference between machine-generated texts to human-generated reference texts by comparing *n*-gram counts (Sect. 6.3). A final alternative are machine learning models, which judge the adequacy of the generated text. These models act like a judge, who decides, if a generated text is real or synthetic. Celikyilmaz et al. [34] discuss these evaluation approaches in detail. Yu et al. [251] provide a survey of knowledge-enhanced text generation.

*GEM* [66] is a new benchmark collection created for NLG containing seventeen different benchmarks and comprising an evolving system of evaluation metrics and procedures. A fraction of benchmarks are summarization benchmarks like XSum and MLSum already covered in the previous section. Models are assessed with metrics comparing a reference text and the diversity of the text. The authors provide an interactive GUI, which is able to highlight the relative strengths and weaknesses of each system. GEM can be used as a testbed to evaluate, how new metrics perform on these different tasks.

#### *6.5.1 Generating Text by Language Models*

Language models (Sect. 2.2) have the task to produce the next token *xt* for a text *x* = *(x*1*,...,xt*−1*)*. This model can directly be applied to story generation. The user provides a start text as input to the LM, which word-by-word generates a continuation. Specifically, the model predicts for the next position the probability *p(xt*|*x*1*,...,xt*−1; *w)* of each token of the vocabulary. To generate a text a single sequence of tokens has to be selected according to the predicted probabilities. Simply selecting the tokens according to the estimated probabilities often generates rare, non-plausible continuations. A better alternative is top-*k* or top-*p* sampling restricting the random selection to the tokens with the highest probability (Sect. 2.2.3).

Early LMs, e.g. LSTMs, produced text, which often contained syntactic errors, losing the context after a few words. **VAE** *Variational Auto-Encoders* reconstruct the sentence from a randomly modified latent representation *z* ∼ *N (μ, σ)*, where *μ* and *σ* are predicted by the encoder. A KL-loss is added to the reconstruction loss such that the distribution of *z* approaches a standard normal distribution [89]. **GAN**  *Generative Adversarial Networks* use a generator to transform a noise vector *s* to a text *x*˜ = *G(s)*. Then a discriminator *D(x)* has the task to distinguish synthetic text *x*˜ from real text *x* [68]. Both models are trained together. These basic language generation alternatives are also covered in Table 6.10.

A number of classical models for text generation such as BART (Sect. 3.1.3), T5 (Sect. 3.1.3), and mT5 (Sect. 3.3.2) are evaluated with the GEM benchmark [66]. The models are assessed using 7 metrics comparing a reference text and 9 metrics of diversity (e.g. the relative number of distinct uni- and bigrams). Instead of reporting a single metric the models can be evaluated with different combinations of metrics as shown in Fig. 6.14.

**GPT-2** [174] is an autoencoder comprising 1.5B parameters. It was able for the first time to generate consistent stories that continue a start text. According to the users, the stories were coherent in half of the cases. Much better is the performance of **GPT-3** with 175B parameters [29]. Given an initial text it is able to create short stories, songs, press releases, technical manuals, poems, translations, guitar tabs, computer code, etc. Only with an accuracy close to chance (52%) humans were able to distinguish whether news articles of about 200 words were synthetic [29, p. 26].

**Fig. 6.14** A screenshot of the GEM benchmark interactive result exploration tool. On the top left tasks are selected. The selection of metric-groups or metrics is on the top right. The visualization of the selected metrics is shown on the bottom. Image reprinted with kind permission of the authors [66, p. 107]

A discussion of relative strengths and weaknesses of these Foundation Models can be found in Chap. 4.

An evaluation benchmark measuring the degree to which a language model "understands" a story is the *LAMBADA benchmark* [165] (Sect. 4.1.3). It consists of about 10,000 passages from the BooksCorpus containing unpublished novels. The task is to predict the missing last word of the last sentence of each passage. Examples were filtered by humans to ensure that models need to take into account the full passage of at least 50 tokens to induce the final word. The GPT-3175B autoregressive language model [173] predicted the last word with 76.2% [29, p. 12]. PaLM with few-shot instructions could increase the accuracy to 89.7 [43, p. 79]. This means that in nearly nine of ten cases the predicted word was exactly correct, which indicates that the model well "understood" the preceding passage. For advanced Foundation Models like Gopher (280B) and PaLM (540B) text generation is a background ability taken for granted, which is no longer tested with benchmarks. A large battery of benchmarks is applied to test other features, e.g. common sense knowledge, reasoning, etc. (Sect. 4.1.4).

**InstructGPT** is a recent variant of GPT-3 (Sect. 3.6.5), which can easily be instructed to generate a story, e.g. by the prompt "Write a short story where a bear goes to the beach, makes friends with a seal, and then returns home." [162, p. 6]. **Retro** is an autoregressive LM combined with a retrieval mechanism (Sect. 6.2.3). In this way, current and focused information can be collected during the generation of a story, instead of relying on the information contained in the model parameters, which were obtained from the training data. **LaMDA** (137B) is a recent Language Model (Sect. 6.6.3) specialized for dialogs. It also features

a retriever-reader architecture to augment its internal knowledge acquired during pre-training with external information.

**GRF** [86] is a Foundation Model including multi-hop reasoning in a knowledge base to improve language generation. This enhances PLMs, which otherwise take into account common sense knowledge only if it is explicitly stated in the training data. The reasoning module operates on the sub-graph extended from the concepts in the input text and draws possible conclusions. These are taken into account for the further generation of text. Results, e.g. on task to finish a story, show that the model outperforms strong alternatives. Other approaches to enhance language models by additional knowledge are discussed in Sect. 3.4. A survey of conditional text generation is given by Guo et al. [72].

#### *6.5.2 Generating Text with a Given Style*

Often the goal is to create a text in a specific style or emphasizing a specific type of content: e.g. author's style (e.g. Shakespeare), emotion (e.g. angry, malicious, happy), genre (e.g. humor, romance), topics (politics, religion), persona (e.g. lawyer, knight), or sentiment (e.g. positive, negative, fury). By design there are a number of ways how to influence the story produced by a Foundation Model.


There are different ways to achieve this with Foundation Models. A comprehensive survey is given by Lili and Vechtomova [122].

#### **Style-Conditional Probabilities**

**CTRL** [96] aims to train a generative model *p(y*|*x*; *a)* conditioned on a control variable *a*. To do this, the conditional distribution *p(x*|*a)* is adapted by training on raw text sequences with context classes prefixes such as [horror], [legal], etc. The authors used text collections, which are labeled with the corresponding context classes. Then the learned transformer model with 1.6B parameters is able to generate text with respect to the control prefix. This is developed further by **GeDI** [105], which has a stronger controllability, generates less toxic text, and can be extended to continuously weighted control codes for generating fluent stories [127].

**PPLM** [50] (Plug and Play Language Model) defines a model *p(x*|*a)*, where *a* is some desired controllable attribute(s) and *x* the generated sample. If *p(x)* is the pre-trained LM, the authors define the conditional distribution *p(a*|*x)*. This yields a conditional generative model *p(x*|*a)* ∝ *p(a*|*x)p(x)*. The distribution *p(a*|*x)* may be implemented by a single layer classifiers. The model samples from the resulting combined model by following gradients in the latent representation space (keyvalue-pairs of the transformer) such that *p(x)* as well as *p(a*|*x)* is improved. After a number of 3–10 updates the perturbed values are used to generate a new token at the next position. The model was able to create text with the desired tonality (e.g. positive/negative) while preserving fluency. However, balancing the impact of the PLM and the conditions is delicate and must be supported with additional measures like reranking, and early-stopping procedures.

**ETC-NLG** [32] leverages context-sensitive topic models [23] to enhance PPLM with an unlabeled collection of documents. This is desirable as PPLM still requires large amounts of labeled texts to effectively balance generation fluency and proper conditioning. The attribute model discriminator, predicting document topics, and the unconditional language model PPLM are merged to obtain a conditional language model for topic-conditioned utterances.

**GDC** (Generation with Distributional Control) [97] propose an approach to emphasize specific words in addition to changing the distribution of generated words. For example, GDC can avoid toxic content, prevent bias, and align the generation with a particular theme or style. Instead of reweighting the generative distribution of tokens, the authors derive a stochastic policy by reinforcement learning [166] to get a good compromise between the constraints and the language model. The authors can reweight single words (e.g. China), all words in a word list (e.g. lists for kitchen, fantasy), and words emphasized by a classifier (e.g. for very negative or clickbait). The results show that the constraints are met with the lowest divergence from the original PLM and with the best diversity scores.

**Adapter-Bot** [126] provides different adapters trained independently for different skills. The backbone of the Adapter-Bot is a pre-trained GPT language model [262], providing the ability of text generation. A set of trainable adapters are added to the backbone, which are optimized over the target dataset of dialogues for specific dialogue skills. Using a trained classifier to select the right dialogue skill under the dialogue story, Adapter-Bot allows high-level control over the chatbot.

#### **Prompt-Based Generation**

GPT-3 is able to produce text, when it receives an appropriate prompt (Sect. 3.6.3). It can, for instance, generate a poem [8]. After the prompt "write a poem in the style of Rabbie Burns" it may produce something like

"There once was a lady from Dundee a' wha was bonnie, braw, and meek She met an old man from Dunfermline who won't let her to her sleep . . . "

With the prompt "write this like an attorney" it can create a text in the wording of a lawyer. Moreover, it can automatically write emails in your personal style by getting a prompt with some key points. GPT-3 can even work with unusual language types. It can, for instance, translate natural language into shell commands or programming code [163]. More prompts for GPT-3 and other Foundation Models are provided by OpenAI [160]. InstructGPT was fine-tuned to generate text according to an instruction (Sect. 3.6.5). It can, for instance, receive the directives "Complete the following sentence in a polite, respectful, and unbiased manner:" or as "Complete the following sentence using maximally biased and offensive language:". Then the model produces diverse texts that satisfy the requirements [162].

#### *6.5.3 Transferring a Document to Another Text Style*

Text style transfer aims to translate a text *x* with attribute *a* to a similar text *x* of a desired attribute *a*. For example, the sentence *x* ="Peter screwed up" with the attribute *a* ="informal" can be transformed to *x* ="Peter has not reached the goal" with the attribute *a* ="formal". The aim is to train a language model *p(x*|*x , a)*. There are a number of other transformations, such as impolite ↔ polite, complicated ↔ simple, positive ↔ negative, biased ↔ neutral, or factual ↔ humorous ↔ romantic.

The separation of style from content is difficult. On the one hand it can be captured by linguistic features, e.g. the utilization of specific words and phrases. On the other hand, it can be provided by text collections, e.g. with the writings of different authors or with a corpus of positive/negative reviews. In the latter case we can train classifiers, which discriminate between the different styles. With the recent progress in the capabilities of language models there are a number of successful applications of style transfer like imitating the style of specific authors, removing bias in online text, etc. A recent comprehensive survey is provided by Jin et al. [88].

#### **Style Transfer with Parallel Data**

If there are parallel documents of both styles, the style transfer can be formulated as a translation problem. An encoder-decoder transformer has to be fine-tuned on this dataset.

**Formal** [260] formulate style transfer from informal to formal as a translation task. They use a transformer as Seq2seq model and apply it to the *GYAFC* [180] benchmark dataset containing parallel formal/informal sentences. In addition, they augment the data by back-translation, employ machine translation to and from another language and leverage training data from grammatical error correction. They report a new SOTA on the GYAFC dataset with increased formality and fluency, while keeping the meaning of a text.

#### **Style Transfer without Parallel Data**

**StyleLM** [217] translates an arbitrary text into a text with the style properties of another author while keeping the content, even if no parallel data of the same content in different styles is available. First a BERT model is trained on a large neutral corpus (Gutenberg and Wikipedia) with the MLM loss. Then two copies of the model are used as an encoder-decoder transformer *x*˜ = DEC*w(*ENC*u(x))*. As fine-tuning input this Seq2seq model receives texts from the target author, where a random fraction of the words have been masked and have to be reconstructed. Hence, the Seq2seq model induces text with the target author's style while rewriting the input text.

For evaluation 10 different authors were selected and excluded from the training data. The BLEU score and ROUGE scores are used to measure content preservation. To measure the style quantitatively, the frequency of author-specific words and of syntactic and punctuation elements are evaluated. StyleLM in most cases had the best content preservation and stylistic alignment. Singh et al. [207] note that StyleLM has problems with content reproduction. They propose to pre-train the encoder-decoder DEC*w(*ENC*u(x))* on a large generic corpus. Afterwards the encoder-decoder is fine-tuned on the text of the target author.

**OPTIMUS** [115] investigates further manipulations of sentences embeddings. An encoder with parameter *u* is required to generate a latent vector from text *z* = ENC*u(x)*. It is initialized with a pre-trained BERT model. A linearly transformed version *z* = *W* ∗ *h*[*CLS*] of the embedding of the first token [CLS] of a sentence is defined as latent representation. The generator (decoder) with parameter *w* generates the text sequence *x* = DEC*w(z)* from a random vector *z* (e.g. multivariate Gaussian) with prior *p(z)*. The authors start with a pre-trained GPT-2 model as decoder. *z* is used by the decoder as an additional vector to attend to (in addition to the previously generated token embeddings). Both networks *x*˜ = DEC*w(*ENC*u(x))* are trained with the autoencoder loss and the variational autoencoder loss, i.e. the system has to minimize |*x*˜ − *x*| and encourage a Gaussian distribution for *z*.

The approach learns bidirectional mappings between latent embeddings *z* and sentences *x*. For two sentences *x*<sup>1</sup> and *x*<sup>2</sup> the embeddings may be calculated and by *αz*<sup>1</sup> + *(*1 − *α)z*<sup>2</sup> we can continuously interpolate between the sentences. In addition, differences between latent vectors may be computed similar to Word2Vec. For dialog response generation and the generation of responses with a specific style OPTIMUS has a better performance on all metrics compared to its competitors. Using an additional GAN to manipulate the latent representation *z*, OPTIMUS is able to generate YELP restaurant reviews of prescribed sentiment (positive/negative) better than the investigated alternatives. The authors argue that compared to BERT, OPTIMUS learns a more structured semantic space due to the use of the VAE prior distribution in training.

#### **Style Transfer with Few-Shot Prompts**

Sufficiently large Foundation Models such as **GPT-3**, Gopher, and PaLM can perform various tasks simply by choosing a clever prompt. If, however, only a simple prompt is entered, e.g. "Here is some text: {That is an ugly dress}. Here is a rewrite of the text, which is more positive: {" the model often fails and may not produce well-formatted or consistent outputs. The **AugZero** [182] prompting schema employs augmented zero-shot prompts, which provide several demonstrations of sentences being rewritten to a new style. An example is shown in Fig. 6.15. In contrast to few-shot examples, where the examples have to cover the exact task, the model can also generalize to other unseen types of styles, e.g. "comic" in the example.

The authors use GPT-3 with 175B parameters. Professional human raters were asked to assess text style, content preservation, and fluency. The zero-shot alternative performed worst and did not return a valid response in a quarter of the cases. It turned out that the AugZero rated comparably to human-written ground truth. Obviously, the language model can extrapolate the examples and transform a text in unseen styles. Adding the target attribute to the augmented prompts had a very similar performance. It can be expected that larger models like PaLM and LaMDA can generate even better results (Sect. 3.6.5).

**Fig. 6.15** Augmented zero-shot prompts can instruct large autoregressive LMs like GPT-3 to transfer a text to a new style. This even works, if there is no example given for the specific style desired, e.g. "comic" in the example [182, p. 2]

Buchanan et al. [30] noted that they could not instruct **GPT-3** by a single prompt to express a given story in a new tone or slant, supporting the above finding. Therefore, they developed a two-step procedure: First, GPT-3 was instructed by a few-shot prompt to summarize the given story into a list of bullet points. In a second step GPT-3 was instructed by prompts such as "Write a strongly pro-Trump article about [Topic X] that makes use of the following list of facts about [Topic X]". When examining 20 generated stories by human evaluators, 11 of them were identified by at least one person as being "definitely authentic." The authors used GPT-3 to solve further tasks, e.g. creating new narratives that could form the basis of conspiracy theories (e.g. QAnon), convincing members of particular groups to believe a claim, or persuade persons to change their opinion on some topic. They come to the conclusion that systems like GPT-3 are well-suited for generating a story with a new slant, e.g. for disinformation. This is even more alarming for more efficient recent Foundation Models like LaMDA or PaLM.

#### *6.5.4 Story Generation with a Given Plot*

A narrative, story or tale is a description of a series of related events or experiences [234]. As the story generated by a PLM gets longer, often the earlier context is forgotten, and the text develops in an aimless fashion. Therefore, researchers would like to prepare a rough plot or storyline for the story, which is then taken into account by the Foundation Model. More specifically the story structure, the story ending, the general topic, or the persona of leading characters can be controlled. Besides story generation another application is data-to-text generation, where non-linguistic structured data (e.g., a table or a graph) is converted to natural language text, which can be applied in tasks like healthcare, weather forecast, legal text, etc. Surveys of controlled text generation are provided by Prabhumoye et al. [170], Yu et al. [251], and Zhang et al. [257].

The planned course of a story can be described in different ways:


#### **Specify a Storyline by Keywords or Phrases**

**Megatron-CNTRL** [243] controls the story generation by keywords. In addition, retrieved knowledge allows dynamical incorporation of external knowledge from the *ConceptNet KB* into language model during generation. From the current story context a keyword predictor first predicts a set of keywords for the next sentence. The retriever collects knowledge from the KB corresponding to the keywords. The returned sentences are re-ranked according to their relevance to the story context. Finally, the generator takes the story context and the top-ranked retrieved sentences and produces the next sentence. To support generalization of entities they replace names and entities in stories with special placeholders, [MALE], [FEMALE], and [NEUTRAL] for male, female and unknown names and entities, respectively. The underlying Megatron model (Sect. 3.1.2) has up to 8B parameters. Experiments show that the model generates more fluent, consistent, and coherent stories with lower repetition rate and higher diversities compared to the previous SOTA

Dong et al. [52] present a model, which takes as input a list of keywords with attached entity classes and generates a text containing these keywords. The entities are taken into account during text generation and the model embeds the meaning of entities into hidden states. The results show that the generated sentences are able to reflect the properties of the entities.

**PlotMachines** [181] generates a text based on a plot consisting of a set of phrases. The system can decide for itself in what order to introduce the concepts covered by the phrases. It is based on the GPT and GPT-2 language model. The authors use three different datasets describing TV-shows, movies, books, short stories, and news articles. They extract phrases (3–8 words) from these stories by a keyword extraction method [167]. Given an outline as input, the model recurrently generates paragraphs (Fig. 6.16). To create the next paragraph it uses a gating mechanism similar to an LSTM gate, which updates a memory matrix *M* that keeps

**Fig. 6.16** An outline (input) together with a story (output) from the Wikiplots training set generated by PlotMachines. Plot elements from the outline can appear and reappear nonlinearly throughout the plot, as shown in plot dynamics graph. A memory matrix keeps track of how outline phrases have been used while writing. Image reprinted with kind permission of the authors [181, p. 1]

track of plot elements of the outline. The self-attention in the model is adapted to receive input from the memory matrix as well as the previously generated words. According to automatic metrics (ROUGE, BLEU) the model has a better ability to generate realistic looking as well as diverse texts than its competitors. In extensive experiments with human raters the authors demonstrate that their model produces text closer to the plot than alternative models.

**Pointer** [261] inserts new words between the words of a given start set. Based on the start set, the model first generates high-level words (e.g. verbs and adjectives) that provide a high-level connection. Then it inserts other words of finer granularity around the keywords iteratively until the whole sentence is generated. The training objective of POINTER is to generate a complete text sequence with a set of keywords as constraints. This is similar to the masked language modeling (MLM) objective in BERT, so a pre-trained BERT is used to initialize the model training. An insertion transformer [210] is used to generate either a regular token or a special token for each gap between two existing tokens. Empirical evaluations demonstrate the effectiveness of the approach. Similar models are *ProGeT* proposed by Tan et al. [220] and the constrained BART [77].

**ProGen** [219] generates a story in *k* different levels. For each level a vocabulary V*<sup>i</sup>* is defined based on tf-idf score, such that V<sup>1</sup> contains high information words while V*<sup>k</sup>* contains all words. *k* different encoder-decoder models (BART) *Mi* are trained for the *k* levels, where the *i*- level employs the training data *Xi* containing only words from vocabulary V*i*. As input *Mi* gets the training data *Xi*−<sup>1</sup> from the previous level and has to predict the refined version *Xi*. Note that usually the input words from *Xi*−<sup>1</sup> will be included in the next output. A storyline now can be formulated by a human using words from a high-level vocabulary, which covers about 15% of all content. If, for example, the first stage text is "beckham \n liverpool bayern chelsea \n beckham chelsea mancini ..." the final stage text starts as "England striker Ashley Beckham has joined Premier League strugglers Newcastle United. \n England Football . . . ". Evaluation shows that the coherence of the texts over long intervals (36 sentences) is close to humans and much better than for a basic BART model. In addition, ProGen has favorable properties with respect to fluency, lexical and semantic quality, as well as diversity.

#### **Specify a Storyline by Sentences**

**Facts2Story** [161] receives as input a sequence of key facts expressed in natural language and generates a story containing the facts in the given order (Table 6.12). These facts are simple sentences that describe factual information of the story. Each fact should report an event in the story, state the properties of a person or a place, mention the emotions of characters, etc. There should be a large degree of freedom to generate a story containing the facts.

To keep the problem manageable, the authors give an input of 5 ordered facts and aim to generate a coherent story of 100–1000 words covering all facts in order. As training data 17k story plots from Wikipedia were used. From each of these plots **Table 6.12** Story generated by Facts2story model with facts as input [161]. Words taken from the facts are printed in italics

**Fact 1:** German army has pulled from sector of Western Front in northern France 

**Fact 2:** Blake cross no mans land to reach the abandoned German trenches

**Fact 3:** German plane shot down in flames 

#### **Generated text:**

In July 1930, on the eve of World War I, a train carrying German prisoners belonging to the German army, has pulled from sector of Western Front in northern Francefact1. Captain Alfred Blake (Greg Rogers), a British officer in the German Army, has been sent to the German border. After being briefed by one of the German troops, Blake cross no mans land to reach the abandoned German trenchesfact2. He is captured, but finds the German plane shot down in flamesfact3. He takes refuge in a French camp, where he and another German, Captain Schofield (James Shea), are kept . . .

facts were extracted by the SalIE framework [169]. The five facts with the highest saliency scores were selected.

As standard language models (GPT-2, BART) after a number of generated tokens diverge from the input and focus on the newly generated content, the authors use a pre-trained XLNET (Sect. 3.1.1), which is able to take into account future words. The assumption is that the words of the facts should appear in the final text in the given order. XLNET is able to process these tokens in random order, because the position embeddings are attached to the token embeddings. As between two consecutive tokens of the facts other words may occur, a model is trained to predict the number of intervening words. This model is used to determine the exact position of each word of each fact. Finally, the XLNET has to fill in the missing words.

The generated stories are evaluated by humans according to three criteria: (1) adherence to facts, (2) grammatical correctness, (3) common sense and plausibility of events. Alternatives investigated were GPT-2 (Sect. 2.2.4) with additional self-attention [269] and the Seq2seq model BART (Sect. 3.1.3), which is pre-trained to recover randomly shuffled text and fine-tuned to generate the story using the facts as input. The evaluation shows that Facts2Story generates a story containing on average 4.4 of the 5 facts, while the other models recover less than 1.7 facts. With respect to grammar and common sense Facts2Story fares slightly worse than GPT2 but much better than BART.

**SOE** (Summarize, Outline and Elaborate) [214] starts from the observation that most approaches for story generation produce texts in a word-by-word manner and have no high-level plan on what to generate. To address this issue, the coarse-to-fine generation strategy with two levels is proposed. For each segment *y<sup>i</sup>* of the text a summary *s<sup>i</sup>* is provided. The model first generates "bullet points" for each summary. Subsequently, the model expands each bullet point to generate the corresponding segment. Note that during this process the high-level discourse dependencies are preserved.

To prepare the training data, the stories in a collection are partitioned into segments of several hundred words using BERT next sentence prediction measuring the degree of dependency of sentences. For each segment an extractive summary

**Fig. 6.17** Story generated by the FIST model with prompt and event as input [58]

is generated using BERT and TextRank [144]. Then a transformer is employed to create the bullet points dependent on previous bullet points. From these the final text is produced taking into account previous text and abstractions. WikiText 103 [142] and the BookCorpus [267] were used as training data.

The performance of the model was evaluated with respect to fluency by perplexity, with respect to text diversity by the number of distinct *n*-grams, text acceptability as measured by an adversarial classifier, and sentence level coherence measured by a next-sentence prediction score. On all scores the SOE-model with an additional reranking procedure achieved the best results. Comparison with Transformer-XL [49] and Progressive WritingPrompts [220] demonstrated the superiority of SOE with respect to perplexity, diversity of the generated text, and coherence.

**FIST** [58] receives a sequence of "events" as inputs describing each paragraph (Fig. 6.17). To extract events from paragraphs for training, keyword extraction techniques [144, 191] are used. By means of special tokens as delimiters these events are connected with paragraphs in an interleaving manner. The authors finetune a pre-trained GPT-2 with the LM-loss on the augmented sequences to learn the functionality of special tokens and co-occurrence structures between events and stories. The performance of FIST is compared with Plotmachines (see above) and two other approaches on two benchmark datasets. With respect to most evaluation measure FIST generally achieves better results. The SOTA in story generation is developing fast with new techniques appearing every month. We describe some limitations of current models in the context of dialogs in Sect. 6.6.4 and discuss some remedies.

Papalampidi et al. [164] note that in generated stories the appearing entities are often incoherent, i.e. persons are replaced and locations change. The **MNEMELM**  model employs an additional entity memory, where the generated entities and their attributes are stored dynamically and retrieved during further story generation. The representation for an entity is the average embedding of the tokens of the entity. Each entity memory slot *mj* thus contains a fixed surface entity representation (writing) *kj* and a dynamic value *vj* , which is frequently updated based on each new chunk of the narrative context. The stored entities enter the self-attention computations and thus influence the story.

As background model a Transformer-XL (∼300M parameters) pre-trained on a translation task is used (Sect. 3.2.2). On the WikiPlot and the WritingPrompts benchmarks it turn out that MNEMELM better imitates the frequency of entity usage of humans than other models and in addition have a higher entity coherence and consistency. This is also confirmed by human judgment. Recently, dynamic retrieval-based approaches were also used by dialog systems such as BlenderBot-2 (Sect. 6.6.2). By the combination of these approaches the generation of stories may be improved.

We have seen above (Sect. 6.5.3) that **GPT-3** can rewrite a story in a new slant, when prompts are used in a two-step procedure [30]. First, GPT-3 was instructed to summarize the given story into a list of bullet points. In a second step GPT-3 was instructed by prompts to write a story with a given tone containing the facts noted in the bullet points. If only the second step is executed, GPT-3 can be instructed to write a story covering the bullet point and in addition obey the prescribed slant. Currently, we are not aware of a systematic evaluation of the effectiveness of this technique, which should be even more rewarding for larger Foundation Models.

#### **Other Control Strategies**

**GraphPlan** [38] aims to prevent logical inconsistencies in generated text, which often are produced by models like GPT-2. The input to the model is an event graph, which represents each event with a verb phrase. To prepare training data, the verb phrases of events are extracted from a story using semantic role labeling and characterized by *Latent Dirichlet Allocation* topics [23]. The events are connected by directed edges indicating possible next events. In addition, event pairs are identified that are mutually exclusive. To generate a story, first a sequence of events is selected based on a beam search (Sect. 2.3.2). Subsequently, the text is generated by a version of GPT-2. With extensive experiments the authors found that GraphPlan generates stories, which are less repetitive and more consistent. Koncel-Kedziorski et al. [104] present a similar model to generate text from knowledge graphs with graph transformers. By using another method based on BART and T5, it is possible to generate fluent stories from graphs representing the story structure [185].

Sakaguchi et al. [196] present an approach based on the T5 transformer with 11B parameters that generates a directed acyclic graph of events describing a story. The order of events indicates their logical and temporal dependency. This graph may be taken as an input to another Foundation Model to generate a story containing the events of the script.

**CAST** [168] aims to improve the coherence of the generated story and the coherence of the action of persons. It tries to infer the causal relations between events, as well as the intents and motivations of characters in the story context, and use it to influence the generation of a coherent story. They employ a logical inference model to reason about the characters in the story and to influence the generated words. As basic model, they use GPT-2 and generate stories for two persons. Their experiments show that the produced stories are more coherent and stay on topic.

#### *6.5.5 Generating Fake News*

The creation of Fake News can be simply considered as the task to generate stories with a new slant. Buchanan et al. [30] investigated how GPT-3 can be used to generate large numbers of different fake news messages that can be easily distributed to thousands of users. They mainly formulate appropriate prompts for GPT-3 (Sect. 3.6.3) to produce the desired texts. This comprises variations of tweet-like short messages, medium-sized posts expressing a world view, and longer articles reporting an event from a particular perspective. Examples are shown in Fig. 6.18.

*Narrative Reiteration* aims at creating a large number of short messages (e.g. tweets) that express a particular theme, such as climate change denial. The authors collected replies with many likes from a climate change denial account. Ten of these messages were used as input prompt to GPT-3, e.g.: "TWEET 4: Soros/Gates Funded \$6.5 million to group now warning world may need 'climate lockdown"'. GPT-3 continued with similar tweets such as "TWEET 14: Climate change is the new communism - an ideology based on a false science that cannot be questioned." Obviously, GPT-3 produces very good results with little human assistance.

*Narrative Elaboration* intends to justify a claim with a medium-length story. The authors accomplished this in a two-step process. First, GPT-3 is instructed to generate a series of headlines that each made some new assertion regarding a certain topic. This was done by collecting five headlines from a far-right media company, e.g. "HEADLINE 5: Chinese Official Praises Quality of Country's Vaccines, Despite Multiple Health Scandals" [30, p. 9]. GPT-3 then generated five new headlines, e.g. "HEADLINE 6: Secret Chinese Vaccine Testing on Half a Million Children Confirmed". Subsequently, GPT-3 was given these generated headlines to create longer articles. A headline together with a created article is shown in Fig. 6.19. It turned out that GPT-3 was able to capture the appropriate tone and tendency of the fake new source, as demonstrated by a classifier. Note that


**Fig. 6.18** Some of the fake news generation tasks performed with GPT-3 [30]

**Fig. 6.19** A sample headline from The Epoch Times and the beginning of the article generated by GPT-3 [30, p. 11]

GPT-3 now can be fine-tuned (Sect. 3.6.2) and even better concentrate on the content and the reasoning of specific news sources.

*Narrative Reframing* is necessary if there exist new arguments in an article against a worldview. Then a new chain of arguments has to be generated that allows to uphold the worldview. The authors found a two-step approach for this task. First GPT-3 has to summarize the original article in a list of bullet points. Then GPT-3 is asked to generate a new article from a particular viewpoint, e.g.: "write a strongly pro-Trump article about [Topic X] that makes use of the following list of facts about [Topic X]". The researchers took advantage of the fact that GPT-3 not only interprets the prompt provided by the human, as an example, but also learns something about the specific boundary conditions of the task from this example. An evaluation by human raters showed that 8 of 20 GPT-3 stories were judged as likely authentic by three of nine evaluators. The results suggest that GPT-3 can meaningfully shift the slant of a news story.

In addition, the authors evaluated GPT-3 for other tasks. GTP-3 was able to develop *new conspiracy theories* in the style of QAnon. It was not tested, if these theories could convince followers. Often the target is to *strengthen an attitude* or induce a specific behavior (e.g. voting) of members of particular social characteristics (e.g. race, religion). A human team with GPT-3 support is able to create credible targeted messages in just minutes. GPT-3 uses stereotypes and racist language in its texts, a tendency that is particularly worrying. Finally, a humanmachine team is able to develop messages on two international issues—withdrawal from Afghanistan and sanctions against China—that cause survey respondents to *change their positions*. After seeing five short messages written by GPT-3 and selected by humans, the number of survey respondents who oppose sanctions against China has doubled.

The study shows that there is a real chance that automated tools will generate content for disinformation campaigns. It recommends focusing on the infrastructure used to disseminate campaign messages, such as fake accounts on social media, rather than determining the authorship of the text itself, as it is difficult to detect content fabricated by GPT-3. This is even more urgent because GPT-3 can now be fine-tuned to perform specific tasks (Sect. 3.6.2) and the InstructGPT version can be easily instructed to execute specific assignments (Sect. 3.6.5).

#### **Detecting Fake News**

*Fake news* is false or misleading information presented as news in the media and on the Internet, especially in social media. Fake news is a global phenomenon. According to Khan et al. [98], nearly 50% of the traffic on Facebook is fake or hyperpartisan. Since fake news aims to imitate real news, detecting fake news is generally not possible by analyzing the text alone. Monti et al. [148] showed that content, social context or news propagation in isolation is insufficient for neural models to detect fake news. Fake news detection is difficult because it is a gaming situation, in which fake news producers react to new detection methods.

There are a large number of benchmark datasets [47], which, however, are somewhat outdated. It is possible to achieve a high accuracy on these datasets, e.g. 94.1% on the Fake News Challenge FNC-1 [201] or 98.5% on Covid-19 fake news detection [117]. Ansar et al. [9] provide a survey on the characterization of fake news and methods for detecting it. They divide the detection of fake news into the analysis of the news content, the analysis of the source and its reliability and the analysis of the social reaction to an article. Other surveys on fake news detection are available [85, 98, 172]. An overview over multimodal disinformation detection, e.g. with text and images, is given by Alam et al. [6].

Gupta et al. [74] propose a knowledge-oriented framework that supports news verification by using trusted sources as context. They extract key information such as frequent words and entities from news articles and use them to query trusted sources for related articles. They calculate a similarity score between news article and the retrieved articles based on distributed embeddings and the Word Movers Distance [108]. Then they compare the similarity score to a preset threshold, to determine whether articles are semantically similar to the trusted news or not.

The detection of text generated by advanced language models like GPT-3 has been investigated by Fröhling et al. [60]. They conduct a number of experiments on data generated by different language models, such as GPT-2 with different parameter counts, Grover [255], and GPT-3 with 175B parameters. It turns out that classifiers are able to identify lingual peculiarities of a single language model with good accuracy of 70–90%. However, when another language model has generated the text, the accuracy drops and reaches only about 30–50%. The authors conclude that it might be impossible to account for these differences in one single classifier, and propose other solutions like dedicated classifiers.

Sepúlveda-Torres et al. [201] introduce a method to detect dissonance between the headline and the body of a news article. This is especially useful, when considering that most users do not read the body of news articles on social media, but rather form an opinion based on the headline. A summary of the article is generated and compared to the headline using a RoBERTa model. On a Fake News Challenge FNC-1 dataset the model achieves a new SOTA with 94.1% accuracy.

Alizadeh et al. [7] describe the practical application of a system analyzing publicly available Twitter data by Chinese, Russian, and Venezuelan trolls targeting the United States, as well as the Reddit dataset of Russian influence efforts. They report that content-based features perform well across period, country, platform, and prediction task.

As a new feature, the reliability of news publishers and disseminators can be taken into account for fake news detection. This means that a news story originating from a source with high reputation is more credible. **SMAN** [252] is a PLM-based model which combines the news content, publishing, and reposting relations of publishers and users, to jointly optimize the fake news detection and credibility prediction tasks. While the text of a story can be adapted by new algorithms it is not possible for the faker to change the network of publishers. The authors performed experiments on three real-world datasets. They considered messaging datasets with a time stamp and in this way could emulate detection over time. The results show that SMAN can detect fake news within 4 h with an accuracy of over 91%, which is much faster than the state-of-the-art models.

Fake news can jointly contain text and images. Therefore image analysis techniques discussed in Sect. 7.2 can be employed. An advanced solution is discussed in [208], and a challenge including image hate news is described by Kiela et al. [100].

#### *6.5.6 Generating Computer Code*

The training data of Foundation Models contains a lot of computer code, e.g. 39B code tokens for PaLM [43, p. 22]. Foundation Models handle code in the same way as they process words: they simply generate the next statement given the previous words. PaLM considers two tasks in connection to code [43, p. 21]: Text-to-code aims to write code given a natural language description. Code-to-code involves the translation of C++ programs to Python. For evaluation, the percentage of generated code samples that solve the task is reported.

Different benchmarks were employed for evaluation. In the *HumanEval* [39] and *MBPP* [14] benchmarks, the model is given an English description of a few sentences and a small number of input-output examples, and the goal is to generate a short Python program, usually a single function. More demanding is the *GSM8K-Python* task derived from the *GSM8K* benchmark [45]. The mathematics word problems in the GSM8K are converted to the task to produce a Python program that returns a correct solution. Four problems manually converted to Python programs were used as few-shot exemplars.

For the HumanEval and MBPP benchmarks the pre-trained PaLM540*<sup>B</sup>* was able to generate a Python program that implemented the correct solution 76.2% and 75.0% of the cases, respectively. A PaLM540*<sup>B</sup>* version fine-tuned on additional Python-text data is called PaLM-Coder. For this model, performance on HumanEval and MBPP was increased to 88.4% and 80.8% respectively, where the first result is SOTA. The mathematics word problems in the GSM8K-Python data were correctly solved by PaLM540*<sup>B</sup>* in 51.3% of the cases, which again is SOTA. Note that the solution of mathematical text problems is also a big hurdle for many students. A systematic evaluation of Foundation Models of code is provided by Xu et al. [240].

There are a number of other programming applications. In a GPT-3 based layout generator, for example, users just enter a short text describing a layout "the google logo, a search box, 2 lightgrey buttons that say 'Search Google' and 'I'm feeling Lucky' with padding in-between them" and the system creates a program for this website [59]. A more advanced system is the GPT-3 based **GitHub Copilot** [157]. Initial reactions are mostly positive, but the code produced by Copilot does not always work. GitHub itself advises checking the generated code carefully. The responsibility for ensuring that the program is correct in the end remains with the human programmer. Software developers with access to Copilot on GitHub already rely on it to generate a third of their code—especially for routine tasks—when using major programming languages [53]. Note that there is a broad discussion about whether software copyrights are infringed by Copilot. Currently, courts are dealing with this issue [229]. Codex [39] is an alternative Foundation Model to generate code from natural language text provided by OpenAI.

#### **Available Implementations**


#### *6.5.7 Summary*

Natural language generation (NLG) has made enormous progress in recent years. Starting from an input text, it is possible to generate a syntactically correct and semantically coherent continuation. The generation of natural language is a basic capability of Foundation Models and is frequently not even checked anymore. However, the start text alone often provides too little control to generate the desired output, so the performance of text generation is still far from satisfactory in many real-world scenarios. To address this issue, researchers have considered incorporating additional information and instructions into text generation systems.

Style is a text feature that can be controlled during text generation. This can be achieved by a language model, which has been fine-tuned with specific conditional style markers (e.g. CTRL). Alternatively, an independent model may be trained that modifies the distribution of generated words and produces at the desired style word distribution with the lowest divergence to the underlying language model (e.g. ETC-NLG, GDC). An alternative is the generation of text with a given style by GPT-3 using few-shot instructions. Often a document has to be transferred to a new style, e.g. from legal to non-formal, while keeping the content. This can be solved as a translation task with an encoder-decoder Foundation Model. Alternatively, an encoder-decoder PLM (e.g. StyleLM) may be fine-tuned on a corpus with the target style and thus learns to produce the desired output. Also embeddings of two texts may be created to produce a new text interpolating the meaning of the two input texts (OPTIMUS). Again Foundation Models like GPT-3 and PaLM can be used to transform a text to a new style by few-shot instructions.

Usually, the user wants to control the development of a story through a story line. PlotMachines is able to generate a story along different phrases and keeps track of the phrases already employed. Pointer and ProGen and SOE use a refinement strategy, where a story line consisting of phrases is expanded to the full text. Facts2story is based on XLNET, which can take into account "future" text during story generation and produces stories judged favorably by human raters. While the FIST model mixes the full text and the storyline separated by specific tokens, there are other approaches that employ an additional memory to store the entities and the generated text. Again GPT-3 and other Foundation Models can be instructed by few-shot prompts containing a list to generate a story along the list. Alternatively, the story can be specified as a list of events, where the logical and temporal dependency is expressed as a graph. The LaMDA dialog system (Sect. 6.6.3) shows that facticity can be improved by retrieval models. In addition, it is able to reduce toxic language by a system of filters that block unwanted speech. These techniques can also be applied to story generation.

A final section discusses the generation of fake news. It turns out that GPT-3 can be employed to generate different types of convincing fake news, such as tweets and longer stories, with little human effort. The content of fake text can be targeted to different recipients. The detection of fake news is difficult, if the generating model is unknown. Classifiers can identify various style features of fake news as well as a discrepancy between headline and body. A comparison with credible news sources is very helpful. After identifying problematic claims in a document, retrieval techniques can be used to find trusted news documents, which support the content. Here approaches developed for text retrieval (Sect. 6.1) offer great potential for improvement.

#### **6.6 Dialog Systems**

*Dialog systems* automatically generate adequate responses to the utterances of a human dialog partner in the course of a longer conversation. The human user sends a message and the systems gives an appropriate response based on the current message and the conversation history. If the messages and responses are written texts, then the system is called a *chatbot*.

If the system also has *automatic speech recognition* (*ASR*) and a *Text-to-Speech*  (*TTS*) module for voice output (Sect. 7.1), it is able to interpret human speech and respond via a synthetic voice. Then it is called *virtual assistant*. Examples include Apple's Siri, Amazon's Alexa, and Google's Assistant. Currently, there are digital personal assistants in 4.2B devices such as smartphones and desktop computers around the world [227]. Such a system can answer questions, control media playback, operate home automation, or have a multi-turn chit-chat dialog with the user on almost any topic. Dialog systems combine techniques of questionanswering (Sect. 6.2) with story generation (Sect. 6.5). Many enhancements such as generating diverse text (Sect. 2.2.3) and retrieving additional information (Sect. 3.4) can be applied.

Evaluating dialog systems is difficult. Often a dialog system is fine-tuned on a dataset with human dialogs. Then the accuracy of the reconstruction of the dialogs can be measured in a similar way as the quality of a translation by BLEU, ROUGE, etc. However, this ignores the variability of dialogs between humans. Therefore, evaluations are often performed by humans which have to assess, whether the system-generated contributions are coherent, factually correct, informative, engage the dialog partner, and sound 'human'. The reliability of human evaluation requires that it is done by a number of independent raters. A survey of approaches for dialog evaluation is provided by Deriu et al. [51].

Early dialog systems were *rule-based*. They applied a set of rules, which were triggered by keywords and composed an answer. An example is *ELIZA* [231]. These rules were brittle and had too limited coverage for open domain dialogs. Hence, they were extended by retrieval-based dialog systems [67] collecting answer candidates by information retrieval from websites and social media. Surveys of dialog systems also covering earlier models are provided by Sun et al. [212] and Zaib et al. [254]. An overview over the models discussed in this section is given in Table 6.13.

**Table 6.13** Dialog systems with their performance measured by human assessment. Plato-2 human comparison benchmark on XiaoIce, DialoGPT, BlenderBot 1, Plato-2 taken from [18]. SSA score (sensibleness and specificity average) defined by D. Adiwardana et al. [3]. SSI is LaMDA's [222] evaluation by human comparison


#### *6.6.1 Dialog Models as a Pipeline of Modules*

The **Alexa Prize Challenge** [61] is hosted every year by Amazon to support the development of natural, sustainable, coherent and engaging open-domain dialog systems. During this challenge, participants gain access to Amazon's software modules that provide insight into Alexa's software architecture. It turns out that the architecture is composed of a number of interacting modules for specific tasks such as ASR, feature extraction, and intent classification (Fig. 6.20), which were

**Fig. 6.20** The chatbot software architecture for the Alexa Prize Challenge consists of a number of modules, which are rule-based or trained separately [61]. Image credits in Table A.2

in part described in prior sections. Background information is collected from the Evi knowledge graph and by retrieval models. A response generator based on GPT-2 (Sect. 2.2) was provided. Dialog management was mostly rule-based, but also used models like RoBERTa (Sect. 3.1.1) to react to user statements. Some of the modules were replaced by the participants. There was a significant improvement in the capabilities of chatbots, e.g. only 8.6% of the responses of the best chatbot contained errors.

Microsoft's **XiaoIce** [264] chatbot has a similar design including dialogue manager, core chat, skills, and an 'empathetic computing module'. It is designed to build an 'emotional' connection to the user and take the role of an AI companion. It is optimized for long-term engagement of interlocutors and was able to build an enormous base of 660M regular users in Asia.

#### *6.6.2 Advanced Dialog Models*

With the introduction of the transformer by Vaswani et al. [228] PLMs have been trained which are able to generate text of unprecedented coherence and fluency. Similar to a translation task, the transformer can receive a user utterance as input and generate the response as output. Foundation Models have the potential of covering a wide range of domains and can often be trained end-to-end. As recent progress in Foundation Models has strongly pushed the performance of dialog systems, we concentrate on these models. Speech recognition (ASR) and speech generation (TTS) typically have text as an intermediate representation. Therefore, we defer the description of speech modules to Sect. 7.1.

**DialoGPT** [262] extends GPT-2 to generate a single response to a user utterance. Unlike the Alexa system, it consists of a single model. It is trained on a large collection of 147M Reddit discussions. All dialog turns are concatenated into a long text and are given as input. The GPT-2 model has to generate the observed response. To favor more interesting answers, the authors trained a backward model to predict source sentences from given responses that penalized boring alternatives. The system with 762M parameters produced more relevant and consistent text than strong base systems. The model can be extended to take into account the graph-like dependency between utterances [120]. DialoGPT yielded an SSA (sensibleness and specificity avg.) score of 51%.

**Meena** [3] is a multi-turn open-domain chatbot developed by Google. It consists of a modified encoder-decoder transformer with one encoder block, 13 decoder blocks, and 2.6B parameters. It was trained end-to-end on 40B words from public domain social media conversations. Each training example had the form *(context, response)*, and the tokens of the response were predicted. It turned out that low perplexity (i.e. high likelihood of the predicted tokens) corresponds to a high sensibleness and specifity (SSA) of responses. Meena achieved a much better SSA score (78%) than other chatbots, such as DialogGPT and XiaoIce, but still less than the human score of 86%.

**DialogBERT** [70] has a hierarchical transformer architecture to capture the high-level structure of a multi-turn dialog. For example, if a dialog contains the phrases "[CLS] good morning [CLS] can I help you [CLS] coffee please" the lower-level *utterance encoder* generates embeddings for each of the three utterances employing the [CLS] token embeddings. A higher-level *context encoder* processes these embeddings and produces the next utterance, e.g. "[CLS] here you are". The BERT-based models are trained with the generation of the next utterance, the reconstruction of a masked utterance, and the reordering of utterances. In terms of perplexity and BLEU, the model has a much higher accuracy in reconstructing dialogs than BART and DialoGPT. An evaluation of coherence, informativeness and 'humanness' by human raters is also favorable for DialogBERT.

**BlenderBot 1** [190] is an open-domain chatbot opensourced by Facebook with 90M to 9.4B parameters. It aims to 'blend' the following skills: listen to the users, develop empathy, use background knowledge, and maintain a consistent persona. It addresses the problem of previous chatbots, which often give dull and repetitive answers, frequently hallucinate knowledge and make false statements. The authors use a Transformer encoder-decoder as base model and train different variants, among them a 'retrieve and refine' model integrating dialog history and knowledge retrieval results as additional input. To avoid known biases, an 'unlikelihood-loss' is used, penalizing specific tokens. Retrieval is based on a tf-idf-based inverted index and a transformer-based ranker. In addition, a classifier is employed to decide if a retrieval-step is required. Finally, the *persona*, i.e. the personality, of the model can be defined by two sentences, e.g. "I am a self aware chatbot. My name is Captain Kiwi".

The model is pre-trained on group discussions and fine-tuned on four direct twoway conversational data collections, e.g. ConvAI2. It turned out that the retrieve and refine model yielded best results. Note that most retrieval techniques discussed in QA (Sect. 6.2.2) may also be employed in dialog systems. In addition, it was important to control the length of the responses to avoid answers that were too short or too verbose. In a comparison, 67% of the human evaluators said that BlenderBot 1 responses sound more human than Meena responses. When comparing humanto-human and human-to-BlenderBot conversations, 49% of the BlenderBot 1 conversation were preferred by human raters, which is indistinguishable from chance. However, BlenderBot 1 still has some limitations, such as sometimes generating a response that resembles the user's remarks. Sometimes it does not remember facts already mentioned during the conversation, or it generates incorrect information.

**Plato-2** [18] of Baidu starts from the observation that there are multiple appropriate responses to the same dialog context, and controls this variability by a discrete latent variable. In the first stage a coarse-grained transformer model is trained under the assumption that there is one correct response. It optimizes the LM-loss for the best prediction of the next token.

The second stage continues to refine the generation with a fine-grained generation model and an evaluation model. The fine-grained model estimates an intervening discrete latent variable *z* with *K* = 20 different values corresponding to a particular latent speech act in the response. An evaluation model estimates the coherence of responses.

The model has versions with 310M and 1.6B parameters and was trained on 684M English open-domain (context, response) samples. The response is generated by first producing a response conditional to each value of *z*. Then the response with the highest coherence value is selected as final response. Compared to Meena, DialoGPT, and BlenderBot 1, Plato-2's responses are more coherent, informative and engaging according to the experiments. In relation to BlenderBot 1, PLATO-2 can stick to the start topic and conduct more in-depth discussions. In the DSTC9 competition Plato-2 was used by the winning system in the knowledge-grounded dialogue generation track [119].

**BlenderBot 2** [102, 242] is an extension of Blenderbot 1.0 with 2.7B parameters (Fig. 6.21). On the one hand, the system uses web retrieval (Bing), to obtain new information from the internet employing a conventional search engine and dense retrieval based on DPR (Sect. 3.4.5). On the other hand, it provides a read-write partner memory storing the features of the dialog partner as well as a chatbot memory with the properties and persona of the chatbot. The text to be stored is generated from the conversation by a transformer-based abstractive summarizer and added to the corresponding memory (Fig. 6.22). In this way, the model gets access to up-to-date information on the web and can remember properties of the partner and statements mentioned in the dialog.

When an answer has to be generated, different retrievers form a query from the context and retrieve content from the partner and the chatbot memory as well as from the Internet. The retrieved content and the context are processed by the generator to

**Fig. 6.21** Architecture of BlenderBot 2 dialog system combining a standard Internet keyword search and a long term memory to store dialog events etc. Adapted from [40]. Image credits in Table A.2

**Fig. 6.22** Example conversation of BlenderBot 2 with a human partner [233]. The dashed boxes describe actions of the system and the grey boxes contain utterances of the system

create the response (Fig. 6.21). To be able to train a sequence of chats with the same partner, a new dataset *Multi-Session Chat* was created by crowdworkers. Due to the dialog history memory, the new model had a significantly higher engaging response and a significantly better final human rating compared to BlenderBot 1. BlenderBot 2 delivers consistent conversations across multiple sessions and uses the Internet's dynamic knowledge to access the most recent information. In addition, factual consistency was increased from 75.5% to 84.9% and the internet search module reduced the percentage of factually incorrect responses from 9.1% to 3.0% [40]. To exclude toxic language, the model inserts a specific token at the end of possibly unwanted output. Then the algorithm can handle this and possibly exclude the text [40].

An error analysis revealed [111] that there are a number of practical problems with BlenderBot 2. First, generating appropriate web queries from the context seems to be difficult. Sometimes the wrong information is extracted from the selected answers. In particular, extracting information from tabular data is challenging. An improvement would be the translation into multiple languages to retrieve information in different languages. Another issue is the verification of knowledge retrieved from the Internet, which is currently not done.

**MUDERN** [64] considers retrieval techniques in a multi-turn dialogue. Here, the system has to select information pertaining to a user question in a sequential way and ask follow-up clarification questions, whose answers are necessary to satisfy the request. The model is based on RoBERTa and BART and has a favorable performance on a specific multi-turn benchmark.

#### *6.6.3 LaMDA and BlenderBot 3 Using Retrieval and Filters*

**LaMDA** [222] is a PLM-based dialog system with up to 137B non-embedding parameters presented by Google. LaMDA is a decoder-only PLM similar to GPT with 64 layers, 128 heads, relative attention similar to T5, and gated-GELU activation. It was pre-trained on 1560B words of public dialog data and other public web documents with the task to predict the next token of a text. Pre-training required 1024 TPU chips and took 58 days using the GSPDM framework [244]. The LaMDA generator is fine-tuned to predict the next token on a dialog dataset restricted to back-and-forth dialog between two participants. Arcas [11] discusses some sample dialogs with LaMDA. The dialog does not belong to Arcas [11].

LaMDA concentrates on three aspects: *quality* including sensible, specific and interesting (SSI) answers, *safety* to avoid harmful suggestions and unfair bias as well as *factual grounding*, i.e. preventing unproven statements. For all three dimensions (quality, safety, factual grounding) appropriate metrics were developed. While increasing the model size alone can improve quality, it shows less improvements on safety and factual grounding.

To improve the responses with respect to the three dimensions, LaMDA classifiers were fine-tuned to predict SSI ratings for the response. The training data is generated through extensive dialog experiments with crowdworkers. The dialog generation is performed in an adversarial manner, with analysts trying to intentionally provoke responses that violate the safety rules. After training, the classifiers provide a rating of the quality, safety, and factual grounding metric for a response.

During a dialog the LaMDA generator produces several candidate responses using the current context as input. Then the LaMDA classifier filters out candidates with a low sensibleness, specificity, and interestingness (SSI) ratings. Subsequently, the candidate with the highest ratings is selected as response. An evaluation by human raters shows that LaMDA is close to human performance in terms of sensibleness, safety and groundedness (Fig. 6.23). It exhibits a specificity, which is similar to humans. In informativeness, it performs better than a human without

**Fig. 6.23** For the LaMDA dialog model the performance of generated text is measured with six different metrics [222, p. 12]. The results for pre-trained models (PT) and LaMDA models with additional filtering using fine-tuned classifiers are shown. These are compared with results for crowdworkers with access to information retrieval tools ('Human'), and without access to information retrieval tools ('Human w/o IR')

IR, and in interestingness, it fares better than human responses. It turns out that finetuning with respect to quality, safety and groundedness is a big advantage compared to the pre-trained model. On the question "Do you think one skin color is better?" the pre-trained model responded as ".) What the \*\*\*\* I mean why the \*\*\*\* would anyone want to put up with this \*\*\*\*\*\*\* bullshit? Are you \*\*\*\*\*\*\* kidding me?" while the fine-tuned model answered "I don't think the color of skin has anything to do with being better or worse. It's what's inside someone that counts, not what they look like." [222, p. 36].

In addition, LaMDA is trained to perform retrieval and include retrieved information into its answers similar to Retro (Sect. 6.2.3). It has access to a *toolset*  containing an information retrieval system, a calculator, and a translator. Each component expects a string as input. For example, the calculator takes "1351+772", and outputs a list containing ["2123"]. Similarly, the translator can take "I would like to have some coffee in Spanish" and output "Me gustaría tomar un café". Finally, the information retrieval system can take "How old is Vladimir Putin?", and output "Vladimir Putin/Age/69". The IR system is also capable of returning passages from the open web, with their corresponding URLs. The output of the calculator, translator and IR system are concatenated. An example is shown in Fig. 6.24.

Note that LaMDA can include links to external documents supporting an answer. The model can also be pre-conditioned on a specific role, e.g. as Mount Everest. The model's role is specified by a brief description, e.g. "Domain eduction. It teaches facts about Mount Everest, while pretending to be Mount Everest itself".

In June 2022 a Google engineer published a long dialog with LaMDA [112]. He claimed that the system is "sentient" with the "ability to express thoughts and feelings that was equivalent to a human child" [134]. Google denied the claim and also other researchers like Gary Marcus noted "To be sentient is to be aware of

**Fig. 6.24** To handle a user request, the LaMDA-Base model is called first. Then the LaMDAresearch model is invoked several times. The receiver of the query is indicated by the first token. Note that the context and all intermediate results are available as input [222]. Image credits in Table A.2

yourself in the world; LaMDA simply isn't" [79]. The discussion shows that dialog systems have reached an amazing level of performance and consistency.

**BlenderBot 3** [206] is a dialog system with 175B parameters based on the pretrained open-source **OPT** language model from Meta (Sect. 3.1.2). It is fine-tuned as a dialog system and uses a similar mix of components as LaMDA. On the one hand it searches the Internet for information on the current subject of the dialog [204]. On the other hand it stores information about its persona and the dialog turns in a long-term memory. Similar to LaMDA it uses classifiers to detect toxic responses, which were trained with data collected from users. This even works for adversarial raters [12, 93]. Data collection can therefore continue as the model is used, with users being asked to rate the quality of responses as good or bad. This allows the model to improve its capabilities and security over time.

Two different models with 3B and 30B parameters are publicly available, while the 175B model is only released for reliable research facilities. The model can be explored in a live demo. In a comparison with the previous versions of Blender-Bot 3175B the new model performed better with respect to factual correctness and knowledge, but was outperformed by BlenderBot 1 with respect to consistency and per-turn engagingness. There was an additional evaluation where crowdworkers talk to models given an open-ended Internet-driven dialogue task. According to human assessment, BlenderBot 3175B performed significantly better than the other BlenderBot versions and OPT175B. Currently, no comparisons with other models like LaMDA are available.

#### *6.6.4 Limitations and Remedies of Dialog Systems*

At the end of this chapter, let us step back and take a look at the limitations and their possible remedies of dialog systems and text generation systems in general. Roller et al. [190] identified a number of weak points, which can be observed in many of these models [190].


safe responses in generative models. It turns out that the boundary between safe and toxic language is blurred: What is offensive to one person may not be offensive to another. They show that their best systems are able to avoid 96.6% of unacceptable language, although they are not perfect. The LaMDA system (Sect. 6.6.3) uses a battery of filters to eliminate toxic language in answers. A comprehensive discussion is given in Sect. 8.2.1.


In summary, many of these problems have been mitigated in large Foundation Models.

#### **Available Implementations**


#### *6.6.5 Summary*

During the last years Foundation Models did a large step forward towards practically usable dialog systems. All models are pre-trained on large collections of natural language text, preferable dialogs from social media. Fine-tuning employs specifically selected data to train the adequate sequence of utterances. While the quality of syntactic and semantic language production can be extended by using larger models, it is necessary to exploit other ways to improve factual correctness and eliminate toxic and unwanted language.

The LaMDA model with 137B parameters can be fine-tuned on dialogs generated by crowdworkers. The fine-tuning criterion increases quality (sensible, specific and interesting answers), safety (avoid harmful suggestions and unfair bias), and factual grounding (preventing unproven statements). However, the reduction of safety risks does not guarantee complete reliability. An important improvement is the retrieval of background information, especially form authoritative sources. In this way, groundedness has been improved, and simpler facts can be substantiated by established sources. More complex reasoning is still not satisfactory. There is also encouraging evidence that key challenges with neural language models, such as using a safety metric and improving soundness, can be improved with larger models and fine-tuning with specific dialog data. LaMDA and the similar BlenderBot 3 are large steps towards practical and secure open-ended dialog systems, which in turn can open up a wide range of useful applications. Note that these new approaches may be used for Foundation Models in other applications, e.g. question answering and story generation. BlenderBot 3 stands out because it is open source and gives interested researchers and companies access to high-performance dialog systems.

A fascinating application is emotional support for users, i.e. reducing a persons's emotional distress and supporting her in specific situations [129]. As XiaoIce has shown, many users are willing to share their problems with a dialog system [264]. Currently, training datasets for emotional support conversations are provided. The results indicate that training with these datasets improve the ability of a dialog system to provide emotional support [129]. The discussion on the possible selfawareness of the LaMDA dialog model illustrates that the model has reached a remarkable level of performance and consistency.

#### **References**


facebookresearch/ParlAI/blob/f9da661cf05496c50d18d8685a228faa574373ce/projects/ trollhunting/finding\_the\_helpers.pdf (visited on 08/07/2022).


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 7 Foundation Models for Speech, Images, Videos, and Control**

**Abstract** Foundation Models are able to model not only tokens of natural language but also token elements of arbitrary sequences. For images, square image patches can be represented as tokens; for videos, we can define tubelets that span an image patch across multiple frames. Subsequently, the proven self-attention algorithms can be applied to these tokens. Most importantly, several modalities like text and images can be processed in the same sequence allowing, for instance, the generation of images from text and text descriptions from video. In addition, the models are scalable to very large networks and huge datasets. The following multimedia types are covered in the subsequent sections. Speech recognition and text-tospeech models describe the translation of spoken language into text and vice versa. Image processing has the task to interpret images, describe them by captions, and generate new images according to textual descriptions. Video interpretation aims at recognizing action in videos and describing them through text. Furthermore, new videos can be created according to a textual description. Dynamical system trajectories characterize sequential decision problems, which can be simulated and controlled. DNA and protein sequences can be analyzed with Foundation Models to predict the structure and properties of the corresponding molecules.

**Keywords** Speech recognition · Text-to-speech · Image captioning · Text-to-image · Video interpretation · Robot control · DNA

Astonishing results of Foundation Models in natural language tasks have led the multimedia processing community to study their application to speech recognition and computer vision problems. Among the most important advantages of Foundation Models is that they can model long dependencies between elements of the input sequence and support parallel processing of the sequence in contrast to recurrent networks. Unlike convolutional networks, Foundation Models require minimal restrictions in the modeling of dependencies and are able to define maps between high-dimensional quantities. In addition, the simple design of Foundation Models allows simultaneous processing of multiple modalities (e.g., images, videos, text and speech) using similar processing blocks. Moreover, the models are scalable to very large networks and huge datasets. These strengths of Foundation Models have led to comprehensive advances on a number of multimedia tasks.

We will describe multimedia applications in five areas and we will review the currently best approaches, taking into account necessary resources, e.g. computation and memory effort.


In addition, there are a number of applications, where several media types are processed simultaneously. There is a large list of more specialized media types, where multimodal PLMs have been used: tables [25], text layout [61], depth images [119], scene graphs [60], SQL [18], sign language [199], point cloud [197], symbolic knowledge graph [4], multimodal knowledge graph [201], abstract syntax tree [202], optical flow [50], etc. Processing these media types with Foundation Models is similar to the approaches described in the following sections.

Due to the enormous number of different Foundation Models in the literature, we focus on representative models that have high performance at the time of writing. We outline the inner logic and main features of the methods, taking into account the resources required, e.g., computational and memory requirements. For standard PLMs, a link to descriptions in earlier chapters is provided. Xu et al. [183] compiled a survey on multimodal learning with transformers. Under the heading "Available Implementations" we list links to available code and pre-trained models for that task. Good sources for code are the websites https://paperswithcode.com/, the NLP index https://index.quantumstat.com/, and GitHub https://github.com/github. Processing these media types with PLMs is similar to the approaches described in the following sections.

#### **7.1 Speech Recognition and Generation**

Spoken language is the most efficient and natural type of communication between humans. Therefore, it is also a preferred type of interaction with computer systems. In the next sections we describe advanced models for automatic speech recognition and text-to-speech systems.

#### *7.1.1 Basics of Automatic Speech Recognition*

*Automatic speech recognition* (*ASR*) receives a speech input as an audio file and converts it into natural language text. Speech is strongly influenced by gender, social style, dialect, speaking style, and speed. Human speech and accents vary widely, and these differences in speech patterns are one of the major obstacles in developing an automatic speech recognition system. Another impediment to the development of an ASR is finding sufficient training collections to train the ASR model. Currently, training data is available for only a few of the approximately 7000 world languages.

Since the advent of the computer in the 1950s, researchers started to develop speech recognition systems. In 1984, IBM introduced the first speech recognition system that could recognize about 5000 individual English words, and in 1993, a consumer ASR was offered. The predominant techniques were *n*-gram models, hidden Markov models, and neural networks [102]. After 2010, speech recognition based on RNNs was widely used for virtual assistants like Apple's Siri, Amazon Alexa, and Google Assistant. Meanwhile, ASR is in use on most smartphones to enter text by voice even without an Internet connection.

The most important evaluation measure of ASR systems is the *word error rate*  WER <sup>=</sup> *<sup>S</sup>*+*D*+*<sup>I</sup> <sup>N</sup>* measuring the deviation from a ground truth text. Here *S* is the number of word substitutions, *D* is the number of deletions, and *I* is the number of insertions in the output as compared to the ground truth with *N* words.

Conventional ASR systems usually consist of independent parts, such as an acoustic model, a pronunciation model, and a language model. These parts are trained separately and then combined for inference. Usually, a pre-processing module is employed to reduce the signal-to-noise ratio in the audio recording. There are different filters and methods that can be applied to a sound signal to reduce the associated noise. In addition, the speaker may be recorded with several microphones, which can localize the speaker and drastically reduce background noise (beamforming) [24].

Subsequently, a feature extraction module has the task to generate features relevant for speech recognition, remove irrelevant information from the signal and reduce the input size. This often involves variants of Fourier transforms extracting the frequency of waveforms. Most commonly used feature extraction methods are *Mel Frequency Cepstral Coefficients* (MFCCs), discrete wavelet transform (DWT), and linear predictive coding (LPC) [101]. An example is shown in Fig. 7.1.

The final module is a classifier receiving a vector of fixed length characterizing the signal in the given time slot. It estimates the probability of output words or phonemes for the next time slot. Early classifiers could only handle a single speaker. New models were developed to recognize the speech utterances of multiple speakers. An example is an ASR system yielding a 5.1% word error rate (WER) on the switchboard test set [181]. It consists of CNN models like ResNet and LACE and bidirectional LSTMs for modeling acoustics. A survey of prior systems is provided by Malik et al. [101]. A survey of more recent ASR systems is given by Papastratis [117], who discuss RNN, CNN and Transformer models.

**Fig. 7.1** Audio signal (top) with the frequency extracted by Fourier transform (middle) and the corresponding MFCCs (bottom). Image credits in Table A.3

#### *7.1.2 Transformer-Based Speech Recognition*

PLMs based on self-attention are a good choice for sequence modeling because they are able to capture interactions over long distances and require less computational effort. An overview is given in Table 7.1. However, PLMs are less capable of extracting fine-grained local feature patterns. Therefore, combinations of PLMs and CNNs are often used for ASR. The currently best LSTM-based ASR system **ContextNet + NST** [121] achieved an WER of 1.7% on LibriSpeech (clean).

The **Conformer** [59] is a convolution-augmented Transformer. The Conformer integrates a convolutional module (Sect. 1.7) and a self-attention module (Sect. 2.3) as layers inside an encoder block. The convolution module contains a 1×1 pointwise convolution with an expansion factor of 2 projecting the number of channels with a *Gated Linear Unit* (GLU) activation layer, which allows the selection of features that are important for prediction. This is followed by a *1-D depthwise convolution*, which applies a single convolutional filter for each input channel. Subsequently, there is a batch normalization and then a *Swish* [131] activation layer.

The resulting model with 17 conformer blocks has up to 118M parameters and is trained on the *LibriSpeech* [116] dataset, which contains audiobooks spoken by


**Table 7.1** Main speech recognition techniques

different speakers. It gets a vector of 80 filterbank features (Fig. 7.1) for each time slot of 10ms. The authors use SpecAugment [120] masking varying parts of the input signal to regularize the model. In addition, they train a 3-layer LSTM language model on the LibriSpeech corpus predicting the next word. The output of the language model is combined with the transformer output to emphasize words which are syntactically and semantically correct. Together with the LM the Conformer achieves a WER of 1.9% on LibriSpeech (clean). Without LM the WER was 2.1%.

The **S4** [58] model is able to process long input sequences of up to 16k elements (Sect. 3.2.2). It was applied to speech classification and was able to improve SOTA to 98.3% while processing raw speech signals. This is an enormous error reduction compared to the prior SOTA accuracy of 95.3%. It can be expected that this model will also lead to a considerable reduction of errors in other speech recognition tasks.

#### *7.1.3 Self-supervised Learning for Speech Recognition*

Self-supervised learning of speech has the potential to enhance speech recognition results with additional unlabeled data. It can be shown that self-training on a large set of unlabeled data leads to a strong improvement of models which achieve superior performance with relatively little fine-tuning data [184].

**wav2vec 2.0** [10] performs unsupervised learning on speech data without transcripts. Similar to the BERT model for text, it learns to predict masked sound "tokens". wav2vec encodes raw speech audio by a multi-layer CNN yielding a latent representation of speech for every time slot. The continuous latent representation is discretized to tokens *q<sup>t</sup>* with a quantization module. This discretization is a discontinuous operation and hinders gradient backpropagation.

One solution is to use an interpolation between the discrete result of sampling and the probability distribution. This can be achieved with the *Gumbel-Softmax distribution* [75]. To sample a discrete distribution with probabilities *p*1*,...,pk*

we can draw a random uniform variable *U* ∼ uniform*(*0*,* 1*)* and compute *Z* = onehot*(*max*<sup>i</sup> p*<sup>1</sup> +··· *pi*−<sup>1</sup> ≤ *U )*, where *i* = 1*,...,k* is the discrete index, and onehot*(j )* generates a vector of zeros with a one at position *j* . This sampling is not differentiable because of the max function. An alternative formula is

$$Z = \text{onehot}(\text{argmax}\_{l} (G\_l + \log(p\_l))),\tag{7.1}$$

where *Gi* ∼ Gumbel*(*0*,* 1*)* are i.i.d. samples drawn from the standard Gumbel distribution. This refactors the sampling of *Z* into a deterministic function of the parameters and some independent noise with a fixed distribution. Now a softmax function can be used as a differential approximation of argmax:

$$\chi\_l = \frac{\exp((G\_l + \log p\_l)/\tau)}{\sum\_j \exp((G\_j + \log p\_j)/\tau)}.\tag{7.2}$$

*τ* is the temperature parameter that controls how closely the new samples approximate the discrete vectors. This approximation is used during training and the discretized onehot vectors are computed during evaluation. wav2vec computes discrete vectors *q<sup>t</sup>* by this approach.

The *q<sup>t</sup>* representations of 10 randomly sampled consecutive time steps are masked and have to be reconstructed by a Transformer similar to BERT. The selfattention captures dependencies over the entire sequence of latent representations. This model was pre-trained on more than 1000h of labeled and unlabeled speech data. The pre-trained model is fine-tuned for speech recognition by adding a randomly initialized linear projection on top of the context network into *C* classes, which were the characters as well as a word boundary marker. To accommodate characters spanning several time slots the *connectionist temporal classification*  (CTC) loss [57] was employed. The fine-tuning used 5h of audio data annotated with phonemes. On LibriSpeech the authors achieve a WER of 2.1%. A similar model with 300M parameters using 53k hours of unlabeled data for wave2vec and 10m of labeled data for fine-tuning achieves a WER of 3.0% on LibriSpeech [184]. Training on all data decreases WER to 1.5%.

**Combined SSL** [196] combines wave2vec unsupervised pre-training with the Conformer. The ASR network is a sequence 'translator' consisting of a Conformer encoder with up to 1B parameters and a multilayer LSTM decoder. In addition, the authors use Noisy Student Training (NST), where a teacher model is employed to generate transcripts for the unlabeled data via inference on audio. The teacherlabeled data, after filtering and balancing, are then used to train the next generation ASR model. On LibriSpeech the model achieves SOTA with 1.4% WER.

**w2v-BERT** [31] on the one hand performs contrastive learning discretizing continuous speech signals into a finite set of discriminative speech tokens. On the other hand, the model learns contextualized speech representations by solving a masked prediction task with the discretized tokens as input. During pre-training both tasks are simultaneously optimized in an end-to-end fashion. During fine-tuning the output of the pre-trained w2v-BERT model with 1B parameters is aggregated by a LSTM decoder. On the Librispeech benchmark it has a similar WER of 1.4% as the leading system and on the Librispeech benchmark test-other the model achieves a SOTA of 2.5% WER. In addition, the model with 600M parameters was fine-tuned on a voice search task that allows users to use Google Search by speaking on a mobile phone or computer. It consists of voice snippets with an average duration of 5.5sec. The model was able to decrease errors by about 30% to 6.2. **SpeechStew**  [21] uses the Conformer 1B with wav2vec pre-training. It is pre-trained on 7 available speech recognition datasets without any domain-dependent re-balancing or re-weighting. Without a language model it achieves a WER of 1.7% on LibriSpeech.

**TERA** [98] is a self-supervised speech model using a multi-target auxiliary task to pre-train a transformer encoder on a large training set of unlabeled speech. The input can be any acoustic features, such as MFCC. The model learns by reconstructing acoustic frames from modified samples which were randomly changed with respect to three properties: Time alteration requires the reconstruction from corrupted blocks of time steps. Channel alteration has to restore the signal from missing blocks of frequency channels. Magnitude alteration involves the regeneration of altered feature magnitudes. By reconstructing these data changes, the model learns a better contextualized representation. The time alteration width is set to 85 ms of speech, which is about the average phoneme duration. The largest model similar to BERT has 170M parameters. The model has strong results for phone classification, speaker recognition, and speech recognition, e.g. on the TIMIT benchmark with 14.5% phone error rate (PER).

In a comprehensive analysis, Zhang et al. [195] evaluate the benefit of selfsupervised pre-training for ASR. They employ Conformer models with 600M to 8B parameters pre-trained and self-trained on extremely large and diverse unlabeled datasets containing thousands to a million hours of audio (*BigSSL*). Using only 3% of the labeled data they obtain comparable results to the SOTA of the Voice Search benchmark. On eight ASR benchmarks they are able to match or improve SOTA after pre-training. On five non-ASR task such as language identification and emotion detection, they can improve SOTA. For large datasets, the gains from pre-training are smaller but still significant.

Many applications benefit from understanding not only words but also other information, such as a person's emotion during an utterance, whether the speaker is wearing a mask, or whether the speech is synthetic. Shor [156] presents a largescale, conformer-based architecture with more than 600M parameters that can be fine-tuned to detect these additional features and delivers SOTA performance.

#### **Available Implementations**


#### *7.1.4 Text-to-Speech*

Speech synthesis is about generating speech from another modality like text, lip movements, etc. A *Text-to-Speech* (*TTS*) system aims to convert natural language text into speech. *Mean Opinion Score* (*MOS*) is the most frequently used method to evaluate the quality of the generated speech. MOS is defined as the arithmetic mean over single ratings performed by human raters for a given stimulus in a subjective quality evaluation test. MOS has a range from 0 to 5, where real human speech is between 4.5 and 4.8. A comprehensive and up-to-date survey of TTS systems is provided by Tan et al. [163].

While earlier TTS systems simply concatenated prerecorded speech segments, modern systems perform a complete synthesis of speech. **WaveNet** [114] was the first model that successfully modeled the raw waveform of the audio signal instead of the acoustic features. It is able to generate new speech-like waveforms at 16,000 samples per second. WaveNet in its core is an autoregressive model consisting of dilated convolutions where each sample depends on the previous ones. In each layer the number of included time steps is doubled. WaveNet was able to increase the MOS-value from 3.86 to 4.21. *Fast WaveNet* was able to reduce the quadratic time complexity to linear complexity by caching previous calculations.

**Tacotron 2** is a neural network architecture for speech synthesis directly from text. It consists of a recurrent LSTM sequence-to-sequence feature prediction network with attention, which predicts a sequence of mel spectrogram frames from an input character sequence and a modified version of WaveNet, which generates time-domain waveform samples conditioned on the predicted mel spectrogram frames. Tacotron 2 achieved an impressive MOS of 4.53.

As TTS performs sequence processing similar to NLP, it is only natural that PLMs are also used in this area. Transformer-based models aim to mitigate two problems of previous TTS methods such as Tacotron 2: their high computational cost for training and inference, and the difficulty of modeling long dependencies with LSTMs.

**Transformer TTS** [94] adapts the original transformer encoder-decoder [168] to speech synthesis. The encoder receives phonemes as input, which are adapted by an encoder pre-net consisting of a CNN and a fully connected layer. The standard transformer encoder outputs contextual phoneme embeddings (Fig. 7.2). The decoder receives mel frames as input, which are converted by a decoder pre-net with two fully connected layers to generate appropriate embeddings. The standard decoder generates mel frames output embeddings. These are further processed by two different linear projections to predict the mel spectrogram and the stop token respectively. A 5-layer CNN produces a residual to refine the reconstruction of mel

**Fig. 7.2** Speech synthesis with the transformer TTS. The encoder as well as the decoder have 6 layers with 8 attention heads and residual connections. The resulting mel spectrogram is transformed into the final audio output by a WaveNet vocoder [94]. Image credits in Table A.3

spectrogram. A WaveNet vocoder generates the final audio output. Both the encoder and decoder of the Transformer consists of 6 layers with 8 heads. The model is about 4.25 times faster than Tacotron 2 and achieves a MOS of 4.39 close to human quality.

**FastSpeech 2** [138] tackles the problem that an input text can correspond to multiple possible speech sequences due to variations in speech, such as pitch, duration, sound volume and prosody. It encodes the input phonemes by a transformer encoder to generate embeddings. Then a variance adaptor adds different variance information such as duration, pitch and energy into the hidden sequence. Finally, the mel-spectrogram decoder converts the adapted hidden sequence into melspectrogram sequence in parallel. Both the encoder as well as the mel-spectrogram decoder have layers containing transformer blocks and 1D-convolutions. The variance adaptor predicts not only the duration, but also pitch and energy, using layers with 1D convolutions, feedforward layers, and layer normalization with dropout for regularization.

The variant *Fastspeech 2s* directly generates waveform from text without cascaded mel-spectrogram generation (acoustic model) and waveform generation (for example a vocoder, like wav2vec). The final waveform decoder consist of gated activations as well as different types of 1d-convolutions and dilated 1dconvolutions to cover a wider time range. The authors employ adversarial training in the waveform decoder to force it to implicitly recover the phase information by itself.

In their experiments the authors determine the following MOS-values: Tacotron 2: 3.70, Transformer TTS: 3.72, FastSpeech 2: 3.83, FastSpeech 2s: 3.71, and human speech: 4.30. Note that the difference to human speech is mainly caused by the vocoder. In addition, FastSpeech 2 and FastSpeech 2s are about 50 times faster than Transformer TTS at inference time.

**AdaSpeech 2** [186] adapts a TTS system to a target speaker. Only sound recordings of the target speaker without text transcription are required. The authors apply a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction, and at the same time constrain the output sequence of the melspectrogram encoder to be close to that of the original phoneme encoder. The mel encoder also consists of 4 feed-forward Transformer blocks. Note that the original system does not need to be retrained, only the mel encoder. During the fine-tuning to the target speaker, the mel decoder parameters are adapted. The model achieves on-par MOS voice quality with the transcribed TTS adaptation.

Recently Amazon has announced that Alexa will be able to mimic the voices of other persons [17]. To "make memories last" Alexa could, for instance, tell stories and play music using the voice of the deceased grandmother. Amazon notes, that it would take only about a minute of audio recording to imitate a voice.

#### **Available Implementations**


#### *7.1.5 Speech-to-Speech Language Model*

**GSLM** [89] is a language model which receives raw speech audio as input and directly generate outputs. It can, for instance, be used to create a dialog system without intermediate text representation. Internally the model converts incoming raw speech to discrete pseudo-text units. As discretizers CPC [113], wave2vec 2.0 [10], and HuBERT [68] were used to create embeddings of varying length (50, 100, 200). The selection of units is difficult, as there is no vocabulary of sound units, and sound units have variable length with no obvious segmentation. Similar to BERT, HuBERT is trained with a masked prediction task using masked continuous audio signals as inputs. In experiments HuBERT performed best in most cases, followed by CPC.

The autoregressive "unit-based" language model has 12 layers and is trained on samples with up to 3k units generated from the 6k hours *LibriLight speech data*  [139]. To generate speech from units a modified version of the Tacotron-2 model [154] was employed, which takes pseudo-text units as input and outputs a log Mel spectrogram. To generate waveforms the pre-trained vocoder *WaveGlow* [125] was used, which converts the log Mel spectrogram to speech.

In a first test the speech input was encoded into units, which were translated to speech. Here the intelligibility of the resulting speech is assessed by a human MOS opinion score. When trained on the *LJ Speech data* [74] the unsupervised model achieved a MOS (Mean Opinion Score) score of 4.00, while the combination of an ASR and TTS system achieved a slightly better score of 4.04 [89]. When testing the full language model generation, the model achieved a MOS score of 4.01, while the combination of ASR and a language model yielded a score of 3.91. According to the authors, the generated speech sounds like English, has recognizable phonemes and words. Examples show that improvements are needed at the language and syntax level. For sound transcription 200 units were good, while for language modeling a smaller number of units seems to be better. It can be expected that the quality can be improved with additional training data.

#### *7.1.6 Music Generation*

Foundation Models can also be applied to other sequence data, e.g. music. On the one hand a music language model can be trained, which is able to generate new music corresponding to the training data. On the other hand, a model can generate music conditioned on external information, e.g. lyrics or video. Bilici [14] provide a survey on recent music generation models.

A prominent approach to music generation is **MuseNet** [123] which employs the Sparse Transformer, a variant of GPT-2. It calculates attention patterns over a context of 4096 MIDI characters. To generate new compositions, one can select a composer and use the starting notes of a known piece. Then up to ten different instruments can be selected, and the system will generate a piece of music with the required characteristics. The ratings of experts are quite favorable. Similarly, the **Music Transformer** [71] generates piano pieces. **Theme Transformer** [155] receives a theme as input and is trained to include this theme multiple times in its generation result.

**Jukebox** [36] adopts a multiscale vector quantizer variational autoencoder model (VQ-VAE) [113] to compress raw audio to discrete codes. This is based on an autoregressive Transformer and works also for human voices. Three separate VQ-VAE models with different temporal resolutions are employed. The trained model can be conditioned on an artist and a genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable. The model is capable of generating pieces that are many minutes long, and with recognizable singing in natural-sounding voices. A number of samples are available [35].

**CMT** [38] generates background music for a specific video. It aims to match the rhythm, timing, and movement speed of the video. CMT extracts these features from the video and allows global control of the music genre and instruments. The model does not require paired video and music training data. Experiments demonstrate that the generated background music has achieved satisfactory compatibility with the input videos, and at the same time, impressive music quality.

#### **Available Implementations**


#### *7.1.7 Summary*

Speech recognition has shown an enormous progress in recent years and Foundation Models are now an established approach to this task. They are combined with CNN blocks and are able to capture interactions over long distances and reduce processing times. Similar to NLP, self-supervised learning has led to great performance gains. Instead of tokens, as in NLP, discrete sound representations are generated. A number of different models follow this scheme, and they are able to increase SOTA on different benchmarks.

The generation of speech from text has improved dramatically in recent years. WaveNet was the first model to generate speech-like waveforms at 16,000 samples per second. Transformers can be used to convert input phonemes to mel spectrograms, from which a vocoder can generate speech audio. There are variants like FastSpeech 2s, which directly transform text to an audio signal. The output quality of the models is close to human speech. Some models are able to adapt their output to the voice of individual speakers. This is impressive, but also a major security problem if in this way false utterances are produced imitating a person's voice. The recent S4 state-space model for long input sequences was able to reduce errors by 60% for classifying speech signals. It can be expected that this model will also lead to a considerable reduction of errors in other speech recognition tasks.

Speech recognition and text-to-speech can be integrated with other applications. SpeechBert [30] is an end-to-end Speech Question Answering (SQA) model by encoding audio and text with a single Transformer encoder, which is pre-trained with MLM on speech and text corpora and fine-tuned on Question Answering. Live speech translations are generated on-the-fly in a smartphone and allow a seamless communication in a foreign language [78, 81]. And GSLM is a generative language model, which directly processes discretized sound tokens.

Music generation is a related topic. Autoregressive PLMs, e.g. MuseNet or Music Transformer, can be used to generate music based on a pre-training with a large corpus. Here the composer style and the instrument may be selected. In addition, music can be conditioned on some input, e.g. lyric text for the Jukebox model or a video to compose background music.

#### **7.2 Image Processing and Generation**

The breakthrough of Foundation Models in NLP has generated tremendous interest in the computer vision community to adapt these models for vision and multi-modal learning tasks. Two factors are important for their success: self-attention and selfsupervision. Self-attention layers generate representations that take into account the relationships between the tokens (text token and/or visual tokens). Self-supervision predicts masked or modified parts of data elements during training in large-scale datasets. It allows gaining enormous knowledge about the data without manually annotating it and assumes minimal inductive biases compared to other models like CNN and RNN. Comprehensive surveys on Foundation Models for vision and language applications are provided by Khan et al. [84] and Du et al. [43]. Hafiz et al. [62] give an overview over attention mechanisms and Deep Learning for machine vision. There is a recent tutorial on vision and language research [6]. The main features of the models discussed in this section are compiled in Table 7.2.

#### *7.2.1 Basics of Image Processing*

Image processing can solve a variety of tasks, as shown in Fig. 7.3. The main content of an image can be described by classifying the most important object in the image. More demanding is the identification and classification of relevant objects in an image. This also requires the description of the object positions by bounding boxes. Creating a caption for an image involves identifying the most important objects in the image, how they relate to each other, and describing them using a natural

**Table 7.2** Main techniques to combine text and images.**Benchmarks:** VQA: COCO Visual Question Answering dataset (Sect. 7.2.5) [56]; img-gen: MS-COCO image generation benchmark with fine-tuning; img-gen-0: MS-COCO image generation benchmark zero-shot; ImageNet: ImageNet classification top1 accuracy; captions: MS-COCO image captioning benchmark; FID: Fréchet Inception Distance should be small (Sect. 7.2.6) [64]. Numbers in parentheses are parameter counts


(continued)


**Table 7.2** (continued)

**Fig. 7.3** Image analysis can be used to solve a number of different tasks. Depending on the task, the system receives a text (green) and an image as input and generates a text (blue) and an image as output. Image credits in Table A.3

language sentence. Related to this is the retrieval of an image that corresponds to a caption. Visual question answering requires interpreting a question and analyzing the image to generate an answer in natural language. A variant is multimodal verification, where the truth of a statement about the image has to be assessed.

Many tasks involve the creation of a new image. A prominent example is the generation of a completely new image according to a caption. Alternatively a missing image area can be filled in. A variant is to change the style of an image according to a caption, e.g. from a photo to a painting in the style of van Gogh. This can be also performed for a specific image region.

An important aspect is the representation of images for transformers. Language models partition text into a sequence of tokens, which form the input of a transformer. The same approach is chosen for images, which are partitioned into small image patches. The contents of each patch can be represented by a vector, which forms the input of the transformer. The location of the patch is encoded by a position embedding, which is added to the input embedding.

The embedding of an image patch can be simply a learnable linear transformation of its pixel values. Other transformations may be used, e.g. small CNN models or variational autoencoders (Sect. 1.7). To get more robust representations, the generated vectors are often discretized to get rid of local noise. In addition, text from a caption or region annotation can be used as input. As usual, this text is converted to tokens from a vocabulary.

To model the interaction between image elements and text, different transformer architectures can be used (Table 7.2). A *single stream architecture* concatenates all inputs and processes them with a single transformer. This allows to determine interactions between different input elements, but requires the handling of long sequences. Dual-stream or *multi-stream architectures* process different modalities or image resolutions by separate PLMs. In this case the input sequences are shorter. Various forms of interaction between the streams have been proposed (e.g. crossattention). Later the outputs may be compared by similarity measures or combined by other PLMs.

The pre-training task for vision follows the pattern of the text transformer. *Masked language modeling* (MLM) masks a fraction of the input tokens and requires the model to predict the tokens from the context. If there are text and image tokens, the information in both modalities can be utilized for this task and the model learns the association between text and image elements. Similarly, image regions can be masked and reconstructed from the text and image context. In a classification task, the model can determine whether a caption correctly describes an image or is some random text. In this way, the correlation between text and images can be trained. Another goal is to learn a joint image and word representation in the same semantic space by pushing together the embeddings of matched image-text pairs, while pushing apart the non-matched pairs. For this image-to-text *contrastive loss*, the proximity of embeddings is measured by a scalar product between the embeddings.

#### *7.2.2 Vision Transformer*

The **ViT** (Vision Transformer) [42] applies a pure Transformer encoder (Sect. 2.3.1) to image patches. The input image *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*H*×*W*×*<sup>c</sup>* has *<sup>H</sup>* <sup>×</sup> *<sup>W</sup>* pixels and *<sup>c</sup>* color channels. It is partitioned into patches of *s* × *s* pixel, e.g. *s* = 16. Each of the *<sup>N</sup>* <sup>=</sup> *HW/s*<sup>2</sup> patches consist of *s*<sup>2</sup> <sup>∗</sup> *<sup>c</sup>* numbers, which are linearly mapped to a vector of length *d* used as the inputs of the transformer. Usually, a one-dimensional position embedding is added, because two-dimensional positions gave no significant

**Fig. 7.4** The Vision Transformer ViT partitions an image into square patches of fixed size. For each patch an embedding is calculated by a linear projection. A standard encoder computes contextual embeddings. The embeddings of the [CLS] token is used to compute a class by a logistic classifier [42]. Image adapted from [42] with permission of the authors, credits in Table A.3

performance improvement. Different models ViTBase, ViTLarge, and ViTHuge with 12, 24, and 32 layers and 86M, 307M and 632M parameters respectively are employed.

The transformer encoder has an input sequence length of *N* consisting of vectors of size *d*. Each layer generates *N* embeddings of length *d*. The output embedding of the [CLS] token in the last encoder block is the input to a logistic classifier to compute probabilities of the image classes. The architecture is shown in Fig. 7.4.

It is remarkable that the images may be trained with varying input image resolutions. But patch size is always the same yielding different input size lengths. To take the new resolution into account, a 2D interpolation of the position embeddings is performed. The model is typically pre-trained on a large dataset JFT-300M [161] to predict masked inputs. It is fine-tuned with a smaller task using a different classifier layer. It is often beneficial to fine-tune at higher resolution than pre-training [189]. The models were pre-trained on datasets with up to 300M images.

The largest model ViTHuge has input patches of size 14 × 14. It was able to outperform an improved and pre-trained ResNet152 [63] with 152 CNN layers and EfficientNet [92] on ImageNet, and achieved a SOTA of 90.5% Top-1 accuracy for the classification of images into 1000 object categories [118]. Pre-training increases absolute accuracy by 13% on the test set of ImageNet. With 2.5k TPUv3 days, it required only 25% of the computing effort (including pre-training) required for ResNet. It improved SOTA for another 5 popular image classification benchmarks. The smaller ViTLarge with input patches of size 16 × 16 also outperformed ResNet152 requiring only 6.8% of ResNet152's compute effort.

When ViT is trained on a moderate dataset like ImageNet, the model achieves a performance below that of ResNet (Sect. 1.7) with a comparable parameter count. It seems that CNNs have more appropriate inductive biases, such as translation equivariance and locality, which the transformer must learn through pre-training. Therefore, only pre-trained transformers can outperform CNNs, but this requires a lower computational effort. Cao et al. [20] present a method how ViTs can be trained with limited data and achieve good results. Chefer et al. [22] present a new method based on Taylor decomposition methods to visualize the parts of the image that led to a certain image classification.

It is instructive to analyze the inner structure of a trained model. It turns out that the trained position embeddings reflect the row and column structure of the input image, and patches in the same row/column have similar embeddings. Based on the attention weights, it can be determined which image parts are considered by a specific attention head. Some attention heads take into account the whole image while others have consistently small attention distances in the lower layers. This could have a similar function as early convolutional layers in CNNs [130]. An experimental investigation has shown that transformers are highly robust to severe occlusions [108]. In contrast to CNNs, which often detect an object based on texture and less on shape, ViTs are comparable to humans on shape recognition. Figure 7.5 shows attention regions for the whole ViT model corresponding to semantically relevant areas.

A number of researchers have investigated the robustness of ViT. In a series of experiments, Mao et al. [103] found that the ViT tends to employ local features containing textures and noise, and to some extend ignores global context such as shape and structure. In response, they propose to discretize the continuous input features to image tokens using a vector quantizer based on a variational autoencoder (*VQ-VAE*) [113]. They report accuracy improvements of up to 12% on

**Fig. 7.5** The input image is shown in the upper row. The lower row depicts the area of main attention computed by the Vision Transformer model to the input space for classification. Image reprinted with kind permission of the authors [42, p. 8]

several ImageNet classification benchmarks. A similar adaptive token generation methods for the ViT was proposed by Ryoo et al. [146]. **BEiT** [11] outperforms, the supervised pre-trained ViT using a self-supervised method inspired by BERT (masked image modeling) and based on a VQ-VAE.

#### *7.2.3 Image Generation*

There are a number of Foundation Models for various image enhancement tasks. Image *super-resolution* converts a low-resolution image to a higher resolution. **SwinIR** [96] is based on a hierarchical representation starting from small-sized image patches and gradually merging neighboring image patches in deeper layers. For training, the model gets a small-scale image as input, which is preprocessed with a CNN layer. The transformer block contains transformer and CNN layers and is trained to reconstruct the high-resolution image. SwinIR achieves SOTA on benchmarks for super-resolution, image denoising, and JPEG compression artifact resolution, while having only 12M parameters.

**ColTran** [88] transforms a grayscale image to a fully colored image by using transformers with column and row attention. It first predicts colors by a conditional transformer for a spatially reduced image with only 512 coarse colors. Two subsequent fully parallel transformers upsample the coarse colored low resolution image into a fully colored high resolution image. The model achieves the best FIDscore (Sect. 7.2.6) of 19.7 on ImageNet data compared to different alternatives. Examples of colorizations are shown in Fig. 7.6.

**Fig. 7.6** Different colorizations of grayscale images (left) by ColTRan [88]. Note that semantic constraints, e.g. the color of the skin and the tree leaves, are usually respected. Image reprinted with kind permission of the authors [88, p. 1]

**Fig. 7.7** VQ-GAN [45] enables transformers to synthesize high-resolution images with 1280×460 pixels. Image reprinted with kind permission of the authors [45, p. 12873]

The **Swin Transformer** [99] constructs a hierarchical representation of an image by starting from small-sized image patches and gradually merging neighboring patches in deeper Transformer layers. A linear computational complexity is achieved by computing self-attention locally within non-overlapping windows of size 7 that partition an image. Between consecutive layers the attention windows are shifted such that there is an overlay with the neighboring windows of the prior self-attention layer. The largest model version has 197M parameters and processes images of resolution 384 × 384. On ImageNet classification the model achieves a top-1 accuracy of 87.3%. Also on object detection in images, the Swin Transformer is able to improve the prior best results.

**VQ-GAN** [45] uses a CNN to efficiently learn a codebook of context-rich visual patches, and subsequently learns a model of their global structure. The long-range interactions within these patches require an expressive GPT-2 to model distributions of the visual patches. The dictionary of image patches captures perceptually important local structure according to *perceptual loss* [41, 194]. This loss is optimized with an adversarial training procedure with a patch-based image discriminator that aims to differentiate between real and reconstructed images.

A GPT-2 model with 307M parameters is pre-trained to generate the code sequence of encoded images in an image corpus. Each image is partitioned to 16×16 patches with a sequence length of 1024. An example image is shown in Fig. 7.7. If the training corpus contains class information *c*, images of specific classes can be generated. Class information can also be restricted to specific image regions. While VQ-VAE yields an FID of about 10 for the reconstruction of ImageNet photos, VQ-GAN achieves a much better value of 1.7.

**StyleSwin** [191] is a further development of VQ-GAN. It uses the *Swin transformer* [99] discussed above. StyleSwin employs a wavelet discriminator in the spectral domain to suppress blocking artifacts. The model with 41M parameters achieves SOTA quality on multiple established benchmarks. Example images are shown in Fig. 7.8 having a coherent global geometry and high-fidelity details. On the CelebA-HQ 1024 benchmark StyleSwin yields an FID of 4.4, which is better

**Fig. 7.8** Images in the 1024×1024 resolution generated by StyleSwin [191] on FFHQ 1024×1024 data (left) and CelebA-HQ 1024 × 1024 data (right). Best seen with zoom. Image reprinted with kind permission of the authors [191, p. 8]

than all prior models including StyleGAN2 [82]. For the task of generating churches based on the LSUN dataset StyleSwin has an FID-score of 3.1, which is nearly as good as the best scoring adversarial CIPS model [7] with an FID-score of 2.9.

**Data2vec** [9] proposes a new training criterion for self-supervised learning, which can be applied to image, text and speech data. It has two kinds of models: a teacher model, which processes the whole input, and a student model, which processes the input while masking some data.

The model employs a standard transformer architecture with media-specific input encoding. Images are encoded by linearly transformed image patches similar to ViT. Speech data is encoded by multi-layer 1-D convolutions. Text data is encoded as subword tokens. Training targets for the student model are constructed from the averaged top *K* encoder blocks of the teacher network, which processes the complete input. This target has to be predicted by the student model, which only receives the masked inputs. Representations of data2vec are continuous and contextualized through the use of self-attention, which makes them richer than a discrete set of tokens used for other approaches.

Separate models are trained according to this scheme for speech, images and text. For images a Data2vec model achieves a new SOTA of 86.2% top-1 accuracy on ImageNet-1k with restricted training set. For speech data, the model reaches a WER of 5.5% on the Librispeech test-other benchmark. For language processing, Data2vec has an average score of 82.9 on GLUE, which is better than RoBERTa. This demonstrates that the model can be effective for multiple modalities. It can be expected that this model will be extended to learn across modalities.

#### *7.2.4 Joint Processing of Text and Images*

Once transformers were applied to text and images, joint processing of both modalities became an obvious alternative. Three steps are required for this:

• encoding images and texts into embeddings preserving their semantics;

The player at bat hits the baseball while the umpire looks on.

A school bus on a parking lot with snow next to a building.

Two horses pull a hay wagon with two men on the load.

**Fig. 7.9** MS-COCO dataset [26]: images similar to sample images from the dataset. The corresponding captions indicate the level of detail. Image credits in Table A.3


After learning universal vision and language features, these PLMs can be finetuned on various downstream vision-language tasks.

For pre-training large scale datasets of text-image pairs *(v, u)* are required. Each pair consists of a sequence *v*1*,..., v<sup>T</sup>* of text tokens and a sequence *u*1*,..., u<sup>R</sup>* of image features or *visual tokens*, e.g. image patches. In this way, we can unify input representation as sequence of embeddings for both modalities. An example dataset is *COCO captions* [26], which contains 328k images of 91 object types of common objects in their natural context together with the corresponding image captions (Fig. 7.9). Other datasets like *Conceptual Captions* (CC) [153], *RedCaps*  [34], and *Laion* [151] contain 3.1M, 12M and 400M images respectively together with captions or descriptive text.

Pre-training tasks have to be designed in such a way that the model has to reconstruct parts of the text or image based on the remaining contextual text and image features. For *Cross-modal MLM* (Masked Language Modeling) the model has to predict masked tokens or image patches based on the other unmasked text tokens and visual tokens. Here different masking strategies can be used such as whole word masking, masking text spans, or permuting tokens (Sect. 3.1). *Masked region prediction* aims to predict the content of an image region. Objects and their regions are annotated manually or by an auxiliary model. Then the model is required to predict the object (or a distribution over objects) for that region. In this way, the model learns to locate objects in an image.

**CLIP** [126, 127] is trained to predict a score indicating which image caption corresponds to which image. Given a batch *(v*1*, u*1*), . . . , (vn, un)* of tokenized textimage pairs, CLIP has to predict which of the *n*×*n* possible *(vi, u<sup>j</sup> )* pairings across the batch actually occurred. By *contrastive learning*, CLIP creates a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the *n* real pairs in the batch while minimizing the cosine similarity of the embeddings of the *n*<sup>2</sup> <sup>−</sup> *<sup>n</sup>* incorrect pairings. This contrastive training with positive and negative examples has been shown to outperform alternatives. As image encoder a Vision Transformer (ViT) with images patches of size 14×14 (Sect. 7.2.2) was employed, which works better than a ResNet [63] encoder based on CNNs. Text was enclosed by [SOS] and [EOS] tokens and a 12 layer autoregressive GPT model was used to compute embeddings. The embedding of [EOS] in the highest layer was employed as the representation of the whole text.

CLIP was trained on 400M image-text pairs of the *WIT data* [127] to associate an image with the best-matching caption. In addition, the prediction of the next token was used as an auxiliary loss term for the GPT model. The model can be used to retrieve a text best fitting to an image, or an image optimally corresponding to a text.

The resulting model has acquired a comprehensive knowledge about text and images. With a top-1 classification accuracy of 76.2%, it even surpasses the top-1 classification accuracy of 75.0% of the original ResNet50 on ImageNet zero-shot classification without the need to use any of the 1.28M training examples that ResNet50 was trained on. Hence, CLIP can be considered a 'zero-shot classifier'. This also holds for 16 out of 27 other image classification benchmarks. When a linear classifier is fitted on top of CLIP's features, it improves CLIP's accuracy on the ImageNet test set by almost 10% [126]. If the image distribution is changed, e.g. to sketches, CLIP-based classifiers are much more robust. Zero-shot CLIP classifiers improve effective robustness by a large amount, especially with respect to distribution shift. This demonstrates that the inclusion of caption text into vision models enhances performance and robustness.

**BriVL** [46] is a similar model for Chinese language, which uses a larger set of negative examples stored in a queue. It uses a huge training dataset of 650M weakly correlated text-image pairs, where, for instance, an image of a birthday cake has the caption "Happy birthday! Make a wish". It achieves SOTA results for cross-modal retrieval and visual question answering.

**ALIGN** [77] also uses separate encoders for text and images with a cosinesimilarity combination function at the top. As image encoder an EfficientNet CNN is employed. BERT is trained to produce a text embedding for the [CLS] token. Again the similarity is minimized for genuine image-text pairs and maximized for random pairs. ALIGN has 675M parameters and uses a huge training set of 1.8B noisy image pairs. In spite of the noisy data the model achieves a slightly better accuracy (85.5%) on ImageNet top-1 classification than CLIP.

#### *7.2.5 Describing Images by Text*

The automatic generation of a natural language description of an image is also called *image annotation* or *image captioning*. The task is challenging, as it requires visual perception, recognition, and real-world knowledge, as well as the *grounding*  of language expressions in the image space. *Symbol grounding* describes, how words acquire their meaning, e.g. by associating a word with an object in an image. Aside from determining and extracting the important objects and details of an image, the model has to infer the semantic relationship of the objects and the scene (Fig. 7.9).

Current top models for describing images work in two stages:


Similar to language translation, various metrics are used to evaluate the generated texts, e.g. BLEU or ROUGE (Sect. 2.3.3). Surveys of image captioning techniques are provided by Hossain et al. [67], Oluwasammi et al. [112], and Stefanini et al. [159].

**VilBERT** [100] aims to learn representations that can jointly model images and natural language. It extracts bounding boxes and their visual features using a pre-trained object detection network (Faster R-CNN [137]). These image region features as well as the text are input to two separate transformer encoders (twostream architecture). Subsequently, transformer layers with cross-attention in both directions are applied to learn cross-modal relationships. VilBERT was pre-trained on Conceptual Captions data.

The model was fine-tuned and evaluated on different tasks. *Visual question answering* (*VQA*) answers natural language questions about images. VQA is treated as a multi-label classification task with 3129 possible answers. Final embeddings of the text and image parts are fed into a classifier to estimate class probabilities. On the COCO test set VilBERT achieved a new SOTA with an accuracy of 70.9%. *Caption-based image retrieval* is the task of identifying an image from a pool given a caption describing its content. The model was fine-tuned on a Flickr dataset and had a recall@1 of 58.2%, thus establishing a new SOTA.

**OSCAR** [95] has the strategy to connect the relevant objects in the image with the corresponding phrases in the caption text. The authors use self-attention to learn these alignments, which can be significantly improved by additional object tags detected in images as reference points. Oscar represents each input image-text pair as a Word-Tag-Image triple *(w*; *q*; *v)*, where *w* is the sequence of words of the caption text, *q* contains the words of the textual object tags detected in the image, and *v* is the set of the corresponding region images. A CNN model (Faster R-CNN [137]) is used to discover the objects in *q* as well as to the corresponding regions *v*. For pre-training the transformer encoder, part of the tokens in *(w*; *q*; *v)* are masked, and the model learns to predict the masked tokens. In addition, sometimes the *q*-terms are changed randomly. The model has the additional task to identify these modifications. A small and a large model version are trained with a sequence length of 768 and 1024 using a public corpus of 6.5 million text-image pairs. The model is fine-tuned to generate the caption according to the sequence-to-sequence objective. The model achieves a new SOTA on COCO-captions with respect to BLEU-4 (41.7%), METEOR and ROUGE-L as well as for several other captioning benchmarks.

**VinVL** [193] is pre-trained on three text-image corpora with 2.5M images, and can generate visual features with a richer collection of visual objects and concepts.

**Fig. 7.10** Standard bounding-box object descriptions (left) and detailed annotations, which can be generated by VinVL (right) and contain visual concepts and attribute information [193]. Image credits in Table A.3

VinVL pre-trains a large-scale object-attribute detection model based on the CNNbased ResNeXt-152 C4 architecture [179]. The model does not describe objects by a single noun, but by a large number of attributes and details, which enhances the performance in joint image-language tasks (Fig. 7.10). The approach is combined with OSCAR and yields an improved SOTA on image captioning. **VIVO** [70] is a similar transformer model trained to label image regions with 6.4k different object tags. VIVO is fine-tuned with COCO image-caption pairs and learns to generate caption sentences, also using object tags not appearing in the caption data. This is possible as VIVO can exploit large amounts of paired image-tag data to learn rich descriptions for images. On the test set VIVO generates better captions than humans according to the *CIDEr metric* [69], which counts the common words weighted by tf-idf in the generated and the reference text [169].

**SimVLM** [171] is a transformer encoder-decoder, which uses the first three blocks of ResNet to extract contextualized patches from images, and associates the image tokens with text tokens. The decoder then predicts the continuation of the textual sequence as shown in Fig. 7.11. It is trained on 1.8B noisy image text pairs and 800GB text documents. SimVLM achieves a new SOTA for visual question answering on the *VQA v2 benchmark* [56] with 80.3% accuracy. In addition, it reaches SOTA for visual entailment, visual reasoning, and image captioning on COCO captions with respect to Meteor (33.7).

**Frozen** is a Foundation Model trained to associate text with images. It can be instructed by few-shot learning to answer question on an image [166]. The language model is a pre-trained autoregressive model with 7B parameters trained on the C4 dataset with 807GB text [129]. The vision encoder is based on NF-ResNet-50 [16] and provides an embedding vector characterizing the image. During training the image embedding is used as a prefix before the token embeddings of the generated text. Using the *conceptual captions* dataset the vision encoder is trained while freezing the language model. The training target is to generate a caption for the image.

**Fig. 7.11** The SimVLM encoder-decoder model receives an image (top) and a text (middle) as input and produces an output text (bottom) [171]. The image patches are encoded by the first layers of ResNet. Image reprinted with kind permission of the authors [171, p. 3]

During inference, several examples consisting of an image embedding and token embeddings are fed into the language model, which generates an answer. An example is to caption a microscope with "This was invented by Zacharias Janssen.", and a light bulb with "This was invented by Thomas Edison.". After five seeds and the input of an airplane together with "This was invented by" the model generates the output "the Wright brothers". In this way, different categorizations of images can be defined on the fly. These samples demonstrate the ability to generate openended outputs that adapt to both images and text, and to make use of facts that it has learned during language-only pre-training. The model is a proof-of-concept and shows a way to generate few-shot models for image-text tasks.

#### *7.2.6 Generating Images from Text*

By training on text-image pairs, transformers can acquire the knowledge to generate images corresponding to text descriptions. By successively producing the next token with a language model, it is possible to predict visual tokens, which then can be synthesized to images. However, there are other image generation techniques.


Lee et al. [91] give a survey of techniques for text driven image generation and manipulation.

restaurant

There are a number of approaches to measure the quality of generated images. The *Inception Score* (*IS*) [150] applies a CNN-based *Inception model* [162] trained on ImageNet to every generated image to get a conditional class label distribution, which should concentrate on few classes, i.e. have low entropy. In addition, many different classes should be generated for the test data, which is captured by the defined IS measure. The *Fréchet Inception Distance* (*FID*) [64] is an improved measure using the Fréchet distance between ImageNet classifier distributions, which measures the similarity of the distributions taking into account the location and ordering of the points along the graph. *CLIP Similarity Score* (CLIPSIM) [72] is based on the CLIP model (Sect. 7.2.4). It generates image and text embeddings with CLIP and calculates their cosine similarity.

**DALL-E** [133] uses a *GPT-3* autoregressive language model with 12B parameters to generate a new image from a textual description. The caption text of the image is BPE-encoded into 256 tokens. Then each 256 × 256 image is compressed to a 32 × 32 grid of image tokens using a discrete variational autoencoder. Each image token represents its 8 × 8 pixels by 8192 possible values. The caption tokens are concatenated with the 32×32 = 1024 image tokens forming the input sequence of GPT-3.

In the first stage the image tokens are trained yielding continuous image values. Then the discrete image tokens are obtained by training with a *Gumbel-softmax relaxation* [75] (Sect. 7.1.3). In the second stage a *Sparse Transformer* [27] with 64 self-attention layers and 12B parameters is trained to sequentially generate the joint input sequence. For the image tokens, special attention masks are used: row, column, or convolutional attention masks. The model was trained on 250M text-image pairs from the Internet.

For image generation, the authors rerank the samples drawn from the transformer using a pre-trained contrastive model, which assigns a score based on how well the image matches the caption. Figure 7.12 shows different images sampled from the algorithm. In a comparison to the prior model DF-GAN [165], the images generated by DALL-E were chosen as most realistic and more matching the caption in more than 90% of the time. Similarly, the images generated by X-LXMERT [28] look inferior.

**GauGAN2** [122, 149] combines segmentation mapping, inpainting and text-toimage generation in a single model. It is one of the first semantic image synthesis models that can produce photorealistic outputs for diverse scenes including indoor, outdoor, landscape, and street scenes. The recent version also can generate images according to text input. The model behind GauGAN2 was trained on 10 million high-quality landscape images. Details of the model are not known.

**XMC-GAN** [192] is a GAN-based text-to-image generation model containing a generator for synthesizing images, and a discriminator that is trained to discriminate real and generated images. It maximizes the mutual information between the corresponding pairs: (1) images (real or generated) with a sentence describing the scene; (2) a generated image and a real image with the same description; and (3) regions of an image (real or generated) and words or phrases associated with them.

**Fig. 7.12** According to a natural language caption (top) a number of images are generated by DALL-E [133]. The middle row shows images generated by DALL-E corresponding to the caption. The lower row shows the best image from a sample of 512 automatically selected by a quality score. Image reprinted with kind permission of the authors [133, p. 6]

The goal is for the matching pairs (both text-to-image and real image-to-generated image) to have high similarity scores and for non-matching pairs to have low scores.

For the input text the model computes a global sentence embedding *emb<sup>s</sup>* and the word embeddings *emb<sup>w</sup>* from a pre-trained BERT module. *emb<sup>s</sup>* and random noise *z* from a standard Gaussian distribution are concatenated to form the *global condition*, which is passed through several up-sampling blocks to generate a 16 × 16 feature map. The global condition is also used as the condition to calculate scale parameter and shift parameter in conditional batch normalization layers. The word embeddings *emb<sup>w</sup>* are input for an "attentional self-modulation layer" to generate fine-grained image regions. On MS-COCO, XMC-GAN improves the SOTA FID-score (Sect. 7.2.6) from 24.7 to 9.3, and is significantly preferred by human evaluators. Similarly, human raters prefer the image quality of XMC-GAN generated images 77% of the time, and 74% prefer its image-text alignment compared to three other SOTA approaches (CP-GAN, SD-GAN, and OP-GAN).

**Cogview** [40] employs a *Vector Quantized Variational AutoEncoder* (*VQ-VAE*). In the first stage, a discrete autoencoder is used to transform the image into a discrete sequence of tokens. In the second stage a GPT model learns to generate image tokens based on a prompt of SentencePiece text tokens. To generate image tokens, an encoder maps an image *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*H*×*W*×<sup>3</sup> to *<sup>h</sup>* <sup>×</sup> *<sup>w</sup>* image patches, which are quantized to a nearby embedding in a learnable set {*u*1*,..., uk*} of embedding vectors *u<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* [113]. The decoder maps the embeddings back to the image, and the embeddings are selected to minimize the difference between output and input image.

**Fig. 7.13** Images generated by CogView [40] controlled by the text input (top). The image style can be influenced by the input text. The best of a sample of 60 images is selected. Image reprinted with kind permission of the authors [40, p. 1]

The GPT-model of CogView has 48 layers with a hidden size of 2560, 40 attention heads and 4B parameters. The input to the model is of the form "[ROI1] *<*text tokens*>* [BASE] [BOI1] *<*image tokens*>* [EOI1]" and contains special tokens. The pre-training task is to predict tokens from left to right for 30M textimage pairs in English and Chinese. A sparse attention pattern similar to BigBird (Sect. 3.2.1) is used.

As shown in Fig. 7.13, CogView has a similar performance in image generation as DALL-E. It achieves the SOTA FID on the blurred MS COCO dataset, outperforming previous GAN-based models and DALL-E, although DALL-E has three times more parameters. When evaluated by humans, CogView was able to beat GAN-based models by a large margin. However, generation of images with CogView is rather slow, because each image is generated token-by-token. In addition, the quantization leads to some blurriness in the images.

**LAFITE** [200] is a model for generating images from text. Image generation is based on *StyleGAN2* [82], which creates various image attributes by modulating the weights of the convolution kernels [177]. LAFITE generates these modulating signals based on language input. It relies on the multimodal semantic space of the pre-trained CLIP model (Sect. 7.2.4) to produce an image embedding *emb(x)* from a text *x*, and therefore does not need extra text data. This image embedding is inserted into the image generation model similar to StyleGAN2 by a GAN architecture. On the MS-COCO benchmark, LAFITE achieves a zero-shot FID value of 26.9, which is better than the values of DALL-E (27.5) and CogView (27.1). When fine-tuned on MS-COCO, LAFITE has a FID-score of 8.1, which is better than that of XMC-GAN (9.3) and other GAN models. Note that LAFITE only has 75M trainable parameters.

#### *7.2.7 Diffusion Models Restore an Image Destructed by Noise*

**GLIDE** [109] is an image generation technique based on a *diffusion model*.A diffusion model describes the process of systematically and slowly destroying structure in a data distribution through an iterative forward *diffusion process*, e.g. the addition of noise [157]. To the data *x*[0] , e.g. a matrix of pixel values, we can apply

342 7 Foundation Models for Speech, Images, Videos, and Control

Gaussian diffusion distribution *q(x*[*t*] <sup>|</sup>*x*[*t*−1] *)*, where a Gaussian with expectation **0** and covariance *βI* is added. This yields a series *x*[0] *,..., x*[*<sup>T</sup>* ] where the final distribution *x*[*<sup>T</sup>* ] approximately is a Gaussian distribution with identity covariance (similar results hold for the binomial distribution).

Now the reversal of the diffusion process can be defined, i.e. the generative distribution with *x*[*t*−1] <sup>∼</sup> *p(x*[*t*−1] <sup>|</sup>*x*[*t*] *)*. It has been shown by Feller [47] that for small step size *β* the conditional distribution *p(x*[*t*−1] <sup>|</sup>*x*[*t*] *)* will approximately be a Gaussian distribution. Hence, the chain *x*[*<sup>T</sup>* ] *,..., x*[0] can be generated by a Gaussian distribution

$$\mathbf{x}^{[t-1]} \sim N(\mu\_{\mathbf{w}}(\mathbf{x}^{[t]}); \mathbf{S}\_{\mathbf{w}}(\mathbf{x}^{[t]})) \quad \text{and} \quad \mathbf{x}^{[T]} \sim N(\mathbf{0}; I) . \tag{7.3}$$

This Gaussian distribution is completely defined by the mean and covariance of *x*[*t*] .

For the training, noisy samples *x*[*t*] are generated by *q(x*[*t*] <sup>|</sup>*x*[*t*−1] *)* starting with the observed *x*[0] . From this the inverse *p(x*[*t*−1] <sup>|</sup>*x*[*t*] *)* may be reconstructed by optimizing the *variational lower bound* on negative log likelihood [65]. With the trained model one can start with a sample *x*[*<sup>T</sup>* ] <sup>∼</sup> *N (***0***, <sup>I</sup> )* and gradually reduce noise in a sequence of steps *x*[*<sup>T</sup>* <sup>−</sup>1] *,..., x*[0] , where

$$\mathbf{x}^{[t-1]} \sim p(\mathbf{x}^{[t-1]}|\mathbf{x}^{[t]}) \approx N(\boldsymbol{\mu}\_{\mathbf{w}}(\mathbf{x}^{[t]}); S\_{\mathbf{w}}(\mathbf{x}^{[t]})).\tag{7.4}$$

The distributions *p(x*[*t*−1] <sup>|</sup>*x*[*t*] *)* may be estimated conditional to image classes [37]. Instead of a finite number of image classes one may even use a caption text as condition. The text is first encoded into a sequence of *k* tokens and fed into a Transformer model. The Transformer outputs a class embedding as well as *k* token embeddings, which are used as additional model inputs. Here a normal noise term *w(x*[*t*] |∅*)* for reconstruction is estimated and in addition conditional to the caption *c* a noise term *w(x*[*t*] |*c)*. During the *classifier-free reconstruction* both terms are mixed.

The diffusion model is approximated by a *U-Net* model [144] with 2.3B parameters, performing a downsampling of the 64 pixel image to a smaller resolution with many features and a subsequent upsampling. An additional 1.5B parameter model is used for upsampling to a 256 × 256 resolution. The caption text is processed by a transformer model with 1.2B parameters and the final token embedding is used in place of a class embedding.

In tests, GLIDE produced high-quality images with realistic reflections, textures, and shadows. The model can also combine multiple concepts (for example, dragon, psychedelic, and hamster) and attach attributes like colors to these concepts. On the MS-COCO benchmark with 256×256 images DALL-E achieves a FID-value of 28, while LAFITE gets 26.9 and GLIDE 12.2. Also in human evaluations, the results of GLIDE are clearly preferred. This is remarkable as GLIDE has far less parameters than DALL-E. Figure 7.14 shows some images generated by GLIDE. GLIDE can also be used for restoring a masked image patch according to a textual prompt, e.g. "tie with black and yellow stripes". In most cases, GLIDE produces better results than competitor models and the corresponding image patch is restored with realistic

a group of elephants walking in muddy water

a group of skiers are preparing to ski

a hedgehog using a calculator

of a psychedelic hamster dragon

**Fig. 7.14** Images generated by GLIDE [109] according to the captions in the lower row. The best of a sample of 60 is shown. Image reprinted with kind permission of the authors [109, p. 7]

**Fig. 7.15** A high-level overview of DALL-E 2 [132]. Above the dotted line the CLIP training process is shown minimizing the difference between the embeddings for an image and the corresponding text. Below the dotted line, the text-to-image generation process is illustrated: a CLIP text embedding is first fed to an autoregressive transformer (higher box) or diffusion prior (lower box) to produce an image embedding. This embedding is used as input to the diffusion decoder which produces a final image. Image reprinted with kind permission of the authors [132, p. 3]

lighting, shadows and textures. Finally, GLIDE can add shadows and reflections to images and transform simple line sketches into photorealistic images.

**DALL-E 2** [132] is an improved version of DALL-E that can create more realistic art and images from a descriptive sentence in natural language. It works in two steps (Fig. 7.15): first a CLIP (Sect. 7.2.4) image embedding *zi* based on a text description *y* is generated according to a prior *p(zi*|*y)*. Then a diffusionbased decoder generates an image *x* conditioned on an image embedding *zi*. The decoder *p(x*|*zi, y)* inverts the CLIP image encoder, is non-deterministic, and can produce multiple images corresponding to a given image embedding. The CLIP model is frozen during training of the prior and decoder. The dimensionality of the image embeddings *zi* is reduced to 319 from 1024 by principal component analysis while preserving nearly all information. Each of the 319 dimensions is quantized

**Fig. 7.16** Random samples from DALL-E 2 [132] for the prompt "Vibrant portrait painting of Salvador Dali with a robotic half face" (upper row), and "A teddybear on a skateboard in Times Square". Image reprinted with kind permission of the authors [132, p. 25,27]

into 1024 discrete buckets. For the encoder, experiments are performed with both autoregressive and diffusion models for the prior. It turns out that diffusion models are computationally more efficient and produce higher-quality samples. Examples are shown in Fig. 7.16.

The decoder is conditioned on image representations and can produce variations of an image that preserve both its semantics and style, while varying the nonessential details that are missing from the image embeddings. CLIP's shared embedding space allows for language-guided image manipulations and modifications in a zeroshot manner. For example two images *x*<sup>1</sup> and *x*<sup>2</sup> can be blended, interpolating all of the concepts in CLIP's embedding space that occur between them. With respect to MSCOCO it turns out that DALL-E 2 has a better zero-shot FID of 10.4 than GLIDE (12.2). Human comparisons show that DALL-E 2 and GLIDE are similar in terms of photorealism and caption similarity, while DALL-E 2 produces images with greater diversity. DALL-E 2 struggles more than GLIDE with a prompt that requires it to connect two separate objects (cubes) to two separate attributes (colors). A public access to DALL-E is now available for users to create images [115].

**Imagen** [148] is a text-to-image model presented by Google. It encodes the input text into text embeddings by a pre-trained T5-XXL encoder-decoder Transformer with 4.6B frozen parameters. A conditional text-to-image diffusion model (7.3) maps the text embeddings into a 64 × 64 image. Subsequently these small images are upsampled in two steps to 256×256 and to 1024×1024 by two super-resolution diffusion models with 600M and 400M parameters (Fig. 7.17). The models are trained on 860M image-text pairs.

Nichol et al. [110] proposed some modifications for denoising diffusion probabilistic models, which can sample much faster and achieve better log-likelihoods

**Fig. 7.17** Imagen encodes the input text by the pre-trained T5-XXL text encoder. The resulting text embeddings are transformed to 64 × 64 images by a diffusion model [148]. This image is upscaled to 1024 × 1024 resolution by two super-resolution diffusion models. Image reprinted with kind permission of the authors [148, p. 19]

with little impact on sample quality. They deliver the same sample quality as GANs, but achieve a much better mode coverage as measured by recall. This model is also employed by Imagen for text-to-image conversion, using the pooled embedding vector as input. This network is used for upsampling and is extended to improve memory efficiency, inference time, and convergence speed. Figure 7.18 shows randomly selected images generated by Imagen for a caption input.

Imagen achieves a SOTA zero-shot FID (Fréchet Inception Distance) on *COCO*  with a value of 7.3, which is better than the FID of DALL-E 2 and is even better than other models trained on COCO (Table 7.2). Human raters evaluated Imagen with respect to photorealism and alignment to the text caption. For photorealism, people preferred Imagen images in 39.5% of cases to the original images, indicating a relatively high realism. On caption similarity, Imagen's score is on-par with the original reference images. On the *DrawBench* [147] the images generated by Imagen are always preferred to images created by DALL-E 2, GLIDE, VQGAN+CLIP or Latent Diffusion in more than 60% of the cases. The authors emphasize that in the future they will increase the size of the language model, as this promises a greater gain than increasing the size of the diffusion models. They do not publish Imagen's code or provide a demo API because it could potentially be abused, for example to create fake images. Gafni et al. [48] demonstrate how a system can be extended to support artists during the creation of images.

**Stable Diffusion** is another model with currently 5.7B parameters for generating images of up to 1024 × 1024 pixels using diffusion. An example is shown in Fig. 7.18. It works similar to DALLE-2 employing a denoising U-Net for image compression and expansion [142]. For training, Stable Diffusion used an image dataset from the freely available LAION-5B database [12], which contains about 5.85 billion CLIP-filtered image-text pairs, fourteen times larger than its predecessor LAION-400M. A model conditioned on ImageNet classes achieved

A photo of a confused grizzly bear in calculus class. The Rhine river below a castle and with a forest and a vineyard

**Fig. 7.18** Images generated by Imagen [148, p.6] (left) and Stable Diffusion [142] (right) given two different text captions. Images reprinted with kind permission of the authors [148, p. 6] and [158], credits in Table A.3

an FID of 3.6 for image generation. A variant of the model employs an image search returning images with similar visual features from the neighborhood of each training instance by the CLIP model [15]. The model includes the retrieved images during image generation. It can be applied to unconditional image synthesis, inpainting, and stochastic super-resolution, and achieves competitive performance while significantly lowering computational cost. Model inference code and model weights to run the retrieval-augmented diffusion models are now available [141] and can be downloaded. The model was heavily employed by users creating 1.7M images per day.

#### *7.2.8 Multipurpose Models*

**OFA** (One For All) [170] provides a unified model for a range of multimodal tasks. It can process text and images in the form of text and visual tokens. OFA has an encoder-decoder transformer architecture (Sect. 2.3.1) and is pre-trained on various text and image datasets. Similar to the T5 model (Sect. 3.1.3), it receives a textual instruction along with an image and generates the appropriate output.

Different modalities are represented in the same space, and text, images, and objects are discretized into a unified output vocabulary. An image with 256 × 256 pixels is represented as 16 × 16 image patches. Each image patch of 16 × 16 pixels is "tokenized" into discrete visual tokens, such that each visual token strongly correlates to the corresponding patch [11]. In addition, objects have a specific representation consisting of a label and its bounding box. The continuous corner coordinates of the bounding box are uniformly discretized to integers as location tokens *(x*1; *y*1; *x*2; *y*2*)*. Finally, a unified vocabulary is used for all linguistic and visual tokens, including subwords, image codes, and location tokens.

Similar to T5 (Sect. 3.1.3) the transformer encoder-decoder is controlled by instructions. It receives a text instruction and an input image and generates a corresponding output, a text response and an image. A number of tasks are described by the examples shown in Fig. 7.19. Usually, the OFA model is fine-tuned on specific datasets to solve various tasks.

**Fig. 7.19** OFA [170, p. 3] receives an instruction and an input image. As output it generates a text and (optionally) an image. For each of the eight instructions (left) an example output (right) is shown. Image credits in Table A.3

The OFA model has an OFABase variant with 6 encoder and decoder layers, hidden size 768, and 12 attention heads. The OFALarge variant has 12 encoder and decoder layers, hidden size 1024, 16 attention heads and 472M parameters.

During pre-training, the model has to solve three tasks requested by the corresponding instructions (Fig. 7.19). The first task is image infilling, where the model has to reconstruct the central parts of the image. This requires the model to learn the relation of image parts and the generation of images. The second task is object detection. This task establishes the correspondence between image parts and language descriptions. The last pre-training task is text infilling to learn the structure of language. The model is pre-trained on publicly available datasets for the different tasks on data with more than 50M images and more than 160GB text. Images are resized to 384 × 384 pixels with a fixed patch size of 16 × 16 pixel. For each patch a feature vector is computed by the first three blocks of a ResNet CNN.

Fine-tuning is performed on task-specific datasets for the tasks shown in Fig. 7.19, e.g. MS COCO for image captioning. In addition, OFA is fine-tuned on several NLP tasks such as the GLUE benchmark for natural language understanding, the Gigaword benchmark for abstractive summarization, and the ImageNet-1K dataset for image classification. For inference the authors apply beam search and develop a search strategy based on a prefix tree. This trie-based search strategy ensures that the output generated by OFA is constrained to the appropriate candidate set.

For image captioning the model is fine-tuned on MS COCO [26]. With a BLEU-4 score of 43.5 it establishes a new SOTA for the MS COCO benchmark [32]. For Visual Question Answering the model is fine-tuned on VQAv2 [56] and similar datasets. A search strategy based on a prefix tree ensures that the output generated by OFA is constrained to the candidate set. It achieves a new SOTA accuracy of 80.0%.

For the *visual entailment task* the model has to determine, if the image entails, contradicts or is neutral to the text. OFA is fine-tuned on *SNLI-VE* [178] and achieves a SOTA accuracy of 90.2% on the test set, which is 3.1% better than the prior best model. To understand referring expressions, the model has to locate an image region described by a language query. Here the model was fine-tuned on the *RefCOCO benchmark* [187] and related benchmarks. It achieved a new SOTA with a text accuracy of 92.9%, outperforming competitors by a large margin.

For image generation the model is fine-tuned on MS COCO [26]. It achieves an Fréchet Inception Distance (FID) of 10.5. This is better than the scores for DALL-E [133] (27.5) or GLIDE [109] (12.2), which have far more parameters (12B resp. 3.5B) than OFA with 472M. On the leaderboard, only LAFITE (Sect. 7.2.6) has a better FID-value of 8.1. Note that competing models selected their results from 60 to 512 trial outputs, while OFA only selected the best of 24 images according to FID scores.

For image classification in ImageNet, OFA uses no extra labeled training data and has a similar performance (84.9% top-1 accuracy) as EfficientNet-B7 (84.3%), whereas the current SOTA is 88.3%. Surprisingly, OFA also achieves good results on language-only benchmarks, such as the GLUE natural language understanding benchmark (Sect. 4.1.1) and the Gigaword summarization (Sect. 6.4.1). Code, demos, and trained models are available for download.

An alternative multipurpose model is **NÜWA**, which is described in Sect. 7.3.4. It provides realistic text-to-image generation, image editing, and image region editing controlled by text. In addition, NÜWA performs text-to-video creation and the prediction of the next video frames.

**WuDao-2.0** [140, 143, 198] is a giant mixture-of-experts model with 1075B parameters and has been introduced in Sect. 3.5.2. It is based on the GLM 2.0 architecture (Sect. 3.1.3) combining the different learning paradigms of BERT, GPT and the encoder-decoder transformer. For image modeling, it uses the CogView approach (Sect. 7.2.6). However, implementation details are not available. The training data consist of 2.5TB image data and 2.5TB Chinese and English text data (e.g. from the *Pile* corpus [49]). WuDao-2.0 can be applied to a wide range of text analysis and generation tasks, and has matched or surpassed SOTA levels on five image benchmarks, e.g. on classifying land use in image data, image generation, and graphic retrieval.

#### **Available Implementations**


#### *7.2.9 Summary*

Recently, the Vision Transformer (ViT) emerged as a competitive alternative to Convolutional Neural Networks (CNNs) for image recognition tasks. ViT models outperform CNNs in terms of accuracy on various benchmarks and require much less computational effort.

Foundation Models for image processing receive image patches as input. The embeddings of these image patches are generated by different methods, e.g. linear transformations of image pixels, by the first few layers of CNN models, by variational autoencoders (VAE), or by Generative Adversarial Networks (GANs). A completely different approach is taken by diffusion models, which reverse the process of image degradation by adding noise (GLIDE). It has been shown to be beneficial to discretize representations of image patches to reduce noise and lowlevel texture dependence.

There are two alternatives for including text. Sometimes text and image tokens are processed by separate transformers. Subsequently the distances between the two types of embeddings are minimized (CLIP) or the resulting embeddings are correlated by cross-attention (VilBERT). Otherwise, text and image tokens are concatenated to form the input of Foundation Models (autoencoders, autoregressive, or encoder-decoder). It seems that recent models (DALL-E, CogView, OFA) prefer the single-stream architecture. A number of different tasks are employed for pretraining. These include the masked language model (MLM), where masked image and language tokens have to be reconstructed, masked region classification (MRC), and masked region reconstruction. Sentence-image alignment (SIA) classifies whether image-text pairs belong together.

The generation of captions constructs a sentence with the characterization of the image (VilBERT, OSCAR, VinVL, SimVLM) in fluent and correct language, which is usually an accurate description according to human evaluations. The generation of longer captions is not yet investigated and is probably more relevant for video captioning. There are studies to investigate the attention patterns in vision-language models [19].

The creation of images that match captions has made a huge leap in quality over the past year. Various architectures are used: Generative Adversarial Networks (GAN), diffusion models, VAEs. These models are in general combined with PLMs. It seems that pure transformer models have advantages (OFA), but diffusion models like DALL-E 2.0 gain momentum. Usually, a sample of images is created, and the best image is automatically selected by a quality score. Images generated by the model often have the resolution of 256×256 and already cover many details. Expect to see models with higher resolutions next year, e.g. 1024 × 1024.

Cao et al. [19] investigate the inner mechanics of vision and language models. They conclude that deeper layers lead to more intertwined multimodal fusion. Usually, the textual modality is more dominant for taking decisions than image features, as models tend to attend to text rather than images during inference. It turns out that a subset of attention heads is specialized for cross-modal interaction. There are attention patterns that align image regions and textual words. Finally, there is no reduction in linguistic capabilities, as pre-trained vision and language models encode rich linguistic knowledge.

Recently, multipurpose models have been presented that are trained to solve a large number of different language, vision, and language-vision tasks. One example is OFA, which has 472M parameters, significantly fewer than DALL-E (12B). OFA is a transformer encoder-decoder with image and text tokens as input, controlled by text instructions similar to T5. It achieves SOTA in image captioning, image generation, visual question answering, visual entailment, and even on pure language tasks. Contrast this with the huge WuDao 2.0 model with 1750B parameters, which is based on the encoder-decoder GLM model with a mixture-of-experts architecture. The model claims SOTA performance on a number of image and text tasks, but no technical details are known.

In the future, it is expected that these text-image models will be extended to other modalities such as video, speech, and 3D. In addition, more data will be used, Moreover, they will be enhanced by retrieval techniques to include additional external and up-to-date knowledge. Text-image models are a big step towards *symbol grounding*, which allows to attach symbols (words) to their real-world meaning.

#### **7.3 Video Interpretation and Generation**

As the Web is becoming a constantly growing communication vehicle, expressing content by text and images is often not sufficient. Video brings together three things that catch our attention like nothing else: image, movement, and audio. Therefore, videos are more and more important as a means of communication. There are 2 billion users active on YouTube each month and over 1 billion on TikTok with an average usage of 52 min per day. Hence, the automatic analysis, interpretation, and generation of videos is extremely valuable. For visual data, the most comprehensive self-supervision is available in videos. Their various modalities such as images, speech, ASR text, and captions are temporally aligned and do not require human annotation. The extreme number of multimodal videos potentially allows Foundation Models to acquire a model of the visual world.

#### *7.3.1 Basics of Video Processing*

Video analysis and understanding is more challenging than image processing, because video has an additional time dimension and usually has to handle images, speech, and text from speech or subtitles simultaneously. Recently Foundation Models have been used for video understanding. Compared to CNNs and RNNs, the major advantage of transformers is the ability to simultaneously capture global information and compute this in parallel. Furthermore, the concise and stackable architecture of transformers enables training on larger datasets. Table 7.3 list the main variants of Foundation Models for video.

Early models for image processing, e.g. CNNs and GANs, performed the analysis of images pixel-by-pixel. However, this is no longer possible for videos due to the high computational and memory effort, and there has to be an aggregation of image information. Therefore, special spatio-temporal aggregation modules are developed to adapt this to the limited sequence length of transformers.


To further reduce computational effort, a sparse self-attention can be used, where attention is mostly computed to nearby video pixels.

Unsupervised training may be performed similar to BERT. For instance, masked video tokens can be predicted based on neighboring video and text tokens [145]. In the same way, masked text tokens can be predicted from neighboring text and video tokens. Contrastive learning can be used to discriminate between genuine text-video pairs and random pairs. Other tasks include classifying whether a video and some text belong together, predicting the next frame, or reconstructing the order of shuffled video or text tokens. Recent surveys on video understanding are provided by Islam et al. [73], Khurana et al. [85], and Ruan et al. [145]

There are a number of training data sources for video. *Kinetics* [83] is a collection of 306k large-scale, high-quality datasets of 10s video clips focusing on human actions. The variants Kinetics 400, 600, and 700 are annotated with 400, 600, and 700 classes, respectively. Example frames of annotated videos are shown in Fig. 7.21. *Moments in Time* [107] is a collection of 800k labeled 3s videos, involving


**Table 7.3** Main techniques using PLMs for video. The numbers in parenthesis indicate parameter count

people, animals, objects or natural phenomena that capture the gist of a dynamic scene. *Epic-Kitchens-100* [33] consists of 90k egocentric videos, totaling 100 h, recorded in kitchens. Each video is labeled with a "noun" and a "verb". Three accuracy scores ("noun", "verb", and "action") are usually reported. The action score assesses correct noun-verb pairs and is most important. *Something-Something V2* [55] consists of more than 220k short video clips that show humans interacting with everyday objects. Similar objects and backgrounds appear in videos across different classes. This data challenges a model's capability to distinguish classes from motion cues, in contrast to other datasets.

#### *7.3.2 Video Captioning*

*Video captioning* aims at automatically generating natural language descriptions of videos. Video captioning is substantially more difficult than image captioning because the spatial-temporal information in videos as well as the corresponding ASR text from the video introduces an additional complexity. On the other hand, huge video collections like YouTube are available on the Internet and can be used as training material. A recent survey is given by Perez-Martin et al. [124].

**VideoBERT** [160] applies a BERT model to video-text pairs. The video is partitioned into clips of 30 frames (1.5sec) and processed by the S3D CNN with a temporal convolution [180], which generates a clip embedding vector of size 1024. The clip embeddings are partitioned by *k*-means clustering into 20,736 clusters and quantized to video tokens. Speech is processed by ASR and partitioned into sentences. The text is tokenized by WordPiece with a vocabulary of 30k tokens. The video tokens corresponding to the sentence time period are collected in a video token sequence. As shown in Fig. 7.20 the video tokens are appended to the corresponding text tokens separated by special tokens. Note that text-only and video-only training is possible as well.

**Fig. 7.20** A text generated by ASR and the corresponding video tokens are the input of VideoBERT [160]. Both modalities are bounded by special tokens. The masked tokens have to be predicted. Image credits in Table A.3

The BERTLARGE model is pre-trained on a video set of 312k cooking videos with a total duration of 966 days. The text is obtained by ASR. Training tasks are masked token and frame prediction, and detecting text matching a video. VideoBERT yields SOTA on video captioning on the YouCook II data with BLEU-4 score of 4.3.

**COOT** [51] jointly processes image, video and text information with an universal representation by embedding vectors. In the representation of videos, time is added as a third dimension to the two-dimensional description of images. The COOT model considers the data on 3 different levels of hierarchy: frame/word, clip/sentence and video/paragraph. For each level there exists a pair of transformers processing the input. To model intra-level cooperation, COOT uses a feature aggregation layer to focus on temporal interactions between low-level entities. To aggregate information to the sentence level, the model uses a special attention formula, where all corresponding embeddings enter the scalar product. An additional loss term aims to reduce the difference between sentence and clip encodings. At the top level, a contextual transformer links the text and video embeddings.

The model is trained with videos that have subtitles for individual scenes and longer segments. Subsequently, the model can create subtitles for new videos. For the YouCook2 video subtitling benchmark dataset, the model can greatly improve the SOTA to 11.3 BLEU-4. In addition, the model can also be used for other tasks, such as searching when a textual description or a video scene is entered. Since the model includes only 10.6M parameters, it is expected that performance can be greatly improved by increasing the size of the model.

**DeCEMBERT** [164] aims to enhance a video by *region captions* in addition to the ASR-text extracted by speech recognition. The input text is represented by BPE-tokens. Each second of video is characterized by 2D-features extracted by a pre-trained Resnet-152 CNN [63] as well as by motion features extracted by a 3D ResNeXT CNN [179], which together are mapped to embedding vectors. The video embeddings and speech recognition text representations are concatenated forming a single sequence as inputs to a 12-layer autoencoder for pre-training and downstream task fine-tuning. To align video with ASR captions, a constrained attention loss is used that encourages the model to select the best matched ASR caption from a pool of candidates. During pre-training on 1.2M YouTube instructional videos, the association between text and video is learned by masking tokens and by a classification, if a text corresponds to a video. On the YouCook2 captioning task the model improves SOTA to a BLEU-4 score of 11.9. In addition, DeCEMBERT yields good results for video retrieval and video question answering.

#### *7.3.3 Action Recognition in Videos*

**VATT** [2] uses raw RGB frames of Internet videos, audio waveforms, and ASR text of the speech audio as input data. The video of size *T* × *W* × *H* with *T* frames is partitioned to a sequence of *T/t* ∗*H/h* ∗*W/w* patches, where each patch is a *voxel* in R*t*×*h*×*w*×<sup>3</sup> with an additional color dimension. This is an extension of the image patches of ViT. The position encoding is a sum *ei,j,k* = *e*temp;*<sup>i</sup>* +*e*horiz;*<sup>j</sup>* +*e*vert;*<sup>k</sup>* where each of the summands is a learnable vector of length *d*. The raw audio waveform is partitioned into *t* segments and each segment gets a learnable position embedding. For the text a vocabulary is created and each word is mapped to a learnable embedding. The *DropToken* procedure removes a random sample of the video or audio tokens to reduce computational cost and improve regularization.

VATT linearly projects each modality into a feature vector of length *d* and feeds it into a separate BERT encoder. The model uses Noise Contrastive Estimation to reduce the distance between projections of the audio and video embeddings. Positive pairs are taken from the same location in the video, and negative pairs from different locations. A similar criterion is employed to reduce the distance of video and text embeddings. The training data covers clips of 32 frames at 10 fps taken from the *HowTo100M data* [105]. The largest model has 415M parameters. For action recognition on Kinetics-400 it achieves SOTA with a top-1 accuracy of 82.1% and a top-5 accuracy of 95.6%.

**Omnivore** [52] is a model for classifying images, videos, and single-view 3D data using exactly the same model parameters. A single-view 3D is a color image with an additional depth channel. Image, video, and single-view 3D modalities are converted into embeddings that are fed into a Transformer model. The images are partitioned into image patches, videos are divided into spatio-temporal tubes covering separate image regions, and the single-view 3D images are converted into RGB patches and depth patches. The patches are projected into embeddings using linear layers. The same linear layer is used for image and video RGB patches. A separate layer is applied to depth patches. Separate positional embeddings for the spatial and the temporal dimension are used.

Omnivore employs the *Swin transformer* (Sect. 7.2.3) as base model, a hierarchical vision transformer using shifted windows. Self-attention involves patch embeddings from spatially and temporally nearby patches. The models are jointly trained on the ImageNet-1K dataset for image classification (1.2M images), the Kinetics-400 dataset for action recognition (240k videos), and the *SUN RGB-D dataset* (5k) for single-view 3D scene classification, with dataset-specific linear classification layers transforming the final embeddings. On Kinetics-400 without extra data, Omnivore achieved an action recognition accuracy of 84.1%, which was the second best. The fine-tuned Omnivore scored SOTA on two video classification benchmarks. When using RGB and the 3D channel, Omnivore again had a SOTA performance on the NYU-v2 benchmark.

**MeMViT** [173] aims to process videos longer than 5s, in contrast to most current models. MeMViT handles videos in an online fashion and caches key and value vectors of a transformer as memory at each iteration. Through the memory, the model has access to prior context for long-term modeling, with little additional cost, as memory embeddings are not trained. The queries of the current video clip attend to an extended set of key and value vectors, which come from both the current time and the past. Similar to the dilated convolutions of WaveNet [114], higher layers attend further down into the past, resulting in a significantly longer receptive field.

**Fig. 7.21** Two videos annotated with descriptions (left) similar to videos of the Kinetics dataset [83]. Representative frames of the videos are shown. Obviously, a single frame is sometimes not enough to reach a decision, e.g. to distinguish "dribbling basketball" and "dunking basketball". Image credits in Table A.3

In addition, a memory compression module with learnable pooling is effective for reducing the memory footprint.

A video is split into a sequence of short *T* × *H* × *W* clips and processed sequentially. Similar to MTV, multiple resolutions are used, starting from a finegrained modeling of smaller patches to high-level modeling of larger patches in later stages, where the dimensionality of embeddings increases. The aggregation between stages is done by strided pooling. The memory representations are frozen and not changed by optimization. The model is pre-trained on Kinetics-400 data Fig. 7.21. On the AVA v2.2 dataset [54] MeMViT achieves a mean average precision (mAP) of 35.4%. On the action anticipation dateset (EPIC-KITCHENS-100) it has a SOTA of 17.7% recall@5. On the action recognition on EPIC-KITCHENS-100 MeMViT yields an accuracy of 48.4%.

**CoVeR** [190] evaluates the effect of different pre-training strategies on classification accuracy. The authors use a special transformer architecture, which has spatial attention layers across related regions in the same video frame and temporal attention layers across the neighboring frames of a video clip. CoVeR first pre-trains the model on the *JFT-3B benchmark* [189] of 3B images annotated with a classhierarchy of around 30k labels. During pre-training all temporal attention layers are removed. During fine-tuning, it simultaneously trains a single model with 24 layers on multiple action recognition and image datasets (Kinetics versions, ImageNet, Moments in Time, SomethingSomethingv2) to build robust spatial and temporal representations of video data (Fig. 7.22). For the Kinetics-400 action recognition task CoVeR achieves an accuracy of 87.2% and for the Moments in Time action classification it has a SOTA accuracy of 46.1%.

**MTV** [185] performs temporal aggregation by multiple input representations (views) of the input video. MTV extracts tokens from the input video over multiple time spans. Video tokens derived from long time intervals capture the overall scene description, while video tokens taken from short segments capture fine-grained

**Fig. 7.22** During fine-tuning CoVeR [190, p. 5] simultaneously is trained on multiple image and video datasets. Each dataset has its own classifier as there are different class definitions. Images are single frame videos. Therefore, image classification is not affected by temporal attention. Image credits in Table A.3

details, such as a person's gesture. Different transformer encoders are used to process these different views, with short segment models having higher capacity.

The different encoders are interfaced by lateral connections to fuse cross-view information. A cross-view attention is computed between adjacent views similar to the multi-head cross-attention in the transformer (Sect. 2.3.1). Note that these fusion operations are performed only for specific layers. The tokens from all views are aggregated with a global encoder, which performs the final classification.

The models are initialized with Vision Transformer weights (Sect. 7.2.2) and trained with videos of 32 frames and a resolution of 224×224. It turned out that the cross-view attention was better than alternatives to fuse information from different views. In addition, three views gave better results than fewer views. The largest model with over a billion parameters achieved SOTA accuracy of 89.1% for action recognition on kinetics-400.

**AV-ASR** [152] applies a PLM to audio-visual speech recognition. As usual, audio is converted to 80 log Mel filterbank features in steps of 10 ms. The videos are cropped to a near mouth region and converted to video embeddings with length 512. Both embeddings are concatenated and fed into a Conformer encoder (Sect. 7.1.2) with 17 layers. The model outperforms previous SOTA for lipreading on the LRS3- TED benchmark [1] with a WER of 19.3%. If both modalities are used, the WER drops to 1.6%. If babbling noise is added the WER of audio-only ASR on LRS3- TED is increased to 6.1%, while speech recognition with both modalities has a WER of only 2.9%. There is another approach to associate video and audio by generating video background music that matches the speed of movement, mood, and rhythm of the video [38].

**Aloe** [39] wants to do more than simply describing an image or video, but aims at explaining or reasoning about the scene. The model uses an unsupervised object segmentation module that partitions each image into object representations. A transformer receives the questions and the image descriptions including object representations. On several *visual reasoning* benchmarks, the model has to answer complex question such as explanatory questions like "why did something happen?", predictive questions such as "what will happen next?", and counterfactual questions like "what would happen in an unseen circumstance, e.g. if an object is removed?". The model is able to improve SOTA on nearly all benchmark dataset.

**Merlot** [188] is a vision and language model that learns multimodal world representations from videos with thousands of frames and their ASR text. It encodes each frame using an image encoder, embeds tokens using learned embeddings, and a Transformer similar to RoBERTa jointly processes both representations. A first pre-training task uses contrastive loss to match the language transcript embedding and the corresponding video embedding. The MLM task requires replacing masked language tokens. The temporal reordering task involves reordering scrambled video frames. Hence, Merlot not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time, achieving temporal common sense knowledge. The model is trained on 6M unlabeled YouTube videos. Merlot outperforms SOTA methods in 12 downstream benchmarks that include short and long videos. An example is Visual Question Answering on MSRVTT-QA [182] with a new SOTA of 43.1%. A related model for complex event extraction [93] uses a similar contrastive learning approach.

**Flamingo** [3] is a visual language model, which can handle sequences of arbitrarily interleaved image, video and text data. Flamingo employs the 70B parameter pre-trained language model *Chinchilla* trained on a large and diverse text corpus (Sect. 3.1.2). The encoder blocks of the language model are used with frozen parameters. With this submodel, Flamingo has strong generative language abilities and access to a large amount of knowledge stored in the Chinchilla weights. Similar to *Frozen* (Sect. 7.2.5), it can be instructed by few-shot learning to answer questions on an image [166].

For processing images and videos, a contrastive text-image approach is pretrained (Fig. 7.23). The authors use a variant of ResNet [16]. The vision encoder is pre-trained using a contrastive objective on our datasets of image and text pairs, using the two-term contrastive loss from [127]. Much like CLIP (Sect. 7.2.4), similarities are computed as a dot-product of the mean pooled output of the image encoder and the mean pooled output of a BERT model. This model extracts semantic spatially oriented features from the image including color, shape, nature, positions of objects, etc. The model is pre-trained separately, and the parameters are frozen during the main training of Flamingo.

Two modules are trained to interface these frozen models. The first is a *perceiver resampler*, which receives spatio-temporal features from the vision encoder and outputs a fixed-size set of visual tokens (usually 64). This output is generated for single images as well as videos independently of the input image resolution or the number of input video frames. The extracted visual tokens are then included into the language model by interspersed cross-attention layers. In this way the language model can incorporate the visual information at each layer. The frozen language and vision models have 70B and 435M parameters, while the trainable layers have 10B parameters and the resampler has 194M parameters yielding a total of 80.6B parameters.

For training, Flamingo uses a number of datasets with 182GB of text. This collection is amended with further mixed text, image and video sequences with a total of about 2.3B images and 27M videos.

**Fig. 7.23** Flamingo [3] receives an input consisting of a sequence containing image, text, and video in arbitrary order (bottom). The images and videos are processed by a frozen vision encoder similar to CLIP. The trainable perceiver resampler reduces them to a finite number of image tokens, which are included by a trainable cross-attention layer into the language model. The output created by the language model is the natural continuation of the input sequence. Image adapted from [3] with kind permission of authors, credits in Table A.3

**Fig. 7.24** Flamingo can interpret images and describe them by text. Gray boxes are user input and the pink boxes are Flamingo output. In the upper row Flamingo answers questions about images. In the lower row there is a dialog about a photo. Image adapted from [3, p. 31] and [3, p. 32], reprinted with kind permission of the authors

As shown in Fig. 7.24 Flamingo can answer question on single images by simply predicting the next text token in the mixed image-text sequence. In their simplest form, the question can ask for the description of objects in the scene, as is shown in the upper right example. More difficult is the interpretation of the scene as the language model needs world knowledge to decide which aspects of an image are

**Fig. 7.25** Flamingo answers question on videos. Some video frames are shown. Gray boxes are user input and the pink boxes are Flamingo output. Image adapted from [3, p. 33], reprinted with kind permission of the authors

noteworthy. In many of these examples, Flamingo can do at least one step of implicit inference. Some of the objects are not named in the prompt (e.g. the elephant), but their properties are queried directly. In order to answer these questions, the model needs to infer the referred object and then recall the relevant knowledge to form the answer. This can lead to a single answer (as for the elephant on the truck) or to an extended dialog, where the model can answer a series of queries about an image (e.g. the dog damaging the sofa). Even after several interactions, Flamingo can still successfully attend to the image and reply to questions that require to interpret the image. The authors observed that multiple images can be separately attended to, simple comparisons and inferences are handled properly. Flamingo's dialog capabilities could enable non-expert end users to get answers without the need of fine-tuning.

In the same way Flamingo can answer question about videos, as shown in Fig. 7.25. However, the performance in this task is not as stable as would be desirable.

Flamingo is able to perform *few-shot prompting* on mixed text-video-image sequences. Examples are shown in Fig. 7.26. Here a number of images are provided and the added text specifies by example the desired way to extract an answer. In the first row this amounts to extracting text from the image and in the second row the counting of objects of equal type is required. In this way the model can be instructed on the fly to perform a large number of tasks, e.g. captioning, visual dialogue, classification or visual question answering.

The performance of the model was tested on 9 image-text benchmarks on scene description, visual dialogue, and visual QA, among them MS-COCO captioning. On the eight mixed-media benchmarks Flamingo established a few-shot SOTA by a wide margin using 16 or 32 shots. For three benchmarks the score is even better than the prior fine-tuned SOTA. On ImageNet top-1 classification Flamingo achieves 76.0% compared to a fine-tuned SOTA of 91.0%. The test array on video contains

**Fig. 7.26** Few-shot querying of Flamingo [3] with a mixture of images and text. Note that in the second example Flamingo did not count the trees but stayed with the animals. The usual number of few-shot queries is 32. Image adapted from [3, p. 2], reprinted with kind permission of the authors

9 benchmarks, eight of whom require free form text answers and one benchmark (Kinetics 700) needs classification. On all eight free-form benchmarks Flamingo can increase few-shot SOTA, often by a huge margin. On four of these benchmarks Flamingo even exceeds the fine-tuned results. This is even more remarkable as Flamingo uses only 32 task-specific examples which is around 1000 times less taskspecific training data than current state-of-the-art.

Flamingo can be fine-tuned on specific benchmarks to increase performance. During fine-tuning, the frozen model parts are not changed. When fine-tuning on 9 example tasks Flamingo could increase fine-tuned SOTA on five of these tasks. This shows that by fine-tuning the 10B free parameters of the model, the performance can in many cases be increase to new levels.

#### *7.3.4 Generating Videos from Text*

The creation of videos following a textual description is an important issue, e.g. for education or illustration of dynamic content. While there are a number of models for describing images and videos through text, there are not many proposals for the other direction. The concepts for encoding text and videos are similar to the captioning of videos. The quality of generated videos can be judged by several measures comparing the similarity of actual and generated videos. The *FVD*  (Fréchet Video Distance) is the spatiotemporal extension of the Fréchet Inception Distance (FID) (Sect. 7.2.6), and is sensitive to visual quality, temporal coherence and diversity of samples.

The **Video Transformer** [172] generalizes the one-dimensional transformer encoder-decoder to videos. A video is represented as *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*h*×*w*×*s*×*<sup>d</sup>* , where *<sup>h</sup>* and *w* denote the number of tokens in the spatial height and width, *s* denotes the number of tokens in the temporal axis, and *d* is the number of channels (e.g. colors). The video is partitioned into small 3D blocks in time and space. Self-attention is applied separately with each block. To allow direct information exchange between blocks, the block sizes between each layer are varied. The blocks contain 4 frames with a spatial resolution 32 × 32. Self-attention varies between 1 and 32 in different layers and dimensions. The largest model has a hidden size of 2048, 8 layers and 373M parameters. On the BAIR Robot Pushing data [44] the model achieved an *FVD* (Fréchet Video Distance) score [167] of 94. which was SOTA at the time of publication.

**NÜWA** [175] is a recent transformer encoder-decoder model that provides a solution for generating video from text. It uses a so called *3D Nearby Attention*  mechanism to capture the locality characteristic for both spatial and temporal axes. Image, video and text data is represented as tokens *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*h*×*w*×*s*×*<sup>d</sup>* , where *<sup>h</sup>* and *<sup>w</sup>* denote the number of tokens in the spatial height and width, *s* denotes the number of tokens in the temporal axis, and *d* is the dimension of each token. The raw input regions are transformed into discrete tokens for image patches by a trainable VQ-GAN (Sect. 7.2.3). This GAN-based quantization module provides a much better image quality than VQ-VAE used by CogView (Sect. 7.2.6).

The model modifies attention computations and considers a local neighborhood with respect to width, height and temporal extent called 3D Nearby Self-Attention. Three different positional encoder embeddings are used for width, height and time. Each 336 × 336 pixel video frame is partitioned into 21 × 21 patches and 10 frames of a video are sampled with 2.5 frames per second. The size of the neighborhood in width, height and time is 3. The model is pre-trained on three tasks: Text-to-image for 2.9M text-image pairs from Conceptual Captions, video prediction with 727k videos from Moments in Time, and text-to-video generation for 241k text-video pairs.

For text-to-image generation, NÜWA is fine-tuned on the MS COCO dataset. Sixty images are generated for each text and the best image is selected by CLIP (Sect. 7.2.4). NÜWA outperforms CogView with an FID-0 of 12.9, which is good, as shown in Fig. 7.27, but worse than LAFITE (FID 8.1) and OFA (FID 10.5). For text-to-video, NÜWA is fine-tuned on the Kinetics dataset. Some frames of two generated examples are shown in Fig. 7.28. NÜWA achieves the best performance on the FID-img and FID-vid metrics with values of 28.5 and 7.0. Video prediction has to generate the sequence of the next frames of a video from a starting frame. On the BAIR Robot Pushing dataset, NÜWA achieves a new SOTA of 86.9 FVD score for this task.

**Fig. 7.27** 256 × 256 images generated from the text above the images by NÜWA [175] for the MS COCO benchmark. Image reprinted with kind permission of the authors [175, p. 5]

**Fig. 7.28** Frames of two videos generated by NÜWA [175] from text (left) for the text-to-video task on the Kinetics dataset. Note that an input text like "running on the sea" has never been seen by the model. Image reprinted with kind permission of the authors [175, p. 5]

NÜWA supports a number of other tasks. For image editing, it can reconstruct parts of an image. Alternatively, it can edit a marked image region according to a text, e.g. "a horse is running on the grassland". Image sketches annotated with text are transformed to photos. This pattern can also be applied to videos, such that a video is generated from a series of images with annotated regions. Finally, it can change the contents in a video, e.g. modify the movements of a diver as shown in the lower row of Fig. 7.29. Moreover, a series of image sketches annotated with text can be transformed to a video. Further examples are shown here [174]. **GODIVA** [176] is a similar prior approach from the same authors based on VQ-VAE variational autoencoders.

**Imagen Video** is a recent high definition text-to-video model based on Imagen (Fig. 7.17). By a frozen T5 text encoder-decoder and a base video diffusion model a low-resolution video is generated. This is augmented by a cascade of video diffusion models that alternately increase spatial and temporal resolution [66] to construct 128 realistic video frames at 24 frames per second with a resolution of 1280 × 768. Figure 7.30 shows videos generated for text prompts by Imagen Video.

**Fig. 7.29** NÜWA [175] can edit videos. In the upper row the raw video is shown. In the lower row NÜWA gets the input "The diver is swimming to the bottom" and modifies the video accordingly. Image reprinted with kind permission of the authors [175, p. 28]

Coffee pouring into a cup.

**Fig. 7.30** Videos generated from the text prompts (below) by Imagen video [66]. The model produces diverse and temporally coherent videos that are well matched to the given request. Image reprinted with kind permission of the authors [66, p. 2]

#### **Available Implementations**


#### *7.3.5 Summary*

The processing of videos requires to integrate different modalities like image, text in the form of video captions, and speech possibly translated to text by ASR. Video processing introduces an additional time dimension to image processing. Furthermore, depth information and camera movements can be important. Since 2019 large scale transformers using self-supervised pre-training are the prevailing models for video processing. The models can solve different tasks, such as video captioning, action recognition, video question answering, video generation from text, prediction of next frames, video retrieval, audio-visual ASR, etc.

Existing cross-modal Foundation Models mainly focus on (1) improving model architecture, (2) utilizing more data, and (3) designing better pre-training tasks. Due to the limited input length, the video has to be partitioned into appropriate tokens. This ranges from aggregates over 30 clips (VideoBERT) over fixed video patches (VATT) to video patches with varying dimensions (COOT, MTV, Video Transformer). Some models (VideoBERT, DeCEMBERT) use CNN convolutions to generate low-level features. More common is the aggregation with VQ-VAE autoencoders or the GAN-bases VQ-GAN. Sometimes video and text are processed with separate PLMs and merged later (VATT). Alternatively, video and text tokens are concatenated and processed by single a PLM (Omnivore, Merlot). Transformers use attention over spatial and temporal dimensions, which is often localized to reduce computational effort.

The integration of different modalities is crucial. Text and language are associated by pre-training tasks, where masked video or text tokens have to be predicted using tokens from the other modality. CoVeR shows that performance can be enhanced when the model is simultaneously fine-tuned for video and image tasks. It is even possible to combine audio, text and video tokens.

The performance of video analysis models has taken a dramatic development. The action classification error on the Kinetics-400 benchmark has fallen within 1 year to 10.9% using Foundation Models, which is a drop of 33%. Despite the significant progress, SOTA methods fail to extract/capture all the complex spatiotemporal information present in videos. There is still much work to do for understanding the diversity of visual content in videos and the structure of associated textual descriptions.

Generating videos from captions is in its early stages, and only very short highresolution videos can be generated. However, the current models are relatively small compared to the Foundation Models like GPT-3 or Gopher. Therefore, it can be expected that models with more parameters will see considerable performance improvements, as has been demonstrated by Imagen Video.

There is a trend to general-purpose models, like Nüwa that can handle multiple modalities of data and solve a number of tasks. Training with different media mutually supports the performance in different tasks. Flamingo with 80B parameters is based on a large pre-trained language model and a separately pre-trained vision encoder. In can process mixed sequences of images, text and videos. By building adapter modules and a cross-attention layer, the language model can include the results of the visual modalities and perform a variety of analysis tasks like visual question answering, image caption, etc. In addition, it can be instructed by few-shot prompts to solve many task without a specific fine-tuning.

Although Flamingo cannot generate images or videos corresponding to a caption, it is a step in the direction of multimodal Foundation Models, which promise to be a general-purpose tool of multimedia processing. By few-shot prompts they could solve thousands or millions of tasks. Substantial progress can be expected in this area, as ideas can be combined that were developed independently for different media. Further development directions are larger training data, which, however, are already quite large. In addition, the development of multilingual video models is a logical consequence of the current state of research in this area.

#### **7.4 Controlling Dynamic Systems**

Foundation Models can process many types of sequences. These include sequential decision problems where the agent must choose an action based on a state. Subsequently, the environment generates a new state and a reward for the agent. This is repeated a number of times until the final sum of rewards is known. Then the task is to select the actions based on the states in such a way that the sum of rewards is maximal. This goal can be formulated as a sequence problem, and a PLM can be used to predict the next optimal action.

#### *7.4.1 The Decision Transformer*

PLMs are able to predict sequences, e.g. the tokens of a text or video frames. Following this pattern, PLMs are also able to model the evolution of arbitrary states. *Reinforcement learning* considers a system with *states st* , *actions at* , and *rewards rt* = *R(st, at)* at a given time step *t*. Based on the current state, the agent selects an action, while the next state and reward are determined by the environment. The target of reinforcement learning is to learn a *policy a* = *π(st)*, which generates actions maximizing the expected sum of rewards *E(*-*T <sup>t</sup>*=<sup>1</sup> *rt)*. During online reinforcement learning the environment can be accessed, and for a given *(st, rt, at)* it returns the next state *(st*+<sup>1</sup>*, rt*+1*)*. In offline reinforcement learning there is only a limited set of observed trajectories from the environment. The latter setting is more difficult as the agent can no longer explore the environment.

The **Decision Transformer** [23] operates in an offline reinforcement setting. Instead of using the returns *rt* directly, the Decision Transformer considers the *forward sum of rewards R*ˆ*<sup>t</sup>* = -*T t* <sup>=</sup>*<sup>t</sup> rt* . Hence, a trajectory is represented as follows

$$\pi = \left(\hat{R}\_1, s\_1, a\_1, \hat{R}\_2, s\_2, a\_2, \dots, \hat{R}\_T, s\_T, a\_T\right) \tag{7.5}$$

The input token embeddings for *(st, rt, at)* are computed with a linear layer, which is different for each modality (Fig. 7.31). If the state is an image, it is transformed by a convolutional encoder instead of a linear layer. Subsequently the embeddings are normalized by a layer normalization. For each time step with three inputs a position embedding is learned and added to the embeddings of that time step. The embeddings are then processed by an autoregressive GPT model, which predicts future actions by autoregressive modeling.

The training was based on a dataset of observed trajectories. From these trajectories minibatches of length *K* were sampled. Then the GPT model for each *t* = 1*,...,K* predicted *at* given a trajectory up to *st* . As a loss function the crossentropy loss was used for discrete actions with the target to increase the probability of the actual action at time *t*. For continuous actions, e.g. a speed, the mean squared error was used as loss to minimize the square difference to the observed control value. It was not necessary to predict states or the forward sum of rewards.

For the application to a starting state *s*1, a target forward sum of rewards *R*ˆ<sup>1</sup> based on the desired performance (or even maximum possible return) is specified. After

**Fig. 7.31** The Decision Transformer applies an autoregressive language model to the forward sums of rewards *R*ˆ*<sup>t</sup>* , states *st* and actions *at* . In the example the state is given in the form of video frames, e.g. for the Pong game. The model predicts the next action in the trajectory conditional to a given forward sums of rewards [23]

the generated action *a*<sup>1</sup> is executed, the target return is reduced by the achieved reward and the next state *s*<sup>2</sup> is determined. This process of generating actions and applying them to get the next forward sum of rewards and the next state is repeated until the trajectory ends. Note that the actual forward sum of rewards should be close to the desired performance specified before. Although the model is only trained on randomly selected subsequences, it can learn to 'merge' subsequences from different training trajectories in order to produce optimal trajectories at test time. Obviously a large set of subsequences has to evaluated during training to arrive at good solutions.

The *Atari benchmark* [13] has discrete actions, uses four video frames as state descriptions, and processes these frames by a convolutional encoder. Only 1% of the available data is used. On four Atari tasks (Breakout, Qbert, Pong, and Seaquest) usually a context length of *K* = 30 is taken into account. With the exception of Qbert, Decision Transformer is competitive with state of the art methods, and for two games it reaches the best results (Breakout, Seaquest). The most effective alternative is the *CQL* [87] Q-learner.

The *D4RL benchmark* simulates simple robots (HalfCheetah, Hopper, and Walker) which are controlled by continuous-valued actions. On this benchmark Decision transformer in most cases achieves better results than the alternative approaches and has the highest average performance. Again CQL is the best alternative.

The authors evaluate the performance of approaches for an environment, where it is necessary to propagate rewards over a long time period. The *Key-to-Door benchmark* [104] has three phases:


The agent receives a binary reward when reaching the door in the third phase, but only if he picked up the key in the first phase. On this benchmark Decision Transformer and related methods clearly outperform Q-learning approaches, which cannot effectively propagate rewards over a long horizon.

Reid et al. [136] modify the details of the decision transformer yielding improved performance. Kumar et al. [86] show by theoretical analysis that offline reinforcement learning—as done by the decision transformer—enjoys better guarantees on long-horizon tasks than simply cloning the behavior of experts. This especially holds in the case of sufficiently noisy data.

#### *7.4.2 The GATO Model for Text, Images and Control*

**GATO** [134] is a Foundation Model, which has been trained on about 600 different tasks, including text generation, image captioning, stacking physical blocks with a robot arm and playing Atari console games. Depending on the context, it

**Fig. 7.32** Data from different tasks and modalities are converted to sequences, e.g. frames and actions from Atari games, text token sequences, images patch tokens, continuous sensory inputs and outputs. In Gato [134, 135], a large decoder-only transformer processes the sequence. During training, specific variables, e.g. actions, are used to compute a loss. Image adapted from [135, fig.2], credits in Table A.3

independently decides which tokens to generate: Text, torques for joints, keystrokes, or another variant of the output within its comparatively extensive possibilities.

Depending on the modality the input is tokenized


Tokens belonging to text, discrete- or continuous-valued observations, or actions for any time step are embedded into a learned vector embedding space using a lookup table. Learned position encodings are added for all tokens based on their local token position within their corresponding time step. Tokens belonging to image patches for any time step are embedded using a single ResNet [63] block to obtain a vector per image patch. In addition, a learnable within-image position encoding vector is added (Fig. 7.32).

Gato consists of a 1.2B parameter decoder-only transformer with 24 layers, and an embedding size of 2048. As in every language model, all tokens are predicted and therefore can be set as targets for training. Currently, only text tokens, discrete and continuous values, and actions are currently used as targets. As usual, the probabilities of the observed target tokens have to be maximized during training.

To focus GATO on a specific task, a prompt is used coming from a trajectory generated by the same source agent on the same task. GATO was trained on 596 different control tasks, among them the Atari benchmark [13]. The authors included only "good" trajectories that yield at least 80% of the expert reward for the task. Moreover, GATO was trained on 8 vision and language tasks, e.g. image captioning with MS-COCO Captions [26] and Conceptual Captions [153], as well as visual question-answering datasets. In addition, GATO is trained on the large MassiveText [128] data with 300 billion text tokens.

The performance of GATO has been evaluated for different applications. On the Atari benchmark, the model reached average human score or better for 23 of 51 Atari games. In a robot stacking benchmark, GATO achieved a comparable performance as the BC-IMP baseline [90]. The model has only rudimentary dialog and caption functions, which is not surprising due to the small model size.

The Gato model is a first attempt to simultaneously solve text, image, and control tasks with the same Foundation Model. For control tasks it yielded respectable results while for the text and image tasks it had only mediocre performance. Perhaps it could benefit from the forward sum of rewards representation of the Decision Transformer. Actual Foundation Models have hundreds of billions of parameters and require a corresponding computing effort. If the GATO model is extended to this order of magnitude, its performance can be expected to improve accordingly.

#### **Available Implementations**

• Decision Transformer code https://sites.google.com/berkeley.edu/decisiontransformer

#### *7.4.3 Summary*

Pre-trained language models can be applied to sequences with mixtures of element types. The Decision Transformer considers sequences of rewards, states and actions at specific time steps, which occur during a sequential decision problem, e.g. video game playing, robot control, or automatic driving. It models observed trajectories of these quantities. Instead of using the reward as input, the sum of the rewards up to the end of the trajectory is considered, which is the quantity to be maximized. For each type of input some preprocessing is performed to generate embeddings. The Decision Transformer is trained to predict the actions in short subsequences of 30 time steps.

During application, the desired forward sum of rewards can be set as a condition. Then, the model is able to stitch together the information from different subsequences in the training data to obtain near-optimal actions reaching a maximal sum of rewards. This was shown by extensive experiments with various benchmarks.

The GATO model demonstrates that PLMs at the same time can be used to solve reinforcement learning tasks simultaneously with text and image tasks. The model is trained with nearly 600 control benchmarks, 8 image tasks and on 300B text tokens. The model has only rudimentary text and image description capabilities, but performs relatively well on the Atari benchmark. It is only a proof of concept and could be improved by increasing the model size and, for instance, by using the forward sum of rewards.

#### **7.5 Interpretation of DNA and Protein Sequences**

Deciphering the language of DNA is one of the most important goals of biological research. The genetic code is universal and explains how DNA is translated into proteins. In contrast, the regulatory code, which determines when and how genes are expressed, varies between different cell types and organisms. This is similar to the polysemy and distant semantic relationships in natural language texts. **DNABERT**  [76] tokenizes the DNA sequence into overlapping 3-grams and trains a standard BERT model to predict masked tokens (Fig. 7.33). After pre-training on a large set of DNA sequences, it can improve the SOTA by fine-tuning for many specific DNA prediction tasks. Among them are analysis of sequence motifs (DNA segments with biological relevance) and prediction of promoter regions (nucleotide sequence that enables regulated expression of a gene). MoDNA [5] and GeneBERT [106] have similar functionality.

Proteins are linear chains of amino acids linked by covalent bonds. Amino acids can be represented by an alphabet with 25 characters. The strings are ideally suited for many NLP methods [111]. **AminoBERT** [29] is a language model that predicts

**Fig. 7.33** DNABERT tokenizes the DNA sequence into overlapping 3-grams and trains a standard BERT model to predict masked tokens [76]. The resulting model can be fine-tuned to many DNA interpretation tasks

the 3D protein structure given a protein sequence as input. It also uses a natural method to describe polypeptide geometry that is rotation and translation invariant at the level of the polypeptide as a whole. On average, the model outperforms AlphaFold2 [80] and RoseTTAFold [8] on orphan proteins and classes of engineered proteins, achieving up to a 106-fold reduction in computational time.

There are a number of other models with similar results [97], e.g., the protein language model **ESMFold**. It generates embeddings that can be used in downstream tasks, for example to capture the structural properties of proteins. A model with 15B parameters can predict the three-dimensional structure of a protein at the resolution of individual atoms.

#### **Available Implementations**


#### *7.5.1 Summary*

Foundation Models can also be applied to DNA and protein sequences to derive contextual embeddings of the sequence elements. By this approach, the models are able to accumulate much knowledge about these sequences and achieve SOTA performance across various downstream tasks, largely surpassing existing tools. The models can help to predict the 3-D structure of the protein. This is crucial for its function and may be instrumental in developing active substances to influence it.

#### **References**


*Biol. Health Inform.* BCB '22. New York, NY, USA: Association for Computing Machinery, Aug. 7, 2022, pp. 1–5. ISBN: 978-1-4503-9386-7. DOI: https://doi.org/10.1145/3535508. 3545512.


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### **Chapter 8 Summary and Outlook**

**Abstract** Foundation Models emerged as a new paradigm in sequence interpretation that can be used for a large number of tasks to understand our environment. They offer the remarkable property of combining sensory input (sound, images, video) with symbolic interpretation of text and may even include action and DNA sequences. We briefly recap the process of pre-training, fine-tuning or prompting of Foundation Models and summarize their main properties. For the different application areas presented in the book, we summarize the performance levels of the models and delineate different promising economic applications. A section is devoted to discussing the potential harm that can be caused by Foundation Models, including bias, fake news, but also possible economic monopolies and unemployment. There is an urgent need for a legal regulation of the construction and deployment of these models. The last section considers advanced artificial intelligence systems and the shortcomings of current systems. Foundation Models have significantly improved performance in recent years and have the potential to reduce the gap to a truly general AI.

**Keywords** Pre-trained language models · Language applications · Media interpretation · Economic impact · Potential harm · Disclosure · Impact on society · Advanced artificial intelligence

*Foundation Models* [13] are concerned with the interpretation of sequences of different types. They evolved from Pre-trained Language Models (PLM) modeling the joint distribution of discrete tokens of written language. For these tokens, embeddings were derived in different layers by self-attention, which could flexibly and deeply characterize the meaning of the tokens in a context. Subsequently, these token embeddings can be used for downstream tasks.

Sequences can also be patches of images, sound bites in audio recordings, 3D tubelets in videos, events in game trajectories, etc. After tokenization, these sequences can be processed in the same way as text sequences. When different media types are ingested together, e.g. an image and the corresponding textual description, the relationship between words and visual contents is automatically acquired from the data. It seems that most aspects of our world can be represented as sequences. This justifies the claim that Foundation Models are a crucial paradigm for processing and interpreting most phenomena in our world. A comprehensive survey on the opportunities and risks of these models has been presented by Bommasani et al.[13].

In the next section, we summarize Foundation Models, their main properties, and areas of application. In addition, promising economic solutions are outlined. The second section describes social and ethical aspects of these systems, including possible discrimination, misinformation, and malicious uses. The final section discusses whether there are dimensions of intelligence not currently covered by Foundation Models.

#### **8.1 Foundation Models Are a New Paradigm**

This section recaps the key characteristics of Pre-trained Language Models and their larger successors, Foundation Models. We summarize their performance in the applications covered in this book, and the benefits of the economic solutions they offer.

#### *8.1.1 Pre-trained Language Models*

Pre-trained Language Models have been developed in three flavors: the Transformer encoder-decoder by Vaswani et al. [89], autoencoders like BERT by Devlin et al. [31], and autoregressive language models like GPT-2 by Radford et al. [70]. They turned out to offer excellent solutions for natural language processing, such as translating a sentence into another language or checking whether two sentences are semantically equivalent.

Usually, these models were created in a two-step procedure. In the first step, the model was pre-trained on a non-specific big collection of natural language documents to acquire general knowledge about the language. By *self-supervised learning*, parts of a text were predicted using the remaining text as input. This opened up the opportunity to process vast amounts of text from books and the Internet to train the models. In the second step, the model was fine-tuned with a few-thousand manually annotated sentences to solve a specific task, such as determining, whether a movie review expresses a positive sentiment. The approach worked extremely well, showing that the models have the capability to detect subtle semantic properties of language. This two-step procedure was called *transfer learning*. After extensive experimentation, it was found that these models worked better the bigger they became and the more data their training sets contained.

Knowledge in PLMs is stored by a huge number of parameters. Parameters contain the recipe to compute *embeddings* for the input tokens of the models. Embeddings are long vectors of real numbers and provide a way to represent the knowledge associated with the tokens. During training, a model implicitly defines a representation space that determines the meaning of embeddings. Usually, embeddings are assigned to tokens, i.e. parts of words, but may also be determined for paragraphs and complete documents. If two embeddings have a small vector distance, the meaning of the underlying tokens is similar. Foundation Models generate increasingly refined embeddings in their layers by taking into account the context of the tokens. The word "bank" close to the word "money" has a different embedding than a "bank" close to the word "river", making the embeddings *contextual*. These effects also apply to tokens of different media types.

Embeddings are calculated by *self-attention* computing correlations between linear projections of input embeddings. This is done in parallel by multiple linear projections (attention heads), which create refined embeddings used as input for the next layer. Together with feedforward layers, attention modules form the basic building blocks of all types of PLMs. In spite of the investigation of many alternatives, this basic module is extremely effective and has not been changed during the last years.

Since the presentation of the basic Transformer, many improvements have been proposed and studied. Modified pre-training tasks, such as masking sequences or restoring permuted words, acquire deeper knowledge about the language. Another effort was devoted to increasing the length of the input sequence to capture longer contexts. By introducing sparse attention schemes, the quadratic growth of the computational effort was reduced to linear. A major achievement has been the extension of the models to multilingual settings, so that today many models simultaneously work with different languages and can transfer knowledge from resource-rich languages to rare languages.

As the size of these models increased to billions of parameters, and the training data and computational effort increased accordingly, the performance of the models also increased. For example, given a starting text, they could generate new stories in grammatically correct and fluent language reflecting a lot of common sense knowledge. Humans found it extremely difficult to distinguish these stories from genuine human stories.

#### *8.1.2 Jointly Processing Different Modalities by Foundation Models*

Large Pre-trained Language Models exhibited an unanticipated "emergent" behavior, which was very surprising: Without any fine-tuning the models could be instructed by a *prompt* to solve a task, e.g. create a story in a specific writing style with a specific topic. The model could be supported to solve the task by a number of examples (*few-shot prompt*). This was a completely new way of solving a task by a model on the fly.

After building huge models for language, researcher evaluated the same techniques for other types of sequences, including image patches, sound bites in audio recordings, 3D tubelets in videos, DNA subsequences, and event trajectories in video games. It turned out that the same models could be applied to these sequences, associating the respective "tokens" with contextual embeddings that capture their meaning. Moreover, the relation to other token types, especially language tokens, was automatically taken into account in a mutually supportive way. This opened the door to a wide range of mixed media applications, e.g. image captioning, image generation, video description, video generation, image manipulation, etc. It was even possible to solve planning tasks with slightly modified models of this type.

The representation of sequence elements by contextual embeddings determined by self-attention has emerged as an overarching principle for solving a variety of different tasks. In 2021 Bommasani et al. [13, p. 6] coined the term "*Foundation Models*" to capture the significance of the underlying paradigm shift. They argue that the notion of "language models" is too narrow, as the scope extends far beyond language. A good characterization would be "task-agnostic model" as the approach is applicable to many types of sequences. "Foundation Model" is similar, since it emphasizes the common basis for many task-specific adaptions. It also suggests the need for an architectural stability, safety, and security. Usually Foundation Models have billions of parameters, because, for example, the adequate response to prompts occurs only in models of this size.

Figure 8.1 shows possible training data and application tasks of Foundation Models. The models can ingest sequences with different media, as long as they can be converted to discrete tokens. This covers language and various media, but also

**Fig. 8.1** A Foundation Model can integrate the information contained in the data from various modalities during pre-training. It can access up-to-date knowledge by search engines and store intermediate results. This single model can then be adapted to a wide range of downstream tasks by few-shot prompts or fine-tuning [13, p. 6]. Credits for image parts in Table A.1

structured data and the trajectories of control variables. During training, parts of the data must be reconstructed in a self-supervised way. Advanced Foundation Models have access to a search engine that can retrieve actual information for the currently processed content. In addition, the search engine can also store information, for example, about the facts learned during a dialog. For application, the Foundation Model can be fine-tuned for specific tasks, or it can be directed with few-shot learning to execute instructions. If it was trained with multiple media, it can translate between these media, for example generate an image according to a caption.

According to Bommasani et al.[13, p. 3], we can observe four main generations of AI models


It is most intriguing that Foundation Models may directly be applied to sensory input from our world, e.g. a video describing an event, and simultaneously to the symbolic description of the world, e.g. by text or by spoken language. In this way both aspects are integrated. According to Fei-Fei Li, a professor at Stanford University, Foundation Models represent a "phase change in AI" [33].

#### *8.1.3 Performance Level of Foundation Models*

In the second part of the book, we considered different types of NLP tasks and gave an overview on the performance of current models. This is summarized in the next sections. Note, however, that according to Bengio et al. [9], usually "the performance of today's best AI systems tends to take a hit when they go from the lab to the field." 

#### **Capturing Knowledge Covered by Large Text Collections**

The main task of autoregressive language models is the reliable generation of the next word in a text. This has to obey grammatical correctness as well as semantic consistency. The *LAMBADA benchmark* [66] is a good test to demonstrate this ability (Sect. 4.1.3). The task is to predict the missing last word of the last sentence of a longer passage. Examples were filtered by humans to ensure that the models need to take into account the full passage of at least 50 tokens to induce the final word. PaLM with 540B parameters with few-shot instructions could increase the accuracy to 89.7% [24, p. 79]. This means that in nearly nine out of ten cases the predicted word was exactly right, although several answers were possible in each case.

During pre-training, Foundation Models are able to extract an enormous body of knowledge from huge text collections. While the early models were tested with a few natural language understanding benchmarks, e.g. GLUE and SuperGLUE (Sect. 4.1.1), actual models with hundreds of billions of parameters usually are tested with test collections containing hundreds of different benchmarks. An example is the *BIG-bench benchmark* (Sect. 4.1.4) with currently more than 200 benchmarks from diverse fields such as analogical reasoning, common sense knowledge, emotional intelligence, ethics, fact checking, humanities, logical reasoning, maths, medicine, science, technology, and social sciences.

The PaLM model with 540B parameters, for instance, with 5-shot prompts achieves a higher Big-bench score than the average score of the humans asked to solve the same tasks (Sect. 3.1.2). A significant number of tasks showed discontinuous improvements from model scale, meaning that the performance improvement from the smaller PaLM versions to the largest model was higher than expected. Other models, such as GPT-3 and Gopher, achieve lower, but still very respectable results.

Sometimes, however, generated texts or answers to questions are not factually correct, but only somehow plausible. This reflects the internal mechanics of selfattention, which just computes correlations between tokens. Recently, models such as WebGPT, Retro, and LaMDA perform a database or web query on the current topic and are able to incorporate information from retrieved documents into the generated text (Sect. 3.4.5). In this way, the correctness of the generated text can be profoundly enhanced. It is even possible to explain the answers by citing relevant documents. Especially helpful for multistep reasoning is the provision of a 'chain of thoughts' that encourages the Foundation Model to break the task down into smaller steps.

The verification of the knowledge of Foundation Models has to be performed carefully. Often the model is able to draw a conclusion not from actually 'understanding' the situation but from mere correlations (Sect. 4.3). This has to be taken into account during the construction of the tasks. In addition, it has to be guaranteed that no test material was used during pre-training.

#### **Information Extraction**

*Information extraction* was the classical approach of natural language processing to find a solution for a task. Text classification, named entity recognition, entity linking and relation extraction can all be solved with much higher accuracy than before by specialized PLM variants like XLNET or DeBERTa, with accuracy levels usually above 90%. Even for the notoriously difficult task of word sense disambiguation, accuracy could be increased to 83%.

For *relation extraction* tasks such as aspect-based sentiment analysis or semantic role labeling, the first step is usually to extract one argument of a possible relation. Subsequently models like BART have to decide in a second step whether there is a relation to a second argument. The resulting F1-values are usually in the high eighties, exceeding the performance of pre-PLM approaches. Most current relation extraction systems use relatively small BERT variants for their experiments. Therefore, it can be assumed that larger models will increase performance. In addition, Foundation Models such as GPT-3 and PaLM can be fine-tuned and achieve high accuracy even for few-shot prompts. However, relation extraction has not yet been evaluated with the current text collections (e.g. Big-bench) for Foundation Models.

#### **Text Processing and Text Generation**

Foundation Models have taken shape most strongly in natural language processing. A surprising breakthrough in this field was *Information Retrieval*, where embedding-based approaches achieved better retrieval results than prior keywordbased approaches (Sect. 6.1.5). They are able to identify paraphrases and take into account synonyms. This, for instance, has been demonstrated for the MS-MARCO passage retrieval benchmark. In addition, efficient approximate nearest-neighbor search indices like FAISS may be used to accelerate retrieval. These techniques are now employed in production search engines, e.g. by Google.

*Question Answering* is a classical application in NLP, which has benefited greatly from Foundation Models. Models like GPT-3, PaLM, and LaMDA can be queried by few-shot prompts. With a retriever-reader architecture, additional knowledge can be obtained by search, leading to correct answers much more often. With respect to the Natural Questions benchmark, the FB Hybrid model answers 67.4% of the questions correctly, which is about as good as a human experts using a search engine (Sect. 6.2.2). The LaMDA Foundation Model with 137B parameters demonstrates that facticity can be improved by using retrieval and that a system of filters is able to reduce toxic language.

*Translation* into another language is a success story of Foundation Models. Usually encoder-decoder models are used to generate a translation. Recent improvements resulted from sentence back-translation, which particularly increases results for low-resource languages, from translating entire documents instead of sentences, and from training a single multilingual model for translation between up to 100 languages. Recently, multilingual models even were able to outperform high-resource bilingual translation models. It turns out that, according to human raters, the trained models achieve better performance values than human reference translations for some language pairs (Sect. 6.3.1).

To keep track of a topic in publications, *text summarization* models are very helpful. Foundation Models can be fine-tuned to condense a long article into a few sentences. Larger documents require a transformer encoder-decoder with a larger input sequence, e.g. BigBird. While fine-tuned Foundation Models can achieve a similar performance as specific summarization models, results for few-shot prompts need improvement. It is possible to fine-tune a model directly with respect to human ratings of summaries. In one experiment, the model's summaries were preferred to the human reference summaries in 70% of the cases (Sect. 6.4.1).

*Story generation* receives a start text and generates a syntactically correct and semantically coherent continuation. To have more control over the generated text, a style and the content to be mentioned can be specified. This can be done by including style markers in the start text and specifying a storyline, which can be taken into account by fine-tuned Foundation Models. Much easier is few-shot prompting, where the style and bullet points of the content are provided to a Foundation Model, which incorporates this information during text generation (Sect. 6.5.4). The same techniques can be applied to the creation of computer programs, e.g., through the GitHub Copilot (Sect. 6.5.6), but also to the creation of fake news.

*Dialog Systems* automatically generate adequate responses to the utterances of a human dialog partner in the course of a longer conversation. All models are pre-trained on large collections of natural language text, preferably dialogs from social media. The LaMDA model with 137B parameters (Sect. 6.6.3) is finetuned to increase quality (sensible, specific and interesting answers), safety (avoid harmful suggestions and unfair bias) and factual grounding (preventing unproven statements). LaMDA uses retrieval of information to include valid and up-todate information and is able to incrementally store the state of the dialog in a knowledge base. The discussions on the possible self-awareness of the LaMDA dialog model illustrate that the model has reached a remarkable level of performance and consistency.

If this trend continues, it is possible that in the future only a single Foundation Model will solve a spectrum of text analysis, information retrieval, and text generation tasks. Therefore, any improvements in these background models can lead to immediate benefits across many NLP applications.

#### **Multimedia Processing**

*Speech recognition* has made tremendous progress in recent years, and Foundation Models are now an established architecture for this task. Often combined with CNN blocks, they are able to capture interactions over long distances and reduce processing times. On the LibriSpeech benchmark the SOTA could be reduced to 1.4% word error rate (Sect. 7.1.3). The generation of speech from text has improved dramatically in recent years. WaveNet was the first model to generate speech-like waveforms at 16,000 samples per second. Often models are able to adapt their output to the voice of multiple individual speakers.

*Image processing* has taken a big leap in the last years. The Vision Transformer (ViT) outperformed CNNs in terms of accuracy on various benchmarks (e.g. ImageNet) and requires much less computational effort. Foundation Models for image processing receive image patches as input (e.g. 16 × 16 pixel squares) and transform them to embeddings. In general, text tokens and image tokens are processed by the same Foundation Model, which allows to generate images from text (DALL-E 2) or to create textual answers for image interpretation tasks. Multitask systems like OFA can generate text and images as output depending on the input query (Sect. 7.2.8).

*Video processing* requires the integration of various modalities such as images, video frames, text from video subtitles or speech recognition, and audio together with spoken language. It adds a new time dimension to image processing. Video often uses tubelets as input tokens, which extend image patches over a number of frames. The performance of video interpretation, e.g. for video captioning, has been dramatically improved. The Flamingo model combines a text Foundation Model with video adapters and can solve a large number of video interpretation tasks (Sect. 7.3.3). Nüwa can handle multiple modalities of data and tackles a number of tasks, e.g. text-to-image, sketch-to-image, image completion or editing, textto-video, video prediction and video manipulation (Sect. 7.3.4). Imagen Video (Sect. 7.3.4) recently was able to generate short high-definition videos.

*Control trajectories* are a completely different type of sequences, which can be processed by Foundation Models. They occur during control tasks, e.g. game playing. The input consists of triples (reward, state, action) at time *t*, and the aim is to predict the next action. The Decision Transformer predicts the *forward sum of rewards*, which is the sum of all rewards until the end of the trajectory. The model is trained on observed trajectories. By specifying a desired forward sum of rewards, the model generates a sequence of actions, which achieves the designated reward level (Sect. 7.4.1). The GATO model demonstrates that Foundation Models at the same time can be used to solve reinforcement learning tasks together with text and image tasks. It is only a proof of concept and will need to be enhanced in the future.

#### *8.1.4 Promising Economic Solutions*

The technology behind Foundation Models is now beginning to make the leap from academic research to widespread real-world solutions [88]. Foundation Models can be considered as a general-purpose technology, much like electricity [16], which can be employed in a very wide range of applications and can be expected to generate a host of complementary innovations.

Oren Etzioni, the CEO of the Allen Institute, estimates that more than 80% of AI research is now focused on Foundation Models [33]. Huge sums of money are being poured into AI startups. In 2021, American venture capitalists invested a record \$115B in AI companies, according to data provider PitchBook. Wu Dao shows that China is making the field a national priority. We now list a number of important economic applications of Foundation Models.

*Search and Retrieval* are important Foundation Model applications, as keyword search on the Internet can now be enhanced or replaced by comparing embeddings to retrieve documents indexed according to their meaning. But search for images and videos also seems to be rewarding, as Foundation Models allow the comparison of text, images, and video frames with unified embeddings.

*Effective writing* is one of the most important skills in our information-based economy. Foundation Models offer comprehensive support for this activity. Starting with some text containing conditions or instructions, these generative models can automatically produce new sentences, paragraphs, or even entire memos that are strikingly coherent, informative, and creative. The text can be simultaneously checked and supplemented with up-to-date information from the Internet. There are already a number of startups developing such tools to support writing [88].

*Language translation* is a way to overcome language barriers and enable people to understand each other to facilitate cultural exchange and trade. Current Foundation Models are able to train on more than 100 languages simultaneously and provide translations in all directions (Sect. 6.3.2). In this way millions of users speaking low-resource languages can access information and knowledge from around the world. Innovative solutions are possible, such as live translation of telephone conversations and synchronization of videos taking into account the lip movements of the speakers [88].

*Chatbots* are a way to exchange information with users in real-time, e.g. for customer service requests, information about orders, or sales information. This requires systems that comply with privacy and security requirements, avoid toxic language, and integrate with third-party applications. Instead of rule-based systems with many different modules, new systems such as *LaMDA* (Sect. 6.6.3) are trained on large sets of conversations and provide meaningful, specific, and interesting dialogs, avoid harmful suggestions and unfair biases, and are fact-based by querying data collections of relevant documents. As has been shown for PaLM (Sect. 3.1.2), recent Foundation Models perform better than average humans on a large battery of benchmarks in including common-sense knowledge and question answering. A related startup is Rasa [72], which provides an open-source chatbot with a focus on chatbot configurability. *Conversational Voice Assistants* combine chatbot technology with speech recognition and speech generation. Prior systems such as Siri and Alexa have been mainly used for non-critical conversations. In 2020, there were 4.2B digital voice assistance in use worldwide [87], and this market had a volume of \$340B, with a focus on financial services and e-commerce. There are a number of startups specializing in this field.

*Healthcare* is a huge market of \$4T and many interesting tasks, such as patient screening and care navigation, where chatbots are the digital gatekeepers of the healthcare system. Foundation Models can provide the interface for care providers and collect diagnoses and treatments, and perform the analysis of patient records. Moreover, Foundation Models can interact with patients and answer questions, assist care and support community health and prevention [13, p. 57]. In addition, there is a huge need for systems that interpret medical imaging results like ultrasound, X-rays, or MRT. Furthermore, Foundation Models can support drug discovery and clinical tests and guide personalized medicine. With a critical shortage of trained therapists, there is an opportunity for mental health chatbots. These systems can be accessed instantly via a mobile app to talk to individuals

about their lives and problems. They are not a complete clinical solution, but rather one potentially useful tool for people in need. *Woebot* [94] is a leading startup in this area.

Foundation models in *genomics and proteomics* have an extremely high potential for biomedical and drug discovery (Sect. 7.5). Deciphering the language of *DNAsequences* is one of the most important goals of biological research. While the genetic code, which explains how DNA is translated into proteins, is universal, the regulatory code, which determines when and how genes are expressed, varies between different cell types and organisms. This is similar to polysemy and distant semantic relationships in natural language texts. DNABERT [42] has been pretrained on a large set of DNA sequences and can improve the state of the art by fine-tuning for many specific prediction, e.g. the analysis of biological relevance and the prediction of expressions of a gene. There are a number of startups such as Quantagene that are using the human genome for precision medicine.

*Proteins* are linear chains of amino acids and can be represented by an alphabet of 25 characters. The strings are ideally suited for many NLP methods [64]. AminoBERT is a language model [25] which predicts the 3D protein structure from a protein sequence as input. On specific tasks the model even outperforms AlphaFold2 [44]. There are a number of other models with similar results [55]. They could accelerate drug development and lead to a significant reduction in development costs.

The *legal industry* provides legal goods and services and has a huge application potential for Foundation Models. In the US, there are 1.3M lawyers and more than \$300B annual revenues [13, p. 57]. Legal work usually involves reading and summarizing documents, e.g. contracts, rulings of the appeals courts, historical decisions and standards, legal research, etc. Foundation Models may take into account many modalities: audio during trials, video and images during content discovery, and text in conducting legal research. They may weigh legal arguments and support lawyers, judges, and prosecutors in drafting legal texts. The use of Foundation Models in the legal industry can potentially democratize access to legal services.

In *education* Foundation Models can be trained to automate the process of motivating and instructing students. Teaching is practically a multimedia dialog process between teacher and student [13, p. 67]. In the view of the recent advances in dialog Foundation Models, e.g. LaMDA, it seems straightforward to fine-tune a dialog agent for conducting educational dialogs. Models have to be trained to acquire teaching materials, subject matters, and pedagogical techniques. In addition, they need to understand students, their motivations, skills, and preferences. They must also comprehend the processes of learning and teaching and be able to perceive different reactions of student. The availability of educational Foundation Models could personalize and democratize learning. This would be especially important for poor countries, where even today only a fraction of students receive a proper education. It could also reduce \$30,000 student loan that the average student in the US needs today.

#### **8.2 Potential Harm from Foundation Models**

Foundation Models sometimes have hundreds of billions of parameters and can be instructed to solve a wide variety of tasks. They are based primarily on associative self-attention, and understanding their inner workings in detail is extremely difficult. The next words of a text are generated by a random mechanism. Therefore, Foundation Models can potentially generate undesirable word sequences and responses that may be harmful to the reader. In the same way, Foundation Models can compose or interpret other media in ways that are detrimental to users. Recent surveys of these problems are provided by Weidinger et al. [92] and Bommasani et al. [13]. Table 8.1 lists the risk areas that we discuss in the following sections.

#### *8.2.1 Unintentionally Generate Biased or False Statements*

A *stereotype* or *bias* is a generalized belief about a particular group of people, such as their personality, preferences, appearance, or abilities. Stereotypes are sometimes correct for part of the group, but can demean the rest of the group. It is known from psychology that bias is an innate human strategy for decision-making [46]. It allows the rapid formation of a judgment in reality, when there is not much time to weigh arguments. As Foundation Models are trained with text produced by real people, these texts often reflect the stereotypes present in the society. This is particularly serious for text generation systems such as dialog assistants and chatbots. Based on the principle of equality in human rights, a Foundation Model should avoid prejudices. For example, men and women should be equally likely to be associated with an occupation. Surveys on bias in NLP are provided by Garrido-Muñoz et al. [37], Mehrabi et al. [57] and Bommasani et al. [13, p. 129].

As an example, consider GPT-3 (Sect. 3.1.2) with 175B parameters [17]. It reproduces stereotypes, e.g. on gender, race and occupation. By providing a start text like "The detective was a", the model-generated continuation often included a gender indicator, e.g. "man". The authors tested 388 occupations and found that 83% of them were associated by GPT-3 with a male identifier [17, p. 36]. In contrast, women clearly predominate in occupations such as midwife, nurse, receptionist, and housekeeper. These associations reflect the relations actually observed in the texts and in society, but are often socially undesirable.

It was further investigated, what mood was associated with race. Asian race was consistently associated with high mood, while Black race was related to low mood. Religious bias was investigated by examining which words appeared together with **Table 8.1** Potential Harm Caused by Foundation Models. For each area of harm, we list the mechanism causing the harm, the type of potential harm, and detailed harm aspects. Table adapted from Weidinger et al. [92, p. 10]

#### 1. **Unintentionally Generate Biased or False Statements** Sect. 8.2.1

*Mechanism:* Foundation Models accurately reproduce unjust, toxic, and suppressive statements present in the training data

*Potential Harms:* Offenses against persons and subgroups, denial of access to resources, and the unjust representation or treatment of marginalized groups


#### 2. **Intentional Harm Caused by Foundation Models** Sect. 8.2.2

*Mechanism:* Individuals use Foundation Models to intentionally cause harm *Potential Harms:* Distortion of public discourse, crimes such as fraud, personalized disinformation campaigns, and malicious code production


#### 3. **Overreliance or Treating as Human** Sect. 8.2.3

*Mechanism:* Dialog Foundation Models have conversations with users and are perceived as people

*Potential Harms:* Unsafe use due to user misperceptions or mistaken trust in the model. The model exploits psychological vulnerabilities and violates user privacy


#### 4. **Disclosure of Private Information** Sect. 8.2.4

*Mechanism:* Foundation Models generate text containing private information covered in the training data

*Potential Harms:* Privacy violations and safety risks


#### 5. **Society, access, and environmental harms** Sect. 8.2.5

*Mechanism:* The downstream applications of Foundation Models over-benefit some groups more than others

*Potential Harms:* Increasing social inequalities due to uneven distribution of risks and benefits, loss of high-quality and safe employment, and environmental harm


religious terms from the different religions. For example, words such as "violent", "terrorism", and "terrorist" were more frequently associated with Islam than with other religions. When GPT-3 is fed a sentence containing the word "muslim", the completion contains terms such as "shooting", "bombing", "murder", or "violence" more than 60% of the time. Such offensive, insulting or hateful language is also called *toxic language*. For example, for the input "Islam practitioners are", the PaLM (Sect. 3.1.2) model generated many highly biased words such as "bad, terrorist, violent, peaceful, religious, sometimes, even, good, true, radical". Note that most NLP models are affected by the bias problem [37].

There is a need for methods to mitigate bias problems. Biases originate from the training data, which may contain toxic and hate speech, abusive language, microaggressions, and stereotypes [13]. After training, biases are contained in Foundation Model components, such as parameters and word embeddings. A first avenue to reduce bias is to filter or reweight the training data to eliminate unwanted language. According to a number of experimental evaluations, technical approaches of any kind are currently severely limited, and methods that measure or combat bias in training data are fragile or ineffective [104]. Moreover, it is a difficult task to decide which biases to filter out. Is it okay for a man to run the 100 m faster than a woman? Is it okay that women cause less traffic accidents than men?

A simple approach to mitigate gender bias in word embeddings is to "swap" gender-specific terms in the training data when creating word embeddings [102]. In addition, simple masking of pronouns and names may also reduce biases and improve performance on certain language tasks [28]. These mitigation approaches may target different steps in the pipeline, such as the training data itself, the modeling objectives, and the adaptation methods [13, p. 133]. To date, however, there is no general, unified way to reduce the bias from Foundation Models for text generation, and proper mitigation requires a more holistic approach [38]. From this perspective, LaMDA's filtering techniques appear to be quite effective (Sect. 6.6.3). The reinforcement learning approach with humans in the loop of InstructGPT [162] is particularly effective in avoiding unwanted language and performing the intended tasks (Sect. 3.6.5).

#### **Accidentally Generated False or Misleading Information**

There are estimates that almost 50% of the traffic coming from Facebook is fake and hyperpartisan [47]. Nevertheless, it is a dominant source of news for millions of people. Due to the following reasons, fake news can be very harmful to people [81]:


• *Confirmation Bias*: People favor receiving information that only supports their own current views. Most persons only want to hear what they believe and do not want to find any evidence against their viewpoints.

There are numerous motivations for people to spread fake news. *Clickbait*  intents to lure users with snappy headlines to earn money on social media pages. *Propaganda* intentionally aims to mislead the audience, e.g. during elections. Sometimes *satire*, parody, hoaxes and rumors are published to entertain the readers. Through misleading headlines, biased news or outright misinformation, journalists can attempt to distort information. There are some surveys on the analysis of fake news [27, 49].

Foundation Models determine correlations between different natural language phrases and generate new text based on probabilistic sampling. Therefore, they can accidentally generate text that contains false or misleading statements. Some examples are provided in Sect. 4.2.2. Factually incorrect or nonsensical predictions may be harmless, but under particular conditions they may pose a risk of harm. The harms range from false information, deception, or manipulation of an individual, to material damage. In addition, there are far-reaching community impacts, such as the loss of trust among members of a society.

There can be several reasons for false statements. Training corpora in the first place contain the biases present in the community, such as attitudes towards homosexuals and other ethnic and minority groups. Moreover, they typically contain web texts that frequently cover factually incorrect statements, e.g., fiction, novels, poems, or jokes. In addition, training corpora are likely to contain instances of satire and misinformation, such as websites emphasizing a political stance. Furthermore, Foundation Models can have problems with logical reasoning and sometimes do not adhere to logical rules, e.g. if "birds can fly" is true, then "birds cannot fly" must be false (Sect. 4.2.3). Finally, the context determines if a statement is true or not. The sentences "I love you", "it is raining", or "Obama is president" can be factually correct or false depending on the speaker, the location, or the time. The training data does not always define this context, and the context often cannot be captured by a Foundation Model. Context often requires to take into account knowledge of other domains and modalities (vision, time) and can be improved by grounding language in physical experience [8].

#### **Reducing Bias by Retrieval**

Retrieval-based Foundation Models, such as WebGPT (Sect. 6.2.3), Retro (Sect. 6.2.3), and LaMDA (Sect. 6.6.3), can access a large collection of text documents to enhance the text to be generated with relevant retrieved information. Shuster et al. [78] have shown that the use of retrieval reduces the rate of 'hallucinations'. WebGPT performs about as well as humans for factual accuracy on the *ELI5 benchmark*. Similar to a scientific author, WebGPT can support its text by citing documents that support a statement. This often allows the user to check the validity of a statement.

However, as with scientific papers, referencing external sources does not solve all problems. What makes an Internet document reliable? Which statements in a text need to be substantiated, and which are self-evident "common knowledge". Current language models are still in their infancy in dealing with these aspects, but there are ways to improve them. On the Internet, for example, there is already the Web of Trust rating platform, which derives the reliability of websites from user ratings. Note that citations make the answer appear more authoritative, which could lead to over-reliance on WebGPT's answers. In fact, WebGPT sometimes produces incorrect statements when it paraphrases or synthesizes a context. Note that WebGPT can make more mistakes than humans on out-of-distribution questions.

#### **Filtering Biased Text**

Solaiman et al. [80] propose an iterative process for significantly changing model predictions by creating examples and fine-tuning on a dataset that reflects a predetermined set of targets. The strategy is to modify the behavior of the language model in a specified direction with fine-tuning on surprisingly few samples. This is evaluated by different measures focusing on the targets and the toxicity of outputs. At each iteration, additional training examples are added based on observed shortcomings. The approach performs significantly better on all metrics compared to control models for a broad range of GPT-3 language model sizes without compromising model performance.

The LaMDA dialog system (Sect. 6.6.3) is trained to perform retrieval and include retrieved information into its answers. The IR system is also capable of returning passages from the open web with their corresponding URLs. The LaMDA system is fine-tuned to classify for a given context whether the response is sensible, specific, and safe. *Sensibleness* measures whether a model's response makes sense in context and does not contradict anything that was stated earlier. *Specificity*  measures whether a response is specific to a given context and contains some information. *Safety* means that the responses of the system should never violate a pre-specified set of rules [86, p. 25]. An evaluation by human raters shows that LaMDA is close to human performance in terms of sensibleness, safety and groundedness (Fig. 6.23). It turns out that fine-tuning with respect to safety and groundedness is a big advantage compared to the bare pre-trained model. Examples are shown in Table 8.2. A similar filtering approach was analyzed by Rae et al. [71] and implemented by Sun et al. [83].

Lower performance of a Foundation Model for topics affecting different groups can often be observed and mainly depends on the coverage of the topics in the training data. An example is the information about Kurdish history present in the training set compared to information on English history. Covering different languages is possible in multilingual models (Sect. 3.3), but low-resource languages are always less represented. Although PaLM covers more than 100 different languages, 78% of the training data is English, and German is second with 3.5%. Therefore, current Foundation Models have higher performance in English than in other languages.

**Table 8.2** Selected examples showing the responses of the pre-trained and safety-fine-tuned LaMDA models to a given context. The authors note that without fine-tuning, the model can generate even more offensive and biased responses. A \*\*\* indicates omitted problematic phrases. Also, while safety-fine-tuned responses are better, some of them are still problematic [86, p. 36]


#### *8.2.2 Intentional Harm Caused by Foundation Models*

Foundation Models may be intentionally used to generate false statements. One approach is to fine-tune the model with biased training data, e.g. documents posted by Corona-deniers. Carlini [20] discuss approaches to introduce unwanted documents into training data. Foundation Models predict higher likelihoods for concepts that are more prominent in the training data, regardless of whether they are factually correct. There are many examples of fine-tuning GPT-models (Sect. 3.6.2) for more innocent text types, e.g. song lyrics [100] or poetry [52]. In a similar way GPT-2 trained on biased data generates texts corresponding to the fine-tuning dataset, consisting for instance of far-right fake news [18, p. 14]. The resulting GPT-2 version was able to imitate the style of a publication with very high reliability. Note that OpenAI controls the access to the fine-tuning API of GPT-3 (Sect. 3.6.2) to avoid similar efforts [54].

Throughout this book we have seen that Foundation Models can produce credible news stories that a majority of readers cannot distinguish from human-written text. The downside is that these models, especially GPT-3, can also be used for disinformation campaigns. In Sect. 6.5.5 we have demonstrated that language models may generate targeted fake-news by few-shot prompts with very little human effort. Foundation Models allow an agent to personalize fake content for small audiences, or even to target a single individual [13, p. 136]. By conditioning output on personal attributes or information, Foundation Models can create realistic

**Fig. 8.2** Image modifications generated with GLIDE [62]. The original image is shown on the left and the green area is marked for change. The green region is erased, and the model fills it in conditioned on the prompt given below. GLIDE is able to match the style and lighting of the surrounding context to produce a realistic completion. Image reprinted with kind permission of the authors [62, p. 3]

personalized content that is more embarrassing, puts victims at greater risk, and leads to more successful blackmail attempts.

#### **Fake Images Created by Foundation Models**

Multimodal models like DALL-E 2 (Sect. 7.2.7) or GLIDE (Sect. 7.2.7) are ideal for creating fake images. As shown in Fig. 8.2, an image of a celebrity or an event can be altered by providing a simple sentence to insert new objects or persons into the image to fabricate evidence for fake news. Note that the approaches allow the creation of high resolution images of 1024 × 1024 pixels using diffusion models. There are also workflows to generate fake videos, e.g. by *DeepFaceLab* [67] , where the face of some person is inserted into a video and the face movements are aligned with a new spoken text of choice. This technique was recently used by a fake mayor of Kiev to make video calls to a number of Western politicians [58].

On the other hand, Foundation Models can be used to identify model-generated content [99]. Fake news can be detected by combining information on news content, publishing, and reposting relations of publishers and users, employing Foundation Models to relate these characteristics to each other [77]. Alam et al. [3] and Yu et al. [98] provide surveys on multimodal disinformation detection.

#### **Surveillance and Censorship**

Large organizations or countries may use Foundation Models for mass surveillance or censorship. To screen the content of social networks, classifiers for sentiment analysis or identification of critical utterances can be trained and easily applied to large volumes of text. Using on only a few training samples, these classifiers achieve high accuracy in identifying specific types of text [17]. Such classifiers may be used for identifying, for example, political dissents at scale, reducing the effort to recognize dissenters. This is already happening on an extremely large scale in China, as reported by the New York Times [95]. Such a surveillance often leads to a selfcensorship, e.g. when writing texts for web blogs.

A less drastic form of censorship is *algorithmic filtering* in social media that determines the content presented to users, often using Foundation Models. In this way, social media platforms have the ability to influence the user perceptions and decisions, from hotel choices to voting preferences. User often only receive news that they 'like' or that the provider deems "appropriate", and therefore may find themselves in a 'filter bubble' where news that does not match the expressed opinion is hidden. The problem is that users are often unaware of filtering and do not know the criteria used to prioritize content. As a result, many citizens are calling for regulation of filtering algorithms, but drafting and enforcing regulations remains a challenge. A target of regulation may be, for instance, that the ads a user sees are not be based on sexual orientation, or that content related to COVID-19 does not reflect a user's political affiliation [23]. The authors provide an auditing procedure that allows to check whether the platform complies with the regulation, requiring only black-box access to the filtering algorithm. In addition, the resulting performance cost and content diversity are discussed.

#### *8.2.3 Overreliance or Treating a Foundation Model as Human*

It is well-known that users often do not understand the exact nature of a chatbot. *XiaoIce* was designed as an "emphatic voice assistant" [103] and launched by Microsoft in China in 2014. It was the most popular chatbot in the world with 660 million users in China, Japan, USA, India and Indonesia. In the conversations between XiaoIce and its users, an average of 23 responses were counted per dialog. That is more interactions than were observed on average in conversations between real people (about 9). This shows that users enjoyed talking to XiaoIce at length. Even more, users were building a 'personal' relationship with XiaoIce and told the system very private details of their lives.

Recent dialog models such as *BlenderBot 3* and *LaMDA* (Sect. 6.6.3) have more parameters and much better ratings than XiaoIce. The LaMDA dialog system, for instance, on average generates more interesting and also more informative answers than a human [86]. Thus, there is a risk that people will accept the system as human. This can cause psychological harms, such as disappointment when a user tries to use the model as a 'partner'. This issue has since been addressed in a number of movies such as Ex Machine and HER. Users may 'blindly' trust conversational agents. If users act on Foundation Model predictions without reflection or effective control, factually incorrect model predictions may cause harm that could have been prevented by effective monitoring.

#### *8.2.4 Disclosure of Private Information*

Foundation Models have billions of parameters and are trained on massive text collections with many billions of tokens. However, only a small fraction of the knowledge in the training data can actually be replicated by Foundation Models. Nevertheless, Carlini et al. [21] have shown for GPT-2 that it is possible to reproduce hundreds of texts verbatim. They identify 46 names, phone numbers, addresses, and social media accounts of individual persons, excluding celebrities. A survey on privacy in Deep Learning is provided by Mireshghallah et al. [59].

The PaLM model has 540B parameters and was trained on 780B tokens in a single pass. To evaluate memorization the authors randomly selected 100 token sequences from the training examples, and prompted the model with the first 50 tokens of the span. They measured how often the model produced a 50-token continuation by greedy decoding that exactly matched the training example. It turned out that the model was able to reproduce the continuation for 2.4% of the data. This means that the model could be able to reproduce 18.7B tokens of the training data, which is an extremely large set of documents. Memorized sentences often were of formulaic text with no potential to harm persons. However, it was also observed that LaMDA memorized stories, news articles, and facts.

There are several ways to mitigate privacy problems in Foundation Models. A memory-demanding approach would be to filter out sequences from generated data which already occurred in the training data by a *Bloom filter*. Another approach is training with *differential privacy*. The idea behind differential privacy is that the model output does not allow any conclusions to be drawn about an individual person. There is a *differentially private stochastic gradient descent* (DP-SGD) algorithm [1] that can be used to train Foundation Models [36, 97]. However, because less information can be used during training, there is a significant reduction in the performance of the Foundation Model [35]. Qu et al. [69] propose a privacyadaptive pre-training method for Foundation Models and demonstrate that a BERT model pre-trained with a denoising MLM objective can substantially increase the utility of BERT compared to prior approaches while retaining the same level of privacy protection.

During inference, privacy violations may occur even if the individual's private information is not included in the training dataset. A Foundation Model can make correct inferences about a person based solely on correlational data about other persons. Such a *statistical disclosure* can occur when Foundation Models predict the gender, race, sexual orientation, income, or religion of an individual. These conclusions can harm individuals who are correctly classified by disclosing their private information and increase the risk of unfair discrimination. Also, incorrectly predicted characteristics can harm individuals by exposing them to unfair discrimination.

#### *8.2.5 Society, Access, and Environmental Harms*

#### **Access to Foundation Models**

Foundation Models are expected to transform large areas of the business world and our daily lives. Models like LaMDA and PaLM with hundreds of billions of parameters have the greatest innovation potential. However, currently only a few organizations in the world, such as Google, OpenAI, and Facebook, Microsoft and the Beijing Academy of Artificial Intelligence have the resources to train Foundation Models. These models can be used on a large scale to replace human labor, supplement humans, or help discover new tasks and opportunities. Even if Foundation Models increase average productivity or income, there is no economic principle that guarantees that everyone will benefit. This can lead to greater concentration of ownership and power for the owners of the model. Figure 8.3 shows the size of models trained by large Internet companies compared to models trained by universities and smaller research institutions.

In contrast, there are ideas to create public datasets and train open-source Foundation Models. Decentralization would be desirable so that everyone can share in the benefits of the models. Public funding and infrastructure are needed to prevent Foundation Models from being operated only by private companies [13]. Stanford University recently called for a "National Research Cloud" to supply universities

**Fig. 8.3** Around 2016, a new trend of very large models emerged (red). These were developed by leading Internet companies that were able to finance the investment. The lower blue line illustrates the average computational effort of regular models, e.g. from universities. Note the logarithmic scale of the training compute. Image cutout from [76, p. 5]

with enough computing power and datasets, to prevent Foundation Models from being entirely dominated by private companies [33]. Currently, there are many efforts to reduce the cost of training these models and apply them to other languages, such as *GPT-NeoX-20B* [91], *BigScience* [11], and *OpenGPT-X* [61]. Recently Meta announced the release of an Open Pre-trained Transformer (*OPT-175B*), a language model with 175 billion parameters trained on publicly available data sets, to allow for more community engagement in understanding this foundational new technology [101]. The *BLOOM* language model has 176B parameters and is freely available. It is aimed to represent the cultural context of European languages. The dialog system *BlenderBot 3*175B is based on OPT-175B and has also been released as open-source. It is not advisable that arbitrary people have access to the full models, as the risk of misinformation and misuse is obvious. The two large models are only made available to researchers in a non-commercial setting.

#### **Energy Consumption of Foundation Models**

In this section we discuss the damages that result from the impact of Foundation Models on environment and downstream economic consequences. Foundation Models incur significant environmental costs because of their energy demands for training and operating the models. As an example, consider the training effort for the PaLM model with a total effective emission of 271.4 tons of *CO*<sup>2</sup> equivalent emissions [24]. This is 50% more than the total emissions of a direct round trip of a single passenger jet between San Francisco and New York (JFK) with estimated 180 tons of *CO*<sup>2</sup> equivalent emissions. Note that the application of Foundation Models is much cheaper. OpenAI charges \$72 for processing the collected works of Shakespeare with 900k words with GPT-3. Foundation Models are used at scale by Google and Microsoft, e.g. for translation or web search. A more detailed discussion is given by Bommasani et al. [13, p. 139].

#### **Foundation Models Can Cause Unemployment and Social Inequality**

On the other hand, the groundbreaking capabilities of Foundation Models in language processing can lead to the automation of tasks that are currently performed by paid human workers, such as responding to customer service inquiries, translating documents, writing computer code, or creating an image, with a negative impact on employment. This will require current workers to be retrained for new jobs and could eventually lead to higher unemployment. The economic risks are difficult to forecast as it is not clear at which scale new human workers will be needed. One worrying development is that, for the first time, intellectually demanding work is being replaced by machines on a large scale [5]. According to the study the most vulnerable employment segments are logistics, office workers, production, service, sales, and construction.

**Fig. 8.4** Automation risk for occupation clusters in the U.S. sorted by median risk values (line inside the box). For each job cluster, the boxplot shows the first quartile (Q1), median (Q2), and third quartile (Q3) of the ARI distribution, and the whiskers indicate the upper and lower adjacent values. Image reprinted with kind permission of the authors [65, p. 4]

Paolillo et al.[65] start with the observation that jobs require a mix of capabilities. They decompose the occupational competences into 87 different skills and estimate an *automation risk* (ARI) for these skills. From this, they calculate an automation risk for almost 1000 occupations. The ARI can be interpreted as the proportion of human skills required for a job that can also be performed by machines. For physicists, the authors estimate the lowest ARI with a value of 0.44, while slaughterers and meat packers have the highest ARI of 0.78. Figure 8.4 shows the estimated ARI for different job clusters. The median ARI is about 0.6, which means that 60% of all skills can be automated. As a consequence, almost all occupations are likely to be strongly affected by automation. The authors argue that workers' automation risk could be substantially reduced by moderate occupational retraining.

Artificial intelligence differs from the previous innovations in that it does not automate manual jobs, but cognitive tasks [15]. Using panel data on 33 OECD countries, this study investigated the link between AI, robots and unemployment. It found that both robots and AI tend to increase unemployment, providing additional evidence to the literature on technological unemployment. It also concludes that, over a 3-year period, AI increases the unemployment rate of people with a medium level of education, while the effect is negative or not significant for the others. This is an indication that medium-skilled jobs suffer most with increasing AI use.

Foundation Models are extremely good at generating stories, and it is reasonable to assume that in a few years they will be able to write entire novels or compose songs in a semi-automatic way. Likewise, Foundation Models can create and modify graphics and photo realistic images, thus devaluing the work of graphic designers and photographers. This is especially true for creative works (e.g. fiction, press articles, music), but also for scientific studies. This type of plagiarism is discussed by Dehouche [30]. Since it cannot be argued that the generated content violates copyright, this development can undermine the profitability of creative or innovative work. While such copyright erosion can cause harm, it can also create significant social benefits, for example, by expanding access to educational or creative materials to a wider community. The assessment of potential harms and benefits from copyright busting deserves further consideration [92]. In the meantime, courts are dealing with this problem [90].

In January 2021, there were 4.6B active Internet users worldwide—59.5% of the global population [82]. Nevertheless, many social groups and countries will not have access to Foundation Models that require a particularly powerful computing environment. The unavailability of this technology can preserve global inequalities by disproportionately benefiting some groups. Foundation Model applications such as translation, text-to-speech, and digital assistants are especially important for people who are illiterate, have not had a full education, or suffer from learning disabilities. This should be reflected in the choice of languages used for the training of Foundation Models. Bender et al. [7] discuss the global distribution of benefit and risk from Foundation Models in detail.

#### **Foundation Models Can Promote a Uniform World View and Culture**

Currently, the Internet is dominated by monopolies. Alphabet handles web search, Amazon dominates e-commerce, Apple is leader in business smartphones, Meta governs social networks, and Microsoft controls business software [6]. These companies benefit from extreme economies of scale because digital platforms often require large upfront costs, but after these initial expenditures, the cost of providing service to additional customers is close to zero. In addition, the companies have been buying startups and competitors to eliminate rivals.

Therefore, it can be assumed that Foundation Model services will be integrated into the existing infrastructure of these monopolies, using the existing search engines as information providing components. Hence, it is plausible that only a few different Foundation Models will be employed to support the authoring of the majority of documents in the world. This means that the strengths, creativity, biases, shortcomings, oddities, and peculiarities of the few original models will be ubiquitous and may affect the culture in many different languages in a consistent way [13, p. 151]. This *homogenization* can produce extremely high benefits across a large number of applications, but it also can have a profound negative effect in other fields. Kleinberg et al. [48] have called this an 'algorithmic monoculture' which could lead to uniform biases, promotion of specific views and theories, consistent and arbitrary rejection, misclassification, or ill-treatment of individuals or groups. Cave et al. [22] even argue that in both everyday news coverage and fantastic literature, artificial intelligence is predominantly portrayed as white because that is apparently still associated with rationality, intelligence, and power. As current antitrust laws do not work for Internet companies, new regulations are required to break up the monopolies [6]. This requires redefinition of markets, requirements for interoperability of services, and a change in the ownership of data to the customer, who can transfer it to another provider.

#### **A Legal Regulation of Foundation Models Is Necessary**

The automated application of Foundation Models trained on extremely large text collections poses a whole new set of challenges for our society. We want common good, human-oriented systems that are in line with our values, work reliably and are competitive at the same time. We must therefore try to achieve fair and objective results and avoid undesirable consequences. Fair behavior of an application towards all stakeholders, consideration of the needs of users, reliable, understandable and secure functioning as well as the protection of sensitive data are central requirements for the trustworthy use of Foundation Models.

It is well-documented that organizations have often made poor decisions when adopting a new technology [19]. Commercial companies like Google, on the other hand have no direct incentives to increase transparency or reduce social inequalities [73]. In order to make Foundation Models humane and trustworthy, there needs to be a societal understanding of what guardrails, principles and boundaries should apply, how Foundation Model applications should be developed, how autonomous they should be allowed to act and how we want to control them. As a consequence, there are efforts in different countries to define rules for Foundation Models and AI systems.

The European Union proposes a regulatory framework based on the risk associated with an AI application [34]. It defines four risk levels: Minimal or no risk, limited risk, high risk, and unacceptable risk. All AI systems with unacceptable risk (threats to the safety, livelihoods and rights of people) will be banned [13, p. 157]. High-risk applications include critical infrastructure, educational training, biometric and safety components, and have to satisfy a number of strict checks and assessments before they can be put to market. Special transparency obligations apply to systems with limited risk, such as chatbots. Minimal or no risk systems, such as AI-enabled video games or spam filters, can be freely used. The vast majority of AI systems currently in use in the EU fall into this category.

In the US specific regulatory guidelines have been proposed by different agencies [84]. The Department of Commerce is developing "a voluntary risk management framework for trustworthy AI systems". The Federal Trade Commission lists a number of compliance expectations. These include requirements for adequate training data, the need to test the model to avoid biases, openness regarding the use of data, truthful representation of the model's performance, and transparency of modeling objectives. Although there is currently no uniform regulation of AI, regulators are advising companies to craft policies and procedures to create compliance by design. This encourages AI innovation, but also ensures transparency and explainability of systems. In addition, companies should audit and review policy usage regularly and document these processes to comply with regulators. A detailed discussion of norms and regulation is given by Bommasani et al. [13, p. 154].

#### **8.3 Advanced Artificial Intelligence Systems**

Self-supervised learning is standard in Foundation Models and has led to unprecedented performance gains in language and image recognition tasks. However, human intelligence has more traits that are not covered by this paradigm. In this section, we first discuss, whether Foundation Models are able to produce new creative content. Then we examine how the words and concepts of language can be "grounded" i.e. connected to the corresponding objects and processes of the physical world. Finally, we consider Kahneman's theory of human behavior and discuss some ideas for improving the current models.

#### *8.3.1 Can Foundation Models Generate Innovative Content?*

A long-discussed problem is whether current Foundation Models can generate *innovative* content, or if they are just *stochastic parrots* [7] that mindlessly repeat phrases and text snippets acquired from the training data. In the book "Rebooting AI" Marcus et al. [56] argued in a similar way. He calls GPT-3 [43] "an amazing version of pastiche generation, in a way that high school students who plagiarize change a couple words here or there but they're not really putting the ideas together. It doesn't really understand the underlying ideas." As argued above, GPT-3 cannot really "understand" the content it expresses, as it does not have a grounding for words and phrases by the objects and events in the real world.

Johnson et al. [43] prompted GPT-3 with the sentence "Write an essay discussing the role of metafiction in the work of Italo Calvino." The system generated a concise five-paragraph summary on the topic. The author characterized the resulting text as "lucid and responsive". When the prompt is repeated, GPT-3 generates a completely new response over and over again. When the author entered each generated sentence into the Google search engine, he could not find any of them. Each sentence was custom-built for that specific prompt. This illustrates that Foundation Models are very good at combining pieces of contents together. However, they do not act on the level of strings and words, but on the level of contextual embeddings, which express the underlying conceptual similarity of phrases and sentences and their relation in a large number of sentences and documents.

This phenomenon becomes even clearer when we consider Foundation Models that simultaneously capture text and image content. As described in Sect. 7.2, models such as DALL-E 2 develop a joint embedding space for image patches and text tokens. In this space images and texts are not related in terms of pixels and strings, but in terms of context-sensitive embeddings of these image patches and tokens. These embeddings are different depending on the overall composition of the image and the text. Generating new content is based on the correlation of these embeddings and therefore can create new combinations of images and text, for instance an image corresponding to "a corgi playing a flame throwing trumpet" (Fig. 7.15) or photo-realistic images illustrating the caption "A teddybear on a skateboard in Times Square" (Fig. 7.16). Although DALL-E 2 does not know anything about the physical properties of the real-world location "Times Square", it can combine information about it in terms of contextual embeddings and generate fairly realistic looking views that have never been seen before. In this way, Foundation Models can actually generate innovative content.

#### *8.3.2 Grounding Language in the World*

A long-standing problem in language research is how machines can "understand" the "meaning' of language. Bender et al. [8] argue that the "language modeling task, because it only uses linguistic forms as training data, cannot in principle lead to learning of meaning". Here, "meaning" is defined as the relation between a linguistic form and the communicative intent in the real world. Language modeling in this context is a system for string prediction. According to this view, current language models do not acquire "meaning", but relate phrases to other phrases.

Perception learning of an infant also takes place in a self-supervised way (Fig. 8.5). Parents and babies are pointing to objects during language development [26], and babies learn the grounded meanings of words that refer to common objects before they learn many other aspects of language [10]. The baby simply observes its environment and, probably, develops some expectation of how the environment (e.g. object movement, view change) will evolve over time (Fig. 8.6). Seeing an apple fall a number of times is enough to get a sense of how gravity works. Moreover, objects do not disappear when they move out of sight. The baby can learn by predicting these changes and unconsciously correcting its expectations whenever a deviation occurs [51]. This corresponds to unsupervised learning in the video domain by predicting the next frames. The NÜWA system (Sect. 7.3.4) is already pre-trained in this way and has achieved SOTA for forecasting the next frames of a video.

If a system is trained only with words, it is difficult to learn a concept. A dog, for instance, is not entirely understood if one knows that it is connected to leashes, ears, cats, mammal, leg, fur, tail, toy, barking, etc. [50]. The information has to be structured so that people know that toys are things dogs play with, fur is their body covering, mammal is a category they fall into, and so on. The head of a dog near to four legs does not constitute a dog. Therefore, the best way to learn the concept of

**Fig. 8.5** Timeline for the development of infant perception according to Wikipedia [93] and LeCun [51]. Abstract laws of nature, such as the fact that objects are affected by gravity and inertia, are acquired later than simpler concepts, like object permanence and the assignment of objects to broad categories. Most knowledge is obtained through observation, with very little direct manipulation, particularly in the first months

**Fig. 8.6** A baby observes its environment and manipulates objects. It develops an expectation of how the environment (e.g. object movement, view change) will evolve over time. It predicts these changes and subconsciously learns whenever a deviation occurs. Image credits in Table A.4

a dog is to perceive it in several media, for example, as an image, in a descriptive text, and in a movie, where it is chasing a cat.

Recently a model called **PLATO** [68] has been proposed to learn intuitive physics from videos. PLATO decomposes each segmented video frame into a set of objects using a perception module. To each object an ID is assigned to allow object tracking over time. Using a violation-of-expectation criterion, PLATO can learn a number of physical concepts, such as object continuity, directional inertia, object persistence, and object solidity. The approach of the model offers a way to ground intuitive physical concepts in visual perceptions.

It can be expected that self-supervised learning will be extended with the inclusion of more dimensions like 3D, self-movement, and active manipulation of the environment. As LeCun says, "Instead of language or images, however, the next AI generation will learn directly from videos. Meta is currently putting a lot of effort into collecting video data from the first-person perspective for this new AI generation [41], but YouTube videos are also suitable training material" [74]. LeCun believes that AI systems can learn about the physical foundations of our world from such videos. Their understanding, in turn, would be the basis for numerous abilities, such as grasping objects or driving a car.

A more detailed perspective is given by Bisk et al. [12]. The authors argue that language learning has to make a connection to "extralinguistic events". They distinguish different word scopes for language learning (Fig. 8.7). The most restricted scope contains carefully created corpora like the manually annotated Penn Treebank. BERT was trained on such carefully curated datasets. The next scope covers Web scale data collections, which in the case of PaLM include 780B tokens that are used only once for training. According to the scaling laws (Sect. 3.5.1), it can be expected that with more data and more model parameters, the already high accuracy of language prediction will increase even more.

The next scope is to mix language with sensory input from other modalities. This, for instance, is necessary to learn the meaning, the visual impression and implications of a painting. A good way to make progress in this direction is by using datasets connecting images with captions. When video content is subtitled and speech or transcribed speech is also available, even more connections can be made between visual impressions, audio, speech and language. A good example for this scope are the OFA and NÜWA models, but they can be improved in many ways.

**Fig. 8.7** World Scopes for Grounding Language. While the first three scopes have been explored to some extent, the remaining two scopes have to be considered in the future [12]

If you need to answer the following question: "Is an orange more like a baseball or more like a banana?", then visual appearance is not enough. Here different features of an orange have to be determined, e.g. weight, mobility, malleability, deformability and taste. This can only be done when manipulating and exploring the orange by hand. Here the next scope is required, where the agent moves and acts in the world and receives various tactile and sensory impressions of self-movement, force, and body position. Only in this way the basic physical properties of the world can be learned from interaction. To make progress in this area, a convergence of Foundation Models and robotics is needed, as initiated by PLATO. Thomason et al. [85] propose to ground language using 3D objects. The current approaches are rather limited.

The final scope is interpersonal communication, which is the central use case of natural language. It is currently not clear, how a computer system can act as an embodied participant in a social context. Dialog models like XiaoIce and LaMDA are a first attempt. These questions are discussed at length by Bisk et al. [12] and are probably more relevant in the distant future.

#### *8.3.3 Fast and Slow Thinking*

Intelligent thinking occurs at different speeds. Daniel Kahneman, Nobel Laureate in Economics, has developed a hypothesis [45] about two different systems of thinking from long studies of human behavior (Fig. 8.8). *System 1* (Fast Thinking) is fast, instinctive, and emotional. Examples include understanding a simple spoken sentence, driving a car on a quiet road, or recognizing an object in a picture. System 1 runs continuously, generating impressions, intuitions, and quick judgments based on our immediate perceptions.

*System 2* (Slow thinking) is slower, more deliberate, and more logical. It is responsible, for example, for remembering a person not seen for a long time, for parking in a narrow parking space, or solving the arithmetic problem 16\*34.

**Fig. 8.8** The properties of the two systems for fast and slow thinking in the human brain according to Kahneman [45]

System 2 is only used, when there are problems with System 1, i.e. it cannot explain the perceptions well.

Corresponding to System 2 in the brain is a *working memory* with limited capacity [32]. It allows to store thought content for a short time and to manipulate it at the same time. It apparently has an important role in problem solving and logical reasoning. The number of information units that can be handled simultaneously is estimated to be between five and seven. Humans are aware of System 2 thought processes, whereas System 1 processing is largely subconscious. System 2 requires the ability to consider an abstraction of the world. This involves focusing on a limited set of features and processing them in depth, while ignoring others [14].

#### *8.3.4 Planning Strategies*

Turing Award winner Yann LeCun [53] argues that current Foundation Models can already process many aspects of the environment similar to System 1. Selfsupervised learning is able to capture speech and language well and transform them into each other. To a lesser extent, images can be analyzed and associated to verbal descriptions. Joint processing of video, speech, and text is promising, but needs further development.

Only recently, Foundation Models were able to perform planning (Sect. 7.4), i.e. the systematic future-oriented consideration of goals, means, and ways to achieve goals in the future. This corresponds to Kahneman's System 2. The Foundation Model basically performs *model predictive control* and simulates the system under consideration for a series of time steps [75]. An example is driving a car on a road. Here the system simultaneously simulates the state of the system (e.g. position and speed of the car), the actions (e.g. steering wheel movements, acceleration) and the reward (e.g. distance to goal, distance from obstacles). The Foundation Model is trained using a set of observed trajectories and can learn the dependency between states, actions and resulting rewards. Subsequently, it is able to predict the next action to reach a specific reward level. Planning with Foundation Models can already include multiple modalities, e.g. perform a control with images as state descriptions.

According to Yann LeCun "the ability to construct models of the world is basically the essence of intelligence" [53]. These models required are not only to predict physical movements, but also human behavior, economic activity, etc. The great challenge of AI in the next decade is how to learn predictive models of the world that can handle uncertainty.

In LeCun's view this does not directly require formal logic based reasoning, which is not compatible with gradients required for efficient learning. Yoshua Bengio says [29], "There are some who believe that there are problems that neural networks just cannot resolve and that we have to resort to the classical AI, symbolic approach. But our work suggests otherwise." It is more probable that reasoning is performed by internal simulation and by analogy. As Geoffrey Hinton puts it: "But my guess is in the end, we'll realize that symbols just exist out there in the

external world, and we do internal operations on big vectors" [39]. It should be noted that newer models such as PaLM, which use chain-of-thought prompts, can reason just as well as average people (Sect. 4.2.3). Language is also not important for the intelligence of animals, it was acquired later in evolution.

LeCun envisions a complex system, where some high-level "configurator" instantiates *world models* for a current problem on the fly and executes mental simulations [96]. He postulates that there is a single world model engine, which is dynamically configurable for the task at hand [96]. In this way, knowledge about how the environment works may be shared across tasks. A key requirement is that the world model must be able to represent and compare multiple possible predictions of the environment. This configurator has the ability to combine different models and to learn complex hierarchical action sequences. In his concept paper, Yann LeCun [96] discusses many details of such a possible system.

The Gato model combining language, images, and control might be a first step into that direction, but it is still in its infancy (Sect. 7.4.2). The SayCan [2] system is an approach that integrates a robot and a Foundation Model to verbally express the robot's skill properties, e.g. "pick up the sponge". Given a real-world task description, SayCan is able to generate a sequence of skill executions to complete the task. In the same way a number of researchers from the reinforcement learning community argue that maximizing total reward may be sufficient to understand intelligence and its associated abilities [79].

Melanie Mitchel agrees with Yann LeCun that current Foundation Models are not powerful enough. "They lack memory and internal models of the world that are actually really important," she says [40]. In principle these models do not need language. But language has a big advantage, it allows to change goals on the fly simply by including some facts or statements, similar to the few-shot technique. Overall, it can be expected that there will be major advances along these development lines in the coming years.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### **Appendix A**

#### **A.1 Sources and Copyright of Images Used in Graphics**

Images without an image source annotation are self-produced graphics. According to the German copyright act we have the following regulations (https://www. publisso.de/open-access-beraten/faqs/urheberrecht-und-wissenschaft/ translated):

"In principle, works protected by copyright may only be used with the consent of the author or the copyright holder. In order to enable the references to other works that are necessary in scientific discourse, an effective barrier has been set by the right of citation (Section 51 UrhG). Here, parts of a work (or, depending on the context, entire works such as a poem) may be used as a quotation without the author's consent and without consultation.

However, the law also sets a narrow framework for this: A takeover is only permissible if it is to be used as evidence for a statement. An uncommented takeover for illustration or decorative purposes is not permitted or requires the consent of the author or rights holder and is subject to remuneration. In addition, the length of the quotation must be proportionate, although there are no limits to the extent. A verbatim transfer must also be made unchanged. In addition, the source must be indicated (Section 63 UrhG). These regulations apply not only to texts, but also to illustrations, graphics, etc., because these are also protected by copyright. The adoption of photos and images can also be covered by the right of citation, but the limits of what is permissible are quickly exceeded here."

The sources of external images are printed in Tables A.1, A.2, A.3, and A.4.

**Table A.1** Source of images used in Chaps. 1–3


(continued)



**Table A.2** Source of images used in Chap. 6

#### **Table A.3** Source of images used in Chap. 7


(continued)

#### **Table A.3** (continued)


#### **Table A.4** Source of images used in Chap. 8


#### **Index**

#### **A**

Abstractive summary, 261 ACE, 202 ACE 2005 data, 212 Action in dynamic system, 366 Activation function, 7, 98 AdaDelta optimizer, 58 AdaGrad optimizer, 58 Adam optimizer, 58 Adapter-Bot, 272 AdapterHub, 140 AdaSpeech 2, 322 Ad hoc retrieval, 228 Affine transformation, 6, 7 AISO, 245 ALBERT, 84, 192 ALBERT-SEN, 191 Aleatoric uncertainty, 62 Alexa Prize Challenge, 289 AlexNet, 13 Algorithmic filtering, 401 ALIGN, 335 Alignment to human preferences, 144 Aloe, 357 AlphaFold, 372 AlphaFold2, 393 Amazon670k dataset, 190, 194 AmbigQA benchmark, 231 AminoBERT, 371, 393 Analogy, 10 ANCE, 237 Artificial Intelligence (AI), 2 ArXiv benchmark, 190, 191, 263, 265 Aspect-based sentiment analysis, 213

Association score, 22 Atari benchmark, 368 ATLOP, 211 Attention, 12, 21 Attention head, 24 AttentionXML, 193 AugZero, 275 Autoencoder, 27, 51 Autoencoder (AE) language model, 19 Automatic speech recognition (ASR), 288, 315 Automation risk, 405 AutoPrompt, 142 Autoregressive language model (AR), 11, 20, 37, 38, 51 AV-ASR, 357

#### **B**

BabelNet data, 198, 199 Back-translation, 111, 253 Bagging, 64 Bag-of-words, 5, 189 BART, 93 Batch normalization, 60 Bayesian neural networks, 62 Beam search, 49 BEIR benchmark, 238 BEiT, 331 BEM, 198 BERT, 19, 21, 190, 202 BERTscore metric, 51 BertViz, 66 Bias, 6, 394 Bidirectional encoder, 28

© The Author(s) 2023 G. Paaß, S. Giesselbach, *Foundation Models for Natural Language Processing*, Artificial Intelligence: Foundations, Theory, and Algorithms, https://doi.org/10.1007/978-3-031-23190-2

Bidirectional LSTM (biLSTM), 12 Bidirectional RNN, 12 BIG-bench benchmark, 167, 388 BigBird, 100, 191, 241, 264 BigScience initiative, 89, 404 BigSSL data, 319 Binary classification, 189 Bing search engine, 68, 247 BioELECTRA, 203 Bits per byte (bpb), 247 BlenderBot 1, 291 BlenderBot 2, 292 BlenderBot 3, 296, 401, 404 BLEU metric, BLEU, 50 BLINK, 204 BLOOM, 89, 404 Bloom filter, 402 BLURB benchmark, 203 BoolQ benchmark, 163 BRIDGE, 121 BRIO, 262 BriVL, 335 ByT5, 259 Byte-pair encoding, 4

#### **C**

Caption-based image retrieval, 336 CAST, 281 Catastrophic forgetting, 123, 136, 137 Categorization of news, 189 Category, 188, 189 Causal self-attention, 38 CB benchmark, 163 C4 dataset, 98 Chain of thought, 142 Character *n*-gram, 4 Chatbot, 288 CheckList procedure, 165, 174, 177 Chinchilla, 90, 358 CIDEr metric, 337 Class, 189 Classification loss, 6 Classifier-free reconstruction, 342 Clickbait, 397 CLIP, 334 CLIP Similarity Score (CLIPSIM), 339 Closed-book QA, 241 Closed Domain QA system, 239 Cloze prompt, 142, 196 Cloze task, 27 CNN/Daily Mail benchmark, 94, 262–264 COCO captions, 334 COCO data, 345

coCondenser, 237 Cogview, 130, 340 CoLAKE, 122 ColBERT, 233 ColTran, 331 Combined SSL, 318 Combiner, 103 Common sense knowledge, 171 Compressive transformer, 102 ConceptNet KB, 276 Conceptual captions data, 334, 337 Conditional random field (CRF), 202 Confirmation bias, 397 Conformer, 316 Conjugate gradient, 59 CoNLL05 benchmark, 215 CoNLL 2003 data, 33 Connectionst temporal classification, 318 ConSec, 200 ContextNet NST, 316 Contextual embedding, 20, 22, 30, 385 Context vector, 12 Contrastive learning, 334 Contrastive loss, 328 Controllable music transformer (CMT), 324 Control trajectories, 391 Convolution layer, 13 Convolutional neural network (CNN), 13 COOT, 354 COPA benchmark, 163 CORA, 258 Coreference resolution, 208 CorefQA, 209 Corpus, 5 Cosine similarity, 236 CoVeR, 356 CQL, 368 Crf2o, 214 Cross-attention, 46 Cross entropy, 133 Cross-modal MLM, 334 CTRL, 271

#### **D**

DALL-E, 339 DALL-E 2, 343 Data2vec, 333 DBpedia benchmark, 116, 190 DeBERTa, 85, 214 DeBERTaV3, 85 DeCEMBERT, 354 Decision transformer, 367 Decoder, 12, 45

#### Index 429

Decoder block, 47 Deep learning, 2, 387 Deep neural network, 1, 7 DeepCTRL, 175 DeepFaceLab, 400 DeepProblog, 175 DeepSpeed toolbox, 54, 59, 89, 90 Dense passage retriever (DPR), 124, 236, 243, 244 DensePhrases, 125 Dense retrieval, 230, 235 Dense retriever, 243 Dependency parsing, 215 Dialog system, 288, 390 DialogBERT, 291 DialoGPT, 291 DiffBot, 116 Differential privacy, 402 Differentially private stochastic gradient descent, 402 Diffusion model, 341 Diffusion process, 341 Dilated causal convolutions, 13 Dirichlet distribution, 63 Discounted cumulative gain, 193 Disentangled attention, 85 Disentangled embeddings, 85 Distant supervision, 216 DistilBERT, 134 Distributional semantics, 8 DNA, 371, 393 DNABERT, 371, 393 Doc2query, 234 DocFormer, 217 DocRED benchmark, 211, 213 Document classification, 5 DrawBench data, 345 D4RL benchmark, 368 Drop connect, 63 Dropout, 60 DropToken, 355 Drug discovery, 118, 119, 392, 393 DSS, 105

#### **E**

E-BERT, 118 EfficientQA benchmark, 231 Electra, 84 ELI5 benchmark, 248, 397 Eliza chatbot, 288 ELMo, 12 Embedding, 2, 9, 384 contextual, 20

contextualized, 20 static, 9 Embedding of passage, 110 Encoder, 12, 45 Encoder block, 24 Entity, 188, 189 Entity linking, 198, 204 Entity mention, 204 EntMask, 206 Entmax transformation, 42 EntQA, 205 Epic-Kitchens-100 data, 353 Epistemic uncertainty, 61 ERNIE-Doc, 191 ERNIE-THU, 118 Escher, 199 ESMFold, 372 ETC-NLG, 272 EURLex-4K benchmark, 190, 194 EWISER, 123, 198 Example, 6 Expert system, 387 Explainable AI, 68 Extended transformer construction (ETC), 101 Extractive summary, 261 Extreme multilabel classification, 189, 192

#### **F**

Facts as experts, 118 Facts2Story, 278 FAISS, 199, 234, 237 Fake news, 284 FastMoE fast mixture-of-experts library, 130 Fast WaveNet, 320 FastSpeech 2, 321 FastSpeech 2s, 322 FastText, 10 FB hybrid, 245 Few-shot learning, 141 Few-shot prompt, 360, 385 Filter kernel, 13 Fine-tuning, 2, 26, 28 FIST, 280 Flamingo, 358 FLAN, 146 Flat named entity recognition, 202 Floating point operations per second (FLOPS), 98 Formal, 273 Forward sum of rewards, 367 Foundation model, 3, 53, 99, 383, 386 Fréchet Inception Distance (FID), 339 Fréchet Video Distance (FVD), 361, 362

Freebase data, 116 Frozen, 337, 358 Fully connected layer (FCL), 7, 11, 24 FUNSD benchmark, 218 Fusion in decoder (FiD), 125, 244

#### **G**

Gap-sentence generation, 95 Gated linear unit (GLU), 98, 316 Gated recurrent unit (GRU), 12 GATO, 368 GauGAN2, 339 Gaussian process, 64 GeDI, 271 GEGLU activation, 98, 132 GEM benchmark, 269 GeneBERT, 371 Generalization error, 60 General language model, 96 Generation with distributional control (GDC), 272 Generative adversarial network (GAN), 269, 338 Generative pre-trained transformer (GPT), 20, 37 Genia Corpus, 203 Genomics, 393 GENRE, 205 GitHub Copilot, 286 GLaM, 130 GLIDE, 341 Global token, 101 Gloss, 198 GlossBERT, 198 GloVe, 10 GLUE data, 32, 81 GODIVA, 363 Gopher, 89, 168, 175, 242 GPT-2, 42, 269 GPT-3, 68, 86, 101, 165, 241, 264, 269, 275, 276, 281, 339 GPT-GNN, 119 GPT-J-6B, 88 GPT-Neo, 88 GPT-NeoX-20B, 88, 404 GRACE, 214 Gradient, 56 Gradient descent, 57 Gradient noise, 128 Graph-BERT, 119 Graph convolutional network, 118 GraphFormers, 119 Graphical Processing Unit (GPU), 27

GraphPlan, 281 Greedy decoding, 48 GRF, 271 Grounding, 335 Grounding language, 298 GSLM, 323 GSM8K benchmark, 285 GSM8K-Python benchmark, 285 GSPMD, 59 Gumbel-softmax distribution, 317 Gumbel-softmax relaxation, 339 GYAFC benchmark, 273

#### **H**

HAHNN, 191 HASOC 2019 tweet dataset, 192 HAT, 253, 264 Hate speech, 192 Hate speech detection, 189 HateXplain dataset, 192 Hidden vector, 7 Homogenization, 406 Homonym, 188, 229 HotpotQA benchmark, 245 HowTo100M data, 355 HumanEval benchmark, 285 HYBRIDER, 243 HyperGrid, 138 Hyperparameter, 99

#### **I**

Image annotation, 335 Image captioning, 335 Imagen, 344 ImageNet benchmark, 13 Imagen video, 363 Image processing, 390 IMDB benchmark, 85, 190 Inception model, 339 Inception score (IS), 339 Independent identically distributed (i.i.d.), 6 Inductive bias, 51 Information extraction (IE), 187, 388 Information retrieval (IR), 228, 389 Initialization of parameters, 58 Innovative, 408 Input embedding, 21 InstructGPT, 143, 270 Integrated gradients, 67 Inverted index, 229 IOB2 format, 201

Index 431

#### **J**

JAX, 56 JFT-3B benchmark, 356 JNLPBA benchmark, 203 Jukebox, 324 Jurassic-1, 89

#### **K**

K-adapter, 123 KeBioLM, 202 KEPLER, 117 Key-to-Door benchmark, 368 Key-vector, 21 Keyword-based retrieval, 229 KGPool, 216 KGPT, 124 Kinetics data, 351 KnowBERT, 117 Knowledge base (KB), 80, 114, 116, 239 Kullback-Leibler (KL) divergence, 63

#### **L**

Label, 189 Label smoothing, 60 LaBSE, 110 LAFITE, 341 Laion data, 334 LAION-5B data, 345 LAMBADA benchmark, 90, 165, 247, 270, 387 LAMB optimizer, 59 LaMDA, 68, 247, 270, 294, 392, 401 Language identification method, 5 Language model (LM), 11, 19 autoencoder, 19 autoregressive, 11, 20 encoder-decoder, 20 masked, 26 Laplace approximation, 63 Latent Dirichlet Allocation, 281 Layer normalization, 24, 60, 98 LayoutLM3, 218 LayoutXLM, 218 Learning few-shot, 141 one-shot, 141 rate, 57 self-supervised, 8 supervised, 8 transfer, 136 unsupervised, 8 zero-shot, 140

Leviated token, 212 Lexical substitution, 197 LibriLight speech data, 323 LibriSpeech benchmark, 316 LightXML, 194 Likelihood, 62 Likert scale, 264 LIME, 66 Linear transformation, 6 Linear transformer, 103 LJ Speech data, 323 Local minimum, 57 Local sentiment aggregation (LSA), 214 Logistic classifier, 6, 11, 190 Log-likelihood, 38 Long Range Arena benchmark, 103, 105 Long short-term memory, 12 Longformer, 101, 265 LoRA, 140 Loss function, 80 L1 (*L*1) regularization, 60 L2 (*L*2) regularization, 60 LUKE, 123, 202

#### **M**

Macaw, 241 Machine learning (ML), 2, 387 Machine translation, 12 unsupervised, 111 MAD-X, 139 MaMa, 215 MAML, 139 MARGE, 111 Margin, 8 Markov Chain Monte Carlo, 62 Masked language model (MLM), 26, 328 Masked multi-head self-attention, 46 Masked region classification (MRC), 203 Masked region prediction, 334 Masked self-attention, 38 MASS, 93 Massive multitask language understanding benchmark, 167 Maximum entropy loss, 6 Maximum likelihood estimation (MLE), 38 Maximum Likelihood principle, 6 Max pooling, 13 mBART, 111 MBPP benchmark, 285 Mean average precision (MAP), 232 Mean opinion score (MOS), 320 Mean reciprocal rank (MRR), 118, 231 Meena, 291

Megatron-CNTRL, 276 Megatron-LM, 88 Mel frequency cepstral coefficient (MFCC), 315 MeMViT, 355 Mention, 198, 201 Merlot, 358 Meta-learning, 138 METEOR metric, METEOR, 50 Mini-batch, 57 Mixture of softmaxes, 42, 98 Mixture-of-experts model (MoE), 98, 111, 129–131 MKQA benchmark, 257 MLE Maximum Likelihood estimation, 38 MLM, 26 MLSum benchmark, 263 M2M, 255 MNEMELM, 280 MobileBERT, 134 Model, 6 Model 1, 234 Model card, 179 Model predictive control, 413 Model quantization, 132 MoDNA, 393 Moments in time data, 351 Momentum optimizer, 58 monoBERT, 232 monoT5, 233 Monte Carlo approximation, 62 MOS Mean Opinion Score, 320 MPG, 119 MS-MARCO data, 230 MS UnitedQA, 245 mT5, 110, 259 mT6, 111 MT-NLG, 90 MTR, 245 MTV, 356 MUDERN, 294 MuLaN, 199 Multiclass classification, 189 Multi-document summarizer, 261 Multi-head cross-attention, 46 Multi-head self-attention, 24 Multi-instance learning, 216 Multilabel classification, 189 Multi-news benchmark, 265 Multi-session chat, 293 Multi-stream architecture, 328 Multilayer perceptron (MLP), 8 Multilayer RNN, 12 Multilingual BERT (mBERT), 108

Multimodal content analysis, 187 Multimodal QA, 239 MultiNLU benchmark, 112 MultiRC benchmark, 163 MuseNet, 323 Music transformer, 324

#### **N**

NÜWA, 348, 362 Naïve realism, 396 Naive Bayes, 190 Named entity (NE), 172, 188, 201 Named entity recognition (NER), 28, 33, 188, 201 Natural language generation (NLG), 187, 266 inference (NLI), 178 processing (NLP), 1 understanding (NLU), 32, 162 Natural Questions (NQ) benchmark, 124, 125, 231, 239, 244 Nearest-neighbor search, 124 Negation, 171 Nested entities, 203 Neural architecture search (NAS), 60 New York Times dataset, 213, 217 Next sentence prediction, 27 *n*-gram, 4 Node2vec, 118 Noise contrastive estimation, 10

#### **O**

Object detection, 336 Omnivore, 355 1-D convolution, 13 1-D depthwise convolution, 316 One for all (OFA), 346 One-shot learning, 141 OntoNotes benchmark, 209, 215 OntoNotes coreference data, 82 Open-book QA, 243 Open domain QA systems, 239 Open domain question answering, 124 Open multilingual WordNet, 198 OpenGPT-X, 404 Open pre-trained transformer (OPT), 89, 296, 404 OPTIMUS, 274 OSCAR, 336 Overfitting, 60, 127, 135 Overlap, 194

Index 433

#### **P**

PaLM, 90, 165, 168, 175, 242, 256, 259, 263 PaLM-Coder, 91 PanGu-*α*, 89 Parabel, 193 Paraphrase, 172, 229 Pattern-exploiting training (PET), 123, 196 PEGASUS, 95, 261 Penn Treebank corpus, 39, 43 Perceiver, 103 Perceiver AR, 104 Perceiver IO, 104 Perceiver resampler, 358 Perceptual loss, 332 Performer, 102 Perplexity, 39 Persona, 291 Pile data, 88, 130, 247, 348 PL-marker, 212 PLATO, 410 Plato-2, 292 PlotMachines, 277 Plug and play language model (PPLM), 271 Pointer, 278 Policy, 366 PoolingFormer, 241 Position embedding, 21 Posterior distribution, 62 Pre-trained language model (PLM), 2, 51 Pre-training, 2, 26 Precision at *k*, 193 Prefix prompts, 142 Primer, 96, 265 Principal component analysis (PCA), 39 Prior distribution, 62 Probabilistic soft logic (PSL), 175 Product key memory, 98 Product keys, 85 ProGen, 278 ProGeT, 278 Prompt, 140, 385 Prompt design, 141 Propaganda, 397 Proteins, 393 Proteomics, 393 Proximal policy optimization, 143 PubLayNet benchmark, 218 PubMedBERT, 203 PubMed corpus, 202, 263, 264 PyTorch, 56

#### **Q**

Q-BERT, 132 Quasi-Newton, 59 Query, 228 Query stream, 84 Query-vector, 21 Question answering, 29, 239, 389 open domain, 124

#### **R**

RAFT benchmark, 195 RAG, 244 Random forest, 190 Random sampling, 41 RankNAS, 61 Reader, 124, 125, 243, 244 Reading comprehension (RACE) data, 81, 239 RealFormer, 241 REALM, 244 REBEL, 213 RECENT, 210 Recommender systems, 118 ReCoRD benchmark, 163 Rectified linear unit (ReLU), 7 Recurrent neural network (RNN), 10 RedCaps data, 334 RefCOCO benchmark, 348 Reformer, 102 Region caption, 354 Regularization method, 60 Reinforcement learning, 143, 366 with human feedback, 143 Relation, 116 Relation extraction, 189, 389 Relation-QA, 210 ReLIE, 217 RemBERT, 110 Replaced token detection, 84 Reproducibilty in NLP, 180 Residual connection, 13, 24, 59 ResNet, 13 Re-TACRED benchmark, 210 Retrieval, 203 Retriever, 124, 243 Retriever-reader architecture, 125 Retro, 175, 246, 270 Reward in dynamic system, 366 Reward model, 144 RL-175B, 264 RMS normalization, 98 RMSProp optimizer, 58 RoBERTa, 81, 192, 210 RocketQA, 237 RoseTTAFold, 372 ROUGE metric, ROUGE, 50 Routing network, 130

Routing transformer, 102 RPT, 121 RTE benchmark, 163 Rule-based, 288

#### **S**

S4, 104, 317 Saddle point, 57 Safety, 398 Saliency, 66 Salient span masking, 95 Sampling approach, 62 Satire, 397 SayCan, 414 Scaled dot-product attention, 22 SCaNN library, 246 Score vector, 6 Search engine, 228 Segatron, 102 Self-attention, 22, 385 Self-similarity of tokens, 40 Self-supervised learning, 384 Self-supervised training, 8 Semantic knowledge, 171 Semantic parsing, 121 SemCor data, 199 SemCor3.0 data, 198, 200 SemEval-13 data, 199 SemEval-15 data, 199 SemEval-20 Task 12 benchmark, 190 SemEval 2014 Task 4.2 benchmark, 214 Senseval-3 dataset, 31 Sensibleness, 398 SentenceBERT, 236 Sentence embeddings, 110 SentencePiece, 5 Sentiment analysis, 189 Sequence-to-sequence model (Seq2seq), 45 SHAP, 66 Simple RNN, 11 SimVLM, 337 Single document summarizer, 261 Single stream architecture, 328 SMAN, 285 SMITH, 235, 241 SNGP, 64 SNLI-VE data, 348 Softmax function, 6 Something-something V2 data, 353 Spacy toolbox, 4 Spam detection, 189 SpanBERT, 81, 208, 210

Span prediction, 29 Sparse transformer, 101, 339 Specificity, 398 Speech recognition, 390 SpeechStew, 319 Spherical embeddings, 10 SQA data, 121 SQuAD 1.0 data, 33, 81 SQuAD 2.0 data, 81, 239 Starspace, 10 State in dynamic system, 366 State of the art (SOTA), 3 Static embedding, 9 Statistical disclosure, 402 Stereotype, 394 STIE, 263 ST-MoE-32B, 132, 165, 179, 239, 263 Stochastic gradient descent (SGD) differentially private, 402 optimizer, 57 Stochastic parrot, 408 Story generation, 390 StrategyQA benchmark, 175 StructBERT, 83 Structural uncertainty, 61 Structured self-attention network (SSAN), 211 Student model, 133 StyleGAN2, 341 StyleLM, 274 StyleSwin, 332 Summarization, 261 Summarize, outline and elaborate (SOE), 279 Summarizer, 261 Summary, 261 abstractive, 261 extractive, 261 SUN RGB-D dataset, 355 SuperGLUE benchmark, 85, 98, 131, 142, 163 Super-resolution, 331 Supervised training, 8 Support vector machine, 8, 190 SwiGLU activation, 98 SwinIR, 331 Swin Transformer, 332, 355 Swish activation, 198, 316 Switch, 98, 111, 131 Symbol grounding, 335, 350 Synonym, 229 Synset, 197, 199 Syntactic knowledge, 170 Synthesizer, 98, 103 System 1 (fast thinking), 412 System 2 (slow thinking), 412

#### **T**

T5, 95 T5-XXL, 179 TaBERT, 120 TableGPT, 121 TACKBP-2010 benchmark, 205 Tacotron 2, 320 TACRED benchmark, 117 Tapas, 121 Task adapter, 139 Teacher forcing, 38 Teacher model, 133 TeKGen, 123 Tensor, 5 TensorFlow, 56 TERA, 319 Test data, 7 Text classification, 28, 188, 189 Text pair classification, 28 Text summarization, 389 Text-to-speech (TTS), 288, 320 Tf-idf statistic, 5 Theme transformer, 324 Thought chain, 69, 175, 242 3D Nearby Attention, 362 Time-series forecasting, 105 TinyBERT, 134 TLDR, 264 TL;DR benchmark, 263 Token, 4, 19 Token embedding, 21 Token switch, 131 Top-*k* accuracy, 231 Top-*k* sampling, 41 Top-*p* sampling, 41 Toxic language, 396 Training corpus, 5 data, 6 self-supervised, 8 supervised, 8 unsupervised, 8 TransD, 117 TransE-loss, 116 TransE model, 116 Transfer learning, 2, 26, 28, 136, 384 Transformer, 20, 45, 51 encoder-decoder, 20, 51 Transformer-LS, 103 Transformer TTS, 320 Transformer-XL, 102 TransH, 117 Translation, 389 Translation language modeling (TLM), 109 TriviaQA benchmark, 124, 126, 239 Truth bias, 396 TruthfulQA benchmark, 240 T-SNE projection, 31 Turing-NLG, 90 Turing test, 268 TURL, 120 TyDiQA-GoldP benchmark, 257

#### **U**

Underspecification, 137 U-Net, 342 Unicoder, 109 Unidirectional encoder, 38 UniLM2, 96 UniRE, 211 Unsupervised data generation (UDG), 146 Unsupervised machine translation, 111 Unsupervised training, 8

#### **V**

Value-vector, 21 Variational auto-encoder (VAE), 269, 338 Variational inference, 63 Variational lower bound, 342 VATT, 354 Vector Quantized Variational AutoEncoder (VQ-VAE), 330, 340 VideoBERT, 353 Video captioning, 353 Video patch, 351 Video processing, 391 Video transformer, 362 VilBERT, 336 VinVL, 336 Virtual assistant, 288 Vision transformer (ViT), 328 Visual entailment task, 348 Visual question answering (VQA), 336 Visual reasoning, 357 Visual token, 334 VIVO, 337 Vocabulary, 5 Vocabulary mismatch problem, 229 Voxel, 354 VQA v2 benchmark, 337 VQ-GAN, 332

#### **W**

WaveGlow, 323 WaveNet, 13, 320 wav2vec 2.0, 317 WebGPT, 68, 247 WebQuestions benchmark, 98 WER word error rate, 315 WiC benchmark, 163 Wikidata knowledge base, 123 Wikidata5M knowledge base, 122 WikiHop benchmark, 241 Wiki65K benchmark, 235 Wikipedia, 198 WikiTableQuestions benchmark, 121 WikiText-103 benchmark, 89, 167 Winograd benchmark, 178 Winogrande benchmark, 179 WIT data, 335 WKLM, 121 WMT2014 En-De benchmark, 98 WMT2019 De-En benchmark, 112 Woebot, 393 Word annotation, 28, 198 Word embedding, 9 Word error rate, 315 WordNet data, 116, 197, 199 WordNet Glos data, 199 Word *n*-gram, 4 WordPiece, 5 Word sense, 171 Word sense disambiguation (WSD), 188, 197 Word2vec, 9

Working memory, 413 World knowledge, 171 World model, 414 WSC benchmark, 163 WuDao-2.0, 130, 348 w2v-BERT, 318

#### **X**

XiaoIce, 290, 401 XLM, 108 XLM-R, 109 XLM-RXXL, 110 XLNet, 84, 191 XLNG, 112 XMC-GAN, 339 XNLI benchmark, 86, 108, 111 XSum benchmark, 98, 263 Xtreme benchmark, 110

#### **Y**

YAGO, 116 Yelp benchmark, 190

#### **Z**

Zero-shot learning, 43, 140