**Methods in Molecular Biology 2453**

# Anton W. Langerak *Editor*

# Immunogenetics

Methods and Protocols

# M ETHODS IN M OLECULAR B IOLOGY

Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK

For further volumes: http://www.springer.com/series/7651 For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.

# Immunogenetics

# Methods and Protocols

Edited by

# Anton W. Langerak

Department of Immunology, Erasmus MC, Rotterdam, The Netherlands

Editor Anton W. Langerak Department of Immunology Erasmus MC Rotterdam, The Netherlands

ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-2114-1 ISBN 978-1-0716-2115-8 (eBook) https://doi.org/10.1007/978-1-0716-2115-8

© The Editor(s) (if applicable) and The Author(s) 2022. This book is an open access publication.

Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature.

The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.

### Preface

Adaptive immune cells (lymphocytes) are equipped with unique antigen receptors, termed immunoglobulins (IG) and T cell receptors (TR), which collectively form a highly diverse repertoire. In the lymphocytes, IG/TR diversity is actually created at the DNA level, thus giving rise to an enormous adaptive immune receptor repertoire (also known as the immunome) that can be studied in healthy and diseased subjects in the context of research questions and clinical applications. This field of (fundamental and translational) research is known as immunogenetics.

The immunogenetics domain has rapidly evolved in the last ten years or so, mainly through the introduction of high-throughput technologies. With these new technologies, unprecedented insight into the adaptive immune receptor repertoire could be obtained with much more sequencing depth and coverage of the repertoire than ever before. In this volume, many chapters are dedicated to lab protocols, bioinformatics, and immunoinformatics analysis of this high-resolution immunome analysis, exemplified by many different applications. Additionally, the newest technological variations on these protocols are discussed, including non-amplicon, single-cell, and cell-free strategies. Collectively, the chapters illustrate the impact that immunogenetics has achieved and will further expand in all fields of medicine, from infection and (auto)immunity, to vaccination, to lymphoid malignancy and tumor immunity.

As the guest editor of this volume on immunogenetics in the Methods in Molecular Biology book series, I am very pleased with the content and quality of this book. I am grateful to all authors who contributed to the success of this book volume with their valuable and informative chapters that collectively cover a broad spectrum of methodologies for applications in research and clinical diagnostics. I sincerely hope that readers will find the protocols and the method descriptions as useful as I did, for their own laboratory studies. Enjoy reading!

Rotterdam, The Netherlands Anton W. Langerak

### Contents



viii Contents


### Contributors


SAFA AOUINTI • IMGT®, the international ImMunoGenetics information system®, Laboratoire d'ImmunoGe´ne´tique Mole´culaire LIGM, Institut de Ge´ne´tique Humaine, (IGH), Centre National de la Recherche Scientifique (CNRS), Universite´ de Montpellier (UM), Montpellier, France; Clinical Research and Epidemiology Unit, CHU Montpellier, Univ Montpellier, Montpellier, France

MARINE ARMAND • AP-HP, Pitie´-Salpeˆtrie`re Hospital, Laboratory of Hematology, Paris, France; Sorbonne Universite´, Paris, France


GIOVANNI CAZZANIGA • Centro Ricerca Tettamanti, Fondazione Tettamanti, Centro Maria Letizia Verga, Monza, Italy; Genetics, Department of Medicine and Surgery, University of Milan Bicocca, Monza, Italy

ANASTASIA CHATZIDIMITRIOU • Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece; Department of Molecular Medicine and Surgery, Karolinska Institute, Stockholm, Sweden


Humaine, (IGH), Centre National de la Recherche Scientifique (CNRS), Universite´ de Montpellier (UM), Montpellier, France


JAMES PETER STEWART • Patrick G Johnston Centre for Cancer Research, Queen's University Belfast, Belfast, UK

MICHAEL SVATON • CLIP - Childhood Leukaemia Investigation Prague, Department of Paediatric Haematology and Oncology, Second Faculty of Medicine, Charles University and University Hospital Motol, Prague, Czech Republic

FLORIAN THONIER • Inria, Rennes, France


DIEDE A. G. VAN BLADEL • Department of Pathology, Radboud University Medical Center, Nijmegen, The Netherlands; Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands

MANDY VAN BRAKEL • Laboratory of Tumor Immunology, Department of Medical Oncology, Erasmus MC-Cancer Institute, Rotterdam, The Netherlands

MIRJAM VAN DER BURG • Department of Pediatrics, Laboratory for Pediatric Immunology, Willem-Alexander Children's Hospital, Leiden University Medical Center, Leiden, The Netherlands


# The Advent of Precision Immunology: Immunogenetics at the Center of Immune Cell Analysis in Health and Disease

Anton W. Langerak

### Abstract

Adaptive immune cells (i.e., lymphocytes of the B and T lineage) are equipped with unique antigen receptors, which collectively form a highly diverse repertoire. Within the lymphocytes, the antigen receptor diversity is created at the DNA level through recombination processes in the immunoglobulin (IG) and T cell receptor (TR) genes that encode these receptors. This gives rise to an enormous immune repertoire (a.k.a. the "immunome") that can be studied in health and disease, both in a scientific and clinical context. In fact, the inherent distinctiveness of the IG/TR rearrangements on a per cell basis allows their usage as unique DNA fingerprints, which enables precision medicine, or for that matter "precision immunology." The field of (fundamental and translational) research on IG/TR repertoire diversity is the topic of the Immunogenetics volume in the Methods in Molecular Biology series.

Key words Immunoglobulin, T cell receptor, Immunogenetics, Immunome, Precision immunology

### 1 Introduction

Our current understanding of the diversity of antigen receptors started with the publication on "Somatic generation of antibody diversity" by Susumu Tonegawa in 1983 [1], which resulted in the Nobel Prize in Physiology for the author in 1989. In this seminal publication, Tonegawa introduced the concept of genetic recombination mechanisms of V (variable), D (diversity), and J (joining) genes in the loci encoding the immunoglobulin (IG) chains, which—as was subsequently discovered—also applies to the T cell receptor (TR) loci. These recombinations lead to an enormous repertoire diversity of B and T lymphocytes, referred to as the "immunome." The research into the genetics of the immune cell repertoire has been termed "immunogenetics." Besides IG/TR gene diversity, the field of immunogenetics formally also includes diversity in the human leukocyte antigens (HLA), but this is largely beyond the scope of the current Immunogenetics volume in the Methods in Molecular Biology series.

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_1, © The Author(s) 2022

### 2 Immunogenetics in the Hematology-Immunology Domain

B and T lymphocyte populations and their respective IG/TR repertoires are mostly studied in the context of immune diseases (autoimmune diseases, allergies, immune deficiencies) and immune responses (infections, inflammation, vaccinology, cancer), but also frequently in the context of hematological malignancies of immune cells (leukemias and lymphomas).

Irrespective of the application, it is important, when evaluating IG and TR repertoire diversity in B and T cell populations, to consider the repertoire data as being part of a spectrum ranging from broadly diverse (polyclonal), to restricted (oligoclonal), to dominant (clonal +/ poly/oligoclonal background) (Fig. 1). This spectrum reflects the minimal to moderate to dominant outgrowth of B or T lymphocytes of a particular specificity, which are selected based on their antigen reactivity.

Immunogenetic analysis can provide in-depth insight into the diversity of immune cells and immune responses in the context of different research questions. Additionally, the diversity or clonality of the immune repertoire can also help to address clinical and diagnostic questions. In the hematological domain, this relates to the distinction between reactive lymphoproliferations (poly- to oligoclonality) and malignantly transformed lymphocytes

Fig. 1 Spectrum of IG/TR immune repertoire diversity, ranging from diverse (polyclonal) to highly restricted (clonal), which can be disclosed using high-throughput sequencing technologies. (Adapted from Langerak, J Immunol 2017;198:3765 [2])

(clonality) or to detection of minimal residual disease of a clone upon treatment (weak clonality in background). In other areas of medicine, immunogenetic analysis can shed light on proper or defective immune responses in infected and vaccinated individuals and/or can help to distinguish between disease entities (e.g., in due time for particular autoimmune IG/TR profiles).

### 3 Immunogenetics Methods

Historically, immunogenetic analysis has been performed using low-resolution methodologies, such as Southern blot analysis, fragment analysis or spectratyping, and Sanger sequencing of cloned, rearranged IG/TR genes [3]. Even though these approaches enabled us to grasp the diversity of antigen receptors to some extent, they suffered from limitations in completely disclosing the depth and broadness of the IG/TR immune repertoire. The introduction of high-throughput technologies some 15 years ago allowed for a more high-resolution immune repertoire analysis via massively parallel sequencing (Fig. 2). These next-generation

Fig. 2 Graphical representation of different sequencing approaches for IG/TR repertoire analysis. By means of traditional (Sanger) bulk sequencing, only the dominant immune repertoire (in green) can be identified over the background (grey), which strongly contrasts with the high-resolution output of many individual IG/TR rearrangements (represented by the different colors) through massively parallel sequencing. The additional advantage of single-cell sequencing technologies is that the high-resolution IG/TR repertoire analysis can be traced back to individual cells, which allows evaluation of paired IG or TR chains at the single-cell level and/or combination of immune repertoire and differentiation or maturation stage features

sequencing methods have the advantage that thousands to millions of IG/TR rearrangement sequences can be analyzed in parallel, thus approximating the true IG/TR repertoire diversity much more closely. A further development has been the introduction of single-cell sequencing technologies (Fig. 2), allowing paired analysis of different IG or TR chains at the single-cell level and the combination of immune repertoire analysis with RNA sequencing-based cell characteristics (e.g., naı¨ve vs. memory, activated or exhausted cells).

### 4 (Pre- and Post-)Analytical Aspects of Immunogenetics

As with any experimental method, immune repertoire analysis also entails pre-analytical, analytical, and post-analytical phases. For immune repertoire studies, the pre-analytical considerations specifically focus around the choice of sample type, nucleic acid type, IG/ TR targets, etc., whereas the analytical phase relates to the pros and cons of the applied method (next-generation sequencing, quantitative PCR, droplet digital PCR). Finally, the post-analytical phase involves the readouts and tools for data analysis, but also the immuno-informatics to accurately annotate the IG/TR sequences and the bioinformatic pipelines and platforms that allow sophisticated analysis of the IG/TR data and all of their characteristic features (gene usage, CDR3, somatic mutations, clustering, and clonal evolution and competition).

In this volume of the Methods in Molecular Biology series, all of the above aspects of the pre-analytical, analytical, and post-analytical phases of IG/TR repertoire analysis are addressed in different methodological chapters that together cover a spectrum of technologies, ranging from quantitative and droplet digital PCR approaches to various NGS methodologies such as ampliconbased, capture-based, and single-cell NGS. Additionally, bioinformatic approaches are discussed that allow for extraction of IG/TR repertoire sequences from -omics data sets, i.e., RNA sequencing, whole genome sequencing, and whole exome sequencing. Finally, several novel approaches in the immunogenetic domain are covered, concerning cell-free IG/TR analysis, analysis of germline areas of the TR loci, analysis of aberrantly rearranged IG genes leading to IG translocations, and engineering of TR sequences in view of adoptive therapy.

### 5 Immunogenetics at the Basis of Precision Immunology

Collectively, the chapters in this volume are a perfect illustration of the central position that immunogenetics has obtained in the hematology-immunology domain in both health and disease

### Precision Immunology via Immunogenetics 5

Fig. 3 Precision immunology through immunogenetic analysis. Characteristic IG/ TR CDR3 profiles allow identification of individual patients. These profiles have implications to define immune responsiveness, to make diagnosis and/or subclassification, or even support therapeutic choices

[2]. Immunogenetic profiles constitute physiological and pathophysiological signatures of cell populations, thereby allowing a more personalized approach in terms of immune responsiveness, diagnostics and classification, and even therapeutic choices [4]. This form of precision medicine involving immunogenetics could therefore best be referred to as "precision immunology" (Fig. 3). The future of immunogenetics is bright!

### References


Design and standardization of PCR primers and protocols for detection of clonal immunoglobulin and T-cell receptor gene recombinations in suspect lymphoproliferations: report of the BIOMED-2 concerted action BMH4- CT98–3936. Leukemia 17:2257–2317

4. Arnaout RA, Prak ETL, Schwab N, Rubelt F, Adaptive Immune Receptor Repertoire Community (2021) The future of blood testing is the immunome. Front Immunol 12:626793

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Next-Generation Sequencing-Based Clonality Detection of Immunoglobulin Gene Rearrangements in B-Cell Lymphoma

### Diede A. G. van Bladel, Jessica L. M. van der Last-Kempkes, Blanca Scheijen, Patricia J. T. A. Groenen, and on behalf of the EuroClonality Consortium

### Abstract

Immunoglobulin (IG) clonality assessment is a widely used supplementary test for the diagnosis of suspected lymphoid malignancies. The specific rearrangements of the immunoglobulin (IG) heavy and light chain genes act as a unique hallmark of a B-cell lymphoma, a feature that is used in clonality assessment. The widely used BIOMED-2/EuroClonality IG clonality assay, visualized by GeneScanning or heteroduplex analysis, has an unprecedented high detection rate because of the complementarity of this approach. However, the BIOMED-2/EuroClonality clonality assays have been developed for the assessment of specimens with optimal DNA quality. Further improvements for the assessment of samples with suboptimal DNA quality, such as from formalin-fixed paraffin-embedded (FFPE) specimens or specimens with a limited tumor burden, are required. The EuroClonality-NGS Working Group recently developed a next-generation sequencing (NGS)-based clonality assay for the detection of the IG heavy and kappa light chain rearrangements, using the same complementary approach as in the conventional assay. By employing next-generation sequencing, both the sensitivity and specificity of the clonality assay have increased, which not only is very useful for diagnostic clonality testing but also allows robust comparison of clonality patterns in a patient with multiple lymphoma's that have suboptimal DNA quality. Here, we describe the protocols for IG-NGS clonality assessment that are compatible for Ion Torrent and Illumina sequencing platforms including pre-analytical DNA isolation, the analytical phase, and the post-analytical data analysis.

Key words Clonality analysis, Next-generation sequencing, B-cell lymphoma, Immunoglobulin gene rearrangements, ARResT/Interrogate

### 1 Introduction

Clonality assessment of the immunoglobulin (IG) or T-cell receptor genes is a useful supplementary tool for the diagnosis of B-cell and T-cell lymphoid malignancies. Cancer cells have a unique feature that they originate from a single transformed cell. The malignant cells of a B-cell lymphoma all have the same rearranged IG

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_2, © The Author(s) 2022

DNA sequences encoding for a unique antigen-receptor molecule, also called the B-cell receptor (BCR). Clonality assessment makes use of this feature. In patients suspected for having a B-cell lymphoma, clonality assessment enables demonstration of a clonal expansion of clonally related B cells, all having the identical molecular footprint of the antigen receptor encoded by the IG genes.

1.1 Immunoglobulin Gene Rearrangements The BCR consists of two IG heavy chains (IGH) and two light chains, IG kappa (IGK) or IG lambda (IGL), with unique nucleotide sequences at the antigen binding region that are generated during lymphoid development. The proper assembly of a functional BCR is controlled by several checkpoints at different stages of B-cell development [1–3]. Once a mature B cell has encountered an antigen, it will undergo somatic hypermutation (SHM) in the germinal center. During this process that is mediated by the enzyme activation-induced cytidine deaminase (AID), random sequence alterations [mostly point mutations, but deletions or insertions can occur as well] are introduced to improve antigen binding, a phenomenon called affinity maturation [2, 3].

> The BCR is generated by a stepwise process involving rearrangements of the different germline variable (V), diversity (D), and joining (J) IG genes, called V(D)J recombination. This process is initiated by the recombination-activating gene (RAG) products RAG1 and RAG2 [4, 5], which relies on the recognition of recombination signal sequences (RSSs) flanking the individual genes. V (D)J recombination starts with the IG heavy chain, by the recombination of one of the D genes with one of the J genes, followed by the subsequent joining of one of the V genes to the rearranged DJ gene (Fig. 1). This random recombination of V, D, and J genes generates the so-called combinatorial diversity. Imprecise joining of the genes by the activation of exonucleases, as well as the addition of non-template DNA nucleotides by the enzyme terminal deoxynucleotidyl transferase (TdT), results in junctional diversity, on top of the combinatorial diversity. As a consequence of the combinatorial and junctional diversity, only one out of three VDJ rearrangements will be able to express a functional BCR. This high frequency of out-of-frame rearrangements may explain why many of the B lymphocytes have rearranged both their IGH genes, so-called biallelic IGH gene rearrangements. Lymphomas with biallelic gene rearrangements occur frequently, whereas lymphomas that are truly bi-clonal are rare [7].

> For the light chain (IGK or IGL), a direct V to J gene rearrangement takes place, where the IGK locus will first undergo gene rearrangement. When there is no productive IGKV-IGKJ rearrangement, additional rearrangements will occur that inactivate the IGK locus by removal of the IGKC region and the enhancers.

Fig. 1 Detection of V(D)J gene rearrangements at the immunoglobulin heavy chain locus. After a functional DJ rearrangement has been generated, a V gene is joined to this DJ fragment. Each B cell may generate one (productive rearrangement) or two (an unproductive and productive rearrangement) specific clonotypes that consist of one IGHV, IGHD, and IGHJ gene segment. The locations of the primers used for IG-NGS clonality assessment are indicated by arrows. For detection of IGHV-IGHD-IGHJ gene rearrangements, the forward primers are located in framework region 3 (VH FR3), which are combined with IGHJ reverse primers. The detection of unproductive, incomplete IGHD-IGHJ rearrangements makes use of forward IGHD primers (located 5<sup>0</sup> of the IGHD genes) and reverse IGHJ primers, hence enabling detection of incompletely rearranged IGHD-IDHJ joining. Once an IGHV gene is recombined to the IGHD-IGHJ segment, the IGHD primer binding site will be removed. Successful amplification will result in DNA fragments that cover the junctional region with a specific amino acid length. Figure adapted from Scheijen et al., 2019 [6]

These rearrangements involve the KDE sequence that can rearrange to one of the kappa V genes and thereby delete the initial IGKV-IGKJ rearrangement, resulting in an IGKV-KDE rearrangement or to an isolated recombination signal sequence (RSS) that is located in the J kappa-C kappa intron (intron RSS), resulting in an Intron RSS-KDE rearrangement [8] (Fig. 2). If there is no proper IGK rearrangement, the IGL genes will rearrange. Theoretically, all mature B-cell malignancies should possess IGK rearrangements, regardless of the light chain expression [9]. Based on the amount of functional genes, the estimated number of unique BCRs generated by combinational diversity of both the heavy and light chain is 4.6 <sup>10</sup><sup>6</sup> [10]. However, the actual number of unique receptors is lower, since not all genes are used at the same frequency, and not every heavy and light chain can pair to form a functional BCR.

The junctional diversity further increases BCR diversity by a factor 10.

B cells that assembled a functional BCR will further diversify by undergoing somatic SHM to extend the IG repertoire upon antigen recognition within the germinal center of a lymph node [2, 11]. When B cells fail or become autoreactive during this process, they will be silenced and eliminated [1, 3].

Fig. 2 IGK rearrangements involving Kappa deleting element. IGK gene rearrangement starts with an initial IGKV-IGKJ recombination. If this results in a productive rearrangement, no subsequent recombination events will occur within the IGK locus. However, in case there is an unproductive IGKV-IGKJ rearrangement, this may lead to inactivation of the IGK locus involving rearrangements with the Kappa deleting element (KDE) sequence. This can include a rearrangement between KDE and Intron RSS-KDE recombination on the same allele (Allele A). The initially formed unproductive IGKV-IGKJ segment remains present on that allele. Both the unproductive IGKV-IGKJ and Intron RSS-KDE rearrangements are detectable with clonality analysis. The second option, involves a recombination of an upstream IGKV gene with the KDE sequence, thereby deleting the preexisting unproductive IGKV-IGKJ rearrangement on that allele (Allele B). Potentially, up to four distinct IGK rearrangements can be generated that go along in one B cell clone. The locations of the primers used for IG-NGS clonality assessment are indicated by arrows. Figure adapted from Scheijen et al., 2019 [6]

1.2 Clonality Detection in B-Cell Lymphoma Based on BIOMED-2/ EuroClonality Assays Clonality assessment by detecting IG gene rearrangements is widely used for diagnostics, and multiple assays have been developed over the years, which differ in the level of sensitivity [12]. The current gold standard are the PCR-based BIOMED-2/EuroClonality assays, visualized with either GeneScan fragment analysis or heteroduplex analysis [13, 14]. In this assay, standardized PCR protocols are used that cover IGH and IGK gene rearrangements. These include complete IGHV-IGHD-IGHJ rearrangements but also incomplete IGHD-IGHJ rearrangements, which are not affected by somatic hypermutation either. For IGK gene rearrangements, not only IGKV-IGKJ rearrangements are included but also rearrangements involving KDE, which are not affected by somatic hypermutation. Notably, these occur on one or both alleles in virtually all IgLambda-positive B-cell malignancies and in one-third of the IgKappa-positive B-cell malignancies. The primers and protocols of the BIOMED-2/EuroClonality PCR assays allow detection of virtually all clonal B-cell proliferations, and the primer design has been based on family primers and consensus primers relevant for the IG genes. A clonal cell population gives rise to one or two dominant PCR products of a given size on GeneScan. A polyclonal cell population will result in a range of differently sized PCR fragments, corresponding to the presence of different V(D)J gene rearrangements showing Gaussian distribution with respect to the amount of inserted or deleted nucleotides in the junctional region.

The BIOMED-2/EuroClonality assays are used worldwide and have resulted in increased clonality detection of lymphoid malignancies [15, 16]. However, there are still some drawbacks that could potentially yield (mainly) false-negative results. The BIOMED-2/EuroClonality assays have been designed for highquality DNA samples generating amplicons in the range of 150–400 bp. However, formalin-fixed paraffin-embedded (FFPE) tissue specimens, which are mostly used in a diagnostic setting, may yield DNA samples of inferior quality. Clonal rearrangements that correspond to relatively longer amplicons may therefore potentially be missed [13, 15, 17]. Furthermore, detection of minor clones in a background of nonmalignant B cells is highly dependent on the position of the clonal product within the Gaussian curve of the polyclonal background, where it can be difficult or even impossible to detect these minor clones.

### 1.3 NGS-Based Clonality Detection in B-Cell Lymphomas

To further improve the application potential of clonality assessment, the EuroClonality-NGS Working Group has developed a novel next-generation sequencing (NGS)-based clonality assay for detection of IG gene rearrangements (IG-NGS) [6], together with the bioinformatics tool ARResT/Interrogate [18]. New primers were designed for the incomplete and complete IGH gene rearrangements, the complete IGK rearrangements as well as for the IGK rearrangements involving KDE, again making use of the complementary approach that is one of the strengths of the conventional BIOMED-2/EuroClonality assays. The primer design for the NGS-based clonality assay is based on gene-specific primers for the relevant genes and, importantly, on the generation of shorter amplicon sizes, which makes it more suitable for clonality detection in samples of inferior DNA quality. Furthermore, the IG-NGS assay immediately provides the nucleotide sequences of the identified clonotypes from both the malignant clone and the nonmalignant background B cells. Using this sequence information, reliable detection of minor clones is possible, resulting in a high sensitivity of the clonality analysis as recently described by Scheijen et al. [6]. Clonal rearrangements of lymphomas with a high tumor load still can be traced back when diluted in a concentration of 5% and 2.5% in a polyclonal background of tonsil DNA. The detection rate of 2.5% is not possible by the conventional assay combined with GeneScanning or heteroduplex, because the clonal product will be blurred by the polyclonal background [13]. Furthermore, the sequence information, the design for suboptimal DNA specimen, and the sensitivity are extremely valuable for comparison of sequential lesions or multiple lymphomas at different locations in a single patient.

1.4 Different NGS Platforms for Clonality Testing: Ion Torrent Versus Illumina

Similar to the BIOMED-2 approach, the IG-NGS clonality assay is based on a multiplex PCR to amplify the target regions and by subsequent ligation of adaptors for sequencing. The targets detected in the NGS clonality assay include IGH (IGHV-IGHD-IGHJ and IGHD-IGHJ) and IGK (IGKV-IGKJ, IGKV-KDE, and Intron RSS-KDE) gene rearrangements. After purification of the PCR products, the library preparation is performed, followed by sequencing on Ion Torrent or Illumina platforms (Fig. 3).

The initial IG-NGS workflow described the protocols for the Ion Torrent platform [6], a technique that makes use of electrochemical detection of hydrogen ions that are released during DNA synthesis [19]. The Illumina platform represents also a widely used NGS application in diagnostic laboratories, and both are very suitable for high-throughput NGS-based molecular assays. In contrast to Ion Torrent-based sequencing, Illumina employs fluorescently labeled nucleotides that are incorporated during complementary DNA strand synthesis [20]. Depending on the type of Illumina sequencer, this can be a 2-channel (e.g., MiniSeq, NextSeq, Nova-Seq) or 4-channel chemistry (e.g., MiSeq, HiSeq).

The Ion Torrent and the Illumina sequencing technologies require specific adapters for sequencing and barcodes for sample identification. In the workflow that was developed for Ion Torrent sequencing, the adapters and barcodes are ligated to the amplicons (adapter ligation protocol). For Illumina sequencing platforms, the sequencing adapters need to be incorporated in the amplicon primers. Recently, the EuroClonality-NGS Working Group described a two-step PCR assay for minimal residual disease (MRD) target identification using an Illumina-compatible workflow [21]. With this approach, the barcoded adapter sequences are incorporated in the second PCR of the two-step PCR assay with universal barcoded M13-tailed primers. The workflow for clonality detection using the Illumina sequencing platform that will be described in this chapter

Fig. 3 Schematic workflow for IG-NGS clonality assay. A multiplex PCR is performed on extracted DNA of specimens suspect for lymphoproliferations to amplify IGHV-IGHD-IGHJ, IGHD-IGHJ, IGKV-IGKJ, and IGKV/ Intron RSS-KDE gene rearrangements. Library preparation for sequencing on Ion Torrent (left panel) or Illumina (right panel) is shown. The Ion Torrent library preparation is an adapter ligation protocol, requiring end repair of the obtained amplicons and the ligation of barcode and adapters to them and nick repair, followed by a final library amplification step. Library preparation for Illumina is a two-step PCR protocol in which the targetspecific amplicons are generated using primers containing an M13 adapter, which is used in the second PCR to add specific barcodes to them. Obtained sequencing data is analyzed using the bio-informatics tool ARResT/Interrogate

is based on this previously described two-step PCR protocol [21], with some minor modifications in the first PCR reaction and purification steps of the amplicons as well as the PCR conditions of the Ion Torrent protocol (Table 1) (see Note 1).

Table 1 note in the bookversion; Table 1 was split over 2 pages in not a nice way; if this Table should be split, please do so starting the second page with the row: PCR program (all targets)

PCR conditions for target amplification comparing different EuroClonality NGS protocols. The components of the PCR mixes and programs are shown for the Ion Torrent protocol for clonality detection, the first PCR step of the two-step Illumina protocol for clonality detection as described in this paper, and the previously published two-step Illumina protocol for marker identification [21]. Primer sequences and final concentrations are provided in Tables 2, 3, and 4



Illumina protocol for clonality detection and the Ion Torrent protocol use 3 IGHJ reverse primers for both the IGH-FR3 and IGHD reaction cAmpliTaq Gold DNA polymerase

dAdjust the total reaction volume with MQ

In the subsequent paragraphs, we present a complete overview of the different steps of IG-NGS clonality analysis in suspected B-cell malignancies that are compatible for either Ion Torrent or Illumina sequencing platforms. For complete IGH rearrangements, in this NGS approach, framework-3 (FR3) primers are used in contrast to the BIOMED-2/EuroClonality assay that employs additional FR1 and FR2 primers, generating larger-sized products. Data analysis with ARResT/Interrogate and the technical interpretation and reporting of the obtained results will be addressed. It is of utmost importance that molecular clonality results are eventually interpreted in the context of available clinical, morphological, and immunophenotypic data. Also detailed knowledge of the immunobiology of IG gene rearrangements is mandatory to be able to correctly interpret the different molecular patterns.

### 2 Materials

2.1 General Materials and Equipment


### 2.2 DNA Isolation 1. Xylene (molecular biology quality grade).

2. TET lysis buffer: 10 mM Tris/HCl pH 8.5, 1 mM EDTA pH 8.0, 0.01% Tween-20.


### 1. dNTPs.

2.3 IG-NGS Clonality

2.3.2 Ion Torrent Library

2.3.3 Illumina Library Preparation and Sequencing

Preparation and Sequencing

Assays

2.3.1 Target Amplification and Purifications

	- 3. FastStart High Fidelity PCR system, dNTPack (Roche).
	- 4. Sequencing equipment and associated Illumina Reagent Kit (e.g., MiniSeq sequencer and MiniSeq Mid Output Kit).

### 3 Methods

### 3.1 Samples and Quality Controls

IG-NGS clonality analysis can be performed on DNA extracted from any preserved human lymphoid tissue. However, each sample type requires a specific extraction procedure for DNA isolation. We here describe DNA extraction methods for formalin-fixed paraffin-

### Table 2

### Primers included in the multiplex PCR reaction for NGS-based clonality assessment: Tube IGHV-FR3


\* For the Ion Torrent protocol, primers without the M13 sequence are used



\* For the Ion Torrent protocol, primers without the M13 sequence are used

embedded (FFPE) and fresh frozen tissue using the Chelex method (FFPE), TSE (fresh frozen), and column-based extraction procedure of QIAGEN; equivalent isolation systems are also possible (see Note 2).

To perform reliable clonality assessment it is important to determine whether a representative tissue section is used, whether obtained DNA is of sufficient quality (see Note 4) and using a standardized DNA input per PCR. Furthermore, robust performance of the multiplex PCR reaction should be assessed by including control samples such as a polyclonal control sample (e.g., tonsil or mononuclear peripheral blood cells) and negative control (water), while preparing the samples for IG-NGS clonality assessment (see Note 5).

### 3.2 DNA Isolation For isolation of genomic DNA from FFPE tissue, different methods are available. Here two of such protocols are described, a commercially available DNA isolation kit (QIAGEN) and the Chelex method. Both protocols use a microcolumn purification of the extracted DNA; this is an important step in preparing DNA samples for clonality assays and is described in Subheading 3.2.3. Finally, a protocol for isolation of genomic DNA from fresh frozen tissue is described.

assessment: Tube IGK


### Table 4 Primers included in the multiplex PCR reaction for NGS-based clonality

\* For the Ion Torrent protocol, primers without the M13 sequence are used


7. Remove carefully the ethanol.




3.2.2 DNA Extraction from Formalin-Fixed Paraffin-Embedded (FFPE) Tissue Starting with the Chelex Method

This Chelex-based DNA extraction protocol is developed as common workflow that is suitable for the majority of the molecular tests used in diagnostics. However, for clonality assessment, it is important to purify the DNA obtained with this protocol before use in the clonality assay in order to obtain good quality results.

All steps are performed at room temperature, unless specified otherwise.


3.2.3 DNA Purification with QIAamp DNA Microcolumn

	- 1. Carefully transfer the entire lysate to the QIAamp MinElute column (in a 2 ml collection tube) and centrifuge at 6000 g for 1 min (see Note 10).

All steps are performed at room temperature, unless specified otherwise.


3.2.4 DNA Extraction from Fresh Frozen Tissue:

TSE Method


3.3 Ion Torrent Protocol for IG-NGS Clonality Assessment

3.3.1 Multiplex PCR for Amplification of IGH-FR3, IGHD, and IGK

3.3.2 Cleanup of IGH-FR3, IGHD, and IGK Amplicons


### Table 6 Protocol of the end repair reaction for Ion Torrent


3.3.3 End Repair of Amplicons

### Table 7 Protocol of the adapter ligation/nick repair reaction for Ion Torrent


### Table 8

### Adapter ligation program for Ion Torrent




### Table 9 Composition of the library amplification reaction for Ion Torrent

### Table 10 Library amplification PCR program for Ion Torrent

3.3.6 Ion Torrent Sequencing Run



3.4 Illumina Protocol for IG-NGS Clonality Assessment This two-step PCR protocol is based on a previously published protocol for marker identification for MRD [21], with some modifications for the first PCR reaction (Table 1). Furthermore, the protocol described below is optimized for sequencing on a MiniSeq (Illumina), but other equipment may be used according to the instructions of the local Sequence Facility.

3.4.1 Multiplex PCR for Amplification of IGH-FR3, IGHD, and IGK

3.4.2 Cleanup of IGH-FR3, IGHD, and IGK Amplicons

	- 2. Pipette the pooled samples in a DNA LoBind plate and add 1.8 times (135 μl) volume Agencourt AMPure XP magnetic beads per sample (see Note 19).
	- 3. Use a pipette to mix the solution thoroughly (avoid air bubbles), until the beads and sample are homogeneously mixed and incubate for 5 min at room temperature.
	- 4. Place the samples for 2–5 min in a magnetic stand until the solution is clear (see Note 20).
	- 5. Carefully remove the supernatant using a 200 μl pipette (see Note 21).
	- 6. Add 150 μl freshly made 70% ethanol per sample (see Note 22).
	- 7. Move the plate in the magnetic stand approximately 4 times from left to right, and make sure the bead pellet migrates and is washed clean.
	- 8. Carefully remove the supernatant using a 200 μl pipette (see Note 21).
	- 9. Repeat steps 6–8 once.
	- 10. Carefully remove any remaining supernatant using a 10 μl pipette, and air-dry the beads for 5 min to allow complete evaporation of residual ethanol (see Note 23).
	- 11. Resuspend the samples in 25 μl Low TE-buffer.

### Table 11

Composition of the barcode amplification reaction for Illumina. Primer sequences and final concentrations are provided in Table 5



### Table 12

PCR program for the barcode amplification reaction for Illumina. A header of this table is missing, please include a Header, which should be: Cycle (1st column), PCR step (column 2) Temperature (column 3) Time (column 4)




tional clonality testing using BIOMED-2/EuroClonality assays. Select a sample, make sure the correct target is selected (i.e., choose "IG" under cell type for B-cell clonality assessment), the filter is set on 0–100% to include all detected clonotypes, and click on "report". The following information will be shown in the report that is generated:

1. First, an overview of some quality parameters is shown. The most important is the QC status: "Pass" when the data meets all quality criteria or "Fail" when the data does not meet all quality criteria. Under "QC report" it can be found why the QC failed and which target failed; please interpret these targets with caution.

Fig. 4 Clonotype annotation. A rearrangement (complete IGH and IGK rearrangements) is referred to as a clonotype notated as an immunoglobulin nucleotide sequence with a 5<sup>0</sup> gene (V-gene), the junction, and the 3<sup>0</sup> gene (J-gene). The junction consists of three parts: the first and last numbers are the amount of nucleotides that is removed from the 5<sup>0</sup> - or 3<sup>0</sup> -genes, respectively. The middle number is the amount of nucleotides that is present between the 5<sup>0</sup> and 3<sup>0</sup> -genes and includes the so-called N-nucleotides that are added by the enzyme terminal deoxynucleotidyl transferase (TdT) during the V(D)J recombination process, as well as the D-gene in case of a complete VDJ rearrangement. Incomplete, nonfunctional IGH and IGK rearrangements (IGHD-IGHJ, IGKV-KDE, Intron RSS-KDE) are "artificially" described as clonotypes in ARResT/Interrogate in a similar way as shown here for complete IGH rearrangements, using the corresponding 5<sup>0</sup> - and 3<sup>0</sup> -genes and their junctions


With the "PDF" button that is present in the reporting function, the total report can be exported as a PDF file.

More advanced analyses can be performed using the "questions" function. In addition to the standard parameters (i.e., junction aa length and clonotype), also other ones can be chosen, like amplicon length or the 50 - or 30 -genes/primers to analyze the data in more detail. Furthermore, in contrast to the reporting section, it is possible to select 2 or more samples at the same time in the questions section, to directly compare the nucleotide sequences in either a bar chart or table. This is especially of added value when a clonal comparison has to be made for a patient with multiple tumors, for example. Using the questions function, the data can be visualized as follows:


Within ARResT/Interrogate, all bar charts (created within both the reporting and questions function) are "interactive" meaning that by clicking on one or more colored parts of a bar, the corresponding clonotypes are selected. A so-called minitable pops up at the top of the page, with the general information about the clonotype(s), but also the most popular full nucleotide sequence of the corresponding clonotype. This information can be downloaded using the download button. Further analysis of the most popular nucleotide sequence, but also all other sequences belonging to the same clonotype, can be done within the "forensics" function. By using the green button "run tests," this forensics section will open automatically or go manually to this section. Here, the following analyses can be performed:


Fig. 5 Output data generated by ARResT/Interrogate with IG-NGS clonality assay. IG-NGS clonality profiles of a polyclonal (upper panel) and monoclonal (lower panel) sample are shown in bar charts generated by ARResT/ Interrogate (target IGHV-IGHD-IGHJ FR3). On the y-axis, the abundancy of detected clonotypes is shown in percentage and on the x-axis, the junction length is shown in amino acids (aa). Each bar represents clonotypes with the same junction aa length, and each color indicates a unique clonotype based on their nucleotide sequence. Only the 50 most abundant clonotypes are colored, and all other, less frequent gene rearrangements are merged and represented by the gray bars

After visualization of the results, the obtained clonality patterns of each sample, which is run under standardized conditions (input of DNA and number of samples per run), can be interpreted. It is strongly advised to include a polyclonal control sample in the run. A standardized input of an FFPE-derived polyclonal control sample under standardized run conditions should demonstrate a Gaussian curve with differently sized junctions of the gene rearrangement and a high variety of clonotypes represented by the presence of gray bars as shown in the top panel of Fig. 5, as well as the detection of the V/D/J gene families. Skewing of the curve to either short or long amplicon lengths could imply that the library preparation was not optimal and may interfere with the analysis of samples prepared within the same run. The same holds true for too few reads and/or clonotypes. Depending on the tumor load of a clonal sample, as well as the input of DNA (per PCR-library), a dominant clonotype will be present, as shown in the lower panel of Fig. 5.

For correct interpretation of the clonality assay per sample, several steps should be followed:


Guidelines for the technical interpretation of the obtained result per locus and rearrangement type, as well as for the molecular clonality conclusion, are under development. Furthermore, it should be stressed that the clonality results should be integrated with the clinical, morphological, and immunophenotypic data to make a final diagnosis.

### 4 Notes


that includes column-based purification is strongly recommended. Extraction methods that isolate both DNA and RNA in parallel are not suitable. RNA present in the DNA solution negatively influences the PCR reaction resulting in an abnormal, disturbed polyclonal pattern. The DNA should be quantified to enable standardized DNA input in the PCR.


adapters are used, the barcode ligation will be very inefficient and up to 85% of the generated reads will not be barcoded and are useless.


"reporting" and "questions") depends on the user mode. However, in each user mode, at least one of these functions is available. For specific questions regarding an ARResT/Interrogate account, please contact the ARResT team (con tact@arrest.tools).

### Acknowledgments

The development of the NGS-based protocols for clonality detection was executed by laboratories within the EuroClonality-NGS Working Group, part of the EuroClonality consortium. A special thanks to Jos Rijntjes and Jeroen Luijks (Department of Pathology, Radboud University Medical Center, Nijmegen, The Netherlands) for technical assistance during the development of the two-step Illumina protocol for clonality detection. This project was funded by EuroClonality and the Dutch Cancer Society (KWF-11137). Figures are created with BioRender.com.

### References


of lymphoma diagnostics via PCR-based clonality testing: report of the BIOMED-2 concerted action BHM4-CT98-3936. Leukemia 21(2):201–206. https://doi.org/ 10.1038/sj.leu.2404467


MQ, James T, Huw Jones TA, Kang GD, Kerelska TH, Kersey AD, Khrebtukova I, Kindwall AP, Kingsbury Z, Kokko-Gonzales PI, Kumar A, Laurent MA, Lawley CT, Lee SE, Lee X, Liao AK, Loch JA, Lok M, Luo S, Mammen RM, Martin JW, McCauley PG, McNitt P, Mehta P, Moon KW, Mullens JW, Newington T, Ning Z, Ling Ng B, Novo SM, O'Neill MJ, Osborne MA, Osnowski A, Ostadan O, Paraschos LL, Pickering L, Pike AC, Pike AC, Chris Pinkard D, Pliskin DP, Podhasky J, Quijano VJ, Raczy C, Rae VH, Rawlings SR, Chiva Rodriguez A, Roe PM, Rogers J, Rogert Bacigalupo MC, Romanov N, Romieu A, Roth RK, Rourke NJ, Ruediger ST, Rusman E, Sanches-Kuiper RM, Schenker MR, Seoane JM, Shaw RJ, Shiver MK, Short SW, Sizto NL, Sluis JP, Smith MA, Ernest Sohna Sohna J, Spence EJ, Stevens K, Sutton N, Szajkowski L, Tregidgo CL, Turcatti G, Vandevondele S, Verhovsky Y, Virk SM, Wakelin S, Walcott GC, Wang J, Worsley GJ, Yan J, Yau L, Zuerlein M, Rogers J, Mullikin JC, Hurles ME, McCooke NJ, West JS, Oaks FL, Lundberg PL, Klenerman D, Durbin R, Smith AJ (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456(7218):53–59. https://doi.org/10.1038/nature07517

21. Bruggemann M, Kotrova M, Knecht H, Bartram J, Boudjogrha M, Bystry V, Fazio G, Fronkova E, Giraud M, Grioni A, Hancock J, Herrmann D, Jimenez C, Krejci A, Moppett J, Reigl T, Salson M, Scheijen B, Schwarz M, Songia S, Svaton M, van Dongen JJM, Villarese P, Wakeman S, Wright G, Cazzaniga G, Davi F, Garcia-Sanz R, Gonzalez D, Groenen P, Hummel M, Macintyre EA, Stamatopoulos K, Pott C, Trka J, Darzentas N, Langerak AW (2019) Standardized next-generation sequencing of immunoglobulin and T-cell receptor gene recombinations for MRD marker identification in acute lymphoblastic leukaemia; a EuroClonality-NGS validation study. Leukemia 33(9):2241–2253. https://doi.org/10. 1038/s41375-019-0496-7

22. van den Brand M, Rijntjes J, Mobs M, Steinhilber J, van der Klift MY, Heezen KC, Kroeze LI, Reigl T, Porc J, Darzentas N, Luijks J, Scheijen B, Davi F, ElDaly H, Liu H, Anagnostopoulos I, Hummel M, Fend F, Langerak AW, Groenen P, EuroClonality NGSWG (2021) Next-generation sequencing-based clonality assessment of Ig gene rearrangements: A multicenter validation study by euroClonality-NGS. J Mol Diagn 23(9):1105–1115

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# One-Step Next-Generation Sequencing of Immunoglobulin and T-Cell Receptor Gene Recombinations for MRD Marker Identification in Acute Lymphoblastic Leukemia

Patrick Villarese, Chrystelle Abdo, Matthieu Bertrand, Florian Thonier, Mathieu Giraud, Mikae¨l Salson, and Elizabeth Macintyre

### Abstract

Within the EuroClonality-NGS group, immune repertoire analysis for target identification in lymphoid malignancies was initially developed using two-stage amplicon approaches, essentially as a progressive modification of preceding methods developed for Sanger sequencing. This approach has, however, limitations with respect to sample handling, adaptation to automation, and risk of contamination by amplicon products. We therefore developed one-step PCR amplicon methods with individual barcoding for batched analysis for IGH, IGK, TRD, TRG, and TRB rearrangements, followed by Vidjil-based data analysis.

Key words Next-generation sequencing, One step, T cell receptor, B cell receptor

### 1 Introduction

Recombination of the V (D) J genes of immunoglobulin (IG) and T cell receptor (TR) loci is an essential step in the differentiation of B and T cells, allowing the production of a unique antigen receptor which is present in all clonal progeny. As such, acute lymphoblastic leukemias (ALLs) are characterized by clonal, homogeneous IG/ TR rearrangement patterns that are widely used for clonal tracking during evaluation of response to treatment, commonly referred to as quantification of minimal (or measurable) residual disease (MRD) [1]. The EuroMRD group has played a seminal role in developing, standardizing, and accompanying optimized use of IG/TR clonal markers in lymphoid malignancies, essentially using CDR3 clone-specific quantitation by PCR. Initial IG/TR target identification was based predominantly on EuroClonality/ BIOMED-2 multiplex PCR-based protocols for IG/TR targets combined with heteroduplex analysis or fragment length (GeneScan) analysis, followed by Sanger sequencing and design of

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_3, © The Author(s) 2022

CDR3-specific PCR primers [2–4]. With the development of NGS immunogenetics [5–9], the EuroClonality-NGS working group developed a standardized two-step multiplex amplicon approach to IG/TR target identification in ALL that enabled switching of sequencing adaptors and a reduction of the total number of primers required for individual sample identification in mixed libraries [10].

Two-step PCR approaches, however, have several limitations, particularly in MRD laboratories, where contamination by PCR products can be a risk of false-positive results. These include more extensive sample handling with consequent increased overall cost and risk of contamination and reduced suitability for automation. We therefore developed a single-step PCR approach to screening for IG/TR rearrangements in lymphoid malignancies, as described here.

### 2 Materials




7. Prepare the primer mix for TRB VDJ (see Note 7).

Importantly, each primermix should be prepared with the same index.




(continued)

Table 1 (continued)



Table 2 Amplification protocols for different IG and TR targets



<sup>¼</sup> conc ng=μ<sup>L</sup> 106 =ð Þ size of library in base pairs <sup>660</sup> :

Option: One can verify the size of each library by electrophoresis on a Bioanalyzer 2100. Analyze 1 μL sample with the DNA High Sensitivity Agilent kit.

After migration, profiles and sizes should be as illustrated below (example is shown for TRB VDJ).

### 3.5 Pool Preparation (2 nM)


4. Transform ng/μL into nM with this formula:

<sup>¼</sup> conc ng=μ<sup>L</sup> <sup>10</sup><sup>6</sup> =ð Þ size of library in base pair <sup>660</sup> :

### 3.6 Denaturation Step

1. Normalize the library pool to 2 nM in Resuspension Buffer.


Adding 10% of PHIX control in pool library:


### 3.7 Bioinformatic Analysis with the Vidjil Platform



Fig. 1 Adding patients and runs in Vidjil


Fig. 2 Adding samples in Vidjil. The rectangles refer to the different steps described in the main text

	- (a) Select the pre-process M + R2: Merge paired-end reads (A in Fig. 2).
	- (b) Click on Add other sample to have as many sample lines as required (B in Fig. 2).
	- (c) Add each sample one by one.
		- Select the FASTQ file for the R1 reads in the first field (C in Fig. 2).
		- Select the FASTQ file for the R2 reads in the second field (D in Fig. 2).
		- Enter the sampling date.
		- In the last field, type the last name of the patient and select the corresponding one in the list that appears (E in Fig. 2). This will associate the sample to the patient, which will then be available from run or patient.


	- 5. Submit the samples.
	- 6. Choose the configuration of the algorithm: "multi+inc+xxx." This is the advised configuration for target identification as it will detect both complete and incomplete recombinations (Fig. 3).
	- 7. Launch the analysis with the selected configuration for each sample.
	- 8. Click on reload, at the bottom left, to see the job status going through the different steps: QUEUED ! ASSIGNED ! RUNNING ! COMPLETED. It is possible to launch several processes at the same time (some will wait in the QUEUED/ ASSIGNED states).
	- 9. Once the jobs are completed, return to the patient list to visualize the results by clicking on the configuration name.
	- 10. Analyze the sample to determine the markers of interest (Fig. 4).
		- (a) The percentage of analyzed reads should normally be above 90%; otherwise the sequencing run may be of poor quality (A in Fig. 4).
			- In case this percentage is too low, investigate the reason why by clicking on the info button in the upper left panel (B in Fig. 4).
			- Specifically, check the percentage of reads that are classified as:
				- UNSEG only V/5<sup>0</sup> (reads only matching V genes).
				- UNSEG only J/3<sup>0</sup> (reads only matching J genes).
				- UNSEG too few V/J (reads matching no V or J gene).

Fig. 4 Analyzing the clonotypes in the Vidjil client. Clonotypes are viewed at the same time in a Genescan-like view, a grid view (depending on V/J genes) and in a list. Moreover, the sequences of the selected clonotypes appear at the bottom

	- (a) Select all the clonotypes with the same V and J genes as the studied clonotype.
	- (b) Align the sequences (D in Fig. 4).
	- (c) Remove the sequences that do not align properly with the studied clonotype.
	- (d) Realign the sequences.
	- (e) Restart steps c and d until all the sequences align with only few differences.
	- (f) Cluster the aligned sequences (button cluster, E in Fig. 4).

### 4 Notes

	- (a) Tube A: combine primer for index D502.
		- Add 2 μL of each primer at 100 μM + 396 μL H2O; each primer is at 10 μM.


	- Add 2 μL of each primer at 100 μM + 90 μL H2O; each primer is at 10 μM.


	- Add 2 μL of each primer at 100 μM + 36 μL H2O; each primer is at 10 μM.

	- (a) Tube DH primer: combine primer for index D502.
		- Add 2 μL of each primer at 10 μM; each primer is at 10 μM.


(b) Tube JH primer: combine primer for index D701.

• Add 2 μL of each primer at 100 μM + 36 μL H2O; each primer is at 10 μM.

```
T7-JH consensus
T7-IGJH-137(faham)
```
	- (a) Tube Vkappa: combine primer for index D502.
		- Add 5 μL of each primer at 100 μM + 585 μL H2O; each primer is at 10 μM.


	- Dilute 5 μL of each primer at 100 μM + 45 μL H2O.
	- Add 4 μL of each primer at 100 μM + 108 μL H2O; each primer is at 10 μM.


	- Dilute 5 μL of each primer at 100 μM + 45 μL H2O.
	- (a) Tube mix A TCRGV: combine primer for index D502.
		- Add 2 μL of each primer at 100 μM + 90 μL H2O; each primer is at 10 μM.


	- Add 2 μL of each primer at 100 μM + 36 μL H2O; each primer is at 10 μM.


	- Dilute 1 μL of each primer at 100 μM + 36 μL H2O.
	- Mix 7 μL of each primer at 20 μM.


	- (a) Tube mix VDD2: combine primer for index D502.
		- Mix 5 μL of each primer at 100 μM.


	- Mix 5 μL of each primer at 100 μM.


	- (a) Tube mix TRB DB: combine primer for index D502.
		- Mix 2 μL of each primer at 10 μM + 36 μL H2O.


	- Mix 2 μL of each primer at 10 μM + 252 μL H2O.


	- (a) Tube mix TRB VB: combine primer for index D502.
		- Mix each primer at 100 μM with the volume below:


	- Mix 2 μL of each primer at 10 μM + 252 μL H2O.


### References


rise of Immunoinformatics. Front Immunol 5: 22

15. Lefranc M-P, Giudicelli V, Duroux P, Jabado-Michaloud J, Folch G, Aouinti S et al (2015) IMGT®, the international ImMunoGeneTics information system® 25 years on. Nucleic Acids Res 43(Database issue):D413–D422

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Chapter 4

## Immunoglobulin/T-Cell Receptor Gene Rearrangement Analysis Using RNA-Seq

### Vincent H. J. van der Velden, Lorenz Bastian, Monika Bru¨ ggemann, Alina M. Hartmann, and Nikos Darzentas

### Abstract

Identification of immunoglobulin (IG) and T-cell receptor (TR) gene rearrangements in acute lymphoblastic leukemia (ALL) patients at initial presentation are crucial for monitoring of minimal residual disease (MRD) during subsequent follow-up and thereby for appropriate risk-group stratification. Here we describe how RNA-Seq data can be generated and subsequently analyzed with ARResT/Interrogate to identify possible MRD markers. In addition to the procedures, possible pitfalls will be discussed. Similar strategies can be employed for other lymphoid malignancies, such as lymphoma and myeloma.

Key words Minimal residual disease, Acute lymphoblastic leukemia, Immunoglobulin, T-cell receptor, Gene rearrangements, RNA-Seq, Whole exome sequencing, Whole genome sequencing, Marker identification

### 1 Introduction

Most clinical protocols for patients with acute lymphoblastic leukemia (ALL) nowadays include minimal residual disease (MRD) based stratification [1–4]. Molecular MRD analysis is, at least in Europe, most commonly used and is generally based on analysis of rearranged immunoglobulin (IG) and T-cell receptor (TR) genes according to international guidelines [5–8]. In a diagnostic setting, IG/TR gene rearrangements are generally identified using DNA-based PCR analysis, followed by classical Sanger sequencing or next-generation sequencing (NGS) [5, 8]. In recent years, whole transcriptome RNA sequencing (RNA-Seq) is increasingly used to identify fusion genes and to assign patients into distinct molecular subgroups according to the WHO 2016 classification, or for protocol-based clinical decisions [9]. Clearly, it would be beneficial if RNA-Seq data could also be used for the identification of IG/TR gene rearrangements pertaining to the leukemic clone. A recent study already showed that RNA-Seq data allowed the identification

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_4, © The Author(s) 2022

of IG heavy chain (IGH) gene rearrangements in approximately 90% of B-ALL patients [10]. It should however be noted that the majority of ALL rearrangements is unproductive; this is in clear contrast to rearrangements present in normal B cells, which virtually all are functional. Therefore, caution is warranted in the analysis of RNA-Seq data for IG/TR marker screening in ALL (and in other lymphoproliferative disorders requiring multiple RNA/DNA analyses) [11, 12], and applying computational methods that only focus on productive rearrangements (e.g., like for most repertoire analyses) will clearly result in incomplete interpretation of IG/TR data for marker identification [13].

In this chapter, we describe how RNA-Seq data can be obtained and subsequently evaluated using the ARResT/Interrogate immunoprofiling platform [arrest.tools/interrogate] to identify possible IG/TR markers. Similar strategies can likely be employed for other lymphoid malignancies, such as lymphoma and myeloma. Finally, comparable data analysis tools may be used for whole genome sequencing and whole exome sequencing data.

### 2 Materials

The following equipment, materials, and reagents (or equivalents) should be available:



The following supplies are specifically required for the "HS" workflow described in this protocol (see Note 1).


### 3 Methods

Be careful when handling RNA samples (see Note 2).

Here we describe the workflow for RNA-Seq using the Illumina® TruSeq Stranded mRNA Kit (Illumina® Document # 1000000040498 v00) Library Kit chemistries and workflows [https://support.illumina.com//sequencing/sequencing\_kits/ truseq-stranded-mrna/documentation.html]. Before you proceed, please check carefully for any changes issued by the manufacturer regarding the kit or protocol (see Note 3).

### 3.1 RNA Isolation and Quality Assessment

Input RNA quality and quantity is essential to transcriptome sequencing. Illumina True Seq mRNA library Kit requires 0.1–1 μg total RNA as input (see Note 4). To assess the RNA quality, use the Agilent RNA 6000 Nano Kit:

	- (a) RNA ladder aliquots, stable at 70 C for extended time periods.
	- (b) Agilent RNA 6000 Nano gel matrix aliquots (65 μl), can be stored at 4 C for 1 month (protect from light during use).

Fig. 1 RNA quality assessment by Bioanalyzer. RNA from bone marrow samples of patients with first diagnosis of ALL was isolated by silica columns (Qiagen, AllPrep) and subjected to microcapillary electrophoresis on the Agilent 2100 Bioanalyzer using the Agilent RNA 6000 Nano Kit as described. (a) Electropherogram representing a high-quality RNA sample (RIN 9.4). (b) Electropherogram representing a low-quality RNA sample with ongoing RNA degradation (RIN 4.2)

16. Vortex the chip horizontally in the IKA vortex (1 min, 2400 rpm), and insert to the Bioanalyzer 2100 station within 5 min to perform analysis using 2100 Expert Software.

Analysis will deliver a microcapillary electropherogram together with the RNA concentration measured and the RNA integrity number (RIN). The presence of a marker peak and two ribosomal RNA peaks (18S and 28S) will indicate a successful measurement of RNA with at least intermediate quality. Integrity of RNA is quantified on a scale from 1 (poor) to 10 (best) by RIN, based on a proprietary algorithm developed by Agilent © [14]. Figure 1 shows a good and a poor RIN example. In the poor RIN example, the ribosomal peaks are hardly detectable and RNA degradation is observed as a smear of RNA with decreasing size. Illumina True Seq protocols recommend a RIN of 8.0 or higher to be used for library preps (see Note 7).


	- (a) 94 C for 8 min
	- (b) Hold at 4 C.

3.2.2 First Strand cDNA Synthesis Purified RNA fragments are reverse transcribed to first strand cDNA using random hexamer primers.

	- (a) 25 C for 10 min.
	- (b) 42 C for 15 min.
	- (c) 70 C for 15 min.
	- (d) Hold at 4 C.

### 3.2.3 Second Strand cDNA Synthesis To maintain strand specificity during cDNA synthesis and to remove the mRNA template, dUTP is replaced by dTTP in second strand cDNA synthesis. Second strand cDNA synthesis results in blunt-end double-stranded cDNA, which can be stored for 1 week (first safe stopping point).

	- 1. Add 2.5 μl Resuspension buffer to all wells of the adapter ligation plate and spin down (5 s, 600 g).
	- 2. Pipette 12.5 μl A-Tailing Mix to each well and mix by shaking (2 min, 1800 rpm).
	- 3. Cover plate with Microseal "B" and spin down (1 min, 280 g).
	- 4. Incubate on 37 C microheating system (30 min), then transfer to 70 C microheating system (5 min, lid closed), and cool down on ice (1 min).
	- 5. Spin down TruSeq® RNA CD Index Plate (1 min, 280 <sup>g</sup>).
	- (a) 2.5 μl Resuspension buffer
	- (b) 2.5 μl Ligation mix
	- (c) 2.5 μl RNA Adapters from the Index adapter Plate (to each corresponding well).

And mix by shaking (2 min, 1800 rpm).


PCR is used to amplify the library and to select for DNA fragments with successful adapter ligation.

3.2.5 DNA Fragment Enrichment

	- (a) 98 C for 30 s.
	- (b) 15 cycles of:
		- <sup>l</sup> 98 C for 10 s
		- <sup>l</sup> 60 C for 30 s
		- <sup>l</sup> 72 C for 30 s.
	- (c) 72 C for 5 min.
	- (d) Hold at 4 C.

3.2.6 Library Quality Check, Normalization, and Pooling

Library quantity and fragment size are determined using the Bioanalyzer. Indexed libraries are pooled prior to sequencing (see Note 12).

1. Quantify library concentration by qPCR as outlined in manufacturers protocol (lluminaSequencing Library qPCR Quantification Guide (document # 11322363) [https://support. illumina.com/content/dam/illumina-support/documents/ documentation/chemistry\_documentation/qpcr/sequenc ing-library-qpcr-quantification-guide-11322363-c.pdf].


for RNA-Seq reads. Nonetheless, it can be used to quickly and easily check per base sequencing quality, the number of input reads, and adapter content in the reads.

3.2.10 Adapter Trimming When read length exceeds DNA insert size, a run can sequence beyond the DNA insert and read bases from the sequencing adapter. To prevent these bases from appearing in FASTQ files, the adapter sequence is trimmed from the 3<sup>0</sup> ends of reads. Trimming the adapter sequence improves alignment accuracy and performance in Illumina FASTQ generation pipelines [https:// support-docs.illumina.c om/SHARE/AdapterSeq/ DNAandRNACDIndexes.html].

3.3 Bioinformatic Analysis of the Sequencing Data Using ARResT/Interrogate

We use the ARRest/Interrogate immunoprofiling platform for data analysis, which has been developed and validated within EuroClonality-NGS [16–18].


please stay updated via ARResT/Interrogate [arrest.tools/ interrogate] and EuroClonality-NGS [euroclonalityngs.org] (see Note 19).

### 4 Notes


the indicated temperatures, that heating systems are preheated to the temperatures required, and that thermal cyclers have been programmed according to the given programs.


expression profiling only. Increasing sequencing depth to >30 million reads will improve the detection of subclonal and less covered markers.


Fig. 2 Comparison between IG/TR rearrangements detected by RNA-Seq and amplicon-based assays ("DNAamp") in 165 ALL patients [13]. Average number of rearrangements detected for the various IG/TR loci per case

gene rearrangements in leukemic rearrangements differ from transcription levels in reactive T cells. Higher read depth may facilitate identification of lowly expressed IG/TR mRNA.

19. Of note, IG/TR rearrangements may also be derived from whole exome sequencing (WES) or whole genome sequencing (WGS) data sets that, in contrast to RNA-Seq data, do not depend on the transcriptional level of rearrangements. This creates a clear advantage as was recently showcased in work introducing IgCaller for WGS-derived IGH data [19].

### Acknowledgments

We thank the members of EuroClonality and EuroMRD for their support, especially the participants of the WholeMark work package (Blanca Scheijen, Bastiaan Tops, Jan Trka, Karol Pa´l, Sonja H€anzelmann, Gianni Cazzaniga, Grazia Fazio, Simona Songia, and Anton W. Langerak).

### References


Gru¨mayer ER et al (2007) Analysis of minimal residual disease by Ig/TCR gene rearrangements: guidelines for interpretation of realtime quantitative PCR data. Leukemia 21: 604–611


B-cell acute lymphoblastic leukemia using RNA-Seq. Leukemia 34:2418–2429


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Minimal Residual Disease Analysis by Monitoring Immunoglobulin and T-Cell Receptor Gene Rearrangements by Quantitative PCR and Droplet Digital PCR

### Irene Della Starza, Cornelia Eckert, Daniela Drandi, and Giovanni Cazzaniga and on behalf of the EuroMRD Consortium

### Abstract

Analysis of immunoglobulin and T-cell receptor gene rearrangements by real-time quantitative polymerase chain reaction (RQ-PCR) is the gold standard for sensitive and accurate minimal residual disease (MRD) monitoring; it has been extensively standardized and guidelines have been developed within the EuroMRD consortium (www.euromrd.org). However, new generations of PCR-based methods are standing out as potential alternatives to RQ-PCR, such as digital PCR technology (dPCR), the third-generation implementation of conventional PCR, which has the potential to overcome some of the limitations of RQ-PCR such as allowing the absolute quantification of nucleic acid targets without the need for a calibration curve. During the last years, droplet digital PCR (ddPCR) technology has been compared to RQ-PCR in several hematologic malignancies showing its proficiency for MRD analysis. So far, no established guidelines for ddPCR MRD analysis and data interpretation have been defined and its potential is still under investigation. However, a major standardization effort is underway within the EuroMRD consortium (www.euromrd. org) for future application of ddPCR in standard clinical practice.

Key words Minimal residual disease, Immunoglobulin, T-cell receptor, Rearrangement, RQ-PCR, ddPCR

### 1 Introduction

After a single lymphoid cell undergoes clonal neoplastic transformation, all progeny leukemic cells will contain the same rearranged clonal Immunoglobulin (IG) and T-cell receptor (TR) genes, thus representing highly specific molecular targets for minimal residual disease (MRD) detection in lymphoproliferative disorders [1].

MRD monitoring has been proven to be a compelling tool for advising therapeutic choices especially in acute lymphoblastic leukemia (ALL), the first neoplasm where MRD has been used to assess early response to therapy [2–6]. The availability of drug combinations capable of unprecedented complete clinical responses

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_5, © The Author(s) 2022

leads to a growing interest for MRD assessment also in other lymphoid malignancies over time, i.e., chronic lymphocytic leukemia, multiple myeloma, as well as mantle cell lymphoma [7–10].

Currently, antigen-receptor gene analysis by real-time quantitative polymerase chain reaction (RQ-PCR) is the gold standard for sensitive and accurate MRD monitoring and has been extensively standardized within the EuroMRD consortium (www.euromrd. org), which established guidelines for the analysis and interpretation of RQ-PCR data [11] to favor a homogeneous application of MRD studies within different lymphoid malignancies and treatment protocols all over the world. However, the measurement of a dynamic process, such as the rate of target amplification, carries some intrinsic fluctuations that cannot be fully eliminated. The digital PCR technology (dPCR) [12], the new generation of conventional PCR, is based on partitioning by nanofluidics and emulsion chemistries which allow performing a limiting dilution of DNA into individual (partitioned) PCR reactions. The DNA template can thus be randomly distributed and the Poisson statistics can be applied to quantify the DNA amount in positive partitions. In comparison with RQ-PCR, dPCR allows the quantification of nucleic acid targets without the need of calibration curves [13]. Moreover, it has the potential to overcome some of the limitations of RQ-PCR. Based on the dynamic nature of these two methods, dPCR appears more accurate than RQ-PCR with a greater amplification efficiency, since each sample is partitioned and each partition is analyzed individually, so small changes in fluorescence intensity are more readily detected [14, 15].

Recently, droplet digital PCR (ddPCR) technology, a type of dPCR characterized by partitioning the sample in droplets, has been applied in comparison to RQ-PCR in several hematologic malignancies, and its additional technical and clinical value to the gold-standard RQ-PCR was demonstrated [16–22]. However, no established guidelines for ddPCR MRD analysis and interpretation have been defined so far, and its potential is still under investigation. A major standardization effort is underway within the ddPCR group of the EuroMRD consortium (www.euromrd.org) for its future application in standard clinical practice.

The PCR approach for IG/TR screening and RQ-PCR MRD analysis have been recently described in this book series, on behalf of the EuroMRD consortium [23]. Briefly, to identify IG/TR markers at diagnosis, either a standard multiplex-PCR/Sanger sequencing [23] or the new and more efficient NGS-based approaches [24, 25] can be applied to define the unique V-(D-)J junctional regions. Complementary patient- and allele-specific oligonucleotide (ASO) primers and common fluorescent probes must be designed for each target of any patient for its MRD monitoring. To perform the MRD relative quantification by RQ-PCR, amplification conditions and sensitivity testing for each ASO-primer are established on the diagnostic material serially diluted in normal mononuclear cells, before quantifying MRD in bone marrow samples collected during treatment. Interpretation guidelines developed and continuously refined within the EuroMRD group are fundamental for issuing comprehensive clinical reports and for comparing independent studies applying the IG/TR RQ-PCR MRD monitoring [11].

Since the PCR approach for IG/TR screening and RQ-PCR MRD analysis in lymphoproliferative disorders has been recently described [23], in this chapter we will focus on the ddPCR protocol.

### 2 Materials


Along this chapter, the Bio-Rad system (Bio-Rad Laboratories) is described. Alternative instruments will require adaptation of this protocol.

### 3 Methods

To identify IG/TR markers at diagnosis, either a standard multiplex PCR/Sanger sequencing [23] or the new and more efficient NGS-based approaches [24, 25] can be applied to define the unique V-(D-)J junctional regions. Complementary ASO primers and common fluorescent probes must be designed for each target of each patient, for MRD monitoring [23]. Several tools are available for assay design and optimization, such as Oligo Analyzer 3.1 (www.eu.idtdna.com), PrimerQuest (Integrated DNA Technologies, www.idtdna.com), Primer3Plus (www.primer3plus.com), or others.

3.1 ddPCR MRD Quantification for the Target Genes

No standard curve generation is needed for a ddPCR experiment setup. However, as for any kind of PCR experiments, a positive control is mandatory (i.e., either a 10-1 dilution or 10-4 dilution point performed in 2-wells could be used). Follow-up samples must be tested in triplicate (two replicates are acceptable only in cases with insufficient DNA or failed technical criteria in third replicate).

To check for unspecific amplifications, nonspecific DNA controls (PB-MNC) should be run in 3 or 6 replicates (see Note 1) and a no template control (NTC) at least in duplicate, for each specific target quantification, respectively.

The specific oligonucleotide primers and probe, as selected based on available IG/TR targets and sensitivity testing, must be used (see Note 2).

1. Prepare the reaction mixture for each sample/well as follows:


	- (a) Load 20 μl of reaction mix and 70 μl of droplet generation oil into the proper DG8 cartridge wells.

3.2 DNA Quantification Using the Reference Gene A reference gene must be tested to correct the MRD value in the actual follow-up sample based on the quantity of DNA loaded. Although no consensus has been reached on reference gene usage, the albumin gene is the most frequently used housekeeping control gene. Details on primers and probe concentrations to amplify a portion of the albumin gene as a reference are indicated in Note 2. The reference gene is recommended to be tested (in a single well) in the same ddPCR plate as for the target gene.

1. Prepare the reaction mixture for each sample/well as follows:


	- (a) Load 20 μl of reaction mix and 70 μl of droplet generation oil into the proper DG8 cartridge wells.
	- (b) Carefully remove any bubble created into the DG8 cartridge "sample" well during sample loading.
	- (c) Put the DG8 gasket and start the droplets generation.

Fig. 1 ddPCR MRD quantification: schematic diagram of a ddPCR experiment. Step 1: the reaction mix is prepared with the same primer/probes as for the RQ-PCR assay. Both the reaction and the DNA samples are partitioned into 20,000 droplets of identical volume through a microfluidic system. Step 2: in a thermal cycler, 20,000 PCR reactions are amplified and fluorescence is the output during the reaction of polymerization. Step 3: a droplet reader analyzes each droplet individually and detects an increased fluorescence in positive droplets, which contain at least one copy of the target DNA


3.3 ddPCR Results Analysis The analysis must be performed by QuantaSoft or QuantaSoft PRO according to the following criteria:


3.4 Interpretation of ddPCR MRD Results An excel sheet can be used to report all IG/TR target amplification values for all follow-up samples. In the process of setting an international standardization, ddPCR results have been interpreted so far with different guidelines [21, 22]. See Table 1 for the provisional EuroMRD guidelines.

> Interpretations must be incorporated into the clinical report. Although it has not been standardized so far, just as for RQ-PCR, a clinical report ideally should contain the following information for each follow-up sample analyzed: date and type of sampling, the actual MRD value, and the corresponding quantitative limit (QL). If the MRD value is positive but below the quantitative limit, the

Fig. 2 ddPCR results analysis: each droplet is plotted on the graph of fluorescence intensity versus droplet number (a). The concentration is calculated on the fraction of empty droplets (green bar), which is the fraction that does not contain any target DNA (b). Fraction of positive droplets is fitted to a Poisson algorithm to determine the absolute copy number, and results are presented in copies per <sup>μ</sup>L (c). In case of few positive events in follow-up samples, NTC or PB-MNC wells, verify the consistency of the amplification signal by checking for the presence of positive droplets in channel 2. If a signal in ch2 is detected, in the same position of ch1 signal, this represents an unspecific amplification (false-positive signal) and must be excluded from the analysis (d). (Adapted from Della Starza I, et al Front Oncol. 2019 Aug 7;9:726)

### Table 1 Provisional EuroMRD guidelines for ddPCR


value can be reported as "POS <sup>&</sup>lt; QL" (i.e., POS <sup>&</sup>lt; 1.0 <sup>10</sup><sup>4</sup> ). As already established for RQ-PCR results, this qualitative result cannot be further interpreted: it only means that the sample is positive and lower than the QL, but it cannot be quantified precisely and should not be used for clinical decision, in particular not for upgrading the therapy, because of the intrinsic risk of falsepositivity. In case of negative MRD, the actual QL and the specific time point need to be considered for clinical interpretation and decision-making.

3.5 Conclusion During the last years, many publications have reported on the ddPCR application in different hematological diseases. Its intrinsic characteristics (accuracy, sensitivity, quantification without the need of a standard curve, etc.) make this method also attractive for MRD evaluation. However, at the moment, the use of ddPCR as a MRD molecular method in clinical protocols is prevented by the lack of published international guidelines for data interpretation, which is a fundamental requirement to ensure reproducibility and to compare MRD data in different clinical protocols. For this reason, a major standardization effort is underway within the EuroMRD consortium groups, and five ddPCR QC rounds have so far been performed, involving 24 laboratories around the world [21]. The further challenges will be to achieve this goal and to assess the prognostic relevance of ddPCR in large studies in the light of its future application in clinical practice.

### 4 Notes


ddPCR reaction mix, adjusting properly with water. Importantly, before using the enzyme, verify that target sequences or primers and probes will be not damaged.

5. In case of one or two positive droplets, in the PB-MNC wells, just above the background signal, threshold line could be settled just above these observed droplets of the PB-MNC. In case of positive droplets in the PB-MNC samples at higher amplitude respect to the cloud of positive control, these are unspecific signals and must be omitted from the analysis.

### References


childhood oncology group. J Clin Oncol 34: 2591–2501


chimerism after allogeneic stem cell transplantation. Exp Hematol 43:462–468


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Quality Control for IG/TR Marker Identification and MRD Analysis

### Eva Fronkova, Michael Svaton, and Jan Trka

### Abstract

Selection of the proper target is crucial for clinically relevant monitoring of minimal residual disease (MRD) in patients with acute lymphoblastic leukemia using the quantitation of clonal-specific immunoreceptor (immunoglobulin/T cell receptor) gene rearrangements. Consequently, correct interpretation of the results of the entire analysis is of utmost importance. Here we present an overview of the quality control measures that need to be implemented into the process of marker identification, selection, and subsequent quantitation of the MRD level.

Key words Minimal residual disease, Acute lymphoblastic leukemia, Quality control, Next-generation sequencing, PCR

### 1 Introduction

Minimal residual disease (MRD) monitoring became the standard tool for acute lymphoblastic leukemia (ALL) patient risk stratification. Development of the methodology, as started by the leading pediatric international consortia, has led to the wide acceptance of this approach by both pediatric and adult hematologists alike. Among all potentially available strategies for MRD follow-up analysis, detection and subsequent quantitation of immunoreceptor gene (immunoglobulin/T cell receptor; IG/TR) rearrangements have become the gold standard. IG/TR-based MRD monitoring is currently not only used in frontline treatment of ALL patients but also for the prediction of outcome after relapse of ALL and for follow-up analysis of patients before and after hematopoietic stem cell transplantation (SCT).

As really crucial treatment decisions are being made based on the results of MRD measurement, the accuracy of the method is critical. At particular time-points of treatment, both potential falsenegative and false-positive results may have serious consequences.

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_6, © The Author(s) 2022

Therefore, quality controls must be an integral part of this approach throughout all the critical procedures of MRD marker identification, selection, and follow-up analysis.

Here, we summarize the critical steps in marker identification and MRD analysis together with the description of related quality control measures.

### 2 IG/TR Marker Identification

2.1 PCR-Based Marker Identification A classical approach of clonal marker identification includes PCR amplification, clonality assessment, and Sanger sequencing of PCR products. The strategy for choosing IG/TR markers for amplification differs based on the type of malignancy. In CLIP laboratories, we prefer to use separate singleplex PCR reactions for ALL (25 for B-ALL, 20 for T-ALL), as described by the BIOMED-1 consortium [1, 2], with frozen premixes including primer pairs and polymerase for each rearrangement, complemented by T cell receptor beta (TRB) detection via three multiplex PCR reactions, as described by the BIOMED-2 consortium [3].

2.1.1 Control Samples Cell line or patient samples with respective rearrangements are used as positive control, and water is used as negative control to check for possible contamination. Using 20–25 single reactions, it is not possible to add positive and negative controls to each mix. One positive control and one negative control are used for each marker screening, with positive control changing (rotating) for each screening round to control all the PCR premixes.

2.1.2 Distinguishing Between Monoclonal and Polyclonal PCR Products

In case of positive amplification, it is necessary to distinguish monoclonal PCR products from oligo/polyclonal ones. This was previously done using heteroduplex analysis on polyacrylamide gels [3]. Currently, GeneScan analysis or technologies of automated electrophoresis (Agilent Bioanalyzer or similar) are preferred due to significantly reduced hands-on time. We use Agilent Bioanalyzer on-a-chip electrophoresis for clonality detection, because it does not require fluorescently labeled primers as in GeneScan, while providing a similar degree of size distinction. Moreover, PCR products can be directly used for further analysis. The TRB multiplex interpretation is difficult due to possible unspecific bands. Therefore, polyclonal control samples consisting of a mix of at least ten healthy donor "buffy coat" samples should be used for each TRB multiplex tube, together with positive controls and water, to discern nonspecific bands. The monoclonal products are then sequenced and clone-specific primers are designed (see below).

2.2 NGS-Based Marker Identification Alternatively—and currently more frequently—the methods used for the screening of IG/TR gene rearrangements as clonal markers in ALL are routinely based on next-generation sequencing (NGS), providing a rapid and full overview of the rearrangements present in the sample. These methods, usually based on amplicon sequencing for particular markers (IG/TR rearrangements), rely on multiplex PCR with a large number of specific primers, and thus a reliable and standardized quality control is needed in routine practice to obtain reliable results. When focusing on noncommercial and thus freely available solutions, EuroClonality-NGS assays and approaches that were developed to standardize routine diagnostic practice for both the wet lab and bioinformatic parts of marker identification are optimal [4].

2.2.1 Quality Control of the Library Preparation To ensure that all possible IG/TR gene rearrangements that are present in the diagnostic DNA sample can be detected, a routine control of the PCR primer mixes should regularly be performed using a polyclonal quality control sample (PC-QC). A mixture of polyclonal DNA samples isolated from the PBMCs obtained from multiple healthy donors is easily accessible in routine laboratory practice and provides a diverse repertoire of IG/TR gene rearrangements. NGS library preparation from the PC-QC is required each time a new working dilution of the primer mix is prepared to test the correct performance of all primers and should be periodically repeated to assess stable primer mix composition over longer periods of time.

Standard quality control (QC) of the NGS library is required for each sequencing run and consists of gel electrophoresis of the final products to assess a good specific amplification of the library at the expected amplicon length and quantitation of the purified specific products.

For the purpose of assessing correct PCR amplification during each NGS library preparation, a central in-tube quality/quantitation control (cIT-QC) is used and added to the PCR reaction to undergo the whole process in parallel with the diagnostic sample. The cIT-QC consists of selected human B and T cell lines with defined IG/TR rearrangements [5] and serves as a positive control for all the IG/TR gene loci, including the ones that were not rearranged in the patient's malignant cells and would otherwise lack specific rearrangements. Reads from the cIT-QC are used during the bioinformatic analysis to confirm correct NGS library preparation and aid with the normalization of all the other reads to cell counts.

2.2.2 Data Analysis A large number of specifically developed software tools exist for the analysis of IG/TR gene rearrangements, with the ARResT/Interrogate [6] and Vidjil [7] applications being developed in collaboration with the EuroClonality-NGS working group to be well suited

Fig. 1 Primer usage in a mixed polyclonal sample. Individual primers from the EuroClonality-NGS IGK-VJ-Kde primer mix are shown with 5<sup>0</sup> primers on the x axis and 3<sup>0</sup> primers in different colors. The y axis shows the relative abundance of reads identified with the respective primer sequence in the NGS library

for MRD marker identification including the automatic quality control of the libraries prepared according to the EuroClonality-NGS working group protocols. An essential prerequisite for the analysis is sufficient sequencing coverage of the NGS libraries with good base quality for reliable identification of all IG/TR rearrangements present in the DNA sample, including the cIT-QC. This is taken into consideration during the bioinformatic analysis with these tools.

Using correct primer annotation, the usage of specific 5<sup>0</sup> and 3<sup>0</sup> primers can be examined in each sample to assess their individual performance. An example of such analysis of IGK-VJ-Kde primer usage in a polyclonal sample is shown (Fig. 1). Although influenced by the gene usage in a healthy polyclonal repertoire, it is a reliable indicator of any errors that may have occurred during the primer mix preparation. Primer mix performance should be checked regularly using a PC-QC.

Reads corresponding to the cIT-QC are identified during the bioinformatic analysis and serve as an amplification control for each individual library. An automatic QC determines that all expected rearrangements of the cIT-QC are present in the respective libraries and a quantitation factor is calculated based on the DNA input of the cIT-QC as well as the patient's sample. A potential failure to detect some of the cIT-QC rearrangements may occur in a situation with a low coverage of the NGS library and a high infiltration of blasts in the patient's sample with monoclonal rearrangement. In such cases usually only some of the cIT-QC rearrangements are not covered and the MRD marker can still be clearly identified. In samples with limited polyclonal IG/TR background, the cIT-QC makes up a large proportion of reads.

2.3 Choosing Markers for MRD and Optimization of the Clonal-Specific RQ-PCR Systems

There have been many debates on the subject of (preferential) selection of the most specific and stable markers. However, in the real-life situation, prioritization of markers is not really an issue; for the sake of time of routine diagnostic throughput, usually all available (mono)clonal markers identified from Sanger sequencing are used for clonal-specific primer design and subsequent RQ-PCR optimization. Sequential testing of potential markers and primers is not preferred as the total time spent on the entire selectionoptimization process must fit in the diagnostic window for MRD monitoring. Markers are therefore mostly selected based on their real, rather than predicted performance during the optimization process.

However, with the advent of NGS-based marker identification, more information is available on every marker. First, the real abundance of the clonal marker in the analyzed DNA sample can be estimated based on the cIT-QC and the background, and second, and perhaps most importantly, its specificity can be confirmed against a large dataset of IG/TR rearrangements from other patients and polyclonal samples. Detailed description of this is well beyond the scope of this chapter.

Ultimately, the real performance of the selected clonal markerprimer in RQ-PCR is the criterion for its use in MRD monitoring.

The EuroMRD (former ESG-MRD-ALL) consortium has established strict criteria for defining sensitivity and specificity of RQ-PCR systems [8].

Reaching adequate sensitivity and specificity based on EuroMRD criteria represents a QC of a well-designed RQ-PCR system per se. Similar rearrangements in normal B and T cells are the source of possible false-positivity, and background amplification is unavoidable in some markers. The extent of nonspecific amplification (NSA) depends on the involved genes and the number of inserted and deleted nucleotides in the junction. It has been estimated that NSA occurs in 35% of IGH markers and in more than 90% of TCRG markers [9]. IGK markers are also highly prone to NSA. IGK-KDE rearrangements are recommended as first-choice markers due to their stability, but based on our NGS data, the presence of highly similar rearrangements with resulting NSA is extremely high in polyclonal controls (unpublished data).

Therefore, it is mandatory to use adequate polyclonal controls. At least six wells of polyclonal DNA (preferentially from at least 10 healthy donors PB samples) should be used together with MRD samples in the RQ-PCR assay. Usually, 2–3 specific primers are tested for each monoclonal rearrangement, and two independent markers with the lowest NSA and sufficient sensitivity are selected and further optimized if needed. To reduce NSA, it is possible to slightly correct RQ-PCR conditions, i.e., to increase the annealing temperature by 2–4 -C or to titrate the primer concentration, usually by decreasing the clonal marker-specific primer concentration.

False-Positive and

### 3 Interpretation of RQ-PCR MRD Analysis Results

The EuroMRD consortium has also defined and published guidelines for the correct interpretation of RQ-PCR MRD monitoring results. These criteria not only reflect the potential biological issues of the approach but also the clinical relevance of the result.

Consequently, the criteria for MRD positivity were defined more strictly for situations, where possible false-positivity would lead to unjustified treatment intensification. This is typically the situation of an emerging molecular relapse, most commonly during regular follow-ups after stem cell transplantation (SCT). In the opposite situation, i.e., when treatment reduction would be the outcome of false-negative MRD result, the criteria are intentionally stricter toward negativity [8].

In summary, sample is considered to be MRD positive in the context of therapy reduction (e.g., risk group stratification into lower risk group) if:

	- and.

A sample is considered to be MRD positive in the context of therapy intensification (e.g., therapeutic intervention after SCT) if:

	- and.

3.1 Identification of False-Negative Results In an intra-laboratory setting, a newly emerged low MRD positivity remains a diagnostic challenge. Before the era of NGS methods, the extent of false-positivity was assessed only indirectly. Van der Velden et al. retested the low-positive samples in different timepoints of ALL using MRD assays designed for different (irrelevant) markers and concluded that the NSA differs between timepoints and markers and is mostly present in IGH markers with background amplification in PB (buffy coats) in post-maintenance treatment phases. Their study concluded that the background for IGH markers was lowest at the end of induction treatment (day 33) and that EuroMRD criteria sufficiently excluded most of the falsepositives [10].

Our group focused on MRD positivity during the post-SCT period. Starting 140 days post-SCT, we frequently observed positive results fulfilling EuroMRD criteria for therapy intensification in patients who turned negative in the following examinations. Using indirect methods, we showed that the positives were nonspecific and their occurrence correlated with intense B cell regeneration, which is usually very intense post-SCT [11]. With the development of NGS-based MRD methods, we expanded the previous cohort and reanalyzed post-SCT RQ-PCR-positive samples by NGS. A vast majority of RQ-PCR positive samples in patients who subsequently did not progress into hematological relapse were negative using NGS. NGS sequences of amplified physiological rearrangements were highly similar to ASO primer sequences, suggesting that RQ-PCR amplification was not specific [12].

Based on these data, we decided to recheck every MRD result post-SCT that was concluded by RQ-PCR to be "positive, nonquantifiable." The size of the nonspecific RQ-PCR products is usually different from the expected size of the amplified marker. Therefore, it is helpful to keep RQ-PCR products and check their size using the Agilent Bioanalyzer together with products of the standard curve dilution (usually 10–1 and 10–4) as size standard and with buffy coats that previously showed positive signals. Based on our experience, up to 30–40% of low-positive (nonquantifiable) RQ-PCR results can be identified as false-positive, because the length of the "clonal-specific" product differs from its original size and overlaps with buffy coat amplification (unpublished data). In the remaining cases, the sizes of all products including buffy coat are in the same size range and thus cannot be distinguished. With IG/TR NGS available, it is possible to reevaluate the remaining positive RQ-PCR result via NGS. However, to ensure that NGS has the same (or better) sensitivity as RQ-PCR, it is crucial to test the sensitivity of NGS using the diluted sample (e.g., 10–4), preferentially in a separate NGS run to avoid sample cross-contamination.

### 4 Conclusion

MRD monitoring using an IG/TR-based quantitation method is an elegant and clinically relevant approach. However, as several steps are prone to technical and interpretational errors, adequate quality control measures must be included throughout the process. Some of the basic and more advanced tips have been listed in this chapter. On top of these, intra-laboratory procedures and interlaboratory measures can be introduced as well.

### Acknowledgments

This work was supported by the Ministry of Health of the Czech Republic, grant NV20-03-00284.

### References


2254–2265. https://doi.org/10.1038/ s41375-019-0499-4


12. Kotrova M, Van der Velden VHJ, van Dongen JJM, Formankova R, Sedlacek P, Bru¨ggemann M et al (2017) Next-generation sequencing indicates false-positive MRD results and better predicts prognosis after SCT in patients with childhood ALL. Bone Marrow Transplant 52(7):962–968. https://doi.org/10.1038/ bmt.2017.16

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Chapter 7

### cfDNA-Based NGS IG Analysis in Lymphoma

### Christiane Pott, Michaela Kotrova, Nikos Darzentas, Monika Bru¨ ggemann, and Mouhamad Khouja and on behalf of the EuroClonality-NGS Working Group

### Abstract

Liquid biopsy is a novel diagnostic approach at first developed to characterize the molecular profile of solid tumors by analyzing body fluids. For cancer patients, it represents a noninvasive way to monitor the status of the solid tumor with respect to representative biomarkers. There is growing interest in the utilization of circulating tumor DNA (ctDNA) analysis also in the diagnostic and prognostic fields of lymphomas. Clonal immunoglobulin (IG) gene rearrangements are fingerprints of the respective lymphoid malignancy and thus are highly suited as specific molecular targets for minimal residual disease (MRD) detection. Tracing of the clonal IG rearrangement patterns in ctDNA pool during treatment can be used for MRD assessment in B-cell lymphomas. Here, we describe a reproducible next-generation sequencing assay to identify and characterize clonal IG gene rearrangements for MRD detection in cell-free DNA.

Key words Cell-free DNA, Plasma, Immunoglobulin rearrangements, Therapy monitoring, Liquid biopsy, Minimal residual disease, Digital droplet PCR, Next-generation sequencing

### 1 Introduction

Circulating cell-free DNA (cfDNA) is fragmented extracellular DNA, which is released from apoptotic and necrotic cells in small fragments of <200 bp [1]. cfDNA is typically isolated from the blood stream; however, it is also possible to detect cfDNA in other biological fluids such as urine or cerebrospinal fluid [2–6]. Interestingly, in cancer patients, a fraction of 0.01–60% of the total cfDNA consists of circulating tumor DNA (ctDNA), which originates from neoplastic lesions [7].

The fact that ctDNA shares the same biological features as the cellular DNA of the tumor, such as point mutations, gene amplifications, and immunoglobulin (IG) and T-cell receptor (TR) gene rearrangements in lymphoma, makes utilizing ctDNA as a noninvasive biopsy in diagnostic approaches and monitoring the status of minimal residual diseases (MRD) very attractive.

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_7, © The Author(s) 2022

The ideal markers for MRD detection in B-cell lymphomas are clonal IG gene rearrangements. The IG heavy chain (IGH) gene rearrangements are frequently used as target due to their unique junctional regions. However, IGH rearrangements may not be that reliable because of the somatic hypermutations (SHM) mainly in the V gene regions taking place during B-cell development and maturation. This might result in mismatches in primer binding [8]. Alternatively, incomplete IGHD-IGHJ rearrangements and IGK gene rearrangements could be used as targets for MRD. Both rearrangement types are mainly unmutated. Incomplete rearrangements in the IGH locus do not contain SHM in the majority of cases, because transcription only starts from the promoters in the V genes [9]. The finding of hypermutation in a small proportion of incomplete DJH rearrangements suggests important biological implications concerning the process of SHM. The rearrangements of the IGK genes can also be an important complementary MRD target, as in rearrangements involving the kappa deletion element (Kde), no SHM can occur after Kde recombination, since the deletion of the JK-CK introns removes the IGK enhancer that is essential for SHM [10].

PCR-based methods like allele-specific real-time quantitative PCR or digital droplet (dd) PCR targeting the clonal IG rearrangements are currently the gold standard for MRD quantification in cfDNA but are limited by a sensitivity of 1xE-05 in a polyclonal B-cell background. Since ctDNA is present at a very low overall amount in the peripheral blood, highly sensitive technologies are needed to detect MRD in cfDNA. Next-generation sequencing (NGS) of IG rearrangements (IG-NGS) is the technology that can overcome the limitation of PCR-based approaches with a potential higher sensitivity.

MRD assessment in cfDNA by IG-NGS requires the identification of the lymphoma-associated clonotypes in diagnostic tumor tissue. Therefore, fresh or formalin-fixed paraffin-embedded (FFPE) lymph node material or diagnostic peripheral blood or bone marrow with sufficient tumor infiltration is required for the initial marker identification. The EuroClonality-NGS working group has recently shown that IGH and IGK rearrangements are highly suitable for detecting clonality in frozen and FFPEembedded tissue specimens [11]. Due to the small fragment size of cfDNA (~166 bp) and the high frequency of SHM in the variable heavy framework region 3 (IGHV-FR3), the EuroClonality-NGS IGHV-FR3 multiplex PCR was redesigned and optimized for the specific requirements of MRD detection in cfDNA. IGHD-IGHJ and IGK (IGKV-IGKJ, IGKV-KDE, and intron RSS-KDE) primer sets remained unchanged; only the reaction conditions and primer concentrations were modified to facilitate balanced amplification of all rearrangements also in this type of material.

As illustrated in Fig. 1, using a one-step NGS PCR protocol, clonal IG rearrangements are amplified in cfDNA and combined with molecular barcodes and sequencing adapters. The amplicons bind the flow cell of the Illumina MiSeq through the introduced adapters and are sequenced by synthesis. A standardized bioinformatic analysis of the high-throughput sequencing data allows the verification of clonal IG rearrangements and their precise quantitation. The bioinformatic platform ARResT/Interrogate (at http:// arrest.tools/interrogate/), developed within the EuroClonality-NGS working group, allows identification of clonotypes and MRD follow-up in the same workflow.

The EuroClonality-NGS "central intra-tube quality/quantification control" (cIT-QC), comprising of known copy numbers of clonal rearrangements, is added to each reaction to enable the quantification of ctDNA as fraction of cfDNA and the correction of potential amplification biases. The cIT-QC is used to calculate the coverage of each single rearrangement copy in order to calculate the read coverage per cell and to determine the MRD level. The utility of this approach has been published recently [12].

Here we provide detailed instructions on amplicon sequencing of clonal IG rearrangements in cfDNA using modified EuroClonality-NGS protocols for IGH (VJ + DJ) and IGK (VJ + intron-Kde/V-Kde) (http://www.euroclonality.org/ protocols/). The process of marker identification in FFPE samples or diagnostic bone marrow or peripheral blood is not part of this chapter; for that we refer to the publication of Scheijen et al. [11].

### 2 Materials

2.1 Sample Collection

Solutions must be prepared with double-distilled water (supplied as ultrapure water or purified by filtering of 18 MΩ-cm at 25 C). All reagents should be stored at 18–25 C unless otherwise indicated.

	- 2. QIAvac 24 Plus vacuum pump (QIAGEN) or any vacuum pump capable of producing a pressure of 800 to 900 mbar. Alternatively, use the Maxwell® RSC instrument (Promega) with the corresponding Maxwell® RSC ccfDNA Plasma Kit (Promega) for automated extraction.

Fig. 1 Schematic representation of the cfDNA-based NGS IG rearrangement analysis in lymphoma. Adapted from "Next Generation Sequencing (Illumina)," by BioRender.com (2021). Retrieved from https:/app.biorender. com/biorender-templates


### Table 1

List of target primers used for quantification using digital droplet PCR as described previously [14]

6. DG8 Gaskets for QX200 Droplet Generator (Bio-Rad).

7. QX200™ Droplet Digital PCR system (Bio-Rad).


### 2.4 One-Step Next-Generation

1. FastStart™ High Fidelity reaction buffer (Roche) w/o MgCl2.

### 2. FastStart™ High Fidelity Taq polymerase (Roche).


### 1. Gel electrophoresis chamber.


### 3 Methods

3.1 Sample Preparation

All experimental procedures should be carried out at room temperature unless otherwise indicated.

### 1. Centrifuge blood collection tubes at 2000 g for 10 min. If using S-Monovette®EDTA tubes, processing time should not be longer than 4 h after sample taking.

2. Carefully move the supernatants (plasma) into a 5 ml tube with a conic bottom without damaging the buffy coat phase.

Table 2

List of target primer pools used for library preparation. Sequencing binding adapter (violet), barcoding variable sequence (XXXXXXXX), sequencing primer (blue), target-specific sequence (green)


	- 1. Prior to starting the extraction the following should be done:
		- (a) Equilibrate samples and buffers to room temperature (18–25 C).
		- (b) Heat a water bath or heating block to 60 C for use with 50 ml tubes.
		- (c) Heat a heating block to 56 C for use with 2 ml Eppendorf tubes.
	- 2. Pipet 400 μl QIAGEN Proteinase K into a pre-labeled 50 ml tube.
	- 3. Add 4 ml plasma to the tube.
	- 4. Add 3.2 ml buffer ACL to the tube. Mix well by vortexing for 30 s.
	- 5. Incubate for 30 min at 60 C.
	- 6. Add 7.2 ml buffer ACB, mix well by vortexing for 15–30 s.
	- 7. Incubate the mixture for 5 min on ice.
	- 8. Insert the QIAamp Mini column into the VacConnector on the QIAvac 24 Plus. Insert a 20 ml tube extender into the open QIAamp Mini column. Make sure that the tube extender is firmly inserted into the QIAamp Mini column to avoid leakage of the sample.
	- 9. Carefully pour the mixture from step 6 into the tube extender of the QIAamp Mini column. Set the vacuum pump to produce a vacuum of 800 mbar to 900 mbar until all lysates are drawn through (takes up to 15 min) (see Note 3).
	- 10. Release the pressure to 0 mbar, discard the tube extenders carefully and leave the QIAamp Mini columns attached to the VacConnector on the QIAvac 24 Plus.
	- 11. Add 600 μl washing buffer ACW1 to the QIAamp Mini column. Switch on the vacuum pump (800 mbar to 900 mbar) while the lid is open (see Note 3). After the entire washing buffer has been drawn through the column, switch the vacuum pump off and release the pressure to 0 mbar.
	- 12. Add 750 μl washing buffer ACW2 to the QIAamp Mini column. Switch on the vacuum pump (800 mbar to

900 mbar) while the lid is open (see Note 3). After the entire washing buffer has been drawn through the column, switch the vacuum pump off and release the pressure to 0 mbar.


3.3 Digital Droplet PCR (ddPCR)-Mediated Copy Number Quantification


### 3.4 NGS Library Preparation For library preparation, keep the reagents on ice until use. Use precooled racks for preparing the reaction mixture. Alternatively, the reaction could be prepared by placing the tubes on ice. Use the EuroClonality-NGS cIT-QC with known copy numbers of clonal rearrangements, in order to calculate the read coverage per cell (see Note 9 and [12]).




### Table 4 Thermal cycler profiles of targeted PCR reactions


Fig. 2 PCR products analyzed by agarose gel electrophoresis. The target-specific band size is around 250 bp in IGH-VJ, 300 bp in IGH-DJ, and 300 bp in IGK. Amplification reactions using buffy coat and distilled water served as positive and negative controls, respectively

and repeat until the agarose completely dissolves. Allow the mixture to cool to <60 C. Add 1:10,000 Gel-Red, mix gently and pour the dissolved agarose into the casting form, place the comb in place, and wait until the gel sets (around 30 min).


280 bp to 290 bp for IGH-VJ (cfFR3), 250–270 bp for IGH-DJ, and around 300 bp for IGK.

7. Using the fragment length and the DNA concentration (ng/μl), convert the DNA concentration (ng/μl) to nM using the following formula:

Concentration ng ð Þ =μl =ð Þ DNA fragment size 650 1, 000, 000


3.5 Next-Generation Sequencing The Illumina MiSeq instrument must be set up correctly before use. Post-run and maintenance wash procedures must be performed after each run.



7. Export the data for demultiplexing and processing.

3.6 Bioinformatic Analysis Output data are analyzed using the previously described bioinformatic platform ARResT/Interrogate [12, 13]; see below.



Fig. 3 Messages and widgets related to cIT-QC (spike-ins)

10. One should be able to see extra relevant widgets and messages in "questions" (and remember to hover over the "?" tooltip anchors) – to see normalized abundances, check the "use" box (Fig. 3).

### 4 Notes



Fig. 4 Representation of the "minitable" with the selected clonotypes and full nt sequences

each reaction for the quantification of ctDNA as fraction of cfDNA and for correction of potential amplification biases [12].


### Acknowledgments

FKZ 01KT1807 TRANSCAN V-NOVEL by BMBF.

### References


T-cell receptor gene recombinations in suspect lymphoproliferations: report of the BIOMED-2 concerted action BMH4-CT98-3936. Leukemia 17:2257–2317


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Targeted Locus Amplification as Marker Screening Approach to Detect Immunoglobulin (IG) Translocations in B-Cell Non-Hodgkin Lymphomas

### Elisa Genuardi, Beatrice Alessandria, Aurora Maria Civita, and Simone Ferrero

### Abstract

Although MRD monitoring by the classic polymerase chain reaction (PCR) approach is a powerful outcome predictor, about 20% of mantle cell lymphoma (MCL) and 50% of follicular lymphoma (FL) patients still lack a molecular marker and are thus resulting not eligible for MRD monitoring. Targeted locus amplification (TLA), a new NGS technology, has been revealed as a feasible marker screening approach able to identify uncommon B-cell leukemia/lymphoma 1 (BCL1) and B-cell leukemia/lymphoma 2 (BCL2) rearrangements in MCL and FL cases defined as having "no marker" by the classic PCR approach.

Key words Mantle cell lymphoma, Follicular lymphoma, Immunoglobulin, Translocations, Molecular marker, Next-generation sequencing, Targeted locus amplification

### 1 Introduction

Mantle cell lymphoma (MCL) and follicular lymphoma (FL) are non-Hodgkin lymphomas with an aggressive and indolent clinical course, respectively [1]. Despite the high rate of success of modern immunotherapies in the treatment of these patients, relapsing disease at variable time from disease presentation is still the rule, and the consequent acquisition of more aggressive behavior overtime is common [2, 3]. Therefore, it is crucial to track the disease course by highly sensitive minimal residual disease (MRD) approaches, in order to assess both the effective treatment efficacy and to early identify patients at risk of relapse [4]. In the last decade several prospective clinical trials revealed MRD as a strong outcome predictor both in MCL and FL [5–8].

Chromosomal translocations, which juxtapose oncogenes to the immunoglobulin (IG) regions, are ideal molecular markers for

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_8, © The Author(s) 2022

MRD in mature B-cell lymphoproliferative diseases. In detail, MCL and FL are characterized by chromosomal translocations that transpose the B-cell leukemia/lymphoma 1 (BCL1) and B-cell leukemia/lymphoma 2 (BCL2) genes, respectively, near the IG heavy chain (IGH) regions; t(11;14) and t(14;18) result in the overexpression of cyclin D1 (CCND1) and BCL2 proteins and to the constitutive activation of proliferative and antiapoptotic cellular pathways, respectively [9].

Actually, fluorescence in situ hybridization (FISH) revealed that almost 90% of MCL and 80% of FL harbor the translocation in the diagnostic tissues, but this technology is not sensitive enough to monitor MRD in follow-up samples [7, 10].

On the other hand, polymerase chain reaction (PCR), able to detect up to one clonal cell among 100,000 analyzed cells, can overcome this limitation. Currently, due to its high international standardization level, it represents the gold-standard approach employed for MRD monitoring purposes in MCL and FL [7, 11].

The well-known t(11;14) and t(14;18) breakpoints concern, respectively, (a) major translocation cluster (MTC), involving the BCL1 region at 11q13 and the IGH locus at 14q32; (b), major breakpoint region (MBR), and minor cluster region (mcr), occurring between BCL2 gene at 18q21 and the 14q32.3 IGH region. In FL MBR is most frequently involved (80% of the identified rearrangements), while mcr is less frequently identified (~15%) [8]; some rare (<5% of cases) "minor" breakpoints involving regions 3<sup>0</sup> and 5<sup>0</sup> of the MBR and mcr and named 3<sup>0</sup> MBR, 5<sup>0</sup> mcr, and distal MBR have also been described [12].

Moreover, between the juxtaposed chromosomic regions, nucleotides, also called N insertions, are randomly added, establishing the tumor "fingerprint-like sequences" essential for MRD monitoring allele-specific oligonucleotide (ASO) assay design [13].

Although classic PCR approaches for marker screening and MRD monitoring have been defined and standardized within the EuroClonality-NGS and EuroMRD working groups (/www. euroclonality.org), about 20% of MCL and 50% of FL patients still lack a molecular marker, thus resulting not eligible for MRD monitoring.

In the last few years, IGH amplicon-based next-generation sequencing (NGS) applications successfully provided new scenarios in several hematological diseases, such as acute lymphoblastic leukemia [14, 15], multiple myeloma [16, 17], and different lymphomas [18, 19] describing those NGS approaches as feasible tools for marker identification and MRD monitoring, allowing clinical correlations in large patient populations.

Also NGS capture panels appeared to be useful in the detection of multiple molecular targets, but their limited sensitivity hampers the application to the MRD context [20, 21].

Targeted locus amplification (TLA), a NGS-based technology firstly developed in 2014 by Cergentis B.V., allows the detection of structural variants not identified by classic PCR methods. TLA protocol differs from NGS capture approaches: actually, it is based on the principle of physiological cross-linking of genome regions placed in physical proximity. Moreover, employing the targeted enrichment of short, locus-specific sequences, it results in the sequencing of all single nucleotide variants (SNVs) and structural variants, such as chromosomal translocations [22, 23].

Since its first publication, TLA approach has been employed in different contexts such as transgene detection, vector design, and novel SNVs identification, thus resulting in a promising technology also for onco-hematology [24–26]. In this context, the application of a multiplex TLA, as a marker screening tool, showed promising results in acute leukemia through detection of cryptic rearrangements and multiple (un)known translocated genes involved in leukemia pathogenesis [27, 28].

Recently the implementation of TLA targeting the fusion partners of the IGH enhancer described the presence of novel, uncommon BCL1 and BCL2 rearrangements in MCL and FL patients lacking a MRD molecular marker by classic PCR marker screening approach [29]. The newly identified TLA rearrangements allowed the design of highly sensitive ASO MRD assays (up to 1E-05), thus priming the potential use of this NGS technology to increase the number of lymphoma patients eligible for MRD monitoring in clinical trials.

Here we provide a detailed description of the TLA protocol as marker screening tool in MCL and FL patients, followed by an ASO MRD assay based on the TLA sequence (Fig. 1).

### 2 Materials

2.1 Reagents and Kits


# Fig. 1 TLA library preparation workflow


### 2.2 Instruments and Software


### 3 Methods

3.1 Mantle Cell Lymphoma and Follicular Lymphoma Cell Collection

Collect bone marrow (BM) and peripheral blood (PB) samples in EDTA vacutainers, ranging from 2 to 7 ml and from 7 to 14 ml for BM and PB, respectively. Next, the red blood cell (RBC) lysis procedure is carried out to collect total white blood cells (WBC), as follows:


### 3.2 gDNA Extraction and Quality Control gDNA is extracted from 5–10 - 106 BM and PB dry cell pellets (see Note 1). High purity gDNA is obtained using semiautomated or automated DNA extraction procedures, avoiding DNA crosscontamination among the samples. Here we describe the Maxwell® Rapid Sample Concentrator (RSC) Blood protocol.


### Table 1 Control gene PCR mix


### Table 2 Control gene thermal profile


After extraction, gDNA quantity (ng/μl) and quality (OD ratio A260/A280 and A260/A230) are evaluated by the Nano-Drop2000 Spectrophotometer (Thermo Scientific, Waltham, MA, USA). gDNA is stored at 4 C or 20 C until library preparation.

Next, a control gene (P53 exon 8) amplification is performed to further qualitatively check the gDNA [30].


Run the PCR products on a 2% agarose gel. The P53 exon 8 amplification signal should appear as a 150 bp band; samples without any signal should not be considered for TLA library preparation.

### 3.3 Targeted Locus Amplification (TLA) Library Preparation

Fixation

TLA consists of different steps that allow gDNA cross-linking and circularization, followed by IGH target enrichment. The protocol outline takes 4 workdays; for more detailed information and technical support, please refer to Cergentis (www.Cergentis.com).

Day 1:


Day 2:

<sup>l</sup> Second enzymatic digestion and ligation: the large gDNA fragments are digested to obtain molecules suited for PCR amplification and then circularized.

Day 3:

<sup>l</sup> TLA PCR: the circularized gDNA molecules are amplified using the IGH enhancer complementary primer.

Day 4:


### 3.3.1 Assembly and At least 5 μg of gDNA (see Note 2) is required for TLA library preparation



Table 3.

### Table 3 TLA PCR mix


### Table 4 TLA PCR thermal profile



### 3.3.5 TLA Library Indexing TLA library indexing is performed through Nextera DNA Flex Library Prep kit (Illumina), using a Bead-Linked Transposomes protocol to fragment and tag the TLA PCR products with adapter sequences, according to Illumina manual protocol procedures.

1. TLA library preparation is performed using at least 50–100 ng of TLA PCR products as starting material.

Fig. 2 TLA library profiles obtained using High sensitivity D1000 ScreenTape (Agilent)


Then, the ASO MRD assay based on the TLA sequence is designed as follows:

	- (a) Primer Tm ranges between 58 C and 62 C.
	- (b) Probe Tm is 10 C higher than primer Tm.
	- (c) It is recommended that the primer GC content is 40–60%.

Tables 5 and 6 show the list of JH primer available for TLA validation, while the ASO forward MRD assay design is detailed in Fig. 3.

### Table 5 JH consensus reverse primer used in ASO forward MRD assay


### Table 6 JH consensus probe used in ASO forward MRD assay


TLA-BCL1 or TLA-BCL2 assay validation is performed using highly sensitive quantification approaches as quantitative PCR (ASO qPCR) [13], setting a tenfold standard curve starting from 500 ng BM and or PB diagnostic sample serially diluted in pooled polyclonal healthy gDNAs or gDNA from a cell line not featuring any of the t(11;14) and t(18;14) translocations.

TLA-BCL1 or TLA-BCL2 is confirmed as MRD molecular markers if the validation experiment achieves a sensitivity level that allows the identification of 1 clonal cell within 100,000 analyzed cells, defined according to EuroMRD guidelines for qPCR data interpretation [11].

### 4 Notes


### References


Enhanced CHO clone screening: application of targeted locus amplification and nextgeneration sequencing technologies for cell line development. Biotechnol J 14:e1800371


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Immunoglobulin/T Cell Receptor Capture Strategy for Comprehensive Immunogenetics

### James Peter Stewart, Jana Gazdova, Shambhavi Srivastava, Julia Revolta, Louise Harewood, Manisha Maurya, Nikos Darzentas, and David Gonzalez

### Abstract

In the era of genomic medicine, targeted next generation sequencing strategies (NGS) are becoming increasingly adopted by clinical molecular diagnostic laboratories to identify genetic diagnostic and prognostic biomarkers in hemato-oncology. We describe the EuroClonality-NGS DNA Capture (EuroClonality-NDC) assay, which is designed to simultaneously detect B and T cell clonal rearrangements, translocations, copy number alterations, and sequence variants. The accompanying validated bioinformatics pipeline enables production of an integrated report. The combination of the laboratory protocol and bioinformatics pipeline in the EuroClonality-NDC minimizes the potential for human error, reduces economic costs compared to current molecular testing strategies, and should improve diagnostic outcomes.

Key words EuroClonality, Next generation sequencing, BIOMED-2, Immunoglobulin, T cell receptor, Copy number alteration, Translocation, Lymphoma

### 1 Introduction

Lymphoproliferative disorders (LPD) can be classified based on multiple parameters, including morphology, immunophenotyping, and genetic analysis. While a large number of lymphoproliferative disorders can be classified solely by assessment of morphology and immunophenotyping, there is an increasing role for the evaluation of genetic features as evidenced by the publication of the updated WHO guidelines in 2016 [1]. The updated guidelines include a vast array of genomic alterations that can significantly improve the diagnostic criteria and the prognostic relevance of existing entities and has led to the introduction of new disease entities. This may result in a change in practice in clinical laboratories with the validation of multiple molecular tests covering the required genetic alterations.

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_9, © The Author(s) 2022

As all cells within a tumor are presumed to arise from a common clonal progenitor, lymphoid malignancies should exhibit clonal rearrangements of the immunoglobulin (IG) and/or T cell receptor (TR) loci. Detection of a clonal IG/TR rearrangement, which can aid in differentiating between a clonal B/T cell proliferation and a reactive hyperplasia, is typically performed using a PCR-based method with primers designed, optimized, and validated during the EuroClonality BIOMED-2 study [2, 3]. PCR products are commonly analyzed using either capillary electrophoresis (i.e., GeneScan) or heteroduplex analysis on polyacrylamide gels. Analysis of clonality using these detection methods is prone to subjectivity, particularly in cases with low tumor infiltration, where it can be difficult to distinguish a clonal peak within a polyclonal background, or for targets with limited complementary determining region 3 (CDR3) variability (e.g., IGKV-J rearrangements). Interpretation of the results is also subject to confounding variables such as the impact of somatic hypermutation (SHM) on detection of IGH, IGK, and IGL rearrangements. The presence of mutations within the binding sites of the PCR primers can prevent annealing, leading to false-negative results which can be addressed in the majority of cases by performing PCR for alternate IG targets that are less prone to SHM such as IGH D-J and IGK Kde rearrangements [3].

Translocations are another genetic alteration tested for in molecular pathology laboratories as they are a hallmark of specific entities within non-Hodgkin lymphoma (NHL) as well as acute lymphoblastic leukemia (ALL) or plasma cell myeloma (PCM), among others. Translocations involving the IG/TR loci such as t (8;14)(q24;q32) in Burkitt lymphoma (BL), t(14;18)(q32;q21) in follicular lymphoma (FL), t(11;14)(q13;q32) in mantle cell lymphoma (MCL), and ALK translocations (to various partners such as NPM1, ATIC and RANBP2) in anaplastic large cell lymphoma (ALCL) are part of the diagnostic testing regime for those particular lymphomas. Currently, translocations are commonly assessed by FISH or PCR methods, although a number of tests are often required, particularly in B-NHL types to encompass different IG loci (IGH, IGK, and IGL) and to accommodate the large region where translocation breakpoints can occur. Multiple tests are often required to accurately define double/triple hit lymphomas as the recent 2016 WHO guidelines established a new classification of high-grade B-NHL based on the presence of a MYC translocation along with BCL2 and/or BCL6 translocations.

Single-nucleotide DNA alterations and/or small insertions or deletions, traditionally detected using Sanger sequencing and more recently by amplicon and capture targeted NGS, can aid in diagnostic and prognostic classification of the disease [4–6]. Mutations can often show a higher frequency in particular LPD subtypes such as mutations of TCF3 or ID3 which have been reported in 70% of sporadic BL, mutations of MYD88 in >90% of Waldenstro¨m's macroglobulinemia, or mutations of TET2, IDH2, and RHOA in a large percentage of angioimmunoblastic T cell lymphoma (AITL). From several sequencing studies, specific mutation profiles can define molecular subtypes such as in diffuse large B cell lymphoma (DLBCL) where specific mutations are associated with the germinal center B-cell (GCB) and activated B cell (ABC) subtypes which have pronounced survival differences with standard chemotherapy [7–9]. The presence of TP53 mutations in chronic lymphocytic leukemia (CLL) has been shown to be an independent prognostic factor and predictor of chemotherapy refractoriness [10]; similarly, NOTCH1 and SF3B1 mutations can be independent prognostic markers in CLL [11, 12]. Activating mutations of NOTCH1 are observed in approximately 60% of T-ALL cases and are reported to be associated with shorter survival in adults [13, 14].

Finally, copy number alterations (CNA) are also prevalent in LPD and can be associated with the underlying biology, with 17p deletion in CLL and PCM being associated with a less favorable outcome. The European Research Initiative on Chronic Lymphocytic Leukemia (ERIC) recommends analysis of del(17p) and TP53 gene mutations as an integral part of routine diagnostics for CLL patients requiring treatment [15].

The overarching objective of the EuroClonality-NDC is to enable a single NGS test to integrate genomic analyses that are currently performed by a number of molecular testing strategies. As part of the EuroClonality-NGS working group, we have developed the EuroClonality-NGS DNA capture assay (EuroClonality-NDC) to detect clinically relevant genetic alterations in LPD using a capture-hybridization approach. To achieve this objective, EuroClonality-NDC was designed to capture all functional variable (V), diversity (D), and joining (J) genes of the IG and TR loci along with additional probes to identify structural variants (SV) in the form of chromosomal translocations and detect CNA and somatic mutations. The accompanying purpose-built bioinformatics pipeline, ARResT/Interrogate, which was originally developed for amplicon assays, was customized and validated for the EuroClonality-NDC [16]. An optimized standard operating procedure (SOP), which has undergone a multi-site validation, ensures robust assay performance [17]. The development and validation of both the EuroClonality-NDC capture panel and the bioinformatics platform provides an end-to-end workflow which minimizes subjective interpretation of results. The methods detailed in this chapter relate to an updated version of the SOP to reflect recent improvements in library preparation and target enrichment.

### 2 Materials


### 2.5 Sequencing of Enriched DNA Library

The following products from Illumina, Inc. (San Diego, CA, USA) are required:


### 3 Methods

3.1 Genomic DNA Evaluation and Preparation for DNA Library Generation

	- 2. The gDNA concentration is assessed using the Qubit broad range assay. Manufacturer guidelines are followed with two modifications: (1) the standard/sample is added to the Qubit assay tubes first followed by the Qubit working solution, and (2) the incubation time prior to reading the standard/sample is 20 min.
	- 3. The gDNA integrity assessment is performed using the Genomic DNA ScreenTape Assay. Manufacturer guidelines are followed without any modifications.
	- 4. For the EuroClonality-NDC protocol, a positive control, a no template control (NTC), and 22 samples are processed in each batch (see Note 2). In well A1 of a 96-well PCR plate, place 100 ng of the positive control in a total of 35 μL, and in well A2, place 35 μL of the NTC.
	- 5. For the EuroClonality-NDC assay, 100 ng of high-molecularweight genomic DNA is required or for genomic DNA extracted from formalin-fixed DNA 100 ng (average fragment size >1000 bp) or 200 ng (average fragment size <1000 bp) is used in a total of 35 μL. Each sample to be prepared should be placed into a separate well of a 96-well PCR plate (see Note 3).

### 3.2 DNA Library Generation

	- (a) KAPA Frag Buffer (10).
	- (b) End Repair & A-Tailing Buffer.
	- (c) Ligation Buffer.
	- (d) KAPA HiFi HotStart ReadyMix (2).
	- (e) Library Amplification Primer Mix (10).



### Table 2 End repair and A-tailing buffer program




Step Temperature (C) Time (min) Heated lid (C)

to disturb the pellet.

### Table 3 Adapter ligation program


### Table 4 Pre-capture PCR amplification program



3.4 DNA Hybridization

	- 2. Remove the KAPA HyperPure beads from 4 C and allow to equilibrate to room temperature for 30 min.
	- 3. For the EuroClonality-NDC protocol, 22 clinical samples are pooled, in equal amounts, into one hybridization reaction to achieve a total of 1.5 μg of DNA (i.e., 68.2 ng of each individual library). To achieve this, calculate the volume of each library to enable 68.2 ng of each library to be added to the hybridization reaction. For the NTC, which should not have a measurable DNA concentration, the average volume of library being added from the 22 samples is calculated to determine the amount of volume of the NTC library to add (see Note 14).
	- 4. Label a LoBind DNA 1.5 mL tube and add the required volume of each of the 22 libraries and the NTC to this tube. Calculate the total volume of the 22 pooled libraries plus the NTC library. If the total volume of libraries is <45 μL (i.e., libraries), then PCR grade water is added to adjust volume to 45 μL.
	- 5. To the pooled libraries, add 20 μL COT Human DNA. Vortex gently before spinning down briefly.
	- 6. Calculate the total volume of the 22 pooled libraries, the NTC library plus the 20 μL of COT DNA. The volume of beads required in the next step is 2 this total volume (i.e., if the total volume was calculated to be 75 μL, then 150 μL KAPA Hyper-Pure beads will be required).
	- 7. Vortex the KAPA HyperPure beads until a homogenous solution is achieved.
	- 8. To the pooled libraries, add the volume of KAPA HyperPure beads calculated in the step 6. Seal the tube and vortex vigorously for 10 s.
	- 9. Incubate the bead/sample mixture for 10 min at room temperature to allow the pooled libraries and COT Human DNA to bind to the beads.
	- 10. Place samples onto a magnetic stand and wait approximately 3 min for the solution to clear (see Note 10).
	- 11. Carefully remove and discard the supernatant taking care not to disturb the pellet.




### Table 6 Preparation of post-hybridization wash buffers


### Table 7 Post-capture PCR amplification program



### 3.5 Quality Control of Enriched DNA Library 1. The concentration of the amplified and enriched library is assessed using the Qubit high sensitivity assay. Manufacturer guidelines are followed with two modifications: (1) the

standard/sample is added to the Qubit assay tubes first followed by the Qubit working solution, and (2) the incubation time prior to reading the standard/sample is 20 min.

	- (a) Protocol A (Standard Normalization Method).
	- (b) Final dilution of library is to 1.5 pM for Mid Output kits.
	- (c) Final PhiX (sequencing control) spike-in percentage is 1% of the final library and PhiX composition.

3.7 Bioinformatic Analysis of the Sequencing Data Using ARResT/Interrogate


### 4 Notes

	- (a) Employ different Illumina sequencing platforms.
	- (b) Utilize sequencing reagents with increased output.
	- (c) Alter the number of samples being applied to the flow cell for sequencing.

### Acknowledgments

We would like to acknowledge the EuroClonality-NGS working group and, in particular, those members that directly or indirectly contributed to the validation of the EuroClonality-NDC assay. The EuroClonality-NGS Working Group is an independent scientific subdivision of EuroClonality that aims at innovation, standardization, and education in the field of diagnostic clonality analysis. The revenues of the previously obtained patent (PCT/NL2003/ 000690), which is collectively owned by the EuroClonality Foundation and licensed to InVivoScribe, are exclusively used for Euro-Clonality activities, such as for covering costs of the Working Group meetings, collective Work Packages, and the EuroClonality Educational Workshops. The EuroClonality consortium operates under an umbrella of ESLHO, which is an official EHA Scientific Working Group.

### References


generation sequencing assay to measure tumour mutational burden and detect clinically actionable variants. Mol Diagn Ther 24(3): 339–349. https://doi.org/10.1007/s40291- 020-00462-x


1316–1320. https://doi.org/10.1038/ng. 2469


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Immunoglobulin Gene Mutational Status Assessment by Next Generation Sequencing in Chronic Lymphocytic Leukemia

Anne Langlois de Septenville, Myriam Boudjoghra, Clotilde Bravetti, Marine Armand, Mikae¨l Salson, Mathieu Giraud, and Frederic Davi

### Abstract

B cell receptor (BcR) immunoglobulins (IG) display a tremendous diversity due to complex DNA rearrangements, the V(D)J recombination, further enhanced by the somatic hypermutation process. In chronic lymphocytic leukemia (CLL), the mutational load of the clonal BcR IG expressed by the leukemic cells constitutes an important prognostic and predictive biomarker. Here, we provide a reliable methodology capable of determining the mutational status of IG genes in CLL using high-throughput sequencing, starting from leukemic cell DNA or RNA.

Key words Chronic lymphocytic leukemia, Immunoglobulin genes, Next generation sequencing, Somatic hypermutation analysis, Mutational status

### 1 Introduction

Chronic lymphocytic leukemia (CLL) is a malignant clonal proliferation of mature B cells. It is the most frequent leukemia in adults in the Western world and is characterized by a marked clinical heterogeneity. For some patients, it is an indolent disease with no or only late need of treatment, while in others it displays an aggressive behavior requiring early initiation of therapy [1]. Many prognostic factors have been identified, and among them, the mutational status of the immunoglobulin heavy chain variable (IGHV) genes of the B cell receptor (BcR) has emerged as one of the most robust parameters [2]. It has several advantages as it is stable and can be evaluated at any time including at diagnosis and is independent of other clinical or biological factors [3]. In addition, it has also proved to be a predictive factor of response to chemoimmunotherapy [4, 5]. Therefore the recent guidelines from the International Workshop on CLL recommend that determination

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_10, © The Author(s) 2022

of the IGHV mutational status should be performed before treatment initiation both in clinical trials and in general practice [6].

The BcR IG display huge diversity in their variable regions which results from complex mechanisms: (1) assembly of variable (V), diversity (D), and joining (J) genes, (2) imprecise junction of these rearranged genes with random nucleotide insertion and deletions, and (3) pairing of heavy and light chains [7]. Further diversification occurs after antigen encounter by somatic hypermutation in the V regions coupled with affinity maturation of the BcR [8]. In tumors such as CLL, all leukemic cells bear the same clonal BcR which reflects the developmental stage from which they derive and constitutes a biomarker of the disease.

Determination of IGHV mutational status is achieved by sequencing the IGHV gene from the clonal IGH rearrangement of the leukemic cells, followed by its comparison with the closest germline counterpart from which it derives [9]. An identity <98% classifies the CLL as "mutated" which is associated with a favorable outcome, while an identity 98% defines "unmutated" CLL and confers a poor prognosis [10, 11]. Since this initial observation in 1999, numerous studies have confirmed that unmutated CLL have shorter time-to-first treatment and overall survival when compared to mutated cases [2, 3]. In addition, large-scale repertoire analyses have shown that CLL display a skewed IG repertoire with a sizeable fraction of patients sharing quasi-identical IG variable heavy chain regions sequences, a phenomenon termed BcR IG stereotypy [12]. Importantly, some of these CLL cases belonging to the same stereotypic groups (or subsets) may also share similar clinical and biological features, separating them from other patients with the same IGHV mutational status [13, 14]. Therefore, BcR IG stereotypy further refines the categorization into mutated or unmutated CLL.

The European Research Initiative on CLL (ERIC) has published methodological guidelines and recommendations on how to perform and interpret IGHV mutational status in CLL [15]. The first step consists in polymerase chain reaction (PCR) amplification of clonal IGH rearrangements. Importantly, as the whole IGHV gene sequence is necessary for accurate calculation of the somatic hypermutation load, 5<sup>0</sup> primers need to be positioned upstream, e.g., on the leader peptide. Both genomic DNA (gDNA) and RNA extracted from leukemic cells can serve as templates, with gDNA having the advantage of being a more robust material, simpler to obtain and also a source for other genomic investigations. However, in a fraction of cases, amplification from gDNA is hampered by the presence of somatic hypermutation in the primer binding sites. Although starting from RNA requires an additional step of reverse transcription (RT), this can be a useful alternative or complementary approach as it allows the use of primers binding to sequences less or not targeted by somatic hypermutation upstream and downstream of the IGHV-IGHD-IGHJ rearrangement, respectively, in the leader region L1 part and the constant regions.

Sequencing of the IGH rearrangements amplicons was traditionally performed by Sanger methodology. However, with the constant advance of next generation sequencing (NGS) in the diagnostic field, there is a need to adapt this technology to IGHV mutational determination [16, 17]. Here, we describe detailed protocols for NGS-based determination of the IGHV mutational in CLL, starting from either gDNA or cDNA templates.


### Table 1

Primers for gDNA template (sequence composition : flow cell binding adapter\_[barcode]\_sequencing primer site\_gene-specific primer)


### Table 2

Primers for cDNA template (sequence composition : flow cell binding adapter\_[barcode]\_ sequencing primer site\_gene-specific primer)



### 2.5 Quantification of Purified PCR Products 1. Quant-iT™ dsDNA High-Sensitivity Assay Kit (Thermo Fischer Scientific).



### 3 Methods



at 68 C for 10 min; 12 C on hold.

### Table 3 PCR mix for gDNA template


### Table 4 PCR mix for cDNA template


10. At this stage, the plate can be sealed and stored at 20 C for later usage.

3.4 PCR Product

1. Preparation.

### Purification


### 3.5 Quantification of Purified PCR Products (See Note 7)



D501 combination).



3.8 Bioinformatic Analysis on Vidjil Platform (See Note 9)


Fig. 1 Screenshot of results displayed Vidjil. The Vidjil platform provides an interactive visualization of antigen receptor repertoire from high-throughput data. The left panel lists the most frequent clonotypes, the most abundant being at the top (squared). By default, the 50 most frequent ones are displayed, the value being adjustable (from 5 to 100). The IGHV, IGHD, and IGHJ genes contributing to each clonotype are indicated as well as number of deleted/inserted nucleotides. Further information is available by clicking on the yellow triangles. At the top of the left panel, a summary of the sample sequencing quality data can be obtained by clicking on the "i" symbol. The top-right panel shows the size distribution of the clonotype average read length, simulating the traditional Genescan view of clonality analysis. The bottom-right panel offers a representation of the clonotypes according to their size and IGHV and IGHJ gene composition. Note that, in the vast majority of cases, the CLL dominant IGH clonotype appears surrounded by multiple small variant ones, differing by minor nucleotides changes. Sequence of the selected clonotype appears at the very bottom, the IGHV, IGHD, and IGHJ genes being highlighted. By clicking on the bent arrow above (circled), the sequence is sent automatically to IMGT/V-QUEST, IgBlast, and ARResT/AssignSubsets for further analysis

	- (a) List of most abundant clonotypes on the left.
	- (b) Graphic visualization by abundance and read length (on top).
	- (c) Graphic visualization by abundance and V/J usage (on the bottom).

### 4 Notes


be purchased. ArresT/Interrogate developed within the EuroClonality-NGS working group is another well-adapted alternative [23].


### Acknowledgments

Anne Langlois de Septenville and Myriam Boudjoghra contributed equally to this work.

### References


novel prognostic indicators in chronic lymphocytic leukemia. Blood 94:1840–1847


and integrated system for IG and TR standardized V-J and V-D-J sequence analysis. Nucleic Acids Res 36:W503–W508


recombinations from high-throughput sequencing. BMC Genomics 15:409


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## NGS-Based B-Cell Receptor Repertoire AnalysisRepertoire analyses in the Context of Inborn Errors of Immunity

### Pauline A. van Schouwenburg , Mirjam van der Burg, and Hanna IJspeert

### Abstract

Inborn errors of immunity (IEI) are genetic defects that can affect both the innate and the adaptive immune system. Patients with IEI usually present with recurrent infections, but many also suffer from immune dysregulation, autoimmunity, and malignancies.

Inborn errors of the immune system can cause defects in the development and selection of the B-cell receptor (BCR) repertoire. Patients with IEI can have a defect in one of the key processes of immune repertoire formation like V(D)J recombination, somatic hypermutation (SHM), class switch recombination (CSR), or (pre-)BCR signalling and proliferation. However, also other genetic defects can lead to quantitative and qualitative differences in the immune repertoire.

In this chapter, we will give an overview of protocols that can be used to study the immune repertoire in patients with IEI, provide considerations to take into account before setting up experiments, and discuss analysis of the immune repertoire data using Antigen Receptor Galaxy (ARGalaxy).

Key words Next generation sequencing, Primary immunodeficiency, B-cell receptor repertoire, Inborn errors of immunity

### 1 Introduction

At this moment, more than 450 monogenetic defects have been reported in patients with inborn errors of immunity (IEI) [1]. The most common forms of IEI are patients with a predominant B-cell disorder leading to primary antibody deficiencies. T-cell disorders also have an effect on the development and function of B cells, because they are required for further differentiation of B cells into memory B cells and plasma cells.

IEI can have a direct or indirect effect on the B-cell receptor (BCR) repertoire. Direct effects are found in patients with genetic defects in genes involved in one of the key processes in the formation or shaping of the B-cell repertoire: V(D)J recombination, somatic hypermutation (SHM), class switch recombination

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_11, © The Author(s) 2022

Fig. 1 Schematic overview of the workflow. Summary of the workflow for NGS-based B-cell receptor sequencing using primer-based amplification and analysis using the Antigen Receptor Galaxy (ARGalaxy) pipeline. Created with BioRender.com

(CSR), and (pre-)BCR signalling and proliferation [2–4]. Indirect effects can also be found because recurrent infections and/or autoimmunity can shape the BCR repertoire in IEI patients [5].

The BCR can be studied in several different ways, largely depending on the research question that needs to be answered and on the availability of the material. We will discuss how the BCR repertoire can be studied by amplifying BCR rearrangements from either DNA or cDNA and how to analyze the data using the Antigen Receptor Galaxy (ARGalaxy) analysis tool (Fig. 1). These methods can be applied to every sample, but we will focus on considerations that will affect the setup of the experiments and the data analysis for patients with IEI.

1.1 Selection of the Type of Cell or Tissue The BCR repertoire can be divided into three classes: the immature BCR repertoire, the naı¨ve BCR repertoire, and the antigen-selected BCR repertoire (Fig. 2). The immature BCR repertoire is derived

Fig. 2 Overview of the B-cell receptor repertoires. The B-cell receptor (BCR) repertoire can be divided into immature BCR repertoire, naı¨ve BCR repertoire and antigen-selected BCR repertoire. Created with BioRender. com

from precursor B cells that did not undergo selection and/or have not completed their BCR rearrangements. This repertoire is particularly interesting for studying BCR repertoire formation in developing precursor B cells and processes like V(D)J recombination or pre-BCR signalling. Since precursor B-cell development takes place in the bone marrow, the only tissue that can be used to study the immature BCR is bone marrow. The naı¨ve BCR repertoire is derived from naı¨ve B cells that have not been activated. These naı¨ve B cells can be found in peripheral blood. Peripheral blood is the least invasive material to obtain and for most labs easily accessible. However, peripheral blood contains a mixture of B-cell subsets, including naı¨ve, memory, and plasma cells. The antigen-selected repertoire is derived from B cells that have been activated by their antigen. These B cells will differentiate into memory B cells or plasma cells. The antigen-selected B cells can be found in peripheral blood or secondary lymphoid organs, such as spleen or lymph nodes. Because tissues contain a mixture of B-cell subsets, it might be relevant to sort the population of interest before performing immune repertoire analysis of the naı¨ve BCR repertoire or the antigen-selected BCR repertoire.

1.2 DNA Versus RNA BCR rearrangements can be amplified from either DNA or RNA (cDNA). DNA is more stable than RNA and can be isolated from smaller cell numbers. The advantage of DNA is that is allows to study unproductive and incomplete (DH-JH) rearrangements, which is not possible with RNA. Furthermore, there is only one DNA copy of a given functional rearrangement per cell, in contrast to RNA where there are many RNA copies per rearrangement per cell. The number of RNA copies is much higher in plasma cells compared to memory B cells. The advantage of RNA is that it allows to only analyze productive rearrangements and to study the constant gene. Furthermore, RNA is also preferred to use unique molecular identifiers (UMI) to identify the single RNA molecules.

1.3 The Number of B Cells that Can Be Studied The BCR repertoire has been studied for decades by amplifying BCR rearrangements, cloning, and Sanger sequencing. However, since the introduction of next generation sequencing, it is possible to study thousands or even millions of BCR, in a way that is less labor intensive. The challenge of this high-throughput method is to obtain enough B cells to study thousands or millions of BCR rearrangements, especially in patients with a B-cell deficiency. Therefore, in patients with IEI, the starting material is often mononuclear cells obtained from blood or bone marrow. When using mononuclear cells, it is good to determine the frequency of B cells, e.g., using flow cytometry to be able to estimate the number of B-cell rearrangements that can be analyzed.

1.4 Location of the Primers The IGH locus consist of >100 different variable (V), diversity (D), and joining (J) genes that are recombined to form a BCR. Fortunately, many of the genes have large sequence similarities and can therefore be subdivided in different families, such that primers specific for these gene families can be used in a multiplex PCR to amplify the repertoire. The forward primers can be located in the leader, or the frame work regions (FR) of the VH genes. Preferably, the forward primers should not be located in the complementary determining regions (CDR) regions, because these regions can have a high frequency of somatic hypermutations (SHM) that can decrease the binding efficiency of the primer. The location of the primers is also dependent on the information that is needed from the immune repertoire data. Primers in the leader sequence are least affected by SHM and provide the most accurate information about the hypomorphic alleles, but this results in a long amplicon that might not be suitable for all sequence platforms. In this protocol, we use the 6 IGHV FR1, 7 IGHV FR2, or 7 IGHV FR3 forward primers adapted with the Rd1 adaptor for Illumina sequencing (Fig. 3) (Table 1) [6]. As reverse primer, a single primer in the JH gene is enough to cover all six functional JH genes (Table 1). However, when there is an interest in information about the (sub) class of the BCR, a primer in the constant (C) gene can be used. These rearrangements can only be amplified using cDNA as starting material. Since the amount of material is often limited in patients

Fig. 3 Overview of IGH locus with primers. The forward primers located in FR1, FR2, or FR2 are indicated. For B-cell receptor rearrangements amplified from DNA the JH consensus can be used. For amplification of the B-cell receptor rearrangements from cDNA, either the JH consensus, the CgCH1, IgHA R, or the Cm CH1 primers can be used

with IEI, using primers in the Cγ or Cα region also allows to select for rearrangements derived from Ig-switched memory B cells without the need of pre-sorting of these cells. Optionally, a reverse primer in the Cμ region can be used. Subsequently, the data can be separated in rearrangements that contain <2% SHM and are likely derived from naı¨ve B cells and rearrangements that have >2% SHM, which are likely derived from memory B cells. The reverse primers should also be adapted by addition of the Rd2 adaptor for Illumina sequencing (Table 1).

1.5 Choosing a Tool to Analyze the Immune Repertoire Data Next generation sequencing of the BCR repertoire generates thousands of rearrangements and requires bioinformatics tools to analyze. In this last decade, many different analysis tools have been developed. Most tools help to annotate the rearrangements and will aid to visualize the data. The choice of the tool greatly depends on the research question, and it is likely that multiple tools are needed to answer all questions. In this chapter, we will discuss the Antigen Receptor Galaxy (ARGalaxy) tool [7]. This tool is a web-based tool and can be used to analyze many different qualitative measurements. It has two different pipelines, the immune repertoire pipeline which allows the analysis of V, D, and J gene usage, CDR3 characteristics and junction characteristics, and the SHM and CSR pipeline which allows the analysis of SHM, antigen selection, and CSR. Depending on the research question, data can be analyzed with either one or both pipelines.

### 2 Materials

2.1 Amplification (VH-Cg or VH-Ca from cDNA or VH-JH from DNA)



Table 1 Overview of primers sequences


Table 1 (continued)




### 2.2 Nested PCR 1. PCR cycler.


2.3 Merging, Trimming, and Alignment of Reads and Data Analysis


### 3 Methods

3.1 Amplification of VH-Cg, VH-Ca, or VH-Cμ from cDNA

	- 2. Transfer PCR master mix into PCR reaction tubes (41 μl into each well).
	- 3. Deposit 1 μl of each primer (6 5<sup>0</sup> primers and 1 Cg or Ca primer per well) into the corresponding well (see Notes 2 and 4).
	- 4. Add 2 μl 50 ng/μl DNA to the corresponding well, and carefully add the lid of the PCR tubes (see Notes 4 and 9).
	- 5. Run PCR at 95 C for 7 min; 25–35 cycles at 94 C for 30 s, 57 C for 30 s, 72 C 1 min; 72 C for 10 min (see Note 6).
	- 6. Load 50 μl PCR product with 10 μl loading dye onto a 1% agarose gel in TBE buffer containing ethidium bromide and run for 1 h at 180 V.
	- 7. Visualize DNA band under ultraviolet (UV) light (see Note 7), and cut the PCR band of approximately 500 bp from gel using a scalpel (see Note 8).
	- 8. Purify the PCR product from gel using the gel extraction kit. Follow the instructions in the manual and eluate with 20 μl elution buffer.
	- 9. Continue with Subheading 3.3, Nested PCR.
	- 2. Run PCR at 95 C for 5 min; 10 cycli at 98 C for 20 s, 66 C for 30 s, 72 C 30 s; 72 C for 1 min.

3.2 Amplification of VH-JH from DNA


1. Sequencing with the Illumina platform results R1 and R2 reads that need to be merged before they can be aligned to a reference database. This merging can be done with PEAR, which is a pair-end read merger [8], which can be found on "pre-processing" at https://argalaxy.researchlumc.nl/.


3.5 Data Analysis Using the Immune Repertoire Pipeline in Antigen Receptor Galaxy (ARGalaxy) (See Note 13)

3.4 Merging, Trimming, and Alignment of Reads Using Galaxy

### Table 2

Example of the overview table of the Immune repertoire pipeline in ARGalaxy showing the number and percentage of (unique) productive and unproductive sequences per donor and per replicate. The definition of unique sequences is based on the clonal type definition filter setting chosen




Table 3 Overview of V, D, and J genes that can be affected in the B-cell receptor repertoire


Fig. 4 Examples of analyses with the immune repertoire pipeline. Naı¨ve B cells have a higher frequency of IGHV4–34 and IGHJ6 compared to antigen-selected switched B cells (IGHG, and IGHA) (a). The CDR3 length is shorter in antigen-selected switched B cells (IGHG and IGHA) compared to naı¨ve B cells (b). Patients with ataxia telangiectasia (AT) or Nijmegen breakage syndrome have a reduced diversity of the naı¨ve BCR repertoire (c). The number of samples analyzed is indicated per group. P-values <sup>&</sup>lt;0.001 are indicated by \*\*\* and P-values <sup>&</sup>lt;0.0001 are indicated by \*\*\*\*

### Table 4

Overview junction characteristics of IEI patients with defects in the non-homologous end joining pathway


### 3.6 Data Analysis Using the SHM and CSR Tool in ARGalaxy (See Note 13)


Fig. 5 Examples of analyses with the SHM and CSR pipeline. The median frequency of somatic hypermutations (SHM) increases during childhood in both IGHG and IGHA antigen-selected switched B cells (a). The median frequency of SHM is reduced in patients with MSH6, PMS2, or UNG deficiency (b). Patients with defects MSH2 and MH6 have a strong reduction in mutation at A and T base pairs compared to controls. Patients with UNG deficiency have a strong reduction in transversion mutations at G and C base pairs (c). Patients with ataxia telangiectasia (AT) and common variable immunodeficiency (CVID) have reduced frequency of IGHG2 and IGHG4 switched B cells (d). \*P <sup>&</sup>lt; 0.05 and \*\*P <sup>&</sup>lt; 0

### 4 Notes


sequences that share the same clonal type between the replicate" should be used if the overlap between at least two replicates within the same donor should be determined. Option 3 "Determine the clonality of the donor (minimal 3 replicates) can be used to determine the number of overlapping sequences between at least three replicates within one donor and provides the clonality score described by Boyd et al. [17].


### References


26. Gupta NT, Vander Heiden JA, Uduman M, Gadala-Maria D, Yaari G, Kleinstein SH (2015) Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics 31 (20):3356–3358

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Generic Multiplex Digital PCR for Accurate Quantification of T Cells in Copy Number Stable and Unstable DNA Samples

### Rogier J. Nell, Willem H. Zoutman, Mieke Versluis, and Pieter A. van der Velden

### Abstract

An accurate T cell quantification is prognostically and therapeutically relevant in various clinical applications, including oncology care and research. In this chapter, we describe how T cell quantifications can be obtained from bulk DNA samples with a multiplex digital PCR experiment. The experimental setup includes the concurrent quantification of three different DNA targets within one reaction: a unique T cell DNA marker, a regional corrector, and a reference DNA marker. The T cell marker is biallelically absent in T cells due to VDJ rearrangements, while the reference is diploid in all cells. The so-called regional corrector allows to correct for possible copy number alterations at the T cell marker locus in cancer cells. By mathematically integrating the measurements of all three markers, T cells can be accurately quantified in both copy number stable and unstable DNA samples.

Key words T cell quantification, Multiplex digital PCR, DNA markers, Copy number instability, Cancer

### 1 Introduction

T cells form an essential part of the human adaptive immune system. These cells are able to recognize and bind antigens via unique, antigen-specific cell-surface receptors, referred to as the T cell receptors (TCR). The enormous diversity of TCR molecules is generated by unique genetic mechanisms occurring during early maturation of these cells in the thymus [1, 2]. One of these mechanisms involves the rearrangement of the germline T cell receptor (TR) genes (i.e., TRD, TRG, TRB, and TRA) into a unique TR blueprint. The absolute presence of T cells varies between tissues and body fluids and is influenced by physiological and pathological conditions. For that reason, an accurate quantification of (infiltrated) T cells is relevant in various clinical applications, ranging from autoimmune disorders to infectious disease and cancer [3, 4].

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_12, © The Author(s) 2022

Traditionally, the presence of immune cells has been assessed by histological or cytological techniques such as immunohistochemistry or flow cytometry, depending on the nature of the input sample. These methods identify cells by making use of antibodies that can bind to T cell-specific epitopes, which should be available and accessible in the samples of interest [5, 6]. For that reason, these methodologies become problematic when the quality or quantity of specimens is limited [7].

More recently, high-resolution technologies (e.g., single-cell RNA sequencing and mass cytometry) have become available to study the presence of immune cells in mixed populations. These approaches, however, have even higher requirements concerning sample quality and quantity than traditional methods and remain financially and technically challenging for common use in research or diagnostics.

Alternatively, the presence of immune cells may be estimated from bulk "omics" data. Based on cell type-specific signature matrices, bulk gene expression or DNA methylation data can be computationally separated into its cellular components, a process called "deconvolution" [8, 9]. These approaches, however, are often less accurate when analyzing mixtures with unknown content or noise (such as cancer cells) and frequently show skewed or nonlinear relationships when compared against ground-truth measurements [8].

As another alternative, the abundance of T cells can be quantified by elaborating the genetic dissimilarities of the TR genes between T cells (i.e., rearranged) and non–T cells (i.e., in germline configuration). While various genomic approaches have been developed, these methods are usually very complex and not entirely quantitative. For example, multiplex PCR-based techniques, like the BIOMED-2 approach, only demonstrate relative differences in V(D)J gene usage and are performed to reveal the clonal expansion of specific T cells, rather than a general quantification of all T cells [2]. High-throughput sequencing can be used to analyze the full repertoire of V(D)J-rearranged TR genes. This approach is, however, relatively vulnerable to preferential amplification, which also limits the possibilities for an absolute quantification. Currently, one of the best solutions is the commercially available ImmunoSEQ™ Assay (Adaptive Biotechnologies). This sequencing-based method makes use of spiked synthetic control DNA, which represents a complete immune repertoire and is co-amplified with the target DNA. Such inline controls allow for the normalization of preferential amplification and offer a more accurate quantification of T cells, as recently demonstrated in melanoma and carcinoma [10, 11]. Nevertheless, it remains a complex, expensive, and timeconsuming procedure to obtain a simple T cell quantification.

Fig. 1 Schematic overview of the availability of the T cell markers (ΔB in the TRB gene and <sup>Δ</sup>D in the TRD gene) and stable genomic reference (REF) in T cells and non–T cells. Due to T cell receptor rearrangements, <sup>Δ</sup>B and <sup>Δ</sup>D are biallelically absent in T cells specifically. In contrast, REF is present on both alleles in all cells

To overcome these hurdles, we developed a novel, digital PCR-based methodology to measure the abundance of T cells from a bulk DNA sample [12]. Our approach is based on unique, generic markers for rearranged TRB and TRD genes (named ΔB and ΔD, respectively) that facilitate a robust and simple T cell quantification. Due to TR rearrangements, mature T cells have completely lost ΔB and ΔD, whereas the markers are biallelically present in other cells (Fig. 1). By simply comparing the absolute abundance of ΔB or ΔD to a stable genomic reference DNA marker (abbreviated as "REF") that is biallelically present in all cells, the fraction of T cells can be determined based on bulk DNA [12]. Our method can be performed using only 20 ng of DNA and showed a highly accurate and linear relationship when compared to flow cytometry in blood samples from healthy donors and lymphoma patients (Fig. 2) [12]. Moreover, we successfully applied this approach to determine the T cell content of primary uveal melanomas [13]. Recently, it was used as part of assays to quantify the number of infected cells with human T cell leukemia virus type 1 (HTLV-1) and human immunodeficiency virus (HIV) [14, 15]. Furthermore, our methodology has translational applications in validating the purity of isolated or sorted populations of T cells and non–T cells [14].

A drawback of our approach lies in its sensitivity to pathogenic genetic alterations that affect the copy number of the various marker loci. While such variation is unusual in benign samples, copy number alterations (CNAs) are frequently seen in malignancies and premalignant conditions [16]. In such conditions, healthy T cells may be mixed with copy number unstable cancer cells, which can complicate the mathematical interpretation of the obtained marker quantifications. On the one hand, the genomic reference may be lost or gained as part of a chromosomal CNA. This problem

Fig. 2 Comparison of T cell quantifications in 30 peripheral blood mononuclear cell samples from healthy donors and lymphoma (Se´zary syndrome) patients obtained by gold standard flow cytometry (measured by CD3+, x-axis) and digital PCR (measured by <sup>Δ</sup>B, y-axis) [12]. A strong and linear correlation is observed (Pearson R <sup>¼</sup> 0.9607, p <sup>&</sup>lt; 0.0001), demonstrating the high accuracy of our approach

was illustrated earlier and is a common pitfall of various molecular techniques [17]. However, it can be easily resolved by using a target at another chromosome as reference. The identification and selection of such sample-specific stable reference may be supported by tumor-type specific knowledge about common copy number alterations. On the other hand, a CNA in admixed cancer cells may disturb the abundance of our T cell marker (ΔB in the TRB gene on chromosome 7q34 or ΔD in the TRD gene on chromosome 14q11.2). Consequently, this gain or loss of T cell marker DNA may be unjustly attributed to the absence or presence of T cells, leading to under- or overestimated fractions. The strict genomic locations of the T cell markers, however, prevent from freely switching to another chromosome to overcome this problem. Based on copy number profiles of more than 10,000 cases spanning 31 tumor types from the TCGA pan-cancer dataset [16, 18], we previously showed that CNAs involving the ΔB and ΔD marker loci are present in ~24% and ~17% of the tumors [19]. These frequencies indicate that our original methodology (referred to as the classic model) gives incorrect T cell fractions in on average one out of four (ΔB) or one out of five (ΔD) of the cancer specimens.

As a robust solution for this problem, we developed an extension (referred to as the adjusted model) of our original experimental setup, which enables the recognition and adjustment of copy

Fig. 3 Workflow to obtain the fraction of T cells from a hypothetical cellular admixture consisting of 50% healthy T cells and 50% copy number unstable non–T cells with a gain of the <sup>Δ</sup>B T cell marker region at chromosome 7. Typically, 20 ng of isolated DNA is analyzed for T cell marker <sup>Δ</sup>B, regional corrector (RCΔB), and reference (REF) using digital PCR, which involves the compartmentalization of the complete PCR reaction (DNA and reagents) into a large number of nanoliter-sized droplets. For each of the DNA markers, the random distribution of the reaction mixture over the droplets results in a certain fraction of the droplets containing this target. PCR amplification only takes place in these droplets and results in a distinctive fluorescence intensity (droplets scored as "positive"). The number of positive droplets of all droplets can then be used to determine the abundances of all DNA targets, by which the T cell fractions can be calculated. Using the classic model, the T cell marker locus CNA is not recognized, and an incorrect T cell fraction of 25% is calculated. Following the adjusted model, the CNA is detected and adjusted for and a correct T cell fraction of 50% is calculated

> number instability involving the T cell marker region. This enhanced approach relies on a so-called regional corrector that measures the copy number of the ΔB or ΔD marker locus. In contrast to the T cell marker, the regional corrector should not be deleted by TR rearrangements and is therefore biallelically present in all T cells. When a CNA in non–T cells involves the T cell marker region, the regional corrector will be affected likewise, allowing for a mathematical correction of the disrupted T cell quantification [19].

> The actual quantifications of the T cell marker, regional corrector, and stable reference are obtained via custom-designed PCR assays (consisting of primers and fluorescently labelled probes) using digital PCR, as illustrated in Fig. 3. This technique involves the compartmentalization of a PCR reaction into a large number of small partitions, which are nanoliter-sized droplets when using the Droplet Digital™ PCR System (Bio-Rad Laboratories, Hercules, USA). For each of the DNA markers, the random distribution of the reaction mixture over the droplets results in a certain fraction of the droplets containing this target. PCR amplification only takes place in these droplets and results in a distinctive fluorescence intensity (droplets scored as "positive"). In contrast, droplets without initial presence of the target (but still containing nontarget DNA) remain unaltered and show a low background level of fluorescence (droplets scored as "negative" or "empty"). As usually an end-point PCR is carried out, all positive droplets will have a comparable fluorescence. "Digital" in digital PCR refers to this

dichotomous way of scoring: each droplet can only be positive or negative for a certain target. The number of positive droplets reflects the abundance of the measured DNA target: the more targets are to distribute, the more droplets will be filled and eventually scored as positive. This relationship is, however, not linear, as the random DNA distribution can also lead to droplets containing more than one target molecule. Instead, the relation between positive droplets and number of targets follows a Poisson distribution [20, 21]. For that reason, the final phase of a digital PCR experiment consists of a mathematical interpretation of the experimental outcomes. The statistical uncertainty of the obtained results is usually presented with a 95% confidence interval.

In this chapter, we describe how T cell quantifications can be obtained from bulk DNA samples using multiplex digital PCR. The experimental setup includes the concurrent quantification of three different DNA targets within one reaction: one of the unique T cell DNA markers (ΔB or ΔD), a regional corrector, and an independent reference DNA marker. By mathematically integrating the measurements of all three markers, T cells can be accurately quantified in both copy number stable and unstable DNA samples, as we previously validated [12, 19].

### 2 Materials



### Table 1 Overview of PCR assays


### 3 Methods

In Subheadings 3.1 and 3.2, we introduce the multiplex experimental setup to quantify T cells in a sample of interest. In Subheadings 3.4, 3.5, and 3.6, we describe the complete workflow to perform the experiments. Finally, in Subheading 3.7, we discuss the legitimacy of the approach in the analysis of samples with a lymphoproliferative component.

### 3.1 Choosing an Experimental Setup

The experimental setup to quantify T cells in (possibly) copy number unstable DNA samples involves the measurement of three distinct DNA targets: a T cell marker, its regional corrector, and a stable genomic reference. We previously identified two DNA targets (ΔB and ΔD) that fulfill the role of generic T cell marker: in mature T cells, both markers are biallelically absent [12]. Both assays have been successfully applied to quantify the proportion of T cells [12–15, 17], and for that reason, either ΔB or ΔD can be used in the experimental setup.

The regional corrector is used to quantify the (possibly altered) copy number of the chosen T cell marker locus. Therefore, it should measure a DNA target located in close genomic proximity to the T cell marker, but its abundance should not be altered due to TR rearrangements. For ΔB, we recently validated a regional corrector targeting TRBC2, the secondary constant domain and last region of the TRB gene complex [19]. As TRBC2 is not deleted as part of VDJ rearrangements, it can be considered the closest genomic locus functioning as regional corrector for ΔB. For ΔD, a candidate regional corrector may be found in the constant gene of the TRA gene, as TRD itself is located within TRA and consequently may be lost due to TR rearrangements [22].

The stable genomic reference should measure a copy number invariant DNA target that is biallelically present in all cells. This marker measures the total number of genomes (and thus cells) and is used to normalize the relative loss of the T cell marker. Hereby, the T cell fraction (i.e., fraction of all cells that is a T cell) can be calculated. The selection of a stable reference in cancer specimens may be guided by tumor-type specific information or measurements in individual samples, as we illustrated previously [17].

The experimental setup of this protocol consists of T cell marker ΔB with regional corrector TRBC2 (both on chromosome 7q34) and stable reference TTC5 (chromosome 14q11.2).

3.2 Multiplex Digital PCR In traditional multiplex (q)PCR reactions, multiple targets of interest can be analyzed simultaneously by using differentially colored fluorescent probes. The QX200™ Droplet Digital™ PCR System is, however, limited to the detection in two optical channels (i.e., FAM and HEX/VIC). Still, it is possible to measure more than two targets in a single digital PCR reaction. By varying the concentration of same-colored probes, distinct probes (and thus distinct targets) may be identified based on different end-point fluorescence intensities [23, 24]. Here, we make use of this strategy to measure the regional corrector (single FAM-labelled assay), the T cell marker (HEX-labelled assay, low concentration), and the genomic reference (HEX-labelled assay, high concentration) in a triplex reaction (see Fig. 3). As the signal intensities may differ between assay batches or dilutions, optimization and validation experiments are needed for each channel with more than one assay. Consequently, in our setup, the mixing of two HEX-labelled assays (ΔB and TTC5) should be optimized.

	- 2. Bring the primer/probe mixes of the ΔB, regional corrector, and reference assays to room temperature, and mix thoroughly by pulse-vortexing the tube. Centrifuge briefly to ensure all contents are collected at the bottom of the tube.

3.3 Pre-PCR Preparation of Reaction Mixture

Fig. 4 2D plot of 1 - 2 multiplex digital PCR analyzing regional corrector TRBC2 on channel 1 (FAM) and T cell marker <sup>Δ</sup>B (assay with lowest fluorescence) and stable reference TTC5 (assay with highest fluorescence) on channel 2 (HEX) in a healthy, copy number stable PBMC sample. Initially (a) clusters are overlapping, but after optimization (b) all eight clusters are separated. To validate that correct quantifications are obtained with this multiplex, the concentration ratios [ΔB]/[TTC5] and [TRBC2]/[TTC5] are compared with the results obtained in separate duplex experiments (c)

	- (a) 11.0 μL ddPCR™ Supermix for Probes (No dUTP)
	- (b) 1.0 μL TRBC2 assay primer/probe mix
	- (c) 0.8 μL ΔB assay primer/probe mix (optimized input; see Subheading 3.2)
	- (d) 1.4 μL TTC5 reference assay primer/probe mix (optimized input; see Subheading 3.2)
	- (e) 1.0 μL of 2 U/μL HaeIII restriction enzyme, diluted in 1-CutSmart® buffer (see Note 6)
	- (f) 20 ng DNA (see Note 7)
	- (g) Nuclease-free H2O up to a total volume of 22.0 μL.
	- 2. After the droplet generation has finished, remove the plate from the cooling block immediately, and cover it with a heatsealed foil seal, for example, using the PX1™ PCR Plate Sealer. As the generated droplets are fragile in this stage, it is advised to handle the plate with care and to proceed with the next step directly.
	- 3. Place the plate with the generated droplets into a T100™ Thermal Cycler or comparable programmable PCR cycler suitable for the described 96-well plates. The PCR amplification should be carried out with a lid temperature of 105 C, a ramp rate set to 2 C/s, and the reaction volume set to 40 μL, using the following protocol:
		- (a) 10 min at 95 C
		- (b) 30 s at 94 C and 1 min at 60 C, for 40 cycles
		- (c) 10 min at 98 C
		- (d) 30 min at 4 C and (optional) cooling at 12 C until droplet reading (see Note 11).
	- 2. Create a template in the experimental setting section of the QuantaSoft™ software (see Fig. 5), and follow further instructions as given in the user manual to start droplet reading (see Note 10).

3.4 Droplet Generation and PCR Amplification

Fig. 5 Example of defining the well template settings for our multiplex experimental setup in the QuantaSoft™ software

Fig. 6 Example of the analysis of our multiplex experimental setup in Roodcom WebAnalysis. Here, 20 ng DNA from a malignant melanoma is analyzed. In the panel "2D plot," eight distinct clusters are detected that are used to calculate the concentrations of the individual targets (in the panel "Concentrations"). When the experimental format is set to "T cell multiplex" (MP-TCF), the T cell fractions according to the classic and adjusted model and their associated 95% confidence interval are calculated automatically (in the panel "Results"). In this tumor sample, TTC5 represents the stable genomic reference, but a chromosomal gain of the T cell marker region makes that [ΔB] and [TRBC2] are higher than [TTC5]. Consequently, the T cell fraction according to the classic model is negative (14.6%), which is incorrect and biologically impossible. Using the adjusted model, however, the CNA is detected and properly normalized, leading to a positive and correct T cell fraction (13.1%)

> (third-party) software applications have been developed and are available for downstream analysis, e.g., QuantaSoft Analysis Pro or QX Manager (both Bio-Rad), ddPCRclust [25] or Roodcom WebAnalysis (https://www.roodcom.nl). Here, we make use of Roodcom WebAnalysis to analyze the data (see Fig. 6).

2. As discussed in Subheading 3.2, an optimized triplex reaction will result in 2<sup>3</sup> <sup>¼</sup> 8 distinct clusters of droplets. Although thresholding may have been carried out automatically by the software, manual examination and, if necessary, adjustment are recommended. To validate that correct quantifications are

### Table 2

Formulas to calculate the T-cell fraction (TCF) and its 95% confidence interval [TCFlow; TCFhigh] according to the classic model, without correction for CNAs affecting the T cell marker locus, based on the absolute numbers of droplets scored positive for markers ΔB (nΔB+) and REF (nREF+) and the total number of droplets analyzed (ntotal)


obtained, we propose various control experiments to be performed next to the samples of interest (see Note 8). For a general evaluation of the experimental performance, we advise to follow the "MiQE" guidelines described by Huggett et al. [26].


$$\text{T cell fraction} = 1 - \frac{[\Delta \text{B}]}{[\text{TTC5}]}$$

### Table 3

Formulas to calculate the T cell fraction (TCF) and its 95% confidence interval [TCFlow; TCFhigh] according to the adjusted model, with correction for CNAs affecting the T cell marker locus, based on the absolute numbers of droplets scored positive for markers <sup>Δ</sup>B (nΔB+), RCΔ<sup>B</sup> (nRCΔB<sup>þ</sup>), and REF (nREF +) and the total number of droplets analyzed (ntotal)


The T cell fraction following the adjusted model (which automatically corrects possible copy number alterations involving the T cell marker locus) can be calculated as follows:

$$\text{T cell fraction} = \frac{[\text{TRBC2}] - [\Delta \text{B}]}{[\text{TTC5}]} \text{ }$$

When the experimental format is set to "T cell multiplex" (MP-TCF) in Roodcom WebAnalysis, these results are automatically available in the panel "Results" (see Fig. 6).

3. We recommend to construct confidence intervals for each obtained T cell fraction (see Tables 2 and 3 for all formulas [19, 20]). These intervals, usually with a confidence level of 95%, indicate the precision of the calculated fractions: the wider the interval, the more uncertain the quantification is. The absolute width of such confidence interval depends on several factors, including the amount of DNA input, the total number of accepted droplets, and the copy number of the T cell marker locus. Generally spoken, the width of the interval can be decreased by analyzing more DNA (when available). In Roodcom WebAnalysis, these confidence intervals are automatically presented in the panel "Results" (see Fig. 6).

### 3.7 Analysis of Samples with a Lymphoproliferative Component

This protocol is designed for the quantification of copy number stable T cells, mixed with (potentially) unstable non–T cells. For that reason, particular care should be taken in the analysis of samples from a lymphoproliferative origin, such as T and B cell lymphomas and leukemias. The maturation stage during onset of the malignancy, clonality, and genetic stability of (pre-)T cell malignancies may have different consequences on the availability of our T cell markers. Whereas mature T cell proliferations have undergone VDJ rearrangement and have generally lost the marker on both alleles, the TR genes in immature T cell proliferations might be incompletely rearranged. As a result, our T cell markers may be mono- or biallelically present in (malignant) T cells, not following our mathematical model. Moreover, in lymphoid malignancies, TR gene rearrangements are not restricted to the T cell lineage only. For example, in precursor-B-acute lymphoblastic leukemias and in acute myeloid leukemias, cross-lineage rearrangements of TR genes are found [27, 28]. As a consequence, our T cell markers may be deleted in these admixed leukemic B cells, resulting in an overestimation of the T cell fraction. Therefore, it would be generally recommended to ensure the absence of any of these alterations when analyzing samples with a lymphoproliferative component.

### 4 Notes

	- (a) Adhesive foils are used to cover the plate before droplet generation and can be removed easily. These foils should be pierceable by the Automatic Droplet Generator.
	- (b) Heat-sealed foils are used to cover the plate after droplet generation and during the PCR. These foils should be compatible with the heating steps in the PCR thermal cycler and should be pierceable by the Droplet Reader. Make sure that only a single foil is used and that the plate is sealed completely.

(900 nM) and probes (250 nM) are usually mixed together (the "assay") and can be stored in the fridge (short term) or the freezer (long term).

	- (a) DNA from 100% non–T cells (e.g., cultured fibroblasts, copy number stable).
	- (b) DNA from >95% T cells (e.g., purified/sorted T cells from blood, copy number stable).
	- (c) DNA from 100% T cells (e.g., a T cell cell line, copy number stable).
	- (d) DNA from 100% non–T cells with a CNA affecting the T cell marker locus (e.g., a pure cancer cell line).
	- (e) DNA from samples with a known T cell fraction (e.g., blood samples measured for T cell content using flow cytometry).

with (control) samples. To reduce the number of pipetting actions and enhance the practical performance, the PCR reaction mixture (without DNA) can usually be prepared for multiple wells at once. As final step, the DNA samples can then be added directly to the various wells.


### References


DNA proviral load and T cells from blood and respiratory exudates sampled in a remote setting. J Clin Microbiol 57(2):e01063-18


quantitative T cell receptor gene rearrangement studies and gene expression profiling. J Exp Med 201(11):1715–1723


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Gene Engineering T Cells with T-Cell Receptor for Adoptive Therapy

### Dian Kortleve, Mandy van Brakel, Rebecca Wijers, Reno Debets, and Dora Hammerl

### Abstract

Prior to clinical testing of adoptive T-cell therapy with T-cell receptor (TCR)-engineered T cells, TCRs need to be retrieved, annotated, gene-transferred, and extensively tested in vitro to accurately assess specificity and sensitivity of target recognition. Here, we present a fundamental series of protocols that cover critical preclinical parameters, thereby enabling the selection of candidate TCRs for clinical testing.

Key words T-cell receptor, T-cell engineering, TCR cloning, TCR annotation, Gene transfer, In vitro assays, Specificity, Sensitivity

### 1 Introduction

Adoptive therapy with T-cell receptor (TCR)-engineered T cells is based on the insertion of genes into the patient's T cells that encode for a TCR directed against a predefined tumor antigen and are re-infused back into the patient. Once transferred to the patient, TCR-engineered T cells specifically migrate toward and kill tumor cells that express this antigen. The promises and challenges of this form of immunotherapy are reviewed elsewhere [1, 2]. Here we provide an overview of steps and details of laboratory protocols necessary to obtain and test TCRs, thereby providing a platform for the identification and selection of those TCRs amenable for further preclinical studies and, when successful, clinical studies.

Epitope-specific T cells and their corresponding TCRs are generally retrieved from tumor-infiltrating lymphocytes (TILs) or peripheral blood mononuclear cells (PBMCs) derived from either patients or healthy donors. In some cases, frequencies of epitopespecific T cells can be amplified in co-culture systems with antigenpresenting cells (not part of this chapter, but well be described in Theaker et al. and Wo¨lfl et al. [3, 4]). Epitope-specific T cells can be

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_13, © The Author(s) 2022

detected and isolated by fluorescent-activated cell sorting (FACS) using peptide-major histocompatibility complexes (pMHC). Then, the RNA of these sorted T cells can be isolated. In Subheadings 2.1 and 3.1, we present materials and protocols to obtain and sequence and identify TCRα and β chains from RNA isolated from pMHCsorted T cells. TCRα and β sequences are identified with the SMARTer RACE cDNA Amplification Kit (Takara Bio) and Sanger sequencing, after which sequences are annotated using the IMGT database and the HighV-QUEST tool.

Depending on the presence and frequency of T-cell clones, a variable number of TCRα and β chains are identified, and single α and β chains can be co-introduced into T cells to test TCRαβ heterodimers. In Subheadings 2.2 and 3.2, we present materials and protocols to introduce TCRαβ genes into T cells. TCRα and β chains that are molecularly connected with a 2A linker are cloned into an expression vector and retrovirally transduced into T cells. To this end, packaging cells are transfected with the TCRα and β genes as well as retroviral helper constructs, which will enable the secretion of virus particles with RNA encoding the TCR gene construct. PBMCs from healthy donors are activated with stimulatory antibodies and/or cytokines and incubated with the virus particles, leading to a stable integration of TCR genes.

TCR-transduced T cells can be validated in vitro. In Subheadings 2.3 and 3.3, we present materials and protocols to assess TCR surface expression and sensitivity as well as the specificity of T cells expressing an epitope-specific TCR. The surface expression of the TCR transgene is measured using pMHC multimers at the single cell level via flow cytometry. Functional avidity of T cells expressing such TCR transgenes can be determined by measuring IFNγ secretion upon co-culture of these T cells with antigen-presenting cells (APCs) loaded with different concentrations of the cognate epitope. Additionally, the specificity of TCR transgene-expressing T cells is determined by identifying the recognition motif of the TCR, i.e., those amino acids and their positions in the cognate epitope that are critical for recognition by this particular TCR. The more stringent the TCR recognition motif (i.e., the more amino acid residues critically contribute to the epitope's recognition), the lesser the chance that the TCR is cross-reactive. Finally, tumor cell recognition assays can be performed to test if the TCR can recognize epitopes that are the product of endogenous antigen processing and presentation by tumor cells. Extensive in vitro testing of the TCR using sensitivity and specificity assays is crucial to assess its potential clinical value [5]. Collectively, the below protocols provide a stepwise approach to identify TCRαβ sequences, introduce the TCR into T cells, and characterize the TCR in vitro (see Fig. 1).

Fig. 1 A stepwise methodological approach to gene-engineered T cells with T-cell receptors for adoptive therapy. The steps are threefold: identification, gene introduction into T cells, and in vitro characterization of TCRαβ sequences (The illustration is created with BioRender.com)

### 2 Materials

2.1 Identification of TCR from RNA Isolated from pMHC-Positive T Cells

	- GSP1α: GATTACGCCAAGCTTGTTTTGTCTGTGATATA CACA.

GSP1β: GATTACGCCAAGCTTTGCACCTCCTTCCCATT CACCC-ACCAGCTCAGCTC.

4. Nested primers: prepare a 10 μM stock in sterile dH2O. Store at 20 C.


Forward M13: GTAAAACGACGGCCAGT.

Reverse M13: CAGGAAACAGCTATGAC.

	- 2. Peripheral blood mononuclear cells (PBMCs) (see Note 8).
	- 3. DMEM++++: DMEM medium supplemented with 10% fetal bovine serum (FBS), nonessential amino acids, 200 mM L-glutamine, and 1% penicillin-streptomycin (PS).
	- 4. RPMI HepesHuS++: RPMI medium supplemented with 25 mM Hepes, 6% human serum (see Note 9), 200 mM L-glutamine, and 1% PS.
	- 5. RPMI HepesFBS++: RPMI medium supplemented with 25 mM Hepes, 10% FBS, 200 mM L-glutamine, and 1% PS.
	- 6. PBS.
	- 7. PBS/1% FBS: add 5 mL FBS to 500 mL PBS. Store at 4 C.
	- 8. PBS/0.1% gelatin: add 25 mL 2% gelatin solution to 500 mL PBS.
	- 9. Trypsin/EDTA.
	- 10. Hygromycin B and diphtheria toxin.
	- 11. Promega Calcium Phosphate Transfection Kit.
	- 12. TCR construct in expression vector (e.g., the pMP71 vector).
	- 13. pHIT60 and pColtGalV helper constructs.
	- 14. Ficoll-Paque plus (density: 1.077 g/mL).
	- 15. 10 μg/mL OKT-3 (anti-CD3 MoAb) in PBS stored at 80 C.
	- 16. Retronectin: 12 μg/mL in dH2O stored at 20 C or 80 C (see Note 10).
	- 17. 100 IU/mL IL-2 (during transduction) and 360 IU/mL IL-2 (during culture).
	- 18. Trypan blue (TB) for cell counting.
	- 19. Hemacytometer counter and cover slips.
	- 20. Light microscope.
	- 21. T75 culture flasks.
	- 22. 0.45 μm filter.
	- 23. 10 mL syringes.
	- 24. 50 mL tubes.
	- 25. 50 mL Leucosep tubes.
	- 26. Non-tissue culture (NTC) 24-well plate.
	- 27. Parafilm.


### 3.1 Identification of TCR from RNA Isolated from pMHC-Positive T Cells

3.1.1 RACE-Ready cDNA, PCR, Cloning, and TCR Sequencing


to Eppendorf tube 2, mix by pipetting up and down, and briefly spin down. Incubate Eppendorf tube 2 for 90 min at 42 C followed by 10 min at 70 C.


Five cycles: 94 C, 30 s; 72 C, 1.5 min.

Five cycles: 94 C, 30 s; 68 C, 30 s; 72 C, 1.5 min.

20 cycles: 94 C, 30 s; 65 C, 30 s; 72 C, 1.5 min

10. Perform nested PCR on the RACE PCR products from step 9 of Subheading 3.1.1 (see Note 19). Mix 1 μL RACE PCR product with 1 μL nested universal primer, 1 μL NP1α- or β-primer, 22 μL dH2O, and 25 μL 2 Q5 master mix in a PCR tube, and perform nested PCR according to the following settings:

25 cycles: 94 C, 30 s; 65 C, 30 s; 72 C, 1.5 min


Fig. 2 Correct size of amplified TCR<sup>α</sup> and <sup>β</sup> products after nested PCR. Correct size is around 800 bp. Intrinsically the TCR<sup>β</sup> chain is larger than the TCR<sup>α</sup> chain; however, due to the design of RACE primers, the TCR<sup>α</sup> chain fragment is slightly larger after nested PCR

DNA to an Eppendorf tube. Add 1 μL linearized pRACE vector and 2 μL In-Fusion HD premix to the Eppendorf tube, and mix by vortexing. As a negative control, prepare an empty vector, replacing the eluted DNA with 7 μL dH2O. The reaction of the positive control provided by the manufacturer consists of 1 μL pUC19 vector, 2 μL 2 kb control insert, 2 μL In-Fusion HD premix, and 5 μL dH2O. Incubate the reactions for 15 min at 50 C, and transfer to ice (see Note 21).

	- (a) 1/10: 25 μL of culture +50 μL SOC medium
	- (b) 1/100: 2.5 μL of culture +50 μL SOC medium
	- (c) Left over.
	- 1 cycle: 95 C, 5 min

25 cycles: 95 C, 30 s; 55 C, 30 s; 72 C, 1 min

	- 2. Copy the sequence in plain text or FASTA format.
	- 3. Classify the TCR-V, D, and J genes with the IMGT database and the HighV-QUEST tool (http://www.imgt.org/IMGT\_ vquest/vquest). Submit the sequence by copy/paste, select Homo sapiens in the species section, and select the α (TRA) or β (TRB) sequence in the type of receptor/locus section. The TCR-V, D, and J genes are classified according to the most recent Lefranc nomenclature (see Note 24).
	- 4. Determine whether the constant region of the β chain is TCRβ constant 1 (Cβ1) or 2 (Cβ2). Align TCR-Cβ of interest with Cβ1 or Cβ2 sequences as reported in https://www.ncbi.nlm. nih.gov/nuccore.
	- 5. Determine the reading frame using the Expasy tool (https:// web.expasy.org/translate/). Use Verbose as output format, and determine the in-frame sequence. In the case the sequence has multiple start codons that are in-frame, choose the start codon that is at the exact 5<sup>0</sup> end of the leader sequence according to SignalP (http://www.cbs.dtu.dk/services/SignalP/).
	- 6. Design the TCRαβ sequence according to scheme below (see Note 25).

NotI—GCCACC (Kozak sequence) TCRVβ—Cβ1 or 2 without stop codon—T2A linker—TCRVα—Cα—stop codon—EcoRI

	- 2. Wash the adherent 293T and Phoenix-Amp cell line with PBS, and loosen the cells with 2 mL trypsin/EDTA at 37 C.

### 3.2 Gene Transfer of TCR into T Cells

3.2.1 Packaging TCR

Viruses

	- (a) 10–15 μg TCR construct.
	- (b) 5 μg of each helper construct pHIT60 and pColtGalV.
	- (c) Add dH2O to a volume of 500 μL.
	- (d) Add 62 μL CaCl2

3.2.2 Activation of Peripheral Blood

Mononuclear Cells (PBMCs)


3.2.3 Transduction of

PBMCs

	- 2. Block the wells with 1 mL PBS/2% FBS for 30 min at 37 C.
	- 3. Harvest virus supernatant from the transfected packaging cells (step 11 of Subheading 3.2.1), and filter through a 0.45 μM filter using a 10 mL syringe into a 50 mL tube.
	- 4. Add 100 IU/mL IL-2 to the filtered virus supernatant.
	- 5. Add 10 mL fresh RPMI HepesFBS++ to the packaging cells to start a second production round of TCR-encoding virus particles.
	- 6. Aspirate the PBS/2% FBS from the wells of the 24-well plate (step 2 of Subheading 3.2.3), and add 0.3 mL virus supernatant to each well.
	- 7. Centrifuge for 15 min at 1000 g with slow deceleration settings.
	- 8. Harvest the activated PBMCs (step 12 of Subheading 3.2.2) by pipetting the cells up and down and using a cell scraper to scrape the cells loose. Transfer the cells to a 50 mL tube.
	- 9. Centrifuge and add RPMI HepesHuS++ to the cells.

### 3.3 In Vitro Validation of TCR

3.3.1 Surface Expression of TCR Transgene

	- (a) 1 μL anti-CD3 FITC
	- (b) 2.5 μL 1/80 diluted anti-CD8 APC
	- (c) FACS buffer to make a total volume of 10 μL.

### 3.3.2 Sensitivity of TCR Transgene


Transgene

Fig. 3 Pipetting scheme to assess the TCR transgene's sensitivity (a) and specificity (b) for its cognate epitope. Different concentrations of the cognate epitope (a) or different single amino acid mutants of the cognate epitope (illustrated with an example sequence) (b) are incubated with BSM or T2 cells prior to co-culture with TCR- or mock-transduced T cells

that should not be recognized by the TCR-transduced T cells as a negative or background control. An example of the layout for the 96-well plate with different epitope concentrations and controls is shown in Fig. 3a.

	- 2. Centrifuge and resuspend T cells in CTX medium at a final concentration of 0.6 106 cells/mL.
	- 3. Harvest BSM or T2 cells.
	- 4. Centrifuge and resuspend BSM cells in CTX medium at a final concentration of 0.2 106 cells/mL.
	- 5. Transfer 1 mL of BSM cells to each FACS tube.
	- 2. Harvest TCR- and mock-transduced T cells.
	- 3. Centrifuge and resuspend T cells in CTX medium at a final concentration of 0.6 106 cells/mL.
	- 4. Harvest target cells from step 1 of Subheading 3.3.4.
	- 5. Centrifuge and resuspend target cells in CTX medium at a final concentration of 0.2 106 cells/mL.
	- 6. Add 100 μL of T cells to each well (in triplicates) of a 96-well TCT round bottom plate.
	- 7. Add 100 μL of target cells to each well (in triplicates), making a total volume of 200 μL.
	- 8. Centrifuge the plate for 2 min at 220 g with slow deceleration settings.
	- 9. Incubate the plate at for 16–24 h at 37 C/5% CO2.
	- 10. Centrifuge the plate for 2 min at 220 g with slow deceleration settings.
	- 11. Harvest supernatant which can be used to measure IFNγ levels with ELISA-based methods as a readout for TCR transgenemediated recognition of target cells (see Note 36).

### 4 Notes


of interest (in these examples, epitopes bound by HLA-A2 are considered), also other cell lines with other HLA alleles can be used.


Concentration=mL <sup>¼</sup> counted cells dilution factor <sup>10</sup><sup>3</sup> <sup>=</sup>

number of squares counted surface area per square in mm2 depth in mm :


36. To determine whether a T-cell response against target cell lines is meaningful, IFNγ levels between TCR T cells and mock T cells should be compared. Significant differences in IFNγ levels can be tested with the Mann-Whitney test.

### References


specific fibronectin fragments increases genetic transduction of mammalian cells. Nat Med 2: 876–882


Transplantation 38:401–406. https://doi. org/10.1097/00007890-198410000-00017


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Combined Analysis of Transcriptome and T-Cell Receptor Alpha and Beta (TRA/TRB) Repertoire in Paucicellular Samples at the Single-Cell Level

### Nicolle H. R. Litjens, Anton W. Langerak, Zakia Azmani, Xander den Dekker, Michiel G. H. Betjes, Rutger W. W. Brouwer, and Wilfred F. J. van IJcken

### Abstract

With the advent of next-generation sequencing (NGS) methodologies, the total repertoires of B and T cells can be disclosed in much more detail than ever before. Even though many of these strategies do provide in-depth and high-resolution information of the immunoglobulin (IG) and/or T-cell receptor (TR) repertoire, one clear disadvantage is that the IG/TR profiles cannot be connected to individual cells. Single-cell technologies do allow to study the IG/TR repertoire at the individual cell level. This is especially relevant in cell samples in which much heterogeneity of the cell population is expected. By combining the IG/TR repertoire with transcriptome data, the reactivity of the B or T cell can be associated with activation or maturation stages. An additional advantage of such single-cell technologies is that the combination of both IG and both TR chains can be studied on a per cell basis, which better reflects the antigen receptor reactivity of cells. Here we present the ICELL8 single-cell method for the parallel analysis of the TR repertoire and transcriptome, which is especially useful in samples that contain relatively few cells.

Key words T-cell receptor alpha, T-cell receptor beta, Repertoire, Transcriptome, Single cell, Nextgeneration sequencing

### 1 Introduction

T cells recognize antigens via unique T-cell receptor (TCR) molecules. Approximately 95% of T cells express a TCRαβ receptor, consisting of a TCRα and a TCRβ chain, whereas the remaining 5% possess a TCRγδ receptor, consisting of a TCRγ and a TCRδ chain. All four TCR chains are highly diverse in their variable domains. Diversity in these variable domains arises from complex recombination processes involving V, D, and J genes in the TCR chain-encoding loci [1]. In this way the V(D)J recombination

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_14, © The Author(s) 2022

process generates a huge TR repertoire diversity, which is especially apparent in the V(D)J junction. The V(D)J junction is one of the complementarity-determining regions (i.e., CDR3) of the variable domain, which collectively mediate the specific recognition of antigens. Estimates of the number of possible different TCRαβ receptors amount to 1012 molecules [2, 3]. Importantly, whereas antigen-inexperienced or naı¨ve T cells have a broad, unselected TCR repertoire [4], antigen-experienced or memory T cells generally contain more narrow TCR repertoires, mostly consisting of particular antigen-selected specificities.

Historically, TCR repertoire diversity assays have mostly focused on TCRβ (TRB) chain profiling. Varying from DNA- [5] or RNA-based [6] TRB bulk sequencing assays to flow cytometrybased single-cell TCRVβ approaches [7], all suffer from drawbacks. A major disadvantage of bulk sequencing approaches is the large number of cells required, whereas flow cytometry-based TCRVβ assays suffer from the limitation that the 24 different TCRVβ antibodies collectively cover only 70% of the normal human TCRVβ repertoire. Moreover, neither of these approaches allows to evaluate the actual composition of the total TCRαβ receptor, as no information on TCRα (TRA) profiles is obtained. Most importantly perhaps, with any of these approaches, it remains difficult to examine changes in TCRαβ repertoire diversity within a heterogeneous pool of T cells or low-abundant population like antigenspecific T cells without purifying them first and/or acquiring large enough numbers of cells.

Over the last 5 years, single-cell transcriptomics has become a popular approach, as it allows to detect the heterogeneity in gene expression among individual cells and the discovery of small subpopulations [8]. The combination of single-cell transcriptomics with TR transcript sequencing provides gene expression and TCR repertoire information at the single-cell level. Several platforms exist for single-cell-combined TCR repertoire and transcriptomics analysis, including 10 Genomics and more recently the ICELL8 single-cell system [9, 10]. Typically, single-cell transcriptomics requires 5–10 K cells [11–13], but little is known about the possibilities of single-cell-based molecular tools for questioning clinically relevant paucicellular samples [9, 10].

Here we describe a method for the combined evaluation of the transcriptome and TRA/TRB repertoire at the single-cell level in clinical samples with low cell numbers. The method covers all the steps from cell dispensation using the ICELL8 single-cell system, double cDNA preparation at the single-cell level, parallel sequencing of transcript and TRA/TRB sequencing libraries, to data evaluation.

### 2 Materials


13. ICELL8 Human TCR a/b Profiling Reagent Kit (Takara Bio).

	- 2. Paired end 600-Cycle Sequencing Kit (Illumina).
	- 3. HiSeq Rapid SBS Kit v2 (50 cycles) or equivalent (for sequencers other than HiSeq1500, 2500) (Illumina).
	- 4. PhiX (Illumina).

### 2.4 Analysis Linux (virtual) machine with at least 16 GB RAM memory and the following software installed:


### 3 Methods

3.1 Sample Preparation

	- 2. Bring DNase medium and FBS-HI to 37 C.
	- 3. Prepare AutoMacs running buffer by adding 50 mL of 15% BSA to 1450 mL of AutoMacs rinsing solution, and bring buffer to room temperature (see Note 1).
	- 4. Start up the AutoMacs Pro Cell Sorter according to manufacturer's instruction.
	- 5. Add 5 mL of DNase medium and 1 mL of FBS-HI per vial (1.8 mL) of 10–20 million PBMC to a 15 mL polypropylene tube (use maximal two vials of PBMC per 15 mL tube).
	- 6. Take the required number of vials of peripheral blood mononuclear cells (PBMC) from liquid nitrogen storage or 150 C freezer (see Note 2).
	- 7. Thaw PBMC at 37 C until only a small clump of ice remains.
	- 8. Add the PBMC suspension dropwise to the 15 mL tube containing 5 mL of DNase medium and 1 mL of FBS-HI.
	- 9. Centrifuge at 900 g for 10 min.

### 3.2 Cell Dispensation

3.2.1 Dispense Instrument Pre-checks

3.2.2 Staining of Cell

Suspension


3.2.3 Dilution and Dispensation of Cells

	- 2. Proceed up to "Specify sample names," then return to the Summary tab, and determine the Poisson value for each sample position (see Note 14).

processing and analysis via RT-PCR. See manufacturer's instructions (Chapter D) on https://www.takarabio.com/ documents/User Manual/ICELL8 Human TCR ab Profiling User Manual/ICELL8 Human TCR ab Profiling User Manual\_072219.pdf (see Note 15).

2. When instructed to use the Manual triage function, use this function for the following:

Exclude some wells that were falsely marked as candidate wells.

Include a lot of wells that were not included by the software:

	- <sup>l</sup> "State" (see Note 16)
	- <sup>l</sup> "HasDeadCells"
	- <sup>l</sup> "Cells1".
	- (a) Go to the "Wells" tab and select all wells.
	- (b) Copy (Ctrl-c) and paste (Ctrl-v) the date into a spreadsheet.
	- (c) In the spreadsheet, sort by "For dispense," and delete rows that contain "For dispense- FALSE."
	- (d) Highlight wells that have duplicate barcodes by using the conditional format function.
	- (e) Select the wells that need to be excluded.
	- (f) Switch to the CellSelect software and go to "Wells" tab.
	- (g) Manually highlight the wells that needs to be excluded, and exclude these wells.

1. Purify and concentrate the amplified full-length cDNA extraction with the Gel and PCR Cleanup Kit according to manufacturer's instructions.

2. Purify the concentrated full-length cDNA eluted from the Gel and PCR Cleanup Kit with the AMPure XP beads.

3.4 Cleanup and Concentration After Full-Length cDNA Extraction


3.6 Preparation of TCR a/b Library by Semi-Nested PCR


Fig. 1 Typical BioAnalyzer output of full-length cDNA, showing a broad peak spanning ~400 bp to ~6000 bp


### 3.7 Purification of TCR a/b Library 1. Purify and size-select the TCR library with the AMPure XP beads.


3.8 Validation and Quantification of TCR a/b Library

Fig. 2 Typical BioAnalyzer output of purified TCR library, showing a broad peak (550–1200 bp) with a maximum between ~700 bp and ~ 900 bp

peak spanning 550–1200 bp, with a maximum between ~700 bp and ~ 900 bp (Fig. 2).

	- 2. Incubate the reaction in a thermal cycler, using the program in Table 2.
	- 3. Add 5 μL of neutralize tagment (NT) buffer to each well.
	- 4. Pipette up and down five times to mix.

3.9 Preparation of 5<sup>0</sup> differential expression

(5<sup>0</sup> DE) Library

	- 2. First purification: add 1.0 volume (50 μL) of AMPure XP beads to the previous PCR product (~50 μL). Mix by pipetting and spin down briefly.
	- 3. Incubate the sample at room temperature for 5 min.

### Table 1 Tagmentation reaction


### Table 2

### Thermal cycler program of tagmentation reaction


### Table 3 Nextera XT PCR reaction


### Table 4

### Thermal cycler program for Nextera XT PCR reaction



### 3.11 Validation and Quantification of 5<sup>0</sup> DE Library


Fig. 3 Typical BioAnalyzer output of purified 5<sup>0</sup> DE library, showing a broad peak spanning 200–1400 bp

3.12 Next-Generation Sequencing


for subsequent TCR analysis (see Note 31). For the transcriptome library, a minimal yield of 50 M clusters is advised to ensure sufficient data for transcriptome analysis.

### 3.13 Data Analysis In the following sections, code will be typeset in a monospaced font.

	- 2. Make a new working directory in which to perform the TCR assignment, and copy the newly generated FastQ files and the well list obtained during well selection over.
	- 3. Go into the new working directory.
	- 4. Create a directory labeled demultiplexed.
	- 5. Create individual FastQ files per well using pysc (https:// github.com/erasmusmc-center-for-biomics/pysc). Both the data start (base 14) and barcode position (read 1 bases 0 to 10) will need to be specified while running this tool as well as a format where to put the output reads. The following command will place the FastQ files per well in the demultiplexed directory:

```
python3/data/Software/python/pysc/bin/pysc demultiplex \
 --read_1 TCR_S1_L001_R1_001.fastq.gz \
 --read_2 TCR_S1_L001_R2_001.fastq.gz \
 --well-list welllist.txt \
 --output-read-1 "demultiplexed/{sample}_{row}_{column}_R1.
fastq" \
 --output-read-2 "demultiplexed/{sample}_{row}_{column}_R2.
fastq" \
 --well-barcode-read 1 \
 --well-barcode-start 0 \
 --well-barcode-end 10 \
 --data-start 14
```
6. Run the TCR snakemake workflow (https://github.com/ erasmusmc-center-for-biomics/tcr-workflows) which removes sequence adapters introduced during the sample preparation,



3.13.4 Projecting Gene Expression Data on Cell Coordinates

Fig. 4 Results from the PCA (a) and Jackstraw (b) analyses

3.13.5 Projecting VDJ Usage Data on Cell Coordinates


Fig. 5 Projections of single cells on a two-dimensional plane based on their expression profiles using t-SNE (a) and UMAP (b)

### 4 Notes


Samples containing a lot of dead cells, for example, 50%, do not perform well in the single-cell RNA sequencing. Typically viability should be more than 80%.

4. Both untouched and positive selection of T cells/cells of interest work for downstream processing/combined analysis of transcript and TCRA/TRB sequencing libraries.

Fig. 6 The number of cells with the same TCR ordered from the most to the least abundant receptor per sample

5. When having more than 25 million of cells, consider to run a second separation tube instead of increasing the volume of the tube. This will result in better purities of enriched fractions. For more than one separation, use a quick rinse in between samples when they are of the same origin, and rinse if you want to separate a sample of a different origin.

Fig. 7 Expression of CD3d (ENSG00000167286) projected on cells placed with t-SNE (a) and UMAP (b)


Fig. 8 TCR V(D)J composition projected on cells placed with t-SNE (a, <sup>c</sup>) and UMAP (b, <sup>d</sup>) for TRA (a, <sup>b</sup>) and TRB (c, <sup>d</sup>)

suspension per sample (100 μL for the blank chip and 200 μL for the barcoded chip). The blank chip can be re-used. Keep the unused wells in the source plate empty. After dispense the wells in the chip that, corresponding to the empty wells from the source plate, will stay clean for the next blank dispense.

13. When prompted for "Run CellSelect with images from: C:\ Wafergen\WafergenData For Chip: <Chip ID>?", click NO. After imaging keep the blank chip. It might contain some empty sample positions, or use as balance chip during centrifugation. Store printed chip at 80 C; this step can also be performed with non-pre-chilled holders.


```
From: C:\Wafergen\WafergenData\ < Chip ID>
To: C:\Users\ICELL8\Desktop\Analyzed Images\ < Chip ID>
```
A maximum of 1728 wells can be selected for further dispense. The downselect function is generally not required.

	- Cluster (+LowConfidence): might contain some cells to include
	- Good: might contain some cells to exclude (empty or duplicate)
	- HasDeadCells (+LowConfidence): might contain many cells to include

(Cells are sometimes incorrectly marked as "dead" due to false detection of well center; dead cells could be included for analysis as well.)


1 μL to load onto the Denovix instrument. This concentration is overestimated because of optical impurities. To compensate for this effect, the concentration outcome must be divided by 2.


```
zcat \
```

```
{fc_1}/{x}_S{y}_L001_R{z}_001.fastq.gz \ {fc_n}/{x}_S{y}
_L001_R{z}_001.fastq.gz | \ gzip -c > {dir}/{x}_R{z}.fastq.gz
```
Please note that neither samples nor reads 1 and 2 should be merged together, and the order of the flow cells remains the same while merging reads 1 and 2. The sample number is variable between flow cells and should not be taken into account while merging.


35. TCR sequences are considered identical, when they are composed of the same V, D, and J genes and have the same CDR3 sequence. As cells should not be counted twice, only the most highly abundant sequence for either TRA or TRB is taken into account.

### Acknowledgments

We would like to thank Maaike de Bie, Amy van der List, and Mariska Klepper for technical assistance.

### References


assay for identification and multi-parameter flow cytometric analysis of alloreactive T cells. Clin Exp Immunol 174(1):179–191

21. Litjens NHR, Langerak AW, van der List ACJ, Klepper M, de Bie M, Azmani Z, den Dekker AT, Brouwer RWW, Betjes MGH, Van IJcken WFJ (2020) Validation of a combined transcriptome and T cell receptor alpha/beta (TRA/TRB) repertoire assay at the single cell level for paucicellular samples. Front Immunol 11:1999

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Chapter 15

# AIRR Community Guide to Planning and Performing AIRR-Seq Experiments

### Anne Eugster , Magnolia L. Bostick , Nidhi Gupta , Encarnita Mariotti-Ferrandiz , Gloria Kraus, Wenzhao Meng, Cinque Soto , Johannes Tru¨ ck , Ulrik Stervbo , and Eline T. Luning Prak and on behalf of the AIRR Community

### Abstract

The development of high-throughput sequencing of adaptive immune receptor repertoires (AIRR-seq of IG and TR rearrangements) has provided a new frontier for in-depth analysis of the immune system. The last decade has witnessed an explosion in protocols, experimental methodologies, and computational tools. In this chapter, we discuss the major considerations in planning a successful AIRR-seq experiment together with basic strategies for controlling and evaluating the outcome of the experiment. Members of the AIRR Community have authored several chapters in this edition, which cover step-by-step instructions to successfully conduct, analyze, and share an AIRR-seq project.

Key words AIRR-seq, Immunoglobulin, Antibody, T-cell receptor, Immune repertoire, V(D)J recombination, Next-generation sequencing

### 1 Introduction

Next-generation sequencing of adaptive immune receptor repertoires (AIRR-seq of immunoglobulin, IG and T-cell receptor, TR rearrangements) has provided a new frontier for in-depth analysis of the immune system. The Adaptive Immune Receptor Repertoire (AIRR) Community was founded with the goal of developing standards for AIRR-seq studies to enable analysis and sharing of AIRR-seq data. In this book, members of the AIRR Community and colleagues have contributed sample methods for immune repertoire profiling studies. These AIRR Community chapters cover experimental (wet lab) and computational (dry lab) methods and encompass all of the many facets of the AIRR Community. While much of our focus in these chapters is on how to adequately control, standardize, annotate, and share data, we found it

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_15, © The Author(s) 2022

impossible to discuss these attributes of AIRR-seq data without also describing the types of data sets that are generated and then integrating those descriptions with data analysis for commonly encountered use cases. In the companion AIRR Community data analysis chapters, information is provided about study design, data analysis, data use, and the AIRR data commons and how data can be reused and shared. In this chapter we describe how to plan and perform AIRR-seq experiments.

### 2 Planning the Experiment

Understanding the dynamics, selection, and pathology of immune responses has been aided greatly aided in recent years by nextgeneration sequencing (NGS)-based approaches to studying the adaptive immune receptor repertoire (AIRR) [1–3]. The AIRR Community is focused on the standardization, sharing, and re-use of these repertoire data [4]. The AIRR is the collection of distinct B-cell and T-cell clones (cells that are derived from a common progenitor cell) that are found in an individual. Each clone is associated with a distinct antigen receptor, which is a B-cell receptor (BCR or IG) or a TR. The DNA sequences that encode IG or TR are very diverse. This diversity is achieved through the recombination of variable (V), diversity (D), and joining (J) gene segments [5, 6]. Moreover, somatic hypermutation (SHM) provides further diversification of IG repertoires through DNA mutation [7, 8]. In addition to facilitating the sampling of diverse and complex immune repertoires, AIRR-seq has opened the door for systematic analysis and comparison of immune responses across different individuals and disease conditions [9–12]. The immune repertoire is dynamic and changes in its composition and diversity with age [13, 14], in different anatomic sites [15] and under diverse conditions such as malignancy, autoimmunity, immunodeficiency, infection, or vaccination [9, 13, 16–21]. In addition to comparing different individuals, AIRR-seq is also a powerful method for studying the evolution of immune responses or tracking specific B- or T-cell populations over time within individuals [22]. For example, clonal expansions can be identified, quantified, and monitored [23]. AIRR-seq studies not only enhance our ability to understand how to diagnose and monitor diseases but also can inform therapeutic approaches [12, 24–31].

When designing a study that leverages AIRR-seq data, there are several considerations including the subjects, sample types, manner in which the samples are processed, timeline and other considerations. The types of samples, their numbers, and budget often drive the types of questions that can be asked and answered using AIRRseq. Once a suitable question has been defined and appropriate samples have been identified, the next major branch point in the decision-making process involves the selection of AIRR-seq methods. In this section, we provide a brief overview of the most important considerations when selecting one or more AIRR-seq methods for a research study or clinical evaluation.

2.1 Organisms This chapter focuses on samples from humans, but of course samples from other vertebrates or synthetic libraries (such as phage display [32]) are possible. If one is planning an experiment with nonhuman or synthetic samples, it is worth considering whether there are established protocols (such as PCR primer sets) and analysis pipelines (to include adequate libraries of validated germline gene sequences for animal species that are not frequently studied) for downstream analysis. With respect to samples derived from humans, there are several considerations [4, 33]. First, are the samples coming from individuals who have been consented for a research study? If not, one should check with the local institutional review board (IRB) or other regulatory body and/or with the investigator who supplies the samples for guidance on whether samples can be studied or if additional regulatory approvals may be required for full analysis and/or sharing of the data. Second, the study design will be impacted by the availability of samples from individuals in different comparison groups or on the availability of samples that are collected over time from the same individuals. Depending on the research question, resources, and time horizon for the project, study participants may be recruited who have a particular disease (in which case the phase of the disease and prior or current therapies may be important). If studying immune responses, longitudinal collections from the same individual at multiple time points and synchronization of those time points across the study cohort may be important to study changes in clonal abundance or, in the case of B cells, the level of SHM within clonal lineages. Demographic characteristics of the individuals in the group under study (including but not limited to age, geographical origin and sex, disease history) and the availability of one or more appropriately matched control groups are additional considerations. For TR-based sequencing, it is also useful to consider the HLA type, as HLA can have a major impact on TRBV gene usage [34]. Finally, if published data are going to be used for comparison, compatibility of the assay platforms and sample types is important.

2.2 Samples and Processing Studies on humans are often limited by sample availability. The most commonly used sample is peripheral blood, which serves as starting material for a range of different sample types including whole blood (drawn into a tube with an anticoagulant such as EDTA), peripheral blood mononuclear cells (PBMCs, which are typically isolated by centrifugation over a Ficoll gradient), or plasma (the liquid portion of anticoagulated whole blood, which is typically prepared by centrifugation and stored in aliquots frozen for isolation of cell-free DNA). Samples from other body fluids such as cerebrospinal fluid or bronchoalveolar lavage may also provide important insights if sampled in certain disease states. Tissue samples can be obtained from fine-needle aspirations (where sample quantities may be very limited, particularly if the same samples are being used for both clinical and research purposes) or from biopsies, where larger amounts of tissue can be sampled. In the case of the bone marrow, the aspirate is typically used for the evaluation of clonally expanded populations. In some cases, it is possible to obtain multiple tissues (surveillance biopsies for transplant rejection or bone marrow samples) as well as peripheral blood from the same individual over time. Finally, different tissues can be accessed from the same individual in organ donors or living individuals, as has been described for studies of human tissue-based immunity [35] and in certain disease states, such as type 1 diabetes, lupus, or rheumatoid arthritis [36–43]. From most of these samples, either total cells or isolated cell subsets (obtained after cell sorting using flow cytometry or magnetic bead-based methods) can be analyzed. The sample size and purity of the cell population of interest are important to consider when designing the experiment and interpreting the results.

How samples are processed is a critical consideration for the design of AIRR-seq experiments. Bulk sequencing methods can use samples that are formalin-fixed, lysed, or non-viably cryopreserved. Fixation significantly reduces the quality of the input nucleic acid and may require larger amounts of input DNA or RNA as well as protocols that use shorter amplicons (such as primers that are positioned in FR3 instead of FR1). The longer a sample sits in a fixative or is stored as a formalin-fixed paraffin-embedded (FFPE) tissue section, the poorer the template quality becomes. If it is possible to obtain snap frozen tissues that are not fixed, this is preferable. For certain cell types, such as diffuse large B-cell lymphoma, using tissue sections may provide a higher yield of cells of interest than single-cell suspensions [44]. For single-cell-based methods, viable cells are essential and typically consist of either freshly isolated cells or cryopreserved cells. In the case of cryopreserved cells, one needs to consider whether the method of initial sample preparation has influenced the recovery or phenotype of the cell population of interest.

Cell sorting or enrichment with magnetic beads can be used to selectively recover larger numbers of cells of interest, as, for example, with antigen-specific T cells identified by multimer staining, but these methods can also result in significant loss of sample. Sorting time should be kept to a minimum for plate-based singlecell methods, as cell viability decreases rapidly in the plate; ideally, the time from the addition of a life/dead staining solution to the end of the sort should not exceed 30 min. If longer sorting times are necessary, as is often the case for rare cells, cells can be sorted into PCR strips instead. For droplet sequencing-based single-cell methods, batches of 1000–20,000 cells are usually collected in PCR tubes that need to be coated to ensure complete recovery of the cells for further processing.

2.3 Bulk vs. Single-Cell Sequencing There are two complementary approaches to analyze the AIRR by sequencing that are usually driven by the number of cells available and the research question. On the one hand, bulk AIRR-seq methods allow systematic and global analysis of TR and IG repertoires from as few as 1000 cells to hundreds of thousands of cells or more. Bulk methods provide information about the TR (usually alpha + beta) or IG (heavy + light) rearrangements, although the pairing information is lost during the cell lysis step. On the other hand, single-cell AIRR-seq offers the possibility to reconstruct paired chain information for each TR or IG. However, most single-cell methods use lower cell input numbers (usually <20,000 cells, due to constraints in costs associated with kits and sequencing). Hence single-cell approaches, when used on bulk populations, generally tend to be focused on specific cell subsets or antigen-enriched cells to ensure sufficient sampling of the population of interest. In some cases, for example, when multiple samples with different amounts of cell inputs are available from the same individual, it may be preferable to use a tiered approach. For example, one might rely on bulk sequencing to get a view of the overall clonal landscape and then leverage single-cell sequencing to gain detailed insights into the association of specific clones (with paired chain information) and cell phenotypes (either through flow cytometry or by singlecell RNA-seq). The single-cell approach is discussed in detail in the AIRR Community chapter (Chapter 20)

2.4 Template Amplification from DNA vs. RNA

Bulk AIRR-seq can be performed on libraries that have been generated from either genomic DNA (gDNA) or RNA. gDNA-based methods are exclusively based on multiplex PCR approaches, where primers targeting the different V genes (or leader regions) and J genes are combined in the same reaction. Advantages of DNA-based sequencing are the stability of the template and its parsimonious nature (one template per cell), which allows for studies in which large numbers of cells are studied at modest cost. Disadvantages include the potential for primer bias, as PCR primers are usually positioned in the V gene and J gene (due to constraints on sequence length) and the potential loss of amplification in heavily mutated IG sequences. The bulk DNA approach is discussed in the AIRR Community chapter (Chapter 18).

Messenger RNA-based methods can be based on multiplex PCR (with either V and J primer combinations or V and constant region (C) primer combinations), or they can use rapid amplification of cDNA Ends (RACE)-PCR. Advantages of RNA-based sequencing are (1) more "shots on goal" with RNA than DNA (with individual B/T cells harboring multiple RNA copies vs. only a single DNA copy), allowing for higher yield of amplicons when there are low cell numbers; (2) reduced PCR bias with primers that are in the constant region, (3) the incorporation of unique molecular identifiers (UMI) at the cDNA synthesis step (allowing for the generation of high-fidelity consensus sequences); and (4) the ability to generate data on the constant region usage for isotyping. Disadvantages of RNA-based sequencing methods include greater cost associated with the higher sequencing depths that are required (particularly if UMIs are used) and biases introduced by differences in transcript abundance in different cell types (if mixed rather than sorted populations are used for input). In the AIRR Community chapter (Chapter 19), we focus on the mRNA-based approach to AIRR-seq.

2.5 Commercial Kit vs. Homebrew Bulk Methods Several commercial kits are now available to generate AIRR-seq data. Currently available commercial kits include gDNA-based methods (e.g., Adaptive Biotechnologies, iRepertoire) as well as mRNA-based methods (e.g., Illumina, Takara Bio, iRepertoire, MiLaboratory). Advantages of commercial-grade AIRR-seq assays are that kit reagents are produced following standards and rigorous quality controls such as qualifying primers, controlling for contamination, and verifying yield and amplification standards. Some vendors obtain certification in meeting rigorous quality standards in their laboratories that manufacture reagents, such as those set forth by the International Organization for Standardization (e.g., ISO 9001). In addition, service providers such as Adaptive Biotechnologies and iRepertoire offer large data sets for comparison and a series of user-friendly data analysis tools. Some disadvantages of commercial methods are that kits are expensive and sometimes these assays are not easily adapted to specific experimental needs. On the other hand, with homebrew assays, there is considerable variation in assay linearity and reproducibility (e.g., see ref. 45), and it can take months or even years to set up robust, well-validated assays that are then also not easy to adjust. The use of commercially available kits for in-house experiments can be a compromise to ensure reliability of the reagents and protocol customization. 2.6 Single Cell: Index Sorting and Bead-Based Emulsion Single-cell AIRR-seq (scAIRR-seq), as any other single-cell sequencing technology, relies on partitioning each cell. In early

Approaches

protocols, cells were index sorted into plates, and multiplex PCR was used to amplify both chains of immune receptors of a cell concomitantly [46, 47]. The emergence of single-cell RNA-seq (scRNA-seq) has provided another tool for AIRR-seq. Many protocols to recover and sequence mRNA from single cells have been developed and differ in their approaches for cell capture, cDNA synthesis (full-length or tag-based) and amplification (only PCR or

PCR following reverse transcription), and library preparation steps

[48]. Probably the most frequently used current commercial protocol for sequencing small cell numbers leverages the scSMARTer technology. With this approach, paired IG/TR information became accessible by combining full-length scRNA-seq amplification approaches with the development of the de novo assemblybased bioinformatics tools (TraCer, scTCR Seq, TRAPes, VDJ Puzzle) [49–52]. Unfortunately, these approaches remain computationally intensive, relatively costly, and are constrained with respect to cell throughput. More recently, bead-based emulsion methods have been developed for higher-throughput single-cell sequencing, allowing access to repertoires of tens of thousands of cells [53]. The formation of droplets in an oil-water emulsion using microfluidics allows single-cell encapsulation, barcoding, and the production of cDNA from each cell and culminates in parallel sequencing of the transcriptomes of thousands of cells [54]. These approaches have been adapted to sequence both TR or IG chains in parallel [55] and are available commercially, via the 10 Genomics platform (Chromium 10), thereby allowing the processing of samples of 5 <sup>10</sup><sup>2</sup> to 1.5 104 cells. In addition to paired immune receptor data, it is also possible to obtain scRNAseq data. Similar approaches are also commercially available including the BD Rhapsody VDJ CDR3 protocol, which relies on cell compartmentation by microwells and allows processing of 1 103 to 4 104 cells, and the Takara Bio ICELL8 Single-Cell System, which can process ~1 <sup>10</sup><sup>3</sup> cells. Recent progress on the throughput of single-cell sorting has been described with CelliGO, which combines cell encapsulation in droplets through microfluidics [56], but sequencing costs are still limiting the widespread adoption of these approaches.

2.7 Cost Finally, cost may influence the choice of a particular protocol. There are many factors that contribute to the cost of AIRR-seq data generation. For example, the number of samples, the cost of sequencing, the sequencing depth, and the number of cells analyzed per sample are all important considerations. Furthermore, the choice between service providers, commercial kits, and "homebrew" methods will influence costs. In general, gDNA analysis is the most cost-effective method, because it usually requires the lowest-sequencing depth with the largest representation of cells per sample, whereas single-cell analysis is on the opposite end of the spectrum, with bulk cDNA sequencing in the middle [45].

2.8 Overview of Companion AIRR Community Method Chapters The correct choice of method for a given experimental question is crucial and has to be carefully evaluated. The companion AIRR Community method chapters concern (1) "Bulk gDNA Sequencing of Antibody Heavy-Chain Gene Rearrangements for Detection and Analysis of B-Cell Clone Distribution" (Chapter 18), (2) "Bulk Sequencing from mRNA with UMI for Evaluation of B-Cell

### Table 1

### Overview of highlighted use cases in associated chapters


Isotype and Clonal Evolution" (Chapter 19), (3) "Single-Cell Analysis and Tracking of Antigen-Specific T Cells: Integrating Paired-Chain AIRR-Seq and Transcriptome Sequencing" (Chapter 20), and (4) "Quality Control: Chain Pairing Precision and Monitoring of Cross-Sample Contamination" (Chapter 21). These chapters illustrate four basic workflows for AIRR-seq, with a focus on IG for bulk sequencing, TR for single-cell sequencing, and IG and TR replicate analyses for quality control. The four methods are summarized in Table 1 and are discussed further below.

In Chapter 18, we illustrate, using a homebrew method with primer sequences adapted for NGS from the BIOMED2 immunoglobulin heavy-chain (IGH) PCR assays [57], how to evaluate the clonal landscape, including clone size distributions, clonal lineage analysis, and tracking of clones in different samples from the same individual. This method uses multiplex PCR and can be scaled to very high cell inputs as described [15]. The method shown uses long reads that are adequate for robust IGHV gene alignment and SHM evaluation but can also be performed with shorter reads, depending upon the sample type and DNA quality. In Chapter 19, IGH rearrangements are amplified from bulk RNA with UMIs incorporated at the cDNA synthesis step for the generation of high-fidelity consensus sequences using a commercial kit from Takara Bio. This method can be used for low to moderate throughput analysis of antigen-enriched cell populations, for evaluation of SHM, selection, and isotype usage. In Chapter 20, two different but parallel workflows are used to analyze single cells, both for paired TR transcripts as well as for their transcriptome, using two commercial kits, one from Takara Bio and one from 10 Genomics. Single-cell technologies can use a multiplex or RACEbased amplification and can generate long high-quality reads that can be mapped to individual cells but can also be based on AIRR target enrichment. One kit allows for the analysis of small numbers of antigen-enriched, index-sorted cells, useful in the case the cells of interest are present at very low frequencies in the overall sample, while the other kit allows for the analysis of larger cell numbers, providing insights into the overall T-cell repertoire as well as into other immune cell populations, if desired. The combination of paired-chain information and RNA-seq data can provide insights into the nature of the different T-cell populations that are found among expanded clones in various disease settings. Furthermore, through clonal overlap analysis, the data from the antigen-enriched cells can be integrated with the larger data set to further characterize the populations with respect to antigen-binding. In Chapter 21, two workflows are presented. The first is for the isolation of CD27+ memory B cells and their expansion in replicate cultures in vitro, using a cell line that expresses CD40L and a cocktail of cytokines. The second workflow is for the isolation of CD8+ T cells and their expansion using CD3/CD28 and IL-2 stimulation. The generation of these expanded cell cultures provides a larger input of more readily resampled cells that can be used as reference libraries for IGor TR-paired chain combinations, respectively, as well as providing diverse libraries for the evaluation of within-sample reproducibility.

### 3 Interpreting the Results


samples that are put through the same workflow can be used to compare the entire AIRR-seq procedure in one assay run to another run, to help identify and control for batch effects. Bead purification and/or further gel purification can be performed to remove primer dimers, which can swamp sequencing runs and reduce the fraction of informative reads. Capillary electropherograms (e.g., Bioanalyzer) can be used to evaluate library quality, while KAPA quantitation and real-time PCR can be performed to quantify the library. For the sequencing run, the clustering density is important (as described in the individual protocol chapters). Another helpful metric is the fraction of reads that have quality scores of 30 or higher (projected sequencing error rates below 1 per 1000 nucleotides).

3.3 Clonal Recovery The quality and type of sample have significant effects on the efficiency of amplification and clonal yield. FFPE tissue samples yield ~10-fold fewer clones than the same tissue snap frozen without fixation. Furthermore, the longer a tissue sits in FFPE, the poorer the sample quality becomes. For FFPE samples, using larger amounts of input DNA or RNA into the initial amplification can improve clonal recovery, as can the use of primers that target shorter amplicons (e.g., primers that flank the CDR3 sequence such as FR3 and JH [58]). Another reason for low numbers of clones is if the initial amplification uses primers that do not capture a high enough fraction of the rearrangements in the sample. With RNA as the starting material, there is bias toward recovering more templates from cells that are activated. Plasma cells, for example, can produce ~100 times as much IG RNA as naive B cells [59]. Primers that amplify DNA are not subject to this problem, but can have other issues, such as the potential for nonuniform amplification of different templates. To correct for PCR bias, some assays use internal calibrators [60, 61]. Amplification of IG rearrangements has an additional challenge if these are highly somatically hypermutated. One hint that this may be occurring is if there is an elevated frequency of nonproductive rearrangements (from a bulk gDNA amplification). Alternative approaches in this situation are to amplify templates that are less prone to SHM such as the leader region in the VH genes or focus on RNA-based sequencing with primers that extend from the constant region [15]. Another approach is to amplify alternative loci (such as light chains, which have about half the level of SHM of heavy chains [62], RS (recombining sequence also known as kappa deleting element) rearrangements [63], or DJ rearrangements [58]).

3.4 PCR Cycle Number For RNA-based protocols, the gene expression of each IG/TR chains can vary significantly from one cell to another. Therefore, it is challenging to predict how many cycles of PCR will amplify sufficient material for downstream sequencing without overamplification such that there are significant off-target PCR products. One approach is to focus on sorted cell populations to control for the effects of different transcript levels. In addition, one can amplify each chain of interest (e.g., IgH, IgK, IgL, etc.) separately, with different library index combinations for each chain. This can allow for separate optimization of cycling conditions for each chain, as discussed in Chapter 19. It is also possible that the suggested number of cycles will not generate enough material for downstream sequencing. If there is insufficient material for sequencing, we recommend increasing the number of cycles. Conversely, if the library yield is too high, the number of cycles in the library PCR amplification (e.g., PCR2 in Chapter 19) can be decreased.

3.5 Sensitivity The sensitivity of an AIRR-seq experiment can be determined by titrating spike-ins, such as mixing cells with a known gene rearrangement into a diverse sample at different ratios, as described by Barennes and colleagues [45]. The linearity of the titration also reveals the range of clone concentrations where the method is quantitative or semiquantitative. The threshold of detection of the assay depends upon the biological question being asked, but if rare clonotypes need to be detected (as is the case for detection of minimal residual disease), then it is important to power the analysis on clone sizes. This can be accomplished experimentally by running multiple biological replicates (independent PCR amplifications) on the same sample and determining the fraction of rearrangements that can be repeatedly sampled in two, three, four, or more replicates, as described previously [15, 64]. Using within-sample clonal overlap as a maximal estimate, one can then evaluate (with greater rigor) the expected overlap between one sample and a different sample [15]. If sensitivity falls below the level required, there are several potential reasons for this including poor-quality sample, too few cells (of the relevant type) in the sample, too small a sample, or a clone size that is too small to be detected. The depth of sequencing can also influence the detection of clones, particularly if one uses rigorous cutoffs for clone size or requires a minimum number of UMIs per clone.


(which is nearly impossible to achieve by chance, particularly for IG sequences, [65]). Spurious clonal overlap between different individuals can arise through mixing of samples prior to nucleic acid amplification, by erroneous assignment of sample barcodes, by PCR contamination, by cross-clustering of samples in the same flow cell, or some combination of these difficulties. Sample mixing can occur during flow cytometry if the instrument is not rigorously flushed between samples. Samples that are assigned the wrong barcode will associate with the "wrong" individual, or if samples come from different species, processing with the wrong pipeline (including the wrong database for reference germline genes) will result in sequences that have very low levels of sequence homology to the (incorrect) germline genes. If this occurs, an IgBLAST [66] search with a few sequences will quickly resolve to which species the genes correspond. With PCR contamination, one may see spurious amplification in the negative control samples (such as water or fibroblast DNA). PCR contamination can also often result in high-copy sequences that are shared by multiple subjects in the same experiment. In contrast, with cross-clustering, there is often a very-high-copy sequence and then a low number of copies of that same sequence in an unrelated individual. There are several process controls that can reduce the risk of contamination. First, there should be physically separate areas for pre- and post-PCR workstations. Second, primers with different barcodes can be used for diagnostic samples (where high-copy clones might be present) vs. MRD samples. Unique dual indices can be used to control for sequencing barcode crosstalk [67]. Third, when in doubt and if more samples are available, repeat the experiment to confirm the results.

3.8 Spurious Amplification Products Sometimes one obtains unexpected sequences due to technical artifacts. Large clonal expansions can appear with PCR jackpots. In the case of gDNA, independent PCR amplifications of the same sample are sampling different gene rearrangements. If the same expanded clone is present in both biological replicates, it is far more likely to be due to a bona fide expansion instead of a PCR jackpot. Another artifact is a hybrid PCR product. With hybrid PCR products, templates with partial sequence homology can cross-amplify [68]. Hybrid products will tend to share sequences at either the 5<sup>0</sup> or 3<sup>0</sup> end and then exhibit a sharp boundary where the templates crossed over into the other sequence. One way to distinguish hybrid products from gene conversion events or biological variants in V gene sequences or potential convergence (with sharing of CDR3 sequences) is to amplify sequences with TRBV or IGHV gene specific primers and see if the same products can be recreated. In addition, using protocols with fewer PCR cycle numbers can sometimes be helpful in reducing spurious amplification products.

### 3.9 Data Reporting The AIRR Community has published a series of data and experimental metadata sharing standards called MiAIRR [33]. The MiAIRR data standards guide the publication, curation, and sharing of AIRR-seq data and metadata and consist of six high-level data sets for study and subject, sample collection, sample processing and sequencing, raw sequences, processing of sequence data, and processed AIRR sequences. All current data fields in the MiAIRR standard can be accessed here: https://docs.airr-community.org/ en/stable/miairr/data\_elements.html.

More details on how to annotate and report AIRR-seq data and metadata are provided in the AIRR Community companion method chapter "Data sharing and re-use" (Chapter 23).

### 4 Conclusion

In this chapter, we have given an overview of the considerations needed to plan and execute a successful AIRR-seq experiment. We have also broadly discussed basic strategies for controlling and evaluating the adequacy of the experiment. Each topic touched upon in this chapter is explored in depth in the corresponding AIRR Community companion chapters.

### Acknowledgments

This work is supported by NIH research grants awarded to E.L.P. (AI144288, AI106697, P30-AI0450080, P30-CA016520). U.S. was supported by grants from Mercator Stiftung, Germany; German Research Foundation, Germany (DFG, grant 397650460); BMBF e:KID, Germany (01ZX1612A); and BMBF NoChro, Germany (FKZ 13GW0338B). A.E. and G.K. are supported by grants from the Deutsche Forschungsgemeinschaft (BO 3429/3-1 and BO 3429/ 4-1) and the BMBF (RESET-AID). E.M.F. contributions were funded by iMAP (ANR-16-RHUS-0001), Transimmunom LabEX (ANR-11-IDEX-0004-02), TriPoD ERC Research Advanced Grant (Fp7-IdEAS-ErC-322856), AIR-MI (ANR-18- ECVD-0001), iReceptorPlus (H2020 Research and Innovation Programme 825821), and SirocCo (ANR-21-CO12-0005-01) grants. J.T. was supported by the Swiss National Science Foundation (Ambizione-SCORE: PZ00P3\_161147 and PZ00P3\_183777).

The authors thank Andrew Farmer for constructive criticism of the manuscript.

E.L.P. is the former Chair of the Adaptive Immune Receptor Repertoire Community, receives research funding from Roche Diagnostics and Janssen Pharmaceuticals for projects unrelated to the methods presented in this chapter, and is consulting or an advisor for Roche Diagnostics, Enpicom, the Antibody Society, IEDB, and the American Autoimmune Related Diseases Association. J.T. is consulting or an advisor for Enpicom and Merck, Sharp & Dohme (MSD).

### References


https://doi.org/10.1111/j.1365-2567.2011. 03527.x


Transplant 13(11):2842–2854. https://doi. org/10.1111/ajt.12431


receptor repertoires after adoptive transfer of expanded allogeneic regulatory T cells. Clin Exp Immunol 187(2):316–324. https://doi. org/10.1111/cei.12887


T-cell repertoire following autologous hematopoietic stem cell transplantation for treatment of type 1 diabetes using high-throughput sequencing. Pediatr Diabetes 19(7): 1229–1237. https://doi.org/10.1111/pedi. 12728


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Adaptive Immune Receptor Repertoire (AIRR) Community Guide to TR and IG Gene Annotation

### Lmar Babrak, Susanna Marquez, Christian E. Busse, William D. Lees, Enkelejda Miho, Mats Ohlin, Aaron M. Rosenfeld, Ulrik Stervbo, Corey T. Watson, and Chaim A. Schramm and on behalf of the AIRR Community

### Abstract

High-throughput sequencing of adaptive immune receptor repertoires (AIRR, i.e., IG and TR) has revolutionized the ability to carry out large-scale experiments to study the adaptive immune response. Since the method was first introduced in 2009, AIRR sequencing (AIRR-Seq) has been applied to survey the immune state of individuals, identify antigen-specific or immune-state-associated signatures of immune responses, study the development of the antibody immune response, and guide the development of vaccines and antibody therapies. Recent advancements in the technology include sequencing at the single-cell level and in parallel with gene expression, which allows the introduction of multi-omics approaches to understand in detail the adaptive immune response. Analyzing AIRR-seq data can prove challenging even with high-quality sequencing, in part due to the many steps involved and the need to parameterize each step. In this chapter, we outline key factors to consider when preprocessing raw AIRR-Seq data and annotating the genetic origins of the rearranged receptors. We also highlight a number of common difficulties with common AIRR-seq data processing and provide strategies to address them.

Key words AIRR-Seq, B-cell receptor, Germline database, Gene annotation, Preprocessing, Singlecell sequencing, T-cell receptor

### 1 Introduction

Once an Adaptive Immune Receptor Repertoire sequencing (AIRR-seq, please see the AIRR Community glossary at doi: https://doi.org/10.5281/zenodo.5095381 for definitions of key terms) experiment has been successfully designed and carried out (see discussion in the Chap. 15, attention turns to analyzing the data collected to produce biological insights. Many of the same

Lmar Babrak and Susanna Marquez are shared first authors.

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_16, © The Author(s) 2022

Fig. 1 AIRR-seq decision points. The different ways an AIRR-seq experiment can be constructed. Each choice has implications both for the experimental methodology and for the design of an appropriate analysis strategy

factors that influenced choices in experimental design will be important in planning the computational approach as well. AIRRseq data to be analyzed may have been generated from genomic DNA or mRNA, with or without unique molecular identifiers (UMIs), and in bulk or single-cell context, as described in the Chap. 15. Each of these alternatives may require (or preclude) the use of certain software tools and influence the interpretation of the analysis. In addition, thought must be given to what computational and storage resources will be necessary given the size of the dataset and the intended analysis.

A clear first decision point in AIRR-seq data analysis is whether IG or TR repertoires are being analyzed (Fig. 1). While many tools such as MiXCR [1], IMGT [2], and others (Table 1) can handle both types of data, some are specific to one or the other. In addition, interest in specialized inquiries like phylogenetic analysis of IGs or calculation of clonal dynamics may require additional specific tools. In such a case, it may be useful to work within a particular ecosystem like Immcantation (http://immcantation. org), VDJServer [18], or SONAR [12], which provide several tools for a thorough analysis from quality control to clonal analysis, to facilitate smooth workflows.

The most critical set of considerations revolve around the origins of the molecules that were actually loaded into the sequencer (see Chap. 15). They may have been initially amplified from genomic DNA or from mRNA; the former results in exactly



(continued)

### Table 1 (continued)


one initial copy of each productive V(D)J rearrangement in a cell, while the latter starts with several or many copies and may vary with cell type and activation state. When amplifying mRNA, the initial molecules may also be labeled with UMIs, which enable the correction of errors introduced by PCR and/or sequencing by identifying reads that are derived from the same original molecule. Of note, while the usage of UMIs enables experimental error correction, their usage necessitates a considerably larger sequencing depth due to consensus read building (for a more nuanced discussion, see, e.g., [20, 21]). UMIs may also be used when sequencing DNA, but that is currently less common in practice. UMIs can also be used to improve quantification, by collapsing apparent expansions due to differential amplification. Some specialized UMI protocols may also require particular matched software tools to fully utilize the advantages of those schemes [22]. Without UMIs, it is advisable to cluster highly similar reads to avoid overcounting, particularly for IG sequences, where errors and somatic hypermutation (SHM) are often indistinguishable.

It is also important to think about how molecules from the full repertoire get included into the pool to be amplified for sequencing. For mRNA-derived libraries, in particular, the efficiency of cDNA generation can be a significant bottleneck and may vary depending on the enzymes and protocol used in the reverse transcription (RT) reaction [23, 24]. The efficiency of the RT reaction can lead to a bias toward abundant species in the repertoire and concomitant dropout of rare ones. In addition, because of the diversity of V and J genes and their surrounding genetic context, many protocols use pools of primers to capture the full repertoire [25]. However, these primers may have different efficiencies in amplifying their respective targets, and some genes might be targeted by more than one primer in a pool. Other protocols circumvent this problem by adding 5<sup>0</sup> anchors during reverse transcription [26]. In addition, IGs with high SHM can lose their ability to bind to an intended primer, resulting in the depletion of these sequences from the measured repertoire.

Recently, several high-throughput technologies have become popular for conducting AIRR-seq at single-cell resolution. These provide the most accurate, direct measurements of repertoire statistics and allow more biologically accurate definitions of clones. To do so, however, requires analysis tools that are capable of keeping heavy/light, alpha/beta, or gamma/delta chain sequences linked. The AIRR Community [27] (https://www.antibodysociety.org/ the-airr-community/) is developing standardized representations for "receptors" and "cells" to facilitate these analyses and ensure data portability. In addition, single-cell IG and TR data can be easily linked to transcriptomic and other measurements for more comprehensive analyses.

The sequencing technology used must also be taken into account. Illumina paired-end sequencing requires an additional preprocessing step to reassemble the amplicon, and this may result in a bias against longer sequences, with less overlap between the two reads. Meanwhile, more error-prone long-read technologies require extra attention to quality control.

This chapter aims to guide bioinformaticians through the first steps in repertoire analysis, specifically the considerations and preparation of raw data for subsequent repertoire analysis (see Chap. 17). Firstly, this chapter provides in-depth information on the materials necessary to conduct the analysis, including computational resources for data preparation, available software tools, and germline database information (Fig. 2). The main portion of the chapter then discusses the considerations on data preprocessing and annotation of raw sequences with a reference germline database.

Fig. 2 Process overview. Conceptual steps in designing an AIRR-seq analysis, proceeding from raw inputs to annotated sequences for downstream analysis

### 2 Materials

### 2.1 Computing Resources

AIRR-seq data are usually large and require specialized analysis methods and software tools. A typical Illumina MiSeq sequencing run generates 20–30 million 2 300 bp paired-end sequence reads which roughly corresponds to 15 GB of sequence data to be processed. Other platforms like NextSeq, which is useful in projects where the full V gene is not needed, creates about 400 million 2 150 bp paired-end reads. Because of the size of the datasets, the analysis can be computationally expensive, particularly the early analysis steps like preprocessing and gene annotation that process the majority of the sequence data. A standard desktop PC may take 3–5 days of constant processing for a single MiSeq run, so dedicated high-performance computational resources may be required. The institution may provide a cluster with high-performance computers for running analysis jobs. Commercial services like Amazon Web Services or Google Cloud can provide access to compute resources. However, this may come at added costs and could carry with them privacy concerns. Alternatively, there are free computing resources available. For AIRR-seq data, VDJServer provides free access to high-performance computing at the Texas Advanced Computing Center (TACC) through a graphical user interface [18]. VDJServer has also parallelized execution for tools such as IgBLAST, so more compute resources are utilized as the size of the input data grows. Analysis that takes days on a desktop PC might take only a few hours on VDJServer. An example workflow is provided in the AIRR Community Chap. 22 with instructions about using VDJServer for immune repertoire analysis.

2.2 Software Tools Many tools are available for the first steps in AIRR-seq analysis [28– 31]. Table 1 highlights several of the more commonly used programs. These are noted particularly because they support standardized AIRR data representations and are mostly free and open source, two key criteria among the AIRR software guidelines (https://docs.airr-community.org/en/stable/swtools/airr\_ swtools\_standard.html). When deciding what are the right software tools to analyze data, besides computational requirements and expertise of the user, we recommend taking into consideration whether these tools use the AIRR Community standards and are AIRR-compliant. Tools that use the standard can easily be incorporated into complex workflows with other tools that share the same data format. Selecting AIRR-compliant software adds an additional layer of transparency to the analysis, because the source code is (1) available for inspection on a publicly available repository, (2) uses a versioning system, (3) has been tested, and (4) is available as a container (Docker, Singularity), among other quality requirements. The use of AIRR standards and of AIRR-compliant software supports the transparency, reproducibility, and rigor of research results.

2.3 Germline Databases IG and TR germline databases are a requirement for accurate AIRR-seq analyses, regardless of the technique used (e.g., single cell vs. bulk). These databases guide the assignment of sequences to known and novel IG and TR genes/alleles, facilitating downstream sequence annotation and the accurate assessment of various repertoire features (e.g., gene/allele usage, SHM, clonal assignment, etc.; see AIRR Community Chaps. 18–20 for more detail). A germline database should ideally contain the most comprehensive and accurate set of possible IG/TR V, D, and J genes and alleles that best represent the genomic content of an organism. There are various sources of reference germline databases available, and occasionally a tool is limited by which database can be used for a particular analysis. Thus, the use of a particular database, or a combination of databases, may vary depending on the experimental objectives, as well as the particular species in which the AIRR-seq data has been generated. We therefore recommend investing effort in obtaining as accurate a database as possible. Table 2 describes currently available databases, focusing on those that are in active development.



IMGT [2] provides the most commonly used reference genome databases, but even for species of substantial research interest, these do not represent species diversity and can contain sequences reported in error [35, 36]. For TR genes and for IG genes from nonhuman species, however, few or no satisfactory alternatives exist. Ongoing initiatives seek to remedy this by continuously improving germline databases across species. Several programs are available to infer personalized databases from AIRR-seq data for each experimental subject (Table 1). VDJbase (https:// www.vdjbase.org) is a resource that brings together AIRR-seq and genomic information to study population diversity and identify previously unreported alleles [34]. In 2019, the AIRR Community established the IARC (Inferred Allele Review Committee) to evaluate, document, and name human IGH alleles inferred from AIRRseq data [37], and it is anticipated that this approach will be extended to other species and loci over time: The IARC's work is supported and published by OGRDB (the Open Germline Receptor Database, https://ogrdb.airr-community.org), which provides full information regarding alleles, metadata on the repertoires from which they originated, and ref. 32.

### 3 Methods

Preprocessing and gene annotation of AIRR-seq data takes as input the sequencing files and returns a set of high-quality sequences for which V, D, and J allele calls can be made and structural elements can be identified. After further quality control filtering steps, a final set of sequences is selected and can be used to carry out more in-depth analyses (see Chap. 17). All steps should be carefully documented to maintain data provenance and allow the analysis to be reproduced; the AIRR Community has defined a set of MiAIRR data processing fields to standardize the representation of analysis steps [38]. Below, we outline the concepts involved in each phase of analysis and then supply detailed protocols, applying them to common use cases. We also provide further information on reporting and sharing AIRR-seq data.

3.1 Preprocessing While there are several experimental technologies available for AIRR-seq studies from different experimental setups, most approaches typically produce the same raw data file format (.fastq) and share the ultimate goal of obtaining a final set of reads of high quality, particularly in the complementarity-determining region 3 (CDR3) region, representative of each B or T cell in the repertoire. The general steps that need to be performed include (1) filtering reads (e.g., removing PhiX spike-ins, short reads, and reads with a low Phred score or excessive ambiguous base calls), (2) identifying and removing primers and sequencing barcodes (if present), (3) building consensus sequences (using UMI or cell barcodes, if present), (4) merging mate pairs (if using a paired-end protocol), (5) masking low-quality positions, (6) annotating with constant (C) region (if present), and (7) collapsing duplicate sequences. For some of these steps, some considerations and adjustments need to be made depending on whether the data are from genomic DNA or RNA, B cells or T cells; bulk or single cell, paired or unpaired chains, and whether UMIs have been used (Fig. 1).

> In the following we describe the important considerations to be made when preprocessing AIRR-seq samples.

3.1.1 Filtering by Sequence or by Clone Current NGS methods introduce occasional base-call errors which may not be detectable from the associated quality scores. A common approach to avoid incorporating these sequences in downstream analyses is to threshold data based on the frequency of reads. This does not eliminate such errors but can reduce their influence on gross metrics of the underlying immune repertoire. To remove spurious sequences, a common approach taken, e.g., by MiXCR [1] and SONAR [12], is to collapse identical or near-identical sequences and drop those with fewer than a specified number of reads (usually two or three). This approach is preferred where individual sequences may be of low quality, for instance, if sequencing depth is low. However, this approach to filtering can result in nonuniform loss of data when libraries of different sequencing depths are compared. Alternatively, instead of a preprocessing step, all sequences passing quality control checks can be grouped into clones using the regular workflows described in the AIRR Community method Chaps. 18 and 19, and then clones that include fewer than the specified number of unique sequences are removed prior to downstream analysis. This may be appropriate for high-quality sequences, such as with UMIs and sufficient sequencing depth for robust error correction. Without this correction, errors in the CDR3 can lead to the inference of spurious clones.

3.1.2 Read Length-Related Effects Long paired-end reads provide useful information for reliable V gene assignment as well as more comprehensive mapping of SHM in the case of IG gene rearrangements [39]. As read length increases, the quality of base calls degrades as sequences are generated, but paired-end sequencing allows for computational alignment of the overlapping regions. After alignment, sequencing errors at the ends of the sequences can be reduced as the higherquality base call for each position that overlaps can be used. However, for longer sequences such as with RNA libraries capturing the constant region, the read length on the sequencer may need to be increased, reducing the overlapping portion of the 5<sup>0</sup> and 3<sup>0</sup> reads, resulting in a bias against sequences encoding longer CDR3. Further complicating this issue, a common procedure is to trim the ends of reads of low-quality stretches of base calls, such as with generic tools like fastx-toolkit or pRESTO's FilterSeq trimqual [4]. This can in turn reduce the number of full-length highquality sequences. On the other hand, with RNA-based sequencing, UMIs can be incorporated at the cDNA synthesis step, and, when coupled with very deep sequencing, these can be used for error correction through the construction of consensus sequences that share the same UMI. There is, however, a trade-off between the sequencing depth required for adequate coverage of UMIs and the number of independent sequences that can be sampled.

> Long reads covering the entire variable region can also be generated using alternative sequencing platforms, such as those offered by Pacific Biosciences and Ion Torrent [31, 40– 43]. These offer the additional advantage of being able to capture large enough parts of the C-region to be able to distinguish between subtypes of IgG. However, lower throughput on these platforms limits the depth of sampling that can be achieved.

> Short reads are sometimes used to generate large quantities of data on CDR3 sequences, as sequencing short reads can be done on higher-throughput sequencers at lower cost. This strategy is particularly common for TR rearrangement analysis on gDNA using

commercial platforms such as Adaptive. Short reads may be required if the template is of low quality, as sometimes occurs in formalin-fixed paraffin-embedded samples. Short reads can sometimes compromise TRBV gene assignments but are particularly problematic for IGH gene rearrangements with SHM. Short IGHV gene sequences result in larger numbers of ambiguous V gene assignments which can cause erroneous clustering of unrelated sequences into clones.

gDNA vs. mRNA templates. When using genomic DNA as starting material, each cell contributes a fixed number of IG or TR template, providing a parsimonious and cost-effective means of profiling large numbers of cells. gDNA-based sequencing will also capture far more nonproductive gene rearrangements than mRNA-based sequencing. With RNA, nonproductive rearrangements are subjected to nonsense-mediated degradation (although some nonproductive rearrangements can be recovered). gDNA is also more stable than RNA. On the other hand, RNA-based sequencing is more sensitive, with more templates per cell. With mRNA-based sequencing, cells contribute different numbers of templates, based upon cell subset-specific differences in transcript abundance. With mRNA-based libraries, cells can be grouped into subsets using immunophenotyping or single-cell RNA-seq to control for these differences. In the case of IG data where primers can be designed to capture the C-regions, each read can be annotated with its isotype using, for example, pRESTO's MaskPrimers routine. Further, unlike gDNA, it is straightforward to incorporate unique molecular identifiers (UMIs) at the RNA to cDNA synthesis step. Each UMI, which should be unique to original individual cDNA templates, can be processed with pRESTO's BuildConsensus to generate consensus sequences which can nearly eliminate sequencing error given sufficient sequencing depth [44, 45]. MiXCR, SONAR, and other packages also offer similar tools. The necessary depth might be difficult to achieve, though, for instance, in cases of vastly different expression levels or with samples of large size.

3.1.3 Productive Vs. Nonproductive Rearrangements For each sample, the fraction of productive rearrangements can be an informative metric. On average, it can be expected that approximately 80% of TRB rearrangements and approximately 85% of IGH rearrangements sequenced from mature T or B cells will be productive [46]. Lower frequencies of productive rearrangements can be observed in immature lymphocytes, where selection has not yet been imposed on cells without productive rearrangements [47]. Lower frequencies of productive rearrangements can also be seen in sequencing libraries that are of poor quality. Nonproductive sequences also can be used as a baseline estimator of gene usage frequency in rearrangement [48, 49] and compared to productive sequences to investigate the effects of tolerance checkpoints on the AIRR [50, 51]. With such comparisons, it may be useful to remove clonal lineages that contain both productive and nonproductive versions of the same rearrangement, as sequencing errors can cause a sequence to appear nonproductive. Nonproductive rearrangements are sometimes also useful for identifying clonal expansions in tumors, particularly if tumors harbor SHM that may interfere with primer binding (the nonproductive rearrangements are usually un-mutated). Nonproductive rearrangements can be found in lymphocytes that have undergone multiple rounds of V(D)J recombination, as can occur with receptor editing; the presence of more than one rearrangement is particularly common with IG light chains [52, 53]. Finally, it is important to computationally filter nonproductive sequences for general analyses, if one is making claims about selected repertoires.

3.2 Gene Annotation After preprocessing AIRR sequences for good-quality and relevant reads, sequences need to be accurately aligned and annotated to an appropriate reference germline database. This process identifies the V, D, and J genes; CDRs; and framework regions (FWRs) for each sequence in the repertoire. There are numerous annotation tools for IG and TR sequences that are freely available to users, including popular programs such as IgBLAST [10] and IMGT/ HighV-QUEST (Table 1) [8]. Depending on the tools, different tool-specific algorithms (e.g., Smith-Waterman) assign the best match among a set of genes in a user-defined reference germline database. Accurate alignment is very important for subsequent analyses such as the identification of SHM for IGs, clustering of clonal groups, and determination of IG/TR diversity. Alignment algorithms have been demonstrated to influence the outcome of V, D, and J gene assignments, even when identical input sequences, tool parameters, and reference germline databases are chosen [31]. Furthermore, differences in the length of alleles of genes in databases may force algorithms to output an incorrect best match in the gene annotation process. To complicate matters, some tools provide alignments to multiple (often highly similar) genes and leave it to users to choose which of the ambiguous calls is most appropriate.

> Schemes for IGs and TRs that number amino acid residues facilitate sequence comparisons, protein structure modelling, and engineering [54]. Although many schemes have been proposed and different schemes are employed by different tools, only five schemes are commonly used. Three are specifically for IGs: Kabat [55], Chothia [56], and enhanced Chothia [57]. Two more can be used for both IGs and TRs: IMGT [58] and AHo [59]. Conversion tables and tools like ANARCI [60] can be used to translate between schemes. CDR boundaries can differ substantially between different numbering schemes: care is needed when comparing results

from different studies [54]. In repertoire studies, the IMGT numbering scheme is widely used and supported, and its use is recommended in the absence of other considerations.

One more barrier to direct comparison is the identification in some studies and tools of the "junction" and in others of the CDR3. In IMGT terminology, the junction includes the second conserved cysteine of the V gene and the conserved tryptophan or phenylalanine of the J gene, while the CDR3 omits these residues. The AIRR Community data representation standard uses "junction"; however, it is not universally accepted [31].

Accurate annotation requires an accurate and comprehensive germline database. As noted above, even the currently available human database does not as yet meet this criterion [15, 61], and databases for other species are often partial and based solely on the analysis of a single animal [36, 62–65]. Fortunately, scientific need has resulted in the determination of new germline gene sets [36, 40, 66, 67], but these are not necessarily implemented by public germline gene databases in a timely fashion. The impact of missing or incorrect information in the database will depend upon the nature of the analysis, but one overall point to note is that the databases are updated frequently, and changes in the database can impact results [31]. It is therefore important that an analysis is conducted using a single, consistent, and up-to-date version of the database and that the version (or download date) is recorded for reproducibility. Germline databases are sometimes installed automatically with annotation tools: where that is the case, researchers should check if the installed version meets these requirements, and update it if necessary.

In a repertoire from a single individual, although structural variation and gene duplication give rise to frequent exceptions, we would expect to see a maximum of two alleles of most germline receptor genes: one from the paternal and one from the maternal chromosome. When used with an extensive germline database, annotation tools that are based on sequence similarity tend to call a biologically implausible number of alleles in B-cell repertoires, particularly in repertoires that are highly mutated, and will make a large number of indeterminate calls, where the tool would be unable to determine the likely germline allele unambiguously. Tools are available that will improve allele calls by using probabilistic methods to infer the individual's "personalized" germline set: such tools can also infer the presence of alleles in the individual that were not listed in the annotation tool's germline database [15–17, 68, 69]. While the use of a comprehensive germline database is important in the first instance, the determination of a personalized germline set and re-annotation with just that set is recommended where allele assignment is important: for example, when clonal inference is employed: personalization can also compensate to some extent for deficiencies in the germline database.

The decision of which annotation tool to use is also dependent on the computer skill set of the user. IMGT/HIGHV-QUEST and IgBLAST provide easy-to-use web platforms, suited for researchers that prefer to access a graphic user interface. Other tools, such as the stand-alone version of IgBLAST [10], MiXCR [1], and partis [11], require additional computer expertise, because they need to be installed and are used in the terminal. The advantage of such tools is that they provide more flexibility and can be integrated in automated workflows.

### 4 Conclusion

In this chapter, we present important considerations involved in the first steps in the preparation of raw data after sequencing and guide bioinformaticians in choosing the appropriate parameters for preprocessing and annotation. These first steps are required for the subsequent repertoire analysis, described in the Chap. 17, as choices made in these first steps have serious implications for the types of data analyses that can be performed and for the accuracy of the results. After the completion of this chapter, the bioinformatician is now ready to begin the in-depth analysis of repertoire features specific to the question at hand.

### Acknowledgments

The authors would like to thank Eline T. Luning Prak for the constructive criticism of the manuscript.

### References


lymphoproliferations: report of the BIOMED-2 concerted action BMH4-CT98-3936. Leukemia 17:2257–2317. https://doi.org/10. 1038/sj.leu.2403202


Slow delivery immunization enhances HIV neutralizing antibody and germinal center responses via modulation of immunodominance. Cell 177:1153–1171.e28. https://doi. org/10.1016/j.cell.2019.04.012


repertoires. Front Immunol 10:2541. https:// doi.org/10.3389/fimmu.2019.02541


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Adaptive Immune Receptor Repertoire (AIRR) Community Guide to Repertoire Analysis

Susanna Marquez, Lmar Babrak, Victor Greiff, Kenneth B. Hoehn, William D. Lees, Eline T. Luning Prak, Enkelejda Miho, Aaron M. Rosenfeld, Chaim A. Schramm, and Ulrik Stervbo and on behalf of the AIRR Community

### Abstract

Adaptive immune receptor repertoires (AIRRs) are rich with information that can be mined for insights into the workings of the immune system. Gene usage, CDR3 properties, clonal lineage structure, and sequence diversity are all capable of revealing the dynamic immune response to perturbation by disease, vaccination, or other interventions. Here we focus on a conceptual introduction to the many aspects of repertoire analysis and orient the reader toward the uses and advantages of each. Along the way, we note some of the many software tools that have been developed for these investigations and link the ideas discussed to chapters on methods provided elsewhere in this volume.

Key words AIRR-seq, B-cell receptor, T-cell receptor, Analysis, Clonal structure

### 1 Introduction

Once an adaptive immune receptor repertoire (AIRR) experiment has been carried out and the data has been appropriately preprocessed and annotated (see chapter "AIRR Community Guide to TR and IG Gene Annotation"), the next step is to plan a course of analysis to answer the questions posed by the experiment. As AIRRs are complex datasets that can contain thousands or even millions of sequences, it is important to have a working familiarity with the type of information each analysis can provide, as well as the limitations of an analysis. Here we provide an introduction to a variety of widely used techniques and discuss their applicability. In other chapters in this volume, we provide detailed experimental protocols and instructions to perform such analyses for the purpose of addressing specific biological questions. For a definition of terms used

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_17, © The Author(s) 2022

throughout this chapter, please see the AIRR Community glossary of terms, available at https://zenodo.org/record/5095381.

### 2 Materials

A breathtaking array of computational tools are available for repertoire analysis. These range from bespoke command line tools written in various programming languages that require facility in a Linux terminal to software with fully developed graphical interfaces and no requirement for programming skills of any kind. Thus, a key factor in choosing which programs to use will be the skill level and comfort of the user. Moreover, most tools have a narrow scope of the types of analysis they can perform, so matching the implementation to the desired goal is also a critical consideration. In addition, thought must be given to the computational resources necessary for repertoire analysis, including both storage and processing.

A comprehensive listing of the available software is out of the scope of this conceptual introduction, but the interested reader is directed to some recent reviews [1–4]. Here we focus on a small selection of commonly used tools, especially those which comply with AIRR Community guidelines for reproducibility and interoperability (https://docs.airr-community.org/en/stable/swtools/ airr\_swtools\_standard.html). These are highlighted in Table 1, and several are discussed in more detail below and in other chapters in this volume, where we demonstrate their application to common analytical tasks.

### 3 Methods

In this section we introduce some of the most frequently used methods to analyze AIRRs and suggest computational tools that can perform such analysis. Some of the methods are applicable to both IG and TR, and some are specific. In addition, the selection of the method and the interpretation of the results can depend on the specific biological state; for instance, some samples might be expanded from solid tumors, others from antigen-specific cells isolated from peripheral blood or from whole blood from healthy and diseased patients. The theoretical framework presented here can be used to interpret the results of the practical methods detailed in the AIRR Community chapters "Bulk gDNA Sequencing of Antibody Heavy Chain Gene Rearrangements for Detection and Analysis of B-Cell Clone Distribution," "Bulk Sequencing From mRNA With UMI for Evaluation of B-Cell Isotype and Clonal Evolution and Single-Cell Analysis," and "Tracking of Antigen-Specific T Cells: Integrating Paired-Chain AIRR-Seq and Transcriptome Sequencing," all in this volume.



(continued)

### Table 1 (continued)


3.1 Gene Usage The V gene is the most diverse gene of the TR and IG loci. This is driven especially by variation in the first and second complementarity-determining regions (CDR1 and CDR2) of the genes, which contribute to the specificity and affinity of the immune receptor. Differences in the distribution of V genes used in the rearranged repertoire can indicate an antigen-specific response or unusual clonal expansions and can be evaluated with the function compareVGeneDistributions of the sumrep R package [20] (https://github.com/matsengrp/sumrep). The D and J gene strongly contributes to the CDR3 and can be compared using compareDGeneDistributions and compareJGeneDistributions. Skewing of the V-J usage can be revealed by plotting the V-J combination as a heatmap (Fig. 1a). The distribution of V-J and V-D-J usage can be compared between two repertoires using the functions compareVJDistributions and compareVDJ-Distributions in sumrep.


Due to the randomness in addition and deletion of nucleotides during the rearrangement of the receptor, CDR3 lengths will be distributed around a mean value (Fig. 1b). Any changes to this distribution signifies an expansion of cells with a particular immune receptor.

Fig. 1 Data visualizations. Examples of different data visualizations to gain insights into the AIRR. The plot title describes the basic analysis. Smp <sup>¼</sup> sample. For further details, please refer to the main text

Different receptors specific for the same epitope can be expected to share motifs [27, 28]. Such motifs can be a few identical amino acids or amino acids with similar physical properties. Apart from properties like size, charge, and polarity, the properties of amino acids can be described by different factors derived through dimensionality reduction of a larger number of properties. Atchley [29] factors comprise five numerical descriptions, and Kidera [30] factors comprise ten numerical descriptions.

The R package sumrep [20] provides functions to compare the CDR3 properties of two repertoires, such as the CDR3 length and a number of amino acid physicochemical properties [31, 32].

3.3 Clonal Lineages A clone or a clonal lineage comprises a group of T or B cells descended from the same original naive ancestor. As such, all cells in a clone contain the same set of rearrangements. An important part of AIRR-seq analysis is computationally reconstructing these relationships from the sequences obtained. For TRs, the exercise is relatively straightforward, as only PCR and sequencing error need to be accounted for. With IGs, however, somatic hypermutation can significantly obscure the ancestry of a particular sequence [33], and so more complex strategies are required (see Subheading 3.9.2).

> When analyzing bulk AIRR-seq data, in which native pairing between heavy and light, alpha and beta, and gamma and delta chains is lost, clones are sometimes defined based on a single chain. This may be sufficient for IGH and TRB rearrangements, which are more diverse and contain most of the information needed to group sequences into clonal lineages [34]. However, care should still be taken in interpreting such data. Many different definitions of clonally related sequences have been offered in the literature (e.g., see the work by Kotouza and co-workers [35]), and methods to infer clones from AIRR-seq data are under active investigation [6, 12, 14, 18, 36].

> The distribution of clone sizes in an AIRR can be informative of underlying biology. One visualization is to plot ranks from high to low on the x-axis and associated frequency on the y-axis (Fig. 1c) to reveal clonal expansion. A closer look at the top x (Fig. 1d) helps likewise to identify clonal expansion. When plotting the log of rank and the frequency (Fig. 1e), the slope reveals the distribution of clones, such that the steeper the slope, the less evenly distributed the repertoire. The function estimateAbundance in the R package Alakazam [6] can estimate clonal abundance with confidence intervals obtained by bootstrapping.

> An alternative visualization makes use of division of clonal frequencies into different groups ("binning") and sums the frequencies in each bin (Fig. 1f). The binning is essentially arbitrary, but binning the clone frequencies into the bins [0.0, 0.001], [0.001, 0.01], [0.01, 0.1], and [0.1, 1] are widely used. Binning by rank is an alternative where the bins [1, 10], [11, 100], [101, 1000], and [1001, inf] are common.

3.4 Diversity The concept of diversity unites two properties of a repertoire, namely, the number of distinct clones and their distribution. As such, diversity describes the composition and state of a repertoire. For instance, a repertoire derived from a completely naive cell population is much more diverse both in terms of distinct clones and their distribution compared to the repertoire of antigenspecific memory cells.

> There are numerous sampling factors that are important to consider when measuring diversity. Perhaps the most important is whether a sample is derived from gDNA or mRNA [37]. As

discussed in Subheading 3.1 in chapter "AIRR Community Guide to TR and IG Gene Annotation," in the case of gDNA, each sampled cell contributes one or two templates, while the number of templates in mRNA data will be skewed by cell subset-specific transcript abundance. In the case of the former, diversity measures will be influenced substantially less by the underlying subset distribution than the latter. For both, one can measure diversity weighted by copy number or by clone number. For DNA data, using copy number-weighted diversity measures can give a sense of how similar sequences are in the underlying repertoire while using an unweighted measure will indicate how similar clones are. With RNA, using copy number-weighted measures will give a general measure of how similar large clones are, and unweighted measures will give a measure of how similar all clones are.

Another consideration when analyzing diversity is the depth of sequencing, that is, the proportion of clones that were sequenced compared to how many were actually in the sample. Assessing appropriate sequencing depth is no trivial task, but very important as undersampling can lead to false conclusions. Rarefaction curves [38] can help to evaluate if a repertoire is near full sampling depth. In this visualization, the number of distinct clones are plotted for a given subsample size (Fig. 1g). If the numbers of distinct clones plateau, the repertoire is near full sampling depth. Conversely, the absence of a plateau is an indication that the sampling depth of repertoire is shallow.

Another use of rarefaction is an estimation of the total number of clones from the sample. To achieve this, libraries from the sample of interest must be run in replicates, where more replicates give a more accurate estimate of total clones [39].

There are a large number of diversity metrics. These different metrics are all united in Hill numbers which are calculated over a range of diversities to generate a smooth curve (Fig. 1h) [40– 42]. The function calcDiversity in the R package Alakazam estimates the Hill numbers for a repertoire. The same function also makes calculation of particular diversity indices straightforward. The function compareHillNumbers in the R package sumrep compares one or more Hill numbers of two repertoires. Newer approaches toward diversity metrics specific for AIRR make use of Hill numbers combined with a functional similarity matrix [43].

3.5 Similarity of AIRR Sequences The similarity of AIRR sequences directly influences antigen recognition breadth: the more dissimilar the receptors are, the larger is the antigen space covered. One major approach to interrogate and measure AIRR sequence similarity is network analysis (Fig. 1i) [44– 50]. Networks allow investigation of sequence similarity and thereby add a complementary layer of information to repertoire diversity analysis. Sequence networks are built by defining each nucleotide or amino acid sequence as a node. Two nodes are connected with an edge if a certain similarity condition is satisfied, which is typically defined as a string distance (e.g., Levenshtein/ edit distance). A commonly used distance for both IG and TR is one amino acid difference [44]. For B cells, networks representing amino acid distances of up to 12 amino acids have been reported [47]. Building a sequence similarity network is computationally expensive. This challenge has been approached by at least two methods that allow the construction of large-scale networks from millions of AIRR sequences [47, 51].

Although networks of a few thousand nodes may be visualized using software suites such as igraph, Cytoscape, and Gephi [52, 53], and the visual interpretation of networks becomes indiscernible with a size of >102 nodes. Furthermore, the visualization of networks does not provide quantitative information regarding the network similarity architecture. To address this problem, graph properties and network analysis have recently been employed to quantify the architecture of large-scale AIRR networks [47]. Architecture analytics may be subdivided into properties that capture the repertoire at the global level (generally one coefficient per network), and those that describe the repertoire at the local level (one coefficient per sequence per repertoire). These network measures may be used to identify enrichment of network clusters (Fig. 1i), potentially originating from an ongoing immune response [46, 47].

To increase precision in isolating immune-associated AIRR sequences and clusters therefore, network analysis may be coupled with AIRR generation probabilities [45]. More generally, it has been observed that sequences that tend to show increased sharing across individuals (discussed in the see Subheading 3.7), are also more connected within a repertoire [45, 47, 48] and confer robustness on its architecture with respect to network properties [47].

Recently, sequence similarity and diversity analysis have been combined, providing further insights into AIRR architecture [43].

3.6 Similarity among Repertoires Similarity indices measure the similarity of two populations by not only considering the number of shared clones but also taking clone count or frequency into account (Fig. 1j). Similarity is sometimes calculated as dissimilarity (for historical reasons), but the index is always in the range of [0, 1]. It is therefore important to indicate the meaning of 0 and 1 to avoid confusion. One of the most popular indices is called Morisita-Horn, implemented in the function vegdist in the R package vegan [54]. Numerically, the observed overlaps are usually small, but considering the potential repertoire being sampled, the upfront chance of an overlap is very small. Alternatively, the CDR3s shared between samples can be plotted as a true/false heatmap (Fig. 1k). This is particularly useful when tracking clones over time or assessing the specificity of transplant infiltrating cells [55, 56].

Similarities on other parameters such as different amino acid properties as well as pairwise CDR3 distance and GC content can be compared between repertoires by the function compareRepertoires in the R package sumrep.

Other proposed similarity measures make use of feature counting [57], while another B-cell-specific similarity metric focuses on identical CDR3 length together with identical V and J genes considered within and between repertoires [58].

3.7 Public Clones Though not clones in a true biological sense, the existence of identical TRs and identical or closely similar IGs in multiple individuals due to convergent rearrangement has been noted on several occasions [59–61]. Such rearrangements are termed public clones and can yield insights into common selection patterns, which in turn can elucidate how the immune system responds to disease and if there are commonalities between individuals. The ability to identify public clones in an AIRR depends on the sequencing depth and the number of individuals tested [62, 63]. In addition, the meaning of a public immune receptor must be assessed in the context of the likelihood for it to be generated [8, 13]. Receptors with shorter CDR3s are more likely to be generated by chance and can overlap even between individuals with no exposures in common [60, 64, 65] and do not necessarily indicate a convergent response in multiple individuals to similar antigens. Sequences that share the same (preferably longer) CDR3 amino acid sequence but have different nucleotide sequences are more convincing as candidate public clones, as differences in the nucleotide sequences may indicate independent generation with convergent selection [66].

> Functionally identical IG can be identified by allowing some degree of difference in the CDR3. There is no well-defined cutoff to ensure the capture of a majority of receptors with identical specificities without including IGs of unrelated specificity into a particular collection of public IGs. A commonly used cutoff is 10–20% amino acid difference in the CDR3 [67–70]. Although a less restrictive cutoff might detect more divergent public clones [71], care must be taken to avoid identification of spurious public immune receptors [72]. Cross-contamination and index hopping on the sequencer further complicate the identification of public clones [73], and suitable definitions and analysis parameters may be helpful.

3.8 Detection and Monitoring of Cross-Sample Contamination Events

Despite strict quality assurance and control measures, PCR-based sample cross-contamination can occur at any time. Environmental contamination events are expected to arise from the presence of remaining DNA amplicons, which can be re-amplified and incorporated into new, unrelated libraries [74]. PCR contaminations can lead to major losses of reagents, time, and samples, and rapid detection and isolation are critical to the health of an AIRR- seq research laboratory. There are several experimental precautions that can reduce contamination, including separate work areas and different sample barcodes, as illustrated in the AIRR Community chapter "Quality Control: Chain Pairing Precision and Monitoring of Cross-Sample Contamination."

3.9 B-Cell-Specific Aspects 3.9.1 IG SHM Analysis SHM is the process driving the affinity maturation of IGs during the adaptive immune response [75]. Mutations are introduced at a rate of ~10-<sup>3</sup> mutations per base pair per division. These mutations are not randomly distributed along the IG but accumulate more in hotspots and CDRs, whereas coldspots and framework regions are disfavored for mutation. Furthermore, substitution profiles may be germline gene-directed [76–79], possibly as a consequence of specific features of the encoded protein sequence. Understanding SHM biases is key to develop better tools to reconstruct lineages, quantify selection pressure, and generate realistic simulated sequence data [9, 79, 80].

To better understand the distribution of targets for SHM, it is, for instance, possible to use the R package sumrep that provides two functions getHotspotCountDistribution and getColdspotCountDistribution to the distribution of the hot- and coldspot motifs in the repertoire. In addition, sumrep interfaces with the R package SHazaM [6], which calculates a mutability model for the likelihood for the center base in a 5-mer to be mutated (the function getMutabilityModel). The associated function getSubstitutionModel provides the relative probabilities that the center base in a 5-mer is mutated into each of the other three nucleotides. SHazaM also provides methods for quantification of selection pressure and whether it has contributed to the nature of the specific IG repertoire during antigenic stimulation [81].

3.9.2 Identification of B-Cell Clones As noted above, B-cell clones can be inferred from AIRR-seq data by analyzing their CDR3s and/or mutation patterns (Fig. 1l). Repertoires usually consist of hundreds or thousands of clonal lineages. Due to the presence of SHM, members of a B-cell clone cannot be identified solely based on identical CDR3s. There are many methods available to group IGs into clonal lineages (Table 1), but all generally attempt to computationally group sequences which likely share a common progenitor. However, different approaches can drastically change the interpretation of the underlying IG immune repertoire.

> Some approaches begin by grouping sequences by their CDR3 independent of their V, D, or J gene usage [22]. Other software first groups sequences by gene (generally just V and J due to the difficulty in D gene annotation) and CDR3 length after which sequences similar in the CDR3 are grouped into clonal lineages [12, 19, 82, 83]. SCOPer does a similar grouping, but then

evaluates the similarity by analyzing shared SHM in the V and J genes [84]. Finally, some pipelines use common mutations in the body of the V gene to group sequences from the same clonal lineage [36, 85]. It is also possible to combine these approaches, but this section focuses on each independently.

Each approach has potential benefits and flaws. Initially grouping sequences by CDR3, either by identity or hierarchical clustering, can result in inflated copy number and sequence counts for common CDR3s (in particular those of short length that incorporate few non-templated bases) which may have arisen independently and utilize different genes. However, this method can be beneficial as some gene calls may be incorrect (in particular when annotation of sequences has not been made using a personalized repertoire as defined above), and similar CDR3 amino-acid sequences, especially those with long lengths, can indicate that sequences are related.

Grouping sequences by both gene annotation and CDR3 length prior to inferring clonal lineages can be beneficial for a number of reasons. Because V gene annotation is generally robust to sequencing error, sequences with similar CDR3s but different V gene assignments are unlikely to derive from the same rearrangement. Binning by gene annotation can therefore prevent erroneous clonal groupings. It also eases the computational burden, as CDR3 identity only needs calculation among smaller sets of sequences. Similar advantages apply to binning by CDR3 length as well, since distance metrics can be calculated more efficiently without the need for alignment. While insertions and deletions can occur as part of SHM, they are relatively rare [86, 87] and can be neglected in many cases.

Once sequences have been binned, hierarchical clustering is a common technique for identifying clonally related sequences [82]. This requires a choice of linkage (e.g., single, average) to define the distance between groups of sequences and a threshold for cutting the hierarchy into discrete groups. A convenient way to set the threshold is to analyze the distribution of distances between nearest neighbors. This distribution is typically bimodal, with the first mode representing sequences in the same clonal lineage, while the second mode represents sequences that do not have any relatives in the data. If the distribution for a particular sample is not bimodal, a set of external sequences from a different subject can be used to establish the threshold [82]. While the threshold for separating the two modes can sometimes be established by visual inspection of the distribution, there are algorithmic methods to determine it more consistently [18].

The last common approach is to group sequences into clones by common mutations in the body of the V gene. This can be done by constructing clonal lineages directly or by inspecting the k-mers of each sequence [36, 88]. Unlike methods that first separate sequences by gene call and junction length, this method takes advantage of infrequent mutations to group sequences into clones. This can be beneficial for a number of reasons in certain circumstances. First, this method does not rely on proper gene calling or sequence alignment, which can be difficult in samples containing highly mutated populations or more generally due to sequencing error. Additionally, it is not sensitive to junction length, allowing sequences that have accumulated insertions and deletions to be grouped into clones [89, 90]. This method necessitates one to define the minimum number of mutations required to group two sequences into the same clone. A fixed value can be used, or the value can be dynamically determined based on the distribution of distances between each pair of sequences.

3.9.3 IG Affinity Maturation The reconstruction and analysis of IG clonal lineages trees is a powerful method to understand the immune response, affinity maturation, and the generation of broadly neutralizing antibodies (bNAb) [91–93]. Within a B-cell clonal lineage, B cells descended from a shared common ancestor evolve through SHM and antigendriven selection. While standard algorithms for inferring phylogenetic trees using maximum parsimony and maximum likelihood [94] are often employed, these approaches can be improved [80]. In particular, the unique biology of B cells can present problems for standard phylogenetic approaches and has led to the development of B-cell-specific phylogenetic tools. One cause of the problems is that SHM is enzymatically driven and biased by hotspot and coldspot motifs. This violates the assumption of independent evolution among sites that many likelihood-based phylogenetics methods rely on. To address this challenge, more contextaware phylogenetic methods, such as IgPhyML [9, 10], have been developed. While context-aware models of SHM clearly improve estimates of phylogenetic model parameters used to detect antigendriven selection [10], it is less clear how much they improve estimates of tree topology and branch lengths [95]. Another problem is that while standard phylogenetic models consider clonal lineages individually, IG repertoires often contain hundreds of independent clones. The use of repertoire-wide models, which allow some parameters to be shared among these multiple clonal lineages, can improve model precision significantly [10]. One important application of B-cell phylogenetics is estimating the series of mutations leading from a clone's unmutated germline ancestor to a sequence of interest, such as a known bnAb sequence. While standard phylogenetic methods can reconstruct intermediate sequences, they are less appropriate for reconstructing the germline ancestral sequence because they do not take into account the biology of V(D)J rearrangement. This has led to the development of tools such as Clonalyst and linearham [96, 97] that improve the reconstruction of these sequences by combining phylogenetic models with models of V(D)J rearrangement. Another feature of B-cell clonal lineages is that reconstructed intermediate sequences are often identical to observed IG sequences. Some tools, such as IgTree [98] and Alakazam [6], use this fact to simplify the visualization of these lineage trees by collapsing observed and sampled intermediate nodes. Finally, lineage trees containing B cells from multiple tissues, isotypes, and timepoints have the potential to be used to make inferences about how B-cell migration, isotype switching, and evolution over time occur. Multiple analyses have used lineage trees for this purpose [33, 40, 99, 100], and generalized tools for making these inferences from B-cell repertoires, such as Dowser and PopTree, are an area of active development [7].

3.10 T-Cell-Specific Aspects There is growing evidence that TR repertoire perturbations can serve as a biomarker of immune response toward some solid tumors [101–103] and pathogens such as Epstein-Barr virus (EBV), cytomegalovirus (CMV), Ebola, and SARS-CoV-2 [104–108]. Challenges with studying T-cell repertoires include the dependence of T-cell interactions on the major histocompatibility complex (MHC) [109], changes in TRBV usage based on MHC and significant differences in TRBV usage, and clonality in CD4+ and CD8+ repertoires [110–112].

Antigen-specific TCRs can be isolated either by sorting of MHC-tetramer-positive cells or activated cells after stimulation with overlapping peptide pools. Staining with tetramers requires knowledge of the correct epitope in the right MHC context, and T cells with high affinity tend to be recovered with the highest efficiency. Therefore, tetramer staining sometimes fails to identify some of the relevant TCRs [113]. Stimulation with overlapping peptide pools, on the other hand, can lead to isolation of nonpeptide-specific T cells due to bystander activation [114]. The TR of the antigen-enriched cells can be compared to samples from different timepoints to track the frequency of clones of interest [104, 106].

### 4 Conclusion

In this chapter, we have provided a brief overview of diverse, widely used techniques to uncover biological information in AIRR-seq data. These techniques can be applied to all of the AIRR-seq data created using the methodologies described in this book. They further form the basis for selecting the optimal experimental protocol to address the biological question and choosing the computational methods used in the analysis.

### Acknowledgments

The authors would like to thank Mats Ohlin for the constructive criticism of the manuscript. US was supported by grants from Mercator Stiftung, Germany; German Research Foundation, Germany (DFG, grant 397650460); BMBF e:KID, Germany (01ZX1612A); and BMBF NoChro, Germany (FKZ 13GW0338B).

### References


https://doi.org/10.1534/genetics.116. 196303


data. Bioinformatics 36:4817–4818. https:// doi.org/10.1093/bioinformatics/btaa611


repertoires. Nat Commun 10:1321. https:// doi.org/10.1038/s41467-019-09278-8


https://doi.org/10.1186/s12859-017- 1556-5


Mascola JR et al (2017) Gene-specific substitution profiles describe the types and frequencies of amino acid changes during antibody somatic Hypermutation. Front Immunol 8: 537. https://doi.org/10.3389/fimmu. 2017.00537


approach. J Mol Evol 17:368–376. https:// doi.org/10.1007/BF01734359


generation sequencing allows complex differential diagnosis of T cell-related pathology: NGS allows complex differential diagnosis. Am J Transplant 13:2842–2854. https:// doi.org/10.1111/ajt.12431


(2018) Peptide-MHC class I tetramers can fail to detect relevant functional T cell clonotypes and underestimate antigen-reactive T cell populations. J Immunol 200: 2263–2279. https://doi.org/10.4049/ jimmunol.1700242

114. Martin MD, Jensen IJ, Ishizuka AS, Lefebvre M, Shan Q, Xue H-H, IMI test presentation (2019) Bystander responses impact accurate detection of murine and human antigen-specific CD8 T cells. J Clin Invest 129:3894–3908. https://doi.org/10. 1172/JCI124443

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Bulk gDNA Sequencing of Antibody Heavy-Chain Gene Rearrangements for Detection and Analysis of B-Cell Clone Distribution: A Method by the AIRR Community

### Aaron M. Rosenfeld, Wenzhao Meng, Kalisse I. Horne, Elaine C. Chen, Davide Bagnara, Ulrik Stervbo, and Eline T. Luning Prak and on behalf of the AIRR Community

### Abstract

In this method we illustrate how to amplify, sequence, and analyze antibody/immunoglobulin (IG) heavychain gene rearrangements from genomic DNA that is derived from bulk populations of cells by nextgeneration sequencing (NGS). We focus on human source material and illustrate how bulk gDNA-based sequencing can be used to examine clonal architecture and networks in different samples that are sequenced from the same individual. Although bulk gDNA-based sequencing can be performed on both IG heavy (IGH) or kappa/lambda light (IGK/IGL) chains, we focus here on IGH gene rearrangements because IG heavy chains are more diverse, tend to harbor higher levels of somatic hypermutations (SHM), and are more reliable for clone identification and tracking. We also provide a procedure, including code, and detailed instructions for processing and annotation of the NGS data. From these data we show how to identify expanded clones, visualize the overall clonal landscape, and track clonal lineages in different samples from the same individual. This method has a broad range of applications, including the identification and monitoring of expanded clones, the analysis of blood and tissue-based clonal networks, and the study of immune responses including clonal evolution.

Key words Antibody, Clone, Lineage, Immune repertoire profiling, Immunoglobulin, V(D)J recombination, Next-generation sequencing

### 1 Introduction

Antibodies or immunoglobulins (IGs) on B cells are generated through somatic recombination of variable (V), diversity (D), and joining (J) genes [1, 2] and further diversified through somatic hypermutation (SHM) [3, 4]. The collection of different B cells in an individual, also known as the immune repertoire, is complex,

Aaron M. Rosenfeld and Wenzhao Meng are shared first authors.

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_18, © The Author(s) 2022

containing many different B cells with different antibodies. B cells that derive from the same progenitor are clonally related and harbor gene rearrangements that are identical or have very similar nucleotide sequences (differing only by SHM or sequencing errors). The grouping of antibody gene rearrangement sequences into clones provides a means of characterizing the immune repertoire with respect to the distribution, size, complexity, and dynamics of clones in different cell types and tissues [5–7].

Here we describe a homebrew method, with primer sequences adapted for NGS from the BIOMED2 IG heavy-chain (IGH) PCR assays [8], to evaluate samples for evidence of B-cell clonal expansion and track clones in bulk gDNA samples. Similar methods exist as commercial services (e.g., Adaptive Biotechnologies, iRepertoire), and there are also similar homebrew methods for the analysis of T-cell AIRR-seq data (e.g., [9]). This homebrew method for IGH rearrangements uses multiplex PCR and can be scaled to very high cell inputs as described in [10]. DNA is more robust than RNA and has a simpler relationship to cell numbers (one template per cell) than RNA. For these reasons, bulk gDNA-based sequencing is typically the method of choice for clinical-grade assays to evaluate malignant clonal expansions [11], as well as the in-depth study of clones in different tissues to study clonal networks in the body [10]. The method shown uses long reads that are adequate for robust IGHV gene alignment and evaluation of SHM, but this method can also be performed with shorter reads, depending upon the sample type and DNA quality.

In this chapter, we also illustrate how to use pRESTO [12] and ImmuneDB [13] to analyze sequencing data generated following the wet bench protocol. In this dry bench analysis, we describe how to filter the raw read data, group highly similar rearrangements into clones using both the IGHV gene and CDR3 sequences, estimate clone size distributions, and track clones of interest in other samples.

### 2 Materials

2.1 Primers All IG gene amplification primers are synthesized by Integrated DNA Technologies, and HPLC purification is recommended for sequences that are longer than 60 bp and any sequence that contains one or more "Ns" (random nucleotides). Dual indices are provided to distinguish clone identification from tracking primers (see Note 1).

> 1. Human (Hu) IGH amplification primers for clone identification:

NexteraR2-Hu-VH1-FW1:GTCTCGTGGGCTCGGAGAT GTGTATAAGAGACAGGGCCT

### CAGTGAAGGTCTCCTGCAAG

NexteraR2-Hu-VH2-FW1:GTCTCGTGGGCTCGGAGAT TGTATAAGAGACAGGTCTG

GTCCTACGCTGGTGAAACCC

NexteraR2-Hu-VH3-FW1:GTCTCGTGGGCTCGGAGAT GTGTATAAGAGACAGCTGG

GGGGTCCCTGAGACTCTCCTG

NexteraR2-Hu-VH4-FW1:GTCTCGTGGGCTCGGAGATG TGTATAAGAGACAGCTTC

GGAGACCCTGTCCCTCACCTG

NexteraR2-Hu-VH5-FW1:GTCTCGTGGGCTCGGAGATG TGTATAAGAGACAGCGGG

GAGTCTCTGAAGATCTCCTGT

NexteraR2-Hu-VH6-FW1:GTCTCGTGGGCTCGGAGATG TGTATAAGAGACAGTCGC

AGACCCTCTCACTCACCTGTG

NexteraR1-Hu-JHmix1:TCGTCGGCAGCGTCAGATGTG TATAAGAGACAGTACGTNC

TTACCTGAGGAGACGGTGACC

NexteraR1-Hu-JHmix2:TCGTCGGCAGCGTCAGATGTG TATAAGAGACAGCTGCNCT

TACCTGAGGAGACGGTGACC

NexteraR1-Hu-JHmix3:TCGTCGGCAGCGTCAGATGTG TATAAGAGACAGAGNCTTA

CCTGAGGAGACGGTGACC

2. Hu IGH amplification for clone tracking:

These primers use dual ID barcodes to distinguish them from the identification sample amplicons (the bold font indicates the barcode sequences).

NexteraR2-Barcoded-Hu-VH1-FW1: GTCTCGTGGGCTCGGAGATGTGTATAAGAGAC

AGAGGCTATAGGCCTCAGTGAAGGTCTCCTGCAAG

NexteraR2-Barcoded-Hu-VH2-FW1:

GTCTCGTGGGCTCGGAGATGTGTATAAGAGAC

AGGCCTCTATGTCTGGTCCTACGCTGGTGAAACCC

NexteraR2-Barcoded-Hu-VH3-FW1: GTCTCGTGGGCTCGGAGATGTGTATAAGAGAC

AGAGGATAGGCTGGGGGGTCCCTGAGACTCTCCTG

NexteraR2-Barcoded-Hu-VH4-FW1:

GTCTCGTGGGCTCGGAGATGTGTATAAGAGAC AGTCAGAGCCCTTCGGAGACCCTGTCCCTCACCTG



### 3 Methods

The major steps of the wet bench procedure are outlined in Fig. 1.


Fig. 1 Workflow for IGH sequencing from bulk gDNA. (a) Starting from PBMCs, bone marrow aspirate, or formalin-fixed paraffin-embedded samples, gDNA is extracted from bulk populations. (b) Next, IGH gene rearrangements are amplified from gDNA using primer cocktails in FR1 and JH along with Illumina adapters. V <sup>¼</sup> variable, D <sup>¼</sup> diversity, and J <sup>¼</sup> joining genes. (c) Amplicons from this first round of PCR are purified using AMPure beads and (d) subjected to second-round amplification using primers that include sample barcodes (see primers in Subheading 2.1 for DNA sequence information). (e) Sequencing libraries are subjected to further purification, size selection, quality control, and pooling prior to loading onto the sequencer

the expected lymphocyte yield is less than 50,000 cells and use a DNA LoBind tube. If the expected yield is more than 50,000 cells per population, sort the cells into sorting buffer, centrifuge the cells, remove the supernatant, and resuspend the cell pellet in a cell lysis buffer (add 300 μl cell lysis buffer for up to two million cells). DNA is extracted from whole blood, bone marrow, or sorted cells using protocols from Gentra Puregene (Qiagen) handbook using the manufacturer's recommendations. Outlined below is the protocol (with notes) for 3 ml of whole blood.


9. Resuspend DNA in 100 ul of TE (low EDTA) buffer (10 mM Tris and 0.1 mM EDTA), and check the DNA quality using NanoDrop. The OD260/OD280 ratio should be close to 1.8. If the DNA concentration is <100 ng/μl using the NanoDrop instrument, repeat the DNA concentration measurement using Qubit HS DNA Kit for a more accurate measurement.

Before beginning, make sure that all of the workstations are clean, and perform all template amplification procedures in a separate pre-PCR area. Aliquot all primers (equimolar mixture of primers for both VH and JH primer mixes), PCR-grade water, and PCR master mix buffers before use. The PCR product that is amplified from gDNA is shown in Fig. 1b.

1. Use water and PCR master mix from Qiagen Multiplex PCR Kit, and prepare the PCR mix (see Notes 10–13):


2. Thermal cycling. If using plates, use microseal B adhesive seal. Perform a quick spin of the plate before loading onto the thermal cycler, and run the following program: First PCR program.


Stopping point: Amplified samples can be stored at 4 C for up to 48 h.

3. Agarose gel electrophoresis of PCR products (see Note 14). Gel electrophoresis is performed to ensure that the first-round PCR has generated a sufficient quantity of amplicons of the correct length and that there is no evidence of contamination in the negative controls.

### 3.3 Template Amplification and Initial Quality Control

	- (a) Mix an equal volume of AMPure beads with amplicons (in this case, 20 μl of beads, see Note 15).
	- (b) Mix the beads and amplicons together by pipetting up and down 20 times. Incubate the mixture at room temperature for 1 min. Set the plate on the magnet for 5 min until the mixture is clear.
	- (c) Keep the PCR plate on the magnet, remove the supernatant, and discard.
	- (d) Wash the beads by adding 180 μl of fresh 85% ethanol, do not mix, incubate at room temperature for 30 s, and remove and discard the supernatant.
	- (e) Use P2/P10 extra-long tips to remove the residual ethanol from each well, and air dry at room temperature for up to 5 min. Note: Do not allow beads to air dry for more than 5 min.
	- (f) Remove the PCR plate from the magnet, add 40 μl of TE (low EDTA) buffer into each sample well. Mix by pipetting up and down ten times to resuspend the beads. Incubate at room temperature for 2 min.
	- (g) Return the PCR plate to the magnet, and incubate at room temperature for 5 min. With the plate on the magnet, transfer 38 μl of the eluates to a new 96-well PCR plate. Stopping point: At this step, the new plate with the purified first PCR amplicons can be sealed and stored at 20 C for later use.

### 3.4 Second-Round PCR and Product Purification

In this section of the protocol, the bead-purified amplicons from the first step are amplified using primers that are tagged with Illumina barcodes. A schematic illustration of the PCR product is shown in Fig. 1d. All procedures for preparing the PCR mix are performed in the pre-PCR room, except for the addition of the firstround PCR amplicons, which is performed in a PCR hood in the post-PCR room. Aliquot all primers (Nextera XT index primers), PCR-grade water, and PCR master mix buffers before use.

1. Use water and PCR master mix from the Qiagen Multiplex PCR Kit, and prepare the PCR mix:


### 2. Run the second-round PCR program:


Stopping point: Amplified samples can be stored at 4 C for up to 48 h.

	- (a) Add equal volumes (typically ~5 μl) of the individual sample amplicons (replicates) together into a "pooled library" for sequencing. Samples can be pooled together at this stage, because the amplicons have sample-specific barcodes.
	- (b) Prepare a 2% agarose gel, and add 5 μl of the secondround PCR amplification mixture. The expected amplicon size on the gel is ~510 bp and should be present in the positive control sample. If water or fibroblast have amplification products, the second-round PCR experiment needs to be rerun. Stopping point: The second-round PCR samples can be stored at 20 C in a post-PCR freezer for later use (see Note 17).
	- (a) Run the pooled samples on a 2% agarose gel with a low-voltage setting (~60 V) to allow the amplicons to migrate slowly on the gel.
	- (b) After 3 h of gel running, cut out the expected size (510 bp) band under long wavelength UV light to minimize DNA damage. Weigh the gel slice in a 1.5 ml Eppendorf tube.
	- (c) Add 3 volumes of buffer QG to 1 volume of gel (100 mg gel corresponds to ~100 μl of liquid volume). The maximum amount of gel per spin column is 400 mg. Incubate at 50 C for 10 min (invert the tube to help dissolve gel) or until the gel slice has dissolved completely.
	- (d) If the color of the mixture is orange or violet, add 10 μl of 3 M sodium acetate until the color turns yellow. Add 1 gel volume of isopropanol to the sample, and mix by inverting the tube ten times.
	- (e) Apply 750 μl of the gel-isopropanol mixture to a QIAquick spin column in the provided 2 ml collection tube, and centrifuge at 17,900 g for 1 min.
	- (f) Discard the flow-through, and place the QIAquick column back into the same tube.
	- (g) Apply the rest of the mixture (if any is remaining) to the same column, and repeat steps 4e and 4 f.
	- (h) Add 750 μl buffer PE to the QIAquick column, and centrifuge at 17,900 g for 1 min to wash the column. Discard flow-through, and place the QIAquick column back into the same collection tube.
	- (i) Centrifuge the QIAquick column for 1 min to remove the residual wash buffer, and place the QIAquick column into a clean 1.5 ml Eppendorf tube.
	- (j) Add 50 μl buffer EB to the center of the QIAquick membrane, let the column stand for 2–3 min, and then centrifuge for 1 min. Stopping point: Gel-purified product (the eluate in the clean 1.5 ml Eppendorf tube) can be stored at 20 C in the post-PCR freezer for later use.

3.5 Library Pooling, Purification, and Quantification


ð Þ A μl 50 nM =34 samples ¼ ð Þ 10 μl 35 nM =46 samples:

$$\mathbf{A} = \mathbf{5}.17 \text{ \textquotedblleft } \text{\textquotedblright}.$$

The concentration of the final pooled library is determined by Qubit and calculated as molarity (see Note 19).

	- 2. Prepare 4 nM of the final pooled sequencing library by diluting the concentrated one with TE (low EDTA).
	- 3. Mix 5 μl of 0.2 N NaOH and 5 ul of 4 nM library by pipetting up and down for 20 times in a 1.5 DNA LoBind tube. Denature at room temperature for 5 min.
	- 4. Add 990 μl prechilled HT1 (from the MiSeq Kit), and incubate on ice immediately. The final concentration for the denatured library is 20 pM.
	- 5. Prepare 20 pM of PhiX. Mix 2 μl of PhiX control with 3 μl of TE (low EDTA) in a 1.5 ml DNA LoBind tube by pipetting. Add 5 μl of freshly diluted 0.2 N NaOH, mix by pipetting up and down 20 times, and incubate at room temperature for 5 min. Next, add 990 μl prechilled HT1 (from the MiSeq Kit), and incubate on ice immediately (see Note 20).
	- 6. To spike in 10% PhiX into the final sequencing library, take 100 μl of the 20 pM denatured library out and discard, and add in 100 μl of 20 pM denatured PhiX. This will yield 20 pM of the final sequencing library with 10% PhiX (see Note 21). Load

600 μl of this library to the pre-thawed MiSeq cartridge MiSeq® Reagent Kit v3 (2X300 cycles). The run takes 2.5 days to complete.

7. General sequencing run QC. For the MiSeq (2X300 cycle) V3 Kit, the optimal raw cluster density is 1200–1400 K/mm<sup>2</sup> (Illumina provides additional details on clustering density online). The percentage of reads for the entire run that have Q scores above 30 (Q30, 1 in 1000 base calls may be incorrect) should be at least 70%. Finally, the percentage of clusters passing filter (PF%) should be > ¼ 80%. If a run does not pass all three of these thresholds, the sequencing should be repeated. Under passing conditions, each replicate has on average 100,000 to 300,000 valid reads (using pRESTO processing with Q30 filtering, please see following sections for data analysis).

3.7 Software Installation Before processing raw sequencing data, analysis software must be installed as follows:


3.8 Raw Data Processing Raw data from NGS platforms are generally output in a format providing base calls for each read along with a quality score for each base. Depending on the sequencing method, there are a number of different steps to transform and filter these data into a format that is readily available for further analyses. In general, if reads are paired, the matching 5<sup>0</sup> and 3<sup>0</sup> reads must be aligned to form full-length sequences. Specifically, each pair of reads is iteratively compared until the maximal number of overlapping nucleotides is found. Nucleotides in the overlapping segment that do not match are assigned the base from whichever read has a higher-quality score.

> Following this, short and low-quality sequences should be removed as they do not provide sufficient information to make accurate gene calls. Then, primer sequences which were incorporated into the DNA/RNA templates should be masked as not to skew later mutation analyses. Individual base calls with low confidence (generally either a Phred score < 20 or < 30) should be masked to reduce their influence on downstream analyses. Finally, genes should be annotated with IgBLAST for downstream processing. The commands for this entire process, assuming paired input

files from an Illumina-based sequencing platform and applying a Phred quality score filter of 30, are as follows:


```
PairSeq.py -1 *R1*.fastq -2 *R2*.fastq
AssemblePairs.py align -1 *R1*_pair-pass.fastq \
 -2 *R2*_pair-pass.fastq \
 --coord illumina
FilterSeq.py quality -s *assemble-pass.fastq
FilterSeq.py trimqual -s *quality-pass.fastq -q 30 --win 20
FilterSeq.py length -s *trimqual-pass.fastq -n 100
FilterSeq.py maskqual -s *length-pass.fastq -q 30
FilterSeq.py missing -s *maskqual-pass.fastq -n 10
```
3. Move the quality-controlled data into a new directory. The remaining steps of this method only use the final resulting files which will end in missing-pass.fastq. These files should now be moved to a location to mount into the ImmuneDB Docker container.

```
mkdir $HOME/immunedb_share/input
mv *missing-pass.fastq $HOME/immunedb_share/input
```
	- (a) Run the docker container. To begin an interactive session, run the following:

```
docker run -v $HOME/immunedb_share:/share \
```
One should see output similar to the following, after which a terminal prompt will be shown:

```
Moving MySQL to Volume
* Starting MariaDB database server mysqld [ OK ]
Setting up database
Starting webserver
```
(b) Run IgBLAST on the QC'd FASTQ files. In the Docker container, a helper script run\_igblast.sh can be used to annotate sequences. Reference genes are provided for humans and mice for IGH, IGL, IGK, TRA, and TRB. In this protocol, we will focus on human IGH. Run the following:

run\_igblast.sh human IGH /share/input /share/input mkdir -p /share/sequences mv /share/input/\*.fast[aq] /share/sequences

> After this step, TSV files annotated in AIRR format [15] will be located in the Docker container at /share/ input (which is also accessible at \$HOME/immunedb\_ share/input on the host).

3.9 Importing Metadata and Sequence Data into ImmuneDB

	- (a) Create a template metadata file. Although a metadata file is simply a TSV which could be created manually, ImmuneDB provides a helper script to create a template as follows:

cd /share/input immunedb\_metadata --use-filenames

(b) Add relevant metadata. With the command above, a metadata file with one row per file will be generated, and the sample name for each file will be set to the filename stripped of its extension.

On the host, open the metadata file in a spreadsheet editor. The headers included by default are required; file\_name and sample\_name will already be filled in from the previous step, but the study\_name, subject must be filled in (see Note 22).

	- (a) Create a database for the project. The first step is to create a database into which the AIRR-compliant sequencing data annotated by IgBLAST will be stored. For this method we will call the database my\_db, but it can be any valid name for a MySQL database (see Note 23).

immunedb\_admin create my\_db /share/configs

(b) Import the annotated data and trace duplicate sequences. The next commands import all the annotated sequences into the previously created database and annotate (collapses) duplicate reads within and between samples. Counting duplicates is useful for downstream filtering and clone size estimation.

```
immunedb_import /share/configs/my_db.json airr \
/root/germlines/igblast/human/IGHV.gapped.fasta \
/root/germlines/igblast/human/IGHJ.gapped.fasta \
/share/input \
immunedb_collapse /share/configs/my_db.json
```
One important parameter in the previous commands is --trim-to. This masks the bases on the 5<sup>0</sup> end of each read with the ambiguity character N. This avoids the primer sequences, which are incorporated into the resulting reads, from being incorporated into downstream mutational analyses. The value of 80 was chosen for this chapter due to the use of framework 1 (FWR1) primers. If different primers are used, the IMGT position of the 3<sup>0</sup> end of the primer sequence should be used instead.

1. Once the data are imported and collapsed, sequences likely originating from a common progenitor cell can be grouped into clones.

immunedb\_clones /share/configs/my\_db.json cluster

The default parameters used by immuneDB to specify clonally related sequences are the use of the same IGHV and IGHJ genes, the same CDR3 length, and at least 85% amino acid sequence similarity in the CDR3 (see Note 24).

2. Calculating statistics. To make downstream analyses more efficient, ImmuneDB pre-calculates a number of statistics about clones and samples (see Note 25).

immunedb\_clone\_stats /share/configs/my\_db.json immunedb\_sample\_stats /share/configs/my\_db.json

3. Create lineage trees for each clone. Optionally, lineage trees can be constructed for each clone. Like clonal inference, this process has many parameters, and the following is for general use and may need to be tweaked depending on sequencing depth, error rates, and the underlying biological samples:

3.10 Clonal Inference from Sequencing Data and General Statistics

More details on clonal lineages can be found in Subheading 3.3 of the chapter "AIRR Community Guide to Repertoire Analysis."

3.11 Analysis of Clone Numbers and Size Distributions 1. Sample clone count (see Note 26). One can do a quick "backof-the-envelope" calculation to estimate the maximal number of expected unique IGH rearrangements in a bulk gDNA sequencing using the equation below [16] if the nanogram input is known:

> Max:#of rearrangements ¼ ð Þ ng input ð Þ 1000 pg=ng ð Þ 1:4 rearrangements=cell =6:7 pg=cell:

Or, equivalently, about 150 cells per nanogram of input DNA. These equations assume that 100% of the cells in the samples are the B or T cells of interest that there is quantitative recovery of all possible rearrangements and that each cell has an average of 1.4 rearrangements (due to some cells having more than one IGH or TRB rearrangement [17], see Note 27). Obtaining fewer or more clones than expected can reveal potential technical or analytical problems with the experiment or data analysis pipeline, respectively (see Note 28).

	- (a) Histogram of top-ranked clones. As shown in Fig. 2a, one can plot the copy number fraction of the 20 clones in a sample that have the highest copy numbers. Investigating the top copy number clones in datasets can highlight expanded clones as compared to the overall repertoire, giving insight into a range of different biological processes (see Note 30). In healthy individuals, expanded B-cell clones in the peripheral blood generally have copy numbers within the same order of magnitude of non-expanded clones (see Note 31).
	- (b) Dx index. One can compute the fraction of sequence copies that are occupied by the top x percent of clones in a sequencing library. Dx is the fraction of total copies

Fig. 2 Clone visualization scheme. All plots are illustrative. (a) Top clone plot. An example plot showing the size of the top clones as measured by copy number in two samples, one shown in blue and one in yellow. Each set of columns represents the clone of a given rank, and the y-axis shows the copy number frequency as a fraction of the entire sample. (b) Clone rank plot. An example of a clone rank plot for two samples. Each bar represents a sample; each color represents the copy number fraction for a bin of clones of a given range of ranks (sizes) with lighter blue indicating higher-ranked (larger) clones and darker blue representing lowerranked (smaller) clones. A generally darker sample indicates that the majority of clones are not expanded, and a lighter sample indicates a more oligoclonal repertoire. (c) Rarefaction curves. Illustrative rarefaction curves for two hypothetical samples showing sufficient and insufficient sampling. The x-axis indicates number of clones, and the y-axis indicates the measured number of total (unique) clones. Curves in which the number of distinct clones continues to increase as the number of sampled clones increases indicate potential undersampling (blue), whereas curves that begin to plateau (black) indicate the sampled clones are becoming more representative of the true underlying clonal population. (d) Clonal string plots visualizing the degree of clonal overlap between three samples. Each row represents a clone and each column a sample (smp). The presence of a clone in a given sample is indicated by blue and its absence by gray. Only clones that overlap in two or more samples are shown. (e) Venn diagram. Three different hypothetical samples (demarcated by the blue, yellow, and black circles) from the same individual. Numbers indicate clone counts that are found uniquely in one, two, or three of the samples. (f) Clonal lineage. An inferred hypothetical lineage of clonally related sequences. Each blue node represents a unique sequence, and the yellow node represents the nearest germline reference sequence. The edge length between two nodes indicates the total number of accumulated mutations from the parent sequence to the child sequence

> occupied by the top x clones. A common value of x is 20 [10] which, when looking at copy number distribution, reveals if there are one or more dominating clones.

(c) Clone rank plot. Unlike the top-ranked clone plot and Dx index, clone rank plots provide a snapshot of the clone size distribution in the entire repertoire. Clone rank plots achieve this by segregating clones by rank as shown in Fig. 2b. In such plots, each column represents a sample, or a pool of samples, and the height of each bar represents the proportion of copies in the given clonal range bracket. For example, in this example, the red bars show the proportion of sequence copies in the top ten ranked clones. In oligoclonal repertoires, both the Dx index and the clone rank plot, the top copy clones contain the majority of copies. In contrast, for polyclonal repertoires, range plots can provide a nuanced view of clonal abundance by stratifying clones into categories based on their copy number distributions.

3.12 Clonal Overlap Analysis Determining how many samples or replicates are necessary to sufficiently reveal the clonal landscape of the underlying immune repertoire is challenging. Undersampling a repertoire can lead to underpowered analyses and false biological conclusions (e.g., claiming lack of overlap), whereas oversampling can be expensive and time-consuming.

	- (a) Clone definitions for the evaluation of clonal overlap. Most frequently used are clonal annotations or shared CDR3 amino acid sequences. In ImmuneDB, for example, clones are annotated with a unique clone ID that can be scanned across all of the samples in a given subject, allowing for the construction of clonal networks across all of the different samples in an individual. Alternatively, one can trace the consensus CDR3 amino acid sequence of each clone through samples to determine overlap.

(b) The Jaccard index [19] is the cardinality of the intersection of two samples divided by the cardinality of the union of the same samples. Specifically, for two (potentially overlapping) sets of clones A and B, the Jaccard index J is calculated with

$$J = \frac{A \cap B}{A \cap B}$$

(c) Cosine similarity. The cosine similarly also gives an indication of overlap between samples. However, unlike the Jaccard index, it takes into account clone size rather than only presence or absence in samples. For each of the two samples to compare, a one-dimensional vector is constructed, the values of which indicate the size of each clone in copies. The order of clone sizes must be the same for both samples. Specifically, given two vectors of clone sizes from two samples, A and B, the cosine similarity S is defined as

$$S = \frac{\sum\_{i=1}^{n} A\_i B\_i}{\sqrt{\sum\_{i=1}^{n} A\_i} \sqrt{\sum\_{i=1}^{n} B\_i}}$$


sequences but instead attempts to construct a tree which requires the minimum number of total mutations. Both have positives and negatives. For example, neighbor joining can create trees which are not optimal (e.g., mutations occurring multiple times or incorrectly grouping clades), but it is computationally more efficient than maximum parsimony. Maximum parsimony, however, guarantees some properties of the tree such as minimizing its height, but is computational intractable to calculate for large clonal lineages.

### 4 Notes


method presented in this chapter uses Qubit for concentration measurement and uses 20 pM of the final library based on the Qubit calculation. Bioanalyzer and KAPA quantification may give different concentrations, and the optimal input library concentrations calculated based on these methods may differ.


sequencing such that only a few of the available rearrangements are being amplified, or a filtering procedure that results in an unacceptably large fraction of the data being removed or a clone collapsing procedure that groups unrelated sequences together into the same clones, under-calling the number of different clones. If, on the other hand, one obtains more clones than the predicted maximum number, there may be an issue with the computational pipeline in terms of how clones are defined. For example, if a very high level of sequence similarity is used on a sample enriched for memory B cells with high levels of SHM, clonally related sequences may be grouped falsely into separate clones.


or other factors). The term nondominant is used in case there are expanded clones with more than one amplifiable IGH rearrangement, for example, one productive and one nonproductive IGH gene rearrangement in the same cell.

32. If multiple replicates are not available for the dataset of interest, one can also computationally resample the dataset, mimicking the effect of multiple replicates [34].

### Acknowledgments

This work is supported by NIH research grants awarded to ELP (AI144288, AI106697, P30-AI0450080, P30-CA016520). US is supported by grants from Mercator Stiftung, the German Research Foundation (DFG 397650460), BMBF e:KID (01ZX1612A), and BMBF NoChro (FKZ 13GW0338B). The authors thank members of the AIRR Community Biological Resources Working Group and Diagnostics Working Group for helpful discussions and feedback on the manuscript.

ELP is the director of the Human Immunology Core facility at the University of Pennsylvania, which uses this protocol. She is also the former Chair of the AIRR Community, receives research funding from Roche Diagnostics and Janssen Pharmaceuticals for projects unrelated to the method presented in this chapter, and is consulting or an advisor for Roche Diagnostics, Enpicom, the Antibody Society, IEDB, and the American Autoimmune Related Diseases Association.

### References


rabbit lymphoid tissues. J Exp Med 122(5): 853–876. https://doi.org/10.1084/jem.122. 5.853


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Bulk Sequencing from mRNA with UMI for Evaluation of B-Cell Isotype and Clonal Evolution: A Method by the AIRR Community

### Nidhi Gupta, Susanna Marquez, Cinque Soto, Elaine C. Chen, Magnolia L. Bostick, Ulrik Stervbo, and Andrew Farmer

### Abstract

During the course of an immune response to a virus such as influenza, B cells undergo activation, clonal expansion, isotype switching, and somatic hypermutation (SHM). Members of an antigen-experienced B-cell clone can have different sequence features including SHM in the immunoglobulin heavy-chain V (IGHV) gene and can use the same IGVH gene in combination with different constant regions or isotypes (e.g., IgM, IgG, IgA). To study these features of expanded clones in an immune response by AIRR-seq, we provide a bulk RNA-based sequencing experimental procedure with unique molecular identifiers (UMIs) and the accompanying bioinformatics analytical workflow.

Key words BCR, B cells, Repertoire, Bulk RNA, Sequencing, AIRR, Immunoglobulin, Bulk RNA sequencing, UMI, Heavy and light chain

### 1 Introduction

This protocol enables users to generate indexed libraries with fulllength transcripts that are ready for sequencing on Illumina platforms (Fig. 1). It allows for the analysis of both immunoglobulin heavy (IGH) and kappa/lambda light-chain (IGK/IGL) gene rearrangements and has a sample input range from 10 ng to 1 μg of total RNA from peripheral blood mononuclear cells (PBMCs) or 1 to 100 ng of total RNA from purified B cells.

The protocol leverages SMART technology (switching mechanism at 5<sup>0</sup> end of RNA template) and employs a 5' RACE-like approach to capture complete V(D)J variable regions of BCR/IG transcripts. It also incorporates unique molecular identifiers (UMIs). First-strand cDNA synthesis is oligo-dT primed and

Nidhi Gupta and Susanna Marquez are shared first authors.

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_19, © The Author(s) 2022

Fig. 1 Overview of the SMARTer human BCR procedure. cDNA is synthesized from RNA isolated from PBMCs or B cells, followed by two rounds of PCR and finally purified and pooled to prepare the libraries for sequencing

catalyzed by SMARTScribe™ reverse transcriptase (RT), which adds non-templated nucleotides at the 5<sup>0</sup> end of each mRNA template. The SMART UMI Oligo anneals to these non-templated nucleotides, serves as a template for incorporation of a PCR handle into the first-strand cDNA, and uniquely tags each cDNA molecule with a UMI (UMIs allow for the generation of consensus sequences during data analysis, thereby minimizing PCR and sequencing errors). Following reverse transcription, two rounds of PCR are performed to amplify cDNAs. To capture the entire V(D)J region, primers in these PCRs anneal to sequence added by the SMART UMI Oligo at the 5<sup>0</sup> end and the IG constant region(s) at the 3<sup>0</sup> end. The second PCR takes the product from the first PCR as a template and uses semi-nested primers to amplify the entire IG variable region and a small portion of the constant region (Fig. 2).

We also provide a computational workflow to analyze the sequencing data with the Immcantation framework (immcantation. org). The workflow covers preprocessing, isotype assignment, quality control and filtering, gene annotation, gene usage, population structure determination, and lineage reconstruction.

### 2 Materials

2.1 General Reagents

All components are available in SMARTer Human BCR IgG IgM / IgK/IgL Profiling Kit (Takara Bio, see Note 1).


Fig. 2 A schematic of dT-primed first-strand cDNA synthesis followed by two rounds of successive PCR for amplification of cDNA sequences. After post-PCR purification, size selection, and quality analysis, the library is ready for sequencing


### 2.2 Primers

2.2.1 Human BCR Indexing Primer Set HT for Illumina Sequences

Illumina indexes are incorporated into human BCR profiling libraries through both forward and reverse PCR primers. The corresponding Illumina indexes are listed below.



### 3 Methods


protocol.

Fig. 3 Bulk RNA sequencing with unique molecular identifiers (UMIs). (a) This protocol begins with a single-cell suspension that can be isolated from whole blood, peripheral blood mononuclear cells, or by purification using either magnetic beads or flow cytometry. Total RNA is extracted from the cell population(s) of interest. (b) cDNA is reverse transcribed from RNA, and unique molecular identifiers (UMIs) are included in the SMART Oligo, tagging each parental cDNA molecule. (c) In PCR1, IgG, IgM, IG kappa, and IG lambda chains (including both the variable and part of the constant region) are separately amplified (only the IGH are shown). (d) In PCR2, Illumina indices are added to generate the sequencing libraries. (e) After size selection, QC, and normalization of library input, libraries are sequenced using the Illumina platform


column as described in the following step. Do not centrifuge the lysate after addition of binding solution before loading it onto the column in order to avoid pelleting the precipitate.

	- (a) First wash: Add 200 μL buffer WB1 to the NucleoSpin® RNA Plus Column. Centrifuge for 15 s at 11,000 g. Discard the flowthrough with the collection tube, and place the column into a new 2 mL collection tube.
	- (b) Second wash: Add 600 μL buffer WB2 to the NucleoSpin® RNA Plus Column. Centrifuge for 15 s at 11,000 g. Discard flowthrough, and place the column back into the collection tube.
	- (c) Third wash: Add 250 μL buffer WB2 to the NucleoSpin® RNA Plus Column. Centrifuge for 2 min at 11,000 g to dry the membrane completely. Place the column into a nuclease-free 1.5 mL collection tube.

3.3 First-Strand cDNA Synthesis First-strand cDNA synthesis (from RNA) is primed by the dT Primer. Here we illustrate cDNA synthesis using the SMART UMI Oligo for template switching at the 5<sup>0</sup> end of the transcript.



a Control RNA is supplied at a concentration of 1 μg/μL. It should be thawed on ice and diluted serially in nuclease-free water



a Ensure the First-Strand Buffer is completely in solution. Vortex gently to remove any cloudiness before use


Stopping point: The tubes can be stored at 4 C overnight.


Table 1 Cycling guidelines based on amount of starting material

a If the number of cycles generates an insufficient library for sequencing, repeat PCR2 with more cycles

### 3.4 First-Round Amplification

Semi-nested PCR amplifies the entire V(D)J region and a portion of the constant region of IG cDNA(s) and incorporates adapters and barcodes for Illumina sequencing platforms. Expression of different IG chains can vary significantly among B-cell populations. Thus, we recommend separately amplifying each chain of interest. Table 1 provides PCR cycling recommendations, but optimal parameters may vary for different sample types, input amounts, and thermal cyclers. We recommend trying a range of cycle numbers to determine the minimum number necessary to obtain the desired yield.

In the first round of PCR amplification, also referred to as PCR1, one performs separate IgG/IgM/IgK/IgL amplification. This PCR selectively amplifies full-length BCR V(D)J regions from first-strand cDNA. A portion of the first-strand cDNA is used for each amplification reaction. The hBCR PCR1 Universal Forward primer anneals to the 5<sup>0</sup> end of transcripts via the SMART UMI Oligo sequence. The hBCR PCR1 IgG/IgM/IgK/IgL reverse primers anneal to sequences in the constant regions of IG heavyand light-chain cDNAs.

1. Thaw 5- PrimeSTAR GXL buffer, dNTP mix, primers, and nuclease-free water on ice. Gently vortex each reagent to mix and centrifuge briefly. Store on ice. Remove the PrimeSTAR GXL DNA polymerase from the freezer immediately before use, gently pipet to mix, centrifuge briefly, and store on ice.

2. Prepare a PCR1 Master Mix for each IgG/IgM/IgK/IgL chain of interest, by combining the following in the order shown, on ice. Gently vortex to mix and centrifuge briefly (see Note 4).



In the second round of PCR amplification, termed PCR2, sequencing libraries are generated. PCR2 further amplifies the full-length IG V(D)J regions and adds Illumina indexes using a semi-nested approach. The hBCR PCR2 Universal Forward 1–12 primers add P7/i7 index sequences. The hBCR PCR2 IgG/IgM/IgK/IgL reverse 1–4 primers anneal to the constant region of the IG sequence and add P5/i5 index sequences (see Note 5).


### 3.5 Second-Round PCR Amplification

shown, on ice. Gently vortex to mix and centrifuge briefly (see Note 6).



Stopping point: The tubes may be stored at 4 C overnight. Here we illustrate amplified library purification using NucleoMag NGS clean-up and size select beads (see Note 7).


### 3.6 Purification of Amplified Libraries

	- Stopping point: The tubes may be stored at 4 C overnight.

To assess the success of library preparation, purification, and size selection, we recommend quantifying the libraries with a Qubit dsDNA HS Kit and evaluating the libraries' size distributions with an Agilent 2100 Bioanalyzer and the DNA 1000 Kit.


### 3.8 Pooling of Samples to Generate Libraries for Sequencing

3.7 Library Validation

> Following library validation by Qubit and bioanalyzer, the desired library pools should be prepared for the sequencing run. Prior to pooling, libraries must be carefully quantified. By combining the quantification obtained with the Qubit with the average library size determined by the bioanalyzer, the concentration in ng/μL can be converted to nM. The following web tool is convenient for the conversion: http://www.molbiol.edu.ru/eng/scripts/01\_07. html. Alternatively, libraries can be quantified by qPCR using the Library Quantification Kit from Takara Bio.

Most Illumina sequencing library preparation protocols require libraries with a final concentration of 4 nM, including the MiSeq instrument that we recommend for this protocol.

Prepare a pool of 4 nM as follows:


You should also plan to include a 10% PhiX control spike-in (PhiX Control v3, Illumina). The addition of the PhiX control is essential to increase the nucleotide diversity and achieve highquality data generation (see Note 14).

Fig. 4 Validation of IG heavy- and IG light (kappa or lambda)-chain libraries from human spleen that were generated using the SMARTer Human BCR Profiling Kit. Purified and size-selected libraries were analyzed on an Agilent 2100 Bioanalyzer (Panels A–H). Panels A, C, E, and G show broad peaks between ~500 and 1200 bp and maximal peaks in the range of ~600–900 bp (typical results for a library generated from spleen RNA). RNA control (NRC) samples (Panels B, D, F, and H) show no library produced and a flat Bioanalyzer profile within the predicted amplicon range of 500–1200 bp

Sequencing should be performed on an Illumina MiSeq sequencer using the 600-cycle MiSeq Reagent Kit v3 with pairedend, 2 - 300 base pair reads. When relying on Qubit quantification, we recommend diluting the pooled denatured libraries to a final concentration of 12.5 pM to achieve optimal cluster density. If using qPCR for quantification, one may need to use a lower final concentration.

Fig. 5 SMARTer human BCR IgG IgM H/K/L profiling library structure. First 19 nt from Read2 can be trimmed off if UMI analysis is not performed

The complexity of the human IG repertoire varies from person to person. We generally recommend a minimum of 200,000 reads for IG heavy-chain libraries (IgG and IgM) from an input of 10 ng PBMC RNA (or 1 ng B-cell RNA). and a minimum of 500,000 reads for IG light (IGK and IGL) chains from an input of 10 ng PBMC RNA (or 1 ng B-cell RNA). For libraries generated from >10 ng PBMC RNA, higher sequencing depth is recommended. However, the optimal conditions may vary for different samples types, sample masses, sample complexities, and desired outcomes. We recommend trying a higher sequencing depth, then down sampling to determine the optimal sequencing depth.

As shown in Fig. 5, a human BCR profiling library contains a 12-nucleotide UMI that can be used to create consensus reads for sequences that share the same UMI, allowing correction for sequencing error correction.

Upon completion of a sequencing run, data can be analyzed with Takara Bio Cogent NGS Immune Profiler Software or other software. In the following sections, we provide a workflow to analyze data with Immcantation, a suite that provides tools to perform preprocessing, population structure determination, and repertoire analysis. Immcantation is certified as compliant with AIRR Community software guidelines.


Fig. 6 Overview of data processing and analysis steps

identifier (UMI) consensus sequences, assembly of paired-end reads, and identification of duplicate sequences.

1. Remove PhiX

If spike-in PhiX was not removed by the sequencing facility, it is recommended [2] to filter out these reads.

2. Understand the Read Layout

It is important to have a good understanding of the read layout and know what region each read covers, where the primers and barcodes are located, and how long they are. In this example, R1 starts in the constant region of the rearranged sequence and R2 upstream the V region. See Fig. 5 for details on the read layout.

Primers from the vendor are not available. To identify isotypes, it is possible to use as primers the consensus sequences of the constant region available online from the protocols/ Universal directory in the Immcantation repository (https://bitbucket.org/kleinstein/immcantation). These sequences have been created after analyzing the first 30 nucleotides of the human constant region sequences available from IMGT.

3. Obtain the Software

Immcantation, with its dependencies, accessory scripts, and IgBLAST [4] and IMGT [3] reference germlines, is available as a Docker container on docker hub under immcantation/suite:x.x.x where x.x.x stands for a release number. This protocol is using the container release 4.3.0.

To start an interactive session inside the container and share local files in the current working directory with the / data folder in the container, use.

docker run -it -v \$(pwd):/data:z --workdir /data immcantation/ suite:4.3.0 bash

Once inside the container, you can use the commands versions report and builds report to know the versions of the software installed.

If you type pwd, you should get the result /data, as expected after starting the container with --workdir /data. If you type ls, you should see the files that you have in the local directory from which you launched the container. Being inside the container session, create the output directories presto and logs, and verify that the folder also becomes available locally in your computer:

```
mkdir presto
mkdir logs
```
4. Remove Low-Quality Sequences

To remove reads with a mean quality lower than 20, use the command.

```
FilterSeq.py quality -s data/S5_R1.fastq -q 20 --nproc 8 \
FilterSeq.py quality -s data/S5_R2.fastq -q 20 --nproc 8 \
```
Output data files for the constant region reads will use the prefix CRR, and data files for the V region reads, will use the prefix VRR.

### 5. Identify Primers and UMI

The next step is to remove or mask primers and extract UMI barcodes from the sequence but keeping this information as annotations in the FASTQ file headers. We recommend to mask or remove primers so that sequencing errors in the primers do not affect downstream analyses. Here we remove barcodes and primers. We know that the kit used to generate the data has a 12-nucleotide-long UMI (-start 12), followed by a linker sequence and a template switch (-len 7). With this command, pRESTO will extract the first 12 bp and annotate the fastq file header with the field BARCODE.

```
MaskPrimers.py extract -s presto/VRR_quality-pass.fastq \
 --start 12 --len 7 --barcode --bf BARCODE --mode cut \
 --log "logs/primers-vrr.log" \
 --outname VRR --outdir presto
```
### An example output FASTQ header is as follows:

@M03355:144:000000000-CH2WP:1:1104:17528:20342 2:N:0:CGCTCATT +TATAGCCT|PRIMER=GTACGGG|BARCODE=TTGAAGTTATTC

6. Annotate R1 with Internal C-Region

Use the following command to annotate the CRR FASTQ file with a constant region call. This step requires a reference FASTA file containing the reverse-complement of short sequences from the front of CH-1. The C-region sequences (-p) are available in the container. For each sequence, Mask-Primers.py align will look for good matches (maximum error of --maxerror 0.3) to the reference sequences in the first 100 nucleotides (--maxlen 100). The matching and preceding region will be cut out from the sequence. The matching sequence name will be added as an annotation into the FASTQ header, under the field C\_CALL.

```
MaskPrimers.py align -s presto/CRR_quality-pass.fastq \
 -p /usr/local/share/protocols/Universal/Human_IG_CRegion_RC.
fasta \
 --maxlen 100 --maxerror 0.3 \
 --mode cut --skiprc --pf C_CALL \
 --log "logs/cregion.log" --outname "CRR" --nproc 8
```
An example output FASTQ header is as follows:

@M03355:144:000000000-CH2WP:1:2116:18550:17244 1:N:0:CGCTCATT +TATAGCCT|SEQORIENT=F|C\_CALL=IGHM

pRESTO tools save logs that can be converted into tabulated files with ParseLog.py. It is useful to use these files to generate diagnostic plots. To extract the information to make figures, inspect the C\_CALLs made, and identify the starting position of the match, use the command below. It will create a tabulated file with the fields ID, PRIMER, ERROR, and PRSTART, which can be used to create such plots.

ParseLog.py -l "logs/cregion.log" -f ID PRIMER ERROR PRSTART --outdir logs

Once the log has been converted to a tabulated file, it can be easily loaded into R, to count the different isotypes that have been identified:

cregion\_table <- read.delim("logs/cregion\_table.tab") table(cregion\_table\$PRIMER)

Example output:

IGHA IGHD IGHE IGHG IGHM IGKC IGLC1 IGLC3 4 32 510 242981 220640 244015 250521 40647

These results match the expectations for this experimental protocol, because it uses a kit designed for IgM, IgG, IgK, and IgL. The isotype count can also be visualized (Fig. 7a):

Fig. 7 Evaluation of B-cell isotype and clonal evolution. (a) Count and position of isotype primers. (b) Gene usage by isotype. (c) Reconstructed lineage tree

```
# Plot isotype primer position
cprimer_plot <- ggplot(cregion_table, aes(x=PRSTART, color=-
PRIMER)) +
geom_freqpoly(size = 0.5,binwidth=1) +
scale_color_manual(values = color_palette) +
theme_minimal() +
 labs(x = "C-region alignment start", y = "Count", colour =
"Primer") +
 theme(legend.key.height = unit(0.1, "lines"), legend.key.
width = unit(0.5, "lines"))
cprimer_plot
```
### 7. Copy Annotations Between Reads

Propagation of annotations between mate pairs is accomplished with PairSeq.py, which also removes unpaired reads and sorts mate pairs in both files. In this example, the UMI barcode is part of read VRR, and C\_CALL is part of read VCC. We need to transfer this information to be able to build consensus sequences for groups of reads sharing the same UMI and C\_CALL.

```
PairSeq.py -1 presto/VRR_primers-pass.fastq \
 -2 presto/CRR_primers-pass.fastq \
 --1f BARCODE --2f C_CALL --coord illumina
```
### 8. Generation of UMI Consensus Sequences

If UMIs are available, it is possible to correct sequencing errors maintaining true mutations introduced by SHM. Reads sharing a UMI barcode are reads that originated from the same RNA molecule. Ideally, if the primers used are different enough, and the UMIs have enough diversity, each UMI will represent one mRNA molecule, and each mRNA molecule will be represented by one UMI. BuildConsensus.py can then be used to generate a consensus sequence for a set of aligned reads sharing the same UMI. Finding more than one primer in a UMI group suggests sequences may not be aligned, as we expect reads originating from the same mRNA molecule should be amplified with the same primer. If the multiplex pool contains similar primers, they could be incorporated into the same UMI group during amplification, and the reads will have variations in the start positions. This situation can be mitigated by first aligning the reads.

### (a) Multiple Align UID Read Groups

If the reads are not aligned, a correction strategy is to use MUSCLE [5] and AlignSets.py to perform a multiple alignment of each UMI read group, before generating the consensus sequence in the next step.

AlignSets.py muscle -s "presto/VRR\_primers-pass\_pair-pass. fastq" --exec /usr/local/bin/muscle --nproc 8 --log "logs/ align-vrr.log" --outname "VRR" AlignSets.py muscle -s "presto/CRR\_primers-pass\_pair-pass. fastq" --exec /usr/local/bin/muscle --nproc 8 --log "logs/ align-crr.log" --outname "CRR"

(b) Build the Consensus Sequence

BuildConsensus.py will group sequences sharing the same barcode to build a consensus sequence. If a UMI group has a number of average mismatches larger than 0.1 (-maxerror 0.1), it will be dismissed. Sequences with the same barcode have originated from the same original mRNA molecule, and they should also have the same isotype. --pf C\_CALL and --prcons 0.6 are used to require that 60% of the UMI group have the same C\_CALL.

```
BuildConsensus.py -s presto/CRR_align-pass.fastq \
 --bf BARCODE --pf C_CALL --prcons 0.6 \
 -n 1 -q 0 --maxerror 0.1 --maxgap 0.5 \
 --nproc 8 --log "logs/consensus-crr.log" \
 --outdir presto --outname "CRR"
```
### Example output:

@TGTTGGTTGGGT|CONSCOUNT=5|PRCONS=IGHM| PRFREQ=0.8333333333333334

> CONSCOUNT shows the number of sequences that contributed to build the consensus. In the example above, the consensus isotype (PRCONS) is IGHM, with a frequency of 0.83. In the starting UMI group, there were six sequences, and one of them was an IGKC. This sequence was not used to build the consensus.

> The same process needs to be repeated for the other reads:

```
BuildConsensus.py -s presto/VRR_align-pass.fastq \
 --bf BARCODE --pf C_CALL --prcons 0.6 \
 -n 1 -q 0 --maxerror 0.1 --maxgap 0.5 \
 --nproc 8 --log "logs/consensus-vrr.log" \
 --outdir presto --outname "VRR"
```
### 9. Synchronize Reads

This step puts pairs of reads in the same order.

```
PairSeq.py -1 "presto/VRR_consensus-pass.fastq" -2 \
 "presto/CRR_consensus-pass.fastq" \
 --coord presto
```
### 10. Assemble Pairs

Consensus sequences are paired in two steps, starting with joining overlapping mate pairs. For read pairs failing this step, the tool proceeds to perform a reference guided alignment, using ungapped V-segment reference sequences to properly space nonoverlapping reads.

```
AssemblePairs.py sequential -1 "presto/VRR_consensus-pas-
s_pair-pass.fastq" \
\
```
Example output: @ACTAGGGTTCAT|CONSCOUNT <sup>¼</sup> 4,4| PRCONS¼IGHM .

PRCONS is the consensus C\_CALL from the CRR file.

11. Mask Low-Quality positions

Positions with a low consensus quality can be masked with Ns.

```
FilterSeq.py maskqual -s presto/S5_assemble-pass.fastq -q
30 --nproc 8 \
```
12. Track the Number of Sequences that Contributed to the Consensus

It is important to know the number of unique sequences that contributed to build the consensus, as this information will be used in a later step.

```
ParseHeaders.py collapse -s presto/S5-MQ_maskqual-pass.fastq
mv "presto/S5-final_reheader.fastq" "presto/S5-final_total.
fastq"
```
### 13. Collapse Duplicates

The goal is to remove duplicated sequences to retain in the repertoire one representative sequence per cell. The argument "-n 0 --inner" will determine how to handle N and gap characters. In this example, we allow 0 ambiguous characters, ignoring any continuous N or gap characters that occur at any end of the sequence. "---uf" specifies fields that should be used to define groups of unique sequences. "--cf CONSCOUNT " requests to copy the field CONSCOUNT and then perform the action "--act sum," to obtain a final unique sequence with CONSCOUNT equal to the sum of the CONSCOUNTS of the collapsed sequences.

```
CollapseSeq.py -s "presto/S5-final_total.fastq" -n 0 \
 --uf PRCONS --cf CONSCOUNT --act sum --inner \
 --keepmiss --outname "S5-final"
```
### 14. Subset to Sequences Seen at Least Twice

We recommend filtering the data to focus the analysis on sequences with at least two contributing reads. Sequences with CONSCOUNT of 1 are generated with only one sequence contributing to the UMI group, and this suggests the existence of sequencing error.

```
SplitSeq.py group -s presto/S5-final_collapse-unique.fastq -f
CONSCOUNT \
 --num 2
```
### 15. Explore the Logs

All pRESTO tools provide the option to generate detailed logs that can be used to generate diagnostic plots. The log files can be converted to tabulated text files with ParseLog.py. The tabulated text files can be loaded into R or python to generate plots.

(a) Obtain Tabulated Data

The output files are parsed to generate tables of data for the repertoire.

```
ParseHeaders.py table -s "presto/S5-final_total.fastq" \
ParseHeaders.py table -s "presto/S5-final_collapse-unique.
fastq" \
ParseHeaders.py table -s "presto/S5-final_collapse-unique_a-
tleast-2.fastq" \
 -f ID PRCONS CONSCOUNT DUPCOUNT --outname "final-unique-
atleast2" \
```
To see a summary of the final isotype assignments:

```
log <- read.delim("logs/final-unique-atleast2_headers.tab")
table(log$PRCONS)
```
IGHE IGHG IGHM IGKC IGLC1 IGLC3 1 48485 50955 37578 46879 7609

(b) Process the Log Files Generated at Each Step Log files are also parsed into tabulated files.

```
ParseLog.py -l "logs/primers-vrr.log" -f ID BARCODE ERROR \
 --outdir logs
ParseLog.py -l "logs/consensus-vrr.log" "logs/consensus-crr.
log" \
 -f BARCODE SEQCOUNT CONSCOUNT PRIMER PRCONS PRCOUNT PRFREQ
ERROR \
 --outdir logs
ParseLog.py -l "logs/assemble.log" \
 -f ID REFID LENGTH OVERLAP GAP ERROR PVALUE EVALUE1 EVALUE2
IDENTITY FIELDS1 FIELDS2 \
 --outdir logs
ParseLog.py -l "logs/maskqual.log" -f ID MASKED \
 --outdir logs
```
### 3.11 Gene Annotation

Raw sequences which have passed general quality control filters should and then be annotated with gene information: for IGH sequences V, D, and J genes and for IGK/IGL only V and J genes. The IgBLAST executable and the reference database are available in the Immcantation container.

1. Convert FASTQ to FASTA

IgBLAST takes as input FASTA file. The FASTQ files obtained at the end of the raw data processing section need to be converted to FASTA format. Simultaneously, rename PRCONS to C\_CALL.

ParseHeaders.py rename -s presto/S5-final\_collapse-unique\_atleast-2.fastq --fasta -f PRCONS -k C\_CALL

2. Run IgBLAST

The wrapper tool AssignGenes.py, from Change-O [6], uses IgBLAST, and a reference database created with germlines from IMGT, to make V(D)J allele calls.

mkdir changeo AssignGenes.py igblast -s presto/S5-final\_collapse-unique\_atleast-2\_reheader.fasta \ --organism human --loci ig \ -b /usr/local/share/igblast --format blast --nproc 8 \ --outdir changeo --outname "S5"

### 3. Data Standardization

IgBLAST's results need to be converted into an AIRRformatted file (https://immcantation.readthedocs.io/en/sta ble/datastandards.html) suitable for downstream analysis.

```
MakeDb.py igblast -s presto/S5-final_collapse-unique_atleast-
2_reheader.fasta \ -i changeo/S5_igblast.fmt7 \
```
Some sequences don't pass this MakeDb step with these settings. This could be because a junction could not be identified, there are Ns in the junction, there is a stop codon, or the reads are partial, among other possible reasons.

3.12 Quality Control After Gene Assignment Once sequences have been annotated with allele calls, and the aligned rearranged sequence is available, further collapsing of duplicates and removal of low-quality sequences is possible. Here we demonstrate how to perform some common additional QC steps using R and Immcantation tools (alakazam [6]). The goal is to keep sequences with at least 200 informative positions, with coherent gene and locus calls, and with a limited number of ambiguous nucleotides. It is also common to focus the analysis in productive sequences. Here we will keep productive sequences and will remove sequences with junction length that is not a multiple of three. Finally, chimeric reads will be identified and removed.

1. Identify Short Sequences.

```
library(airr)
library(alakazam)
library(stringi)
library(dplyr)
airr <- read_rearrangement("changeo/S5_db-pass.tsv")
# Min length 200 nt
long_seq <- stri_count(airr[['sequence_alignment']],re-
gex="[^-.N]") >= 200
```
2. Identify Reads with Coherent Gene, Primer, and Isotype Calls The goal is to remove sequences with incoherent gene and isotype calls. For example, a sequence that has a V gene assigned, but an IG light-chain-constant region will be removed.

```
# Keep reads with coherent gene, primer and isotype calls
same_locus <- getLocus(airr[['v_call']]) == airr[['locus']] &
getLocus(airr[[c_call]]) == airr[['locus']]
```
### 3. Identify Reads with an Acceptable Number of Ambiguous Nucleotides.

```
# Max 10% N
num_n <- stri_count(airr[['sequence_alignment']],fixed="N")
len <- stri_count(airr[['sequence_alignment']],regex="[^-.]")
low_n <- num_n/len <= 0.10
```
### 4. Identify Productive Sequences.

```
prod <- airr[['productive']]
```
5. Identify Sequences with Junction Length Multiple of Three.

m3 <- airr[['junction\_length']] %% 3 == 0

### 6. Filter and Save.

```
filter_pass <- long_seq &
 same_locus &
 low_n &
 prod &
 m3
write_rearrangement(airr[filter_pass,], file="changeo/S5_fil-
ter-pass.tsv")
```
### 7. Reconstruct Germline Sequences

Identify the V(D)J germline sequences from which each of the sequences is derived. These germlines will be used to analyze mutation patterns in a sliding window to identify chimeric sequences.

```
CreateGermlines.py -d changeo/S5_filter-pass.tsv \
 -g dmask --format airr
```
### 8. Identify and Remove Chimeric Sequences

Chimeric sequences can be identified by analyzing their mutation frequencies. The function slideWindowDb, from shazam [6], identifies which sequences in the repertoire contain excessive mutations in a given length of consecutive nucleotides (a "window") when compared to their respective germline sequence.

```
library(airr)
library(shazam)
airr <- read_rearrangement("changeo/S5_filter-pass_germ-pass.
tsv")
is_chimeric <- slideWindowDb(
 airr,
 sequenceColumn = "sequence_alignment",
 germlineColumn = "germline_alignment_d_mask",
 mutThresh=6,
 windowSize=10
)
table(is_chimeric)
airr <- airr[!is_chimeric,]
```
### 9. Collapse Duplicates

Once the sequences in the repertoire are aligned following the IMGT scheme, further collapsing of duplicate sequences can be done with the function collapseDuplicates.

```
library(dplyr)
num_fields <- c("consensus_count", "duplicate_count")
# Data comes one sample, so no need to add
# sample identifier groups
collapse_groups <- c("v_gene",
 "j_gene",
 "junction_length",
 "c_call",
 "productive")
airr <- airr %>%
 mutate(v_gene=getGene(v_call),
 j_gene=getGene(j_call)) %>%
 group_by(.dots=collapse_groups) %>%
 do(collapseDuplicates(.,
 id = "sequence_id",
 seq = "sequence_alignment",
 text_fields = NULL,
 num_fields = num_fields,
 seq_fields = NULL,
 add_count = TRUE,
 ignore = c("N", "-", ".", "?"),
 sep = ",",
```

```
dry = FALSE,
verbose = FALSE
)) %>%
ungroup() %>%
select(-v_gene, -j_gene)
```
3.13 Identify Clonally Related Sequences The goal is to partition sequences into clonal lineages. Each clonal lineage is a group of sequences derived from the same original cell. There are several methods to identify clonal lineages (see Subheading 3.9.2: Identification of B-Cell Clones in the chapter "AIRR Community Guide to Repertoire Analysis"). Here, we first group by V gene, J gene, and junction length. Then we compare the junctions and apply a threshold to separate sequences into clonal lineages.

1. Calculate the Distance to the Nearest Distribution

Hierarchical clustering requires a measure of distance between pairs of sequences and a choice of linkage to define the distance between groups of sequences. The result is a hierarchy, and a threshold is needed to cut the tree into clonal groups.

```
# Subset to heavy chain sequences
airr_heavy <- airr %>%
 filter(locus == "IGH")
# Group by V gene, J gene and junction length, and calculate
the distance
# to the nearest sequence in the group
airr_heavy <- distToNearest(airr_heavy, sequenceColumn="junc-
tion",
 vCallColumn="v_call", jCallColumn="j_call",
 model="ham", first=FALSE, normalize="len",
 nproc=params$nproc)
write_rearrangement(airr_heavy, file="changeo/IB7_heavy_col-
lapse-pass.tsv")
```
2. Find a Threshold

It is possible to determine a threshold by analyzing the distribution of the distances. The distribution is usually bimodal. The first mode represents sequences that have a close relative. The second mode is representative of sequences without clonal relatives. The goal is to select a threshold that separates the two modes.

```
threshold <- findThreshold(airr_heavy[['dist_nearest']],
method="density")
plot(threshold, binwidth=0.02, silent=FALSE)
clone_threshold <- round(threshold@threshold,)
clone_threshold
```
### 3. Identify Clonally Related Sequences Once a threshold is selected, it is applied to identify groups of related sequences:

```
DefineClones.py -d changeo/S5_heavy_collapse-pass.tsv --model
ham \
 --outname S5 --outdir changeo --format airr --log "logs/
clone.log"
```
### 4. Reconstruct Clonal Germline

The next step is to identify the V(D)J germline sequences from which each of the observed sequences is derived. These germlines are used as the reference to analyze mutations.

```
CreateGermlines.py -d changeo/S5_clone-pass.tsv \
 -g dmask --format airr --cloned --outname S5-airr
```
### 3.14 Gene Usage by Isotype When isotype information is available, it is possible to investigate

biases in gene usage at the isotype level. library(alakazam) library(airr) library(dplyr) airr <- read\_rearrangement("changeo/S5-airr\_germ-pass.tsv") # Gene usage by Isotype for one sample with only heavy chain data v\_usage\_isotype <- countGenes(airr, "v\_call",group="c\_call", fill=T) most\_used\_v <- v\_usage\_isotype %>% filter(c\_call != "IGHE") %>% group\_by(c\_call) %>% slice\_max(.,seq\_freq,n=1)

```
# Plot the most used V gene(s)
library(scales)
```

```
gene_usage_plot <- ggplot(v_usage_isotype %>%
 filter(c_call != "IGHE" & gene %in% most_used_v[['gene']]),
 aes(x=gene,y=seq_freq, color=c_call)) +
 scale_color_manual(values=color_palette) +
 scale_y_continuous(labels=percent) +
 geom_point(size=2) + theme_minimal() +
 xlab("Gene") + ylab("Frequency") +
 guides(color=guide_legend(title="Isotype"))
gene_usage_plot
```
The gene usage by isotype can be visualized in Fig. 7b.

3.15 Clonal Lineage Tree Analysis Dowser [7] provides tools for building and visualizing IG lineage trees using multiple methods and implements statistical tests for discrete trait analysis of B-cell migration, differentiation, and isotype switching.

1. Format

First, data must be formatted into a data table of AIRR clone objects. The formatClones function will change non-nucleotide characters to N characters, collapse sequences that are either identical or differ only by ambiguous characters, and remove uninformative sequence sites in which all sequences have N characters.

```
library(dowser)
```
clones <- formatClones(airr, traits=c("c\_call"), num\_fields=c("duplicate\_count"), columns=c("d\_call"), minseq=10)

### 2. Build the Trees

There are several lineage reconstruction methods implemented in dowser. Maximum parsimony trees (topologies that minimize the number of mutations needed along the tree) can be built with the getTrees function.

```
# build maximum parsimony trees
clones <- getTrees(clones)
```
### 3. Visualize

The function plotTrees makes plotting lineages easy. Branch lengths by default represent the number of mutations per site between nodes. It is also possible to show numerical or categorical information associated with the tree tips, such as the duplicate count or the isotype.

# Plot the trees. Save them in a list of plots.

# Use tip metadata: c\_call and duplicate\_count

```
tree_plots <- plotTrees(clones, tips="c_call",
 tipsize="duplicate_count",
 tip_palette=c(color_palette, "Germline"="#000000"))
```
Example output is given in Fig. 7c.

### 4 Notes


### Acknowledgments

The authors would like to thank Eline Luning Prak for assistance with manuscript content and formatting of text and Chaim Schramm, Johannes Truck, and Wenwen Xiang for constructive criticism of the manuscript. Conflict of interest: NG and AF are employees at Takara Bio, Inc., San Jose, CA, USA, that produces the kit described in this protocol.

### References


for human and mouse immunoglobulin and T cell receptor genes. Nucleic Acids Res 33: D256–D261. https://doi.org/10.1093/nar/ gki010


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Single-Cell Analysis and Tracking of Antigen-Specific T Cells: Integrating Paired Chain AIRR-Seq and Transcriptome Sequencing: A Method by the AIRR Community

### Nidhi Gupta, Ida Lindeman, Susanne Reinhardt, Encarnita Mariotti-Ferrandiz, Kevin Mujangi-Ebeka, Kristen Martins-Taylor, and Anne Eugster

### Abstract

Single-cell adaptive immune receptor repertoire sequencing (scAIRR-seq) offers the possibility to access the nucleotide sequences of paired receptor chains from T-cell receptors (TCR) or B-cell receptors (BCR). Here we describe two protocols and the downstream bioinformatic approaches that facilitate the integrated analysis of paired T-cell receptor (TR) alpha/beta (TRA/TRB) AIRR-seq, RNA sequencing (RNAseq), immunophenotyping, and antigen-binding information. To illustrate the methodologies with a use case, we describe how to identify, characterize, and track SARS-CoV-2-specific T cells over multiple time points following infection with the virus. The first method allows the analysis of pools of memory CD8<sup>+</sup> cells, identifying expansions and contractions of clones of interest. The second method allows the study of rare or antigen-specific cells and allows studying their changes over time.

Key words Single-cell sequencing, TR gene, IG gene, Rearrangement, Transcriptome, 10x Genomics, SMART-seq, Multi-omic analysis

### 1 Introduction

Single-cell adaptive immune receptor repertoire sequencing (scAIRR-seq) aims at describing the sequences of T-cell receptor (TR) or immunoglobulin (IG) rearrangements at the single-cell level. scAIRR-seq has been used since the mid-1990s [1, 2] and has seen rapidly increasing adoption by the scientific community over the last few years [3]. This has been facilitated by a plethora of protocols, commercial kits, and platforms as well as by the associated software tools developed in the last decade as discussed in detail in the two AIRR Community commentary chapters in this

Nidhi Gupta, Ida Lindeman, and Susanne Reinhard are shared first authors.

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_20, © The Author(s) 2022

volume. Workflows for scAIRR-seq distinguish themselves from bulk methods by several features: First, they preserve the chainpairing information of the complete IG/TR, which is critical for the experimental reconstruction and measurement of receptor reactivities. Second, they allow an unbiased view of clonal expansion, as observed sequences can be attributed to and normalized by the individual cell from which they originate. Third, using the individual cell as a common reference point, they allow the integration with other single-cell resolution data, such as transcriptome, cellsurface phenotype as well as antigen-specificities. All of this, however, comes at a reduced throughput and lower sensitivity for the detection of rare clones when compared to bulk sequencing.

The vast majority of currently used scAIRR-seq methods, including the two presented here (Fig. 1), maintains the compartmentalization of the cells throughout the process, either physically or via barcoding [4]. The 10x Genomics Chromium is a microfluidic-based platform, which allows the encapsulation and barcoding of up to 3000 to 10,000 cells at a time. The Chromium Next GEM Single Cell V(D)J Reagent Kits described here allow the generation of three libraries (Fig. 2): (1) full-length, paired AIRR sequences, (2) the cell transcriptome (derived from all polyadenylated transcripts), and (3) feature barcodes linked to the surface protein expression and antigen specificity (e.g., CITE-seq) as well as barcoding of libraries for multiplexing (e.g., hash-tagging). Single-cell SMART-seq is a method to collect single-cell AIRR-seq and gene expression data from cells which are sorted into 96-well plates (Fig. 3 and Fig. 4), allowing the analysis of rare cells. Single-cell SMART-seq is based on the SMART® (switching mechanism at 5<sup>0</sup> end of RNA template) technology. The SMART-Seq Single Cell Kit used to generate mRNA-seq libraries is particularly useful for the analysis of cells with very low RNA content, such as PBMCs.

In both methods AIRR-seq as well as transcriptome sequences are obtained from RNA, making use of the template-switching activity of reverse transcriptase to enrich for full-length cDNAs and to add defined PCR adapters directly to both ends of the first-strand cDNA. This ensures that the final cDNA libraries contain the 5<sup>0</sup> end of the mRNA and maintain a true representation of the original mRNA transcripts. These factors are critical for AIRRseq and transcriptome sequencing.

A use case for these two methods is to identify, characterize, and track SARS-CoV-2- or other virus-specific T cells over multiple time points following infection, remittance, or vaccination. The Chromium Next GEM Single Cell V(D)J method can be used to broadly analyze thousands of memory CD8<sup>+</sup> cells from several time points of an individual, before and after the immunizing event, in a multiplexed manner (through hash-tagging). This allows identifying expansions and contractions of clones of interest (through AIRR-seq), phenotyping them (through transcriptome analysis

Fig. 1 Schematic illustrating the main characteristics of single cell paired chain AIRR-seq. (a) A blood sample is processed by Ficoll gradient centrifugation to obtain PBMC. (b) PBMCs are stained and cells of interest sorted by FACS. As described in Subheading 3.1, using the 10x Genomics fluidics system (panel <sup>c</sup>), cells are processed for transcript barcoding. After encapsulation in a droplet, a "GEM" is created (panel <sup>d</sup>). Gel bead primers containing a 10 barcode, a UMI, and a template switch oligo (TSO) bind to the transcripts after cell lysis (panel <sup>e</sup>). Gel bead primers also capture the cell surface feature barcodes. Barcoded transcripts and feature barcodes are then reverse transcribed, and through size selection and enrichment, a library containing amplified AIRR (panel <sup>f</sup>), a library containing whole cell transcriptome, and one containing feature barcodes are prepared through the sequential additions of primers containing the P5 and P7 sites required for sequencing. As described in Subheading 3.1, cells are deposited into a plate (panel <sup>g</sup>). Here reverse transcription, cDNA amplification, and preparation of a library containing amplified AIRR and of a library containing whole cell transcriptome take place through the sequential additions of primers that include the P5 and P7 sites required for sequencing by the SMARTseq method (panel <sup>h</sup>)

Fig. 2 Overview of the main steps of the Chromium Next GEM Single-Cell procedure: The creation of droplets (left), in which the RNA is captured and barcoded is followed by breaking the GEMs, the amplification of cDNA, the fractionation that allows separation of cDNA from feature barcodes from cellular cDNA, and finally by the preparation of the three libraries

Fig. 3 Overview of main steps for SMART-seq scTCR procedure: Stimulated cells are sorted into PCR plates, followed by cDNA synthesis, two rounds of PCR and purification, and are finally pooled to prepare the library for sequencing

and feature barcode analysis), and correlating them to disease status. Clones or cells of interest can be defined through their activation state or their antigen specificity and are isolated by flow cytometry after surface marker or multimer staining, respectively. Clones or cells of interest can be activated CD8<sup>+</sup> CD25+ CD137+ T cells [5] from several time points during and after a viral infection, or cells stimulated with an antigen of interest. Single-cell SMARTseq of these often rare clones after isolation gives access to their AIRR data that can then be matched to the data obtained from Chromium Next GEM Single Cell V(D)J. Here we provide protocols and detailed information for the generation, processing, and analysis of scAIRR-seq- and associated data produced with the two platforms described (Fig. 5).

Fig. 4 Schematic of the technology in the SMART-Seq Single-Cell Kit. Non-templated nucleotides (indicated by Xs) added by the SMARTScribe II reverse transcriptase (RT) hybridize to the SMART-Seq single-cell templateswitching oligonucleotide (SMART-Seq sc TSO), which provides a new template for the RT. The SMART adapters used for amplification during PCR added by the oligo(dT) primer (3<sup>0</sup> SMART-Seq CDS Primer II A) and TSO are indicated in green. Chemical modifications to block ligation (if using a ligation-based library preparation method) are present on some primers (indicated by black stars)

### 2 Materials

### 2.1 10x Genomics Chromium Next GEM Single-Cell V(D)J Kit


cDNA Synthesis and Amplification

Fig. 5 Overview of the main steps of the analysis of single-cell AIRR-seq data. Libraries created with the 10x Genomics technology (upper panel) are processed using the CellRanger software. In brief, sequencing libraries are demultiplexed before TR sequences are extracted and annotated. The quality of each library is assessed, and TR sequences may be combined with transcriptional and feature libraries for an in-depth integrated analysis. For libraries created by plate-based sequencing technologies such as SMART-seq (lower panel), TR sequences are computationally reconstructed with TraCeR. Low-quality cells or potential duplets may be filtered out, before clonally related cells are identified and visualized in clonal networks


10x Genomics Kits and Reagents (10X Genomics, Unless Mentioned)

	- 13. Chromium Single Cell 5' Feature Barcode Library Kit, 16 rxn.

Other Supplies


Bead Purifications

27. NucleoMag NGS cleanup and size select (see Note 2) Takara Bio) or the AMPure XP PCR Purification Kit (Beckman Coulter).

For cDNA and Illumina Library Quantification and Preparation


Cell Preparation

34. Benzonase (10 U/ml).

2.2 Single-Cell SMART-Seq The same equipment and supplies are used as for 10- Chromium, except for the following:

General Lab Equipment

1. 96-well PCR chiller rack, such as IsoFreeze PCR Rack, or 96-well aluminum block.

Sample Preparation


Bead Purifications


Sequencing Library Generation


cDNA Synthesis (Takara Bio Unless Otherwise Specified)


Nextera Library Preparation (Illumina Unless Otherwise Mentioned) 32. Amplicon Tagment Mix (ATM).


Nextera Indices


### 2.3 10x Genomics Data Processing and Analysis

2.4 Single-Cell SMART-Seq Data Processing and Analysis


### 3 Methods

### 3.1 10x Genomics Chromium Next GEM Single-Cell V(D)J Kit

3.1.1 Coat Tubes for Cell Sort and Count Cells

Before starting, please refer to considerations regarding the kits used (see Note 3), sample multiplexing (see Note 4), and surface protein detection (see Note 5).

	- 2. After staining and washing (see Note 6), sort cells into the tube (see Note 7), and use 1–2 μl from the sample to verify the cell quality and number (see Note 8) under a light microscope. Proceed to loading the chip taking into account the time required for the sort (see Note 9).

### 3.1.2 Load Next GEM Chip G To avoid contamination, this section should be carried out on a separate bench dedicated to RNA/cell work.


Stopping point: At this point the samples can be stored at 4 C for up to 72 h or at 20 C up to a week. If samples were frozen, keep them at room temperature for 10 min before continuing. The aqueous phase will look translucent (rather than clear).

3.1.3 Post GEM Cleanup and cDNA Amplification To avoid contamination, this section should be carried out on a separate bench dedicated to RNA work.

	- 2. Add 65 μl of amplification mix to each tube containing 35 μl GEM-RT product. Mix and centrifuge.
	- 3. Perform amplification as follows (lid temperature: 105 C, reaction volume 100 μl).

98C, 45 s; 14 cycles of (98C, 20 s; 68C, 30 s; 72C,1 min); 72C, 1 min; 4C, hold. The number of cycles depends on cell size and the number of cells recovered.

Stopping point: At this point the samples can be stored at 4 C for up to 72 h.

In this section the amplified feature barcode fraction is separated from the amplified cDNA by size selection, so that both fractions can be further processed separately. These steps should be carried out on a bench dedicated to cDNA work.

1. Add 60 μl (0.6-) of resuspended SPRIselect beads to the amplification tube. Mix well, pulse-spin the tube, and incubate for 5 min at room temperature. Place the tube on a 10- magnet (high position) to separate beads from supernatant.

3.1.5 Feature Barcode and cDNA Fractionation by Size Selection

3.1.4 cDNA and Feature Barcode Amplification


Stopping point: At this point the samples can be stored at 4 C for up to 72 h or at 20 C up to 4 weeks.

Construction The following steps describe the preparation of three types of libraries: a feature barcode library (that will yield information on cell surface proteins (features) or hash-tags, made from the purified feature barcode fraction) (a), AIRR libraries (TR and/or IG) that require target enrichment (made from the purified cDNA fraction), and 5<sup>0</sup> gene expression libraries (made from the purified cDNA fraction) (b, c).

	- 2. Add 5 μl feature barcode sample fraction, mix, pulse-spin, and start the following PCR program (lid temperature: 105 C): 20C, Hold; 98C, 45 s; 98C, 20 s; 54C, 30 s; 72C, 20 s (eight cycles); 72C, 1 min; 4C, hold.

3.1.6 Library

Feature Barcode Library Construction by Index-PCR and Purification

The following steps should be carried out on a separate, post-PCR-dedicated bench.


Stopping point: At this point the samples can be stored at 4 C for up to 72 h or at 20 C for long-term storage.

	- 2. Perform amplification as follows (lid temperature: 105 C, reaction volume 100 μl): 98C, 45 s; x\* cycles of (98C, 20 s; 67C, 30 s; 72C, 1 min); 72C, 1 min; 4C, hold. \*Six cycles for IG and ten cycles for TR.

Stopping point: At this point the samples can be stored at 4 C for up to 72 h.

The following steps should be carried out on a bench dedicated to amplified cDNA.


Target Enrichment for AIRR Libraries

	- 20C, hold; 98C, 45 s; 98C, 20 s and 67C, 30 s and 72C, 1 min (six cycles); 72C, 1 min; 4 C, hold.

Stopping point: At this point the samples can be stored at 4 C for up to 72 h.


5<sup>0</sup> Gene Expression and AIRR Library Construction: Fragmentation, Adaptor Ligation, and Library Amplification (See Note 13) 18. Quantify samples using the FA NGS Standard Sensitivity in a region of 200–5500 bp. Alternatively, an Agilent Bioanalyzer High-Sensitivity chip can be used.

Stopping point: At this point the samples can be stored at 20 C for up to 1 week (or 4 C for up to 72 h).


1 min; 4C, hold. \*Cycles: cDNA: adjust to your input: 1–25 ng cDNA: 14–16 cycles, 26–50 ng cDNA: 10–14 cycles; AIRR: 8 cycles.

Stopping point: At this point the samples can be stored at 4 C for up to 72 h.


3.2.1 Cell Sorting and Due to the sensitivity of these protocols, cells should be collected under clean-room conditions to avoid contamination. The whole process of cDNA synthesis should be carried out in a PCR clean workstation under clean-room conditions.


3.2 Single-Cell SMART-Seq

cDNA Synthesis

Buffer Preparations

	- 2. Seal the plate/tube strips with an aluminum foil seal or PCR strip caps. Ensure the plate/tube strips are sealed firmly to minimize any evaporation.
	- 3. Immediately after sorting the cells and sealing the plate, spin briefly to collect the cells at the bottom of each well in the PSS.
	- 4. Place the plate on dry ice to flash-freeze the sorted cells (see Note 22).

Fig. 6 Fragment analysis of the final libraries. Top: example of the distribution of a cDNA library. Middle: example of the distribution of a feature barcode library. Bottom: example of the distribution of an AIRR library

Preparing Controls See below for guidelines on setting up your positive and negative controls alongside your cell samples.

	- 2. Place the samples on ice, and add any necessary remaining reagents, including 1 μl of 3<sup>0</sup> SMART-Seq CDS Primer II A (see Note 24). Mix well by gently vortexing, and then spin the tube(s) briefly to collect the contents at the bottom of the tube (see Note 25).
	- 3. Immediately incubate the tubes at 72 C in a preheated, hot-lid thermal cycler for 3 min.
	- 4. Prepare RT master mix while the samples are incubating. Prepare enough for all the reactions, plus 10% of the total reaction mix volume, by mixing at room temperature 422.4 μl SMART-Seq sc First-Strand Buffer, 105.6 μl SMART-Seq sc TSO, 52.8 μl RNase inhibitor (40 U/μl), and 211.2 μl SMART-Scribe II reverse transcriptase to a total volume of 792 μl for 96 wells (Includes 10% overage). Add the SMARTScribe II reverse transcriptase just prior to use (in step 7 part (b) of this section).
	- 5. Immediately after the 3-min incubation at 72 C, place the samples on ice for at least 2 min (but no more than 10 min).
	- 6. Preheat the thermal cycler to 42 C.
	- 7. Add the SMARTScribe II reverse transcriptase to the RT master mix. Mix well by gently vortexing, and then spin the tube briefly in a mini-centrifuge to collect the contents at the bottom of the tube.
	- 8. Add 7.5 μl of the RT master mix to each sample. Mix the contents of the tubes by gently vortexing, and spin briefly to collect the contents at the bottom of the tubes.
	- 9. Place the tubes in a thermal cycler with a heated lid, preheated to 42 C. Run the following program: 42C, 180 min; 70C, 10 s; 4C, hold.

Stopping point: The tubes can be stored at 4C overnight.

3.2.2 cDNA Amplification by LD PCR The PCR primers amplify the cDNA by priming to the sequences introduced by the 3<sup>0</sup> SMART-Seq CDS Primer II A and the SMART-Seq sc TSO.


Remove the SeqAmp DNA polymerase from the freezer, gently mix the tube without vortexing, and add to the master mix just before use. Mix the master mix well by vortexing gently, and spin the tube briefly to collect the contents at the bottom of the tube.


Stopping point: The tubes may be stored at 4 C overnight.

	- 2. If you are performing purification with the Thermo Fisher Magnetic Stand-96 (recommended if processing 48–96 samples), cDNA samples need to be transferred to a 96-well V-bottom plate. Distribute 40 μl of beads (see Note 28) to each well of the 96-well V-bottom plate, and then use a multichannel pipette to transfer the cDNA. Pipette the entire volume up and down at least ten times to mix thoroughly. Proceed to step 3 of this section.

3.2.3 Purification of

Amplified cDNA


3.2.4 Validation Using the Agilent 2100 Bioanalyzer

3.2.5 Library Preparation for Next-Generation Sequencing

Dilute and Prepare cDNA for Tagmentation

The following sections describe a modified Illumina Nextera XT DNA library preparation protocol that has been fully validated to work with the SMART-Seq Single-Cell Kit. The reaction size has been reduced to a quarter volume of what is recommended by Illumina.

	- 2. Warm Tagment DNA Buffer and NT Buffer to room temperature. Visually inspect NT Buffer to ensure that there is no precipitate. If there is a precipitate, vortex the buffer until all particles are resuspended.
	- 3. After thawing, gently invert the tubes 3–5 times, followed by centrifuging the tubes briefly, to ensure all reagents are adequately mixed.
	- 4. Label a new 96-well PCR plate "Library Prep."
	- 5. In a 1.5-ml PCR tube, prepare tagmentation premix by mixing 300 μl Tagment DNA Buffer and 150 μl amplification tagment mix to a total volume of 450 μl (calculated based on a 25% excess). Vortex gently for 20 s and centrifuge the tube briefly.
	- 6. Distribute 3.75 μl of the tagmentation premix into each well of the "Library Prep" plate.
	- 7. Transfer 1.25 μl of each diluted cDNA sample to the "Library Prep" plate.
	- 8. Seal the plate and vortex at medium speed for 20 s. Centrifuge at 2000 g for 5 min to remove bubbles.
	- 9. Place the "Library Prep" plate in a thermal cycler with a heated lid, and run the following program: 55 C, 10 min; 10 C, hold.
	- 10. Once the thermal cycler reaches 10 C, pipette 1.25 μl of NT Buffer into each of the tagmented samples to neutralize the samples (see Note 31).
	- 11. Seal the plate and vortex at medium speed, and then centrifuge at 2000 g for 1 min.
	- 12. Incubate at room temperature for 5 min.

Select appropriate Index 1 (N7xx) and Index 2 (S5xx) primers for the number of samples in your experiment (see Note 33).

Amplify the Tagmented cDNA


Samples can be left overnight in the thermal cycler at 4 C. If not processed within the next day, freeze the PCR products at 20 C.

	- 2. Pool the libraries by pipetting a fixed volume from each sample into a 1.5-ml tube or PCR tube. Volumes between 2 μl and 8 μl are appropriate. Do not use less than 2 μl per sample to ensure greater accuracy (e.g., to pool 96 libraries, add 2 μl of each library (total 192 μl) and 154 μl of bead volume to a 1.5 ml tube. The bead volume is approximately 80% of the total pool volume).
	- 3. Add a volume of beads representing 80% of the volume of the pooled libraries. If cleaning up libraries individually, add 40 μl of beads to each 50-μl sample.
	- 4. Mix well by vortexing or pipetting the entire mixture up and down ten times (see Note 35).
	- 5. Incubate at room temperature for 5 min to let the cDNA libraries bind to the beads.
	- 6. Briefly spin the sample to collect the liquid from the side of the tube. Place the tube on a magnetic stand for ~2 min or until the liquid appears completely clear, and there are no beads left in the supernatant.
	- 7. While the samples are on the magnetic separation device, remove and discard the supernatant. Take care not to disturb the beads.
	- 8. Keep the samples on the magnetic separation device. Add 200 μl of fresh 80% ethanol to each sample without disturbing

Pooling and Purification of Amplified Libraries

the beads. Incubate for 30 s, and then remove and discard the supernatant, taking care not to disturb the beads. The cDNA remains bound to the beads during washing.


3.2.6 Sequencing Sequence the SMART-Seq single-cell library with Illumina sequencing (see Note 37).

> 10X Genomics data is analyzed using the Cell Ranger software, provided by 10x Genomics free of charge. Cell Ranger allows (1) sequencing raw data demultiplexing, (2) quality control of the raw data obtained, (3) raw data alignment to the reference genome, and (4) data matrix preparation for further in-depth analyses using dimensionality reduction methods. Three Cell Ranger pipelines are now available: cellranger count (for transcriptome and feature data), cellranger vdj (for AIRR data) and cellranger multi (which does the integrative analysis of transcriptome, AIRR, and feature data). Loupe Brower and Loupe VDJ Browser, which are also available via the 10x Genomics website, provide a complementary set of analysis tools. Note that the Loupe Browsers are only available for Windows or macOS environments.

3.3.1 Setup Cell Ranger can be installed in a folder named "cellranger" in the home directory. Before running Cell Ranger, ensure that this folder is included in the PATH environment variable: export PATH ¼ \$PATH:\$HOME/cellranger.

3.3 10x Genomics Chromium Next GEM Single-Cell V(D)J Kit Data Processing and Analysis


	- 1. cellranger vdj --id LBA-01-VDJ --sample LBA-01- Sample --fastqs data/VDJ --reference data/ refdata/refdata-cellranger-vdj-GRCh38-altsensembl-5.0.0.

Arguments (see Note 40):

	- (a) --id: The folder that will contain the output of the pipeline (here: LBA-01-GEX)
	- (b) --sample: Sample name as specified in the FASTQ file (here: LBA-01-GEX-Sample)


dows and https://docs.docker.com/docker-for-mac/

#advanced for Mac.


docker run -it --rm -v \$PWD:/scratch -w /scratch teichlab/ tracer assemble [options] <file\_1> [<file\_2>] <cell\_name> <output\_directory>

The main arguments expected by TraCeR are as follows:

<file\_1>: FASTQ file containing #1 mates from pairedend sequencing or all reads from single-end sequencing. If paired-end sequencing is used, provide #2 mates after the #1 mates FASTQ file.

<cell\_name>: Name of the cell chosen by the user. This name will be used in file names and labels.

<output\_directory>: Directory for output. The cellspecific output from the assemble mode will be found in <output\_directory>/<cell\_name>.

TraCeR also accepts several options, which are detailed at https://github.com/Teichlab/tracer and in [8].

2. Reconstruct TRA, TRB, TRD, and TRG rearrangements from paired-end data using one processor core by running the following command (here for a hypothetical example dataset consisting of T cells from humans):

docker run -it --rm -v \$PWD:/scratch -w /scratch teichlab/ tracer assemble cell\_1\_R1.fq.gz cell\_1\_R2.fq.gz cell\_1 Exp\_1 -c my\_config\_file -s Hsap --loci A B G D -m assembly

3.4.3 Reconstruct TR Sequences with the Assemble Mode

3.4.4 Identify Clonally Related Cells with the Summarise Mode


docker run -it --rm -v \$PWD:/scratch -w /scratch teichlab/ tracer summarise [options] <input\_dir>

For a hypothetical example dataset consisting of T cells from humans, the following command could be run using one processor core:

docker run -it --rm -v \$PWD:/scratch -w /scratch teichlab/ tracer summarise Exp\_1 -c my\_config\_file -s Hsap --loci ABGD -g svg -u

	- 2. Remove likely cell doublets/multiplets affecting clonal assignments with more or less strict criteria depending on the dataset and biological questions (see Note 49) by visually inspecting the clonal graph outputs created by TraCeR summarise run with the -u flag. Likely doublets/multiplets can be seen as cells with two or more sets of rearranged TRA and TRB chains (or TRD and TRG), connecting smaller clone groups that otherwise do not share rearranged sequences with each other. If such likely doublet/multiplets exist in the data, we recommend to remove the result directories from assemble for these cells and rerun TraCeR summarise mode without the -u flag.
	- 3. Remove likely cell doublets/multiplets/contaminations by opening TCR\_summary.txt in the unfiltered summary folder and looking at the section named #Cells with more than two recombinants for a locus. Move the assemble result folder for any cell with more than three reconstructed TR

rearrangements for any locus to the directory containing cells to be filtered out.




```
3.4.7 In-Depth Analysis Further analysis can be performed using additional tools (see Note
                        44).
```
Fig. 7 Overview of clonality of a T-cell population based on TR sequences reconstructed by TraCeR. Each cell is represented by a node, while each reconstructed TR sequence is represented by a horizontal line (a) or a sequence identifier (b; only showing the largest clone group), colored according to chain type. Cells sharing identical TR sequences are connected with edges colored by chain type

### 4 Notes


v11-chemistry) carefully, and follow all instructions regarding general reagent handling, Chromium Next GEM Chip handling, assembly, loading, and all other technical instructions. The evolving kit versions (v1., v1.1, v2) differ with respect to certain volumes and concentrations. The following protocol is based on v1.1.


https://support.10xgenomics.com/single-cell-geneexpression/sample-prep/doc/demonstrated-protocol-cell-sur face-protein-labeling-for-single-cell-rna-sequencing-protocols

https://support.10xgenomics.com/single-cell-vdj/sam ple-prep/doc/demonstrated-protocol-cell-labeling-withdextramers-for-single-cell-rna-sequencing-protocols).



### Table 1 Cell recovery as a function of cell concentration

Top rows (bold): μl cell suspension; bottom rows: μl PBS or nuclease-free water (NFW)


These libraries include a P5 part that binds to the flow cell, the primer binding site for read 1 which contains a 16 bp 10- barcode to identify the cell assignment, followed by a 10 mer UMI for counting the transcripts, the TSO, and the poly-Astretch. The transcript insert follows and is sequenced in read 2, followed by a region for the sequencing primer, the i7 index, and the P7 part that binds to the flow cell. The minimum sequencing lengths are 26 bp for read 1 and 91 bp for read 2. Sequencing these libraries produces standard Illumina BCL data. The optimal sequencing depth is 25–30 K reads/cell for cDNA libraries and 10 K reads/cell for AIRR and feature barcode libraries. For library loading we recommend the following: MiSeq (2- 150 bp reads): 15 pM; NovaSeq in XP mode: cDNA lib 250 pM; feature barcode library: 190–200 pM; AIRR library: 300 pM; NovaSeq in standard mode: cDNA lib 450 pM; and feature barcode library 300 pM: AIRR library 500 pM.


efficiency of cDNA synthesis; however, it is safe to store the cells for several weeks prior to cDNA synthesis.

23. Control cells should be in PSS described above. PSS does not contain the 3<sup>0</sup> SMART-Seq CDS Primer II A, so it must be added here.

The Control Total RNA is supplied at a concentration of 1 μg/μl. It should be diluted to match the concentration of your test sample using serial dilutions. For positive and negative controls, replace the cell sample with 2 μl of the diluted control RNA and water, respectively.


Cycling guidelines based on cell type and pg RNA per cell: PBMCs (1–5 pg), 20 cycles; Jurkat cells (5 pg), 17 cycles; and lymphoblastoid cells (2–15 pg), 17–19 cycles.


1.25 μl), but any input between 100 and 300 pg will work. If all samples are correctly quantified and normalized to a uniform input amount before Nextera XT library preparation, sequencing libraries can be pooled before cleanup, and a relatively uniform amount of sequencing reads will be obtained. However, sample-to-sample read coverage varies, and one may observe some underrepresented or overrepresented samples with the pooling option. Always use a minimum of 2 μl of cDNA to make dilutions. Samples containing less than 100 pg/μl can still be used without dilution, but one may get fewer reads than for other samples if pooled for cleanup. If negative controls are going to be sequenced, they should be used without dilution.


Antibody Heavy-Chain Gene Rearrangements for the Detection and Analysis of B-Cell Clone Distribution" and "Bulk Sequencing from mRNA with UMI for Evaluation of B-Cell Isotype and Clonal Evolution," in Note 15 and on the Illumina website.


each cell but may be performed for multiple cells in parallel if running TraCeR on a computational cluster.


### Acknowledgments

We thank Magnolia Bostick, Christian Busse, Eline T. Luning Prak, Gloria Kraus, Chaim Schramm, Nicolas Tchitchek, Ulrik Stervbo, and Johannes Tru¨ck for helpful discussions and inspiration around the manuscript and for editing and proofreading and Elaine Chen for contributing figures. EMF and KME were funded by iMAP (ANR-16-RHUS-0001), Transimmunom LabEX (ANR-11- IDEX-0004-02), TriPoD ERC Research Advanced Grant (Fp7-IdEAS-ErC-322856), AIR-MI (ANR-18-ECVD-0001), iReceptor-Plus (H2020 Research and Innovation Programme 825821), and SirocCo (ANR-21-CO12-0005-01) grants. AE is supported by grants from the Deutsche Forschungsgemeinschaft (BO 3429/3- 1 and BO 3429/4-1) and the BMBF (RESET-AID). IL is funded by KG Jebsen (project SKGJ-MED-017). Conflict of interest: AE, EMF, KME, IL, NG, and SR declare no conflict of interest. KM is an employee at 10x Genomics, Pleasanton, CA, USA, and NG is an employee at Takara Bio, Mountain View, CA, USA. Both companies produce a kit described in this protocol.

### References


https://doi.org/10.3389/fimmu.2019. 02568


receptor-sequencing data. Bioinformatics 36: 4817–4818

13. Samir J, Rizzetto S, Gupta M, Luciani F (2020) Exploring and analysing single cell multi-omics data with VDJView. BMC Med Genomics 13:29

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Quality Control: Chain Pairing Precision and Monitoring of Cross-Sample Contamination: A Method by the AIRR Community

### Cheng-Yu Chung, Matı´as Gutie´rrez-Gonza´ lez, Sheila N. Lo´ pez Acevedo, Ahmed S. Fahad, and Brandon J. DeKosky and on behalf of the AIRR Community

### Abstract

New approaches in high-throughput analysis of immune receptor repertoires are enabling major advances in immunology and for the discovery of precision immunotherapeutics. Commensurate with growth of the field, there has been an increased need for the establishment of techniques for quality control of immune receptor data. Our laboratory has standardized the use of multiple quality control techniques in immunoglobulin (IG) and T-cell receptor (TR) sequencing experiments to ensure quality control throughout diverse experimental conditions. These quality control methods can also validate the development of new technological approaches and accelerate the training of laboratory personnel. This chapter describes multiple quality control techniques, including split-replicate cell preparations that enable repeat analyses and bioinformatic methods to quantify and ensure high sample quality. We hope that these quality control approaches can accelerate the technical adoption and validated use of unpaired and natively paired immune receptor data.

Key words B-cell receptor, T-cell receptor, Next-generation sequencing, PCR, Single-cell analysis, Replicate analysis, Quality control

### 1 Introduction

Recent developments in single-cell technologies have made it possible to capture the sequences of both heavy and light chains of immunoglobulins (IG) at high throughput [1–3]. These highthroughput methods require single-cell isolation approaches to enable the identification of the IG heavy and light chain cognate pairs. Clonal spikes of immortalized B-cell lines can provide some measure of IG heavy and light chain pairing quality control in single cell assays, but the expression level of clonal spikes is often very different from that of native B cells in a given sample. One method

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_21, © The Author(s) 2022

to accurately analyze the quality of single-cell analyses relies on the determination of B-cell IG heavy and light chain pairs in splitreplicate samples. By processing two or more replicates of the same sample, the experimental or technical performance and reproducibility can be analyzed via these replicates for methods development and high-quality data and to perform sample-specific quality control. For highest experimental accuracy and statistical validity, it is important that both replicates are treated in exactly the same way throughout all stages of the experiment and subsequent data analysis.

B-cell replicate analyses can be performed both with two aliquots of the same patient samples (e.g., split aliquots of the same blood PBMCs), which is often the simplest method. Some information can also be obtained using different tissue samples from the same individual or animal (e.g., comparing spleen and bone marrow compartments from the same mouse), although with some statistical compromises when the cell sources are different, because the samples are therefore not true replicates. A robust experimental analysis of split replicates can be performed after in vitro B-cell expansion is providing a large pool of expanded B cells, which will be distributed across the two replicates. This approach enables precise determination of IG heavy and light chain pairing accuracy using a given single-cell technique. The first experimental section of this methods article describes an effective way to generate expanded B cell populations that are ready for split-replicate analyses and associated statistical determinations of the pairing precision accuracy.

In addition to IG sequence analysis, the high-throughput sequencing of T-cell receptor (TR) gene rearrangements provides insights into dynamic cellular adaptive immune responses. TR screening can also be useful for discovery of therapeutic T cells or evaluations of T cell–based therapies [4, 5]. Single-cell TR sequencing approaches further accelerate progress by yielding paired alpha and beta TCR chain information. Methods development and quality control of single T-cell sequencing are facilitated by splitreplicate studies for high-quality statistical determination of technical single-cell accuracy. The second experimental protocol describes the retrieval of frozen cell samples, the purification of a T-cell subset of interest (here we use CD8+ T cells as a demonstration), and T-cell expansion in vitro prior to TR sequencing.

One important approach for quality control of single-cell sequencing experiments is the use of technical split-replicate samples, which provide a powerful tool to measure the reproducibility of a technique or assay. Split replicates can also provide major advantages for experimental training of new individuals and for the sample quality control of critical samples. While not strictly necessary, we find that in vitro expansion of B or T cells prior to analysis can provide a major boost to the number of overlapping clones in a given split-replicate analysis, thereby enhancing statistical accuracy of the pairing precision analysis, although the in vitro expansion can alter the original distribution of clonal frequency within a dataset. To determine the single-cell chain pairing precision, we assume that an IG heavy-light chain pair found across multiple replicates is a true positive, while an IG heavy chain paired with a different IG light chain found across replicates is a false positive in at least one replicate [2, 6]. We base our pairing precision analyses on the CDR3 nucleotide sequence for highest accuracy, as many IG heavy or light chain CDR3s can be encoded by similar amino acid sequences across individuals but still derive from unique V(D)J rearrangement events. In the third experimental protocol here, we describe the evaluation of single-cell pairing precision for both IG and TR sequences using a common bioinformatic approach.

A major source of potential experimental error is PCR contamination, which can occasionally appear at any research group and must quickly and effectively be eliminated to avoid continued spread. In particular, it is important to track experimental samples for the presence of potential PCR contamination across an entire laboratory and group of researchers as a means of ensuring high sample quality in an ongoing and up-to-date basis. Our final protocol describes the construction and analysis of a database of samples previously analyzed in a laboratory or research group to monitor for cross-sample contamination in new samples. The database can be as simple as a collected set of files containing the information needed, and ongoing additions to the database permits facile monitoring and analysis for any potential PCR contamination events to ensure high quality control.

The four protocols provided here describe both experimental and bioinformatic methods that help ensure robust and rigorous data from next-generation sequencing technologies. We believe that these key quality control methods can be useful for other laboratories and can accelerate the growth of sequence data and associated information derived from single-cell adaptive immune receptor sequencing techniques.

### 2 Materials

2.1 B-Cell Stimulation to Generate Split-Replicate Cell Samples


2.2 T-Cell Stimulation to Generate Split-Replicate Cell Samples


2.3 Technical Precision Analysis of Paired IG Heavy/Light or Paired TR Alpha/ Beta Sequencing

2.4 Laboratory-Scale Global Detection and Monitoring of Cross-Sample Contamination Events

	- 2. PCR\_QC\_analysis.py requires the software dependencies Python v3.6 and pandas 0.25. The script has not been tested in older versions of python or pandas.

### 3 Methods

3.1 B-Cell Stimulation for the Generation of Split-Replicate Samples as Repeated Analyses

Here, we describe a protocol for CD27+ antigen-experienced B-cell retrieval via magnetic-activated cell sorting (MACS) that uses magnetic beads that are coated with antibodies or enzymes associated with surface markers of our targeted cells. Alternatively, flow cytometry can be used to isolate high-purity B-cell populations of interest. Next, B-cell activation by cells presenting CD40 ligand (CD40L), along with other cytokines, is performed to induce B-cell proliferation in vitro [2, 6–8]. This robust selection and stimulation protocol yields a substantial B-cell population, normally expanded two- or threefold after a 5-day culture period, ready for subsequent split-replicate analyses, and single B-cell quality control studies. In this section, steps 1–13 describe human B-cell enrichment without CD43 depletion, while steps 14–29 describe CD27+ memory B cell selection, and finally, steps 30– 38 describe the procedures for cell culture for in vitro expansion.




T-cell populations can be divided as replicates prior to single-cell analyses for robust statistical determination of technical performance. To obtain primary T cell populations, MACS or immunophenotyping can be used. MACS is fast, easy to scale, and may have higher viability post-purification than FACS. Alternatively, FACSbased selection permits more complex cell subset isolations, including fluorochrome-labeled multimeric peptide-MHC screening [9], peptide-pulsed antigen-presenting cells [10], and peptide megapools [11].

Direct TR sequencing analysis after T-cell isolation using MACS for split-replicate TR analysis is feasible, although the overlap will not be as high as for in vitro expanded T-cell populations. Larger cell numbers following stimulation can allow for more complete coverage of the TR repertoire when multiple assays are designed (e.g., staining cells for multiple peptide/MHC targets). Split-replicate samples can be used for their TR /TRβ chain pairings to test the statistical accuracy of single-cell TR sequencing. The following protocol describes the collection and in vitro expansion of T cell populations to generate split-replicate populations for single-cell technical analysis studies. In this section, steps 1–7 detail how to thaw frozen PBMC or splenocyte samples that contain the desired T-cell populations, while steps 8–15 describe how to purify CD8+ T cells from PBMC samples, and finally steps 16–18 describe how to expand the T cells in vitro.


3.2 T-Cell Stimulation for the Generation of Split-Replicate Samples as Repeated Analyses


Here, we describe how to run the software for automated pairing precision analysis.

1. Execute the command:

```
bash precision_calculator.sh <REPLICATE_FILE1> <REPLI
CATE_FILE2>
```
	- (a) Extract and count repeated CDR-H3/CDR-β3 sequences. Only CDR-H3/CDR-β3 sequences that overlap between replicates are counted.
	- (b) Count true positives (TP). CDR-H3/CDR-β3 paired with exact match CDR-L3/CDR- 3 sequences. An exact match is equal length and identical sequence.
	- (c) Count false positives (FP). CDR-H3/CDR-β3 paired with different CDR-L3/CDR- 3 sequences, defined by different lengths, in at least one replicate.
	- (d) Count TP and FP among CDR-H3/CDR-β3 with mismatched CDR-L3/CDR- 3.

Clonally expanded variants can have mutated CDR-L3/CDR- 3 sequences either due to somatic hypermutation and sequencing error (IG), or sequencing error alone (TR), but still clearly derive from the same V-J rearrangement and represent different variants of the same BCR/TCR cell clone. CDR-L3/CDR- 3 with equal length and not more than 20% mismatches can generally be considered TP, while equal length CDR-L3/CDR- 3 with more than 20% mismatches can be considered FP.

The Hamming distance is then used to calculate the degree of mismatch:

% CDRL3 difference <sup>¼</sup> ð Þ Number of mismatches ð Þ Sequence length

3.3 Monitoring the IG Heavy/Light and TR Alpha/Beta Technical Pairing Precision Using Split-Replicate Single-Cell Sequencing Samples

(e) The script will then calculate and report the chain pairingprecision in the following fashion (see Notes 13 and 14):

The pairing precision (P) is calculated from the number of TP and FP [12]:

$$P = \frac{(\text{TP})}{(\text{TP} + \text{FP})}$$

Therefore, the collective precision of two independently technical replicates (R1 and R2) mentioned above as follows [2, 6]:

$$P\_{\text{R1\\_and\ R2}} = \frac{(\text{TP}\_{\text{R1\\_and\ R2}})}{(\text{TP}\_{\text{R1\\_and\ R2}} + \text{FP}\_{\text{R1\\_and\ R2}})}$$

The probability of independent events is equal to the product of the independent event probabilities. Moreover, the P of technical replicates 1 and 2 (PR1 and PR2) are considered to be equal, a property of technical replicates.

$$P\_{1^2} = P\_{\mathbb{R}1} \times P\_{\mathbb{R}2} = P^2$$

Solve for the above two equations and estimate P for a single analysis:

$$P\_{\text{R1\\_and\ R2}} = P^2 \frac{\text{(TP}\_{\text{R1\\_and\ R2})}}{\text{(TP}\_{\text{R1\\_and\ R2}} + \text{FP}\_{\text{R1\\_and\ R2}})}$$

In this section, steps 1 and 2 detail how to prepare the data files, while steps 3 and 4 describe how to perform the analysis for largescale contamination monitoring.

	- 2. Prepare a metadata file containing experimental information for each sample. This file must have a column named file that contains the file names, which is used to match the contamination analysis output. Although the script does not have specific metadata requirements, we recommend the metadata file that include all MiAIRR information and use AIRR standard names whenever possible.

3.4 Laboratory-Scale Global Detection and Monitoring of Cross-Sample Contamination Events

3. Execute the command:

```
python PCR_QC_analysis.py
<sequencing_files_pattern> <metadata_file>
```
	- (a) Search for all desired files using a pattern expansion strategy. For example, using the term <PBMC> as <sequencing\_files\_pattern> variable will match all files with the phrase "PBMC" in the current directory. Then, the script will create all pairwise combinations of matched files (see Note 15).
	- (b) CDR-H3 nucleotide sequences are matched across files. In this example the number of matches is divided by the total number of clones. The protocol does not discriminate between potential convergent or public responses and actual cross-contamination events. Presence of shared CDR-H3s should be assessed on a case-by-case basis, considering the nature of each experiment and the extent of shared sequences. Unrelated samples from literature or from different laboratories can be used to set a minimum threshold for convergent responses within your database. Cross-contamination events will be readily detectable and will exceed background levels established for convergent responses by a substantial margin.
	- (c) For IG sequences, clones are also binned by antibody isotype to allow a closer analysis of potential contaminations.
	- (d) Output overlap fractions are processed as pairwise comparisons and annotated from an external database containing important file metadata. As another control, the provided script also reports the fraction of shared CDRs for a single file. The sum result of the constituent fractions for a complete repertoire should always be 1.

For a pair of unrelated samples, the level of shared CDR-H3 sequences should be close to zero. However, a low level of shared sequences can be expected, and a low threshold can be considered (e.g., <1 in 10<sup>4</sup> clones) [13] and should be addressed on a case-by-case basis. For the CDR-L3, a lower diversity translates into a higher number of public sequences [14], which should be taken into consideration when interpreting results. In the case of paired VH/VL paired sequences, the considered threshold for CDR-H3 must be lower than the used for CDR-L3, since the chances of shared CDR-H3 and is so much lower than that of a CDR-L3. Some samples may show low-level overlapping CDR-H3s simply from being analyzed on the same sequencing run, for example, as a result of index hopping during Illumina sequencing. If contamination across samples from different species is observed (e.g., human/ rhesus), a V(D)J gene annotation tool (e.g., IgBLAST) with a search database that includes both species can accurately reflect the level of cross-contamination and help with robust identification of the source of contamination. Similar approaches could be used with cell barcodes or index barcodes, provided that sufficient barcode space is available and rare or no reuse of index barcodes for a substantial period of time to enable robust tracking and sequence overlap analysis.

### 4 Notes

	- (a) Total CDR-H3 sequences overlapping in both replicates: 1000.
	- (b) CDR-H3 observed to be exactly matched with the same CDR-L3s in R1 and R2 (TP): 950.
	- (c) CDR-H3 observed to be paired with CDR-L3's of different lengths in R1 or R2 (FP): 30.
	- (d) CDR-H3 with matched length, but an inexact CDR-L3 nt sequence match in R1 or R2: 20.

Analyzing the 20 mismatched light chain sequences using a Hamming distance formula:


$$P = \sqrt{\frac{965}{(965+35)}} = 98.2\%$$


### Acknowledgments

We thank Susanna Marquez, Eline T. Luning Prak, Chaim Schramm, and Ulrik Stervbo for assistance with the manuscript and David Price and Daniel Douek for scientific guidance. This work was supported by the University of Kansas Departments of Pharmaceutical Chemistry and Chemical Engineering, the KU Cancer Center, the US Department of Defense W81XWH1810296, and by NIH grants DP5OD023118, P20GM103418, R21AI143407, and R21AI144408.

### References


(2014) Developmental pathway for potent V1V2-directed HIV-neutralizing antibodies. Nature 509:55–62


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Immune Repertoire Analysis on High-Performance Computing Using VDJServer V1: A Method by the AIRR Community

### Scott Christley, Ulrik Stervbo, and Lindsay G. Cowell and on behalf of the AIRR Community

### Abstract

AIRR-seq data sets are usually large and require specialized analysis methods and software tools. A typical Illumina MiSeq sequencing run generates 20–30 million 2 300 bp paired-end sequence reads, which roughly corresponds to 15 GB of sequence data to be processed. Other platforms like NextSeq, which is useful in projects where the full V gene is not needed, create about 400 million 2 150 bp paired-end reads. Because of the size of the data sets, the analysis can be computationally expensive, particularly the early analysis steps like preprocessing and gene annotation that process the majority of the sequence data. A standard desktop PC may take 3–5 days of constant processing for a single MiSeq run, so dedicated highperformance computational resources may be required.

VDJServer provides free access to high-performance computing (HPC) at the Texas Advanced Computing Center (TACC) through a graphical user interface (Christley et al. Front Immunol 9:976, 2018). VDJServer is a cloud-based analysis portal for immune repertoire sequence data that provides access to a suite of tools for a complete analysis workflow, including modules for preprocessing and quality control of sequence reads, V(D)J gene assignment, repertoire characterization, and repertoire comparison. Furthermore, VDJServer has parallelized execution for tools such as IgBLAST, so more compute resources are utilized as the size of the input data grows. Analysis that takes days on a desktop PC might take only a few hours on VDJServer. VDJServer is a free, publicly available, and open-source licensed resource. Here, we describe the workflow for performing immune repertoire analysis on VDJServer's high-performance computing.

Key words AIRR-Seq, B-cell receptor, T-cell receptor, High-performance computing, Cloud computing

### 1 Introduction

Immune repertoire sequencing produces large, highly complex data sets that require specialized analysis methods and software tools. We developed VDJServer to address critical barriers in broader adoption of immune repertoire sequencing, namely, the

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_22, © The Author(s) 2022

lack of a complete, start-to-finish analysis pipeline, the lack of a data management infrastructure, and limited access for many researchers to high-performance computing (HPC) resources. VDJServer fills these gaps, specifically providing (1) an open suite of interoperable repertoire analysis tools that allows users to upload a set of sequences and pass them through a seamless workflow that executes all steps in an analysis, (2) access to sophisticated analysis tools running in an HPC environment, (3) interactive visualization capabilities for exploratory analysis, (4) a data management infrastructure, and (5) a graphical user interface to facilitate use by experimental and clinical research groups that lack extensive bioinformatics expertise.

Here, we describe the workflow for performing immune repertoire analysis on VDJServer's high-performance computing. The major steps of the workflow include creating a project to hold sequencing data and analysis results, uploading and preparing immune repertoire sequencing files, preprocessing the raw sequence data, performing V(D)J assignment and annotation of the processed sequences, defining study metadata and analysis comparison groups, performing repertoire characterization and comparison, and visualizing and downloading analysis results.

### 2 Materials

VDJServer requires a user account with a valid email address to access the system. Creating an account is free, as well as using the VDJServer resources. Accounts are used to insure data and results are private and secure. Create an account at https://vdjserver.org to get started. Contact VDJServer with any questions or concerns by using the Feedback option on the website or send email to vjdserver@utsouthwestern.edu.

### 3 Methods

For researchers without access to high-performance computing (HPC), VDJServer provides free access to the Texas Advanced Computing Center (TACC) through a standard web browser via a graphical user interface [1]. A suite of tools for a complete analysis workflow are provided, including modules for preprocessing and quality control of sequence reads, V(D)J gene assignment, repertoire characterization, and repertoire comparison (Fig. 1). VDJServer incorporates analysis software from the Immcantation suite [2, 3], VDJPipe [4], and other interoperability tools [5, 6]. Germline gene sets for human and mouse are derived from IMGT [7], and a draft germline set for Indian origin rhesus macaque IG is also provided [8]. VDJServer provides the Community Data Portal for

Fig. 1 Workflow immune repertoire analysis on high-performance computing using VDJServer V1

publicly sharing data and analysis results, and studies can be published to the AIRR Data Commons [9], which is not covered in this workflow. To publicly share data, please see the AIRR Community method chapter entitled, "Data Sharing and Re-Use." Here, we discuss the different steps in the immune repertoire analysis workflow using VDJServer.


3.2 Upload Files into Project From the Upload and Browse Project Data page, click on the Upload button and select files from the local computer, from Dropbox, or from a URL (ftp/http), to be uploaded; multiple files can be selected (see Note 2). Click the Start button to start uploading. Upload FASTQ sequence read files (compress with gzip for faster upload), FASTA files with barcode and primer sequences, TSV containing metadata, or any file to be associated with the project.

```
3.3 Set File
Attributes and Link
Paired-End Read Files
                        VDJServer attempts to detect the file type (FASTQ, FASTA, AIRR
                        TSV, etc.) from the file extension, but this can be changed with the
                        File Type setting. Use Barcode or Primer for files containing those
                        sequences. For paired-end read sequencing files, set the Read
                        Direction on each file to either the Forward or Reverse orientation,
                        and then link the two files together on the Link Paired Read Files
                        page. The forward orientation refers to the V gene end of the
                        template, and reverse orientation refers to the J gene or constant
```
Statistics

region end of the template. Correct orientation is necessary for proper matching of barcodes and primers. Linked files will show together as a pair on the Upload and Browse Project Data page.

3.4 Preprocessing with VDJPipe or pRESTO From the Upload and Browse Project Data page, select sequence read files for preprocessing by clicking on the checkbox next to each file. Click on the Run Job button and select either VDJPipe or pRESTO; a job submission screen will be displayed. VDJPipe and pRESTO have similar capabilities; pRESTO should be used for UMI; otherwise, VDJPipe is significantly faster (up to 20) on larger data sets. A single workflow is available for pRESTO, while VDJPipe offers a number of customized workflows. VDJPipe's single function workflows perform individual preprocessing steps, while the complete workflow performs all steps. If unsure about filtering parameters to use for preprocessing, such as length or quality settings, it is useful to run VDJPipe's Sequence Statistics workflow. This will visualize length, quality, and nucleotide distributions of the read data. The job submission screen will provide parameters, with default values, that can be changed for the individual preprocessing steps (see Note 3). Finally, click the Launch Job button to submit the preprocessing job to the TACC supercomputer. The user will receive an email when the job is finished. 3.5 Review Preprocessing When the preprocessing job is complete, the job on the View

Analyses and Results page will change from an In Progress label to a View Output button. Click the button to show the View Output page, which has three main sections: job output files, analysis charts, and log files. Job output files provides a list of output files generated from the preprocessing job. Analysis Charts provides visualizations for pre- and post-filtering statistics, and log files are job error logs and workflow provenance metadata. The provided visualizations include:


Use the Analysis Charts to review the preprocessing results; they show the pre- and post-filtering statistics to understand how preprocessing has affected the data. If preprocessing removed too many reads, or alternatively has not removed enough, then a new preprocessing job should be run with looser or more stringent parameters. Among the job, log files are summary logs that will give information about the number of reads processed during each preprocessing step.

### 3.6 Make Job Output Files Available in Project Data Area Once satisfied with the preprocessing results, the appropriate job output files need to be made available in the project data area so that they can be selected as input for additional analysis jobs. This can be done in two ways. The first is on the View Analyses and Results page. Click the Job Actions button for the job and select Include Job Output; this will make all output files available. Conversely, select Exclude Job Output from the Job Actions button, which will remove all output files for the project data area. Alternatively, the second way, the user can make individual job output files available from the View Output page for the job by clicking the Make Available in Project Data Area button next to each file. Clicking that button again will remove the file from the project data area. Job output files available in the project data area will show in their own section on the Upload and Browse Project Data page, grouped together by the job with the job name as a title.

3.7 Gene Annotation with IgBLAST Select files for IgBLAST processing, either job output files or uploaded FASTA files, on the Upload and Browse Project Data page by clicking on the checkbox next to each file (see Note 4). Click on the Run Job button and select IgBLAST; a job submission screen will be displayed. Select the organism species (human, mouse, or rhesus macaque), the strain (if appropriate), and the sequence type (IG or TR). VDJServer maintains separate germline databases, so processing multiple sequence types and/or organism species requires running multiple IgBLAST jobs. Finally, click the Launch Job button to submit the preprocessing job to the TACC supercomputer. The user will receive an email when the job is finished.

> As with all analysis jobs on VDJServer, job status is shown on the View Analyses and Results page, and the job output is available with the View Output button. Multiple output formats are provided including VDJServer's custom RepSum TSV, VDML, Change-O TSV, and AIRR TSV. Individual files can be downloaded by clicking on the filename, or all output files can be downloaded by clicking on Archive of Output Files in the log file section. It is recommended that AIRR TSV files are used for any custom analysis as they contain the most comprehensive annotations, and they are interoperable with many AIRR-seq tools.

3.8 Define MiAIRR Study Metadata and Repertoire Comparison Groups By this point in the workflow, raw sequence data has been preprocessed, and sequences have been annotated. However, to achieve the greatest utility of repertoire analysis, it is recommended that metadata is entered and comparison groups are defined, though it is not strictly necessary as individual files can be analyzed in isolation. Entering metadata also has the benefit of providing MiAIRR compliance when it's time to publish the study. Metadata is entered on the Metadata Entry page and consists of the six MiAIRR components: study, subject, diagnosis, sample, cell processing, and nucleic acid processing. VDJServer adds a seventh component with sample groups for doing group comparisons. Metadata can be manually entered on the page, but it is typically more efficient to prepare the metadata in a separate spreadsheet file, then import that spreadsheet into VDJServer (see Note 5). To do this, go to the appropriate section on the Metadata Entry page, click on the Metadata Actions button, and select Export to File. Open the spreadsheet file in Excel or another program, use one row for each entry, and fill in the values for each column. Save the file as Tab-delimited Text and upload the file into the project. Finally, on the Metadata Entry page, click on the Metadata Actions button and select Import From File. A panel will be shown where the user can pick the file to import and choose to either replace or append the current metadata.

Sample groups are a specialized feature of VDJServer that allows sample repertoires to be grouped together for performing intragroup and intergroup comparisons. Sample groups are defined by using one or more grouping operations. These grouping operations include:


3.9 Repertoire Characterization and Comparison with RepCalc RepCalc performs a wide variety of analysis functions including clonal assignment, gene usage, gene combination usage, CDR3 length distribution and amino acid properties, CDR3 and clonal sharing and uniqueness, clonal abundance, diversity profile, and B cell–specific mutation analysis and clonal lineage. RepCalc uses a combination of tools to perform the analyses including VDJServer's custom repertoire summarization and Change-O, Alakazam, and SHazaM from the Immcantation suite.

To run RepCalc, no files need to be selected on the Upload and Browse Project Data page; instead, RepCalc will directly access the appropriate output files from a previous IgBLAST job. Click the Run Job button and select RepCalc; a job submission screen will be displayed. Pick the IgBLAST job to use as input. If study metadata was defined, the screen will indicate its availability and automatically perform group comparison; otherwise, RepCalc will only perform analysis on individual files. Change the default values to include or exclude specific analysis functions. Finally, click the Launch Job button to submit the job to the TACC supercomputer. The user will receive an email when the job is finished.

When the RepCalc job has completed successfully, click the View Output button on the View Analyses and Results page to display the analysis results. For RepCalc jobs, the View Output page has three main sections: job output files, analysis charts, and log files. Job output files provides a list of clonal assignment output files, Analysis charts provide analysis visualizations, and log files are job error logs and workflow provenance metadata. RepCalc produces a set of interactive analysis charts:


Each chart provides three pop-up lists for selecting files, sample repertoires, or sample groups to be displayed on the chart. Chart figures can be downloaded by clicking on the Download Chart button, which will generate a figure identical to the chart being displayed in the browser, and the data for the chart can be downloaded by clicking on the Download Data button. Not all analysis output has an associated visualization but can be downloaded by clicking on Archive of Output Files in the log file section, with the data provided in TSV format for easy import into Excel and other tools.

### 4 Notes


### 3.10 Visualize Analysis Results and Download Data

field to narrow the list and verify. Another technique is to upload the files in batches, e.g., 20 files at a time, and check after each batch that all the files got uploaded.


### References


a file format with tools for capturing the results of inferring immune receptor rearrangements. BMC Bioinformatics 17:333


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Data Sharing and Reuse: A Method by the AIRR Community

### Brian D. Corrie, Scott Christley, Christian E. Busse, Lindsay G. Cowell, Kira C. M. Neller, Florian Rubelt, and Nicholas Schwab and on behalf of the AIRR Community

### Abstract

High-throughput sequencing of adaptive immune receptor repertoires (AIRR, i.e., IG and TR) has revolutionized the ability to study the adaptive immune response via large-scale experiments. Since 2009, AIRR sequencing (AIRR-seq) has been widely applied to survey the immune state of individuals (see "The AIRR Community Guide to Repertoire Analysis" chapter for details). One of the goals of the AIRR Community is to make the resulting AIRR-seq data FAIR (Findable, Accessible, Interoperable, and Reusable) (Wilkinson et al. Sci Data 3:1–9, 2016), with a primary goal of making it easy for the research community to reuse AIRR-seq data (Breden et al. Front Immunol 8:1418, 2017; Scott and Breden. Curr Opin Syst Biol 24:71–77, 2020). The basis for this is the MiAIRR data standard (Rubelt et al. Nat Immunol 18:1274–1278, 2017). For long-term preservation, it is recommended that researchers store their sequence read data in an INSDC repository. At the same time, the AIRR Community has established the AIRR Data Commons (Christley et al. Front Big Data 3:22, 2020), a distributed set of AIRR-compliant repositories that store the critically important annotated AIRR-seq data based on the MiAIRR standard, making the data findable, interoperable, and, because the data are annotated, more valuable in its reuse. Here, we build on the other AIRR Community chapters and illustrate how these principles and standards can be incorporated into AIRR-seq data analysis workflows. We discuss the importance of careful curation of metadata to ensure reproducibility and facilitate data sharing and reuse, and we illustrate how data can be shared via the AIRR Data Commons.

Key words AIRR-seq, B-cell receptor, Immunoglobulin, T-cell receptor, FAIR data, Data sharing, Data reuse

### 1 Introduction

Once an adaptive immune receptor repertoire sequencing (AIRRseq, see Table 1, of the "AIRR Community Guide to TR and IG Gene Annotation" chapter for a glossary of terms) experiment has been successfully designed and carried out (see the "AIRR

Brian D. Corrie and Scott Christley shared first author.

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_23, © The Author(s) 2022

### Table 1 AIRR-seq-related data repository resources


Community Guide to Planning and Performing AIRR-seq Experiments"chapter) and the data have been processed and analyzed for the experimental purpose of the study (see the "AIRR Community Guide to Repertoire Analysis" chapter), it is necessary to consider how to report on and share the AIRR-seq data from that study according to FAIR data principles. The FAIR data principles, which state that data should be Findable, Accessible, Interoperable, and Reusable[1], provide a number of benefits to the research community and to individual researchers. The principles ensure that the generated data can easily be reused within the laboratory that generated the data, thus maximizing the potential of the data for that lab. Externally, the principles can increase the visibility and recognition of the work and thereby attract new partnerships within the research community and with policy makers (see Note 1). In the scientific community at large, the FAIR data principles support scientific transparency and reproducibility, increasing the rigor of scientific results. Additionally, they facilitate data reuse for the exploration of new questions, particularly for questions that benefit from the ability to integrate data originally generated in different studies.

1.1 Experimental Reporting: Minimal Information Standards The primary purpose of the AIRR Community-endorsed MiAIRR Standard [4] is to establish a community-based standard for the recording and reporting of experimental results involving AIRR-seq data. The MiAIRR paper states that such a standard is considered necessary "... for the interpretation and comparison of AIRR-seq experiments conducted by different groups" and is a critical component of making AIRR-seq "interoperable" and "reusable" (the "I" and "R" in FAIR) (see Note 2). It is also critical for scientific transparency and reproducibility. The MiAIRR standard covers six high-level sets of data, including recommendations on how to capture data and metadata at the following levels: (1) study/subject/diagnosis, (2) sample collection, (3) sample processing and sequencing, (4) raw sequences, (5) data processing, and (6) sequence annotations. At each level, there are a set of metadata fields that are recommended for consideration when designing AIRR-seq studies and curating the metadata during the performance and reporting of such a study. 1.2 Data Sharing: Data Formats Although minimal information standards are necessary, they are not sufficient to completely enable data sharing, interoperability, and reuse. Subsequent to the establishment of the MiAIRR Standards, the AIRR Community established a set of computable specifications and accompanying file formats that facilitate sharing of data and support analysis tool interoperability. This includes a file format for AIRR-seq rearrangement data [6] and a file format for study and repertoire metadata [5]. In addition, the AIRR Community has established a software certification process that provides a "badge"-based system for tool developers to certify that their software supports the AIRR Standards as specified on the AIRR Software Compliance web page (https://docs.airr-community. org/en/stable/swtools/airr\_swtools\_standard.html).

1.3 Data Sharing: AIRR Data Commons Critical to the concept of FAIR is making AIRR-seq data "Findable" and "Accessible." The AIRR Community has established the AIRR Data Commons (ADC) [5], a network of geographically distributed AIRR-compliant repositories that adhere to the AIRR Standards. Of particular importance in creating the ADC is the establishment of the AIRR Data Commons web API (ADC API) [5] for finding, querying, and exploring data in AIRR-compliant repositories. The ADC API is a web-based query API that makes AIRR-seq studies and their associated annotated sequence data findable and accessible (the "F" and "A" in FAIR) (see Note 3 in regard to challenges in data discovery). Because the ADC API utilizes the MiAIRR data standard and AIRR file formats, the ADC also promotes and facilitates interoperability and data reuse (the "I" and "R" in FAIR), thereby supporting reproducibility, data integration, and meta-analysis. The ADC has grown an order of magnitude since 2018, from just under 400 million annotated adaptive immune receptor rearrangements in late 2018 to its current size of five distributed repositories with over 60 studies, 6000 repertoires, and 4 billion rearrangements available for data exploration and download. Of the five distributed repositories, there are two community repositories in Canada (the iReceptor Public Archive (IPA) and iReceptor COVID-19 repositories managed by iReceptor [7]), one community repository in the United States (managed by VDJServer [8]), and two research group-specific repositories: the i3 AIRR repository at Sorbonne University in France and the VDJBase AIRR-seq repository at Bar Ilan University in Israel. We expect two new repositories to be added to the ADC in the near future, with ImmuneDB [9] at the University of Haifa, Israel, and sciReptor [10] at DKFZ in Germany working on implementations of the ADC API for their respective repositories. The ADC can be searched using CURL at the command line or interactively using a web user interface through the iReceptor Gateway [7]. In the near future, interactive search through the VDJServer web interface will also be possible [8].

As a result of the COVID-19 pandemic, the AIRR Community made a call for open data [11]. In response to this call, the iReceptor and VDJServer groups have collaborated with a number of researchers to curate publicly available COVID-19 AIRR-seq data in the ADC [3]. As of the first quarter of 2021, there are 13 studies, over 3500 repertoires, and over 1 billion rearrangements from COVID-19 studies available in the ADC.

### 2 Materials

2.1 Study Design and Reproducibility

A diverse range of clinically important conditions—including infections, vaccinations, autoimmune diseases, transplants, transfusion reactions, aging, and cancers—can lead to unique, measurable AIRR responses [12]. Therefore, AIRR-seq not only can be used to advance our understanding of disease and how the immune system responds but also can provide a unique opportunity for diagnostic and prognostic approaches. Analysis of AIRR-seq data provides the opportunity for advancing personalized medicine in the form of a highly multiplexed diagnostic tool, with the potential of a near-universal blood test. Sample processing, sequencing methodology, and bioinformatic analysis are all critical to generating reliable and meaningful data. However, bioinformatics is the key component when using AIRR-seq to approach clinical questions. While powerful enough for the current era of personalized immuno-medicine, data science, and data-driven patient care, there is a particularly high need for standardization and reproducibility. Even though the first AIRR-based applications are already in clinical use in, e.g., leukemia and COVID-19 [13–15], many challenges remain before AIRR-seq-based blood testing can become a useful component in daily clinical practice. Many aspects in addition to the sample processing (please refer to the "AIRR Community Guide to TR and IG Gene Annotation," "AIRR Community Guide to Planning and Performing AIRR-Seq Experiments," and "AIRR Community Guide to Repertoire Analysis" chapters for details) need to be considered.

First, immune responses (particularly in investigations of infections or vaccine responses) follow very time-sensitive kinetics, with initial responses detectable within days and durations of weeks up to life-long. Similarly, sample choice including both tissue sources (e.g., peripheral blood or tumor tissue) and cell subset (e.g., memory B cells, effector T cells) will heavily influence results. In all cases, every aspect of any AIRR-seq based approach for diagnostic purposes will have to be rigorously validated, as potential regulatory approval will be contingent upon successful validation. Finally, obtaining meaningful data at scale requires a sample set with detailed clinical annotations, while still maintaining adherence to patient privacy directives and related legislation. Designing an AIRR-seq-based study with these key points in mind will ultimately determine the range of conclusions which can be obtained from it. The power of bioinformatic analysis depends not only on high quality sample processing and sequencing methods (see the "AIRR Community Guide to Planning and Performing AIRR-Seq Experiments" chapter) but also equally on careful study design.

In addition to validation for clinical uses, ensuring that bioinformatic analyses are reproducible is important for all research. This can be challenging, as analysis of AIRR-seq data is lengthy and requires the use of specialized software, with settings that can be project-specific and reference germlines that change in time, as new alleles are discovered [2]. To guarantee the reproducibility of results, we recommend to always record the versions of software and germlines used, as well as the arguments that were decisive to choose one setting or another in the metadata of an analysis such as an AIRR-compliant Repertoire file (https://docs.airr-community. org/en/stable/datarep/metadata.html). Analysis environments such as VDJServer often capture analysis metadata automatically. For custom scripts, we recommend documenting the code and avoid creating new names for fields that are already described in the AIRR Community standards.

2.2 Software Tools Many tools are available for AIRR-seq analysis [16–19]. Table 1 highlights several of the more commonly used programs that are free and open source and support standardized AIRR data representations, which facilitates data sharing via the ADC.

### 3 Methods

### 3.1 How to Share AIRR-Seq Data: General Information

The AIRR Community provides substantial documentation around the processes of sharing and finding AIRR-seq data. We summarize these processes below. For more detailed descriptions, please refer to the AIRR Community documentation web site (https://docs. airr-community.org/en/stable/standards/data\_submission. html). Refer to Notes 1 and 3 for the costs and benefits of sharing AIRR-seq data.

3.1.1 Curating Using MiAIRR One of the most critical steps in sharing data is to ensure that studies are curated with appropriate terms as specified in the MiAIRR Standard. The AIRR Community has published the list of MiAIRR fields (https://docs.airr-community.org/en/stable/ miairr/data\_elements.html) with field definitions, types, and examples. In addition, the Center for Expanded Data Annotation and Retrieval (CEDAR, https://metadatacenter.org) project has created a MiAIRR-compliant web user interface (Cedar for AIRR or CAIRR [20] (https://docs.airr-community.org/en/stable/cairr/ overview.html) for capturing and entering AIRR-seq study metadata. Projects such as iReceptor [7] and VDJServer [8] provide tools for MiAIRR metadata curation as part of their platforms, ranging from template metadata spreadsheets to web interfaces for capturing MiAIRR study metadata.

3.1.2 Storing AIRR-Compliant Data in INSDC Repositories In order to promote Open Science, the AIRR Community recommends that the source sequence data from studies be stored in a sustainable repository such as those maintained by the International Nucleotide Database Collaboration (INSDC) (e.g., NCBI SRA or EBI's ENA). The AIRR Community has collaborated with NCBI to create a protocol for storing MiAIRR compliant study metadata in the NCBI resources [4]. This protocol maps MiAIRR field names to metadata in NCBI entities such as BioProject and BioSample. In addition, the CAIRR pipeline [20] supports and facilitates publication of MiAIRR compliant metadata to the NCBI. 3.1.3 Publishing Your Data in the AIRR Data Commons

The primary difference between data in the INSDC repositories and the data in the ADC is that the ADC repositories store data that has gone through quality control and annotation pipelines as described in the "AIRR Community Guide to Repertoire Analysis" and "AIRR Community Guide to TR and IG Gene Annotation" chapters (see Note 5). The annotation process compares the expressed sequences to a reference database (see the "AIRR Community Guide to TR and IG Gene Annotation" chapter) and identifies the most likely V, D, and J genes that contributed to the expressed genes. By sharing this processed data, other researchers avoid having to rerun these computationally complex and sometimes expensive annotation pipelines. Additionally, this facilitates querying of AIRR-seq data based on the annotations, such as querying for sequences that use a particular V gene or contain a particular CDR3 sequence.

Although there are a number of large repositories in the ADC that curate data from a broad range of research groups, there are also multiple mechanisms by which data generators can themselves publicly share their AIRR-seq data into the ADC. Researchers can (1) collaborate with one of the existing ADC repositories to publish their data, (2) self-publish data into those ADC repositories that provide such a service, (3) install and manage their own ADC repository, or (4) implement the ADC API against an existing repository making their own repository ADC-compliant. These options are described in more detail below.

Collaborating with an Existing ADC Repository Provider A number of the large repository providers in the ADC (e.g., iReceptor, VDJServer) are community repositories that curate and store data on behalf of the community. Although it does not scale for these groups to curate and store data from the hundreds of AIRR-seq studies that are currently published each year, these large repository providers often collaborate with researchers to help them publish their AIRR-seq data. For example, in answer to the AIRR Community's call for sharing COVID-19 data, both the iReceptor and VDJServer repositories collaborated with a number of research groups to curate and share their COVID-19 studies [3]. Additionally, users who have analyzed their data using the VDJServer analysis portal can work with the VDJServer team to directly share their project data into the ADC.

Installing and Running Your Own ADC-Compliant Repository Developing, installing, and managing an AIRR-seq repository to facilitate data sharing can be challenging, but for some groups, this may be the best option. For example, large research groups that manage and process their own data and have the bioinformatics and technical expertise to manage database platforms may want to manage their own repository. In addition, groups that want to more closely manage the stewardship of their data (due to ethics requirements) may also want to operate their own AIRR-compliant repository.

To enable this, the iReceptor Project has developed a software stack called the iReceptor Turnkey [7] that is designed to make the download, installation, and management of an AIRR-compliant repository as straightforward as possible. The software is open source and uses container-based software management (Docker) to implement an AIRR-compliant database, a data curation service, and an ADC API web service for querying the database. As a result, a research group can download and install the software, curate their data, and easily have an AIRR-compliant repository that can participate as a member in the ADC. Such a repository would then automatically be searched by tools that search the ADC, such as the iReceptor Gateway. Currently, the iReceptor COVID-19, the i3 AIRR Sorbonne University repository, and the VDJBase Bar Ilan repository are all using the iReceptor Turnkey software as the platform for their ADC repositories. For more information on installing an iReceptor Turnkey, see Subheading 3.3.4.

Implementing the ADC API in an Existing Repository Several groups have preexisting repositories that already contain AIRR-seq data, having been developed prior to or in parallel with the ADC [9, 10, 21–24]. In order to interoperate with repositories in the ADC, it is necessary to perform a data transformation to bring the data and metadata into compatible formats. Alternatively, a repository can implement the ADC API, thereby avoiding the need to transform data. Although implementing such an API is not a trivial task, if a research group has a significant investment in an existing repository technology and wants to add their data to the ADC, this is one practical option. The AIRR Community has developed a reference implementation for the ADC API (https:// github.com/airr-community/adc-api). This implementation provides a JavaScript-based ADC API web implementation that performs simple queries against an AIRR-compliant MongoDB repository. This provides a framework for implementing the ADC API against an existing repository. In addition, the AIRR Community provides an extensive suite of test queries (https://github. com/airr-community/adc-api-tests) to help implementers ensure that their ADC API implementation is compliant with the ADC API specification. Although nontrivial, we know through iReceptor and VDJServer that this approach works, and indeed a number of the repositories described above are currently working on implementations of the ADC API (e.g., ImmuneDB [9] and sciReptor [10]). We expect them to be searchable as part of the ADC in the near future.

3.1.4 Sharing Data Through a Non-ADC but AIRR-Compliant Repository All ADC repositories are AIRR-compliant, but it is possible to be compliant with the AIRR formats and not be part of the ADC. The key difference between AIRR Standards-compliant repositories and those that are part of the ADC is that ADC repositories implement the ADC API for queries, while AIRR-compliant repositories do not. AIRR-compliant repositories store valuable and useful annotated AIRR-seq data in an AIRR-compliant format and use the AIRR Standard file formats for data exchange, but cannot be directly queried by external clients using the ADC API. There are a number of repositories of this type. For example, the Observed Antibody Space (OAS) [21] repository has over 1 billion annotated sequences from a number of AIRR-seq studies that can be queried and downloaded. Data from OAS is interoperable with data from repositories in the ADC, but ADC queries do not work against the OAS repository.

In addition, it is possible to download, install, and curate your own data into an AIRR-compliant repository. ImmuneDB (http:// immunedb.com/) [9] allows the curation of AIRR-seq study data and the sharing of that data through a web-interface or passwordprotected mysql repository. Installation through the Docker image is simple, and once installed, ImmuneDB allows users to annotate raw data, load previously annotated data, and share that data using its web interface. ImmuneDB uses AIRR-compliant file formats for data exchange and is currently working on implementing the ADC API for queries. We expect these services to be provided as part of the ADC in the near future.

### 3.2 Finding Data in the AIRR Data Commons

There are two primary mechanisms for finding, downloading, analyzing, and re-using data in the ADC, that is using the ADC API or using a web-based user interface.

3.2.1 Using the ADC API The ADC API [5] is the primary mechanism to search the ADC and is required for a repository to be part of the ADC. The ADC API specifies a rich query language that allows researchers to pose complex queries across all keywords defined in the MiAIRR data standard. The same query will work on all ADC-compliant repositories, providing a consistent mechanism to identify data sets of interest. Queries can be made against repertoire metadata at the study, subject, and sample level, including how the sample was obtained, prepared for sequencing, and processed after sequencing, via the ADC web API repertoire query end point. Queries can be made at the sequence annotation or rearrangement level, such as for V, D, or J gene annotations, via the ADC web API rearrangement query end point. Once a set of repertoires of interest are identified (e.g., all repertoires generated using primers that target IGH genes), it is possible to filter the rearrangements from those repertoires based on sequence annotation fields, such as for specific V gene calls or CDR3 amino acid sequences. Finally, once a data set

Fig. 1 Using the AIRR Data Commons API. Searching two repositories in the AIRR Data Commons for all human repertoires

is identified, it is possible to download that data in an AIRRcompliant file format.

An example query, which searches two of the repositories in the ADC for all human repertoires, is given in Fig. 1. The AIRR Community provides documentation for using the ADC API (https://docs.airr-community.org/en/stable/api/adc\_api.html), including a number of additional example queries.

There are many repositories in the ADC, and when using the ADC API, it is necessary to query each one independently and federate the query results. As such, use of the ADC API is targeted at users who are comfortable writing code that uses web API queries. Subheading 3.4 contains examples on how to query VDJServer and iReceptor from the command line, python, and R.

3.2.2 Using a Web-Based User Interface Web-based user interfaces (UIs) for the ADC are targeted at the more general AIRR-seq data user. User interfaces such as the iReceptor Gateway [7] are designed to hide the complexity of the fact that the user is querying multiple, international repositories. Web-based UIs typically implement a specific workflow, through web-based forms and menus, providing the user with the ability to issue complex queries across the entire ADC. These queries are usually targeted at a specific scientific use case and workflow, allowing the web-based UI to optimize the queries performed and to provide a simple UI for that specific use case.

> For example, the iReceptor Gateway currently implements two data exploration workflows:

1. One that allows the user to generate complex queries that span the rich study, subject, sample, and processing metadata of the MiAIRR standard to find specific repertoires of interest across thousands of repertoires.

2. One that allows users to generate simple sequence annotation queries across gene and other sequence annotation fields to find specific sequences of interest across billions of sequences.

Queries are sent out by the iReceptor Gateway to each repository in the ADC, then results are federated and presented in a manner that helps the user find data of interest.

Typical workflows might be to iteratively search for data of interest from the entire ADC. For example, a researcher might start by limiting the data to all subjects that were diagnosed with COVID-19, then search for IG data sets (IGH, IGK, or IGL data), and finally refine that search to those data sets that only have paired IG heavy and light chain data (see Note 6 for more discussion on finding data of interest). For this example, each step is accomplished by a few UI interactions (choosing menu items, typing in keywords), drilling down from over 6000 repertoires, 60 studies, and over 4 billion annotated sequences to the two such studies currently included in the ADC, which together comprise approximately 245,000 annotated sequences from 87 repertoires found in two international repositories. Once data of interest are found, the user can visualize a number of statistics on each repertoire (such as V, D, or J gene usage and CDR3 length distribution), search for a sequence annotation feature (such as search for a specific V, D, or J gene or CDR3), or request that the iReceptor Gateway federate and download the annotated sequence data and the repertoire metadata for further analysis and reuse (see Note 7 for considerations when combining data from different studies). A screenshot of such a query, as implemented in the iReceptor Gateway, is given in Fig. 2.


Fig. 2 iReceptor Gateway

Fig. 3 Workflow sharing of AIRR-seq data

### 3.3 Methods for Sharing of AIRR-Seq Data

3.3.1 Submission of AIRR-Seq Data to NCBI BioProject, BioSample, and SRA

The AIRR Community provides processes for sharing AIRR-seq data. The following subsections provide detailed instructions for some of the common processes. An overview is given in Fig. 3.

At a minimum, raw sequence read data should be submitted to an INSDC repository such as NCBI for long-term archival along with study and sample metadata. The normal NCBI submission form is used, but, in place of the standard NCBI spreadsheets, the AIRR Community XLS spreadsheets, which contain MiAIRR 1.0 compliant data elements, are used:


Download these spreadsheets from the AIRR Community GitHub repository (https://github.com/airr-community/airrstandards/tree/master/NCBI\_implementation/templates\_XLS), fill them out with sample and sequencing run metadata, and then upload them as part of the NCBI submission. Here are the detailed steps:


When your submission is published, the BioSample and SRA records will contain all of the MiAIRR metadata provided in the spreadsheets.

Even with raw sequencing data available in an INSDC repository, it is not immediately useful as the data needs to be preprocessed and annotated before it can be used for immune repertoire analysis. It is strongly encouraged to make postprocessed data (annotated sequences, clonal analysis, clonal lineage) available. One method is to publicly share the data so that it can be downloaded; Subheading 3.3.2 describes how to do that with the VDJServer Community Data Portal. The other method is to publish your AIRR-seq data in the ADC; Subheading 3.3.3 describes publishing in VDJServer's repository, and Subheading 3.3.4 describes running your own repository using the iReceptor Turnkey.


Create Project Click on the Add Project button to create a new project and give the project a name. To help users identify the project, use a descriptive name such as the title of the study publication. After the project is created, go to the Metadata Entry page and fill out the Project/ Study Metadata with a long study description (e.g., abstract of paper), PI and contact information, grant information, publication identifiers (e.g., Pubmed ID), and the BioProject ID. Click the Save Project Metadata button to save changes. It is not necessary to enter metadata for the other sections.

Upload Files into Project From the Upload and Browse Project Data page, click on the Upload button and select files from the local computer, from Dropbox, or from a URL (ftp/http), to be uploaded; multiple files can be selected. Click the Start button to start uploading.

Publish Project On the Project Settings page, copy/paste the VDJServer UUID. This is a long identifier with numbers, letters, and dashes that uniquely identifies the project. Provide this UUID in the Data Availability section of the publication so users can directly search and find the project. Finally, click the Project Actions button and select Publish Project. This will initiate publishing, and one will receive an email when the project is publicly available. Changes cannot be directly made to a published project, but as project owner one can unpublish the project at any time to correct information or files. On the Community Data page, find the project and go to the Project Settings page. Click the Project Actions button and select Unpublish Project; one will receive an email when the project has been unpublished. Make the necessary corrections to the project then publish it to make it publicly available again.

3.3.3 Publish the AIRR-Seq Study in the ADC with VDJServer Publishing the AIRR-seq study in the ADC with VDJServer is not a completely automated process; there are a number of manual validation steps that need to be performed. Furthermore, loading the data into the repository database can take hours, days, or even a week depending upon the size of the data; therefore, the load process is initiated by a VDJServer administrator. The basic requirements include:


The iReceptor Turnkey [7] is a self-contained AIRR compliant database platform that makes it easy for a research group to curate and share their own data. The iReceptor Turnkey software (https:// github.com/sfu-ireceptor/turnkey-service-php/blob/master/ README.md) is open source and is available for download via Github. The software uses Docker containers to manage the installation and includes a container for the repository itself (MongoDB), a container for the web service that implements the ADC API to query the repository, and a container to load data into the repository. It assumes that one is installing the software on a Unix platform and have appropriate privileges to install software.

Installation is a simple four-part process. More detailed instructions for installing and managing an iReceptor Turnkey repository are available on the iReceptor Turnkey repository github site.

3.3.4 Install a Local Repository with the iReceptor Turnkey

Download the Software Downloading the software is straightforward with the following command:

```
git clone --branch production-v3 https://github.com/sfu-ire-
ceptor/turnkey-service-php.git
```
This downloads the v3.0 production release (June 2020), which includes the Docker configuration files and the basic commands one that use to control and manage the repository.

Install the Software Installing the software is simple through a provided installation script. Note that this installation script installs Docker, docker\_ compose, and downloads multiple Docker images from Docker-Hub. Total time estimate: 5–10 min.

> cd turnkey-service-php scripts/install\_turnkey.sh

After this step, it should notify one that the system is installed and running. If the software is running correctly, one should be able to query the repository using typical command line URL software such as curl. Because no data has yet been loaded, the query below will return an empty data set.

curl --data "{}" "http://localhost/airr/v1/repertoire"

Loading AIRR-Seq Data This is typically a two-step process; first it is necessary to load repertoire metadata that describes the study, subject, and samples that are in the study. The input file can either be an AIRR Repertoire JSON file or a simple comma separated file with each column header mapping to an AIRR Standard field name. We have provided some simple test data to test out the installation. To load the test repertoire metadata, simply issue the following command:

> scripts/load\_metadata.sh ireceptor test\_data/PRJNA330606\_- Wang\_1\_sample\_metadata.csv

> One can check that it worked with the following command once again:

curl --data "{}" "http://localhost/airr/v1/repertoire"

One should now see a single repertoire returned as a JSON object.

Next, load a set of sequence annotation files. There is typically one set of AIRR-seq annotation files loaded for each row in the metadata file loaded above. Again, an example sequence annotation file with 1000 rearrangements generated using the MiXCR annotation tool is provided and can be loaded using the following command:

scripts/load\_rearrangements.sh mixcr test\_data/ SRR4084215\_aa\_mixcr\_annotation\_1000\_lines.txt

Finally, check that the rearrangements were loaded correctly.

curl --data "{}" "http://localhost/airr/v1/rearrangement"

One now has a running AIRR compliant repository, containing one repertoire with 1000 sequence annotations loaded.

The first three steps provide an AIRR-compliant repository for local access. If the repository machine has a publicly accessible IP address, it can be added to the iReceptor Gateway to test and verify that the AIRR-seq data can be queried. Contact the iReceptor team (support@ireceptor.org) to enable the repository in the iReceptor Gateway.

To make the repository publicly accessible on the Internet, it should be given a domain name and an SSL certificate for https access. A domain name (e.g., repository.example.org) can be acquired through any number of companies, but one should contact their own institution regarding domain name policies as the institution may be able to provide and manage a domain name. Acquiring an SSL certificate is highly recommended because many modern browsers have restrictive policies that will prevent users from accessing the repository through a nonsecure connection. Similar to the domain name, an SSL certificate can be acquired through any number of companies, but one should contact their own institution as they may be able acquire and manage SSL certificates.

Finally, the repository can be added to the list of repositories in the ADC on the AIRR Community documentation website (https://docs.airr-community.org/en/stable/api/adc.html). File an issue at the AIRR Standards Github (https://github.com/airrcommunity/airr-standards) with information about the repository to get it added to the list.

3.4 Methods to Query AIRR-Seq Data in the ADC The ADC API provides a programmatic method for accessing data from the ADC. This is possible through any programming language or tool that supports web queries. Examples are given below for performing queries from the Unix command line, python, and R. Note that the ADC consists of a large number of repositories, and when using the ADC API, it is necessary to send the query of interest to all repositories in the ADC if one wants all data in the ADC that meets the query constraints. The examples

Domain Name, Security, and Public Access to Your Repository

below use only a single repository. For a complete list of ADC repositories please refer to the AIRR Community ADC web page (https://docs.airr-community.org/en/stable/resources/adc\_sup port.html) or the AIRR Data Commons repository list on Fairsharing.org (https://fairsharing.org/biodbcore/?q¼AIRR).

Queries are sent to ADC API repositories using the ADC API query language (in JSON format) and the responses from the API are provided in JSON as well. For more information on using the ADC API, please refer to the AIRR Community ADC API web page (https://docs.airr-community.org/en/stable/api/adc\_api. html).

3.4.1 Using the UNIX Curl Command The Unix curl command can be used to send a query to any ADC compliant repository. An example searching for all repertoires that are from subjects with species equal to the Homo sapiens ontology ID to the iReceptor repository http://covid19-1.ireceptor.org would be issued as follows:

```
curl -s --data \
  '{"filters": { "op":"=", "content":
 { "field":"subject.species.id", "value":"NCBITAXON:9606"}}}'
\
  http://covid19-1.ireceptor.org/airr/v1/repertoire
```
3.4.2 Query ADC API with R A more complex example, using the R programming language and querying the same repository but looking for subjects with a COVID-19 disease diagnosis and specifically IG data, is given below. In addition, this example also takes the results of the initial repertoire query to retrieve a small number of rearrangements from a single repertoire that was returned in the first query.

```
# Load required libraries
library(yaml)
library(httr)
library(dplyr)
library(jsonlite)
# Find Covid-19 repertoires
repertoire_api <-
  'http://covid19-1.ireceptor.org/airr/v1/repertoire'
# Set up a query for IG sequences from humans with COVID-19
# See https://docs.airr-community.org/en/stable/datarep/meta-
data.htm
# for further information on the fields and values
query_repertoires <- '{
   "filters":{
```

```
"op":"and",
      "content": [
        {
            "op":"=",
            "content": {
                 "field":"subject.organism.id",
                 "value":"9606"
            }
        },
            {
            "op":"in",
            "content": {
        "field":"sample.pcr_target.pcr_target_locus",
                 "value": ["IGH",
                         "IGK",
                         "IGL"]
            }
            },
            {
            "op":"=",
            "content": {
        "field":"subject.diagnosis.disease_diagnosis",
                 "value":"DOID:0080600"
            }
            }
      ]
   }
}'
repertoires_response <-
  POST(url = repertoire_api, body = query_repertoires)
repertoires <-
  jsonlite::fromJSON(
      httr::content(
      repertoires_response,
      as = "text",
      encoding = "UTF-8"),
      simplifyDataFrame = TRUE)
selected_repertoires_id <-
  unique(repertoires$Repertoire$repertoire_id)
```
Once the repertoire\_id is known, it is possible to use a loop to retrieve the sequence data. This example shows how to retrieve three sequences from the first Covid-19 repertoire returned from the previous query:

```
# API's Rearrangement endpoint url
                       rearrangements_api <-
                         "http://covid19-1.ireceptor.org/airr/v1/rearrangement"
                       # Prepare the query
                       rearrangement_query <-
                         paste0(
                              '{"filters": {"op":"=","content": {"field":"repertoir-
                       e_id","value":"',
                              selected_repertoires_id[1],
                              '"}},"size":3}'
                         )
                       # Submit the query
                       rearrangement_response <- POST(rearrangements_api,
                                                         body = rearrangement_query)
                       # Parse the response
                       rearrangement <-
                         jsonlite::fromJSON(
                              httr::content(
                              rearrangement_response,
                              as = "text",
                              encoding = "UTF-8"),
                              simplifyDataFrame = TRUE
                         )
                       # Explore the response:
                       # General information
                       rearrangement$Info
                       # Inspect the first 3 rows and columns of the Rearrangement
                       rearrangement$Rearrangement[1:3, 1:3]
3.4.3 Query ADC API with
                       A slightly different query using the python programming language
                       is provided below. This queries the VDJServer repository (https://
                       vdjserver.org) for all repertoires that contain TRB data from a
                       specific study (with Study ID PRJNA300878) and then writes
                       that data to a file. A second query in this example downloads
                       1000 productive rearrangements from a single repertoire from
                       the same repository.
                       import airr
                       import requests
                       # This study is stored at VDJServer data repository
```
host\_url = 'https://vdjserver.org/airr/v1'

Python

```
#
# Query the repertoire endpoint
#
# POST data is sent with the query. Here we construct an object
for
# the query ((study_id == "PRJNA300878") AND (locus == "TRB"))
query = {
    "filters": {
        "op": "and",
        "content": [
            {
                "op": "=",
                "content": {
                    "field": "study.study_id",
                    "value": "PRJNA300878"
                }
            },
            {
                "op": "=",
                "content": {
                                 "field": "sample.pcr_target.
pcr_target_locus",
                    "value": "TRB"
                }
            }
        ]
    }
}
# Send the query
resp = requests.post(host_url + '/repertoire', json=query)
# The data is returned as JSON, use AIRR library to write out
data
data = resp.json()
airr.write_repertoire('repertoires.airr.json',
                      data['Repertoire'], info=data['Info'])
# Construct a query to retrieve the 1000 productive sequences
from the
# repertoire with repertoire_id == 2603354229190496746-
242ac113-0001-012
query = {
    "filters": {
        "op": "and",
```

```
"content": [
            {
                "op": "=",
                "content": {
                     "field": "repertoire_id",
                  "value": "2603354229190496746-242ac113-0001-
012"
                }
            },
            {
                "op": "=",
                "content": {
                     "field": "productive",
                     "value": True
                }
            }
        ]
    },
    "size": 1000,
    "from": 0
}
# Send the query
resp = requests.post(host_url + '/rearrangement', json=query)
data = resp.json()
rearrangements = data['Rearrangement']
```
4 Notes

1. The "research value" of data: Even though the ADC contains over 4 billion annotated sequences from over 6000 repertoires and 60 studies, this is a small fraction of the AIRR-seq data that has been produced. As part of its AIRR-seq data curation effort, the AIRR Community has been attempting to document publicly available AIRR-seq data sets (data that are available in SRA/ENA repositories) on the B-T.cr web site (https://b-t.cr/t/publicly-available-airr-seq-data-sets-bcells/470). Currently, there are 110 B-cell and 109 T-cell studies listed with known publicly available AIRR-seq data. Only 60 of these 219 studies are currently available in the ADC. As a result, it is likely that a researcher looking for a very specific type of AIRR-seq data (e.g., data from a rare disease with certain subject characteristics) may not be able to find it. This limit is primarily driven by the fact that not enough data has been shared in an easily usable form.

It is no surprise that the data available in the ADC reflects the priorities of researchers who took the time to curate this data. For example, the iReceptor Public Archive (IPA) repositories are currently cancer focused, as the iReceptor project chose cancer AIRR-seq studies as an important and broadly valuable resource for the general community. These studies were primarily curated in 2018 and 2019. Similarly, in response to the COVID-19 pandemic, the AIRR Community made a significant effort to curate COVID-19 AIRR-seq studies, resulting in 13 studies, 3500 repertoires, and over 1 billion sequence annotations being made available in 2020 from COVID-19 patients.

This is unfortunately a dilemma, with the costs of curating data for sharing being balanced against the rewards and value of that data, in turn driving the amount and type of data that is currently available for reuse. For example, the importance and value of COVID-19 data and its reuse in terms of fighting the COVID-19 pandemic has compelled researchers to share their data, even in some cases in the preprint stage. This, combined with standards for sharing being in place (the AIRR Standards) and readily available resources in which to store and share this data (e.g., iReceptor and VDJServer), has made COVID-19 data the single largest available disease diagnosis in the ADC [3]. This illustrates that by working toward a common goal of data accessibility, it is possible to resolve the conflicting costs of data sharing against the value of that shared data, leading to an increase in data reuse.

2. Interoperability and reusability: Although the AIRR Standards provide mechanisms to enable reuse of AIRR-seq data, challenges remain. These standards have varying levels of rigor applied to the AIRR fields. For example, many fields are defined as ontology terms, coming from well-known and widely used ontologies such as the NCBI Taxonomy (https://www.ncbi. nlm.nih.gov/taxonomy) and the Disease Ontology (https:// disease-ontology.org). Other terms come from AIRR defined controlled vocabularies (e.g., pcr\_target\_locus, as used above in the ADC query API, comes from a controlled vocabulary of IGH, IGI, IGK, IGL, TRA, TRB, TRD, TRG). Such rigor makes it possible to be very precise about sharing, exploring, and reusing data across studies. In particular, it is possible to compare such metadata computationally, taking away the need for an expert to interpret such fields.

At the same time, it is not possible to be so precise about all fields. The AIRR-seq world is evolving rapidly, and agreement around rigor on all terms in the standard is not feasible and in many cases is not desirable. Researchers need flexibility to describe parts of their study in ways that we are not currently able to capture precisely. As the domain matures, more rigor will be applied to more fields in the standard, but in some cases, it is still challenging to compare two repertoires interoperable way. The most prominent fields where this is prevalent is in the fields involving data processing. Given the complexities discussed around preprocessing, annotation, and analysis of AIRR-seq data, standardization around fields and their contents are yet to be determined. Although the MiAIRR data standard has fields to capture this information, the community is working toward a concise specification for describing these processes in detail.

3. Knowing where to look for data: In working from the ADC API level to find and access data, one challenge when searching for data across the ADC is knowing which repositories are actually part of the ADC. Currently, there is no central registry where one can get a list of the repositories in the ADC; at present, the main central resource providing this information is the AIRR Community ADC documentation page (https:// docs.airr-community.org/en/stable/resources/adc\_support. html) with the AIRR Community also maintaining a registry of repositories on Faisharing.org (https://fairsharing.org/ biodbcore/?q¼AIRR). This page provides links to the existing repository providers. If searching for data in the ADC using a programmatic interface (Python, R), then it is necessary for the end user to manage the list of repositories and send queries to the appropriate repositories as required.

Web-based user interfaces, such as the iReceptor Gateway, typically hide the fact that there are multiple repositories being queried, and these platforms maintain an internal registry so the end user does not need to worry about where the data resides.

The AIRR Community is working on the specification and establishment of a central registry that will provide programmatic access to a list of repositories known to support the ADC API.

4. The time and cost of data sharing: As evidenced by the nuances discussed in both this and the "AIRR-Community Guide: Planning and Performing AIRR-Seq Experiments" chapter, the process of defining and performing a study involving AIRR-seq data is incredibly complex. The MiAIRR and other AIRR Standards are designed to guide researchers in capturing study/data processing features important for data reuse. Because of this completeness, the standards necessarily contain a lot of detail. Although the AIRR Community encourages researchers to be as complete as possible in documenting their study design and process, the more critical aspect of data sharing and re-use is that when one does document a part of a study, it is done in a standard's compliant way. Although the MiAIRR Standard has a large number of fields, all of them considered "important," only a small subset of these fields are "essential/required" for a study to be AIRR-compliant.

There is a balance to be struck between the time it takes to capture study metadata at an appropriate level, both for internal requirements to perform the study and external requirements for data sharing and reuse. The MiAIRR Standard was created to help on both of these fronts, providing researchers with a comprehensive list of metadata fields that they might consider when designing a study and fields that enable comparison of study methods and guide decisions on data reuse. One of the main reasons that it is currently costly in terms of time and effort to reuse data (see below) is because the data has not been curated in a manner that facilitates this reuse.

5. The challenges of starting with FASTA/FASTQ: Although the INSDC sequence archive repositories (SRA, ENA, etc.) are a critical resource for long term storage of raw AIRR-seq data, it can be challenging for some researchers to reuse this data. This challenge comes from the need to transform data as stored in SRA/ENA to data that can be reused in analysis. As discussed in the "AIRR Community Guide: TR and IG Gene Annotation" chapter, the transformation of raw sequence data to annotated data, which is in turn the basis for data reuse in analysis, can be complex. If all data reuse requires starting from FASTA/FASTQ sequence files, each case of reuse requires the reannotation of the data, including the expertise and time to run the annotation pipelines. Even for experienced bioinformaticians, lack of data preprocessing information (e.g. primer/barcode sequences or thresholds for trimming/ filtering) adds significant time to data reannotation and can potentially impact reproducibility of results. In some cases, researchers may need to contact the data generator directly to obtain critical details such as barcodes for sample demultiplexing; depending on the level of collaboration, this could pose a barrier to data reuse. In addition, unless the AIRR-to-NCBI pipeline (Subheading 3.1.2) was used to process study metadata, it is unlikely that the study, subject, and sample metadata will be stored in a standard-based and reusable format. Mapping study metadata and rerunning annotation pipelines is error prone, and this process needs to be redone by each researcher wanting to reuse a data set obtained from INSDC sequence archives.

The AIRR Community has attempted to minimize this cost, firstly by providing a process to store the critical study, subject, and sample metadata in the INSDC repositories and secondly by providing a mechanism—the AIRR Standards and the AIRR Data Commons—by which researchers can curate and share annotated AIRR-seq data. Through storing AIRRseq data in a standard-based curated and annotated form, each AIRR-seq data set can be annotated and stored once, and then reused by many others without the costly overheads of reannotation. Because the AIRR Standard supports the ability to curate the annotation process, it is possible to have a data set from a single study, annotated multiple times using different annotation tools and have the user differentiate and choose the annotation that they think is the most appropriate for their data reuse.


Regarding differences in experimental protocols, each of the decision points outlined in the "AIRR Community Guide to TR and IG Gene Annotation" chapter, Fig. 1 (single cell versus bulk sequencing, gDNA versus mRNA as the starting molecule, and whether UMIs were used), as well as differences in sequencing depth, sequencing error rates, read length, and the primers used can impact the data in ways that can make certain conclusions invalid. For example, differences in hybridization and amplification efficiencies between primers in a primer set can influence gene usage estimates. Thus, combining or comparing gene usage estimates between studies that used different primer sets can result in gene usage differences that are attributable to experimental artifacts and do not reflect true repertoire differences. The potential impact of specific experimental protocol choices on analysis conclusions are briefly outlined here and in Table 2. All of these experimental protocol differences can impact the number of unique receptor sequences observed in a sample, particularly whether rare sequences are observed. This can in turn result in artifacts when computing any of the common diversity metrics, such as clonality, repertoire overlap, and when constructing sequence networks. Similarly, protocol differences that affect relative abundances of starting templates, including whether gDNA or mRNA was used, primer differences, and whether UMIs were used, can result in artifacts when computing diversity and clonality measures and when analyzing gene usage. Finally, differences in the use of UMIs, sequencing platform error rates, and primers can result in artifacts for somatic hypermutation analyses.

Differences in preprocessing and sequence annotation protocols can similarly result in artifacts during re- or meta-analysis (Table 2). Differences in sequence filtering and deduplication can impact the number of unique AIRR sequences observed in a sample, as well as estimates of starting template relative abundances, in turn impacting diversity, clonality, and overlap measures, as well as sequence networks and gene usage estimates. The germline gene database and alignment algorithm used for germline gene annotations can impact germline gene assignments, thereby impacting gene usage estimates and somatic hypermutation analyses. These can also impact diversity, clonality, and other analyses when they are conducted at the annotation rather than sequence level (i.e., defining unique rearrangements according to their V gene, J gene, and CDR3 sequence rather than their full-length sequence). Finally, assigning clonal membership can depend on the clonal assignment algorithm used and could thereby impact the results of IG affinity maturation.

### Table 2 Considerations in re- and meta-analysis


(continued)

### Table 2 (continued)


Upper panel—Experimental, preprocessing, and rearrangement annotation protocol differences are shown in rows, and the impacted AIRR-seq data set features are shown in columns. X indicates when a data set feature is expected to be impacted by a particular protocol difference. Lower panel—Analysis types are shown in rows. X indicates when an analysis type is expected to be impacted by the data set feature in the corresponding column. O indicates that an analysis type is expected to be impacted when the analysis is conducted at the annotation- rather than sequence-level (e.g., V gene, J gene, and CDR3 sequence rather than full-length sequence)

> The above and Table 2 are not meant to be comprehensive, but instead serve as a guide when designing analyses that combine data sets or results across different studies. When selecting studies for such an analysis and formulating the research questions that can be reliably addressed, it is important to identify differences in experimental, preprocessing and annotation protocols and understand how these protocol differences can affect relevant data set features and analyses. It is recommended to choose research questions and analysis methods that are independent of any protocol differences where possible. Furthermore, common experimental design best practices are encouraged, such as ensuring that protocol differences do not partition with treatment groups and incorporating methods to control for and/or estimate their effects, as one would batch effects.

### Acknowledgments

We would like to thank our colleagues from the AIRR Community, who have dedicated many hours to the development of the community and the standards and initiatives on which this chapter is based. In particular, we would like to thank the authors of the other AIRR Community chapters in this volume, with a special thanks to Susanna Marquez, William Lees, and Ulrik Stervbo who assisted with content and editing of this chapter.

### References


https://doi.org/10.3389/fimmu.2017. 01418


repositories. Front Immunol 9:1877. https:// doi.org/10.3389/fimmu.2018.01877


Jason R, Cantrell Daniel K, Wheeler Alessandro, Sette Bjoern, Peters (2019) (2018) The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Research 47(D1) D339–D343. https://doi.org/10.1093/nar/ gky1006


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Chapter 24

# IMGT® Immunoinformatics Tools for Standardized V-DOMAIN Analysis

### Ve´ronique Giudicelli, Patrice Duroux, Mae¨l Rollin, Safa Aouinti, Ge´raldine Folch, Joumana Jabado-Michaloud, Marie-Paule Lefranc, and Sofia Kossida

### Abstract

The variable domains (V-DOMAIN) of the antigen receptors, immunoglobulins (IG) or antibodies and T cell receptors (TR), which specifically recognize the antigens show a huge diversity in their sequences. This diversity results from the complex mechanisms involved in the synthesis of these domains at the DNA level (rearrangements of the variable (V), diversity (D), and joining (J) genes; N-diversity; and, for the IG, somatic hypermutations). The recognition of V, D, and J as "genes" and their entry in databases mark the creation of IMGT by Marie-Paule Lefranc, and the origin of immunoinformatics in 1989. For 30 years, IMGT®, the international ImMunoGeneTics information system® http://www.imgt.org, has implemented databases and developed tools for IG and TR immunoinformatics, based on the IMGT Scientific chart rules and IMGT-ONTOLOGY concepts and axioms, and more particularly, the princeps ones: IMGT genes and alleles (CLASSIFICATION axiom) and the IMGT unique numbering and IMGT Collier de Perles (NUMEROTATION axiom). This chapter describes the online tools for the characterization and annotation of the expressed V-DOMAIN sequences: (a) IMGT/V-QUEST analyzes in detail IG and TR rearranged nucleotide sequences, (b) IMGT/HighV-QUEST is its high throughput version, which includes a module for the identification of IMGT clonotypes and generates immunoprofiles of expressed V, D, and J genes and alleles, (c) IMGT/StatClonotype performs the pairwise comparison of IMGT/HighV-QUEST immunoprofiles, (d) IMGT/DomainGapAlign analyzes amino acid sequences and is frequently used in antibody engineering and humanization, and (e) IMGT/Collier-de-Perles provides two-dimensional (2D) graphical representations of V-DOMAIN, bridging the gap between sequences and 3D structures. These IMGT® tools are widely used in repertoire analyses of the adaptive immune responses in normal and pathological situations and in the design of engineered IG and TR for therapeutic applications.

Key words IMGT, Immunogenetics, Immunoinformatics, Immunoglobulin, Antibody, T cell receptor, V-DOMAIN sequence analysis, Adaptive immune repertoire, IMGT Collier de Perles, IMGT-ONTOLOGY

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_24, © The Author(s) 2022

### 1 Introduction

Immunoglobulins (IG) or antibodies [1, 2] and T cell receptors (TR) [3] are antigen receptors of the adaptive immune responses in vertebrates with jaws (gnathostomata) [4]. The huge diversity of the variable domains (V-DOMAIN) of the IG and TR chains of the immune repertoires results from several mechanisms that occur during their synthesis [1–4]. In particular, the combinatorial diversity depends on the number of variable (V), diversity (D), and joining (J) genes found in the IG and TR loci, which potentially can rearrange to form V-DOMAIN encoded by V-(D)-J regions [1–4]. It is the recognition of the V, D, and J as "genes" and their entry in databases that mark the creation of IMGT in 1989 by Marie-Paule Lefranc (Universite´ de Montpellier, CNRS) at Human Gene Mapping 10 (HGM10) and is at the origin of immunoinformatics, a new science at the interface between immunogenetics and bioinformatics [4].

Other mechanisms of diversity comprise the junctional diversity with exonuclease trimming at the ends of the V, D, and J genes and the random addition of nontemplated nucleotides, preferably "g" and "c," by the terminal deoxynucleotidyl transferase (TdT) encoded by the DNA nucleotidylexotransferase (DNTT) gene, creating the N-regions [1–3], and for IG, somatic hypermutations [1, 2]. For 30 years, IMGT®, the international ImMunoGeneTics information system® http://www.imgt.org, has implemented databases and developed tools for IG and TR immunoinformatics [5], based on the IMGT Scientific chart rules (see Subheading 2) and IMGT-ONTOLOGY concepts and axioms [6, 7], and more particularly, the princeps ones: IMGT genes and alleles (CLASSIFICA TION axiom) [8–12] and the IMGT unique numbering [13–17] and IMGT Collier de Perles [18–21] (NUMEROTATION axiom). This chapter describes the online analysis tools for the characterization and annotation of the expressed V-DOMAIN nucleotide (nt) and amino acid (AA) sequences, available from "IMGT tools" section of the IMGT® Home page. Protocols for their use and the description of main results are presented in this chapter. These concern the following: (a) IMGT/V-QUEST [22, 23] is the IMGT® online tool for the analysis of IG and TR nucleotide rearranged sequences (see Subheading 3); (b) IMGT/HighV-QUEST [24–27], the high throughput version of IMGT/V-QUEST, can analyze sets of up to one million sequences. It includes a module for the identification of IMGT clonotypes (AA) and the generation of IG and TR gene profiles for the diversity and expression of IMGT clonotypes (AA) (see Subheading 4); (c) IMGT/StatClonotype [28, 29] is a standalone package that performs statistical pairwise comparisons of IMGT clonotype (AA) diversity or expression between two IMGT/HighV-QUEST result sets (see Subheading 5); (d) IMGT/DomainGapAlign [30, 31] analyses domain AA sequences and two dimensional (2D) structures, and its results are used in antibody engineering and humanization [32, 33] (see Subheading 6); and (e) IMGT/Collier-de-Perles tool [21] generates IMGT Colliers de Perles graphical 2D representations for AA domain sequences [18–20] (see Subheading 7), it is available from the IMGT Home page and is also automatically launched by IMGT/V-QUEST and IMGT/DomainGapAlign.

### 2 IMGT Scientific Chart Rules for the Analysis of the V-DOMAIN

2.1 IMGT Gene and Allele Nomenclature and IMGT Reference Directory Sets The IMGT gene names of the IG and TR V, D, J, and C genes [1– 4] were approved by the Human Genome Organization (HUGO) Nomenclature Committee (HGNC) in 1999 [8, 9, 12] and were endorsed by the WHO-IUIS Nomenclature Subcommittee for IG and TR [10, 11]. IMGT gene and allele names are based on the concepts of classification of IMGT-ONTOLOGY "Group," "Subgroup," "Gene," and "Allele" [1–4, 10–12]. Alleles are the polymorphic variants of a gene: they are identified by their IMGT reference sequence, which corresponds to the coding V-REGION, D-REGION, J-REGION, and C-REGION sequence at the nucleotide level of V, D, J, and C gene alleles, respectively. IMGT reference directory sets include the allele IMGT reference sequences from functional (F) genes and alleles, open reading frame (ORF), and pseudogenes (P) [5]. IMGT germline V, D, and J genes and alleles, with their characteristics, their reference sequence and other sequences from the literature are managed in IMGT/ GENE-DB [34] and in IMGT Repertoire (IG and TR) Gene tables and Alignments of alleles Web resources [1–4]. The tools for V-DOMAIN analysis compare user sequences with IMGT reference directory sets for the identification of V, D, and J genes and alleles and the evaluation of mutations and AA changes. 2.2 IMGT Unique Numbering for the IG and TR V Domains An IG or TR V-DOMAIN comprises about 100 amino acids and is made of nine antiparallel beta strands (A, B, C, C0 , C00, D, E, F, and G) linked by beta turns (AB, CC0 , C00D, DE, and EF) or loops (BC,

C0 C00, and FG) [35]. At the structural level, they form a sandwich of two sheets closely packed against each other through hydrophobic interactions and joined together by a disulfide bridge between 1st-CYS at position 23 in B-STRAND (in the first sheet) and 2nd-CYS at position 104 in F-STRAND (in the second sheet) [13].

The IMGT unique numbering for IG and TR V-DOMAIN [13] delimits (1) the four framework regions: FR1-IMGT (A and B strands, from positions 1 to 26), FR2-IMGT (C and C<sup>0</sup> strands, from positions 39 to 55), FR3-IMGT (C00, D, E and F strands, from positions 66 to 104), FR4-IMGT (G strand, from positions 118 to 128), and (2) the three hypervariable or complementarity


Fig. 1 Protein displays of IG and TR V-DOMAIN based on the IMGT unique numbering for V-DOMAIN [13]. The V-DOMAIN translations were obtained from the analysis by IMGT/V-QUEST [22, 23] (see Subheading 3) of the nucleotide sequences of shown accession numbers in IMGT/LIGM-DB [36]. The identification of FR-IMGT and CDR-IMGT and of beta strands and loops was performed by IMGT/DomainGapAlign [30, 31] (see Subheading 6), which provides a standardized delimitation whatever the species, the receptor type, and the chain type. CDR-IMGT lengths are indicated between brackets, separated by dots (column on the right). 1st-CYS 23 and 2nd-CYS 104 are in pink, and W 41, hydrophobic AA 89, and W or F 118 are in blue. Taxons are in the IMGT 6 or 9-letter abbreviation: Homsap for Homo sapiens, Canlupfam for Canis lupus familiaris (dog), Musmus for Mus musculus (mouse), Turtru for Tursiops truncatus (dolphin), Macmul for Macaca mulatta (Rhesus monkey), and Felcat for Felis catus (cat)

> determining regions involved in the ligand recognition: CDR1- IMGT (BC loop, positions 27 to 38), CDR2-IMGT (C<sup>0</sup> C<sup>00</sup> loop, positions 56 to 65), and CDR3-IMGT (FG loop, positions 105 to 117, with additional positions 112.1, 111.1, 112.2 etc., if longer than 13 codons (or AA)). FR-IMGT positions, which delimit the three CDR-IMGT, are designated as anchors: they are 26 and 39, 55 and 66, and 104 and 118, respectively (Fig. 1), and shown as squares in IMGT Colliers de Perles [18–21]. According to the IMGT unique numbering [13], a V-DOMAIN is characterized by five highly conserved AA: 1st-CYS 23, tryptophan 41 (CONSERVED-TRP), hydrophobic amino acid 89, 2nd-CYS 104, and J-PHE or J-TRP 118 of the J-MOTIF (F/W-G-X-G, 118–121, where F is phenylalanine, W tryptophan, G glycine, X, any AA). The three CDR-IMGT lengths characterize a V-DOMAIN. By convention, they are indicated between brackets, separated by dots (for example [8.8.13]). The CDR1-IMGT and CDR2-IMGT are encoded by the V-REGION, whereas the CDR3-IMGT results from the V-(D)-J rearrangement. The IMGT Collier de Perles [18–20] can be generated by the IMGT/ Collier-de-Perles tool [21] (see Subheading 7).

Fig. 2 Graphical representation or prototypes of IG and TR V-DOMAIN with IMGT labels at the nucleotide level. (a) V-D-J-REGION. (b) V-J-REGION [1–4]. The JUNCTION encompasses 2nd-CYS 104, CDR3-IMGT, and J-TRP or J-PHE 118, and its length is therefore two AA longer than CDR3-IMGT. Potential palindromic nucleotides ("P") identified in case of untrimmed V, D, and/or J regions during the DNA rearrangement are not shown (With permission from M-P. Lefranc and G. Lefranc, LIGM, Founders and Authors of IMGT®, the international ImMunoGeneTics information system®, http:/www.imgt.org)

2.3 IMGT Standardized Labels and Sequence Description

The IMGT tools, which perform the analysis of sequences, provide the description of the V-DOMAIN with IMGT standardized labels (written in capital letters). The V-DOMAIN corresponds either to a V-D-J-REGION (in IG heavy (IGH)), TR beta (TRB), and TR delta (TRD) chains) (Fig. 2a) or to a V-J-REGION (in IG light lambda (IGL) and IG kappa (IGK)), TR alpha (TRA) and TR gamma (TRG) chains) (Fig. 2b), encoded by V-D-J or V-J rearrangements, respectively.

The V-DOMAIN labels according to the chain type or locus are: VH, V-KAPPA, V-LAMBDA for the IGH, IGK, and IGL, respectively, and V-ALPHA, V-BETA, V-DELTA, V-GAMMA for the TRA, TRB, TRD, and TRG, respectively [1–4].

the same reading frame)) or "unproductive" (stop codons and/or

2.4 IMGT Functionality of IG and TR Genes and Alleles and of Rearranged Sequences The "Functionality" concept identifies the functionality based on the configuration of the IG and TR genes. The functionality of the germline (V, D and J) and undefined (C) IG and TR genes and alleles, defined on the same criteria as conventional genes and alleles, is either functional (F), open reading frame (ORF), or pseudogene (P). The functionality of the IG and TR V-(D)-J rearranged sequences is either "productive" (no stop codon and in-frame JUNCTION (2nd-CYS 104 and J-TRP/J-PHE 118 in

out-of-frame JUNCTION) [2].

### 3 IMGT/V-QUEST

IMGT/V-QUEST [22, 23] identifies the V, D, and J genes and alleles in IG and TR V domains. It characterizes the nucleotide (nt) mutations and amino acid (AA) changes resulting from somatic hypermutations in IG V-REGION. It provides a detailed characterization of the V-D-J or V-J junctions by the integrated IMGT/ JunctionAnalysis tool [37, 38] and the full annotation of the V-DOMAIN with IMGT labels by IMGT/Automat [39, 40]. 3.1 IMGT/V-QUEST Sequence Submission The top of the IMGT/V-QUEST Welcome page (Fig. 3) provides two links: the first one gives access to the list of the IMGT/V-QUEST reference directory sets to which the users' own sequences can be compared (see Note 1), and the second one provides examples of human rearranged sequences to test the tool. The page then includes five sections to configure the analysis: 3.1.1 Your Selection 1. Select the species or taxon first and then the receptor type or locus (see Note 2) in the lists. 3.1.2 Sequence Submission IMGT/V-QUEST analyses up to 50 FASTA formatted IG or TR rearranged nucleotide sequences per run, from genomic DNA or cDNA indifferently. 1. Enter the sequences in the text area "Type (or copy/paste) your nucleotide sequence(s) in FASTA format".

Fig. 3 IMGT/V-QUEST Welcome page with the five sections: "Your selection," "Sequence submission," "Display results," "Advanced parameters," and "Advanced functionalities" [22, 23]

	- 1. Select "A. Detailed view" to get the results for each sequence individually. Results consist in a "Result summary" with the main results of the analysis and 14 detailed result sections that can be selectively checked or unchecked by the user (see Note 4). In sequence alignments, the number of IMGT reference sequences aligned with the user sequence (five by default) can be modified from 1 to 20.
	- 2. Select "B. Synthesis view" to display the sequences that express the same V gene and allele aligned together. Results include a "Summary table" with the main results of the analysis which can be ordered by "V-GENE and allele name" (default) or by the sequence "input" order. There are eight detailed result sections that can be checked or unchecked (see Note 5).
	- 3. Select "C. Excel file" to download the results, either in a spreadsheet (default) or as a zip archive. The results may include 11 sheets (or text files in the zip archive (see Note 6)), which can be checked or unchecked. The 12th sheet (or text file) is available if the option "Analysis of single chain Fragment variable (scFv)" is selected in "Advanced functionalities" (see Subheading 3.1.5) [41]. An alternative is to display the content of one given sheet in your browser ("Display 1 CSV file in your browser") or to "Download AIRR formatted results" as a zip archive (see Note 7) [42, 43].

3.1.4 Advanced Parameters The default values of the advanced parameters are used by IMGT/ V-QUEST for classical analyses [22, 23]. They may be modified for specific studies and/or unusual sequences. The user may:


insertions and/or deletions. Selecting "Yes" allows to identify the somatic hypermutations by nucleotide insertions and deletions in the V-REGION that may occur in normal and malignant cells [44] and/or potential sequencing errors.

	- (a) "Nb of accepted D-GENE" (number of D genes searched by the tool in IGH, TRB or TRD junctions).
	- (b) "Nb of accepted mutations" in 30 V-REGION, D-REGION, and 50 J-REGION: by default, 2, 4 and 2 mutations are accepted in the 30 V-REGION, D-REGION, and 5<sup>0</sup> J-REGION, respectively, for IGH, 7 in the 3<sup>0</sup> V-REGION and 5<sup>0</sup> J-REGION for IGK and IGL junctions (see Note 9). By default, no mutation is accepted for the TR junctions.
	- (a) "Nb of nucleotides to exclude in 5<sup>0</sup> of the V-REGION for the evaluation of the number of mutations" (useful in case of primer specific nucleotides).
	- (b) "Nb of nucleotides to add (or exclude) in 3<sup>0</sup> of the V-REGION for the evaluation of the alignment score" (useful in case of low (or high) exonuclease activity).

3.1.5 Advanced Functionalities "Advanced functionalities" [22, 23] corresponds to specific analyses, with additional dedicated results, for engineered/artificial sequences, and for the search of specific sequences for clinical applications. The user may:


3.2 IMGT/V-QUEST Results for A. Detailed View The page "A. Detailed results for the IMGT/V-QUEST analyzed sequences" [22, 23] indicates at the top the number of analyzed sequences and the list of sequences identifiers with links allowing to browse directly the corresponding individual results. Individual results include the FASTA submitted sequence and the "Result summary" of the analysis, followed by the detailed result sections selected in the Welcome page. Importantly, the result sections allow to explore in depth the results of the analysis regarding the identification of V, (D), J genes and alleles, the description of the V-DOMAIN with the delimitation of FR-IMGT and CDR-IMGT, and the characterization of the mutations.

3.2.1 Sequence and Result Summary The numbers of 50 trimmed-n and 30 trimmed-n from the submitted sequence before the analysis if any (see Note 3), the sequence length, the sequence analysis category (see Note 12), and the IMGT reference directory set with which the sequence was compared (e.g., Homo sapiens (human) IG set) are indicated above the submitted sequence provided in FASTA format [22, 23] (Fig. 4). The part of the sequence corresponding to the V-DOMAIN is underlined in green. If a sequence was submitted in antisense orientation, it is complementary reversed and displayed, as well as the results, in the V gene sense orientation.

> The "Result summary" provides the main characteristics of the analyzed sequence [22, 23]:


IMGT/V-QUEST provides warnings (not shown) that appear as notes in red to alert the user, if potential insertions or deletions are suspected in the V-REGION (see Note 15), or if other possibilities for the J-GENE and allele names are identified. Users are encouraged to check alignments in related detailed result sections.

Below the "Result summary," notes in black (not shown) may appear to indicate:



Fig. 4 IMGT/V-QUEST "Detailed results" [22, 23]. The parameters of the analysis are recalled on the top of the page. The first part the "Detailed results" for "seq\_1" (IMGT/LIGM-DB [36] accession number X81732) includes the sequence in FASTA format (the first 57 nt not underlined in green are not part of the V-DOMAIN) and the "Result summary." Seq\_1 functionality is "productive." This human IGHV sequence expresses the IGHV3-9\*01, IGHD3-3\*01, and IGHJ4\*02 genes and alleles. The lengths of the four FR-IMGT are 25, 17, 38, and 11. The lengths of the three CDR-IMGT are 8, 8, and 13. The JUNCTION length is 45 nt, and the decryption [45] shows that it is composed of 8 nt for the 3<sup>0</sup> V-REGION (5 nt were trimmed from the germline V during DNA rearrangement), 3 nt for N1-REGION, and 17 nt for the D-REGION (7 nt in 5<sup>0</sup> and 7 in 3<sup>0</sup> were trimmed from the germline D, 6 nt for the N2-REGION, and 11 for the 5<sup>0</sup> J-REGION (6 nt were trimmed from the germline J))

3.2.2 Detailed Result Sections

If selected in the Welcome page, the 14 detailed result sections are displayed [22, 23]. They allow to verify, detail, and complete the "Result summary."

	- (a) The "Analysis of the JUNCTION" (Fig. 5) [22, 23] shows the details of the junction at the nucleotide level with delimitation of the IMGT labels (Fig. 2 in Subheading 2.3). Dots indicate the number of nucleotides trimmed at the germline V, D, and J gene ends. Vmut, Dmut, and Jmut indicate the number of mutations in the 30 V-REGION, D-REGION, and 5<sup>0</sup> J-REGION, respectively, and the corresponding mutated nucleotides are underlined in the sequence. "Ngc" corresponds to the ratio of the number of g+c nucleotides to the total number of N nucleotides. The JUNCTION decryption is also provided [45] (see Note 14). If selected "Eligible D genes" (not shown), all D genes, which match the junction with their corresponding score, are displayed below.
	- (b) The "Translation of the JUNCTION" displays the AA JUNCTION with AA colored according to the eleven IMGT physicochemical classes [46] (see Note 18), the JUNCTION frame ('+' for in-frame, and '-' for out-offrame), the CDR3-IMGT length, the molecular mass, the isoelectric point (pI), and a link to detailed physicochemical descriptor (not shown). Gaps (represented by dots) are inserted in "out-of-frame" JUNCTION to maintain the J-REGION frame, and the corresponding codon, which cannot be translated, is represented by "#" in AA translation (not shown).



Fig. 5 IMGT/V-QUEST "Detailed results" [22, 23]. Results of IMGT/JunctionAnalysis [37, 38] for AB063867 IMGT/LIGM-DB accession number. This human IGH sequence results from the rearrangement of IGHV3- 11\*05 F, IGHD6-13\*01 F, and IGHJ4\*02 F. The JUNCTION is in-frame. The length of the CDR3-IMGT is 14 AA (42 nt). The JUNCTION length is of 48 nt, and the decryption [45] shows that it is composed of 9 nt for the 3<sup>0</sup> V-REGION (2 nt were trimmed from the germline V), 6 nt for N1-REGION, and 17 nt for the D-REGION (1 nt in 5<sup>0</sup> and 3 in 3<sup>0</sup> were trimmed from the germline D, 3 nt for the N2-REGION, and 13 for the 5<sup>0</sup> J-REGION (4 nt was trimmed from the germline J)

> FASTA format with the formatted header required as input by IMGT/JunctionAnalysis online [37, 38]. These results are provided even if IMGT/JunctionAnalysis gives no results.

	- (a) "6. V-REGION alignment according to the IMGT unique numbering" for the nt sequences with the FR-IMGT and CDR-IMGT delimitations according to the IMGT unique numbering [13].
	- (b) "7. V-REGION translation" for the nt sequence and its AA translation, aligned with the closest germline V-REGION.
	- (a) "9. V-REGION mutation and AA change table" lists the nt mutations and, if nonsilent, the corresponding AA changes. They are described for each FR-IMGT and CDR-IMGT with their nt and codon positions according to the IMGT unique numbering [13]. In parentheses, the "AA class Change Type" indicates if, between germline AA and replaced AA, the hydropathy, volume, and physicochemical properties have been conserved (+) or not (-) according to the IMGT physicochemical classes [46].
	- (b) "10. V-REGION mutation and AA change statistics" comprises two tables for the detailed and complete characterization of nt mutations and AA changes: "Nucleotide (nt) mutations" table quantifies nt positions with or without gaps, the identical nt, the total number of mutations, and the silent and nonsilent ones for the V-REGION and per FR-IMGT and CDR-IMGT. It then details the same evaluation for the four types of transitions and of the eight types of transversions. "Amino acid (AA) changes" table quantifies the codons or amino acid positions, with or without gaps, the unchanged AA, and AA changes for the V-REGION and per FR-IMGT and CDR-IMGT (see Note 19). It then evaluates the number of changes in 4 "AA class Similarity Degree": "Very similar" (the three properties hydropathy, volume, and physicochemical properties are conserved), "Similar" (one of the three properties is changed), "Dissimilar" (two of the three properties are changed), and "Very dissimilar" (the three properties are changed).
	- (c) "11. V-REGION mutation hot spots" shows the localization of the hot spot patterns (a/t)a (or wa) and (a/g) g(c/t)(a/t) (or rgyw) and their complementary reverse motifs t(a/t) (or tw) and (a/t)(a/g)c(c/t) (or wrcy) in the closest germline V gene and allele. Finally, this section includes a table for the "Correlation between V-REGION mutations, AA changes, codons changes, and hotspot motifs." It provides a synthesis for each mutation: the position in nt, the AA change and its position according

to the AA numbering [13, 16, 17], the AA class Change Type, the germline and mutated codon, and the corresponding hotspot if any. An illustration is provided in Fig. 6.

	- (a) "12. V-REGION and V-(D)-J-REGION" provides nt and AA FASTA sequences with gaps according to the IMGT unique numbering [13, 16, 17] of the V-REGION (nt sequence with access to the IMGT/PhyloGene tool [47]) and of V-J or V-D-J-REGION. In case of out-of-frame junctions V-J or V-D-J-REGION, a note is added, and the V-J or V-D-J-REGION is shown in red.
	- (b) "13. Annotation by IMGT/Automat" provides a full automatic annotation for the V-J-REGION or V-D-J-REGION by IMGT/Automat [39, 40] with IMGT labels (see Subheading 2.3).

The insertions and/or deletions that are detected by using the "Advanced parameters" and "Search for insertions" are described in the "Result summary" row [22, 23] (Fig. 7) with their localization in FR-IMGT or CDR-IMGT, the number of inserted or deleted nt, and, for insertions, the inserted nucleotides, the presence or absence of frameshift, the V-REGION codon from which the insertion or deletion starts, and the nt position in the user sequence.

The Advanced functionality "Analysis of single chain Fragment variable (scFv)" [41] allows the analysis of scFv sequences from phagedisplay combinatorial libraries [48, 49]. IMGT/V-QUEST [22, 23] identifies, localizes, and characterizes the two V-DOMAIN of a scFv (Fig. 8). At the top of the result page, the number of analyzed sequences and the number of identified V-DOMAIN are indicated. V-DOMAIN identifiers are automatically generated by adding to the sequence identifier a suffix composed of an underscore plus a letter for the locus (H, K, L for IGH, IGK, IGL or A, B, D, G for TRA, TRB, TRD, TRG, respectively). Below the list of V-DOMAIN identifiers is a table that indicates the positions of each V-DOMAIN and of the linker in the identified scFv. The detailed analysis of each individual V-DOMAIN is then provided classically.

3.2.3 Sequence and Result Summary with the Search for Insertions and Deletions in V-REGION

3.2.4 Top of Detailed Results for the Analysis of single chain Fragment variable (scFv)




Fig. 6 IMGT/V-QUEST "Detailed results" [22, 23]. Correlation between V-REGION mutations, AA changes, codons changes, and hotspots motifs in FR1-IMGT of seq\_3 (accession number AJ006165 of IMGT/LIGM DB [36]). (a) "7. V-REGION translation" and (b) "11. V-REGION mutation hotspots." Only the FR1-IMGT parts of the results are displayed: the two silent mutations g48>a (G16) and t54>c (S18) are shown in green and light blue rectangles, respectively. The nonsilent mutation g74>c (shown in the red rectangle) leads to the AA dissimilar change G25>A (hydropathy and physicochemical properties are not conserved). The codon "ggt" in position 73-75 is changed in "gct." The nt mutation occurs in the hotspot "ggtt" in position 73–76



Fig. 7 IMGT/V-QUEST "Detailed results" [22, 23]. Sequence and Result summary with "Search for insertions and deletions." An insertion of 18 nt is identified in seq\_4 (IMGT/LIGM-DB [36] accession number MG950400) from CDR2-IMGT position 56 (from nt 151 in the submitted sequence). The insertion is shown in capital letters

### 3.3 IMGT/V-QUEST Results for B. Synthesis View

At the top of the page, the parameters used for the analysis are recalled, and the number of analyzed sequences is indicated. The results include a summary table and potentially eight detailed result sections if selected by the user.

3.3.1 Summary Table The "Summary table" (Fig. 9) displays one row for each input sequence with the corresponding results, including 22 columns [22, 23]: (1) the sequence order in the submission; (2) the sequence identifier (Sequence ID); (3) the name of the closest V-GENE and allele; (4) the functionality of the sequence (when found, the presence of stop codons is indicated); (5) the V-REGION score; (6) the V-REGION percentage of identity with, between parentheses, the ratio of number of identical nucleotides (nt)/number of aligned nt; (7) the name of the closest J-GENE and allele; (8) the J-REGION score; (9) the J-REGION percentage of identity and the ratio of number of identical nucleotides (nt)/number of aligned nt; and provided according to the IMGT/JunctionAnalysis results [37, 38] (10) the D-GENE and allele name; (11) the D reading frame; (12) the CDR-IMGT lengths; (13) the AA JUNCTION; and (14) the JUNCTION frame (in the absence of results of IMGT/JunctionAnalysis, only the AA JUNCTION defined by IMGT/V-QUEST [22, 23] is displayed); (15) the JUNCTION nt length and decryption [45]; (16) the number of missing in 5<sup>0</sup> partial V-REGION; (17) the number of uncertain nt; (18) the number of missing nt in 3<sup>0</sup> partial J-REGION; (19) and (20) the numbers of 5<sup>0</sup> and 3<sup>0</sup> trimmed 'n' nucleotides; (21) the length of the sequence; and (22) the sequence analysis category (see Note 12). Clicking on the sequence ID provides the corresponding Detailed View in a separate tab (depending on your browser). Warnings in red may be indicated to highlight specific features of the sequence (Fig. 9). 3.3.2 Detailed Analysis of the JUNCTION A link to access the IMGT/JunctionAnalysis [37, 38] results is provided for sequences of the same locus. AA translations are aligned on the longest CDR3 length according to the IMGT unique numbering [13] (Fig. 10). 3.3.3 Detailed Result Sections for Alignment of Sequences Expressing the In "Alignment with the closest alleles" below the summary table, the V genes and alleles are listed with the number of assigned sequences in parentheses [22, 23]. Click on the associated link to reach the corresponding detailed result sections. They provide six

Same V Gene and Allele different displays (if all were selected) of alignment of sequences

Fig. 7 (continued) in the FASTA sequence. IMGT/V-QUEST then performs a classical analysis (for gene and allele identification, analysis of the JUNCTION, evaluation of nt mutation, and AA changes) after removal of the insertion(s) and addition of gaps to replace the deletions. The evaluation of the identity percentage added in square brackets includes each insertion or deletion as an additional mutation



Fig. 8 IMGT/V-QUEST "Detailed results" [22, 23]. Top of Detailed results for the "Advanced functionality" "Analysis of single chain fragment variable (scFv)". scFv\_1 and scFv\_2 sequences correspond to the accession numbers AJ006120 and AF117956 in the IMGT/LIGM-DB database [36]. The 5<sup>0</sup> V-DOMAIN are VH and 3<sup>0</sup> V-DOMAIN are V-KAPPA for both scFv. The detailed results are then provided for each domain



Fig. 9 IMGT/V-QUEST Synthesis view [22, 23]. (a) The 12 first columns of the "Summary table": the four analyzed sequences are shown in the "V-GENE and allele name" order in the Summary table. The sequences ID are accession numbers of IMGT/LIGM-DB [36]. The hyperlinks allow to get the corresponding results in "A Detailed view." The two sequences assigned to IGHV3–9 and the two sequences assigned to IGHV5-51 will be, respectively, aligned together in the detailed result sections 1 to 6. In the column "J-GENE and allele," a warning "(a)" in red indicates that other IGHJ genes and alleles may be solutions for the sequence 3 (not shown). (b) The last 12 columns of the "Summary table": the V-DOMAIN of sequence 1 is partial: 5 nt are missing in the 3<sup>0</sup> part of the J-REGION

> that express the same V gene and alleles: "1. Alignment for V-GENE," "2. V-REGION alignment according to the IMGT unique numbering" [13], "3. V-REGION translation," and three different formats for the "V-REGION protein display." Section "7. V-REGION most frequently occurring AA per position and per FR-IMGT and CDR-IMGT" shows the most frequent AA in sequences expressing the same V genes and alleles per FR-IMGT and CDR-IMGT and per position according to the IMGT unique numbering [13].

3.4 IMGT/V-QUEST Output for Excel File "Excel file" allows the users to open and save a spreadsheet including the results of the IMGT/V-QUEST analysis [22, 23]. The file contains 11 sheets or 12 for the Advanced Functionality "Analysis of single chain Fragment variable (scFv)" [41] (see Subheading 4.3 for the detail of their content in IMGT/HighV-QUEST sequence analysis results).


Fig. 10 IMGT/V-QUEST Synthesis view [22, 23]. Results of IMGT/JunctionAnalysis for four IGH junctions [37, 38]: "Analysis of the JUNCTIONs" displays for the four junctions, the sequences of the IMGT labels 3<sup>0</sup> V-REGION, N1, D-REGION, N2, and 5<sup>0</sup> J-REGION. The mutated nt are underlined. "Translation of the JUNCTIONs" displays the four junctions aligned per position according to the IMGT unique numbering [13]

### 4 IMGT/HighV-QUEST

IMGT/HighV-QUEST [24–27] is the high throughput version of IMGT/V-QUEST [22, 23]. It is freely available for academics, but it requires the user's registration. This allows the tool to automatically notify the users on the availability of the results. A link to the "New user" form is provided in the IMGT/HighV-QUEST welcome page. When the user logs in, the tool uses reCAPTCHA (https://developers.google.com/recaptcha) to protect the site from spam and abuse.

IMGT/HighV-QUEST provides two main functionalities [24–27]:


Subheading 4.4), and use "Statistics history" page for the download of IMGT clonotypes results (see Subheading 4.5)).

Links to the four pages are displayed on the top of the IMGT/ HighV-QUEST web interface.

### 4.1 IMGT/HighV-QUEST Sequence Set Submission IMGT/HighV-QUEST Search page (Fig. 11) is provided when the user logs in. It includes the four following sections [24–27]: 4.1.1 The Sequence Submission Form 1. Provide an analysis title, select the species (see Note 1), and the receptor type or locus (see Note 2) as for IMGT/V-QUEST. 2. Upload a simple text-formatted file containing your FASTA sequences (up to 1,000,000 of IG or TR rearranged sequences can be submitted in a single run). 3. When an analysis is launched ("Start" button), it is firstly dispatched and queued on the IMGT servers and is then performed depending on the available resources. Choose to be notified by e-mail "when analysis is queued" and/or "when analysis is completed" (selected by default). 4.1.2 Display Results 1. Select Result format: the default "CSV" result format includes 11 (or 12 with the Advanced Functionality "Analysis of single chain Fragment variable (scFv)") CSV files equivalent to those provided by the "Excel file" of IMGT/V-QUEST. Result format "AIRR" [42, 43] (see Note 7) or "Both formats" can be selected. 2. The individual result files (equivalent to IMGT/V-QUEST "Detailed view" in text format) can be included in the results for submissions of maximally 200,000 sequences only. 4.1.3 Advanced Parameters The analysis can be customized with exactly the same advanced parameters as proposed by IMGT/V-QUEST (see Subheading 3.1.4). 4.1.4 Advanced Functionalities "Analysis of single-chain Fragment variable (scFv)" [41] can be included in the analysis (default is "no") (see Subheading 3.1.5). 4.2 IMGT/HighV-QUEST Analysis History Page: Follow-Up and Download of Results The "Analysis history" page allows the user to check the status of the submitted analyses [24–27]. A table displays for each of them its title, its status (queued, running, or completed), the submission date, the number of submitted sequences, the species and the receptor type or locus (as selected by the user), and the actions that can be performed. When the analysis is completed, the user can download the results as a single archive file in TXZ format (commonly supported by archive tools for windows and other operating

systems). The availability of the results is guaranteed for two weeks


Fig. 11 The IMGT/HighV-QUEST Search page [24–27]

4.3 IMGT/HighV-QUEST Sequence Analysis Results

after the analysis is completed. After that, the files can be removed by the system. In that case, "File removed" is indicated in red instead of the archive logo.

A user may delete an analysis at any time except if it is used by the second module "Statistics" of IMGT/HighV-QUEST. In such cases, "Used by Statistics" is indicated in place of the "delete" button.

The content of the TXZ file depends on the selected "Result format" ("CSV," "AIRR," or "Both formats") [24–27]:

1. "CSV" format contains a tar folder (which needs to be extracted by an archive tool) with 11 (or 12) files (equivalent to the results of the excel file provided by the classical IMGT/ V-QUEST) in CSV format, and, if selected in the IMGT/ HighV-QUEST Search page, one subfolder with individual result files, in text format for each sequence (equivalent to the classical IMGT/V-QUEST "Detailed view" results (see Subheading 3.2)).

The content of each CSV file is indicated in Table 1 [27].


4.4 IMGT/HighV-QUEST Launch Statistics Page for the Evaluation of IMGT Clonotypes

An IMGT clonotype (AA) is defined by a unique V-(D)-J-rearrangement (V and J genes and alleles), with a unique CDR3- IMGT amino acid sequence and the presence of the conserved anchors C 104 and W/F 118 [26]. An IMGT clonotype (AA) is linked to one or more IMGT clonotype (nt): they are defined by a unique V-(D)-J-rearrangement with a unique CDR3-IMGT nucleotide sequence, whose translation corresponds to the CDR3- IMGT of the IMGT clonotype (AA) [26]. When IMGT/HighV-QUEST "Statistics" is launched, the tool evaluates IMGT clonotypes in batches of analyzed sequence sets per locus and provides immunoprofiles for IMGT clonotypes (AA) diversity (number of different IMGT clonotypes per V, D, and J gene and allele) and expression (number of sequences assigned to IMGT clonotypes (AA) per V, D, and J gene and allele) [26]. Moreover, IMGT/ HighV-QUEST can perform the comparison of multiple batches and provide the list of the IMGT clonotype (AA), which are common to two or more batches [26].


1theIMGT/HighV-QUESTCSVfileswiththenumberofcolumnsandresultcontent[27]

Table (continued)


Table 1 (continued)


IMGT® Immunoinformatics for Standardized V-DOMAIN Analysis 503

Table 1 (continued) File number Result type File name Number of columns Results content (see Note 20) #10 "V-REGION-mutationhotspots" 8 Hotspot motifs (a/t)a, t(a/t), (a/g)g(c/t)(a/t), and (a/t)(a/g)c(c/t) detected in the closest germline V-REGION with their localization in FR-IMGT and CDR-IMGT #11 "Parameters" "Parameters" ∙ Date of the analysis ∙ IMGT/V-QUEST program version, IMGT/V-QUEST reference directory release ∙ Parameters used for the analysis: species, receptor type or locus, IMGT reference directory set, advanced parameters, advanced functionalities #12 Sequence description scFv 40 Available only for the advanced functionality "Analysis of single chain Fragment variable (scFv)," one line per scFv: Positions and length of the V-(D)-J-REGION, CDR\_lengths, JUNCTION for the 2 V-DOMAIN of the scFv, positionsandlengthofthelinker[41]

For launching statistics, the following steps should be followed:


It allows to follow the status of the submitted statistics analysis and to download the results once completed [24–27]. The IMGT/ HighV-QUEST statistical output is provided as a zip file.


The IMGT/HighV-QUEST statistics output [26] is organized in the sections listed in Table 2 (see also http://www.imgt.org/ HighV-QUEST/doc.action#statistical-outputs-results).

The illustration of the content of file 4.2.1 is shown in Fig. 12: it shows the first seven most expressed IMGT clonotypes (AA) of a list of 27,080.

4.5.2 "Data" Directory Importantly, the archive includes a "data" directory: it contains text files named 'stats\_xxx' where 'xxx' is composed of 'batch name'\_' locus'. They include the list of all the IMGT clonotypes (AA) (that are displayed through html sections of Table 2) and their characteristics separated by tabulations. These files include the fields needed by the external IMGT/StatClonotype [28, 29] tool (see Subheading 5). Their content is described in the IMGT/HighV-QUEST Documentation at http://www.imgt.org/HighV-QUEST/doc. action#datastatsxxx.

QUEST Statistics History Page: Follow-Up and Download of Statistics

4.5 IMGT/HighV-

4.5.1 Results Sections to be Displayed in the User Web Browser



(continued)



(continued)

### Table 2 (continued)



Fig. 12 Top of the file 4.2.1 'IMGT clonotypes (AA) per Nb<sup>0</sup> [26]. (a) List of IMGT clonotypes (AA) ordered by decreasing number of assigned sequences. The first seven of IMGT clonotypes (AA) are shown. The table provides the Exp. ID (IMGT clonotype (AA) identifier in the set), the numbers of "1 copy," "More than one," and the total. The IMGT clonotype (AA) definition includes the names of the V, D, and J genes and alleles, the CDR3-IMGT length, the AA CDR3-IMGT, and the anchors. The IMGT clonotype (AA) representative sequence is characterized by the identity percentage with the closest V gene and allele, the sequence length, the sequence functionality, and a link to the FASTA sequence. An additional link allows to display all "1 copy" assigned to the IMGT clonotype (AA). The batch S3 results from the analysis of the run SRR1168790 available on Sequence Read Archive (SRA) (https:/www.ncbi.nlm.nih.gov/sra). (b) Example of IMGT clonotypes (nt) linked to the IMGT clonotypes (AA) #7 (extracted from file 4.2.2) to which 96 "1 copy" sequences were assigned. Ninety-four of them are assigned to the same IMGT clonotype (nt) with a CDR3-IMGT of 42 nucleotide "gcgtgtgacgtccagacgtcacaatatgtagcttttgactac". Two other IMGT clonotypes (nt) (with one sequence each) are also linked to #7. One shows a mutation (t>c) on the nt 6 of the CDR3 and the second a mutation (t>c) on the nt 33 of the CDR3 (shown in red in the figure)

### 5 IMGT/StatClonotype



Fig. 13 IMGT/StatClonotype Welcome page [28, 29]. The files stats\_S3\_IGH.txt (IMGT/HighV-QUEST set 1) and stats\_S4\_IGH.txt (IMGT/HighV-QUEST set 2) were uploaded from the "data" directory of IMGT/HighV-QUEST statistical output (obtained from the runs SRR1168790 and SRR1168789, respectively, available on Sequence Read Archive (SRA) (https:/www.ncbi.nlm.nih.gov/sra)). The range for CDR3-IMGT lengths is 4 and 36. IMGT/HighV-QUEST set 1 includes 27,075 IMGT clonotypes (AA) to which 44,446 sequences were assigned


statistical procedures (not shown).


Fig. 14 IMGT/StatClonotype Multiple testing procedures plots for genes [28, 29]. In the left panel, "IMGT clonotype (AA) diversity," "Single gene," and "Hide null or smallest gene occurrences" were selected (not shown). "Multiple testing procedures plots" displays an interactive line graph on the left and a scatter plot on the right for genes. Hovering the mouse on the interactive the line graph on the left allows the display of the exact number of significant differences in proportions, that is 21 for a Type I error <sup>α</sup> <sup>¼</sup> 0.05 with the multiple testing procedure BH. On the right, the scatter plot shows the coordinates of z-score and log\_10 (SidakSS) for IGHV1-69, for which the difference in proportion is significant whatever the multitesting procedure as indicated in the table. The graphs can be saved in PNG, JPG, or PDF and the tables in CSV format

> confidence intervals (CI), for genes (on the top) (Fig. 15) and for alleles (at the bottom) (see Note 27). In synthesis graphs for genes, IMGT gene names are ordered by their positions in the locus with their known functionalities. Below the normalized bar graph are listed the not ordered genes (not shown). They are grouped and shown at the bottom of the gene list in the synthesis graph. The values for the normalized proportions of genes (or alleles) in set 1 and set 2, the differences in proportions, the lower and upper bound of the confidence indices for differences in proportions, and the Test interpretation are recorded in "Statistical test results" tab Tables.

5. "CDR-IMGT lengths" tab [28, 29] displays, in the right panel, interactive bar graphs for set 1 and set 2 showing the distribution of the number of IMGT clonotypes (AA) (for IMGT clonotype (AA) diversity) or of the number of sequences assigned to IMGT clonotypes (AA) (for IMGT clonotype (AA) expression), per CDR-IMGT length (see Note 28). The left panel allows to choose the CDR-IMGT (CDR1-IMGT, CDR2-IMGT, or CDR3-IMGT) and to select the length of

Fig. 15 IMGT/StatClonotype synthesis graph for IMGT clonotype (AA) diversity per V gene [28, 29]. It displays visual comparison of the normalized proportions of IMGT clonotype (AA) diversity of the IGHV genes between sets 1 and 2. For example, the diversity of IMGT Clonotypes (AA) expressing the IGHV1-2, IGHV4- 39, and IGHV1-69 genes is significantly higher in set 1 than in set 2 whatever the multiple testing procedure. In the left panel, "Single gene" and "Hide null or smallest gene occurrences" were selected. Synthesis graphs are downloadable in PNG, JPG, or PDF format

the CDR-IMGT for the "List of IMGT clonotypes (AA) with selected CDR3-IMGT length" displayed below the bar graphs for sets 1 and 2.

6. "CDR-IMGT AA Properties" tab displays the distribution of the IMGT classes [46] of the 20 amino acids at CDR-IMGT (CDR1-IMGT, CDR2-IMGT, or CDR3-IMGT) positions in sets 1 and 2 for a given CDR-IMGT length. The left panel allows (1) to select the IMGT classes to be displayed for the amino acids: "20 amino acids," "Physicochemical" (Fig. 16), "Hydropathy," "Volume," "Chemical," "Charge," "Hydrogen donor or acceptor atoms," and "Polarity"; (2) to show

Fig. 16 IMGT/StatClonotype CDR-IMGT AA properties distribution [28, 29]. Examples for the CDR3-IMGT of length 15 in set 1: (a) "20 amino acids" and (b) "Physicochemical"

results by absolutes values (number of occurrences of an amino acid (or IMGT amino acid class) at a given position, for a given CDR length) or percentages; and (3) to modify the length and width of the graphs. The major right panel includes, for each set, a table with numbers (or percentages) of the amino acids (or IMGT amino acid classes) (in rows), at a given position (in columns). The table includes a row for undefined amino acids ("X") and for stop codons. The tables can be downloaded as CSV files. The corresponding graphical representation is shown as an interactive bar graph to visualize the amino acid distribution per position. At the bottom of the page, the variability plots based on the indexes according to "Shannon entropy," 'Wu-Kabat variability," or "Simpson index" with tables for numerical values are displayed. Comparisons of two sets are useful in detecting the characteristics of amino acids at positions important for the V domain antibody diversity or, by contrast, for maintaining its structure.

7. "V-D-J gene associations" tab [28, 29] displays interactive heat maps to represent V-J, V-D, or D-J gene associations in set 1 and set 2. The left panel allows (1) to display the Dendrogram for V-J, V-D, or D-J gene association, (2) to get the results with clustering or not, (3) to get the results in normalized values, and (4) to select the color palettes. The major right panel includes interactive heat maps to represent V-J, V-D, or D-J gene associations in set 1 and set 2. If the "Results with clustering" is selected, a double Ward hierarchical clustering with Euclidean distance is performed (this classification operates simultaneously on the lines and columns of a matrix intersecting two different types of genes), otherwise heat maps are shown without dendrograms and ordering. Such an analysis permits to detect genes with similar diversity or expression profiles, which can be further explored for given and/or related specificities in immune repertoire comparative analysis. Under heat maps, tables crossing the V-J, V-D, or D-J gene occurrences in set 1 and set 2 are given.

### 6 IMGT/DomainGapAlign

IMGT/DomainGapAlign [30, 31] analyzes the amino acid sequences of the IG and TR V-DOMAIN (see Note 29). IMGT/ DomainGapAlign identifies the closest V and genes and alleles of the user's amino acid domain sequences by comparison with the IMGT reference directory sets composed of the translations of the germline V and J regions of the genes managed in IMGT/GENE-DB [34]. The reference amino acid sequences are available by querying IMGT/DomainDisplay (IMGT® Home page, http:// www.imgt.org). Importantly, IMGT/DomainGapAlign can analyze V-DOMAIN from different species and different locus in a single run. The tool gaps the sequences, numbers the AA of each V-DOMAIN, and provides the delimitations of the FR-IMGT and CDR-IMGT and those of the beta strands and loops by applying the IMGT unique numbering [13]. It also characterizes the amino acid changes (see Note 30).

### 6.1 IMGT/ DomainGapAlign Query and Customization of the Analysis

6.1.1 Standard Parameters and Sequence (s)


Select the number of alignments displayed for each V-DOMAIN in the results (default is 5).

5. Check "IMGT Colliers de Perles" [21] to include the IMGT Collier de Perles [18–20] in the results (see Subheading 7)


Fig. 17 IMGT/DomainGapAlign Welcome page [30, 31]


Fig. 18 IMGT/DomainGapAlign Results [30, 31]. Top of the result page: the VH domain of daclizumab 3nfp\_H chain (PDB code 3nfp of IMGT/3Dstructure-DB [30, 50, 51]) is compared with the Homo sapiens reference directory. It is aligned with the human IGHV1-46\*01 and IGHJ4\*01 genes alleles


A second table is displayed for J genes and alleles with the species, the IMGT J gene and allele name, the number of the domain, the Smith-Waterman alignment score, the identity percentage, and the overlap.


Below are displayed two additional parallel tables: on the left the "AA changes in strands and loops" and on the right the "AA changes in FR-IMGT and CDR-IMGT" with the number of different AA, the description of the AA change with the "AA class Change Type'" (+) or not (-) (for hydropathy, volume and physicochemical characteristics [46] according to the AA IMGT classes), and "AA class Similarity Degree" (very similar, similar, dissimilar, and very dissimilar).

5. IMGT Colliers de Perles [18–20] (See Subheading 7) are shown, if selected, on one or two layers, without or with AA change positions shown in pink circles (or squares for CDR-IMGT anchors).

### 7 IMGT/Collier-de-Perles

The IMGT/Collier-de-Perles tool [21] generates "'IMGT Colliers de Perles" [18–20]. For V-DOMAIN, IMGT Colliers de Perles are obtained on one or two layers, provided that the V-DOMAIN



Fig. 19 IMGT/DomainGapAlign Results [30, 31]. Bottom of the result page for VH domain of daclizumab 3nfp\_H chain (PDB code 3nfp of IMGT/3Dstructure-DB [30, 50, 51]): the CDR-IMGT lengths are [8.8.9] with a total of three AA changes. The FR-IMGT lengths are [25.17.38.11] with a total of 14 AA changes

(AA) sequence is gapped according to the IMGT unique numbering [13] (see Note 32). Resulting IMGT Colliers de Perles show the standardized delimitation of FR-IMGT and CDR-IMGT, and of beta strands with their orientation in the IG and TR V-DOMAIN, allowing the visualization of the amino acids, which are important for a 3D structural configuration and bridging the gap between sequences and structures.

7.1 IMGT/Collier-de-Perles Launched from IMGT Sequence Analysis Tools

	- 2. Starting from a V-DOMAIN amino acid sequence, use IMGT/ DomainGapAlign [30, 31] (see Subheading 6) to generate the IMGT Colliers de Perles and select "IMGT Colliers de Perles" in the submission form (see Note 33).

7.2 IMGT/Collier-de-Perles Submission Interface

	- 1. Select the "Domain type" ("Variable (V)"), the number of layers for the IMGT Collier de Perles representation (1 or 2) (see Note 34).
	- 2. Select the "CDR-IMGT color type" [46] according to the locus of the sequence (1 for IGH, TRB, or TRD sequences and 2 for IGK, IGL, TRA or TRG sequences) and the "Background color," which will be applied to the FR-IMGT positions (see Note 35).
	- 3. Enter the CDR3-IMGT length.
	- 4. Enter the gapped AA sequence without any header.
	- 5. In case of detected amino acid insertions compared with the IMGT unique numbering for V domain [13], provide in "Amino acid insertions" the position that precedes the insertion, its length in AA, and the numbering label for each inserted position.
	- 6. A title for the resulting IMGT Collier de Perles can be optionally provided.
	- 7. Click on "Show" to launch the tool.

### 7.3 IMGT/Collier-de-Perles Results The IMGT Collier de Perles for a V-DOMAIN [18–21] displays the graphical representation of a V-DOMAIN with one position (1 AA) per bead (circle or square). Numbers allow an easy delimitation of the FR-IMGT, of the CDR-IMGT, and of the beta strands


Fig. 20 The IMGT/Collier-de-Perles Welcome page [21]

and the localization of the conserved amino acids. The anchor positions of CDR-IMGT are in square (see Subheading 2.2). The hatched positions represent gaps according to the IMGT unique numbering for V domain [13]. AA written in red letters indicate the five conserved positions in V-DOMAIN (1st-CYS 23, CON-SERVED-TRP 41, hydrophobic 89, 2nd-CYS 104 and J-TRP or J-PHE 118). CDR-IMGT are colored according to "IMGT CDR-IMGT color type" [46] of the corresponding locus and

Fig. 21 IMGT Colliers de Perles for V-DOMAIN [18–21]. (a–d) Background color is "50% Hydrophobic positions," and Proline (P) is in yellow [46]. (a) IMGT Collier de Perles on one layer generated from the

FR-IMGT according to the "Background color" (see Note 35) selected by the user. The orientations of the nine beta-strands are indicated at the bottom of the IMGT/Collier-de-Perles. Illustrations of IMGT/Collier-de-Perles output are shown in Fig. 21.

### 8 Notes


Fig. 21 (continued) IMGT/V-QUEST [22, 23] analysis of a VH (nt) (accession number X81732 of IMGT/LIGM-DB [36]). (b) IMGT Collier de Perles on two layers of a VH with hydrogen bonds between the amino acids of the C, C0 , C<sup>00</sup>, F, and G strands and those of the CDR-IMGT (daclizumab 3nfp\_H, PDB code 3nfp of IMGT/3Dstructure-DB [30, 50, 51]). (c, <sup>d</sup>) IMGT Colliers de Perles with AA changes of a V-LAMBDA domain generated from the IMGT/DomainGapAlign [30, 31] analysis (translation of the AF063723 IMGT/LIGM-DB accession number), (c) on one layer, and (d) on two layers. (e, <sup>f</sup>) IMGT/Collier-de-Perles [21] results on one layer for the entry code p00149 from IMGT/2Dstructure-DB (e) with background color "IGH 80% hydrophopathy classes [46] and (f) with background color "IGH 80% physicochemical classes" [46]


deletions in V-REGION," and (4) analysis on complementary reverse sequence with "Search for insertions and deletions in V-REGION" and corrections if any.


in 5' of the V-REGION for the evaluation of the number of mutations") before launching the analysis.


### Acknowledgements

We are very grateful to Ge´rard Lefranc, founder of the Laboratoire d'ImmunoGe´ne´tique Mole´culaire LIGM (Universite´ de Montpellier and CNRS), for his unique contribution in the creation of IMGT® in 1989 and his unwavering support for these 30 years. We thank all members of the IMGT® team for their expertise and constant motivation. IMGT® was funded in part by the BIOMED1 (BIOCT930038), Biotechnology BIOTECH2 (BIO4CT960037), 5th PCRDT Quality of Life and Management of Living Resources (QLG2-2000-01287), and 6th PCRDT Information Science and Technology (ImmunoGrid, FP6 IST-028069) programs of the European Union (EU). IMGT® received financial support from the GIS IBiSA, the Agence Nationale de la Recherche (ANR) Labex MabImprove (ANR-10-LABX-53-01), the Re´gion Occitanie Languedoc-Roussillon (Grand Plateau Technique pour la Recherche (GPTR), and BioCampus Montpellier. IMGT® is currently supported by the Centre National de la Recherche Scientifique (CNRS), the Ministe`re de l'Enseignement Supe´rieur, de la Recherche et de l'Innovation (MESRI), the University of Montpellier, and the French Infrastructure Institut Franc¸ais de Bioinformatique (IFB) ANR-11-INBS-0013. IMGT® is a registered trademark of CNRS. IMGT® is member of the International Medical Informatics Association (IMIA) and a member of the Global Alliance for Genomics and Health (GA4GH). This work was granted access to the High Performance Computing (HPC) resources of Meso@LR and of Centre Informatique National de l'Enseignement Supe´rieur (CINES), to Tre`s Grand Centre de Calcul (TGCC) of the Commissariat a` l'Energie Atomique et aux Energies Alternatives (CEA) and to Institut du de´veloppement et des ressources en informatique scientifique (IDRIS) under the allocation 036029 (2010-2022) made by GENCI (Grand Equipement National de Calcul Intensif).

### References


Harb Protoc 2011:726–736. https://doi.org/ 10.1101/pdb.prot5635


Recognit 17:17–32. https://doi.org/10. 1002/jmr.647


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# IMGT/3Dstructure-DB: T-Cell Receptor TR Paratope and Peptide/Major Histocompatibility pMH Contact Sites and Epitope

### Marie-Paule Lefranc and Ge´rard Lefranc

### Abstract

T-cell receptors (TR), the antigen receptors of T cells, specifically recognize peptides presented by the major histocompatibility (MH) proteins, as peptide/MH (pMH), on the cell surface. The structure characterization of the trimolecular TR/pMH complexes is crucial to the fields of immunology, vaccination, and immunotherapy. IMGT/3Dstructure-DB is the three-dimensional (3-D) structure database of IMGT®, the international ImMunoGenetics information system®. By its creation, IMGT® marks the advent of immunoinformatics, which emerged at the interface between immunogenetics and bioinformatics. The IMGT® immunoglobulin (IG) and TR gene and allele nomenclature (CLASSIFICATION axiom) and the IMGT unique numbering and IMGT/Collier-de-Perles (NUMEROTATION axiom) are the two founding breakthroughs of immunoinformatics. IMGT-ONTOLOGY concepts and IMGT Scientific chart rules generated from these axioms allowed IMGT® bridging genes, structures, and functions. IMGT/3Dstructure-DB contains 3-D structures of IG or antibodies, TR and MH proteins of the adaptive immune responses of jawed vertebrates (gnathostomata), IG or TR complexes with antigens (IG/Ag, TR/pMH), related proteins of the immune system of any species belonging to the IG and MH superfamilies, and fusion proteins for immune applications. The focus of this chapter is on the TR V domains and MH G domains and the contact analysis comparison in TR/pMH interactions. Standardized molecular characterization includes "IMGT pMH contact sites" for peptide and MH groove interactions and "IMGT paratopes and epitopes" for TR/pMH complexes. Data are available in the IMGT/3Dstructure database, at the IMGT Home page http://www.imgt.org.

Key words IMGT, T-cell receptor, CDR-IMGT, Major histocompatibility, Paratope, Epitope, TR/ pMH, IMGT-ONTOLOGY, Immunoinformatics, IMGT/3Dstructure-DB

### 1 Introduction

The adaptive immune responses were acquired by jawed vertebrates (or gnathostomata) more than 450 million years ago and are found in all extant jawed vertebrate species from fishes to humans [1]. The adaptive immune responses are characterized by a remarkable specificity and memory, which are the properties of the B and T cells owing to an extreme diversity of their antigen receptors [1]. The

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_25, © The Author(s) 2022

specific antigen receptors comprise the immunoglobulins (IG) or antibodies of the B cells and plasma cells [2–5] and the T-cell receptors (TR) [6]. Whereas the IG recognize antigens in their native (unprocessed) form, the TR recognize processed antigens, which are presented as peptides by the highly polymorphic major histocompatibility (MH) proteins (in humans HLA for human leukocyte antigens, encoded by genes in the MHC locus) (Fig. 1). T cells are involved in cell-mediated immune response, against a stress of viral, bacterial, fungal, or tumoral origin and identify antigenic peptides presented by the MH proteins as

Fig. 1 A T-cell receptor (TR)/peptide-major histocompatibility 1 (pMH1) complex. A TR (here, TR-alpha\_beta) is shown (on top, upside down) in complex with an MH (here, MH1) presenting a peptide in its groove [1]. In vivo, a TR is anchored in the membrane of a T cell as part of the signaling T-cell receptor (TcR <sup>¼</sup> TR +CD3). A TR is made of two chains, each comprising a variable domain (V-DOMAIN) at the N-terminal end and a constant domain (C-DOMAIN) at the C-terminal end. The domains are V-ALPHA and C-ALPHA for the TR-ALPHA chain and V-BETA and C-BETA for the TR-BETA chain. An MH1 is made of the I-ALPHA chain with two G-DOMAIN (G-ALPHA1 and G-ALPHA2) and a C-LIKE-DOMAIN (C-LIKE), noncovalently associated with the B2M (a C-LIKE-DOMAIN). In this representation (with G-ALPHA1 on the left, G-ALPHA2 and B2M on the right), the peptide is oriented in the groove from front of the figure to back. The TR/pMH1 complex structure is 3qfj from IMGT/3Dstructure-DB (http:/www.imgt.org). (With permission from M-P. Lefranc and G. Lefranc, LIGM, Founders and Authors of IMGT®, the international ImMunoGeneTics information system®, http:/www.imgt.org)

peptide/MH (pMH) on cell surface [1]. The recognition and signal transduction are carried out by the multiprotein bifunctional T-cell receptor (TcR) assembly that comprises the TR responsible of the specific pMH recognition plus the associated transmembrane signaling CD3 proteins [6]. The TcR is itself associated, in the immunological synapse, with the CD4 or CD8 coreceptors, to the activating CD28 and inhibitory CTLA4 costimulatory proteins, to the CD2 adhesion molecule and to intracellular kinases. The CD8 expressed on most cytotoxic T cells binds the MH class I (MH1) that is expressed ubiquitously on cells of the organism [7]. The CD4 expressed on most helper T cells binds the MH class II (MH2) that is expressed by professional antigen presenting cells (dendritic cells, macrophages, monocytes, and B cells) [7].

IMGT®, the international ImMunoGeneTics information system®, http://www.imgt.org, is the global reference in immunogenetics and immunoinformatics [1], founded in 1989 by Marie-Paule Lefranc at Montpellier (Universite´ de Montpellier and CNRS). It is a high quality integrated knowledge resource comprising 7 databases, 17 online tools, and more than 25,000 pages of web resources [8–11]. IMGT® is specialized in the sequences, structures, and genetic data of the IG, TR, and MH of human and other vertebrate species, in the immunoglobulin superfamily (IgSF) and the MH superfamily (MhSF) of vertebrates and invertebrates, and in related proteins of the immune system (RPI), fusion protein for immune applications, and composite proteins for clinical applications [1, 8–11]. IMGT/3Dstructure-DB [12–14] is the three-dimensional (3-D) structure database of IMGT®. This database provides the standardized IMGT annotation and analysis of the 3-D structures of the TR, pMH, and TR/pMH complexes and comprises detailed molecular characterization and description of their interactions [15–17]. The standardized analysis is based on the concepts of IMGT-ONTOLOGY, the first ontology in immunogenetics and immunoinformatics [18–24]. The IMGT-ONTOLOGY concepts are generated from seven axioms [25–31], of which the CLASSIFICATION axiom (IG and TR gene and allele nomenclature) (see Note 1) at the birth of IMGT® and immunoinformatics [1, 25] and the NUMEROTATION axiom (IMGT unique numbering [7, 26–27, 32–35] and IMGT Colliers de Perles [28, 29, 36–39]) allow bridging sequences, structures, and functions. The IMGT unique numbering for variable (V) domain includes the IG and TR V-DOMAIN and the V-like domains of IgSF other than IG and TR [32–34]. The IMGT unique numbering for constant (C) domain includes the IG and TR C-DOMAIN and the C-like domains of IgSF other than IG and TR [35]. The IMGT unique numbering for G domain includes the groove (G) domains of the MH G-DOMAIN and the G-like domains of MhSF other than MH (or RPI-MH1Like) [7]. The IMGT/ DomainGapAlign tool [13, 40, 41] analyzes the amino acid

### sequences of the V, C, and G domains using the IMGT unique numbering [7, 34, 35] and provides a direct link to the IMGT/ Collier-de-Perles tool [39]. The IMGT Scientific chart rules provide a standardized description of the contact analysis [15–17] and comparison of TR/pMH complexes and their interactions, irrespective of the TR chains and domains, the MH class (MH1 or MH2), or the species (Homo sapiens, Mus musculus, etc.). Eleven "IMGT pMH contact sites" were defined for the comparison of pMH interactions, regardless of the peptide lengths [15–17]. The "IMGT pMH contact sites" visualize the interactions between the amino acids (AA) (see Note 2) of the peptide and those of the MH groove based on the contact analysis. They are a useful asset in peptide vaccine design and epitope prediction, and they precisely identify and visualize AA of the peptide located in the MH2 groove. The standardized "IMGT paratope and epitope" for TR/pMH complexes comprises the TR paratope and the pMH epitope, determined from contact analysis, in IMGT/3Dstructure-DB, at the IMGT Home page http://www.imgt.org.

### 2 TR and MH Standardized Description in IMGT/3Dstructure-DB

2.1 TR and MH Chains and Domains 2.1.1 TR Chains and Domains The TR is made of two chains, an alpha chain (TR-ALPHA) and a beta chain (TR-BETA) for the TR-ALPHA\_BETA receptor and a gamma chain (TR-GAMMA) and a delta chain (TR-DELTA) for the TR-GAMMA\_DELTA receptor [6] (Table 1). Each complete TR chain comprises an extracellular region made up of a V-DOMAIN) (for instance, V-ALPHA for the alpha chain) and a C-DOMAIN) (for instance, C-ALPHA for the alpha chain), a

### Table 1

IMGT standardized labels for the DESCRIPTION of the T-cell receptors (TR) and of their chains and domains. IMGT® labels (concepts of description) are written in capital letters [1]


a The TR chain C-REGION also includes the CONNECTING-REGION (CO), the TRANSMEMBRANE-REGION (TM), and the CYTOPLASMIC-REGION (CY), which are not present in the 3-D structures (IMGT® http://www.imgt. org, IMGT Scientific chart >1. Sequence and 3D structure identification and description > Correspondence between labels for IG and TR domains in IMGT/3Dstructure-DB and IMGT/LIGM-DB)

connecting region ( CONNECTING -REGION (CO)), a transmembrane region (TRANSMEMBRANE-REGION (TM)), and a short cytoplasmic region (CYTOPLASMIC-REGION (CY)) [6, 7] (Fig. 2, Table 1). The TR V domains that are directly involved in the TR/pMH interactions are described in Subheading 2.2.

2.1.2 MH Chains and Domains The MH1 is formed by the association of a heavy chain (I-ALPHA) and a light chain (beta-2-microglobulin or B2M). The MH2 is an heterodimer formed by the association of an alpha chain (II-ALPHA) and a beta chain (II-BETA) [7] (Table 2) The I-ALPHA chain of the MH1 and the II-ALPHA and II-BETA chains of the MH2 comprise an extracellular region, made of three domains for the MH1 chains and of two domains for the MH2 chains, and CO, TM, and CY regions [7] (Fig. 2, Table 2). The I-ALPHA chain comprises two groove domains (G-DOMAIN), G-ALPHA1 [D1] and G-ALPHA2 [D2], and one C-LIKE domain [D3] [7]. The B2M corresponds to a single C-LIKE domain. The II-ALPHA chain and the II-BETA chain each comprises two domains, G-ALPHA [D1] and C-LIKE [D2], and G-BETA [D1] and C-LIKE [D2] [7] (Fig. 2). Only the extracellular region that corresponds to these domains has been crystallized. The MH G domains that are directly involved in the TR/ pMH interactions are described in Subheading 2.3.

2.2 TR V Domains 2.2.1 Definition A V domain [32–34] comprises about 100 AA and is made of nine antiparallel beta strands (A, B, C, C<sup>0</sup> , C00, D, E, F, and G) linked by beta turns (AB, CC<sup>0</sup> , C00D, DE, and EF) or loops (BC, C<sup>0</sup> C00, and FG) and forming a sandwich of two sheets (Table 3). The sheets are closely packed against each other through hydrophobic interactions giving a hydrophobic core and joined together by a disulfide bridge between first-CYS at position 23 in the B-STRAND in the first sheet and the second-CYS 104 in the F-STRAND in the second sheet [34]. The V domain type includes the V-DOMAIN of the TR (and IG), which corresponds to the V-J-REGION or V-D-J-REGION encoded by V-(D)-J rearrangements [1–6, 36], and the V-LIKE-DOMAIN of the IgSF other than IG and TR [37–44]. In a V-DOMAIN, the three hypervariable loops BC, C0 C00, and FG involved in the ligand (antigen for IG or pMH for TR) recognition are designated as complementarity determining regions (CDR-IMGT) [1–6].

2.2.2 IMGT Unique Numbering for V Domain The V domain strands and loops and their delimitations and lengths are based on the IMGT unique numbering for V domain (V-DOMAIN and V-LIKE-DOMAIN) [33, 34] (Table 3). In the IG and TR V-DOMAIN, the G-STRAND is the C-terminal part of the J-REGION, with J-PHE or J-TRP 118 and the canonical motif F/W-G-X-G at positions 118–121 [1]. The loop length (number of AA (or codons), which is the number of occupied positions, is a

Fig. 2 T-cell receptor/peptide/MH complexes with MH class I (TR/pMH1) and MH class II (TR/pMH2). (a) 3-D structures of TR/pMH1 and TR/pMH2. (b) Schematic representation of TR/pMH1 and TR/pMH2. The TR (TR-ALPHA and TR-BETA chains), the MH1 (I-ALPHA and B2M chains), and the MH2 (II-ALPHA and II-BETA chains) are shown with the extracellular domains (V-ALPHA and C-ALPHA for the TR-ALPHA chain; V-BETA and C-BETA for the TR-BETA chain; G-ALPHA1, G-ALPHA2, and C-LIKE for the I-ALPHA chain; C-LIKE for B2M; G-ALPHA and C-LIKE for the II-ALPHA chain; II-BETA and C-LIKE for the II-BETA chain), and the connecting, transmembrane, and cytoplasmic regions. [D1], [D2], and [D3] indicate the domains. Arrows indicate the peptide localization in the MH groove made of two G-DOMAIN [7]. In these representations (with G-ALPHA1 on the right, G-ALPHA2 and B2M on the left), the peptide is oriented in the groove from back of the figures to front. (With permission from M-P. Lefranc and G. Lefranc, LIGM, Founders and Authors of IMGT®, the international ImMunoGeneTics information system®, http:/www.imgt.org)

### Table 2

IMGT-standardized labels for the DESCRIPTION of the major histocompatibility (MH) and of their chains and domains. IMGT® labels (concepts of description) are written in capital letters [1]


a The I-ALPHA, II-ALPHA and II-BETA chains includes at the C-terminal end of the C-LIKE-DOMAIN, the CON NECTING-REGION (CO), the TRANSMEMBRANE-REGION (TM), and the CYTOPLASMIC-REGION (CY), which are not present in the 3-D structures

### Table 3

V domain strands and loops, IMGT positions and lengths, based on the IMGT unique numbering for V domains (V-DOMAIN and V-LIKE-DOMAIN) [33, 34]


a IMGT® labels (concepts of description) are written in capital letters <sup>b</sup>

In number of AA (or codons)

c See Subheading 2.4 <sup>d</sup>

In the IG and TR V-DOMAIN, the G-STRAND (or FR4-IMGT) is the C-terminal part of the J-REGION, with J-PHE or J-TRP 118 and the canonical motif F/W-G-X-G at positions 118–121. The JUNCTION refers to the CDR3-IMGT plus the two anchors second-CYS 104 and J-PHE or J-TRP 118 [1]

crucial and original concept of IMGT-ONTOLOGY. The lengths of the loops BC (or CDR1-IMGT), C0 C00, (or CDR2-IMGT) and FG (or CDR3-IMGT) characterize the V-DOMAIN (Table 3). They are delimited by anchor positions (see Note 3). The BC loop (or CDR1-IMGT) comprises positions 27–38, the C0 C00 (or CDR2-IMGT) positions 56–65, and the FG (or CDR3- IMGT) positions 105–117. In a V-DOMAIN, the CDR3-IMGT that encompasses the V-(D)-J junction resulting from V-J or V-D-J rearrangements [1] is more variable in sequence and length than the CDR1-IMGT and CDR2-IMGT that are encoded by the V-REGION only. The lengths of the three loops BC, C0 C00, and FG are shown in number of AA (or codons), into brackets and separated by dots. For example, [9.6.9] means that the BC, C<sup>0</sup> C00, and FG loops (or CDR1-IMGT, CDR2-IMGT, and CDR3-IMGT for a V-DOMAIN) have a length of 9, 6, and 9 AA (or codons), respectively.

2.2.3 IMGT Colliers de Perles for V Domain The V domain nine strands are indicated, with their orientation, in the IMGT Colliers de Perles [28, 29, 32–34, 36–39], which are IMGT 2D graphical representations based on the IMGT unique numbering. IMGT Colliers de Perles of the TR V-ALPHA and V-BETA domains from 1ao7 (see Note 4) a TR/pMH1 3-D structure complex are shown as examples (Fig. 3). The V-ALPHA and V-BETA domains share the main conserved characteristics of the V-DOMAIN, which are the disulfide bridge between cysteine 23 (first-CYS) and cysteine 104 (second-CYS), and the three other hydrophobic core residues tryptophan 41 (CONSERVED-TRP), leucine (or hydrophobic) 89, and phenylalanine 118 (J-PHE) (see Note 5). In Fig. 3, the V-ALPHA (1ao7\_D chain; [6.6.11]) has a CDR1-IMGT and a.

CDR2-IMGT of 6 AA and a CDR3-IMGT of 11 AA, whereas the V-BETA (1ao7\_E chain [5.6.14]) has a CDR1-IMGT, CDR2- IMGT, and CDR3-IMGT of 5, 6, and 14 AA, respectively (Subheading 2.2.2) (see Note 6). In IMGT/3Dstructure-DB, the IMGT genes and alleles that contribute to the V-DOMAIN are determined automatically by IMGT/DomainGapAlign [13, 40, 41], based on the standardized IMGT nomenclature [1, 2, 6] and IMGT unique numbering [34]. Thus, the V-ALPHA of 1ao7\_D corresponds to Homo sapiens TRAV12-2\*02-TRAJ24\*02 and the V-BETA of 1ao7\_E corresponds to Homo sapiens TRBV6-5\*01- (TRBD2)-TRBJ2-7\*01 [16, 17].

2.3 MH G Domains 2.3.1 Definition A G domain [7] comprises about 90 AA and is made of a sheet of four antiparallel beta strands linked by turns and of a helix (Table 4); the helix sits on the beta strands, its axis forming an angle of about 40 degrees with the strands [16, 17]. Two G domains are needed to form the MhSF groove made of a "floor" and two "walls" [7]. Each G domain contributes by its four strands

Fig. 3 IMGT/Collier-de-Perles for TR V domain (V-DOMAIN). (a) IMGT/Collier-de-Perles for TR V-ALPHA (chain 1ao7\_D). The CDR-IMGT lengths are [6.6.11]. (b) IMGT/Collier-de-Perles for TR V-BETA (chain 1ao7\_E). The CDR-IMGT lengths are [5.6.14]. AA ais shown in the one-letter abbreviation (see Note 2). Position at which hydrophobic AA (hydropathy index with positive value: I, V, L, F, C, M, A) and tryptophan (W) are found in more than 50% of analyzed sequences are shown in blue, online. All proline (P) are shown in yellow, online. Anchor positions are shown in squares (see Note 3). Arrows indicate the direction of the beta strands [28, 29]. Hatched circles correspond to missing positions according to the IMGT unique numbering for V domain [33, 34]. IMGT color menu for CDR1-IMGT, CDR2-IMGT, and CDR3-IMGT is blue, green, and greenblue, for V-ALPHA, and red, orange and purple, for V-BETA (see Note 6). IMGT/Collier-de-Perles are shown on one layer (on the left hand side) and two layers (on the right hand side). The IMGT Colliers de Perles on two layers show, in the forefront, the GFCC<sup>0</sup> C<sup>00</sup> strands and, in the back, the ABED strands. Hydrogen bonds (from the IMGT/3Dstructure-DB entry) are show in green, online. Only those between the AA of the C, C<sup>0</sup> , C<sup>00</sup>, F, and G strands (in the forefront) and those of the CDR-IMGT are shown here. IMGT/Collier-de-Perles are from IMGT/3Dstructure-DB, http:/ www.imgt.org. (With permission from M-P. Lefranc and G. Lefranc, LIGM, Founders and Authors of IMGT®, the international ImMunoGeneTics information system®, http:/www.imgt.org)

and turns to half of the groove floor and by its helix to one wall of the groove [7, 16, 17]. The G domain type includes the G-DOMAIN of the MH [7] and the G-LIKE-DOMAIN of the MhSF other than MH or RPI-MH1Like [7, 45, 46] (see Note 7).

### Table 4

G domain strands, turns, and helix, IMGT positions and lengths, based on the IMGT unique numbering for G domains (G-DOMAIN and G-LIKE-DOMAIN) [7]


a IMGT® labels (concepts of description) are written in capital letters <sup>b</sup>

In number of AA (or codons)

c See Subheading 2.4 <sup>d</sup>

For details on the characteristic Residue@Position and additional positions, see Ref. [7] <sup>e</sup>

Or 9 in some G-BETA

f Or 0 in some G-ALPHA2-LIKE [7]


the G-DOMAIN are determined automatically by IMGT/

Fig. 4 IMGT/Collier-de-Perles of MH G domains (G-DOMAIN). (a) MH1 G-ALPHA1 and G-ALPHA2 domains from 1ao7 (I-ALPHA chain 1ao7\_A). (b) MH2 G-ALPHA and G-BETA domains from 1j8h (II-ALPHA chain 1j8h\_A and II-BETA 1j8h\_B, respectively). AA positions and gaps (hatched positions) are according to the IMGT unique numbering for G domain [7]. Positions 61A, 61B, and 72A are characteristic of the G-ALPHA2 and G-BETA domains (and are not reported in the G-ALPHA1 and G-ALPHA IMGT/Collier-de-Perles) [7]. IMGT/Collier-de-Perles are from IMGT/3Dstructure-DB, http:/www.imgt.org. G-domain terminal hatched positions (MH1 G-ALPHA1 91 and 92 and MH2 G-BETA 90, 91 and 92) are not reported in online IMGT/Collier-de-Perles. The IMGT Colliers de Perles can also be obtained, with the sequences gapped by IMGT/DomainGapAlign [40, 41], using the IMGT/Collier-de-Perles tool [39]. (With permission from M-P. Lefranc and G. Lefranc, LIGM, Founders and Authors of IMGT®, the international ImMunoGeneTics information system®, http:/www.imgt.org)

DomainGapAlign [13, 40, 41], based on the standardized IMGT nomenclature and numbering [1, 7]. Thus the G-ALPHA1 and G-ALPHA2 of 1ao7\_A are encoded by HLA-A\*0201 [15–17].

2.4 Residue@ Position and Atom Pair Contacts "Residue@Position" is an IMGT® concept of numerotation that numbers the position of a given residue (or by extension that of a conserved property AA class [47]), based on the IMGT unique numbering (see Note 8). A "Residue@Position" (R@P) is defined by the position numbering according to the IMGT unique numbering [7, 34, 35], the residue name (in the 3-letter abbreviation and/or in the one-letter abbreviation) (see Note 2), the IMGT domain label (Tables 1 and 2), and either the gene and allele name for AA sequences (see Note 1), or the "IMGT chain ID" for 3-D structures. In IMGT/3Dstructure-DB, a 'Residue@Position is described in a "Residue@Position card" (Fig. 5) that provides information on its characteristics (see Note 9) and the list of the other R@P with which it interacts [16, 17]. Each interaction is characterized by the total number of "atom pair contacts" (see Note 10) and, as selected by the user for display, the number of atom pair contacts per type ("noncovalent," "polar," "hydrogen bond," "non polar," "covalent," or "disulfide") and/or per category ("(BB) Backbone/backbone," "(SS) Side chain/side chain," "(BS) Backbone/side chain," and "(SB) Side chain/backbone") [16, 17].

### 3 IMGT pMH Contact Analysis

3.1 IMGT pMH Contact Sites Definition and Determination

"IMGT pMH contact sites" [15–17] highlight the contacts between the amino acids of a presented peptide and those of the floor and helix walls of the MH groove, in 3-D structures of pMH and TR/pMH complexes [12–14]. The "IMGT pMH contact sites" are visualized in IMGT Colliers de Perles for G-DOMAIN [7]. The "IMGT pMH contact sites" provide a standardized comparison of the interactions between a presented peptide and the MH, regardless of the MH class (MH1 or MH2), the G domain (G-ALPHA1, G-ALPHA2, G-ALPHA, and G-BETA), and the peptide length. The "IMGT pMH contact sites" also allow one to precisely identify the AA that is effectively bound in the MH groove. This is particularly informative for the peptides bound to MH2 as these peptides can be much longer than the actual groove length with the N-terminal and C-terminal ends extending outside the groove [7]. In order to deal with different peptide lengths in the groove, 11 standard "IMGT pMH contact sites" were defined (C1–C11) [15–17] (Fig. 6). They correspond to a theoretical maximum length of 11 AA in the groove. This means that, in 3-D structures, some (usually two or three) "IMGT pMH contact sites" are absent as peptides are shorter than 11 AA (usually nine or eight AA long).




Fig. 5 IMGT Residue@Position card. The "Residue@Position: 61A—ALA (A)—G-ALPHA2—1ao7\_A" is defined by the position numbering ("61A") according to the IMGT unique numbering for G domain [7], the residue name in the three-letter abbreviation and in the one-letter abbreviation for AA ("ALA (A)") (see Note 2), the IMGT domain label (G-ALPHA2) (Table 2) and the IMGT chain ID (1ao7\_A) (see Note 4). The list of atom pair contacts shows that this R@P interacts with 5 R@P of the same domain (G-ALPHA2) and, of interest for the TR/ pMH interactions, with 4 R@P of the V-BETA and one of the peptide (Ligand). The "Residue@Position" card is from IMGT/3Dstructure-DB, http:/www.imgt.org. (With permission from M-P. Lefranc and G. Lefranc, LIGM, Founders and Authors of IMGT®, the international ImMunoGeneTics information system®, http:/www.imgt.org)

> The peptide binding mode to MH1 is characterized by the N-terminal and C-terminal peptide ends docked deeply with the C1 and C11 contact sites (red and pink, respectively, in the IMGT/ Collier-de-Perles) and by the peptide length that mechanically constrains the peptide conformation in the groove. Thus, for a peptide of 10 AA, one "IMGT pMH contact sites" is absent (C2), and for a peptide of 9 AA, two "IMGT pMH contact sites" are absent (C2 and C7), whereas for a peptide of 8 AA, three pMH contact sites are absent (C2, C7, and C8) [15–17] (see Note 11).


Fig. 6 Standard "IMGT pMH contact sites'. Eleven standard 'IMGT pMH contact sites' (C1 to C11) were defined for the standardized analysis and comparison of pMH interactions [16, 17]. (With permission from M-P. Lefranc and G. Lefranc, LIGM, Founders and Authors of IMGT®, the international ImMunoGeneTics information system®, http:/www.imgt.org)

The peptide binding mode to MH2 is different with the peptide lying in the groove. Thus, for nine amino acids lying in an MH2 groove, C2 is present but there are no C7 and C8. For a given 3-D structure in IMGT/3Dstructure-DB, the determination of the "IMGT pMH contact sites" combines contact analysis between the peptide and the MH (in a pMH or in a TR/pMH complex), with an interaction scoring function (see Note 12). The MH AA automatically selects the highest score that is listed and displayed in "IMGT/Collier-de-Perles with pMH contact sites." The characterization of the "IMGT pMH contact sites" based on contact analysis has superseded the previous identification of "pockets" in the MH groove (see Note 13).

### 3.2 Access to IMGT pMH Contact Sites


Fig. 7 "IMGT pMH contact sites" between MH1 and a 9-AA peptide. (a) "IMGT pMH contact sites" for MH1 (human HLA-A\*0201, 1ao7\_A) and peptide 1ao7\_C. The numbers 1–9 refer to the peptide AA numbering ( LLFGYPVYV). C1–C11 refer to the "IMGT pMH contact sites" (there are no C2 and C7 in agreement with MH1 binding a 9-AA peptide). In that 3-D structure, there is no C5 because the glycine G4 score is too low. The G-ALPHA1 and G-ALPHA2 AA positions assigned automatically to the "IMGT pMH contact sites" are listed. (b) "IMGT/Collier-de-Perles with pMH contact sites." View is from above the cleft, with G-ALPHA1 on top and G-ALPHA2 on bottom. (c) Groove 3-D structure. The groove is shown with and without the peptide (on the left and right hand side, respectively). The IMGT Color menu for "IMGT pMH contact sites" is used in (a), (b), and (c). (a) and (b) are from IMGT/3Dstructure-DB, http:/www.imgt.org. (With permission from M-P. Lefranc and G. Lefranc, LIGM, Founders and Authors of IMGT®, the international ImMunoGeneTics information system®, http:/www.imgt.org)

3.2.1 IMGT pMH Contact Sites for pMH1 An example of "IMGT pMH contact sites" for pMH1 is shown in Fig. 7. In that 3-D structure of a TR/pMH1 complex (1ao7), the groove made by the G-ALPHA1 and G-ALPHA2 of the I-ALPHA chain (1ao7\_A) binds a 9-AA peptide (1ao7\_C). "IMGT pMH contact sites" results provide first a table, which shows the positions 1–9 of the peptide AA (each AA is clickable, giving access to its Residue@Position card). Nine of the 11 C1–C11 contact sites are displayed, C2 and C7 being absent, in agreement with a 9-AA peptide bound in a MH1 groove (see Subheading 3.1). The G-ALPHA1 and G-ALPHA2 AA positions that contribute to each "IMGT pMH contact site" are listed. For example, G-ALPHA1 59 and G-ALPHA2 73, 77, and 81 contribute to the "IMGT pMH contact site" C1 that predominantly interacts with leucine (L) 1 of the peptide (N-terminal end) (Fig. 7a). The "IMGT pMH contact sites" are displayed in "IMGT/Collier-de-Perles with pMH contact sites" (Fig. 7b). Clicking on one residue in the IMGT/Collier-de-Perles gives access to its "IMGT Residue@Position card" (see Subheading 2.4). The 3-D structure, with or without peptide, is shown in Fig. 7c.

3.2.2 IMGT pMH Contact Sites for pMH2 An example of "IMGT pMH contact sites" for pMH2 is shown in Fig. 8. In that 3-D structure of a TR/pMH2 complex (1j8h), the groove made by the G-ALPHA (of the II-ALPHA chain) (1j8h\_A) and the G-BETA (of the II-BETA chain) binds a 13-AA peptide (1j8h\_C). "IMGT pMH contact sites" results provide first a table, which shows the AA 1–9 in the groove (each AA is clickable, giving access to its Residue@Position card). However, in contrast to MH1 (see Subheading 3.2.1), the nine AA shown in Fig. 8a only correspond to the central part of the peptide. Indeed, the peptide bound to MH2 is longer than the length of the groove and extends outside its N-terminal and C-terminal ends, as the MH2 groove is "open" at both ends [7]. One major breakthrough of the "IMGT pMH contact sites" is the identification of the AA that is located in the MH2 groove [15–17]. Whereas the peptide (1j8h\_C) is 13 AA long ( PKYVKQNTLKLAT ), the "IMGT pMH contact sites" results allow one to determine that the 9 AA in the MH2 groove are YVKQNTLKL (Fig. 8a). Nine of the 11 C1–C11 contact sites are displayed, C7 and C8 being absent, in agreement with 9 AA inside a MH2 groove (see Subheading 3.1). The G-ALPHA and G-BETA AA positions that contribute to each 'IMGT pMH contact sites' are listed. They are visualized in the "IMGT Collier de Perles with pMH contact sites" (Fig. 8b). Clicking on one residue in the IMGT Colliers de Perles gives access to its "IMGT Residue@Position card" (see Subheading 2.4). The 3-D structure, with or without peptide, is shown in Fig. 8c.

### 4 IMGT/3Dstructure-DB Domain Pair Contacts

4.1 IMGT/ 3DStructure-DB Domain Pair Contacts (Overview)

"IMGT/3Dstructure-DB Domain pair contacts (overview)" (Fig. 9) is accessed by clicking on "Domain contacts (overview)" of "Contact analysis" in an IMGT/3Dstructure-DB card. The example shown in Fig. 9 is that of the TR/pMH1 structure 1ao7. Eight "Domain pair contacts" are of interest for TR/pMH interactions, two for pMH1 (see Subheading 4.1.1) and six for TR/ pMH1 (see Subheading 4.1.2). Similar results are obtained for the

Fig. 8 "IMGT pMH contact sites" between MH2 and 9 AA in the groove. (a) "IMGT pMH contact sites" for MH2 (human HLA-DRA\*0101\_HLA-DRB1\*0401) (1j8h\_A-1j8h\_B) and a 13 AA long peptide (1j8\_C). The numbers 1–9 refer to the AA numbering in the groove (YVKQNTLKL) as determined by the "IMGT pMH contact sites." C1–C11 refer to the "IMGT pMH contact sites" (there are no C7 and C8 in agreement with MH2 binding 9 AA in the groove). In that 3-D structure, there is no C5 because the asparagine N5 score is too low. The G-ALPHA and G-BETA AA positions assigned automatically to the "IMGT pMH contact sites" are listed. (b) "IMGT/Collierde-Perles with pMH contact sites." View is from above the cleft, with G-ALPHA on top and G-BETA on bottom. (c) Groove 3-D structure. The groove is shown with and without the peptide (on the left and right hand side, respectively). The IMGT Color menu for "IMGT pMH contact sites" is used in (a), (b), and (c). (a) and (b) are from IMGT/3Dstructure-DB, http:/www.imgt.org. (With permission from M-P. Lefranc and G. Lefranc, LIGM, Founders and Authors of IMGT®, the international ImMunoGeneTics information system®, http:/www.imgt.org)

> TR/pMH2 structure, e.g., 1j8h (the only difference being the names of the G-DOMAIN, G-ALPHA, and G-BETA, instead of G-ALPHA1 and G-ALPHA2) (not further detailed here).

4.1.1 Domain Pair Contacts for pMH1 Interactions

The two domain pair contacts for pMH1 interactions are "(Ligand)/G-ALPHA1" and "(Ligand)/G-ALPHA2" (Fig. 9). Thus, for the pMH1 interactions in 1ao7, the domain pair "(Ligand)/G-ALPHA1" shows that 26 residues are involved,


Fig. 9 IMGT/3Dstructure-DB Domain pair contacts (overview). The IMGT/3Dstructure-DB entry is the TR/pMH1 3-D structure 1ao7. The domain partners considered are designated as "Unit 1" and "Unit 2." The number of residue pair contacts, the number of residues involved (total, from Unit 1 and from Unit 2), the number of total atom pair contacts, and, as selected by the user for the display, the number of contacts per type and/or by category are provided. "(Ligand)" refers to the peptide. Two red frames highlight the domain pair contacts for pMH interactions. Two blue rectangles highlight the domain pair contacts for TR/pMH interactions, three for V-ALPHA and three for V-BETA. (With permission from M-P. Lefranc and G. Lefranc, LIGM, Founders and Authors of IMGT®, the international ImMunoGeneTics information system®, http:/www.imgt.org)

> 8 AA of the peptide (Ligand) interacting with 18 AA of G-ALPHA1 (creating 29 residue pair contacts with a total of 305 atom pair contacts). Similarly, the domain pair "(Ligand)/G-ALPHA2" shows that 24 residues are involved, 8 AA of the peptide interacting with 16 AA of G-ALPHA2 (creating 26 residue pair contacts with a total of 281 atom pair contacts).

4.1.2 Domain Pair Contacts for TR/pMH1 Interactions The six domain pair contacts for TR/pMH1 interactions include three domain pairs involving V-ALPHA and three domain pairs involving V-BETA (Fig. 9). The TR/pMH1 interactions in 1ao7 are the following:



Fig. 10 pMH1 interactions. (a) Interactions "G-ALPHA1/(Ligand)" of 1ao7. (b) Interactions "G-ALPHA2/ (Ligand)" of 1ao7. Clicking on a R@P link gives access to the corresponding IMGT Residue@Position card. "(Ligand)" refers to the peptide. The contact analysis of the TR/pMH 3-D structure 1ao7 is from IMGT/ 3Dstructure-DB, http:/www.imgt.org. The "IMGT pMH contact sites" for G-ALPHA1 (a) and G-ALPHA2 (b) were added on the right hand side of the figure, for a comparison with Fig. 7. (With permission from M-P. Lefranc and G. Lefranc, LIGM, Founders and Authors of IMGT®, the international ImMunoGeneTics information system®, http:/www.imgt.org)


4.3 IMGT Paratope and Epitope IMGT paratope and epitope are concepts of the "SpecificityType" in IMGT-ONTOLOGY [48–50]. Paratope, or "antigen-binding site," identifies the part of the V-DOMAIN of an IG or antibody ("IG paratope") or of a TR ("TR paratope") that, respectively, recognizes (binds to) the antigen (Ag) or the peptide/major histocompatibility (pMH) ("epitope" or "antigenic determinant") [49]. Epitope, or "antigenic determinant," identifies the part of the antigen (Ag) or of the peptide/major histocompatibility


Fig. 11 TR V-ALPHA/pMH1 interactions. (a) Interactions "V-ALPHA/G-ALPHA1" of 1ao7. (b) Interactions "V-ALPHA/G-ALPHA2" of 1ao7. (c) Interactions "V-ALPHA/(Ligand)" of 1ao7. Clicking on a R@P link gives access to the corresponding IMGT Residue@Position card. "(Ligand)" refers to the peptide. The contact analysis of the TR/pMH 3-D structure 1ao7 is from IMGT/3Dstructure-DB, http:/www.imgt.org. The IMGT color menu is blue, green, and greenblue for CDR1-IMGT, CDR2-IMGT, and CDR3-IMGT, respectively (see Note 6). (With permission from M-P. Lefranc and G. Lefranc, LIGM, Founders and Authors of IMGT®, the international ImMunoGeneTics information system®, http:/www.imgt.org)


Fig. 11 (continued)

(pMH) that is recognized by the paratope of the V-DOMAIN of an IG or antibody or of a TR, respectively [50].

The amino acids that constitute the TR paratope belong to the paired V domains of a TR (V-alpha and V-beta for a TR-alpha\_beta, V-gamma, and V-delta for a TR-gamma\_delta), and more precisely to the CDR-IMGT [49]. Among the CDR-IMGT, the CDR3- IMGT that results from the V-J and V-D-J junction play the major role in TR/pMH interactions [15–17]. T-cell epitopes are usually identified as "linear" when referring to the processed peptide (p) presented in the groove of the MH proteins. However, in IMGT-ONTOLOGY, the "T-cell epitope" concept is identified as "discontinuous" as it comprises amino acids of the MH that bind to the TR V domains [50]. Thus, in a TR/pMH complex, the AA in contact at the interface between the TR and the pMH constitute the paratope on the TR surface and the epitope on the pMH surface (Fig. 13). In IMGT/3Dstructure-DB, the "IMGT paratope and epitope" for TR/pMH complexes are determined by combining contact analysis (Table 5) with an interaction scoring function, which roughly complies with the true mean energy ratio [15–17]. A standardized description of the "IMGT paratope and

Fig. 12 TR V-BETA/pMH1 interactions. (a) Interactions "V-BETA/G-ALPHA1" of 1ao7. (b) Interactions "V-BETA/ G-ALPHA2" of 1ao7. (c) Interactions "V-BETA/(Ligand)." Clicking on a R@P link gives access to the corresponding IMGT Residue@Position card. "(Ligand)" refers to the peptide. The contact analysis of the TR/pMH 3-D structure (1ao7) is from IMGT/3Dstructure-DB, http:/www.imgt.org. R@P belonging to CDR1- IMGT is in red and those belonging to the CDR3-IMGT are in purple according to the IMGT color menu (see Note 6). (With permission from M-P. Lefranc and G. Lefranc, LIGM, Founders and Authors of IMGT®, the international ImMunoGeneTics information system®, http:/www.imgt.org)


Fig. 12 (continued)

epitope" is provided. Thus, the pMH1 epitope of 1ao7 (Fig. 13) comprises AA of G-ALPHA1 and G-ALPHA2 (1ao7\_A) (HLA-A\*0201) and of the peptide (1ao7\_C, Tax peptide 11–19). Twenty-four AA form the pMH1 epitope: sixteen from the MH1 (six from G-ALPHA1 and ten from G-ALPHA2) and eight from the peptide. Each AA that belongs to the epitope is characterized by its position according to the IMGT unique numbering for G domain [7] and by its position in the peptide.

The TR paratope of 1ao7 (T-cell receptor A6) (Fig. 13) comprises AA of V-ALPHA (1ao7\_D chain) and of V-BETA (1ao7\_E chain). Sixteen AA of the TR (11 from V-ALPHA and 5 from V-BETA) form the paratope. The IMGT/Collier-de-Perles (Fig. 3) show that nine out of the 11 AA of the V-ALPHA paratope belong to the CDR-IMGT (D27, R28 and G29, Q37 to the CDR1-IMGT, Y57 to the CDR2-IMGT, T108, D109, and W113 and G114 to the CDR3-IMGT) and that five AA of the V-BETA paratope belong to the CDR3-IMGT and are localized at the top of loop. Clicking on "Epitope IMGT Residue@Position cards" and "Paratope IMGT Residue@Position cards" (Fig. 13) provide detailed contacts for each AA belonging to the epitope and paratope, respectively. IMGT paratope and epitope are


Fig. 13 "IMGT paratope and epitope" of an IMGT TR/pMH complex. Each AA that belongs to the pMH epitope is characterized by its position in the peptide or in the G domains according to the IMGT unique numbering [7]. For examples, "E (58G1\_A)" means that the glutamate (E) is at position 58 of the G-ALPHA1 domain (1ao7\_A), "AAH (61G2-62G2\_A)" means that the alanine (A), alanine (A), and histidine (H) are at positions 61, 61A, and 62 of the G-ALPHA2 domain (1ao7\_A) (see also Fig. 4a). Each AA that belongs to the TR paratope is characterized by its position in the V domains according to the IMGT unique numbering [32–34]. Thus, "DRG (27 V1-29V1\_D1)" means that the aspartate (D), arginine (R), and glycine (G) are at positions 27, 28, and 29 of the V domain 1 of 1ao7\_D (V-ALPHA) (see also Fig. 3a). In the same way, "GGRP (112.1V1-114V1\_E)" means that the glycine (G), glycine (G), arginine (R), and proline (P) are at positions 112.1, 112, 113, and 114 of the V domain 1 of 1ao7\_E (V-BETA) (see also Fig. 3b). The "IMGT paratope and epitope" analysis of the TR/pMH1 3-D structure (1ao7) is from IMGT/3Dstructure-DB, http:/www.imgt.org. (With permission from M-P. Lefranc and G. Lefranc, LIGM, Founders and Authors of IMGT®, the international ImMunoGeneTics information system®, http:/www.imgt.org)

> determined automatically for the TR/pMH 3D structures in IMGT/3Dstructure-DB (see Note 15). Clicking on the "References and links" tag in the IMGT/3Dstructure-DB card gives access to external links. Links to the Immune Epitope Database (IEDB) [51, 52] are provided. Clicking on the "IMGT numbering comparison" displays, per chain, a table providing the correspondence between the IMGT unique numbering per domain and the PDB numbering of the chain entry.

### 4.4 Bridging IMGT Clonotype (AA), TR-Mimic Antibody and Paratope

4.4.1 IMGT Clonotype (AA) Repertoire and TR Paratope

Next-generation sequencing (NGS) data, analyzed by IMGT/ HighV-QUEST, provides a standardized characterization of the TR repertoire diversity and expression in normal (e.g., before and after vaccination) and pathological situations. The results include the IMGT variable (V), diversity (D), and joining (J) gene and allele names (identified at the nucleotide level) and the identification of the JUNCTION lengths and amino sequences, which, together, characterize the IMGT clonotypes (AA) [53]. TR V domain analysis, using IMGT nomenclature [1, 6] and IMGT unique numbering [34] for both NGS and 3-D structures of TR/pMH complexes

Table 5 Amino acids of the TR paratope (V-ALPHA, V-BETA) and of the pMH1 epitope (G-ALPHA1, peptide, and G-ALPHA2) of 1ao7 based on Contact analysis from IMGT/3Dstructure-DB, http://www.imgt.org [12 –14]. (A) TR V-ALPHA paratope – pMH1 epitope. (B) TR V-BETA paratope—pMH1 epitope. Amino acid positions of the TR V-ALPHA and V-BETA are according to the IMGT unique numbering for V-DOMAIN [33, 34]. Amino acid positions of the TR V-ALPHA and V-BETA are according to the IMGT unique numbering for G-DOMAIN [7]. The list of contact analysis below is complete. Differences observed in visual displays are due to filters, based on contact types or scores


<sup>(</sup>continued)

Table 5 (continued)

(A)

TR V-ALPHA paratope – pMH1 epitope of 1ao7 EPITOPE G-ALPHA1 (Ligand) Peptide



 epitope)

cAtom [7, 12–14], provides a paradigm for bridging IMGT clonotypes (AA) of NGS repertoires and TR paratope CDR-IMGT (particularly CDR3-IMGT) delimitations.

### 4.4.2 TR-Mimic Antibody Paratope The 3-D structures of an engineered TR-mimic antibody and that of a TR targeting peptide-HLA were recently compared: the IG Fab 3M4E5 and the TR 1G4\_a58b61 are receptors targeting the NY-ESO-1 peptide presented by HLA-A\*02:01 [54]. The pMH contacts of the NY-ESO-1 peptide SLLMWITQC with the MH1 HLA-A\*02:01 groove are similar in the two peptide-HLA complexes, as expected [55]. The paratope of the IG Fab (TR-mimic antibody) includes amino acids of VH [8.8.12] and V-LAMBDA [9.3.9], whereas the paratope of the TR, classically, includes amino acids of V-BETA [5.6.12] and V-ALPHA [6.7.13]. The IMGT unique numbering for V-DOMAIN [34] was used for the four domains in the description of the paratopes. Similarly, in both 3-D structures, the IMGT unique numbering for G-DOMAIN [7] was used for the description of the pMH epitope, which comprises the G-ALPHA1 helix, the peptide, and the G-ALPHA2 helix [55].

### 5 Availability and Citation

Authors who use IMGT® databases and tools are encouraged to cite this article and to quote the IMGT® Home page, http://www. imgt.org. Online access to IMGT® databases and tools is freely available for academics and under licenses and contracts for companies.

### 6 Notes

1. Since the creation of IMGT® in 1989, at New Haven during the tenth Human Genome Mapping Workshop (HGM10), the standardized classification and nomenclature of the IG and TR of human and other vertebrate species have been under the responsibility of the IMGT Nomenclature Committee (IMGT-NC). In 1995, following the first demonstration online of the nucleotide database IMGT/LIGM-DB at the ninth International Congress of Immunology in San Francisco, IMGT-NC has become the World Health Organization-International Union of Immunological Societies (WHO-IUIS)/IMGT Nomenclature SubCommittee for IG and TR. IMGT® gene and allele names are based on the concepts of classification of "Group," "Subgroup," "Gene," and "Allele," generated from the IMGT-ONTOLOGY CLASSIFICATION axiom. The IMGT® gene nomenclature for IG and TR genes was approved at the international level by the Human Genome Organisation (HUGO) Nomenclature Committee (HGNC) in 1999 and by the WHO-IUIS [56, 57]. The IMGT® IG and TR gene names [2, 6, 58, 59] are the official reference for the vertebrate genome projects and, as such, have been entered in IMGT/ GENE-DB, the IMGT® gene database [60], in National Center for Biotechnology Information (NCBI) Gene [61], in European Bioinformatics Institute (EBI) Ensembl, and in the Vega Genome Browser (Wellcome Trust Sanger Institute).



15. In January 2021, IMGT/3Dstructure-DB [12–14] contained 239 TR/pMH complexes (of which 186 are TR/pMH1 and 53 are TR/pMH2) [15–17].


### Acknowledgements

We thank Patrice Duroux for the IMGT/3Dstructure-DB database and associated tools computing management, Anjana Kushwaha for the IMGT/3Dstructure-DB entries biocuration, and Franc¸ois Ehrenmann for help with the 3-D structure figures. We are grateful to the IMGT® team for its expertise and constant motivation. We thank Cold Spring Harbor Protocol Press for the pdf of the IMGT Booklet available in IMGT references. IMGT® is a registered trademark of CNRS. IMGT® is member of the International Medical Informatics Association (IMIA) and a member of the Global Alliance for Genomics and Health (GA4GH). All figures are used with permission from M-P. Lefranc and G. Lefranc, LIGM, Founders and Authors of IMGT®, the international ImMunoGeneTics information system®, http://www.imgt.org).

Funding: IMGT® was funded in part by the BIOMED1 (BIOCT930038), Biotechnology BIOTECH2 (BIO4CT960037), fifth PCRDT Quality of Life and Management of Living Resources (QLG2-2000-01287), and sixth PCRDT Information Science and Technology (ImmunoGrid, FP6 IST-028069) programs of the European Union (EU). IMGT® received financial support from the GIS IBiSA, the Agence Nationale de la Recherche (ANR) Labex MabImprove (ANR-10-LABX-53-01), the Re´gion Occitanie Languedoc-Roussillon (Grand Plateau Technique pour la Recherche (GPTR), and BioCampus Montpellier. IMGT® is currently supported by the Centre National de la Recherche Scientifique (CNRS), the Ministe`re de l'Enseignement Supe´rieur, de la Recherche et de l'Innovation (MESRI), and the University of Montpellier.

### References


immunoglobulins (IG), T cell receptors (TR), and conventional genes. Cold Spring Harb Protoc 6:604–613. https://doi.org/10. 1101/pdb.ip82


Mol Biol 882:605–633. https://doi.org/10. 1007/978-1-61779-842-9\_33


D339–D343. https://doi.org/10.1093/nar/ gky1006


John Wiley and Sons, Hoboken N.J, pp A.1O.1–A.1O.23


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Chapter 26

# ARResT/Interrogate Immunoprofiling Platform: Concepts, Workflows, and Insights

### Nikos Darzentas

### Abstract

ARResT/Interrogate was built within the EuroClonality-NGS working group to meet the challenge of developing and applying assays for the high-throughput sequence-based profiling of immunoglobulin (IG) and T-cell receptor (TR) repertoires. We herein present basic concepts, outline the main workflow, delve into EuroClonality-NGS-specific aspects, and share insights from our experiences with the platform.

Key words Immunoglobulin, Antigen receptors, IG, TR, Bioinformatics, Pipeline, Sequence analysis

### 1 Introduction

Immunoglobulins (IG) and T-cell receptors (TR) are highly adaptive molecular receptors involved in antigen recognition and enormously variable immunological responses. The advent of sequence-based profiling of IG and TR repertoires has been instrumental for understanding such responses, both normal and pathologic, the latter encompassing a wide range of diseases with an underlying immune cause. This unprecedented capability has also brought along novel and unique challenges [1]; this chapter will cover the bioinformatic one, from the perspective of the ARResT/ Interrogate immunoprofiling platform.

ARResT (abbreviation of Antigen Receptors Research Tool, http://bat.infspire.org/arrest) comprises a handful of tools developed over the years within focused groups. It originated in the days of Sanger sequence analysis toward delineating subsets of stereotyped antigen receptor sequences in chronic lymphocytic leukemia (CLL) [2, 3].

ARResT/Interrogate [http://arrest.tools/interrogate] was built from the grounds up within the EuroClonality-NGS working group [http://euroclonality.org] to initially support the development of the group's NGS assays and eventually to apply them in

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_26, © The Author(s) 2022

research and clinical applications [4, 5]. ARResT/Interrogate is able to: automatically paired-end-join and concatenate input files; use spreadsheet sample sheets to make data and metadata available to itself and the user; identify, tag, trim, and report on primer sequences (and primer dimers); annotate and identify all rearrangement types (or 'junction classes') of all IG/TR loci; offer powerful interactive tools to the user for mining results; identify, filter, and use the EuroClonality-NGS central in-tube quality/quantification control (cIT-QC, or spike-ins) for abundance normalization; generally support EuroClonality-NGS assays, also with bespoke analytical and visual functionalities; and provide detailed logs and feedback to the user.

ARResT/Interrogate will be continuously updated, and therefore bioinformatic and user interface details included herein may not stay the same over time. We advise readers/users to seek the latest information on the ARResT/Interrogate browser [http:// arrest.tools/interrogate] and on the EuroClonality-NGS website [http://euroclonality.org]. For the same reason, we chose not to focus on application-specific methods, also because they are covered in other chapters in this book. Still, the general concepts and workflows included in this chapter should be considered safe, as should the Notes below from our years of experience both developing and using ARResT/Interrogate.

1.1 Design ARResT/Interrogate consists of the pipeline and the browser (its user interface). The browser features four main "panels" for logically organized and ordered steps (Fig. 1):


There is also the "HQ" panel that offers introductory text and specific notes and advice (in separate "tabs"). There are more panels to serve special applications, e.g., clonality assessment, but they are by default hidden and may be accessed by switching "user modes" with the widget on the top left (set at "Interrogate.simple" in Fig. 1).


Fig. 1 ARResT/Interrogate browser (user interface), with "panels," "tabs," and the "user mode" selection widget

1.2 Primers ARResT/Interrogate is able to identify, tag, trim, and report on primer sequences (and primer dimers), including making the results available for a fully interactive analysis. The trimming allows for less artificial sequence data to be processed more accurately and more efficiently, while the reporting allows for the primer-based results to be directly used for quality control and development. See also Notes 2 and 3 about trimming.

1.3 Rearrangements and Junctions ARResT/Interrogate is able to annotate and identify all rearrangement types of all IG/TR loci. We call these rearrangement types "junction classes." They include "complete," e.g., IG's VJ:Vh- (Dh)-Jh; "incomplete," e.g., TR's DJ:Db-Jb; and "other," e.g. IG's Vk-Kde or intron-Kde (Table 1). For junction classes with no biologically relevant junctional anchors (i.e., residues that define the CDR3 region, as per IMGT), we decided to introduce virtual ones—this enables consistent and informative results across all junction classes, assisting the user to focus on the most variable part of the rearrangement. For the D genes in DJ, VD, and DD incomplete junction classes, we use recombination signal sequence (RSS) heptamers: the last triplet of the heptamer in 5<sup>0</sup> and the first triplet of the heptamer in 3<sup>0</sup> . For the intron RSS in the IGK locus, we use a CCC triplet between the EuroClonality-NGS primer and the RSS heptamer, while for Kde in the IGK locus the final triplet after the RSS heptamer and before the EuroClonality-NGS primer is used. In the majority of cases, these anchors are far enough from the junctional point to allow for nucleotide trimming without affecting their presence, but ARResT/Interrogate is anyway able to report rearrangements even with the anchors trimmed or mutated. This is also true for normal anchors in complete rearrangements.

> Anchors overview: 5<sup>0</sup> side of junction.

V genes: C aa ¼ TG[CT] nt.

D genes: V aa ¼ GT[any] nt, the last triplet of the 5<sup>0</sup> heptamer.

intron: P aa ¼ CCC nt, a triplet between primer and heptamer.

30 side of junction.

J genes: W aa ¼ TGG nt or F aa ¼ TT[CT] nt.

D genes: H aa ¼ CA[CT] nt, the first triplet of the 3<sup>0</sup> heptamer.

Kde: R aa ¼ CGA nt, final triplet after heptamer and before primer.

### Table 1

Junction classes supported by ARResT/Interrogate and the EuroClonality-NGS amplicon and capture assays


### 2 Materials

	- 2. Sample sequences should be uploaded in FASTQ (preferably) or FASTA format, preferably also compressed in "gunzip" format (extension ".gz"). Also see Note 4.
	- 2. The ARResT/Interrogate sample sheet offers a number of predefined columns (i.e., ARResT/Interrogate expects these column names for the information to be used properly) and the possibility to add many others with flexible column names.
	- 3. The most important predefined columns (again, do not change the column names or use them for other purposes) are:
		- (a) Sample: required—unique for every sample and part or whole of the sample's sequence filenames.
		- (b) Cells: number of cells, based on amount of DNA of, e.g., patient, to be used for quantification.


Fig. 2 Sample sheet example. Red are predefined columns, and blue are flexible columns


### 3 Methods

3.1 A Basic Workflow

	- (a) Create a new analysis or select an existing one, otherwise the "default" will be used, which is OK. Also see Note 6.
	- (b) Upload sample sequences in compressed FASTQ/A format (see Subheading 2).
	- (c) The default scenario ("ARResT.profile") should work fine in any case. One may select a different user mode or pipeline scenario, especially when deploying EuroClonality-NGS assays (see Subheading 3.2).
	- (d) One may use own primer sequences by uploading them in uncompressed FASTA format and selecting them under "scenario options" (there are instructions on the user interface). In general, please study primers (e.g., see Notes 3, 7, and 8 as to why). Also see Note 9.
	- (a) Select results in the drop-down widget, select filtering level (see Note 11), click "load results".
	- (b) One may browse the run and sample reports, paying attention to quality control (QC) information, alarms (and our hints and tips for possible causes and solutions), basic numbers like percentage of reads with junction (see Note 7) that are also color-coded to provide visual feedback. Alarms include:
		- <sup>l</sup> Low number of reads "5<sup>0</sup> primed in R1" or "3<sup>0</sup> primed in R2", indicating wrong or missing primers, noisy reads, i.e., compromised primer alignment, etc.
		- <sup>l</sup> Low number of reads "3<sup>0</sup> primed in R1" or "5<sup>0</sup> primed in R2", indicating long or trimmed amplicons (with FR1 or FR2 primers for example) not covered by the sequenced read length, or wrong or missing primers.
		- <sup>l</sup> High number of reads "short"—sequence artifacts are generally an explanation, and if primers were used with the pipeline, one may also see an alarm about primer dimers.
	- (a) The main series of widgets are split into "select" on the left and "filter" on the right (Fig. 3).
	- (b) Note that if samples are "QC-failed" (see Note 12), they will not be available here by default; uncheck appropriate widget in "samples options" to include them back in.


Fig. 3 "Select" and "filter" widgets on the "questions" panel


# Fig. 4 "Table" visualization of the "questions" panel


Fig. 5 The "minitable" for tabulation and downloading of selected features and their most popular sequences

	- (a) One may retrieve and download all stored sequences of the clonotype in the "sequences" tab. The sequence variation that will undoubtedly appear in the retrieved sequences could be biological variability (e.g., somatic hypermutation) or technical noise including PCR or sequencing errors or amplification by different primers (see Note 3). Also, the retrieved sequences are not necessarily all possible sequences from the original sample, as we mainly avoid storing sequences supported by a single read unless they are the only representative of a combination of features.

(b) The "tests" tab (also accessible via "runs tests" in the "questions" panel) offers the possibility to annotate the sequences in different ways. When checking the "Interrogate" option, one will get more color-coded information than what is available in "questions," including D genes and more detailed segmentation. "AssignSubsets" provides access to ARResT/AssignSubsets for assignment of IGH rearrangements to major stereotyped subsets of chronic lymphocytic leukemia (CLL) [http://bat. infspire.org/arrest/assignsubsets] [3].

### 3.2 EuroClonality-NGS Assays We will now provide more information on EuroClonality-NGSspecific aspects.

	- (a) The EuroClonality-NGS amplicon assay uses eight tubes for the eight EuroClonality-NGS primer sets: IGH-VJ-FR [1–3], IGH-DJ, IGK-VJ-Kde, intron-Kde, TRB-VJ, TRB-DJ, TRD, and TRG.
	- (b) It is useful for ARResT/Interrogate to know the primer set of the sample, and therefore, we try to auto-detect it, otherwise the sample is considered "pooled." If the sample is not pooled, these primer set names should be used bookended by \_, e.g., sample1\_IGH-VJ-FR1\_[...] or sample2\_IGK-VJ-Kde\_[...]. If still detected wrongly, the name should be edited accordingly as to affect the process, either way. Another way is with a sample sheet and its "primer set" column (see Subheading 2.3).
	- (c) Starting from version 1.90, ARResT/Interrogate specifically tags rearrangements that do not match the primer set (e.g., an VJ:Vg-Jg in an IGK tube) as contamination—one of the advantages of the EuroClonality-NGS assays using one primer set per tube.
	- (a) If one uses spike-ins and wants to access normalized values (i.e., number of cells instead of number of reads), it is also necessary to provide the number of cells (derived from the DNA amount) in the sample, e.g., ~15,000 cells from 100 ng of DNA; this will be used as the denominator for the ratio calculation. There is a widget in the "processing" panel and the same in the "questions" panel (Fig. 6), which sets the value for all samples; if different values need to be set for different samples, this needs to be done with a sample sheet and its "cells" column (see Subheading 2.3). Do not include spike-in cells in those numbers.


Fig. 6 Messages and widgets related to cIT-QC (spike-ins)

	- (a) It is important to select the appropriate user modes to properly analyze data from EuroClonality-NGS assays. One of the automations is the preset of appropriate pipeline scenarios in the "processing" panel with the aforementioned primers and spike-ins.
	- (b) Switch to the "Interrogate.EC-NGS marker identification" user mode for the assays described in [6]. These assays involve one primer set per tube, plus spike-ins in each tube.
	- (c) Switch to the "Interrogate.EC-NGS clonality assessment" user mode for the assays described in [7]. These assays pool the primer set tubes after PCR but before sequencing; therefore, ARResT/Interrogate needs to computationally separate them before calculating abundances. There are currently no spike-ins included. This user mode also enables a bespoke panel, "reporting," in which ARResT/Interrogate separates the different primer sets from the pooled data sample creating one view for each—see the VJ:Vh-(Dh)-Jh and DJ:Dh-Jh views (the latter partially and with a faint red background because of the low number of reads—121, in dark red background included in it) in Fig. 7.

Fig. 7 Views from the bespoke "reporting" panel of the "Interrogate.EC-NGS clonality assessment" user mode, with two of the pooled primer sets separated, normalized, and presented to the user

### 4 Notes


Amplification by different primers annealing on the same template may result in slightly different sequences of, e.g., the same clonotype—keep that in mind when looking at combinations of primers and clonotypes, or retrieve sequences of such a clonotype.

Amplification by different primers annealing on the same template that result in the same sequence and length means that to fully study primers one needs to disable primer trimming so that the sequences remain separate; otherwise, only one primer is remembered per unique sequence that might not represent the full picture. To do this, enable the "primer\_ext" pipeline option, or use the "ARResT.profile.primer\_ext" scenario as a template, or email contact@arrest.tools.



Fig. 8 View of the sample report with the "postmortem" section expanded—most abundant examples of sequences with and without junction are shown

bookended by the IGK-INTR-A-1 and IGK-J-A-1 EuroClonality-NGS primers, which actually do not make sense as a pair (IGK intron and IGK J). The second example has reverse reads; it had to go through a more sensitive workflow ("retried") and ended up with an unsafe IGHJ gene assignment ("unsafeJ"), and only had the 5<sup>0</sup> IGHV primer on the sequence.


### References


receptors: molecular and computational evidence. Leukemia 24:125–132


interrogate: an interactive immunoprofiler for IG/TR NGS data. Bioinformatics 33:435–437


sequencing of immunoglobulin and T-cell receptor gene recombinations for MRD marker identification in acute lymphoblastic leukaemia; a EuroClonality-NGS validation study. Leukemia 33:2241–2253

7. Scheijen B, RWJ M, Rijntjes J, van der Klift MY, Mo¨bs M, Steinhilber J et al (2019) Nextgeneration sequencing of immunoglobulin gene rearrangements for clonality assessment: a technical feasibility study by EuroClonality-NGS. Leukemia 33:2227–2240

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Purpose-Built Immunoinformatics for BcR IG/TR Repertoire Data Analysis

### Chrysi Galigalidou, Laura Zaragoza-Infante, Anastasia Chatzidimitriou, Kostas Stamatopoulos, Fotis Psomopoulos, and Andreas Agathangelidis

### Abstract

The study of antigen receptor gene repertoires using next-generation sequencing (NGS) technologies has disclosed an unprecedented depth of complexity, requiring novel computational and analytical solutions. Several bioinformatics workflows have been developed to this end, including the T-cell receptor/immunoglobulin profiler (TRIP), a web application implemented in R shiny, specifically designed for the purposes of comprehensive repertoire analysis, which is the focus of this chapter. TRIP has the potential to perform robust immunoprofiling analysis through the extraction and processing of the IMGT/HighV-Quest output, via a series of functions, ensuring the analysis of high-quality, biologically relevant data through a multilevel process of data filtering. Subsequently, it provides in-depth analysis of antigen receptor gene rearrangements, including (a) clonality assessment; (b) extraction of variable (V), diversity (D), and joining (J) gene repertoires; (c) CDR3 characterization at both the nucleotide and amino acid level; and (d) somatic hypermutation analysis, in the case of immunoglobulin gene rearrangements. Relevant to mention, TRIP enables a high level of customization through the integration of various options in key aspects of the analysis, such as clonotype definition and computation, hence allowing for flexibility without compromising on accuracy.

Key words Antigen receptor, B-cell receptor, Immunoglobulin, T-cell receptor, Immunoinformatics, Clonality, Immune repertoire, Somatic hypermutation

### 1 Introduction

Profiling the B-cell receptor immunoglobulin (BcR IG) and T-cell receptor (TR) gene repertoires using next-generation sequencing (NGS) technologies advanced our understanding of various clinical conditions and biological processes, extending from infections, vaccination, autoimmunity, to malignancy. NGS immunogenetics has applications in both diagnostics (e.g., assessment of clonality in

Chrysi Galigalidou and Laura Zaragoza-Infante are equal first authors.

Fotis Psomopoulos and Andreas Agathangelidis are equal senior authors.

Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8\_27, © The Author(s) 2022

samples investigated for a possible lymphoproliferation or detection of minimal residual disease in patients with lymphoid malignancies) and research [1, 2]. To date, several pipelines that perform both BcR IG/TR sequence annotation and meta-data analysis have been made publicly available [3–6]: in that regard, notable examples include the "IMGT/StatClonotype" tool [7, 8], the MiXCR software [9], the Vidjil platform [10], and the ARReST|Interrogate application [11], among others.

Our contribution in this field concerns the T-cell receptor/ immunoglobulin profiler (TRIP) software [12], which was designed in order to enable the comprehensive characterization of BcR IG and TR gene repertoires based on an integrated, robust, and user-friendly interface. TRIP has been utilized in projects on hematological malignancies, such as chronic lymphocytic leukemia (CLL) and multiple myeloma (MM) [13–16], as well as other contexts, e.g., infections [17, 18], providing valuable insight into the selection forces that shape the architecture of the respective immune repertoires.

This chapter will focus on the features of TRIP, particularly aiming to highlight how the functionalities offered by this software address the challenges of repertoire analysis in both diagnostic and, particularly, research settings.

### 2 Data Processing


	- 1. The first line begins with the "@" symbol followed by a read identifier, which is given during the sequencing process.
	- 2. The second line contains the nucleotide sequence of the read.
	- 3. The third line has a "+" symbol, used as a line separator.
	- 4. The fourth line has information about the quality of each base of the sequence, represented as Phred quality score. The value of these quality scores can be retrieved from ASCII charts.

Additional information about FASTQ files is provided at the following link https://www.ebi.ac.uk/ega/submission/ sequence#fastq\_format.

In the case of sequencing on an Illumina platform, the bcl2fastq2 software is the one most commonly used for demultiplexing sequencing data and for the masking of the adaptor sequences and/or UMIs (unique molecular identifiers), if present. Moreover, bcl2fastq2 transforms base call files (BCL), which is the default format of raw data when obtained from the Illumina sequencer platform, into FASTQ files. Some sequencing platforms, such as the MiniSeq or MiSeq, provide the option to automatically transform BCL files to the FASTQ format.

In case another sequencing platform is used, it is necessary to follow the instructions specified for each scenario, check the data format of the sequencer server output files, and transform them to FASTQ.

2.2 Filtering of the Raw Data As a first step, quality filters should be applied to all reads in the FASTQ file(s) in order to ensure that only high-quality data will be subjected to further analysis. A set of filtering parameters can be selected according to the type of data and the design of the experiment. The reads that do not fulfill all the requirements will be filtered out. The most common parameters are related to the read length, the quality score of each individual nucleotide, and the overall quality score of each read.

> The level of strictness of the parameters is chosen according to the overall quality of the NGS run and the minimum quality threshold that would allow the extraction of biologically meaningful results depending on the project design.

> Indicative examples of parameters for the analysis of BcR IG/ TR data include: minimum length of the raw reads, 150 nucleotides; quality threshold for each nucleotide, 14; accepted minimum mean sequence quality for each read, 20; maximum percentage of

low-quality nucleotides, 0.2 (20%); and minimum percentage of accepted unidentified nucleotides, 0.01 (1%).

2.3 Synthesis of Paired-End Reads Given the extreme intrinsic variability of BcR IG/TR rearrangement sequences, paired-end sequencing protocols are usually applied. In this scenario, two individual reads, namely, R1 and R2, are obtained from each sequence ensuring the high quality of the sequences and the accuracy of the immunogenetic annotation.


The final synthesized reads that have successfully passed all filters from each sample are deposited in a FASTA file. This file consists of two lines of information per sequence: the first line begins with a ">" symbol followed by the read identifier, and the second line contains the nucleotide sequence.

### 3 Sequence Annotation with IMGT/HighV-QUEST

IMGT (the international ImMunoGeneTics information system) is the worldwide reference in immunogenetics and immunoinformatics [19]. IMGT has incorporated the most extensive and updated reference datasets for human BcR IG/TR genes. IMGT/HighV-QUEST is the web portal for BcR IG/TR data analysis from NGS high-throughput and deep sequencing [20].

In the IMGT/HighV-QUEST home page (http://www.imgt. org/HighV-QUEST/home.action), the user can customize the analysis through a series of options, including a job title, the species, the antigen receptor type (BcR IG or TR), and the specific locus (for instance, BcR IGH or IGL). The data has to be uploaded in FASTA format, and the submission limit is 1,000,000 sequences. Once the analysis is finished, the results can be downloaded from the "Analysis history" tab.

The output for each sample is a folder with ten files in text (.txt) format, with each of them containing different types of immunogenetic information. More specifically, the output files are the following: "1\_Summary.txt," containing a summary table of basic immunogenetic information, such as the rearranged V(D)J genes, the % of identity with the germline, the presence of indels etc.; "2\_IMGT-gapped-nt-sequences.txt"; "3\_Nt-sequences.txt"; 4\_IMGT\_gapped\_AA\_sequences.txt"; "5\_AA-sequences.txt; "6\_Junction.txt"; "7\_V-REGION-mutation-and-AA-changetable.txt"; "8\_V-REGION-nt-mutation-statistics.txt"; "9\_V-REGION-AA-change-statistics.txt"; "10\_V-REGION-mutationhotspots.txt"; "11\_Parameters", with the set of parameters applied in the analysis; and "README.txt", with technical information about the analysis.

### 4 IMGT/HighV-QUEST Meta-Data Analysis with TRIP

The T-cell receptor/immunoglobulin profiler (TRIP) tool [12] is a web application that provides an in-depth meta-data analysis based on the processing of the IMGT/HighV-QUEST output files, through a number of interoperable modules. The TRIP tool can be downloaded from the following link: https://bio.tools/TRIP\_- \_T-cell\_Receptor\_Immunoglobulin\_Profiler.

1. Since IMGT/HighV-QUEST has a submission threshold of 1,000,000 sequences, if a sample contains a larger number of sequences, the user must split them into different batches of sequences before analyzing them with IMGT/HighV-QUEST. Thus, multiple output folders will be generated by the tool for the same sample. In this case, the folders should be named using the same identifier with a different extension, following a numerical order starting from 0, i.e., "-0", "-1", "-2", etc. With this approach, TRIP can trace the origin of these files to the same sample and will combine the respective data for the analysis.


After loading the files (option "Load Data"), TRIP scans the data and gives a notification in the case of data headers with a different or an unknown value. In that case, data headers should be replaced with the appropriate ones.


A summary of all aforementioned analytical steps is depicted in Fig. 1.

### 5 High-Throughput Data Analysis with TRIP


Fig. 1 A summary of all major steps in the analytical workflow starting from the NGS BcR IG/TR raw data up to the extraction of biologically meaningful results

104 (second-CYS 104) and a tryptophan or a phenylalanine (for BcR IG and TR sequences, respectively) at IMGT position 118 (i.e., J-PHE or J-TRP 118). If necessary, it is possible to add more than one landmark in the analysis by separating them with the "|" symbol.

Filters are applied consecutively and, as soon as one of the criteria is not passed, the sequence is filtered out; it is important to keep in mind that only the first non-passed criterion is reported. Only sequences that were of high quality according to all aforementioned standards will be further analyzed.

The results of the preselection process are summarized in four different tables and can be found at the "Pre-selection" tab:


The last data column in the "Clean out" table indicates the criterion that was not passed for each individual sequence. All tables can be downloaded in text (.txt) format.

5.2 Selection (Filtering) Sequences, meeting all the pre-selection criteria, are further filtered during the Selection step. The range of the V-region identity % should be selected. Sequences with a V region identity % that does not fall into the selected range are excluded from the analysis. The selection % of identity depends largely on: 1. The type of antigen receptor gene sequence data, e.g., the SHM mechanism is operational exclusively in B cells. 2. The expected error rate induced by the amplification protocol or the sequencing process. In more detail, in the case of BcR IG sequences, a typical range of the V-region identity would be 85–100%, whereas in TR, the range would be narrower (95–100%). The rest of the available filters enable the selection of sequences with specific immunogenetic features, namely, V, D, and J genes, CDR3 length and the presence of particular CDR3 amino acid sequence motifs. These filters allow for a high level of customization of the analytical procedure. Again, four output files are produced, which are located at the "Selection" tab: 1. A summary table with all filtered-in and -out sequences for each individual parameter ("Summary). 2. The entire set of sequences that passed all the preselection criteria ("All Data table"). 3. All sequences that passed through the Selection filters ("Filter in table"). 4. The sequences that did not meet the selection criteria and were, thus, excluded from further analysis ("Filter out table"). The last column of the "Filter out" table indicates the criterion that was relevant for the exclusion of each individual sequence. These tables can be downloaded in text (.txt) format. The Pre-selection and Selection steps were developed in order to ensure that only relevant, high-quality BcR IG/TR sequence data will be included in the downstream Analytical Pipeline of TRIP. 5.3 TRIP Analytical Pipeline Once the NGS data has been curated and filtered, it is subjected to the TRIP Analytical Pipeline (located at the "Pipeline" tab). The workflow of the analysis can be customized according to the biological context of the project.

5.3.1 Clonotype Computation

The first step of the pipeline refers to the clonotype computation. It concerns the grouping of the analyzed sequences in clonotypes, based on a set of shared immunogenetic properties.

The clustering process depends on the definition of the clonotype. TRIP provides ten different options for defining the clonotype, in order to facilitate the selection of the most relevant immunogenetic properties. If, for example, "IGHV gene and CDR3 aa sequence" is chosen as definition, all the reads expressing the same IGHV gene and identical CDR3 at the aa sequence level will be grouped together into a single clonotype.

There is also the option "Load clonotypes," which allows to directly upload precomputed clonotypes from analyzed datasets.

The output is located at the tab "Clonotypes" and can be downloaded in text (.txt) format. The output contains a series of information regarding each individual clonotype:


Each clonotype is also a link leading to a table with the immunogenetic information of all the assigned reads. At this step, each clonotype is given a unique cluster identifier (cluster ID).

Clonotype computation can provide important biological information mostly in regard to the BcR IG/TR clonality levels in a given setting. Some examples of different approaches supported by TRIP are the following:


An example of clonality assessment using the top 10 clonotypes is illustrated in Fig. 2a.

5.3.2 Computation of Highly Similar Clonotypes Following the previous approach on clonotype definition, namely "V gene and CDR3 aa sequence," two or more clonotypes would be considered as highly similar, if displaying the same CDR3 amino acid length and a low number of amino acid mismatches. TRIP allows for the grouping of highly similar clonotypes obtained at the "Clonotypes computation" step (Subheading 5.3.1). The number

### 594 Chrysi Galigalidou et al.

Fig. 2 Clonality assessment through the analysis of the top 100 clonotypes for five samples, using the either the "Clonotype computation" (a) or the "Highly similar clonotypes computation" option (b). The first three samples (namely Samples A, B, and C) display a monoclonal profile, characterized by predominance of a single clonotype with a frequency of 95%, 87.6%, and 90.9%, respectively. Sample D is oligoclonal, with multiple clonotypes exhibiting high frequency; the dominant clonotype accounts for 34.4% of the repertoire, whereas the cumulative frequency of the top 10 clonotypes is 86.4%. Finally, Sample E is polyclonal with the top 10clonotypes accounting for a very small fraction of the repertoire (2.9%). The option of merging together the "Highly similar clonotypes" resulted in an increase in the cumulative frequency of the top 10clonotypes in all samples (range 0.4–2.8%) indicating the presence of minor clonotypes exhibiting strong immunogenetic relations with the top 10 clonotypes

> of allowed CDR3 aa mismatches can be either chosen manually for each individual CDR3 length or through the application of a percentage (%) threshold.

> One of the most typical approaches is based on the CDR3 length and allows for a low number of aa mismatches, thus ensuring a strong connection between highly similar clonotypes:


This process is implemented by considering the most frequent clonotype for each given CDR3 length as the reference for all the remaining clonotypes with the same CDR3 length. After merging the highly similar clonotypes, their relative frequencies are calculated accordingly.

Another parameter given by TRIP for the computation of highly similar clonotypes concerns the rearranged V gene. The application of this parameter enables the consideration of the whole variable domain of the BcR IG/TR into the clonotype grouping process, yet depends on the context of the given project.

The output files from this process are given as text (.txt) files and contain the following information:


The output files from this step can be found under the tab entitled "Highly Similar Clonotypes."

The effect of the grouping of highly similar clonotypes on clonality assessment is given in Fig. 2b.

5.3.3 Repertoire Extraction The next step of the analysis enables the extraction of the V, D, and J repertoires either at the gene or at the gene allele level. The V, D, or J gene repertoires are extracted from the output file of the previous step (Subheading 5.3.2) that includes all the clonotypes of the dataset (Fig. 3). Here, it is important to keep in mind that the relative frequency of each V, D, or J gene is calculated at the clonotype level rather than at the sequence level. The output of this process is provided as a text (.txt) file and contains information on the gene names, and the absolute number and relative frequency of clonotypes utilizing each specific V, D, and J gene.

> At the end of this section, TRIP allows the user to choose whether the repertoire extraction will be based on the computation before or after the grouping of highly similar clonotypes. The output of this part of the pipeline can be found under the tab "Repertoires."

5.3.4 CDR3 Length Distribution The distribution of the CDR3 length is calculated based on the number of clonotypes corresponding to each individual length. In case the user would like to perform the analysis after the grouping of highly similar clonotypes, the results will be modified accordingly. The output is provided in the form of a table and a graph and can be found under the tab "Visualization." Characteristic examples of CDR3 length distribution are given in Fig. 4.

Fig. 3 The TRIP output for "Repertoire extraction." IGHV (a), IGHD (b), and IGHJ (c) gene repertoires at the clonotype level for Sample A. Strong biases were identified in all cases, characterized by predominance of the IGHV1–8, IGHD6–13, and IGHJ4 genes

Fig. 4 The distribution of the CDR3 length in samples A–E. The x axis refers to the CDR3 length, and the y axis to the respective number of clonotypes. Samples A–C exhibited strong restrictions, in line with their monoclonal profiles. In contrast, restrictions were less prevalent in Sample D, in line with its oligoclonal clonotype repertoire. Finally, in the case of sample E, an almost Gaussian distribution of the CDR3 length is evident

5.3.5 pI Distribution Next, the isoelectric point (pI, pH(I), and IEP) values of the CDR3 of each clonotype is extracted from the corresponding IMGT/ HighV-QUEST output file, which is the pH at which the respective CDR3 carries no electrical charge or is electrically neutral. The pI of a given CDR3 is largely dependent on its amino acid composition. TRIP provides the distribution of the pI in a given dataset, based on the selection of either all or the merged clonotypes from the previous steps. A graph referring to the pI distribution can be found at the "Visualization" tab (Fig. 5).

5.3.6 Multiple Value Comparison Different pairs of immunogenetic variables can be selected at this part of the pipeline. TRIP uses the output file from the computation of either all clonotypes or the merged clonotypes and performs comparisons between any given set of variables. The output file contains the values for each of the two selected variables and the number and relative frequency of clonotypes for each possible combination of values.

> Eleven different variables that can be selected at this step include:


Figure 6 illustrates two examples of comparisons when using the V-gene and J-gene variables. The output files for the selected comparisons can be found at the "Multiple value comparison" tab.

<sup>5.3.7</sup> Computation of Shared Clonotypes In this section, TRIP scans different samples for the presence of identical clonotypes. The output file is provided in text (.txt) format with each row corresponding to a unique clonotype and each column to a different sample. Results include the absolute number of reads and the relative frequency of each clonotype in each sample (Column A: Sample id 1\_Reads/Total, Column B: Sample id 1\_Freq, Column C: Sample id 2\_Reads/Total, Column D: Sample id 2\_Freq). This type of analysis is based on the selection of either all clonotypes or just the merged clonotypes.

Fig. 5 The pI distribution for samples A–E individually and all samples together, using a boxplot. Clonotypes from Samples A–D displayed a similar pI distribution, whereas the clonotypes from sample E exhibited lower pI values

Fig. 6 Comparisons between IGHV and IGHJ gene utilization in monoclonal (Sample A) (a) versus polyclonal cases (Sample E) (b) using heatmaps. (a) A strong association between the IGHV1–8 and IGHJ4 genes is evident in Sample A, corresponding to the dominant clonotype. (b) Several associations are evident in Sample E, reflecting the polyclonal profile of this sample

5.3.8 Repertoire Comparison Similar to the comparison of clonotypes, TRIP allows the comparison of gene or gene allele repertoires (see Subheading 5.3.3), between two or more samples/datasets. The output consists of a table where each row represents a unique gene and each column a sample. Results include the absolute number and relative frequency of the clonotypes expressing each gene in every individual sample (Column A: Sample id 1\_N/Total, Column B: Sample id 1\_Freq, Column C: Sample id 2\_N/Total, Column D: Sample id 2\_Freq). Again, this type of analysis can be performed on either all clonotypes or just the merged clonotypes.

5.3.9 Clustering of CDR3 Sequences with Maximum Length Difference of One Amino Acid As in the previous section concerning the merging of highly similar clonotypes (Subheading 5.3.2), at this point, TRIP allows for the merging of clonotypes differing by one amino acid in CDR3 length that are identical over the same length. In this case, TRIP adds one amino acid at a specified position of the shorter CDR3 resulting in the formation of two identical CDR3s. The output graph can be found at the "Visualization" tab.

5.3.10 Alignment TRIP provides the option to align all clonotypes using the IMGT germline reference of the VDJ or VJ region at both the nucleotide and amino acid levels. An alignment table and a grouped alignment table based on the corresponding region are computed, and they are both available at the "Alignment" tab. Relevant gene alleles or a different reference sequence can be provided by the user.

5.3.11 Insert Identity Groups At this point, TRIP enables the customization of the SHM analysis that can be applied at the next step (see Subheading 5.3.12). In detail, the user can specify the number of clonotype groups and the respective germline identity % thresholds that will be used for the SHM analysis. In certain clinical contexts, especially chronic lymphocytic leukemia (CLL), mutational categories defined by specific identity % thresholds have distinct clinical course, including responses to different treatments [21]. In that case, TRIP allows defining three distinct groups through the application of the 85–98% (see Subheading 5.2 on Selection for the application of the 85% cutoff), 98%–100%, and 100% cutoffs. The first group corresponds to "IG-mutated CLL" (M-CLL), the second to "IGunmutated CLL" (U-CLL), and the third to "truly IG-unmutated" CLL cases. In terms of clonotype selection, TRIP gives the user the option to perform this part of the analysis on either all or just the merged clonotypes. Figure 7 depicts the application of these identity % thresholds in a series of cases.

> This part of the analysis ("Insert identity groups") along with the next one ("Somatic hypermutation") apply to BcR IG datasets only, since the SHM mechanism does not operate in cells other than B cells.

For SHM analysis, TRIP uses as reference the alignment tables produced at the Alignment step. This type of analysis can be applied to the entire dataset or only to clonotypes exhibiting either high frequency or specific immunogenetic properties.

As an output, TRIP offers information on:

1. The type of nucleotide mutations and relevant amino acid changes.

5.3.12 Somatic Hypermutation Analysis

Fig. 7 Relative frequency of the three IG mutational subgroups, in Samples A–E. The mutated subgroup (germline identity, GI 85–98%) was dominant in Samples A, C, and D. Truly unmutated clonotypes (GI 100%) accounted for the largest fraction of the repertoire in Sample B, indicating a different biological context. Finally, polyclonal Sample E was characterized by similar frequency levels for all mutational subgroups, perhaps due to the lack of strong selection mechanisms


### 5.3.14 Visualization The tab "Visualization" on the TRIP interface includes all different graph types that were produced during the course of the analytical pipeline. The first graph is a bar plot of either all clonotypes or the merged clonotypes, with the option of a frequency threshold. The

next graphs are pie charts of the selected V, D, and J gene repertoires. The option of applying a frequency threshold is also given here. The visualization of convergent evolution is next, with different options including a 3-D plot. Next on this tab is a pie chart and corresponding table concerning the selected identity groups for the SHM analysis along with the absolute number and frequency of clonotypes assigned to each group. The last graphs of this section are a candlestick chart for the depiction of the pI distribution and the line graph for the CDR3 distribution, below which the corresponding table is presented.

	- 1. It is necessary to select the option "Clonotype computation" in order to apply the following types of analysis:
		- (a) "Highly similar Clonotypes computation."
		- (b) "Repertoires Extraction". In the case that the "Highly Similar Clonotypes Computation" has been selected, the repertoires will be extracted for both the total clonotypes and the merged clonotypes.
		- (c) "Alignment" using the option "Select top N clonotypes."
		- (d) "Mutations" using the options "Select top N clonotypes" or "Select clonotypes separately."
		- (e) "Logo" using the "Select top N clonotypes" option.
	- 2. The "Somatic hypermutation status" is applied using the groups that have been selected using the "Insert identity groups" option.
	- 3. If both "Alignment" and "Clonotypes computation" have been selected, the cluster ID in the alignment table is the same as the one in the Clonotype table. Otherwise, all elements in the "cluster\_ID" column of the alignment table will be set to 0.
	- 4. To apply "Mutations," "Alignment" should have run previously, using the "AA or Nt" option. The Mutation table is computed based on the grouped alignment table.

### Acknowledgments

This work was supported in part by the Framework of the Hellenic Republic: Siemens Settlement Agreement, through the Hellenic Precision Medicine Network on Oncology project; the ERA-NET on Translational Cancer Research (TRANSCAN-2) acronym NOVEL project code (MIS) 5041673; and the Hellenic Foundation for Research and Innovation (HFRI) and the General Secretariat for Research and Technology (GSRT), under grant agreement No 336 (Project CLLon); the project ODYSSEAS (Intelligent and Automated Systems for enabling the Design, Simulation, and Development of Integrated Processes and Products) implemented under the "Action for the Strategic Development on the Research and Technological Sector," funded by the Operational Programme "Competitiveness, Entrepreneurship and Innovation" (NSRF 2014–2020) and @co-financed by Greece and the European Union, with grant agreement no: MIS 5002462; and the EuroClonality-NGS working group.

Disclosures: Kostas Stamatopoulos has received honoraria and research support from Abbvie, Janssen, Astra-Zeneca, and Gilead.

### References


REBUILD study). Blood 134(Supplement 13167):3167. https://doi.org/10.1182/ blood-2019-124655


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# INDEX

### A


### B


72, 73, 75, 93, 94, 103, 114, 115, 128, 135, 148, 150, 158, 163, 164, 173, 267, 272, 425, 440, 451, 453, 478, 563, 571, 572, 575, 587

### C



### D

Droplet digital PCR/digital PCR (ddPCR) ..........80–88, 105, 109, 110, 196, 197, 199, 200, 205

### F


### G


### H


### I


Anton W. Langerak (ed.), Immunogenetics: Methods and Protocols, Methods in Molecular Biology, vol. 2453, https://doi.org/10.1007/978-1-0716-2115-8, © The Editor(s) (if applicable) and The Author(s) 2022

### 606 IMMUNOGENETICS: METHODS AND PROTOCOLS Index


### L

Lymphoma ....................................................7–40, 62, 80, 101–116, 119–121, 123, 134, 135, 193, 194, 264

### M


### N

Non-Hodgkin lymphoma (NHL)................................134

### P


### Q

Quality controls....................................36, 71, 72, 92–94, 97, 123, 124, 138, 141–143, 147, 148, 266, 268, 280, 283, 287, 288, 306, 322, 323, 325, 329, 330, 340, 346, 358, 360, 369–371, 404, 406, 408, 409, 416, 423–425, 427, 440, 453, 573, 577

### R


### S


### T


IMMUNOGENETICS: METHODS AND PROTOCOLS Index 607


### U

Unique molecular identifier (UMI)............................172, 266–268, 282, 287–289, 298, 345–376, 381, 413, 416, 417, 442

### V


### W

