Fabian Dürr

Band 63

F. Dürr

Multimodal Panoptic Segmentation of 3D Point Clouds

**Multimodal Panoptic Segmentation of 3D Point Clouds**

Fabian Dürr

**Multimodal Panoptic Segmentation of 3D Point Clouds**

Karlsruher Schriften zur Anthropomatik Band 63 Herausgeber: Prof. Dr.-Ing. habil. Jürgen Beyerer

Eine Übersicht aller bisher in dieser Schriftenreihe erschienenen Bände finden Sie am Ende des Buchs.

# **Multimodal Panoptic Segmentation of 3D Point Clouds**

by Fabian Dürr

Karlsruher Schriften zur Anthropomatik

Herausgeber: Prof. Dr.-Ing. habil. Jürgen Beyerer

Eine Übersicht aller bisher in dieser Schriftenreihe erschienenen Bände finden Sie am Ende des Buchs.

Band 63

Karlsruher Institut für Technologie Institut für Anthropomatik und Robotik

Multimodal Panoptic Segmentation of 3D Point Clouds

Zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften von der KIT-Fakultät für Informatik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation

von Fabian Dürr, M.Sc.

Tag der mündlichen Prüfung: 22. Juni 2023 Erster Gutachter: Prof. Dr.-Ing. Jürgen Beyerer Zweiter Gutachter: Prof. Dr.-Ing. J. Marius Zöllner

**Impressum**

Karlsruher Institut für Technologie (KIT) KIT Scientific Publishing Straße am Forum 2 D-76131 Karlsruhe

KIT Scientific Publishing is a registered trademark of Karlsruhe Institute of Technology. Reprint using the book cover is not allowed.

www.ksp.kit.edu

*This document – excluding parts marked otherwise, the cover, pictures and graphs – is licensed under a Creative Commons Attribution-Share Alike 4.0 International License (CC BY-SA 4.0): https://creativecommons.org/licenses/by-sa/4.0/deed.en*

*The cover page is licensed under a Creative Commons Attribution-No Derivatives 4.0 International License (CC BY-ND 4.0): https://creativecommons.org/licenses/by-nd/4.0/deed.en*

Print on Demand 2023 – Gedruckt auf FSC-zertifiziertem Papier

ISSN 1863-6489 ISBN 978-3-7315-1314-8 DOI 10.5445/KSP/1000161158

# **Abstract**

The human driver is one of the main reasons for traffic accidents. Therefore, autonomous driving or Advanced Driver Assistance Systems (ADAS) have the potential to reduce the number of human-related traffic accidents drastically by supporting or replacing the human driver. Driven by the potential impact of ADAS and autonomous driving, the understanding and interpretation of a vehicle's 3D environment has become increasingly important. Knowledge about drivable space, the position and motion of other traffic participants, such as cars or pedestrians, and the static environment is of utmost importance. Therefore, autonomous vehicles are usually equipped with a variety of sensors, such as camera, lidar, radar, and ultrasonic sensors. Lidar sensors and their recorded point clouds are particularly interesting for the challenge of 3D scene understanding since they provide accurate 3D information about the current environment. An essential task in this context is panoptic segmentation, which enhances every 3D point with semantic and instance information. However, the unstructured and sparse nature of 3D point clouds requires novel approaches and algorithms to achieve high quality and robust results. Established technologies from the 2D image domain, such as Convolutional Neural Networks, cannot be applied directly. Hence, the first key challenge is the representation of point clouds to effectively leverage the power of approaches based on deep learning. Alongside the chosen representation, the sequential nature of the recorded sensor data over time offers great potential as an additional modality to improve the 3D panoptic segmentation. Nevertheless, it is challenging to propagate information through time from a 3D point cloud to its successor since spatial correspondences must be found. Finally, different sensor modalities with distinct measurement principles offer further potential for enhancement. However, sensor fusion requires an efficient combination of diverse information from distinct sensor spaces. An even greater and still unsolved challenge is a joint solution for these aspects and dense 3D tasks, such as semantic or panoptic segmentation.

The objective of this thesis is the design of a multimodal approach based on deep learning for 3D panoptic segmentation. It builds upon and combines the three key aspects multi view point cloud architecture, temporal feature fusion, and deep sensor fusion. The multi view architecture exploits multiple point cloud representations, also called *views*, to combine their strengths and compensate for individual weaknesses. The complementary 2D range view and bird's eye view are utilized for efficient context aggregation to support a high resolution point view. The point view combines multi view context while preserving fine details. Afterwards, a recurrent temporal fusion approach is introduced to exploit temporal dependencies by aggregating and propagating feature maps through time. It builds upon a temporal memory in range or bird's eye view containing the aggregated past information, which is updated in every step with the current information. A temporal alignment step compensates for the ego motion and ensures spatial consistency between frames. Furthermore, a novel deep sensor fusion approach combines lidar and camera features to enhance 3D panoptic segmentation. Motivated by the promising combination of image and depth information, camera and lidar feature maps are fused in the range view following either an iterative or pyramid-based fusion strategy. Finally, the individual contributions are combined into a novel multimodal multi view architecture that simultaneously exploits the proposed multi view, temporal, and sensor fusion frameworks.

Extensive experiments on the two large scale public datasets nuScenes and SemanticKITTI are conducted to investigate the benefits of the three main contributions and the combined multimodal framework. First, the evaluation shows the superiority of the multi view approach over single view methods. Next, it underlines the value of learning temporal dependencies by revealing significant improvements of the presented temporal framework over single frame baselines. Furthermore, it confirms the value of fusing distinct sensor features. Finally, excellent results are achieved by combining the presented approaches into a multimodal framework, which outperforms state-of-the-art results for various tasks and benchmarks.

# **Kurzfassung**

Einer der Hauptgründe für Verkehrsunfälle ist oftmals der menschliche Fahrer. Fahrerassistenzsysteme und autonomes Fahren haben deshalb das Potenzial, die Zahl der Verkehrsunfälle drastisch zu senken, indem sie den menschlichen Fahrer unterstützen oder ersetzen. Eine wichtige Voraussetzung dafür ist das zuverlässige Erfassen und Verstehen der 3D-Fahrzeugumgebung. Automatisierte Systeme benötigen Informationen über den Straßenverlauf, die statische Umgebung und über die Position und Geschwindigkeit anderer Verkehrsteilnehmer, wie Autos oder Fußgänger. Dafür sind autonome Fahrzeuge üblicherweise mit einer Vielzahl von Sensoren, wie beispielsweise Kamera-, Lidar-, Radar- und Ultraschallsensoren ausgestattet. Die von Lidarsensoren aufgezeichneten 3D-Punktwolken leisten dabei einen zentralen Beitrag zur 3D-Umfelderfassung, da diese präzise 3D-Informationen über die aktuelle Umgebung liefern. Eine elementare Aufgabe ist in diesem Kontext die panoptische Segmentierung, die jeden 3D-Punkt der Punktwolke einer semantischen Klasse und individuellen Objektinstanz zuordnet. Für diese Aufgabe sind aufgrund der ungeordneten und unregelmäßigen Struktur von 3D-Punktwolken jedoch neue Ansätze und Algorithmen erforderlich, um qualitativ hochwertige und robuste Ergebnisse zu erzielen. Etablierte Technologien aus der 2D-Bildverarbeitung, wie z.B. Convolutional Neural Networks, sind aufgrund der ungeordneten Struktur nicht direkt anwendbar. Eine zentrale Herausforderung ist deshalb 3D-Punktwolken geeignet zu repräsentieren, um Deep Learning Ansätze anwenden und deren Potenzial ausnutzen zu können. Neben der Repräsentation der Punktwolken bietet auch die zeitlich sequentielle Natur der Sensordaten ein erhebliches Potenzial zur Verbesserung der panoptischen Segmentierung. Eine große Herausforderung ist dabei jedoch, die aus einer Punktwolke extrahierten Informationen auf die zeitlich nachfolgende Punktwolke zu übertragen, da räumliche Korrespondenzen in 3D gefunden werden müssen. Eine weitere vielversprechende Verbesserungsmöglichkeit bietet die Fusion verschiedener Sensormodalitäten mit unterschiedlichen Messprinzipien. Die Sensorfusion erfordert jedoch eine effiziente Kombination verschiedener Informationen aus unterschiedlichen Sensorräumen. Eine noch größere und ungelöste Herausforderung ist eine kombinierte Lösung, die all drei Aspekte vereint.

Das Ziel dieser Arbeit ist die Entwicklung eines multimodalen Ansatzes basierend auf Deep Learning für die panoptische Segmentierung von 3D Punktwolken. Die drei zentralen Aspekte des Ansatzes sind dabei eine Multi-View-Architektur für 3D-Punktwolken, die zeitliche Fusion von 3D Punktwolken sowie deren Fusion mit Kamerainformationen. Die Multi-View-Architektur vereint verschiedene Repräsentationen von 3D Punktwolken, auch Views genannt, um deren Stärken zu kombinieren und individuelle Schwächen zu kompensieren. Dabei kommen mit der Range View und Bird's Eye View zwei komplementäre 2D Repräsentationen für die effiziente Aggregation von Features und Kontext zum Einsatz. Beide unterstützen die dritte verwendete Repräsentation, die sogenannte Point View. Diese dient als Bindeglied, um die Features aller drei Repräsentationen für jeden 3D-Punkt zu fusionieren. Die Ausgabe des Point View Netzwerkes ist ein individueller Feature-Vektor für jeden 3D Punkt für dessen semantische Klassifikation. Anschließend wird ein rekurrenter zeitlicher Fusionsansatz vorgestellt, um zeitliche Abhängigkeiten zu lernen und auszunutzen. Dafür werden Feature Maps in Range und Bird's Eye View über die Zeit aggregiert, wodurch ein zeitliches Gedächtnis entsteht. Dieses enthält durch die rekursive Aggregation die Informationen der vorangehenden Punktwolken und wird in jedem Schritt mit den aktuellen Informationen aktualisiert. Ein Transformationsschritt kompensiert die Eigenbewegung des Fahrzeuges und gewährleistet die räumliche Konsistenz des Gedächtnisses über die Zeit. Im nachfolgenden Schritt wird ein Sensorfusions-Ansatz für die Fusion von Lidar- und Kamerainformationen vorgestellt, um die panoptische Segmentierung weiter zu verbessern. Motiviert durch die erfolgreiche Kombination von RGB- und Tiefeninformationen werden Kameraund Lidar-Feature Maps in der Range View kombiniert. Dafür kommt entweder eine iterative oder Pyramiden-basierte Fusionsstrategie zum Einsatz. Am Ende werden die einzelnen Beiträge zu einer multimodalen Multi-View-Architektur kombiniert, die als erster Ansatz die Vorteile einer Multi-View-Architektur, zeitlichen Fusion und Sensorfusion für die 3D panoptische Segmentierung vereint.

Für die Evaluation werden vielfältige Experimente auf den beiden umfangreichen und öffentlichen Datensätzen nuScenes und SemanticKITTI durchgeführt. Dabei werden die Vorteile der einzelnen vorgestellten Ansätze sowie des kombinierten multimodalen Ansatzes untersucht. Die Experimente zeigen im ersten Schritt die Überlegenheit des Multi-View-Ansatzes gegenüber Single-View-Methoden. Zusätzlich wird der Wert gelernter zeitlicher Abhängigkeiten unterstrichen, da die durchgeführten Experimente signifikante Verbesserungen als Ergebnis der zeitlichen Fusion zeigen. Weiterhin bestätigen die Experimente wesentliche Verbesserungen der panoptischen Segmentierung durch die vorgeschlagene Fusion von Lidar- und Kamerainformationen. Als Gesamtergebnis der Arbeit erzielt die Kombination der vorgestellten Ansätze zu einem multimodalen Ansatz hervorragende Ergebnisse und verbessert den Stand der Technik für verschiedene Arten der Segmentierung.

# **Contents**




# **Notation**

This chapter introduces the notation and symbols that are used in this thesis.

## **General Notation**


### **Scalars**




### **Vectors**



### **Matrices and Tensors**


### **Functions**


xiv


**Sets**


# **1 Multimodal Scene Understanding with 3D Point Clouds**

## **1.1 Motivation**

Advanced Driver Assistance Systems (ADAS) and autonomous driving are among the most impactful and disruptive technologies in the automotive industry and beyond. The purpose of these systems is to make driving safer and more convenient. According to the National Highway Traffic Safety Administration (NHTSA), the human driver is the critical reason for approximately 94% of traffic accidents in the United States [Sin15]. Therefore, ADAS and autonomous driving have the potential to support or replace the driver and to reduce the number of human-related accidents drastically. The capabilities of these systems are characterized by two key properties. The first one is the level of autonomy, which determines if the system takes over full responsibility or must be supervised by the human driver. The second one is the operational design domain, which specifies the use cases the system can handle. Hence, five levels have been proposed [SAE18] to classify these systems, starting from level one with limited functionality and restricted use cases while the entire responsibility remains with the driver. On the other hand, systems of the fifth level provide full autonomy for any use case and take over full responsibility. The increasing amount of autonomy, functionality, and variety in the covered use cases impose tremendous requirements on the autonomous system and especially its environmental perception. Therefore, without a robust and comprehensive understanding of its environment, an autonomous vehicle can neither fulfill its purpose nor drive at all safely.

Consequently, the understanding and interpretation of a vehicle's 3D environment has become increasingly important. The foundation is the sensor set of an autonomous vehicle, which usually comprises a variety of different sensors, such as camera, lidar, radar, and ultrasonic sensors. Based on measurements provided by these sensors, a comprehensive and unified environment model must be predicted to provide a holistic understanding of a vehicle's current 3D environment. Among others, this includes knowledge about drivable space, the static world, or the position and motion of other traffic participants, such as cars or pedestrians. Therefore, the different and complementary sensors are combined by sensor fusion to create the required unified environment model and to compensate for the shortcomings of individual sensor types. In addition, sensors in the context of autonomous driving record their environment sequentially, and previous recordings contain valuable information also for the current time step. Hence, a temporal fusion of current and past information has the potential to improve the environment model.

The individual sensor types provide different data, such as camera images or lidar point clouds. Lidar sensors and their recorded point clouds are particularly interesting for 3D scene understanding or environment models since they provide accurate 3D information. Various tasks can be solved based on these 3D points to provide valuable information for autonomous driving, such as object detection and semantic or panoptic segmentation. The latter is an important and complex task, depicted in Fig. 1.1, that enhances every 3D point with semantic and instance information. Semantic information describes the object class, whereas instance information allows distinguishing between individual instances of a semantic class. Hence, 3D panoptic segmentation provides a valuable combination of geometric, semantic, and instance knowledge.

Motivated by the high value of 3D panoptic segmentation for environment perception, this thesis focuses on solving this task for lidar point clouds. Methods based on deep learning achieve excellent results for scene understanding tasks in the image domain, such as panoptic segmentation. Therefore, the proposed approach builds upon deep learning to leverage its power for 3D point clouds. Furthermore, this thesis focuses on a multimodal approach capable of performing sensor and temporal fusion. It predicts panoptic segmentation car road sidewalk terrain vegetation trunk building car 1 car 2 car 3 car 4 car 5

for lidar point clouds while additionally exploiting camera and temporal information.

**Figure 1.1:** The task of 3D panoptic segmentation. It is composed of semantic and instance segmentation subtasks, shown in the middle and at the bottom, respectively. Different colors visualize the semantic classes and individual instances.

## **1.2 Challenges**

Panoptic segmentation of 3D point clouds is a challenging task due to the irregular and sparse nature of point clouds and the necessity to distinguish simultaneously between a significant number of semantic classes and their instances. Additionally, existing training data is considerably unbalanced with respect to the semantic classes and their instances. Furthermore, approaches that include sensor and temporal fusion face additional challenges related to calibration, ego motion, and temporal synchronization.

The first set of challenges originates from the particular properties of lidar point clouds. These point clouds


These properties complicate the computation and exploitation of relations between individual points and the capturing of local structures. As a result, the hierarchical aggregation of information required to understand objects and structures sparsely represented in a point cloud is challenging. Furthermore, the unordered nature prevents the direct application of established deep learning architectures from the image domain. These require their input data to be organized as a grid and face additional challenges related to the training data:


Independently of the considered panoptic segmentation task and deep learning, more challenges arise when temporal and sensor fusion are considered:


## **1.3 Contributions**

The objective of this thesis is the design of a multimodal architecture based on deep learning for robust and high quality panoptic segmentation of 3D point clouds. The proposed framework builds upon three main concepts: a multi view point cloud architecture, a temporal feature fusion, and a deep sensor fusion. The multi view architecture relies on different point cloud representations, also called *views*, to exploit their strengths and compensate for weaknesses. A recurrent temporal feature fusion considers information from previous time steps to exploit temporal dependencies. Finally, deep sensor fusion exploits cameras as an additional sensor modality to improve the 3D panoptic segmentation. The evaluation is performed on two challenging and large scale outdoor datasets, where the individual contributions and their combination outperform state-of-the-art results for various tasks. The contributions of this thesis are in detail:

• A novel multi view framework [Due22] addresses the shortcomings of single view approaches and individual views. It is based on range view, bird's eye view, and point view and obtains significantly improved features compared to single view approaches. Range and bird's eye view provide efficient context aggregation, while the high resolution point view maintains a unique feature vector for every 3D point. Due to the carefully chosen views, the introduced approach considerably reduces the computational complexity compared to established multi view approaches. The framework also includes an enhanced multi view, multi task strategy. The point view provides the 3D semantic segmentation, whereas the bird's eye view is used for center-based instance recognition, required for instance segmentation.


# **2 Related Work**

Panoptic segmentation [Kir19] is the combined task of semantic and instance segmentation, which provides semantic and object information about the environment. The subtask of semantic segmentation assigns one of the predefined semantic classes to every image pixel or 3D point. Instance segmentation, on the other hand, clusters pixels or points into instances. However, distinguishing between instances is only possible and useful for some semantic classes. Foreground or "thing" classes are countable classes that require instance segmentation, specifically traffic participants, such as car, bicyclist, or pedestrian. On the other hand, background or "stuff" classes are uncountable, such as road and sidewalk, or their instances are irrelevant for the considered scenario, such as buildings or poles. Therefore, instance segmentation is only provided for the subset of thing classes, which are determined by the semantic segmentation. Overall, panoptic segmentation simultaneously requires a high quality semantic segmentation and sophisticated instance recognition for convincing panoptic results. Furthermore, it requires mutually consistent predictions for both subtasks instead of an independent and trivial combination of both.

The overall goal of this thesis is a multimodal, deep learning-based approach for panoptic segmentation of 3D point clouds. It builds upon the foundations of deep neural networks and their success in 2D scene understanding. Related work is investigated in the areas of 3D semantic, instance, and panoptic segmentation, as well as deep learning-based temporal and sensor fusion.

## **2.1 Deep Neural Networks**

While the mathematical foundations and basic unit have already been proposed in the middle of the last century [Ros57], the path from this early and simple linear classifier to a deep neural network that exceeds human-level performance in image classification [Rus15, He15b] took more than half a century. Neural networks in computer vision experienced their renaissance with the growing computational power of GPUs [Kri12], which were able to optimize these networks in reasonable time. In the following years, tremendous progress and improvements have been achieved across various computer vision tasks, and in other areas, such as neural language processing. The following section summarizes the fundamentals of neural networks and specialized network architectures, such as convolutional and recurrent neural networks. Further theoretical and mathematical details can be found in [Dud00, Bis06]. While and describe Cartesian coordinates throughout this thesis, they are used as scalar components of input and output vectors and in this section to follow established conventions [Dud00, Bis06].

### **2.1.1 Multi-Layer Perceptron**

The basic building blocks of neural networks are individual neurons, called perceptrons [Ros57], and their structure is depicted in Fig. 2.1. Perceptrons take scalar input values and produce a scalar output . For that reason, the input vector ∈ ℝ is multiplied with a weight vector ∈ ℝ and a scalar bias is added. Finally, an activation function ∶ ℝ → ℝ is applied to produce the output:

$$\mathbf{y} = \mathcal{A}(\boldsymbol{\omega}^{\mathsf{T}} \mathbf{x} + \boldsymbol{b}).\tag{2.1}$$

Common choices for the activation function of Convolutional Neural Networks (CNNs) are nowadays Rectified Linear Units (ReLUs) [Jar09, Nai10]

$$\mathcal{A}(\mathbf{x}) = \max(\mathbf{x}, 0) \tag{2.2}$$

and their leaky counterpart, Leaky Rectified Linear Units (LReLUs) [Maa13]

$$\mathcal{A}(\mathbf{x}) = \max(\mathbf{x}, \mathbf{0}) + \boldsymbol{\beta} \cdot \min(\mathbf{x}, \mathbf{0}), \tag{2.3}$$

where is a small number, such as 0.01. Alternatively, the sigmoid or hyperbolic tangent activation functions are other possible choices.

**Figure 2.1:** Structure of a perceptron.

A Multi-Layer Perceptron (MLP) combines multiple perceptrons to solve complex and nonlinear problems. The perceptrons are grouped into layers and only connected to perceptrons of the previous and next layer, illustrated in Fig. 2.2. No connections exist inside a layer, and outputs are only provided to the next layer, which prohibits feedback loops to previous layers. The layer with perceptrons receives the output or feature vector −1 ∈ ℝ−1 of the previous layer, representing the scalar outputs of −1 perceptrons. A weight matrix ∈ ℝ×−1 and bias vector ∈ ℝ contain the weight vector and scalar bias of every perceptron in the -th layer. This leads to the overall equation

$$\mathbf{f}\_l = \mathcal{A}(\mathbf{W}\_l \mathbf{f}\_{l-1} + \mathbf{b}\_l),\tag{2.4}$$

where <sup>0</sup> = is the input, like an image, and = ∈ ℝ is the output of an MLP with layers.

**Figure 2.2:** Multi-Layer Perceptron (MLP) composed of individual perceptrons.

Weight matrices and bias vectors, in the following summarized as , are the parameters of an MLP, which are optimized during training. The usual supervised training strategy requires pairs of input and ground truth output data (, gt) for which the network computes its output . A task-dependent loss function ℒ( , gt) measures their alignment. Common loss functions for regression are mean-squared error (MSE) and mean-absolute error (MSA):

$$\begin{split} \mathcal{L}\_{\text{MSE}} \left( \mathbf{y}^{\mathbf{x}}, \mathbf{y}^{\text{gt}} \right) &= \frac{1}{M\_N} ||\mathbf{y}^{\mathbf{x}} - \mathbf{y}^{\text{gt}}||\_2^2 = L\_{\text{MSE}}, \\ \mathcal{L}\_{\text{MAE}} \left( \mathbf{y}^{\mathbf{x}}, \mathbf{y}^{\text{gt}} \right) &= \frac{1}{M\_N} ||\mathbf{y}^{\mathbf{x}} - \mathbf{y}^{\text{gt}}||\_1 = L\_{\text{MAE}}. \end{split} \tag{2.5}$$

For classification, the output vector contains the individual class scores, and its dimension matches the number of classes classes. In this case, Cross-Entropy (CE) loss is commonly used:

$$\mathcal{L}\_{\rm CE}(\mathbf{y}^{\mathbf{x}}, \mathbf{y}^{\rm gt}) = -\sum\_{\rm cls=1}^{N\_{\rm clas}} \log \left( \frac{e^{\mathbf{y}\_{\rm cls}}}{\sum\_{\rm cls'=1}^{N\_{\rm clas}} e^{\mathbf{y}\_{\rm cls'}}} \right) \cdot \mathbf{y}\_{\rm cls}^{\rm gt} = L\_{\rm CE}.\tag{2.6}$$

Backpropagation [Rum86] computes the gradient for the loss with respect to the network parameters and is the foundation for the optimization process. The commonly used stochastic gradient descent computes in every iteration the gradient over a subset of the training set, called batch , and applies a given learning rate :

$$
\Delta \mathbf{W}\_l = -\eta \cdot \sum\_{(\mathbf{x}, \mathbf{y}^\vartheta) \in \mathcal{B}\_l} \frac{\partial \mathcal{L}(\mathbf{y}^\mathbf{x}, \mathbf{y}^{\vartheta})}{\partial \mathbf{W}}.\tag{2.7}
$$

Afterwards, the parameters are updated accordingly:

$$\mathbf{W}\_{l+1} = \mathbf{W}\_l + \Delta \mathbf{W}\_l.\tag{2.8}$$

Momentum extends this update step with a portion of the previous weight update to overcome plateaus in the loss function and converge more quickly:

$$
\Delta \mathbf{W}'\_{l} = \Delta \mathbf{W}\_{l} + \psi \Delta \mathbf{W}'\_{l-1},
$$

$$
\mathbf{W}\_{l+1} = \mathbf{W}\_{l} + \Delta \mathbf{W}'\_{l}.
\tag{2.9}
$$

### **2.1.2 Convolutional Neural Networks**

The previously presented layers of an MLP are called fully connected because every perceptron of one layer is connected to every perceptron of the next layer. As a result, the number of connections and associated weights increases quadratically with the number of perceptrons. This is especially challenging for high dimensional inputs, like images, with potentially millions of pixels. Convolutional layers address this challenge for data with a grid-like topology by sparsity and parameter sharing. The sparsity is achieved by restricting connections to nearby perceptrons, illustrated in Fig. 2.3, based on the assumption of strong local structure and correlation [Lec98]. This locally connected region is the receptive field of a convolutional layer. Additionally, the weights are shared among all perceptrons of one layer to regularize the parameters and further reduce their number. Therefore, the weights are independent of a perceptron's position and grant spatial invariance. As a result, the operation of a convolutional layer can be considered as applying filter kernels with the size of the receptive field to its input.

**Figure 2.3:** Comparison of a fully connected and 1D convolutional layer. Illustration based on [Her18].

Convolutional layers are the main building blocks of CNNs, together with pooling and fully connected layers [Lec98]. The first part of CNNs, made of convolutional and pooling layers, is called feature extractor or backbone and is responsible for extracting meaningful features from the network input. Convolutional layers apply distinct and learnable × kernels to their input to compute so-called feature maps, depicted in Fig. 2.4. These feature maps contain the extracted features, such as edges or corners when considering image input. Pooling layers, on the other hand, subsample feature maps by applying a reduction operation to local regions, e.g., of size 2 × 2 or 3 × 3. Most commonly, the maximum is taken from each region, which has shown better results than taking the average [Sch10]. In order to achieve the desired subsampling, pooling operations are usually applied with a stride of two. As a result, the pooling operation is only applied to every other spatial location, which halves the feature map resolution. A common alternative to reduce the feature map resolution are convolutional layers with a stride of two or higher [Spr15]. The downsampling aggregates features and reduces the sensitivity of the output to shifts and distortions [Lec98]. Most importantly, repeated downsampling allows subsequent layers to extract higher-order features with increasing abstraction levels.

The basic setup of a CNN are alternating convolutional and pooling layers in the feature extractor followed by fully-connected layers in the second part, which is often referred to as head. It computes the task-specific final output, such as a classification vector, based on the feature maps from the feature extractor.

**Figure 2.4:** Operating principle of a convolutional layer. learnable filter kernels are applied to in input feature maps and produce new feature maps.

#### **Deep Convolutional Neural Networks**

Over time, state-of-the-art network architectures have become more complex, which is reflected in an increasing number of layers and has shaped the term deep learning [Sim14]. Starting with just a few layers for the task of document recognition [Lec98] and later image classification [Kri12], the number rose to 16 layers [Sim14] and afterwards up to 100 layers and beyond [Sze15, He16]. This was mainly driven by the challenging classification task of ImageNet [Den09], where images have to be classified as one of 1,000 classes. However, training very deep neural networks comes with several challenges. In order to address the vanishing or exploding gradient problem [Ben94], the activation function and weight initialization must be carefully chosen. As a result, ReLU or LReLU are the established choices, and the initial layer weights are sampled from a Gaussian distribution with zero mean, and a variance based on layer size [Glo10] or based on layer size and activation function [He15b]. Furthermore, Batch Normalization (BN) [Iof15] has been proposed to normalize the outputs of convolutional layers before the activation function is applied. Looking at the 1D case, the layer activations ̃ = −1 + are normalized to zero mean and unit variance ̃ norm by mean and variance 2 computed over the current batch . The learnable parameters BN and BN ensure that the normalization layer does not negatively impact the representational capabilities:

$$
\tilde{\mathbf{f}}\_{l}^{\text{BN}} = \boldsymbol{\omega}\_{\text{BN}} \circ \tilde{\mathbf{f}}\_{l}^{\text{norm}} + \mathbf{b}\_{\text{BN}},\tag{2.10}
$$

where ∘ denotes the Hadamard product. Batch Normalization makes the network's weight initialization more robust and allows higher learning rates for faster convergence. Since batches are usually only used for training, running mean and variance are stored during training and applied during inference.

Based on these techniques, "going deeper" [Sze15] has shown great success and surpassed human-level performance [He15b] on the classification task of ImageNet. However, experiments have shown that the improvement achieved by stacking more layers not only diminishes but turns into a negative impact beyond a certain number [Sri15, He15a]. This effect cannot be explained by overfitting because the training error increases as well [He15a]. Residual networks [He16] are motivated by the consideration that increasing the number of layers should not negatively impact the results since they can be turned into identity functions. Therefore, instead of directly learning a mapping from input to output, a residual mapping is proposed, which is implemented by an identity skip connection and illustrated in Fig. 2.5. This enables deeper networks with improved results and allows the successful training of over a thousand layers. The core elements are residual Basic Blocks (BBs) and residual Bottleneck Blocks (BoBs) with the characteristic skip connection. These are grouped into stages which contain all layers applied to one specific feature map resolution or scale. The stages start with a subsampling layer, such as maximum pooling or strided convolution. As illustrated in Fig. 2.5, residual networks are composed of five stages with a varying number of residual blocks, depending on the specific setup. Commonly used configurations are ResNet-34, ResNet-50, ResNet-101, and ResNet-152, where the number indicates the total number of layers. Their detailed configurations can be found in [He16]. These networks have been, and still are, the predominant feature extractor in state-of-the-art methods across many tasks.

**Figure 2.5:** Building blocks and architecture of residual networks.

### **2.1.3 Recurrent Neural Networks**

The huge success of neural networks is not limited to computer vision but influences other fields like natural language processing too. In this area, data is often arranged in sequences, like a sequence of words forming a sentence or text. As a result, neural architectures emerged, which are able to extract information along a sequence of data. When considering time, these principles are also fundamental for computer vision, especially in the context of 2D scene understanding, since it allows processing and exploiting video data instead of individual images.

In contrast to the feed forward networks discussed so far, Recurrent Neural Networks (RNNs) have a feedback loop, giving access to information from previous inputs. Figure 2.6 shows this recurrent loop, where the so-called hidden state loops back to the network input. In addition, Fig. 2.6 illustrates the unrolled network for a sequence of three inputs <sup>0</sup> , <sup>1</sup> , and <sup>2</sup> to visualize the repeated application of an RNN to the elements of a sequence and their relations more clearly. The recurrent network receives not only the input vector but also the hidden state vector with features from the processing of the previous input. As a result, the predictions are not only based on the corresponding input but also on the sequence context provided by the hidden state.

**Figure 2.6:** Recurrent Neural Network (RNN) with an unrolled example on the right. The gradient flow is indicated in orange.

As the unrolled architecture indicates, RNNs can also be interpreted as very deep neural networks, especially for long sequences. Therefore, they also suffer from exploding and vanishing gradients since the gradient must be propagated along the recurrent application, depicted as orange path in Fig. 2.6. One way to address this is to use gating mechanisms proposed by Long Short-Term Memory (LSTM) [Hoc97] and Gated Recurrent Units (GRUs) [Cho14]. Figure 2.7 exemplarily illustrates the latter with a focus on time series data, where the sequence index equals a discrete point in time . Its output is computed by the following equations:

$$\begin{aligned} \mathbf{r}\_{l} &= \text{sigmoid}\left(\mathbf{W}\_{\mathbf{x},\mathbf{r}}\mathbf{x}\_{l} + \mathbf{W}\_{\mathbf{h},\mathbf{r}}\mathbf{h}\_{l-1} + \mathbf{b}\_{\mathbf{r}}\right), \\ \mathbf{z}\_{l} &= \text{sigmoid}\left(\mathbf{W}\_{\mathbf{x},\mathbf{z}}\mathbf{x}\_{l} + \mathbf{W}\_{\mathbf{h},\mathbf{z}}\mathbf{h}\_{l-1} + \mathbf{b}\_{\mathbf{z}}\right), \\ \widetilde{\mathbf{h}}\_{l} &= \tanh\left(\mathbf{W}\_{\mathbf{x},\mathbf{h}}\mathbf{x}\_{l} + \mathbf{W}\_{\mathbf{h},\mathbf{h}}\left(\mathbf{r}\_{l} \circ \mathbf{h}\_{l-1}\right) + \mathbf{b}\_{\mathbf{h}}\right), \\ \mathbf{y}\_{l} &= \mathbf{h}\_{l} = (\mathbf{1} - \mathbf{z}\_{l}) \circ \mathbf{h}\_{l-1} + \mathbf{z}\_{l} \circ \widetilde{\mathbf{h}}\_{l}. \end{aligned} \tag{2.11}$$

The reset gate computes the reset vector and decides which information from the previous state −1 to forget and which to keep. The update vector on the other hand controls the element-wise combination of the previous hidden state and the new candidate state ˜ . The latter is computed from the previous hidden state multiplied by the reset vector and the current input. All three gates are implemented based on a single-layer MLP with the respective weight matrices and bias vectors . When propagating the gradient along the processed sequence, it only has to pass the element-wise addition and multiplication, but not an entire MLP. This advantage also holds when the output of an MLP is provided to the GRU instead of the raw input . The gating mechanism counteracts vanishing or exploding gradients by reducing the number of layers the gradient passes when being backpropagated along the sequence. ConvGRU [Sia17] transfers this concept to the image domain, where input and hidden states are 2D feature maps instead of feature vectors, and the gates build upon convolutional layers instead of MLPs.

**Figure 2.7:** Structure of a Gated Recurrent Unit (GRU). It computes its output based on the previous hidden state −1 and input using a reset, update, and candidate gate.

## **2.2 2D Scene Understanding**

After the overwhelming success of CNNs in image classification [Rus15], they have been quickly deployed for other application areas, such as 2D scene understanding. The latter comprises, among others, the tasks of semantic and instance segmentation, depicted in Fig. 2.8. This was also supported by the release of a constantly increasing number of semantic datasets [Bro09, Sil12, Lin14, Mot14, Cor16, Zho17, Yu20, Gey20] providing pixelwise semantic labels. Some of these datasets [Sil12, Lin14, Cor16, Zho17, Yu20] additionally provide pixelwise instance labels and enable the development of panoptic segmentation approaches. Since scene understanding is one of the key challenges of autonomous driving, many of these datasets belong to the outdoor driving domain [Bro09, Cor16, Yu20, Gey20]. In general, 2D scene understanding is a huge field with countless published work. The following section focuses on pioneer work and approaches that influenced this thesis or are directly used.

```
(a) Camera image. (b) Semantic segmentation. (c) Instance segmentation.
```
**Figure 2.8:** Semantic and instance segmentation, two tasks of 2D scene understanding [Cor16].

## **2.2.1 Semantic Segmentation**

The task of assigning a semantic class to every pixel of an image is called semantic segmentation. One fundamental concept for approaches based on CNNs are Fully Convolutional Networks (FCNs) [Lon15], which compute a pixelwise prediction for a given input image in an end-to-end fashion. This requires a new architecture for the network head because a classification vector is required for every individual pixel instead of one for the entire image. Therefore, a 1 × 1-convolutional layer replaces the standard fully-connected classification layer to generate a prediction for every pixel. However, the output of a CNN's feature extractor usually has a considerably smaller resolution than its input due to pooling or convolutions with a stride greater one. As a result, the low resolution predictions provided by the 1 × 1-convolutional layer must be upsampled again. Different FCN architectures are proposed, with a single or multiple upsampling steps. Alongside bilinear interpolation, a new layer called deconvolution or transposed convolution learns the upsampling instead of applying a fixed one. The fully convolutional network with three upsampling steps (FCN-8s) is visualized in Fig. 2.9.

**Figure 2.9:** Architecture of Fully Convolutional Networks (FCNs). The low-resolution predictions are upsampled in three steps by transposed convolutions.

U-Net [Ron15] enhances FCNs by improving the upsampling processes. The number of upsampling steps is matched to the number of downsampling steps of the feature extractor. For every upsampling step, the corresponding feature maps with matching resolution are concatenated to improve spatial feature propagation, see Fig. 2.10. In contrast to native FCNs, U-Net upsamples the feature maps and not predictions. Overall, the downsampling and upsampling paths form a U-shaped architecture and are often called encoder and decoder, respectively. U-Net started to address one of the main challenges introduced by the pixelwise semantic segmentation task, which makes it necessary to simultaneously capture the global context of a scene and fine details. The former requires large receptive fields and is usually achieved by iteratively reducing the feature map resolution while dropping spatial information. However, the loss of spatial information negatively affects the capturing of fine details since information about the features' exact location is lost.

**Figure 2.10:** U-Net architecture for 2D semantic segmentation.

PSPNet [Zha17] addresses this challenge by introducing a pyramid pooling module. Different pyramid levels divide the feature maps into different-sized subregions and compute an aggregated representation for each region using pooling, depicted in Fig. 2.11. Aggregated context varies from local to global dependent on the subregion size. The outputs of the different pyramid levels are upsampled and concatenated with the original feature maps. As a result, the final classification layer is provided with feature maps containing local and global context at different scales, illustrated in Fig. 2.11, which significantly improves the segmentation results.

**Figure 2.11:** PSPNet with its pyramid pooling approach, illustration based on [Zha17].

The DeepLab family [Che18, Che17a] relies on atrous convolutions to increase the receptive fields' size without reducing the feature map resolution or increasing filter sizes. Additionally, Atrous Spatial Pyramid Pooling (ASPP) is implemented by deploying atrous convolutions at different rates in parallel to exploit context at different scales. This pyramid pooling is similar to PSPNet but with atrous convolutions instead of pooling operations. Its goal is again the aggregation of multi scale context.

Deep Layer Aggregation (DLA) [Yu18] replaces simple skip connections with an enhanced aggregation architecture, as depicted in Fig. 2.12. Two neighboring stages of the feature extractor provide their feature maps to an aggregation node, which combines and compresses its inputs. This requires an upfront upsampling of the lower-resolution feature maps. The compression is achieved by ensuring that the output channel size matches the channel size of a single input. The aggregation nodes are stacked in a tree-like fashion, and each iterative deep aggregation path, which replaces a conventional skip connection, aggregates features from shallow to deep.

**Figure 2.12:** Structure of Deep Layer Aggregation (DLA), illustration based on [Yu18].

### **2.2.2 Panoptic Segmentation**

For a long time, semantic and instance segmentation have been approached individually and were considered separate tasks. Kirillov et al. [Kir19] proposed panoptic segmentation, which unifies both tasks and requires a semantic *and* instance label for every pixel. Following the categorization for instance segmentation methods, approaches can generally be grouped into two categories. First, top-down or proposal-based methods rely on parallel semantic and object detection branches, where the latter predicts bounding boxes, which are further refined with the semantic segmentation to instance masks. Second, bottom-up or proposal-free approaches cluster pixels based on pixelwise instance embeddings, such as predicted features, relative positions, or semantic segmentation. This thesis uses the terms top-down and bottom-up because some methods generate instance proposals by clustering instance embeddings. In this case, the term proposal-free would be misleading. Nevertheless, these methods are considered bottom-up since they generate instances by clustering pixelwise embeddings instead of deploying a detection network.

### **Top-Down Approaches**

Current state-of-the-art methods [Por19, Xio19, Moh20] mostly rely on Mask R-CNN [He17] because it predicts not only bounding boxes but also object masks. Additionally, it uses a Feature Pyramid Network (FPN) [Lin17] to better recognize objects at multiple scales. Similar to U-Net, and novel for the object detection task, FPNs upsample feature maps again after the feature extractor with the help of skip connections. However, predictions are not only performed on the last feature maps but are made independently for all feature map scales to improve multi scale object recognition, depicted in Fig. 2.13. UPSNet [Xio19] proposes a panoptic architecture based on Mask R-CNN with an additional semantic segmentation branch relying on deformable convolutions [Dai17b]. A parameter-free panoptic head resolves class conflicts between semantic and instance predictions and introduces a dedicated unknown class for non-resolvable conflicts. Seamless [Por19] also uses an architecture similar to Mask R-CNN and relies on a ResNet backbone enhanced with an FPN. A novel semantic head exploits the multi scale features by applying individual ASPP modules to each scale to improve the aggregated information for semantic segmentation. EfficientPS [Moh20] deploys an EfficientNet [Tan19] followed by a novel 2-way FPN. The latter improves multi scale feature aggregation since aggregation is not only performed from low to high resolution but also vice versa. An enhanced semantic head captures fine details and long-range context more effectively.

**Figure 2.13:** Feature Pyramid Networks (FPNs) provide predictions based on multi scale feature maps to improve object recognition.

#### **Bottom-Up Approaches**

One of the first proposal-free panoptic approaches was DeeperLab [Yan19c]. It relies on a keypoint-based representation of instances based on the four bounding box corners and its center. The instance branch predicts a heatmap for these keypoints and multiple short- to long-range offset maps, which are the foundation of the instance clustering. Single-Shot instance segmentation with Affinity Pyramids (SSAP) [Gao19] predicts a pixel-pair affinity pyramid, determining the probability that two neighboring pixels belong to the same instance. The instance clustering is computed by an efficient graph partitioning module based on affinity and semantics. Panoptic-DeepLab [Che20] proposes a clustering strategy built upon center and offset regression, illustrated in Fig. 2.14. It comprises a shared backbone followed by a dual ASPP and dual decoder setup for independent semantic and instance branches. The latter predicts a center heatmap indicating the position of instances which are represented by their centers. The second prediction, the offset vectors, point for every pixel to its corresponding center. The class-agnostic clustering is performed based on these two predictions. It extracts <sup>c</sup> centers with the highest score as instance candidates from the heatmap and assigns thing pixels to the closest offset-indicated center candidate, illustrated in Fig. 2.14. Center candidates without assigned pixels are discarded.

**Figure 2.14:** Panoptic-DeepLab clusters instances based on a center heatmap and offset vectors. The offset vectors are converted to an angle for visualization by a color wheel. The clustering assigns pixels to the center closest to the offset position, and each center represents a unique instance.

## **2.3 3D Scene Understanding**

Robots and autonomous vehicles operate in a 3D environment, which makes 3D geometric information provided by 3D point clouds highly valuable. Common sources of point clouds are RGB-D cameras and light detection and ranging (lidar) sensors. The former are usually used in indoor scenarios [Sil12, Son15, Arm16, Dai17a], while lidar sensors are used outdoors [Hac17, Roy18, Beh19, Cae20, Pan20a, Xia21, Kur21]. The basic measurement principle of a lidar sensor is the emission of a laser pulse at time <sup>0</sup> , which is reflected when hitting an obstacle. The sensor's detector recognizes this reflection at time <sup>1</sup> , and the distance to the obstacle can be computed based on the time of flight and speed of light <sup>0</sup> :

$$r = \frac{1}{2} \cdot (t\_1 - t\_0) \cdot c\_0. \tag{2.12}$$

Polar and azimuth angles (, ) specify the laser direction for these measurements and are an intrinsic property of the sensor, determined by design or sensor rotation. The result is a measured 3D position in spherical coordinates = (, , ) ˜ , which can be transformed into Cartesian coordinates

$$\mathbf{p} = \begin{bmatrix} \mathbf{x} \\ \mathbf{y} \\ \mathbf{z} \end{bmatrix} = \begin{bmatrix} r\cos(\phi)\sin(\theta) \\ r\sin(\phi)\sin(\theta) \\ r\cos(\theta) \end{bmatrix}. \tag{2.13}$$

Repeating this measurement process thousands or even millions of times provides a point cloud of the sensor's surroundings, which can be represented as set = { ∣ 1 ≤ ≤ } or matrix = [<sup>1</sup> <sup>2</sup> ⋯ ] T ∈ ℕ × 3. One possible and frequently used sensor setup in current outdoor datasets is a vertical stack of laser emitters and detectors spinning around the vertical axis, see Fig. 2.15. Compared to point clouds recorded with RGB-D cameras in indoor environments, outdoor point clouds from a lidar usually cover a much larger area, are relatively sparser, and have a point density strongly varying with distance. These properties impose additional challenges on approaches for 3D scene understanding.

**Figure 2.15:** Common setup of a lidar sensor with pairwise vertically stacked laser emitters and detectors spinning around the vertical axis.

Driven by the value and importance of 3D information for scene understanding, approaches based on deep learning for various related tasks emerged, building upon the huge success of CNNs in the image domain. However, one of the main challenges is the unstructured nature of point cloud data, which CNNs cannot directly process. Therefore, a significant research effort is put into developing suitable representations across various tasks to enable the efficient processing of point clouds by CNNs. These are discussed in the following Section 2.3.1. Early approaches [Mat15, Wan15, Qi16] mainly tackled the tasks of 3D object classification, retrieval, or detection. Soon after, pointwise tasks like semantic or instance segmentation followed [Qi17a, Tch17, Wan18b, Mil19]. More recently, the combined task of panoptic segmentation gained more and more attention [Zho21, Hon21, Sir22]. Sections 2.3.2 to 2.3.4 present a detailed overview of the current state-of-the-art for these pointwise 3D tasks.

### **2.3.1 Point Cloud Representations**

Unlike images, point clouds cannot be processed with native 2D or 3D convolutions. Therefore, and independently of the task, a point cloud representation is required, which allows the processing with established CNN architectures. Alternatively, point-based approaches [Qi17a, Tho19] propose adapted convolution operations and architectures directly applicable to point clouds in **point view (PV)**, see Fig. 2.16, without requiring a preliminary transformation. One major advantage is that no transformation-induced loss of information occurs. However, no neighborhood relations are inherently represented in an unordered set of points but instead must be explicitly computed. Also, the usual hierarchical aggregation of local context based on consecutively subsampled grid-shaped feature maps must be explicitly formulated and computed. Both operations are potentially expensive, especially for large scale point clouds. Hence, different representations based on regular grids have been derived to enable the application of conventional CNNs, which are subsequently called *views*. Motivated by the discussed polar nature of lidar measurements, these views can also represent point clouds in spherical or cylindrical coordinates. If required, Cartesian coordinates can be transformed into spherical coordinates ˜

$$\mathcal{Q}^{\mathcal{G}}(\mathbf{p}) = \begin{bmatrix} \sqrt{\mathbf{x}^2 + \mathbf{y}^2 + \mathbf{z}^2} \\ \arccos\left(\frac{\mathbf{z}}{\sqrt{\mathbf{x}^2 + \mathbf{y}^2 + \mathbf{z}^2}}\right) \\ \tan 2\left(\mathbf{y}, \mathbf{x}\right) \end{bmatrix} = \begin{bmatrix} r \\ \boldsymbol{\Theta} \\ \boldsymbol{\Phi} \end{bmatrix} = \widetilde{\mathbf{p}},\tag{2.14}$$

or cylindrical coordinates ˜ :

$$\mathcal{Q}^{\mathbb{Z}}\left(\mathbf{p}\right) = \begin{bmatrix} \sqrt{\mathbf{x}^2 + \mathbf{y}^2} \\ \operatorname{atan2}\left(\mathbf{y}, \mathbf{x}\right) \\ \mathbf{z} \end{bmatrix} = \begin{bmatrix} r \\ \boldsymbol{\phi} \\ \boldsymbol{z} \end{bmatrix} = \widetilde{\mathbf{p}}^{\mathbb{Z}}.\tag{2.15}$$

**(b)** Semantic segmentation. **(c)** Instance segmentation.

**Figure 2.16:** Point clouds in their native point-based representation. Visualized are the distance channel of the point cloud, as well as semantic and instance labels.

Without loss of generality, the following discretizations and projections are considered for a lidar sensor with a vertical and horizontal field of view defined as fov = down − up and fov = max − min, whose measurements are in a distance and height interval of fov = max − min and fov = max − min. To transform the predictions from the individual views back to the point cloud, every 3D point receives the prediction of its corresponding grid cell.

#### **Voxel View**

Motivated by the 3D nature of point clouds, a straightforward discretization into a grid are Cartesian or cylindrical voxels, the building blocks of the voxel view (VX). The discretized voxel coordinates of cylindrical 3D points for a voxel grid of size × × can be computed by [Zhu21b]:

$$\mathcal{P}^{\rm VX}(\widetilde{\mathbf{p}}^{z}) = \begin{bmatrix} \left\lfloor (r - r\_{\rm min}) \cdot r\_{\rm fv}^{-1} \cdot H \right\rfloor \\\\ \left\lfloor (\phi - \phi\_{\rm min}) \cdot \phi\_{\rm fv}^{-1} \cdot W \right\rfloor \\\\ \left\lfloor (z - z\_{\rm min}) \cdot z\_{\rm fv}^{-1} \cdot D \right\rfloor \end{bmatrix} = \begin{bmatrix} u \\ v \\ w \end{bmatrix} = \mathbf{u}^{\rm VX}.\tag{2.16}$$

The , , and coordinates of the Cartesian representation can be discretized accordingly. Point clouds and their input features can be transformed into a voxel grid based on these coordinates. In general, the transformation suffers from the many-to-one problem, meaning that multiple 3D points lie inside one voxel. Therefore, a handcrafted or learned fixed-sized feature vector is required, also called encoding, which represents an arbitrary number of points.

The voxel view's advantages are that it retains the 3D structure of the data and inherently contains 3D neighborhoods. On the other hand, the major drawback of the dense voxel view, alongside the introduced quantization error, is the explicit representation of empty space, resulting in high memory and computational demands. At the same time, the sparsity of point clouds results in predominantly empty voxels. Sparse convolutions [Gra15] have been proposed to speed up computation and reduce the memory footprint for inputs with predominantly empty voxels. Therefore, only non-empty voxels are represented, and convolutions are only applied to these cells. One drawback of this implementation is the reduction of sparsity after each layer. The resulting voxel of a convolutional or pooling operation is only empty if all its inputs within the according receptive field are empty. Submanifold Sparse Convolutional Networks [Gra18] address this by restricting the output of these layers to the initially non-empty voxels, keeping the sparsity and the full benefit of sparse convolutions. While sparse convolutions significantly reduce memory and runtime overhead and enable the voxel view for large scale point clouds, neighborhoods are no longer implicitly represented. Nevertheless, the gains for omitting empty space outweigh the additionally required computations as long as the grids have less than 10% occupied cells [Gra15].

#### **Range View**

The range view (RV) is a 2D representation resulting from a spherical projection [Mil19]. It is closely connected to lidar sensors since it exploits the spherical representation of 3D points, which is directly provided by many lidar sensors. For a range image of size × the 2D projection coordinates of a spherical point are defined by:

$$\mathcal{P}^{\mathbb{R}\mathbb{V}}(\widetilde{\mathbf{p}}) = \begin{bmatrix} \left[ (\boldsymbol{\Theta} - \boldsymbol{\Theta}\_{\mathrm{up}}) \cdot \boldsymbol{\Theta}\_{\mathrm{for}}^{-1} \cdot \boldsymbol{H} \right] \\\\ \left[ (\boldsymbol{\phi} - \boldsymbol{\phi}\_{\mathrm{min}}) \cdot \boldsymbol{\phi}\_{\mathrm{for}}^{-1} \cdot \boldsymbol{W} \right] \end{bmatrix} = \begin{bmatrix} \boldsymbol{u} \\\\ \boldsymbol{\upsilon} \end{bmatrix} = \mathbf{u}^{\mathbb{R}\mathbb{V}}.\tag{2.17}$$

The point cloud and associated features, such as intensity, as well as the ground truth, can then be transformed into the range view based on these 2D coordinates, as depicted in Fig. 2.17. The advantages of the range view are its dense 2D representation, which allows for very efficient processing. Additionally, it does not depend on the covered area because the range image size is independent of fov. Its disadvantages are the distortion of physical dimensions due to the spherical projection and adjacent points with a significant difference in distance and 3D position. Furthermore, a combined point cloud from multiple overlapping lidar sensors introduces the many-to-one problem.

**(a)** Distance channel.

**(b)** Intensity channel.

**(c)** Semantic segmentation.

**(d)** Center heatmap.

**(e)** Offset vectors.

**(f)** Instance segmentation.

**Figure 2.17:** Point clouds represented in range view. The upper images show the distance (a) and intensity (b) measurements, followed by the ground truth semantic segmentation (c). Images (d) and (e) depict the ground truth center heatmap and offset vectors required for bottom-up instance clustering to predict an instance segmentation (f).

#### **Bird's Eye View**

The bird's eye view (BEV) projection omits the -axis to project the 3D point clouds onto the -plane. Alternatively, a projection based on cylindrical coordinates onto the -plane is also possible [Zha20c]. The 2D image coordinates for a polar bird's eye view image of size × are computed by:

$$\mathcal{P}^{\rm BEV}(\widetilde{\mathbf{p}}^{\rm Z}) = \begin{bmatrix} \left\lfloor (r - r\_{\rm min}) \cdot r\_{\rm for}^{-1} \cdot H \right\rfloor \\\\ \left\lfloor (\phi - \phi\_{\rm min}) \cdot \phi\_{\rm for}^{-1} \cdot W \right\rfloor \end{bmatrix} = \begin{bmatrix} u \\ v \end{bmatrix} = \mathbf{u}^{\rm BEV}.\tag{2.18}$$

The results of this projection are depicted in Fig. 2.18. Similar to the voxel view, the bird's eye view has to deal with the many-to-one problem by computing handcrafted or learned encodings based on the points inside each cell. It is similarly efficient as the range view, while the spatial separation of individual instances is superior since there are rarely occlusions along the -axis for common thing classes. This is a valuable property for the clustering of bottom-up panoptic segmentation. However, the bird's eye view is not particularly dense, with more than half of the cells being empty for standard lidar sensors. Additionally, small vertical objects are improperly represented.

### **2.3.2 Semantic Segmentation**

After the first approaches [Mat15, Wan15, Qi16] have predominantly tackled object classification and detection, the pioneer PointNet [Qi17a] also addressed semantic segmentation. However, the first 3D semantic segmentation methods rarely scale to large scale outdoor scenarios due to the mentioned challenges of large covered areas, increased sparsity, and varying point density. With the rise of outdoor datasets [Hac17, Roy18, Beh19, Cae20, Pan20a, Xia21, Kur21], a significant amount of research effort shifted towards these scenarios, which originate mainly from the driving domain. Approaches have been proposed based on the different point cloud views discussed in the previous section, which are also combined to multi view approaches.

**(e)** Instance segmentation.

**Figure 2.18:** The polar bird's eye view. The first image (a) shows the occupied cells, followed by the semantic segmentation (b). Bottom-up instance clustering requires center heatmap (c) and offset vectors (d) to predict the instance segmentation (e).

#### **Point-based Approaches**

Point-based approaches directly process raw point clouds in the point view without preceding transformation. New network architectures and redefined convolution operations have been proposed for this purpose, which can be roughly assigned to different categories, loosely following the survey of Guo et al. [Guo21]. The most influential categories are architectures based on the **pointwise application of MLPs** and **point convolutions** directly applicable to 3D points. Graph-based methods as the third category, such as [Lan18, Lan19a], do not play a notable role in the current state-of-the-art outdoor semantic segmentation. The interested reader is referred to [Guo21].

The pioneering approach of directly processing raw point clouds and **pointwise application of MLPs** was PointNet [Qi17a]. It repeatedly applies shared MLPsto every input point to compute individual feature vectors. Hence, these MLPs are called pointwise Multi-Layer Perceptrons (pMLPs) and are followed by a symmetric aggregation function for global feature aggregation, such as max pooling, illustrated in Fig. 2.19. The symmetric property ensures that the order of points does not influence the results since point clouds are unordered sets of points. For semantic segmentation, the global output feature vector is concatenated with the local feature vectors and further processed by pMLPs to predict the semantic labels based on local and global information. While PointNet has a low computational complexity, a single global feature aggregation strongly limits the capturing of hierarchical spatial relations, which are important for semantic segmentation. Its successor PointNet++ [Qi17b] tackles these shortcomings by applying individual PointNets to local regions in a hierarchical fashion. A sampling layer determines the region centers based on iterative Farthest Point Sampling (FPS), and a grouping layer selects points from the center's neighborhood to form local regions. Subsequent approaches propose new and improved neighborhood aggregation strategies to capture context hierarchically. The proposed strategies are inspired by Scale Invariant Feature Transform (SIFT) [Jia18], build upon concentric spherical shells [Zha19a], are based on densely connected local webs [Zha19b], or combine geometric and feature neighborhoods [Eng19]. Other approaches deploy

attention [Yan19b, Hu20b] or RNNs [Eng17, Ye18] to exploit relations between points for context aggregation.

**Figure 2.19:** The PointNet architecture applies layer-wise shared MLPs followed by a symmetric operation, such as channel-wise maximum.

Many of the approaches mentioned so far face difficulties in scaling to outdoor scenarios, either in terms of computational complexity or quality of the results. The computational complexity is mainly impacted by FPS and -nearest neighbor (NN) search, with up to 57% of the overall runtime spent on data structuring [Liu19b]. The mediocre results are caused by the fact that local and global context aggregation is more challenging for outdoor point clouds, as discussed at the beginning of Section 2.3. More recent approaches started to address this challenge to improve the results. PointASNL [Yan20] refines the region centers sampled with FPS based on learned shifts and introduces a local-nonlocal module to improve the capturing of local and long-range context. Another strategy is multi task learning [Una21] with the added prediction of 3D objects. However, none of these approaches solves the high runtime demands. RandLA-Net [Hu20b] addresses both shortcomings by simultaneously improving the quality and efficiency of the neighboring feature pooling. It replaces the expensive FPS by random sampling. In addition, a novel attentive local feature aggregation module improves local context aggregation. Another approach is presented by Qiu et al. [Qiu21] and comprises a bilateral context and adaptive fusion module. The former augments pointwise features with explicit geometric information provided by the point cloud at different resolutions. The latter adaptively fuses multi resolution features to provide enhanced features for 3D semantic segmentation. Both approaches significantly reduce runtime and considerably improve the results.

Approaches belonging to the **point convolution** category adapt the convolution operation for point clouds. They are frequently used for 3D shape classification [Guo21] but also for indoor semantic segmentation, such as convolution on -transformed features [Li18], PointConv [Wu19b], and dilated point convolutions [Eng20b]. However, only a few approaches of this category are designed for large scale point clouds with tens of thousands of points [Wan18a, Tho19, Bou20]. Parametric continuous convolution [Wan18a] is a learnable operator based on parameterized kernel functions which are approximated by an MLP. Kernel point convolutions (KPConv) [Tho19] use a flexible number of continuous and learnable locations in Euclidean space as kernel points. This property considerably increases the flexibility over fixed grid convolutions because these so-called deformable convolutions learn to adapt their kernel to local geometry. In addition, a regular subsampling strategy ensures increased robustness to varying point densities.

Despite the achieved improvements [Tho19, Hu20b, Qiu21] in the outdoor domain, point-based approaches still suffer from mediocre segmentation results [Hu20b] or high computational complexity [Tho19, Qiu21]. Hierarchical context aggregation is still more sophisticated and efficient in structured grid representations. This is reflected in a lower computational complexity while achieving predominantly better segmentation results.

### **Projection-based Approaches**

Aiming for the application of established and efficient 2D CNNs, projectionbased approaches project 3D point clouds onto 2D subspaces. Commonly deployed methods are the spherical or bird's eye view projection presented in Section 2.3.1. However, also more complex projections exist, such as virtual tangent planes [Tat18].

Approaches based on the **range view** predominantly build upon existing 2D architectures from the image domain with novel extensions or adaptions targeted for processing the projected 3D point clouds. SqueezeSegV1 [Wu18] was one of the first methods relying on the range view as input representation for segmenting the road-object classes car, pedestrian, and cyclist. Its backbone is based on SqueezeNet [Ian16] and followed by a Conditional Random Field (CRF) to refine the road-object segmentation. The enhanced version SqueezeSegV2 [Wu19a] has an improved model structure and exploits synthetic data combined with unsupervised domain adaption to reduce domain shift. The most recent version SqueezeSegV3 [Xu20] predicts semantic segmentation and introduces spatially-adaptive convolutions to counteract the spatially-varying feature distribution in range images. RangeNet++ [Mil19] exploits the DarkNet architecture [Red18] and presents a back-projection of range view labels to the point cloud based on the nearest neighbors. Measurement uncertainties and ego motion during the continuous scan can introduce projection errors with multiple points projected onto the same cell. The most basic strategy simply assigns the same label to all points of one cell. In contrast, RangeNet++ additionally considers the neighborhood of a point to reduce the impact of these projection errors on the segmentation. Instead of an expensive -nearest neighbor (NN) search, the 5 × 5-neighborhood in range view is used as a proxy. Weighted majority voting based on the differences in radial distance determines the label for every 3D point. SalsaNext [Cor20] is an enhanced version of SalsaNet [Aks20] and introduces a set of improvements, such as the use of Lovász-Softmax loss [Ber18] and the replacement of transposed convolution layers with pixel-shuffle layers [Shi16]. While primarily designed for 3D object detection, one of LaserNet's [Mey19b] intermediate results is a semantic segmentation of the range view input. Their architecture is based on the previously introduced DLA and also inspired the range view backbone of this thesis. LiteHDSeg [Raz21b] proposes harmonic dense convolutions and an improved global contextual module to capture multi scale context. A multi class Spatial Propagation Network (MCSPN) tackles the refinement of semantic boundaries. A parameter-free full interpolation decoding module is proposed by FIDNet [Zha21c], which is based on bilinear interpolation, as a more efficient upsampling alternative compared to transposed convolutions.

The **bird's eye view** was originally proposed and is widely used for the task of 3D object detection. Early approaches [Che17b, Yan18, Sim19] relied on a handcrafted feature encoding for each cell, which was subsequently replaced by learned encodings. PointPillars [Lan19b], influenced by VoxelNet [Zho18], proposes a PointNet-based bird's eye view encoding, which is also beneficial for semantic segmentation. A PointNet is applied to all 3D points inside one cell and maps them to a fixed-size feature vector, illustrated in Fig. 2.20. As a result, the input to the 2D backbone is a learned feature encoding of the projected point cloud.

Looking at semantic segmentation, Zhang et al. [Zha18] use a Cartesian bird's eye view and a handcrafted feature encoding, which concatenates all points inside one cell. PolarNet [Zha20c] builds upon the ideas of PointPillars and relies on a learned polar bird's eye view encoding based on PointNet. They empirically show that a polar grid better matches the point distribution of lidar sensors than a Cartesian grid and leads to fewer empty cells. Both approaches apply a U-Net to compute the required semantic segmentation.

**Figure 2.20:** A PointNet is applied to every bird's eye view cell to learn an embedding vector.

As a result of the 2D projections, range and bird's eye view are the most efficient representations among the presented ones. In a direct comparison, range view methods achieve predominantly better results. While projection-based approaches outperform point-based methods in terms of segmentation quality and computational complexity, they cannot entirely compete with the current state-of-the-art segmentation results achieved in the voxel view. The projection induces either a significant amount of information loss due to the manyto-one problem or places points next to each other, which are far apart in 3D. The latter increases the challenge of separating class boundaries since implicit spatial separation along the distance or height axis is omitted.

#### **Discretization-based Approaches**

Most of the early semantic segmentation approaches [Dai17a, Tch17, Ret18], which relied on the **dense voxel view** as input representation, focused on indoor scenes, where this view's runtime and memory downsides are less severe. SEGCloud [Tch17] deploys a 3D FCN followed by a trilinear interpolation and 3D CRF. The interpolation converts coarse voxel-level predictions back to the 3D points, and the CRF ensures global consistency and provides refined semantic segmentation. The Fully Convolutional Point Network [Ret18] applies a PointNet to the points in uniformly sampled regions to create a 3D feature map, which is processed by a 3D CNN. Nearest neighbor interpolation converts the voxel features back to the 3D points. PointLabeling [Hua16] was the only early approach for outdoor semantic segmentation. The point cloud is converted into a 3D occupancy grid and fed to a 3D CNN to predict voxel-level semantics. Afterwards, all points inside one cell receive the semantic label of this cell. The reported runtime of several minutes for an area of 100 m × 100 m shows that, despite a huge voxel size of 0.3 m × 0.3 m × 0.3 m, the dense voxel view cannot deal with large scale outdoor point clouds. This is also supported by the observation that representing at least 90% of a point cloud's points in unique cells requires about 82.6GB of memory when training with a batch size of 16 [Liu19b], even for compact indoor scenes.

**Sparse convolutions** enable the usage of the voxel view for outdoor scenarios and are frequently used by recent methods. These approaches propose new learnable modules to improve the segmentation quality and rely on a 3D U-Net architecture. S3Net [Che21c] introduces a sparse intra- and interchannel attention module. The former addresses the local information loss caused by the discretization and usage of sparse convolutions, and the latter re-weights the channels of a feature map to learn better representations. JS3C-Net [Yan21] uses semantic scene completion as a supervisory signal to learn contextual shape priors from the dense aggregation of multiple point clouds. An improved sparse architecture based on attentive feature fusion and adaptive feature selection is presented by Cheng et al. [Che21d]. Attentive feature fusion deploys small, medium, and large kernels in parallel. The former focuses on fine details and small semantic classes, while the latter aggregate global context and larger semantic classes. A learnable, weighted combination merges the features of these three branches. The second module, called adaptive feature selection, learns relations between channels across the three multi scale feature maps from the attentive feature fusion module to improve contextual information. A cylindrical [Zhu21b] instead of a Cartesian voxel view better matches the distribution of lidar point clouds, similar to the 2D polar bird's eye view. Zhu et al. [Zhu21b] further introduce an asymmetrical residual block to better match the point distribution and object shapes in point clouds of driving scenes. Additionally, context modeling based on dimension decomposition merges several low-ranked feature tensors into the final high-rank tensor.

The sparse voxel view achieves high quality 3D semantic segmentation and outperforms all other views discussed so far. Despite the sparse nature, one downside is still the considerably higher computational complexity compared to 2D representations such as range and bird's eye view.

### **Multi View Approaches**

Considering the point cloud representations or views discussed so far, they all have individual strengths and weaknesses. Multi view approaches build upon the distinct properties of different views and exploit multiple representations to combine their strengths and compensate for individual weaknesses. These methods further improve the state-of-the-art or introduce efficient approaches with high quality results.

One category of multi view approaches builds upon voxel and point view to combine their complementary strengths. The main challenge of point-based approaches, the expensive sampling and forming of neighborhoods, can be omitted in this multi view setup since the voxel view implicitly provides 3D neighborhoods. One drawback of the voxel view, the loss of information during voxelization is simultaneously counteracted by the point view, which directly processes the original point cloud. Motivated by this, Point-Voxel CNN (PVCNN) [Liu19b] deploys a low resolution voxel branch to extract coarse neighborhood context, combined with a high resolution point-based branch to provide individual point features. Sparse Point-Voxel CNN (SPVCNN) [Tan20] improves this further by replacing the dense voxel view with its sparse counterpart, which allows for higher voxel resolution and considerably improves the results. The combination of sparse voxel and point view can be further extended by the range view as a third branch [Xu21a]. Range and voxel features are repeatedly transformed back to the 3D points and fused across all three views by a gated fusion module. Afterwards, the fused multi view features are transformed back to the individual views and further processed by the individual backbones. FusionNet [Zha20a] proposes a VoxelMLP to exploit the voxel view for a fast neighborhood search to compute pointwise features using PointNet under consideration of a point's neighborhood. Its sparse voxel branch additionally computes voxel features, which are fused with the pointwise features from the VoxelMLP. DRINet [Ye21b] builds upon the alternating application of Sparse Point-Voxel and Sparse Voxel-Point Feature Extraction. The former receives pointwise features as input, aggregates context with multi scale pooling, and transforms the pointwise features into voxel features. Sparse Voxel-Point Feature Extraction processes voxel features with a 3D backbone, and a geometry-aware attentive gathering generates high quality pointwise features. The successor DRINet++ [Ye21a] further improves the architecture by treating voxels as points, motivated by the observation that voxels can be considered an abstraction of the points inside with their 3D position defined by their center. In general, these multi view approaches cannot completely mitigate the computational complexity drawback of the voxel view [Liu19b, Tan20, Zha20a, Xu21a] or have to use very small 3D networks [Ye21b].

Motivated by the distinct underlying projections, another popular combination is range and bird's eye view. Since the point clouds are projected onto complementary 2D planes, both views contribute distinct and valuable features. Additionally, this combination can be very efficient because the computational complexity of both views is relatively low. Early approaches proposed rather simple fusion strategies, e.g., Ali et al. [Ali21], which sum over the predictions from a range and bird's eye view network to get the final 3D semantic segmentation. However, they achieve only mediocre results, especially for a multi view approach. AMVNet [Lio21] proposes another late fusion strategy based on range view and polar bird's eye view predictions. The respective backbones compute the semantic predictions in both views, which are backprojected to the 3D points. If the predictions of both views disagree, a pointbased network refines these uncertain points based on the range and bird's eye view features of the considered point and its nearest neighbors. Therefore, AVMNet requires an expensive nearest neighbor search and inherits one of the point view's weaknesses. TornadoNet [Ger21] combines both views consecutively instead of deploying parallel backbones and starts with a bird's eye view network, called pillar projection learning, to compute bird's eye view features. Afterwards, these features are transformed into range view, combined with the range view input data, and processed by the main backbone. The methods considered so far combine multi view features or predictions only once and cannot exploit the full potential of this view combination. Recent approaches propose advanced range and bird's eye view fusion strategies, like CPGNet [Li22c]. Pointwise features computed by a pMLP are projected into range and polar bird's eye view and processed by individual backbones. In the next step, the features are back-projected using bilinear interpolation and fused across both views by another pMLP. Afterwards, the entire process is repeated. GFNet [Qiu22] also deploys individual backbones for range and polar bird's eye view. In order to bidirectionally align and propagate complementary geometric information between both backbones and across views, a geometric flow module is proposed, which is applied after each upsampling step. Except for CPGNet [Li22c], these approaches are outperformed by 2D single view approaches [Ali21, Ger21] or still need expensive 3D operations [Lio21, Qiu22], leading to a computational complexity similar to single or multi view voxel-based approaches while achieving inferior results.

### **2.3.3 Instance Segmentation**

Similar to the 2D image domain, instance segmentation is an essential task for 3D scene understanding. Significant progress has been achieved over the last years, enabled by two datasets, ScanNetV2 [Dai17a] and S3DIS [Arm16], which provide pointwise semantic and instance labels for indoor scenes. As a result, most existing approaches tackle indoor instance segmentation and can be grouped into top-down and bottom-up methods, similar to the image domain. Another consequence of the indoor domain is the predominant focus on the point and voxel view.

Starting with top-down and **point-based** methods, Yi et al. [Yi19] introduce a Mask R-CNN inspired generative shape proposal network, which generates high quality 3D object proposals. Instances are predicted based on these proposals, supported by a semantic segmentation computed by a PointNet++. Based on pMLPs, 3D-BoNet [Yan19a] directly regresses 3D bounding boxes for all instances in a point cloud combined with the prediction of a pointwise mask. The result is a single-stage, anchor-free approach, which is endto-end trainable.

The bottom-up approaches predominantly deploy a PointNet or PointNet++ as backbone to predict pointwise feature embeddings and focus on improving the clustering step. Similarity Group Proposal Network (SGPN) [Wan18b] computes a similarity matrix for all paired points based on the predicted embeddings. This matrix is used to generate an intermediate clustering, which is further heuristically refined and passed through Non-Maximum Suppression (NMS). Joint Semantic-Instance Segmentation (JSIS3D) [Pha19] applies a multi value CRF and incorporates the predicted semantics and embeddings for joint optimization to generate semantic and instance segmentation. Associatively Segmenting Instances and Semantics (ASIS) [Wan19a] and JSNet [Zha20d] fuse instance and semantic features before the final predictions to mutually improve these features and enhance predicted semantics and instance embeddings. Both approaches use mean shift clustering [Com02] to generate the instances. Zhang et al. [Zha21a] propose probabilistic embeddings and represent each 3D point as a tri-variate normal distribution. The core idea of the AS-Net [Jia20a] is to treat instance segmentation as candidate assignment problem. Candidates are selected based on pointwise features and represent different instances. An assignment module allocates points to the candidates, and a suppression module removes redundant candidates.

The top-down method 3D-SIS [Hou19] builds upon the dense **voxel view** to predict 3D bounding boxes with class labels. It is combined with a 3D mask network to predict a voxel-level instance mask. Bottom-up approaches predominantly rely on sparse U-Nets for feature extraction and, once more, focus on improving the clustering. The sparse backbone of 3D Multi Proposal Aggregation (3D-MPA) [Eng20a] predicts semantic and offset vectors, which the authors call object center votes. Proposal locations are then sampled from the predicted centers, and proposal features are learned by grouping and aggregating votes in the neighborhood of sampled centers. A graph neural network refines these features for a final proposal clustering. Liang et al. [Lia20] propose a structure-aware loss function, considering geometrical and embedding information to improve the learned 3D instance embeddings. A graph neural network refines the embeddings, which are finally clustered by mean shift. PointGroup [Jia20b] computes offset vectors pointing to the corresponding instance center instead of conventional embedding features. A dual clustering step based on original and offset positions produces candidate clusters. These are fed to a subnetwork, called ScoreNet, to provide a score for each cluster used by NMS to generate the final instances. Multi Task Metric Learning (MTML) [Lah19] combines feature embeddings with directional information for clustering. The clustering is provided by offset vectors pointing to the corresponding instance center. Mean shift clustering and NMS compute the final instances based on these predictions. Hierarchical Aggregation for 3D Instance Segmentation (HAIS) [Che21a] improves the clustering based on offset vectors using a hierarchical aggregation strategy to generate instance proposals progressively. OccuSeg [Han20] additionally predicts an occupancy output, while its instance clustering follows a graph-based segmentation schema.

Some approaches also combine bottom-up and top-down elements. Semantic Superpoint Tree Networks (SSTNet) [Lia21] over-segment point clouds in the first step to create superpoints. These are geometrically homogeneous neighborhoods, similar to superpixels in the image domain. Afterwards, a semantic superpoint tree is constructed bottom-up and based on semantic features of the superpoints, which are pooled from predicted pointwise semantic features. The tree is traversed top-down and split at intermediate tree nodes to create instance clusters. SoftGroup [Vu22] performs clustering on soft semantic scores to reduce the influence of wrong semantic predictions. The instance proposals originating from the soft grouping step are refined in a top-down manner based on proposal features.

Only Zhang et al. [Zha20b] tackle instance segmentation in the outdoor domain. Point clouds are projected into **bird's eye view** and processed by a 2D CNN to predict 2D offset vectors to instance centers for the clustering, complemented by predicted object heights used as a constraint. The other discussed methods are mostly not applicable to the outdoor domain. As explained in Section 2.3.2, approaches based on PointNet or PointNet++ are unable to aggregate sophisticated context in large scale outdoor scenarios, and runtime increases prohibitively with scenario size. The sparse voxel-based approaches predominantly use a voxel size of 2 cm, which is impractical outdoors. There, the covered area approximated for automotive grade lidar scans is about (50 m) 2 ⋅ ≈ 7854 m<sup>2</sup> , compared to a typical indoor area [Dai17a] of 22.6 m<sup>2</sup> . State-of-the-art voxel-based approaches for outdoor semantic segmentation build upon larger voxel sizes [Tan20, Xu21a, Ye21b] and need to address the loss of resolution by novel extensions. Consequently, a significant research effort is still required for outdoor instance segmentation. This also applies to the clustering since the predominantly used mean shift clustering already requires at least 100 ms for small point clouds with 4096 points, which is an order of magnitude smaller than common outdoor point clouds.

### **2.3.4 Panoptic Segmentation**

Many instance segmentation methods from the previous section also predict semantic segmentation and could potentially be extended to the task of panoptic segmentation. However, most of them use the semantics solely to support the instance task and neither jointly optimize both nor consider stuff classes, which is required for panoptic segmentation. Hence, they have been considered as instance segmentation approaches. Nevertheless, there are some exceptions [Wan19a, Pha19, Zha20d, Lia20] already aiming for the joint prediction of instance and semantic segmentation and can be considered among the first 3D panoptic segmentation approaches, although their authors never call it panoptic. Supported by the public release of two large scale datasets, SemanticKITTI [Gei12, Beh19] and nuScenes [Cae20, Fon22], with pointwise semantic and instance labels, most of the recently published panoptic methods tackle outdoor panoptic segmentation of driving scenarios. Consequently, research not only focuses on improving the clustering [Gas21, Hon21, Li22a] but also on improved backbones and feature extraction [Mil20, Zha20c, Sir22, Xu22, Li22b]. Additionally, runtime complexity and real-time capabilities are considered. Again, existing methods follow the bottom-up [Mil20, Gas21, Li21, Hon21, Li22a, Xu22, Li22b] or top-down [Sir22] strategy, while others [Raz21a, Li21] do not follow the established patterns.

Panoster [Gas21] is a proposal-free approach with a novel learnable clustering step instead of a fixed post-processing one. Therefore, Panoster's instance branch directly predicts instance IDs and is trained based on a differentiable confusion matrix over ground truth and predicted clusters. In addition, a postprocessing step based on DBSCAN [Est96] merges fragmented instances or splits wrongly fused instances and significantly improves the results. While Panoster can be used in combination with arbitrary backbones, it has been evaluated for the point-based method KPConv and the range view-based approach SalsaNext, where KPConv achieves superior results.

Milioto et al. [Mil20] introduce a proposal-free, **range view**-based approach based on a shared DarkNet53 [Red18] encoder and dual decoder setup to predict semantics and instance centers. Instead of transposed convolutions or bilinear upsampling, which both leverage 2D proximity in the range view, a differentiable trilinear upsampling layer is introduced. It exploits 3D geometric information of the point cloud to upsample the 2D feature maps in both decoders. CPSeg [Li21] is a proposal- and clustering-free method with a task-aware attention module to force both decoders to learn comprehensive task-aware features. Geometric features extracted from the surface normals further assist the instance decoder. A similarity matrix based on the learned embeddings determines the instances. The proposal-based approach EfficientLPS [Sir22] uses a shared backbone followed by a semantic branch and a Mask R-CNN-based instance branch. A proximity convolution module aggregates 3D neighborhoods into the range image prior to the backbone application, followed by a 2-way FPN for multi scale feature aggregation. The latter is supported by a novel range encoder network providing spatial information based on the distance channel. Additionally, a novel panoptic periphery loss is introduced to refine boundaries between instances.

The **bird's eye view** approach Panoptic-PolarNet [Zho21] stands on the shoulders of PolarNet and extends it to the panoptic segmentation task. Motivated by Panoptic-Deeplab from the image domain, their instance branch predicts a center heatmap and offset vectors for the clustering of instances. Panoptic-PolarNet shares not only the encoder among tasks but also the first part of the decoder to reduce the computational effort and simultaneously improve the results.

Based on the sparse cylindrical **voxel view**, DSNet [Hon21] introduces a clustering-based framework. The sparse backbone computes voxel features, which are shared among both tasks and further refined in the semantic and instance branch. The predicted voxel-based semantic classes and offset vectors are then transformed back to the 3D points. A learnable clustering module, called dynamic shifting, and motivated by mean shift clustering, can adapt its kernel functions on-the-fly for different instances. GP-S3Net [Raz21a] builds upon AF²-S3Net [Che21d] to compute a 3D semantic segmentation in the first step. Instead of a parallel instance branch, a downstream instance network performs over-segmentation on the thing classes. A graph is built based on these segmented clusters and processed by a graph neural network to predict the final instance segmentation.

The **multi view** approach SMAC-Seg [Li22a] presents sparse multi directional attention clustering. The predicted offsets are transformed from range to bird's eye view for clustering, and an attention module aggregates instance features for the individual clusters. Finally, a centroid-aware repel loss improves the separation of instances. Sparse Cross-Scale Attention Network (SCAN) [Xu22] uses a sparse 3D backbone to compute multi scale voxel features and derive point features. A cross scale attention module aggregates the multi scale voxel features, followed by a reduction into sparse bird's eye view to compute a sparse 2D centroid distribution. The pointwise features are used to predict

offset vectors and semantic classes for the 3D points. SCAN clusters instances based on the centroid predictions and offset vectors. PHNet [Li22b] starts with the computation of voxel features for occupied cells, which are then transformed into polar bird's eye view. Semantic and instance features are computed by a 2D CNN and combined with the voxel features. The proposed kNN-transformer models interactions among instance classes and predicts the offset vectors for every voxel. Based on these, a pseudo-heatmap in bird's eye view for potential instance centers is derived. A final clustering step provides the required instances.

## **2.4 Temporal Point Cloud Fusion**

Sensors mounted on autonomous vehicles provide a constant stream of sequential measurements, such as 3D point clouds. The sequential nature combined with the continuity of the world provides a substantial potential to improve various tasks by considering past frames in order to exploit temporal information and dependencies. A frame comprises all relevant sensor measurements from one point in time. For approaches solely based on lidar, this is only the point cloud. On the other hand, a frame contains the data of multiple sensors for sensor fusion approaches. Early temporal attempts focused on object detection, quickly followed by temporal semantic segmentation approaches. Recently, the first approach started to tackle temporal panoptic segmentation. Five categories of fusion strategies can be identified across these tasks: aggregating the inputs, adding a temporal grid dimension, exploiting neighborhoods across time, employing the popular attention mechanism, or passing information recurrently through time.

#### **Input Aggregation**

The simplest temporal fusion is an early fusion strategy, merging multiple input point clouds and providing the aggregated point cloud to the backbone. In order to compensate ego motion, the point clouds are transformed to the current ego position beforehand. For the task of object detection, this can be used [Cas18, Hu20a] to aggregate multiple point clouds before transforming them into bird's eye view to increase its density and the detection results. Panoptic segmentation also benefits from input aggregation [Wan22b], where instance points of previous time steps are aggregated into the current point cloud. Another possibility to exploit past information on the input level are residual images in range view for the tasks of moving object [Che21b] and semantic segmentation [Wan22a]. Residual images represent the difference in the range channel of an ego motion compensated previous time step and the current range channel. The residual values are close to zero for the static environment and significantly deviate from zero for dynamic objects. While residual images are a valuable temporal input for moving object segmentation, the input aggregation strategy achieves only mediocre improvements for semantic segmentation and cannot convincingly exploit temporal information.

#### **Temporal Grid Dimension**

Approaches in this category stack feature maps from multiple time steps along a temporal dimension. Afterwards, 3D or 4D convolutions are applied to processes and fuse the stacked feature maps. Luo et al. [Luo18] extract feature maps from point clouds in bird's eye view for multiple time steps. These feature maps are stacked along a temporal axis and subsequently fused by a 3D CNN. Alongside improving 3D object detection, the temporal approach also enables the forecasting of object motion. One additional dimension is required by MinkowskiNet [Cho19], which introduces 4D sparse tensors to incorporate previous frames. The temporally stacked sparse input tensor is processed by a 4D U-Net with generalized sparse operations to deal with the memory and runtime requirements. A hybrid kernel replaces the hypercube kernel to reduce the kernel size and the computational requirements, e.g., from 3 <sup>4</sup> = 81 to 29 for a 3 × 3-kernel. Building upon these ideas, Mersch et al. [Mer22] propose an approach for moving object segmentation. The spatio-temporal information extracted by the sparse 4D CNN is the foundation for the decision about moving versus non-moving objects. The major drawback of the extra grid dimension is the significant impact on computational requirements, which increases with each considered past frame.

#### **Spatio-Temporal Neighborhood**

Neighborhood relations also exist across time since all point clouds can be transformed to the current pose. In contrast to input aggregation, the time is explicitly modeled as the fourth dimension in this category, and points must be spatially and temporally close to be considered neighbors. Consequently, these spatio-temporal neighborhoods allow the aggregation of point features across time. MeteorNet [Liu19a] deploys a PointNet++ for pointwise feature extraction followed by its main contribution, a temporal aggregation module called meteor module for pointwise spatio-temporal feature aggregation. A shared MLP aggregates the spatial-temporal neighborhood of a point, which is computed by a new chained-flow clustering strategy. The module is used in an early fusion strategy for semantic segmentation and applied to the aggregated point cloud. However, it can also be used in a late fusion setup. Point Spatio-Temporal Network (PSTNet) [Fan21] proposes point spatio-temporal convolutions for temporal semantic segmentation to exploit spatio-temporal neighborhoods. After disentangling space and time for consecutive point clouds, spatial convolutions capture the geometric structure of the point clouds, and temporal convolutions extract the temporal dynamics of the spatial regions. The high computational effort of computing neighborhoods in point clouds is further increased for spatio-temporal neighborhood search since the spatiotemporal point cloud is considerably bigger, or neighborhoods must be computed for multiple point clouds.

#### **Attention**

Another strategy is based on concatenating past and current features followed by channel-wise attention to fuse both. SpSequenceNet [Shi20] relies on the sparse voxel view and proposes cross-frame global attention for temporal fusion. This attention layer uses global information from the previous frame to compute channel-wise attention for the features of the current frame. Additionally, cross-frame local interpolation aggregates local information from a point's neighborhood in the previous frame. The follow-up work [Han22] uses the same attention mechanism but replaces the interpolation step with an improved temporal-variation-aware interpolation module, which considers the feature variation inside the neighborhoods. Channel-wise attention is also used for moving object segmentation based on features from stacked residual images [Sun22] or ego motion compensated range images [Kim22]. The latter method additionally applies spatial attention to the fused feature maps. Another attention-based strategy is introduced in STELA [Kni21] and fuses the feature maps of multiple time steps after each encoder stage in order to aggregate every voxel's sparse neighborhood across time.

#### **Recurrent Neural Network**

RNN-based approaches recursively aggregate feature maps by updating a temporal memory with current information. The memory transports the temporal information from time step to time step. This strategy was first used to improve 3D object detection [Els18, McC20] by recursively aggregating the last feature map prior to the detection head with a ConvLSTM [Shi15]. Yin et al. [Yin20] follow the same strategy but deploy an enhanced ConvGRU called attentive spatio-temporal transformer instead of a ConvLSTM. In contrast to the previous approaches, which rely on the bird's eye view, Huang et al. [Hua20a] build their approach on the sparse voxel view. The extracted features are fed to a sparse ConvLSTM and recurrently aggregated. Instead of compensating the ego motion for the input point clouds, like previous approaches, the sparse 3D locations of the hidden state features are transformed to the current time step to compensate for the ego motion. The only semantic segmentation approach [Sch22b] proposes multiple fusion steps instead of a single temporal fusion step after the backbone. Recurrent temporal connections implemented by ConvGRU aggregate and pass feature maps at multiple locations inside the backbone from time step to time step to improve 3D semantic segmentation. In general, RNNs offer the potential to reuse their temporal memory at + 1 when the next lidar scan is recorded. Except for Huang et al. [Hua20a], existing approaches compensate for ego motion in the input point clouds, prohibiting the reuse of already computed features and limiting the potential of RNNs. The other presented fusion strategies generally lack this potential, in addition to their already discussed drawbacks.

## **2.5 Sensor Fusion for Point Clouds**

Autonomous vehicles or robots are often equipped with different sensor types and multiple sensors of the individual types. These offer the potential to exploit additional sensor data to improve various 3D tasks. Independently of the task, different fusion strategies with neural networks emerged [Che17b]. Early fusion combines multi sensor data directly at input level, deep fusion fuses intermediate feature maps potentially multiple times, and late fusion relies on the predictions of the different sensor modalities.

Sensor fusion is well established in the area of 3D object detection, especially for the combination of lidar and camera. Early fusion approaches generate regions of interest in camera images to determine where to look in the point cloud [Qi18, Wan19b]. Other methods [Vor20, Xu21b] enhance point clouds with semantic labels from image segmentation. Late fusion approaches combine bounding box predictions from both modalities [Pan20b, Pan22]. Deep fusion approaches can be further divided into **feature-level** and proposallevel fusion [Che17b, Ku18]. Strategies related to regions of interest, proposal, or bounding box fusion are specific to object detection. On the other hand, deep sensor fusion is task agnostic and can also be exploited for multimodal panoptic segmentation. Therefore, it will be the focus of this section. A comprehensive overview of the other categories is provided in [Mao22].

The first challenge to address for sensor fusion on **feature-level** is spatial alignment since camera and lidar features exist in different spaces. One common strategy is the projection of 3D lidar points [Hua20b, Zhu21a, Wan21b, Wan21a, Zha21b] or voxel centers [Sin19] onto the 2D camera image to extract spatially matching camera features. Zhao et al. [Zha21b] further improve this projection by learning pointwise correction offsets to compensate for deviations in the projection caused by calibration or time synchronization errors. Another strategy is the projection of camera features into bird's eye view based on voxel projection [Yoo20], lift splat shoot [Phi20, Liu22], parametric continuous convolution [Lia18, Lia19], or cross-attention [Che22]. For semantic segmentation, the geometric projection of camera features into range view is another popular strategy [Elm19, Mey19a, Kri20, Zhu21c].

The second challenge after the spatial alignment is the fusion of lidar and camera features. Proposed methods build upon simple addition [Lia18, Lia19], concatenation followed by convolution [Wan21a, Zha21b, Liu22] or residualbased addition [Zhu21c], cross-sensor attention [Che22], and gated fusion. The gating signal for the latter is computed from the concatenated [Yoo20, Wan21b] or added [Hua20b] lidar and camera features.

Only some mentioned approaches [Mey19a, Elm19, Kri20, Zha21b, Zhu21c] tackle semantic segmentation, which requires different architectures than object detection to restore the original resolution. This difference enables additional and distinct fusion strategies and architectures. Some methods [Elm19, Kri20] achieve notable improvements, however, for relatively weak baselines and only predict the road-object classes car, pedestrian, and cyclist. The fusion strategies of other approaches [Mey19a] achieve only small improvements, indicating that the camera's full potential is not yet exploited. Additionally, none of the existing approaches consider the camera failure case or evaluate its impact on the results.

# **3 Concept**

This thesis aims to design a multimodal network architecture based on deep learning for panoptic segmentation of 3D point clouds, which can exploit various information sources of an autonomous vehicle for high quality and robust results. The point cloud is the main source of information, usually originating from a lidar sensor, which can be supported by multiple other sources. First, time and temporal dependencies provide additional and valuable information when not only the current frame but also previous frames are considered. These offer the potential of temporally more consistent predictions since temporally aggregated features provide a more stable and dense context for every 3D point, compared to the sparse and incomplete context of a single lidar scan. Next, other sensors, such as camera or radar, provide complementary information about the environment based on their respective measurement principles. In particular, cameras with their usually much higher spatial resolution than a lidar contribute additional information. Consequently, different requirements can be derived, which the designed architecture needs to fulfill:


Fulfilling all these requirements provides a multimodal architecture for 3D panoptic segmentation. Therefore, the individual requirements are tackled by the main contributions of this thesis, which are then combined into one unified architecture:


While the multimodal architecture provides the full potential and best results, the contributions of this thesis can be flexibly combined. For example, combining the multi view and temporal architecture provides strong results without using other sensors. Reasons not to use existing other sensors are, for example, a minor or no overlapping field of view or safety and redundancy considerations. Alternatively, if the entire multimodal framework is computationally too complex, combining the temporal and multi sensor fusion provides a sophisticated and highly efficient range view-based approach.

## **3.1 Multi View Architecture**

Over time, various point cloud representations with different strengths and weaknesses have been proposed, with the most relevant ones introduced in Section 2.3.1. One possibility to combine different strengths and counteract weaknesses is the combination of individual views. Promising combinations are the sparse voxel and point view to reduce voxel resolution and omit neighborhood aggregation in unstructured point clouds [Tan20, Ye21b]. The sparse voxel view provides the 3D neighborhood aggregation, and the pointwise information in the point view allows a lower voxel resolution. Another promising combination is the range and bird's eye view [Lio21, Li22c], motivated by the distinct underlying projections which omit different axes. Hence, the 2D neighborhoods of projected points differ and complement each other to reduce the impact of far apart points belonging to the same 2D neighborhood. The proposed multi view architecture [Due22] builds upon the range and bird's eye view, motivated by four reasons:


occupy previously empty voxels. While an increased density is generally favorable, it drastically reduces the computational benefits of the sparse representation. Above a certain density, it eventually becomes inefficient [Gra15].

Motivated by the success of Panoptic-DeepLab's [Che20] bottom-up clustering approach in the image domain and the success of representing objects by center points [Yin21], this thesis relies on a bottom-up clustering for instance segmentation based on a center heatmap and offset vectors, introduced in Section 2.2.2.

Existing approaches [Zha20a, Lio21, Ger21, Qiu22] that rely on the range and bird's eye view deploy either a simple fusion strategy with only minor improvements or require 3D neighborhoods for refinement and inherit the point view's main drawback. In contrast, the presented approach [Due22] introduces a sophisticated point view backbone as a third parallel network and superior multi view link over late fusion, which simultaneously mitigates this drawback. The proposed multi view architecture illustrated in Fig. 3.1 aggregates neighborhoods and context in the range and bird's eye view. Consequently, no context aggregation, neighborhood relations, and hierarchical point cloud subsampling are required in the point view. Instead, a unique feature vector is maintained and refined for each 3D point based on the aggregated features of range and bird's eye view extracted at different scales. For that purpose, 2D CNNs are employed as backbones for both 2D views, which enable efficient context aggregation by exploiting the implicit neighborhood relations provided by the representations' grid topology. Alongside mitigating the main drawbacks, this architecture provides a superior multi scale fusion strategy over late fusion for range and bird's eye view features.

Furthermore, existing approaches based on range and bird's eye view provide only semantic instead of panoptic segmentation. In contrast, the proposed architecture and its novel panoptic head exploit multi view benefits also for panoptic segmentation and the different requirements of its subtasks. The head relies on the point view for semantic segmentation and bird's eye view for instance recognition. Offset vectors and center heatmap required for bottom-up instance segmentation are predicted based on bird's eye view feature maps. This allows a dense 2D center heatmap and decouples object center positions from the measured 3D points since their center position is usually not directly measured. The head clusters all 3D points belonging to thing classes based on the 2D offset vectors and object center candidates in the bird's eye view to compute the 3D instance segmentation.

**Figure 3.1:** High-level architecture for multi view panoptic segmentation of 3D point clouds, which combines the range view, point view, and bird's eye view.

## **3.2 Temporal Multi View Architecture**

Considering not only the current but also past frames provides the capability to exploit valuable temporal information and dependencies. This is especially beneficial for sensors with a high intra- and inter-class variance but a spatially rather low information density. For the targeted domain of realtime applications, waiting for a fixed number of frames and predicting them all at once is not feasible. Instead, a prediction with low latency is required for every arriving frame, as soon as it is provided by the sensor. This repetitive recording is rarely considered by existing approaches, see Section 2.4. Instead, they predominantly compute a prediction for the current frame by transforming a short sequence or temporal window of previous point clouds (≤ 5) to the current frame and processing them all at once. This strategy has two major drawbacks. First, it limits the temporal information that can be exploited due to the limited temporal window size, which linearly increases computational complexity and memory requirements. Second, all frames in the temporal window have to be processed in every time step. Consequently, individual frames are processed multiple times, as illustrated in Fig. 3.2.

**Figure 3.2:** Processed frames for three arriving frames at , + 1, and + 2. Existing approaches have to process their temporal window and individual frames repeatedly. In contrast, the proposed approach exploits recursion and processes only the arriving frame.

To address these limitations, the proposed architecture [Due20a] follows a different approach and introduces a novel temporal fusion of features and feature maps. It reuses a significant amount of computations from previous time steps and recursively aggregates features and information through time, without a limiting temporal window size. The general idea, which follows a recursive filtering pattern, is shown in Fig. 3.3. Point clouds are processed in one of the chosen representations by a common backbone to compute deep aggregated feature maps for the current time step, similar to a conventional single frame approach. A temporal memory is passed through time and recursively updated with the features computed by the backbone and provides temporally aggregated features to the head for the final predictions. This recursive strategy allows reusing previous feature computations and only adds the update step as additional computation compared to single frame methods. As a result, the runtime is no longer connected to the temporal window size and the features of a potentially unlimited number of past frames can be aggregated into the memory.

**Figure 3.3:** Recursive temporal update for aggregating features across time.

While this approach significantly enhances single view approaches, it is furthermore integrated into the multi view architecture to combine multi view and temporal benefits. Therefore, the range and bird's eye view branch are extended by the proposed recursive memory aggregation, as depicted in Fig. 3.4, to incorporate feature maps from previous time steps in both views. The motivation for choosing the 2D views over the point view was already briefly discussed in the previous section. Aggregating context across time requires finding spatially close features in the last time steps for all 3D points of the current time step, which is more efficient in a 2D grid than a 3D point cloud. Additionally, the choice of the 2D views allows both subtasks to benefit from the temporal fusion, see Fig. 3.4. Temporally fused feature maps are provided to the point view backbone to improve the point-based features for the 3D semantic segmentation. In parallel, the offset vector and center heatmap predictions required for the instance segmentation benefit from the temporally fused bird's eye view feature maps.

**Figure 3.4:** Temporal multi view concept for panoptic segmentation of 3D point clouds. It employs a recurrent temporal fusion in range and bird's eye view to exploit temporal dependencies.

## **3.3 Multimodal Multi View Architecture**

The presented temporal and multi view frameworks are single modality approaches focusing on the lidar sensor. However, other sensor modalities with distinct measurement principles, such as camera and radar, have great potential to provide additional information when fused with point clouds. In general, sensor fusion with neural networks follows early, deep, or late fusion, introduced in Section 2.5. Deep fusion offers the potential to fuse aggregated feature maps at multiple stages. Consequently, less information is lost when fusing sensors with different resolutions since features are spatially aggregated prior to fusion. Furthermore, deep fusion offers the potential to fuse features at multiple scales. In contrast, early and late fusion lacks these potentials. Early fusion of lidar and camera for 3D panoptic segmentation typically omits a significant amount of camera information since the camera resolution is usually much higher compared to a lidar. Late fusion combines both sensor modalities at the latest possible moment and therefore lacks the potential to improve feature extraction at earlier stages of the networks. These considerations motivate the choice of deep sensor fusion as the underlying strategy.

A promising sensor combination consists of camera and lidar because both sensors provide rather complementary information. Camera sensors usually measure RGB intensity values at high resolution, whereas lidar sensors provide valuable 3D geometric information with a relatively low resolution. Additionally, the considerable success of image-based scene understanding indicates the value of camera image information for understanding a vehicle's environment. The successful combination of RGB and depth images for 2D scene understanding [Sil12] motivates the fusion of camera and lidar information in the range view. The main advantage over the point view is, again, the grid topology which allows aggregating context for the fused features with standard 2D convolutions.

The proposed fusion architecture performs multi scale feature fusion of lidar and camera. Improving over existing approaches [Elm19, Mey19a, Kri20, Zhu21c], it ensures that the lidar baseline performance is still achieved in case of missing camera information. Furthermore, two novel multi scale fusion strategies improve the exploited camera information and fused features. The underlying setup is illustrated in Fig. 3.5 and builds upon individual sensor backbones to extract lidar range view and camera features. These are provided at multiple scales to the proposed fusion branch, which geometrically transforms camera feature maps into range view feature maps. Afterwards, one of two proposed fusion strategies combines the multi scale feature maps. The fusion branch decouples the fusion from the backbones and is combined with an adapted training strategy so that each backbone can still provide single modality predictions as a backup in case of sensor failure. This property increases robustness against missing or unexpected camera output and is a considerable advantage over existing approaches.

**Figure 3.5:** Range view-based lidar and camera fusion. The multi scale fusion branch decouples the individual sensor backbones from the fusion.

While this fusion approach considerably improves the panoptic segmentation in the range view, another important step is the integration into the temporal multi view architecture to ultimately combine multi view, temporal, and multi sensor benefits. This multimodal architecture is illustrated in Fig. 3.6, where the range fusion backbone replaces the single sensor range view backbone. Consequently, the temporal range view is applied to the multimodal features and provides temporally enhanced multimodal range view features. In this setup, the range view feature maps propagated to the point view contain the fused features across lidar, camera, and time. On the other hand, the bird's eye view and its predictions lack the camera information. However, the main negative impact on the panoptic results are errors in the semantic segmentation, shown with an oracle test in [Zho21], which directly benefits from the sensor fusion. Additionally, the bird's eye view allows the integration of additional sensors in the future, such as radar or even online cloud-based map data.

**Figure 3.6:** The multimodal architecture, which combines multi view, temporal, and sensor fusion benefits for 3D panoptic segmentation.

## **3.4 Multimodal Feature Map Transformation**

The presented architectural concepts require the combination and fusion of features and entire feature maps across different sensors, time steps, and point cloud representations. Therefore, a spatial transformation is required to transform feature maps between different modalities and views. The enabling element is 3D information provided by the lidar point clouds or the lidar views themselves. These 3D points can be geometrically transformed to other sensors or time steps and can be projected into different views. The transformation requires ego poses and sensor extrinsics defined by homogeneous transformation matrices and are illustrated for a generic vehicle setup in Fig. 3.7. The transformation from the lidar (li) at time to sensor at time is defined by:

$$\mathbf{T}\_{t \to \mathbf{\tau}}^{\rm li \to S} = \left(\mathbf{T}\_{\tau}^{\rm ego} \cdot \mathbf{T}^{S}\right)^{-1} \cdot \mathbf{T}\_{t}^{\rm ego} \cdot \mathbf{T}^{\rm li}, \quad \mathbf{T}\_{t \to \mathbf{\tau}}^{\rm li \to S} \in \mathbb{R}^{4 \times 4}.\tag{3.1}$$

The ego poses can be omitted if the source and destination time steps are identical. This combined transformation matrix transforms a homogeneous point cloud = [<sup>1</sup> , ⋯ , , ⋯] T ∈ ℝ × 4 to another sensor and time step:

$$\mathbf{T}\left(\mathbf{p}\_{\mathsf{T}}^{S}\right)^{\mathsf{T}} = \mathbf{T}\_{t \to \mathsf{T}}^{\mathrm{li} \to S} \cdot \mathbf{P}\_{t}^{\mathsf{T}}.\tag{3.2}$$

**Figure 3.7:** Vehicle with two sensors and ′ at two distinct points in time. The vehicle poses in the global coordinate system at time and are given by ego and ego . The sensor poses with regard to the vehicle coordinate system are specified by extrinsics and ′ .

Afterwards, a projection projects the original or transformed points into one of the considered views :

$$\begin{aligned} \mathcal{P}^{\mathcal{V}} & \colon \mathbb{R}^4 \to \{1, 2, \ldots, H\} \times \{1, 2, \ldots, W\} \\ \mathcal{P}^{\mathcal{V}}(\mathbf{p}) &= \begin{bmatrix} u \\ \upsilon \end{bmatrix} = \mathbf{u}^{\mathcal{V}}. \end{aligned} \tag{3.3}$$

Relevant views for this thesis are lidar range view, polar bird's eye view, and camera image view (IMG). The specific computation of the projection depends on the view and has been introduced in Section 2.3.1. The camera projection follows the pinhole model [For12]. Additionally, Cartesian points must first be converted into spherical or cylindrical coordinates for range and polar bird's eye view, achieved by the functions and , introduced in Eqs. (2.14) and (2.15). Computing the projection or cell index for a point and a target view at time usually requires more steps than just the projection itself and motivates a combined transformation :

$$\begin{aligned} \mathcal{T}\_{t \rightarrow \tau}^{\mathcal{V}} &: \mathbb{R}^4 \to \{1, 2, \ldots, H\} \times \{1, 2, \ldots, W\} \\ \mathcal{T}\_{t \rightarrow \tau}^{\mathcal{V}}(\mathbf{p}) &= \left(\mathcal{P}^{\mathcal{V}} \circ \mathcal{Q}^{\mathcal{S}|\mathbf{z}}\right) \left(\mathbf{T}\_{t \rightarrow \tau}^{\text{li} \rightarrow \text{S}} \cdot \mathbf{p}\right) . \end{aligned} \tag{3.4}$$

Equation (3.4) combines the transformation, spherical or cylindrical conversion, and projection, where the first two steps are not always required:


Computing the projection or cell index for each point of a point cloud provides an index matrix:

$$\mathbf{U}\_{\tau}^{\mathcal{V}} = \begin{bmatrix} U\_{1,1} & U\_{1,2} \\ \vdots & \vdots \\ U\_{N,1} & U\_{N,2} \end{bmatrix} = \begin{bmatrix} U\_{n,d} \end{bmatrix} := \begin{bmatrix} \mathcal{I}\_{l \to \tau}^{\mathcal{V}}(\mathbf{p}\_n)\_d \end{bmatrix} \in \mathbb{N}^{N \times 2}. \tag{3.5}$$

Equations (3.4) and (3.5) are illustrated for camera and range view images in Fig. 3.8. The projection index of can be found in the -th row of .

**Figure 3.8:** Projection of a 3D point cloud into lidar range view and camera image view.

Feature maps ∈ ℝ ˜ × in point view, which originate from neural networks directly applied to unstructured point clouds, are in the most general sense matrices with individual entries ˜, . These feature maps contain in the -th row the feature vector of the -th point of a point cloud, with residing in the -th row of the point cloud matrix . On the other hand, feature maps ∈ ℝ ×× of 2D CNNs are in the most general sense 3D tensors. Their entries are in the following denoted as , , , following [Goo16]. Since these feature maps are the result of processing a 2D input, such as camera or range view images, they can also be considered as a 2D grid of feature vectors at grid coordinates = (, ) ∈ ℕ<sup>2</sup> . Consequently, they are called 2D feature maps. Since 3D points can be projected into these grid cells, feature vectors from a 2D feature map can be assigned to every 3D point. Based on this relation, the 1D feature map transformation transforms a 2D feature map into a point view feature map ˜:

$$\mathcal{S}: \mathbb{R}^{H \times W \times \mathcal{C}} \times \mathcal{U} \to \mathbb{R}^{N \times \mathcal{C}}$$

$$\mathcal{S}\left(\mathbf{F}, \mathbf{U}\right) \coloneqq \left[F\_{U\_{n,1}}, \iota\_{n,2}, \mathcal{c}\right] = \widetilde{\mathbf{F}} = \left[\widetilde{F}\_{n,\mathcal{c}}\right] \in \mathbb{R}^{N \times \mathcal{C}} \tag{3.6}$$

for all valid index matrices:

$$\mathfrak{A} = \{ \mathbf{U} \mid \mathbf{U} \in \mathbb{N}^{N \times 2} \land 1 \le U\_{n,1} \le H \land 1 \le U\_{n,2} \le W \}. \tag{3.7}$$

The underlying idea is illustrated in Fig. 3.9. Equations (3.5) and (3.6) allow transforming feature maps from an arbitrary 2D view and time step to the point view of the current time step :

$$\mathbf{F}\_{\tau}^{\nu} \mathbf{F}\_{t}^{\text{PV}} := \mathcal{S} \left( \mathbf{F}\_{\tau}^{\nu}, \mathbf{U}\_{\tau}^{\nu} \right). \tag{3.8}$$

Equation (3.8) is the foundation for the multi view architecture to transform range and bird's eye view feature maps back to the point view.

**Figure 3.9:** Visualization of the point view feature map transformation introduced in Eq. (3.6). The features in this example are RGB values visualized by color.

In addition, the proposed temporal and sensor fusion requires the transformation of camera image feature maps to the range view and the transformation of range and bird's eye view feature maps between time steps. Therefore, the ideas of Eq. (3.6) are extended to a general 2D feature map transformation ̂:

$$\hat{\mathcal{S}}: \mathbb{R}^{H' \times W' \times \mathcal{C}} \times \mathcal{V} \to \mathbb{R}^{H \times W \times \mathcal{C}}$$

$$\mathcal{S}(\mathbf{F}', \mathbf{V}) \coloneqq \left[ F'\_{V\_{u,v,1}, V\_{u,v,2}, \mathbf{c}} \right] = \mathbf{F} = \left[ F\_{u,v,c} \right] \in \mathbb{R}^{H \times W \times \mathcal{C}} \tag{3.9}$$

for all valid index tensors:

$$\mathcal{W} = \{ \mathbf{V} \mid \mathbf{V} \in \mathbb{N}^{H \times W \times 2} \land 1 \le V\_{\mathbf{u}, \upsilon, 1} \le H' \land 1 \le V\_{\mathbf{u}, \upsilon, 2} \le W' \}. \tag{3.10}$$

An example of the 2D feature map transformation is depicted in Fig. 3.10.

**Figure 3.10:** Visualization of the 2D feature map transformation introduced in Eq. (3.9).

In contrast to the index matrix , which is based on the point cloud itself, the index tensor is computed based on the 3D position , of the target view's cells:

$$\begin{aligned} \mathbf{V}\_{\tau}^{\boldsymbol{\nu}} &= \begin{bmatrix} V\_{\boldsymbol{u}, \boldsymbol{\nu}, \boldsymbol{d}} \end{bmatrix} := \begin{bmatrix} \mathcal{T}\_{\boldsymbol{t} \to \tau}^{\boldsymbol{\nu}} \left( \mathbf{g}\_{\boldsymbol{u}, \boldsymbol{\nu}} \right)\_{\boldsymbol{d}} \end{bmatrix} \in \mathbb{N}^{H \times W \times 2}, \\\ \mathbf{g}\_{\boldsymbol{u}, \boldsymbol{\nu}} &= \begin{bmatrix} G\_{\boldsymbol{u}, \boldsymbol{\nu}, \boldsymbol{1}} & G\_{\boldsymbol{u}, \boldsymbol{\nu}, \boldsymbol{2}} & G\_{\boldsymbol{u}, \boldsymbol{\nu}, \boldsymbol{3}} & \mathbf{1} \end{bmatrix}^{\mathsf{T}}. \end{aligned} \tag{3.11}$$

The definition of = [, , ] ∈ ℝ ×× 3 depends on the view. A range view cell's 3D position is defined by the 3D point of the point cloud, which is projected into the considered cell, illustrated in Fig. 3.11. If multiple points are projected into one cell, only one is kept, which is discussed in detail in the next chapter. On the other hand, the center of a bird's eye cell determines its 3D position independently of the point cloud when making a flat-earth assumption with = 0.

**Figure 3.11:** Examples for computing the index tensors. The left example illustrates the mapping from range view to camera image view and the left from bird's eye view at time to time . The bird's eye view resolution or distance between cells is defined by Δ and Δ.

# **4 Multi View Panoptic Segmentation of 3D Point Clouds**

The multi view approach presented in this chapter addresses the challenge of predicting an improved 3D panoptic segmentation for unstructured point clouds. It comprises individual backbones for range and bird's eye view, which support a point view backbone. In the first step, range and bird's eye view are considered individually to introduce single view approaches and derive 2D backbones for the multi view architecture. The novel multi view framework [Due22] for 3D panoptic segmentation builds upon and extends these backbones and is presented in the second half of this chapter.

## **4.1 Range View Network**

Range view-based panoptic segmentation relies on point clouds represented as range images. These are processed by a 2D backbone to compute feature maps for semantic and instance segmentation, illustrated in Fig. 4.1. Multiple heads are deployed to provide the parallel predictions required for panoptic segmentation. On the one hand, the semantic head predicts a 2D semantic segmentation based on semantic feature maps sem. On the other hand, the center and offset head use instance feature maps ins to compute a heatmap and offset vectors for 2D bottom-up instance segmentation. The overall range view network is called RVNet, and its final 3D panoptic segmentation is the result of a NN-based back-projection.

**Figure 4.1:** The range view network RVNet. It is composed of a 2D fully convolutional backbone based on feature extractors (FEs) and feature aggregators (FAs) [Mey19b], and three different heads. The edges are labeled with the feature maps sizes.

#### **Input Representation**

The underlying idea of the range view is to represent point clouds in spherical coordinates and discretize azimuth and polar angle into cell coordinates following Eq. (2.17). Ideally, all 3D points are projected to different grid cells. However, this requires that a point cloud is recorded from a single point of view and measurements are performed in equidistant azimuth and polar intervals. Since these requirements are rarely satisfied for point clouds originating from moving lidar sensors in the considered context, collisions occur when multiple points are projected into the same grid cell. One strategy to address this is the consideration of the intrinsic properties of the deployed lidar to adapt the mapping from angles to cell coordinates accordingly. This can significantly reduce collisions and information loss but omits the underlying regularity of the range view grid, which is an important property for convolution-based processing with CNNs. Hence, further projection strategies are discussed in addition to the regularly-spaced strategy of Eq. (2.17).

As a result of the functional principle of common lidar sensors introduced in Section 2.3, the resulting measurements are usually represented in spherical coordinates. While the distance measurement depends on the environment, both angles are usually determined directly or indirectly by design. Many rotating 360<sup>∘</sup> lidars use a vertical stack of laser heads rotating around their vertical axis. The position in this stack maps to a fixed polar angle and can be used as row index. On the other hand, the azimuth angle usually depends on the angular velocity and time. In this case, the column can be determined by enumerating the measurements for each laser head. The result is a purely sensor-dependent projection without collisions or information loss and is directly provided by most lidar sensors. However, the cells of the resulting grid are not necessarily equidistant.

For some use cases, such as many public datasets, it is impossible to derive the cell coordinates directly from the sensor since only the Cartesian point clouds are provided. These are often already pre-processed to compensate the ego motion during the continuous scan, which prevents restoring the raw measurements. However, sensor properties can still be taken into account to optimize the projection. Considering lidar sensors with a grid-like measurement pattern of size (, ) defined by intrinsic measurement angles (̂ , ˆ), spherical point clouds can be projected by [Due20a]:

$$\mathbf{u} = \begin{bmatrix} \arg\min\_{1 \le u \le H} \left( \left| \hat{\theta}\_u - \theta \right| \right) \\\\ \arg\min\_{1 \le v \le W} \left( \left| \hat{\phi}\_v - \phi \right| \right) \end{bmatrix}. \tag{4.1}$$

While Eq. (4.1) works for any arbitrary and irregular angle distribution, it has a higher computational complexity than Eq. (2.17). A combination of the presented projection strategies is also possible and applies distinct strategies for the individual dimensions. Some sensors, such as a Velodyne HDL-64E, have an equidistant azimuth but no entirely equidistant polar angle distribution. However, the upper and lower half individually does. In order to reduce the number of collisions and information loss, the computation of the column index follows Eq. (2.17), and the computation of the row index is optimized to:

$$\mathcal{U} = \begin{cases} \begin{array}{c} \left[ 0.5 \cdot H \cdot \frac{\Theta - \Theta\_{\text{up}}}{\Theta\_{\text{mid}} - \Theta\_{\text{up}}} \right] \\ \cdot \end{array} & \text{if } \Theta < \Theta\_{\text{mid}} \\\ \left[ 0.5 \cdot H \cdot \left( 1 + \frac{\Theta - \Theta\_{\text{mid}}}{\Theta\_{\text{down}} - \Theta\_{\text{mid}}} \right) \right] & \text{otherwise} \end{array} \tag{4.2}$$

The decision boundary mid is an intrinsic property of the sensor and separates the upper and lower half. Independent of the used projection, the result is a range image of size × × 6 with channels , , , , intensity , and an occupancy flag. The latter indicates if at least one point was projected into the respective cell. In case of collisions, the point with the smallest distance is selected. The resulting tensor is the input to the range view backbone and overall network. The second, third, and fourth channel contain the projected Cartesian points and are used as RV ∈ ℝ ×× 3 in the following chapters.

#### **Backbone**

The backbone is a fully convolutional network depicted in Fig. 4.1 whose architecture is motivated by Deep Layer Aggregation (DLA) [Yu18] and related to the backbone of LaserNet [Mey19b]. It consists of two different building blocks, feature extractors (FEs) and feature aggregators (FAs) [Mey19b]. An extractor consists of stacked residual blocks, illustrated in Fig. 4.2. The first convolutional layer of its first block downsamples the input by applying a stride = (, ) and optionally increases the number of channels. All remaining blocks of a feature extractor keep the resolution and channel size constant. Three feature extractors with = 4, 5, and 6 build the encoder of the network, see Fig. 4.1, and consecutively downsample the feature maps. The first and second feature extractor apply a stride of = (1, 2) and horizontally downsample by a factor of two, whereas the last one with = (2, 2) downsamples also vertically. The asymmetrical downsampling is motivated by the intrinsic properties of the deployed lidar sensors. Their horizontal resolution is roughly four times higher than their vertical. With its 30 layers in total, the encoder is comparable to a ResNet-34, however with only three stages.

**Figure 4.2:** The building blocks of the range view backbone based on [Mey19b], which consist of Basic Blocks (BBs) [He16].

Feature aggregators, as the second building block, receive feature maps of two different stages and resolutions. They upsample their lower-resolution input with a transposed convolution, concatenate both inputs and apply two residual blocks. As a result, the upsampling of the lower-resolution feature maps with more context is guided by feature maps from the previous stage with higher resolution and more spatial information. This strategy improves the combination of aggregated context with fine details. The decoder comprises four feature aggregators, depicted in Fig. 4.1. It uses the multi scale feature maps and context provided by three feature extractors to compute the final feature maps for the heads. As panoptic segmentation requires multiple heads, two feature aggregators are deployed in parallel at the end, one for the semantic head and another for the instance heads. This dual setup showed improved results for existing methods [Che20, Zho21].

#### **Heads and Loss**

The range view predictions for semantics, centers, and offsets are computed based on the backbone's final feature maps and multiple parallel heads. The semantic head is a single 1 × 1-convolution that computes the semantic segmentation. Both instance heads for the center and offset predictions have the same architecture consisting of a 3 × 5-convolution which reduces the feature channels to 32, followed by BN, LReLU, and a final 1 × 1-convolution. All predictions are horizontally upsampled by a factor of two to match the original input resolution. Afterwards, Non-Maximum Suppression (NMS) extracts the <sup>c</sup> range view cells with the highest scores as center candidates ˆ = ( ̂, ̂) ∈ based on the center heatmap. In addition, the semantic segmentation determines the range image cells belonging to a thing class. Finally, the assignment allocates each of these cells to one of the center candidates ˆ based on the predicted offset vectors ∈ ℝ ×× 2:

$$\mathcal{C}(\mathbf{u}) = \operatorname\*{arg\,min}\_{\hat{\mathbf{u}} \in \mathcal{K}} \left( \left\| \begin{bmatrix} u \\ \upsilon \end{bmatrix} + \begin{bmatrix} O\_{u,\upsilon,1} \\ O\_{u,\upsilon,2} \end{bmatrix} - \begin{bmatrix} \hat{u} \\ \upsilon \end{bmatrix} \right\|\_{2} \right). \tag{4.3}$$

A NN-based back-projection strategy [Mil19], introduced in Section 2.3.2, transforms the predicted 2D semantic and instance segmentation back to the 3D point cloud to provide the required 3D panoptic segmentation.

Since all heads are trained simultaneously, a multi task loss is required. The semantic head is trained with CE and Lovász loss [Ber18], the center head with MSE loss, and the offset head with MAE loss, leading to a weighted loss term with weights :

$$L = \lambda\_{\text{sem}} \cdot (L\_{\text{CE}} + L\_{\text{Lováz}}) + \lambda\_{\text{ctr}} \cdot L\_{\text{MSE}} + \lambda\_{\text{off}} \cdot L\_{\text{MAE}}.\tag{4.4}$$

## **4.2 Bird's Eye View Network**

The bird's eye view is the second view which plays an essential role in the multi view architecture. Similar to the range view, the first step is a transformation of the point clouds into the desired input representation. The bird's eye view input tensor is then fed to the backbone, which follows the same architecture as the range view backbone, with the only difference of a symmetric feature map downsampling, see Fig. 4.3. The final feature maps of the backbone are again provided to three parallel heads. A sparse semantic head and 3D clustering step provide the final 3D panoptic segmentation. Similar to the range view network, the loss function of Eq. (4.4) is used. The overall bird's eye view architecture (BEVNet) is depicted in Fig. 4.3.

#### **Input Representation**

The point clouds are transformed into the polar bird's eye view to create the input tensor. The motivation for a polar instead of a Cartesian representation lies in the underlying sensor properties discussed in Section 2.3. Its point density naturally diminishes with distance, and therefore, a polar grid is less sparse and requires a smaller number of cells to cover the required area. Every 3D point is projected to its corresponding bird's eye view cell following Eq. (2.18). However, multiple points will be projected into most of the cells. While these collisions are undesirable for the range view and can ideally be avoided, it is unavoidable for the bird's eye view since there is no one-to-one correspondence between cells and measurements. Hence, a learned function based on PointNet transforms the varying number of points inside each cell into a fixed-size feature vector, following the ideas of PointPillars [Lan19b] and PolarNet [Zha20c]. Its lightweight structure proposed in this approach is shown in Fig. 4.4.

**Figure 4.3:** The bird's eye view network BEVNet. It is composed of a fully convolutional backbone and three different heads. The edges are labeled with the feature maps sizes.

**Figure 4.4:** PointNet setup for computing a fixed-size feature vector for every grid cell based on the set of points {1, ..., } which are projected into the considered cell. Every pMLP consists of a single layer and uses BN and LReLU.

#### **Sparse Semantic Head**

Current state-of-the-art approaches [Zha20c, Zho21] compute the semantic segmentation in the bird's eye view, which is then transformed back to the point cloud to obtain a 3D semantic segmentation. Since a 2D bird's eye view semantic segmentation leads to mediocre results, the classification layer of existing approaches predicts ⋅classes semantic class scores for every bird's eye view cell, leading to a predicted segmentation ∈ ℝ ×× ⋅classes. These can be interpreted as predictions for a vertical stack of voxels per bird's eye view cell and results overall in a voxel-based prediction VX ∈ ℝ ×× × classes , depicted on the left in Fig. 4.5.

However, this is quite expensive and ineffective due to the sparsity of the underlying voxel grid. Therefore, a novel sparse bird's eye view head improves this final step by omitting empty cells and only considering cells with at least one point inside:

$$\begin{split} \hat{\mathcal{U}}^{\text{occupied}} &= \{ \mathbf{u} \mid \mathbf{u} \in \{1, 2, \ldots, H\} \times \{1, 2, \ldots, W\} \land \\ &\quad \land \exists \mathbf{p} \in \mathcal{G}^{\mathbf{p}} \colon \left( \mathcal{P}^{\text{BEV}} \circ \mathcal{Q}^{z} \right) (\mathbf{p}) = \mathbf{u} \} .\end{split} \tag{4.5}$$

The set of occupied cells and an arbitrary but fixed ordering defined by the bijective function

$$\mathcal{E} \, : \, \widehat{\mathcal{U}}^{\text{occupied}} \to \{1, 2, \ldots, N'\}, \quad N' = |\widehat{\mathcal{U}}^{\text{occupied}}| \tag{4.6}$$

can be used to transform the 2D semantic feature maps sem into a sparse representation. The transformation is based on the spatial feature map transformation introduced in Eq. (3.6):

$$\begin{aligned} \mathbf{F}\_{\text{sem}}^{\text{sparse}} &= \mathcal{S}\left(\mathbf{F}\_{\text{sem}}, \mathbf{U}^{\text{occupied}}\right), \\ \mathbf{U}^{\text{occupied}} = \left[U\_{n',d}\right] = \left[\mathcal{E}^{-1}(n')\_d\right] \in \mathbb{W}^{N' \times 2}. \end{aligned} \tag{4.7}$$

Figure 4.5 visualizes the underlying idea and equations. Every row of the matrix sparse sem contains the feature vector of a non-empty bird's eye view cell. The ordering ℰ associates the corresponding row to the 3D points when transforming the predictions back to the point cloud. It is worth mentioning that omitting the underlying grid structure is possible because the classification layer individually maps the final feature vectors to class scores without using the grid.

**Figure 4.5:** Idea of the sparse semantic head. A Cartesian grid instead of the deployed cylindrical grid ensures a clearer visualization. The additional dimension for the class scores is replaced by class colors for visualization. Empty cells or voxels are depicted in gray.

The classification layer of the deployed sparse head computes ⋅ classes class scores for every occupied cell based on sparse sem , which provides a vertical stack of predictions sparse ∈ ℝ ′× × classes. These can be transformed into a 3D point cloud semantic segmentation by:

$$\mathbf{S}^{\text{spurse}} \mathbf{S}^{\text{PV}} = \mathcal{S} \left( \mathbf{S}^{\text{sparse}}, \mathbf{U}^{\text{sparse}} \right) \in \mathbb{R}^{N \times N\_{\text{classes}}}.\tag{4.8}$$

The index matrix ̃ sparse for the sparse representation is based on voxel coordinates for the cylindrical voxel view

$$\mathbf{U}^{\rm VX} = \begin{bmatrix} U\_{n,d} \end{bmatrix} = \begin{bmatrix} (\mathcal{P}^{\rm VX} \circ \mathcal{Q}^{\rm z})(\mathbf{p}\_n)\_d \end{bmatrix} \in \mathbb{N}^{N \times 3},\tag{4.9}$$

and is computed as follows:

$$\mathbf{U}^{\text{sparse}} = \begin{bmatrix} \tilde{U}\_{n,d} \end{bmatrix}, \quad \mathcal{U}\_{n,1} = \mathcal{E} \begin{pmatrix} \begin{bmatrix} U\_{n,1} \\ U\_{n,2} \end{bmatrix} \end{pmatrix}, \quad \mathcal{U}\_{n,2} = U\_{n,3}. \tag{4.10}$$

The ordering ℰ identifies for every point the row in sparse based on the first two voxel coordinates, as illustrated in Fig. 4.5. These correspond to the bird's eye view coordinates, which cannot be directly used since the underlying grid structure was abandoned. Simultaneously, the third voxel coordinate , 3 identifies the column and corresponds to the voxel position in the vertical stack. Overall, the sparse segmentation head significantly reduces inference and training time as well as memory demands due to the sparsity of approximately 86 %.

#### **Instance Heads**

Both instance heads consist of a 3 × 3-convolution which reduces the number of feature channels to 32, followed by BN, LReLU, and a final 1×1-convolution. In contrast to the semantic head, expanding the predictions to voxel-based predictions is unnecessary since the clustering is based on 2D centers and offsets. Furthermore, the employed instance clustering directly considers and clusters the individual 3D points instead of bird's eye view cells. As a result, the explicit computation of a 2D instance segmentation can be skipped, which avoids processing the empty cells. The first step of the clustering is again NMS to extract the <sup>c</sup> center candidates ˆ with the highest score. Afterwards, the assignment allocates every 3D point to one of the extracted center candidates ˆ based on the offset vectors and the point's projection index :

$$\mathcal{C}(\mathbf{u}\_n) = \operatorname\*{arg\,min}\_{\hat{\mathbf{u}} \in \mathcal{K}} \left( \left\| \begin{bmatrix} u \\ \upsilon \end{bmatrix} + \begin{bmatrix} O\_{u,\upsilon,1} \\ O\_{u,\upsilon,2} \end{bmatrix} - \begin{bmatrix} \hat{u} \\ \hat{\upsilon} \end{bmatrix} \right\|\_2 \right). \tag{4.11}$$

Consequently, the combined predictions of the sparse semantic and the instance heads provide the final 3D panoptic segmentation.

## **4.3 Multi View Network**

The main drawback of the presented approaches is the focus on an individual view. Depending on the chosen view, these approaches suffer from different weaknesses. Consequently, a multi view framework [Due22] is proposed, which combines distinct views and addresses these drawbacks to improve the predicted 3D panoptic segmentation. It builds upon three distinct representations, the 2D range and bird's eye view combined with the unstructured point view. The backbones of the previously introduced single view networks are two primary components of the multi view network and extract feature maps of the point cloud represented in range and bird's eye view. Their 2D grid structure allows for efficient feature and context aggregation, and as a result of the different projections, both views contribute valuable features based on different 2D neighborhoods. A novel point view backbone, illustrated in Fig. 4.6, is responsible for the vital combination of features across views. These multi view features are fused at feature level inside the point view backbone, instead of a simple late fusion step, to leverage their full potential. The panoptic head also exploits the multi view setup and provides semantic and instance predictions in different views. On the one hand, it predicts a 3D semantic segmentation based on the pointwise multi view features. On the other hand, it builds upon bird's eye view feature maps for bottom-up instance segmentation, which is most suitable for a dense heatmap prediction. Furthermore, it allows the range and point view to focus on semantic features. The overall architecture of the multi view approach MVNet is shown in Fig. 4.6.

**Figure 4.6:** Architecture of the multi view framework MVNet based on range view, bird's eye view, and point view.

#### **Point View Backbone**

The key element and important link between the range and bird's eye view is the point view backbone. Its overall structure mimics the architecture of the 2D backbones. However, it omits some cross-connections since it deploys no feature aggregators, which require two feature map inputs of different scales. Consequently, the architecture is more related to the U-Net family [Ron15] with a single skip connection. The backbone itself consists of two elements, pMLPs and Multi View Aggregation (MVA). The single-layer pMLPsrefine the pointwise features, while the MVA is the actual link between the backbones, combining and fusing the features across three different views. This architecture is flexible and can be deployed in different configurations to decide for each block of the backbone whether a multi view fusion should be applied. Depending on the choice, an MVA module or pMLP is deployed. The setup shown in Fig. 4.6 deploys multi view aggregation three times. At the beginning, to collect low level features with little context but high spatial resolution, after the last feature extractor to gather features with strong context but reduced spatial resolution, and at the end to exploit the final feature maps, which contain the aggregated context and have a high spatial resolution. Another evaluated configuration is the aggregation after every block, which replaces the remaining three core pMLPs with MVA modules. Independently of the chosen configuration, the last MVA module receives two point view inputs which are concatenated and processed by a pMLP prior to the aggregation.

The multi view aggregation follows a two-step process to aggregate features across three views. The first and most important step is the transformation of feature maps from range and bird's eye view to point view. This step is required to provide the aggregated context for every 3D point in the point cloud. Based on Eq. (3.8), the 2D bird's eye view feature maps are transformed following

$$\begin{aligned} \mathbf{^{BEV}F}\_{j}^{\mathrm{PV}} &= \mathcal{S} \left( \mathbf{F}\_{j}^{\mathrm{BEV}}, \mathbf{U}\_{j}^{\mathrm{BEV}} \right) \in \mathbb{R}^{N \times C\_{\mathrm{BEV}}}, \\ \mathbf{U}\_{j}^{\mathrm{BEV}} &= \left[ \left( \mathcal{P}\_{j}^{\mathrm{BEV}} \circ \mathcal{Q}^{\mathrm{Z}} \right) (\mathbf{p}\_{n})\_{\mathrm{d}} \right]. \end{aligned} \tag{4.12}$$

The range view feature maps are similarly transformed:

$$\begin{aligned} \mathbb{R}^{\mathbb{R}} \mathbf{F}\_{\mathbf{j}}^{\mathrm{FV}} &= \mathcal{S} \left( \mathbf{F}\_{\mathbf{j}}^{\mathrm{RV}}, \mathbf{U}\_{\mathbf{j}}^{\mathrm{RV}} \right) \in \mathbb{R}^{N \times \mathcal{C}\_{\mathrm{RV}}}, \\ \mathbf{U}\_{\mathbf{j}}^{\mathrm{RV}} &= \left[ \left( \mathcal{P}\_{\mathbf{j}}^{\mathrm{RV}} \circ \mathfrak{Q}^{\mathcal{B}} \right) (\mathbf{p}\_{n})\_{d} \right]. \end{aligned} \tag{4.13}$$

The projections depend on the level due to different feature map sizes. After this step, three feature maps are present in the point view while originating from different views. The second step fuses these into a combined feature map for the next level + 1:

$$\mathbf{F}\_{j+1}^{\rm PV} = \mathcal{F}\_{\rm MV} \left( \mathbf{F}\_{j}^{\rm PV}, \,^{\rm RV} \mathbf{F}\_{j}^{\rm PV}, \,^{\rm BEV} \mathbf{F}\_{j}^{\rm PV} \right). \tag{4.14}$$

Four different fusion strategies ℱMV with fixed and learnable fusion operations are proposed and investigated. These are addition, elementwise maximum, concatenation followed by a 1 × 1-convolution, and a weighted sum with learnable parameters ∈ ℝPV× 3 and weight vectors :

$$\mathbf{F}\_{j+1}^{\rm PV} = \sum\_{\nu} \text{diag} \left( \mathbf{y}^{\nu} \right) \cdot \mathbf{\mathcal{F}}\_{j}^{\rm PV}, \quad \nu \in \{ \rm PV, RV, BEV \},$$

$$\left[ \mathbf{y}^{\rm PV} \,\mathbf{y}^{\rm RV} \,\mathbf{y}^{\rm BEV} \right] = \sigma\_{\text{softmax}} \left( \sum\_{\nu} \, ^{\nu} \mathbf{F}\_{j}^{\rm PV} \cdot \mathbf{W}^{\nu} \right) \in \mathbb{R}^{N \times 3}. \tag{4.15}$$

The softmax function is applied row-wise. All strategies apply a final pMLP in the end and require PV = RV = BEV to be applicable, except for the concatenation. The steps and setup of the MVA module are illustrated in Fig. 4.7.

**Figure 4.7:** Setup and steps of the multi view aggregation module. Range and bird's eye view feature maps are transformed into the point view. Afterwards, the features originating from three different views are fused.

#### **Multi View Panoptic Head and Loss**

The semantic head consists of a pMLP, which maps the final pointwise features PV sem to semantic class scores. On the other hand, the offset and center head are applied to bird's eye view feature maps BEV ins and have the same structure as the respective heads of the bird's eye view network BEVNet. In analogy to the range and bird's eye view network, MAE loss is applied to the predicted offsets and MSE loss to the predicted center heatmap. The semantic loss is adapted to the multi view setup. Instead of applying the semantic loss sem = CE + Lovász only to the predicted semantic segmentation of the point backbone, auxiliary semantic heads are added to the range and bird's eye view backbone during training, depicted in Fig. 4.6. These heads are equal to the semantic heads presented for the range and bird's eye view approaches. As a result, the multi view architecture benefits from the previously presented sparse semantic bird's eye view head too, which speeds up the multi view training. An auxiliary loss is computed based on their predictions to support both backbones to learn meaningful features. Additionally, the auxiliary loss prevents the bird's eye view backbone from focusing too much on the instance segmentation. The overall loss used to train the multi view framework is then defined as:

$$L = \lambda\_{\rm sem}^{\rm RV} \cdot L\_{\rm sem}^{\rm RV} + \lambda\_{\rm sem}^{\rm BEV} \cdot L\_{\rm sem}^{\rm BEV} + \lambda\_{\rm sem}^{\rm PV} \cdot L\_{\rm sem}^{\rm PV} + \lambda\_{\rm crr} \cdot L\_{\rm MSE}^{\rm BEV} + \lambda\_{\rm off} \cdot L\_{\rm MAE}^{\rm BEV} \tag{4.16}$$

#### **Data Augmentation**

Pointwise tasks usually suffer from an imbalanced class distribution, since points of classes, such as road or building, occur much more frequently than points of small ones, such as pedestrian or cyclist. One strategy for 3D panoptic segmentation to mitigate this imbalance is the extraction of instances of these classes across the training set to create an instance database. During training, instances from this database are randomly pasted into the 3D point clouds. This data augmentation technique [Zha20c, Xu21a] is called random object augmentation (ROA) in the following and is further improved in this thesis to reduce the domain gap [Due22] as follows:


The second step is approximated by using a fixed bounding box size for each class, which allows pre-computing valid positions for each point cloud and class. To account for the class imbalance, the probability of a class being chosen is inversely proportional to its point frequency.

# **5 Temporal Panoptic Segmentation of 3D Point Clouds**

The multi view approach presented in the previous chapter focuses on improving 3D panoptic segmentation based on the point cloud information of the current time step. The temporal framework [Due20a] proposed in this chapter goes one step further and exploits temporal information and dependencies in point cloud sequences. These are the result of the consecutive and repeated sensor measurements performed by autonomous vehicles or robots. The proposed recurrent fully convolutional approach aggregates and memorizes information over time to improve the predictions for the latest point cloud based on past information. It builds upon a novel recurrent temporal feature fusion for 2D feature maps, which extends range and bird's eye view approaches with a temporal memory to exploit past feature maps. Subsequently, the multi view architecture is extended by the temporal fusion in range and bird's eye view to combine multi view and temporal benefits.

## **5.1 Temporal Range View**

Temporal approaches use and exploit previous predictions, feature maps, or input data to improve current predictions, which distinguishes them from single frame approaches considering only the latest frame. In order to develop a beneficial temporal fusion strategy and architecture for the range view, it is essential to consider the requirements of the targeted domain of real-time autonomous systems. In this domain, sensors provide new recordings every Δ and the autonomous system requires new predictions with low latency for these. The proposed recurrent architecture with recursive feature fusion addresses the drawbacks of overlapping and fixed-size temporal windows discussed in Section 3.2 for recurring recordings. It builds upon RNNs, which compute their output based on the input and previous outputs and resemble infinite impulse response filters. As a result, the underlying idea is based on two elements, a single frame backbone and a hidden state or temporal memory. The single frame backbone was presented in Section 4.1 and computes feature maps RV for the latest input range image to provide the information extracted from the current point cloud. The temporal memory , on the other hand, contains the temporal feature maps recursively aggregated over time. The recursive update step ℋ combines both elements and updates the temporal memory with the latest feature maps provided by the single frame backbone. It can generally be formulated as:

$$\mathbf{H}\_{t} = \mathcal{H}(\mathbf{F}\_{t}^{\text{RV}}, \mathbf{H}\_{t-1}) = \mathcal{H}(\mathbf{F}\_{t}^{\text{RV}}, \mathcal{H}(\mathbf{F}\_{t-1}^{\text{RV}}, \mathbf{H}\_{t-2})) = \dots \tag{5.1}$$

One significant advantage of the recursive update is the reuse of all previous feature computations. When a new recording arrives, the backbone computes RV , analogous to the single frame approach. The temporal memory −1 has already been computed in the previous time step and is reused. Therefore, the computational effort is only increased by the temporal update ℋ, which is performed once in every time step. This increase is independent of the processed sequence length, and no fixed temporal window size or tradeoff between exploited past frames and computational effort is necessary. The number of considered past frames is potentially unlimited since their information is aggregated in the temporal memory. The network learns to integrate or forget information as part of the training. One remaining challenge is the alignment of feature maps between two time steps. The backbone extracts the feature maps RV in the range view defined relative to the current ego position. However, the latter constantly changes due to ego motion. Consequently, a spatial transformation has to recursively transform the temporal memory −1 from the range view of the last to the range view of the current ego position.

The proposed recurrent fully convolutional architecture T-RVNet is depicted in Fig. 5.1, where the temporal alignment is applied prior to the update step. Different alignment and update strategies are proposed and discussed in the following. Since the single frame backbone returns semantic and instance feature maps, the temporal pipeline and memory are required twice, illustrated in Fig. 5.1. The panoptic head and loss are inherited from the range view network of Section 4.1. The proposed temporal training strategy trains the temporal framework on short data sequences, comprising several tens of frames. This strategy ensures the presence of temporal dependencies and simultaneously retains a significant variation in the data.

**Figure 5.1:** The recurrent temporal architecture T-RVNet with its components and unrolled for two time steps. The temporal memory has the same spatial dimensions and channel size as the backbone feature maps RV. The temporal pipeline is required twice and applied to the semantic features maps RV sem and instance feature maps RV ins .

## **5.1.1 Temporal Memory Alignment**

The temporal memory alignment enables the recurrent architecture and recursive memory update depicted in Fig. 5.1. It addresses the underlying challenge of transforming the temporal memory −1 into the current range view, defined relative to the current ego pose, prior to the memory update. The transformation into the latest range view is based on Eqs. (3.9) and (3.11):

$$\mathbf{H}\_{t-1}\mathbf{H}\_t = \mathcal{S}(\mathbf{H}\_{t-1}, \mathbf{V}\_{t-1}).\tag{5.2}$$

Afterwards, the transformed memory −1 is spatially aligned with RV . Two strategies for the computation of the index tensor −1 are introduced, the backward (bwd) strategy

$$\begin{aligned} \mathbf{V}\_{t-1}^{\text{bwd}} &= \left[ V\_{\boldsymbol{\mu}, \boldsymbol{\upsilon}, d} \right] = \left[ \mathcal{T}\_{t \to (t-1)}^{\text{RV}} (\mathbf{g}\_{\boldsymbol{\mu}, \boldsymbol{\upsilon}})\_d \right], \quad \mathbf{G} = \mathbf{P}\_t^{\text{RV}}, \\ \mathcal{T}\_{t \to (t-1)}^{\text{RV}} \left( \mathbf{g} \right) &= \left( \mathcal{P}^{\text{RV}} \circ \mathcal{Q}^{\mathcal{G}} \right) \left( \mathbf{T}\_{t \to (t-1)} \cdot \mathbf{g} \right) \end{aligned} \tag{5.3}$$

and the forward (fwd) strategy:

$$\begin{array}{l} \mathbf{V}\_{t-1}^{\text{fwd}} = \text{reverse} \left( \mathbf{V}\_{t}^{\prime} \right), \\\\ \mathbf{V}\_{t}^{\prime} = \left[ V\_{u, \upsilon, d}^{\prime} \right] = \left[ \mathcal{T}\_{(t-1) \to t}^{\text{RV}} \left( \mathbf{g}\_{\mu, \upsilon} \right)\_{d} \right], \quad \mathbf{G} = \mathbf{P}\_{t-1}^{\text{RV}}, \\\\ \mathcal{T}\_{(t-1) \to t}^{\text{RV}} \left( \mathbf{g} \right) = \left( \mathcal{P}^{\text{RV}} \circ \mathcal{Q}^{\mathcal{G}} \right) \left( \mathbf{T}\_{(t-1) \to t} \cdot \mathbf{g} \right). \end{array} \tag{5.4}$$

The transformation ̂used with the forward strategy natively transforms from to −1. However, the temporal memory must be transformed from −1 to . Therefore, the native index tensor ′ , which contains index ′ in cell ∗ , must be reversed, see Fig. 5.2. Afterwards, the new index tensor contains index ∗ in cell ′ , which allows ̂to access the previous temporal memory −1 at ∗ and moves its content to cell ′ in the temporally aligned memory −1 .

At first glance, both strategies look rather similar. However, there is an important difference in the index tensors bwd and fwd. On the one hand, the backward strategy computes for every cell of the *current* time step where its contained 3D point would have been in the last time step, illustrated in Fig. 5.2. It transforms the 3D points RV from the current to the previous ego pose, followed by a range view projection. On the other hand, the forward strategy considers for every cell of the *previous* time step where its content would be in the current time step. It transforms the points RV −1 of the previous time step to the current ego pose, followed by a range view projection. The underlying assumption in both cases is that the cell ∗ in the previous and the cell in the current range view contain spatially close measurements. The notable difference between the backward and forward strategy is that the former uses RV and the latter RV −1 to associate the cells ∗ and . Consequently, forward and backward association is sometimes asymmetric:

$$\exists \mathbf{u} \; : \; \mathcal{T}\_{t \rightarrow (t-1)}^{\text{RV}} \left( \mathbf{p}\_{\mathbf{u}, t} \right) = \mathbf{u}^\* \land \mathcal{T}\_{(t-1) \rightarrow t}^{\text{RV}} \left( \mathbf{p}\_{\mathbf{u}^\*, t-1} \right) = \mathbf{u}' \neq \mathbf{u}, \tag{5.5}$$

where , is the 3D point that was projected into cell at time and ∗, −1 was projected into cell ∗ at time − 1. The underlying reason for the asymmetry is that , ≠ ∗, −1. In general, this is expected since a lidar sensor never records the exact 3D point in the current and last time step. Most of the time, these points are spatially close, and ′ equals or is an adjacent cell, depending on the quantization. In this case, the spatial deviations are small, and the temporal alignment works well without considerable differences between both strategies.

**Figure 5.2:** Association of range view cells across time, based on the backward (dotted) and forward strategy (dashed). In the first example, both strategies are symmetric, while in the second, they provide different pairs of associated cells.

However, there are additional and more severe reasons for , ≠ ∗, −1, shadowing and moving objects. They have different causes but a similar impact, with an example for shadowing shown in Fig. 5.3. In scenario (a), the point of a pole is measured. This point is projected into cell of the latest range view and into cell <sup>∗</sup> when transformed and projected into the range view of the previous ego pose and time step. However, this pole was invisible in the previous time step since a building was in the line of sight. Hence, the point ∗, −1, which was actually recorded and projected into cell ∗ , lies on the corner of the building, with a significant spatial distance to , . Scenario (b) shows the opposite setup, where the corner of the building recorded in the previous time step is no longer visible because the pole shadows it. While moving objects cause similar effects, the reason there is predominantly the movement of the objects themselves and not the change of sensor viewpoint induced by ego motion.

The forward and backward strategies are prone to different errors in these scenarios. The forward strategy generates wrong associations in scenarios similar to (b), where an obstacle or moving object hides previously recorded areas. Since it transforms the hidden point −1, it wrongly matches and ∗ . The backward strategy, on the other hand, handles these scenarios well and only associates cell with ′ since it relies on . However, it struggles in scenarios where previously hidden areas become visible, such as scenario (a), and incorrectly associates and ∗ . As a counterpart, the forward strategy can be applied successfully in these scenarios.

**Figure 5.3:** Two different scenarios illustrate a wrong association for either the backward (a) or forward (b) strategy.

The examples and discussion above show that scenarios exist for both strategies where wrong cells get associated across time. While it is possible to leave it up to the network to learn to handle this kind of error, further investigation is beneficial. Therefore, a detection mechanism is proposed to detect wrong associations based on the spatial distance Δ of expected and measured 3D points, illustrated in Fig. 5.3. Based on the decision criterion Δ ≤ range, the affected cells of the temporal memory can be discounted with ∈ [0, 1] or explicitly deactivated. Based on this mechanism, combining both strategies and choosing the strategy with the smallest Δ for each cell is also possible.

### **5.1.2 Temporal Memory Update**

The temporal memory update fuses the latest feature maps with the aligned temporal memory. Initially, RNNs in neural language processing used LSTMs or GRUs to combine the temporal memory, or hidden state, with the current input. Both approaches emerged to address the vanishing or exploding gradient problem and have also been adapted to the 2D image domain by replacing fully connected layers with convolutional layers [Shi15, Sia17]. Although LSTMs have more gates and thus more parameters, GRUs showed similar performance [Chu14], and no clear advantage of any of the approaches could be demonstrated. Hence, the investigated baseline fusion strategy is based on a ConvGRU, introduced in Section 2.1.3.

However, one drawback of the established ConvGRU is that it considers only a small spatial context based on a single convolutional layer. In fact, ConvGRUs aggregate no context at all if a 1 × 1-convolution is used. This section proposes improved strategies to address this drawback. These can generally be grouped into two categories, gated strategies based on extended ConvGRUs and residual non-gated strategies. The gated update strategy is an enhanced ConvGRU, called ContextGRU and illustrated in Fig. 5.4. It aggregates sophisticated context based on the temporal memory and current feature maps. A small residual network is applied to the concatenation of the temporal memory −1 and the latest feature maps RV to aggregate combined context for a significantly improved candidate memory ˜ . The chosen location also ensures that the gradient flow through time stays untouched to prevent reintroducing exploding or vanishing gradients.

The residual update strategies use no gating mechanisms but rely on a concatenation followed by a residual network. At first glance, these approaches are prone to the vanishing or exploding gradient problem. The number of layers the gradient has to pass increases with every additional time step the gradient is backpropagated. Nevertheless, a residual strategy is promising since residual networks were designed as very deep networks with many stacked layers. Additionally, gradients will usually be backpropagated only a few time steps in the considered context. The reasons are twofold. First, the last few frames contain the most valuable information, which naturally diminishes with temporal distance. Therefore, backpropagating the gradient dozens or even hundreds of time steps adds no significant benefit. Secondly, computational and especially memory demands allow only a few steps and up to ten in this thesis. Based on this discussion, the maximum depth for the backpropagation can be computed. Considering ten steps in time and a residual network consisting of four BBs for the update, the maximum number of layers from output to input are 118 convolutional layers. Since residual networks with up to 152 layers are frequently used [He16, Zha17, Yu18], the residual update strategy is well-suited and promising.

**Figure 5.4:** Update strategies for the temporal memory based on a gated ContextGRU and a residual network. The ContextGRU extends the original GRU, illustrated in Fig. 2.7.

## **5.1.3 Temporal Training**

The underlying training strategy for recurrent architectures differs significantly since temporal dependencies only exist for a sequence of consecutive point clouds. On the one hand, this requires datasets that contain sequential data. On the other hand, training with randomly drawn samples from the dataset, which is a common strategy, no longer works. A straightforward strategy provides the data sequentially based on the native sequences in the underlying dataset. However, this strategy has several drawbacks:


The proposed sequence-based training strategy addresses these challenges by training the temporal approaches with subsequences of point clouds. Different sequence lengths ∈ {25, 50, 100} are investigated. This strategy significantly improves the variety because every frames a new and, in most cases, distinct subsequence is randomly drawn. Additionally, the variation between epochs is considerably improved since the number of sequences to shuffle is usually two or three orders of magnitude higher with short subsequences than with native sequences. The restriction of the sequence length is a negligible limitation since the most valuable temporal dependencies are short-term. Point clouds which are minutes or hundreds of meters away contain little relevant information for the current frame.

Alongside the training strategy, also the gradient flow differs from optimizing a single frame network. Predictions not only depend on the current but also on previous input and computations. As a result, error and gradients can be backpropagated through time. Thereby, the single frame backbone learns to compute valuable feature maps for the current *and* future frames. Additionally, the memory update learns to combine current and past information. Truncated Backpropagation Through Time (TBPTT) [Wil90] makes gradient propagation through time computationally manageable by truncating it after <sup>2</sup> steps. While the training with subsequences already increases the training variation, they still contain similar data. Therefore, the proposed strategy updates the weights only every <sup>1</sup> steps to reduce the number of weight updates on strongly related data. These sparse updates also enable training on datasets, where only every <sup>1</sup> -th point cloud has labels. Since training on a subsequence starts with a zero-filled memory, the first weight update is delayed for <sup>3</sup> steps. The delay allows the memory to aggregate meaningful temporal information prior to the first weight update. The zero-filled memory is also the initial state for inference. An overview of the presented temporal training configuration is shown in Fig. 5.5.

**Figure 5.5:** The proposed temporal training strategy [Due20a]. After a warm-up phase of <sup>3</sup> steps to fill the initially empty memory, loss and weight updates are computed every <sup>1</sup> steps. The gradient is propagated <sup>2</sup> past frames back through time before it is truncated.

## **5.2 Temporal Bird's Eye View**

The presented recurrent temporal fusion with recursive feature map transformation is not restricted to the range view. It can extend arbitrary approaches relying on grid-based views as long as a unique 3D position can be assigned to the cells of the respective view. Therefore, this concept can also be applied to the bird's eye view to create a similar recurrent temporal architecture, depicted in Fig. 5.6 and called T-BEVNet. It extends the bird's eye view network introduced in Section 4.2, which consists of the presented bird's eye view backbone and the sparse panoptic head. The latter comprises the introduced sparse semantic, offset, and center head. The notable difference to the temporal range view approach concerns the temporal alignment step, which needs to transform polar bird's eye view cells across time instead of range view cells. The memory update and temporal training strategies are equal to the strategies presented in the previous section.

**Figure 5.6:** The recurrent temporal bird's eye view T-BEVNet architecture unrolled for two time steps. The temporal pipeline is required twice and applied to the semantic features maps BEV sem and instance feature maps BEV ins .

#### **Temporal Memory Alignment**

In general, the temporal alignment for the bird's eye view feature maps follows Eq. (5.2). The difference between the range and bird's eye view alignment originates from the definition of the 3D cell positions. Bird's eye view cells usually contain multiple points and simultaneously have a unique 3D position, independently of the contained data. These properties motivate the choice of the cell center as 3D position. On the other hand, a range view cell has no 3D position without considering the data since empty cells do not have an assigned distance. Therefore, the backward strategy

$$\begin{aligned} \mathbf{V}\_{t-1}^{\text{bwd}} &= \left[ V\_{\boldsymbol{\mu}, \boldsymbol{\upsilon}, d} \right] = \left[ \mathcal{I}\_{t \to (t-1)}^{\text{BEV}} (\mathbf{g}\_{\boldsymbol{\mu}, \boldsymbol{\upsilon}})\_d \right], \quad \mathbf{G} = \mathbf{G}^{\text{BEV}}, \\ \mathcal{I}\_{t \to (t-1)}^{\text{BEV}} (\mathbf{g}) &= \left( \mathcal{P}^{\text{BEV}} \circ \mathcal{Q}^{\text{z}} \right) \left( \mathbf{T}\_{t \to (t-1)} \cdot \mathbf{g} \right) \end{aligned} \tag{5.6}$$

and forward strategy

$$\begin{array}{c} \mathbf{V}\_{t-1}^{\text{fwd}} = \text{reverse} \left( \mathbf{V}\_{t}^{\prime} \right), \\\\ \mathbf{V}\_{t}^{\prime} = \left[ V\_{\boldsymbol{u}, \boldsymbol{\upsilon}, \boldsymbol{d}}^{\prime} \right] = \left[ \mathcal{T}\_{(t-1) \rightarrow t}^{\text{BEV}} \left( \mathbf{g}\_{\boldsymbol{u}, \boldsymbol{\upsilon}} \right)\_{\boldsymbol{d}} \right], \quad \mathbf{G} = \mathbf{G}^{\text{BEV}}, \end{array} \tag{5.7}$$
  $\mathcal{T}\_{(t-1) \rightarrow t}^{\text{BEV}} \left( \mathbf{g} \right) = \left( \mathcal{P}^{\text{BEV}} \circ \mathcal{Q}^{\text{z}} \right) \left( \mathbf{T}\_{(t-1) \rightarrow t} \cdot \mathbf{g} \right)$ 

rely on the 3D positions BEV = [, , ] of the bird's eye view cells. These are defined by their centers:

$$\begin{aligned} G\_{u,\upsilon,1} &= r\_{u,\upsilon} \cdot \cos\left(\phi\_{u,\upsilon}\right), \; G\_{u,\upsilon,2} = r\_{u,\upsilon} \cdot \sin\left(\phi\_{u,\upsilon}\right), \; G\_{u,\upsilon,3} = 0, \\\ r\_{u,\upsilon} &= r\_{\text{min}} + \frac{u - 0.5}{H} \cdot r\_{\text{fov}}, \quad \phi\_{u,\upsilon} = \phi\_{\text{min}} + \frac{\upsilon - 0.5}{W} \cdot \phi\_{\text{fov}}.\end{aligned} \tag{5.8}$$

The grid is fixed in the ego coordinate system and as such independent of any input data and time. The data independence also eliminates alignment errors induced by shadowing. Therefore, moving objects are the only cause of alignment errors in the bird's eye view. Figure 5.7 depicts both strategies.

**Figure 5.7:** Association of bird's eye view cells across time with the backward (dotted) and forward strategy (dashed).

## **5.3 Temporal Multi View Network**

The next important step is the successful combination of the multi view and temporal framework to combine multi view and temporal benefits. However, two challenges arise when attempting to add temporal capabilities to the proposed multi view approach. First, temporal fusion is required for the semantic and instance branch to fully benefit from temporal information. The semantic branch is based on the point backbone's multi view features, and the instance branch relies on bird's eye view feature maps. Second, the multi view features of the point backbone are pointwise features and challenging to associate with aggregated temporal information from the previous time step. Spatio-temporal proximity in the point view requires NN search, which is computationally expensive for large point clouds and even more for combined point clouds across time. The presented temporal framework for 2D approaches in range and bird's eye view is the foundation to address these challenges. It integrates temporal fusion into the range and bird's eye view branch.

The combined architecture T-MVNet is depicted in Fig. 5.8. It extends the range view branch with the temporal range view fusion presented in Section 5.1. Since the backbone, as part of the multi view architecture, returns only semantic feature maps, a single temporal pipeline is sufficient. On the other hand, the bird's eye view branch is enhanced with the temporal bird's eye view fusion proposed in Section 5.2, which requires the temporal pipeline twice. The architecture of the point backbone is unchanged. However, the last multi view aggregation step no longer uses the final feature maps of the backbones but the aggregated temporal memories of both 2D views. Therefore, it aggregates temporal range and bird's eye view features to compute temporal multi view features, which improve the 3D semantic segmentation. In parallel, the second temporal bird's eye view pipeline computes the temporal memory for the instance feature maps provided to the offset and center head. Hence, the offset vectors and center heatmap are also temporally enhanced and, thereby, the instance segmentation. As a result, both subtasks of panoptic segmentation benefit from the proposed temporal multi view approach.

**Figure 5.8:** Temporal multi view architecture T-MVNet, which combines multi view and temporal benefits. One temporal memory is added for the semantic range view features map RV sem and two memories for the semantic and instance bird's eye view features maps BEV sem and BEV ins , respectively.

# **6 Multimodal Panoptic Segmentation of 3D Point Clouds**

The proposed temporal multi view framework focuses on the lidar sensor as a single sensor modality combined with temporal information. The multimodal multi view framework introduced in this chapter goes one step further and exploits the camera as an additional sensor modality. Since sensor fusion is combined with the multi view architecture, different lidar views are available for fusion. A promising combination is based on camera images and the lidar range view, as motivated in Section 3.3. Consequently, this chapter investigates the fusion of camera and lidar and presents a novel multi scale deep fusion network [Due20b, Due21, Sch22a] to fuse lidar range view feature maps with camera feature maps. Based on the resulting range fusion backbone, the temporal multi view approach is enhanced to a multimodal framework, which combines multi view, temporal *and*, multi sensor benefits.

## **6.1 Multi Sensor Range View**

The sensor fusion range view network SF-RVNet is designed to combine and fuse lidar and camera information in the range view. It addresses two main drawbacks of the mentioned existing approaches in Section 2.5. First, the novel multi scale fusion provides considerably improved multi sensor features to enhance both semantic *and* panoptic segmentation. Second, the architecture and training strategy decouple both sensor backbones to keep them independent for increased robustness against sensor failure. A major drawback of many fusion approaches [Mey19a, Kri20, Zhu21c] is their dependency on both sensors. Without the proposed design, a fusion approach will considerably degrade when a sensor fails or provides invalid output.

The suggested fusion architecture depicted in Fig. 6.1 relies on lidar and camera backbones, which are connected by the fusion branch. The lidar range view backbone presented in Section 4.1 is used again and computes range view feature maps at multiple stages and scales. The exchangeable camera backbone has the same task and provides multi scale camera feature maps. At least three different scales are required for the proposed and deployed fusion strategies. In general, an arbitrary image network can be used as long as it provides the three different scales of camera feature maps, which is fulfilled by most established architectures. Since ResNets are predominantly used as feature extractors in state-of-the-art image networks, ResNet-50 and ResNet-101 are chosen exemplarily in this approach. The feature maps at the end of each stage are potential candidates for fusion, illustrated in Fig. 6.1, and identified by the respective block based on the official naming convention [He16]. The final block of the fourth stage differs between ResNet-50 and ResNet-101.

**Figure 6.1:** Range fusion network SF-RVNet based on lidar range view and camera backbone. A fusion branch connects the backbones by fusing the feature maps of both sensors.

The fusion branch is the vital link between both backbones, which is responsible for the fusion of lidar and camera features. It transforms camera feature maps from camera image to range view, followed by their fusion with different strategies introduced in the following. The fused semantic and instance feature maps are finally provided to the standard panoptic head, introduced in Section 4.1. Furthermore, to train the overall range fusion framework successfully, different training strategies are investigated regarding panoptic results and the impact of sensor failure.

## **6.1.1 Sensor Fusion**

The core element of the fusion network is the fusion branch, which combines lidar range view and camera image feature maps to provide improved features containing the information of both sensors. Following the concept discussed in Section 3.3, deep feature fusion is chosen as the overarching fusion strategy, motivated by the discussed advantages over early and late fusion. In contrast to the latter, deep fusion is a generic strategy because the high number of intermediate feature maps offers countless possibilities for their combination and fusion. Hence, specific deep fusion strategies or architectures have to define several steps. First, feature maps from both sensors and different stages must be selected for fusion. In the second step, a common representation is required for both sensor modalities and a spatial feature transformation into the chosen representation. Finally, a multi scale aggregation strategy for the fused feature maps is required to provide the final multi sensor feature maps. Two novel deep fusion strategies [Due21, Sch22a] are proposed in the following, which address the discussed challenges while proposing different aggregation strategies.

In order to achieve a multi scale fusion, both strategies rely on multiple lidar and camera feature maps, which originate from different scales. The lidar backbone computes feature maps at three different scales, depicted in Fig. 6.1. Therefore, the fusion strategies combine these three scales with camera feature maps at the same number of distinct scales to fully exploit the multi scale potential. The ResNet camera backbone offers up to five different scales. Since there is no obvious most promising combination, the most beneficial one is evaluated by experiments. In the next step, the camera feature maps are transformed into the lidar range view since the target is a lidar panoptic segmentation. This transformation is based on Eq. (3.9) and the following equation:

$$\begin{aligned} \mathbf{^{IMG}F}\_{j}^{\rm RV} &= \mathcal{S} \left( \mathbf{F}\_{j}^{\rm MG}, \mathbf{V}\_{j}^{\rm MG} \right), \\ \mathbf{V}\_{j}^{\rm MG} &= \left[ V\_{u, \upsilon, d} \right] = \left[ \mathcal{T}\_{j}^{\rm MG} (\mathbf{g}\_{u, \upsilon})\_{d} \right], \quad \mathbf{G} = \mathbf{P}^{\rm RV}, \end{aligned} \tag{6.1}$$
  $\mathcal{T}\_{j}^{\rm MG}(\mathbf{g}) = \mathcal{P}\_{j}^{\rm MG} (\mathbf{T}^{\rm li \rightarrow \rm cam} \cdot \mathbf{g}) \,.$ 

The projection depends on the stage due to the different camera feature map sizes. Initially, projected camera features IMG RV of stage in the range view have still the range view's original resolution. After a bilinear downsampling step to match the size of the lidar feature maps, both sensors can be fused. The last step aggregates the three stages to combine and exploit features at different scales.

The first strategy follows an iterative or hierarchical pattern [Due21] and is illustrated in Fig. 6.2. The main component is a fusion module, which transforms the camera feature maps into the range view following Eq. (6.1) and reduces their large feature channel size to IMG <sup>1</sup> = IMG <sup>2</sup> =64, and IMG <sup>3</sup> =128. This reduction ensures equal channel sizes for lidar and camera feature maps and equal influence of both sensors. Afterwards, they are concatenated and processed by a residual block to provide fused feature maps with ′ <sup>1</sup> =′ <sup>2</sup> =96 and ′ <sup>3</sup> =192. In the second step, the fused features from the previous stage and scale are combined with the current stage by concatenation and two additional residual blocks. By splitting every module in a sensor fusion and fusion refinement step, the network can focus on a beneficial sensor fusion first, then on combining multi scale multi sensor features. The refinement step is omitted for the first stage since there is no preceding one. By stacking three fusion modules, all three scales of lidar and camera backbone are iteratively exploited and aggregated for the final multi sensor feature maps. Two parallel modules are deployed in the last step to fuse and provide semantic and instance feature maps to the panoptic head.

**Figure 6.2:** Iterative fusion strategy based on alternating sensor fusion and feature refinement steps.

The second strategy [Sch22a] builds upon the idea of FPNs and is depicted in Fig. 6.3. Lidar range view and camera feature maps are combined by the previously presented sensor fusion step. The three resulting multi sensor feature maps are then simultaneously aggregated in a bottom-up and top-down feature pyramid to compute multi scale features. On the one side, the top-down pyramid (Dn) aggregates multi sensor features starting with fine details and incorporates more and more context. On the other side, the bottom-up pyramid (Up) starts with aggregated context and adds more and more details. Both pyramids build upon the feature refinement step depicted in Fig. 6.2. An ablation study investigates the replacement of this module by a simple concatenation and 3 × 3-convolution as a more lightweight alternative. The outputs of both pyramids are combined by a convolutional layer, BN, and optionally LReLU (Py). While the iterative strategy doubles only the last module for parallel semantic and instance features, this is impossible for the pyramid fusion. The reason is that the parallel semantic and instance feature maps from the range view are not fused in the last but in the first step of the bottom-up pyramid. Therefore, two parallel pyramid branches are deployed to provide semantic and instance features.

**Figure 6.3:** Pyramid fusion based on two parallel FPNs. The channel size of the feature maps for all non-labeled edges is = 256.

### **6.1.2 Training Strategy**

The training strategy plays an important role when training multi sensor approaches, especially for the presented range fusion approach. Its decoupled camera and lidar backbone offer the possibility to pre-train them individually on lidar and camera data. Depending on the existing data and sensor setup, this has potentially two advantages. First, no combined data is necessary for this step, which allows using data where only one sensor modality is present, potentially increasing the amount of training data. Second, the training is not restricted to the overlapping field of view of both sensors. Again, this can significantly increase the amount of training data, e.g., when considering a 360<sup>∘</sup> lidar and front camera.

During the training of the overall architecture, the pre-trained backbones can be further trained or kept unchanged. While fixed backbones might negatively impact predictions, it offers a major advantage in terms of redundancy. In case of sensor failure or unavailability, the backbones can still compute their single sensor predictions as a fallback. This requires as little overhead as applying the single sensor head to the last feature maps of the respective backbone, which ensures a low latency. The required head is needed for and optimized during pre-training in any case. As a result, the following two-step training strategy is deployed:


## **6.2 Multimodal Multi View Network**

The combination of the previously presented temporal multi view approach of Section 5.3 and the proposed range fusion network is the last step of this thesis and provides the overall multimodal multi view framework. The combined architecture TSF-MVNet is able to simultaneously leverage the potential of a multi view, temporal, and multi sensor architecture.

As a result of the chosen design, the range fusion backbone seamlessly replaces the range view backbone of the temporal multi view architecture, as illustrated in Fig. 6.4. The temporal memory of the range view is now provided with the fused camera and lidar features and aggregates multi sensor features over time. Consequently, the point view backbone receives multi sensor features at the first level ( =1) and temporally fused multi sensor features at the last level ( =3). The intermediate level ( =2) provides lidar features since no sensor fusion is performed at this stage. In addition, the range fusion can also be combined individually with the multi view and temporal framework, called SF-MVNet and TSF-RVNet, respectively. SF-MVNet replaces the range view backbone of MVNet, and TSF-RVNet the range view backbone of T-RVNet, with the range fusion backbone.

**Figure 6.4:** The multimodal multi view architecture TSF-MVNet, which combines multi view, temporal, and sensor fusion benefits. The bird's eye view branch equals the one depicted in Fig. 5.8.

# **7 Evaluation**

The following chapter thoroughly evaluates the individual contributions of this thesis. The first section introduces the datasets and metrics used for the evaluation. Afterwards, the three main contributions of this thesis are evaluated, starting with the multi view framework presented in chapter 4, followed by the temporal framework introduced in chapter 5, and finally, the sensor fusion approach proposed in chapter 6. In addition, another set of experiments investigates the benefits of combining the individual contributions of this work. The respective temporal and multimodal multi view frameworks have been presented in Sections 5.3 and 6.2. All contributions are first evaluated and analyzed by extensive ablation studies, followed by a comparison to state-of-the-art approaches.

## **7.1 Experimental Setup**

The main elements of the experimental setup for evaluation are the selected datasets and metrics. First, the two chosen public, large scale, and challenging datasets from the driving domain are presented and analyzed in the next section. Afterwards, the metrics for semantic and panoptic segmentation are introduced. The reliance on established and frequently used metrics ensures comparability to other state-of-the-art methods.

### **7.1.1 Datasets**

A considerable number of point cloud datasets have been published over the last years, see Section 2.3.2. However, only a few have the necessary properties for the training and evaluation of the proposed multimodal framework, which requires sequential lidar scans and camera images. Additionally, the considered task of panoptic segmentation needs pointwise semantic and instance labels for supervision. These requirements reduce the set of eligible datasets to SemanticKITTI [Beh19] and nuScenes [Cae20], two large scale and distinct outdoor datasets, which are also the predominant choices of other state-of-the-art methods.

**SemanticKITTI**is a multimodal outdoor dataset recorded in the city of Karlsruhe in Germany and provides pointwise semantic and instance labels based on the KITTI Odometry Benchmark [Gei12]. The 360<sup>∘</sup> lidar scans with up to 64 ⋅ 2,083 = 133,312 points originate from a Velodyne HDL-64E and are recorded with 10 Hz. Two front-facing cameras are triggered by the lidar and provide camera images with a resolution of approximately 1,245 × 375 after rectification. The dataset is divided into 22 individual sequences, which are officially grouped into training and test split. Sequences 0–10 with 23,201 frames serve for training and validation, while the remaining sequences 11–21 with 20,351 frames serve for testing. The evaluation on the test set is only possible on the official benchmark server since no labels have been published. To prevent optimizing on the test set, an overall maximum of ten submissions per account is possible. Sequence 8 with 4,071 frames is used for validation throughout the experiments.

The semantic labels of the official benchmark contain 19 distinct classes with eight thing and eleven stuff classes. The classes motorcycle (mcycle), motorcyclist (mcyclist), other-vehicle (vehicle), and other-ground (ground) are abbreviated throughout the evaluation to increase the readability of tables and plots. An overview of the thing classes is shown in Fig. 7.1. It illustrates the frequency of semantic class labels and the number of total and unique instances for the train and validation set. While most classes are self-explanatory, the difference between bicycle and bicyclist is not obvious. The former refers to a bicycle without a rider, whereas the latter describes bicycles with a rider and includes both. The same applies to motorcycle and motorcyclist. In general, the thing classes represent only a small proportion of the overall point labels except for the more common car class. The total instance count ranges from 750 to approximately 15,000, which are different recordings of 26–193 unique instances. The exception is again the class car with about 2,200 unique and a total of over 200,000 instances. The distribution of the stuff classes with a combined share of nearly 95% is depicted in Fig. 7.2.

**Figure 7.1:** Overview of the thing classes of SemanticKITTI with the absolute number and percentage of pointwise semantic labels. Additionally, the number of total T and unique T instances is shown. The colors correspond to the class label visualization in figures.

SemanticKITTI offers additional tasks alongside the introduced semantic and panoptic benchmarks. The multi scan or dynamic semantic segmentation task further distinguishes between moving and non-moving for the classes car, truck, vehicle, person, bicyclist, and motorcyclist. Consequently, it contains 25 distinct semantic classes and is predominantly used to evaluate temporal approaches. Furthermore, the binary task moving object segmentation requires approaches to classify each point as moving or non-moving.

**Figure 7.2:** Overview of the stuff classes of SemanticKITTI with the absolute number and percentage of pointwise semantic labels.

**NuScenes** is also a large scale multimodal outdoor dataset with pointwise semantic and instance labels and has been recorded in Boston and Singapore. The 360<sup>∘</sup> lidar scans from a Velodyne HDL-32 were recorded at 20 Hz and contain up to 32 ⋅ 1,084 = 34,688 points per scan. Additionally, six cameras mounted around the car provide camera images of the car's 360<sup>∘</sup> environment with a resolution of 1,600 × 900 each. The dataset is divided into 1,000 individual sequences, each approximately 20 s in length. The official split assigns 700, 150, and 150 sequences or 28,130, 6,019, and 6,008 frames for training, validation, and testing. Similar to SemanticKITTI, the results for the test set can only be evaluated on the official benchmark server, which restricts to three evaluation runs per year. Labels only exist for keyframes sampled across the sensor modalities at 2 Hz, which results in a predominant number of unlabeled intermediate frames. These intermediate frames are required for temporal training, in contrast to the non-temporal case. Since two consecutive frames are very similar due to the high frame rate of 20 Hz, every other frame is omitted. This speeds up training time without impacting the results.

The semantic segmentation task of nuScenes contains 16 classes, which can be divided into ten thing and six stuff classes. Similar to SemanticKITTI, motorcycle (mcycle) and construction-vehicle (con-vehicle) are abbreviated in the following. Figure 7.4 illustrates the frequency of thing classes for the train and validation set. These classes represent less than 9% of the overall point labels, and most of them individually represent less than 1%. However, trucks and especially cars are more common. The total instance count for the rarer classes lies between approximately 10,000 and 20,000, which corresponds to 600–1000 unique instances. More common classes have more than 70,000 total and 4,000 unique instances up to around 360 000 total and 20,000 unique car instances. The distribution of the stuff classes with a combined share of nearly 92% is depicted in Fig. 7.4.

**Figure 7.3:** Overview of the thing classes of nuScenes with the absolute number and percentage of pointwise semantic labels. Additionally, the number of total T and unique T instances is shown.

The distinct properties of the datasets motivate the evaluation on both of them. They have been recorded in different countries and use different lidar sensors and cameras. Therefore, the point clouds are much sparser for nuScenes with only one-fourth of the points. SemanticKITTI, on the other hand, has less traffic with only 23 000 moving instances compared to 300 000 of nuScenes. Additionally, the thing classes significantly differ between both datasets, and SemanticKITTI requires a more detailed differentiation of stuff classes.

**Figure 7.4:** Overview of the stuff classes of nuScenes with the absolute number and percentage of pointwise semantic labels.

### **7.1.2 Metrics**

Different metrics have been established to evaluate semantic and panoptic segmentation approaches, which are introduced in the following. Important concepts for classification tasks are true positives, false positives, and false negatives. Applied to semantic segmentation and a semantic class ∈ , the set of true positives are all pixels or points which are correctly classified as class . The false positives are classified as class but belong to another class, and the false negatives belong to class but have been wrongly classified otherwise, see Fig. 7.5. The visualizations of the concepts in this section rely on image pixels instead of 3D points for a clearer illustration, while the concepts equally apply to 3D points.

The first metric for **semantic segmentation** based on these concepts is the accuracy , which is defined by the number of correctly classified points divided by the total number of points total:

$$acc = \frac{\sum\_{cls \in \mathfrak{G}} |TP\_{cls}|}{N\_{\text{total}}}.\tag{7.1}$$

However, accuracy favors dominating classes. For example, an algorithm can achieve an accuracy greater than 99% on SemanticKITTI and still ignore all thing classes except car due to their small proportion. Hence, the prevailing metric to assess semantic segmentation is the mean of the class-wise intersection-over-union :

$$mIoU = \frac{1}{|\mathfrak{G}|} \sum\_{clss \in \mathfrak{G}} IoU\_{cls} = \frac{1}{|\mathfrak{G}|} \sum\_{clss \in \mathfrak{G}} \frac{|TP\_{cls}|}{|TP\_{cls}| + |FP\_{cls}| + |FN\_{cls}|}. \tag{7.2}$$

ground truth pixel of class

predicted pixels of class

**Figure 7.5:** Visualization of true positives, false positives, and false negatives on pixel level.

In case of **panoptic segmentation**, the sets of true positives , false positives , and false negatives are defined on instance level. Ground truth and predicted instances I gt and I prₑd establish a match (I gt, I prₑd) ∈ if their match is greater than 0.5. This threshold ensures that ground truth or predicted instances are only matched once. An unmatched ground truth instance is considered a false negative, and an unmatched predicted instance is a false positive. Predicted instances are required to have a uniform class, which is why the standard evaluation procedure splits not only based on the predicted instance but also based on semantics. Consequently, points with the same instance label but different semantic classes are considered different instances. Alternatively, a uniform semantic class can be explicitly computed upfront. An exemplary matching is shown in Fig. 7.6.

**Figure 7.6:** The matching of predicted and ground truth instances results in the illustrated sets of true positives (), false positives (), and false negatives ().

The instance detection performance is assessed based on these sets by the mean recognition quality , defined by

$$mRQ = \frac{1}{|\mathsf{GP}|} \sum\_{cls \in \mathsf{\mathfrak{G}}} RQ\_{cls} = \frac{1}{|\mathsf{\mathfrak{G}}|} \sum\_{cls \in \mathsf{\mathfrak{G}}} \frac{|TPI\_{cls}|}{|TPI\_{cls}| + 0.5 \cdot (|FPI\_{cls}| + |FNI\_{cls}|)},\tag{7.3}$$

and equals the well-known <sup>1</sup> -score commonly used for object detection. Additionally, the mean segmentation quality assesses the instance segmentation over and measures its quality:

$$mSQ = \frac{1}{|\mathfrak{G}|} \sum\_{\text{cls} \in \mathfrak{G}} SQ\_{\text{cls}} = \frac{1}{|\mathfrak{G}|} \sum\_{\text{cls} \in \mathfrak{G}} \frac{\sum\_{\text{match} \in TPI\_{\text{cls}}} IoU\_{\text{match}}}{|TPI\_{\text{cls}}|}. \tag{7.4}$$

Finally, both metrics are combined into the unified panoptic quality [Kir19]:

$$mPQ = \frac{1}{|\mathfrak{G}|} \sum\_{\text{cls} \in \mathfrak{G}} PQ\_{\text{cls}} = \frac{1}{|\mathfrak{G}|} \sum\_{\text{cls} \in \mathfrak{G}} SQ\_{\text{cls}} \cdot RQ\_{\text{cls}}.\tag{7.5}$$

These metrics also include the stuff classes, which are considered one-instance classes. All ground truth and predicted points of a stuff class belong to *one* ground truth and *one* predicted instance, respectively.

**Statistical significance** in machine learning is commonly demonstrated with -fold cross-validation. However, applying this method to approaches based on deep learning is often challenging and rarely seen in literature. The underlying reason is that training a single deep learning approach on one of the folds may take days, and training on all folds can take weeks or even a month. The average training time of the experiments in this thesis was about 2.6 days, which results in 26 days for a 10-fold cross-validation for a single experiment. Since this is infeasible, the statistical significance is shown by a randomization test [Smu07], which tests the likelihood that one approach is truly better than another instead of achieving better results by coincidence. A randomization test checks for paired probes of both compared approaches if they originate from the same underlying distribution, without making any assumptions about this distribution.

The test computes a metric for two approaches and predictions or sets of predictions, which results in paired probes. The test's null hypothesis assumes that both approaches are equally good and, therefore, originate from the same underlying distribution : ,0 ∼ and ,1 ∼ . Under this assumption, the assignment of ,0 and ,1 to their respective approach is irrelevant for the paired probes, and switching would not influence the measured results. The test creates a large amount of these permutations and computes the test metric, such as the mean , for both approaches to verify or reject the null hypothesis. Consequently, the -value is the fraction of permutations with an equal or higher difference |<sup>0</sup> − <sup>1</sup> | than the originally measured difference. The null hypothesis is rejected if the -value is smaller than a selected significance level . The following evaluation aims for a confidence of about two standard deviations and selects = 0.05.

The natural way to apply a randomization test to the setup of this thesis is to consider the results for each frame as paired probes. However, and are not computed for each frame individually. After aggregating true positives, false positives, and false negatives over all frames, they are computed over the entire validation set. As a result, the individual frames have a different influence depending on the occurring classes and the frequency of their points or instances. This inequality prevents the application of the test for and on frame-level. One possibility to address this is considering each 3D point and its prediction as a probe. However, this requires comparing all point predictions and computing the metrics over the entire validation set with hundreds of millions of points from scratch for each permutation. Consequently, it would take days just to compare two approaches when a meaningful number of 100 000 or more permutations are used. In order to make this test computationally feasible, a subset of five million points is randomly drawn for every permutation. The original and permuted difference of the two approaches is computed on these subsets and both allow the computation of the -value. While these computations are still too expensive for the , they allow showing the statistical significance for the . An underlined in the following tables indicates a significant difference to the previous underlined value, or the first line. The row showing a significant difference to all other lines is underlined twice.

### **7.1.3 General Implementation Details**

All experiments are implemented based on PyTorch¹ and use distributed parallel training in mixed precision mode on up to eight NVIDIA V100 GPUs. Adam optimizer [Kin15] is used across all experiments with a weight decay of 0.0005 and optimizes the networks for up to 100 000 iterations. The initial learning rate <sup>0</sup> is exponentially reduced during training following:

$$
\lambda\_l = \lambda\_0 \cdot e^{-5 \cdot 10^{-5} \cdot l}.
\tag{7.6}
$$

¹ https://pytorch.org/

All non-temporal experiments are trained with a batch size of 32. Due to the high memory demands of TBPTT, and to ensure a constant batch size across all temporal experiments, the batch size is reduced to 16 for temporal trainings. All non-temporal lidar experiments are pre-trained on the semantic task with an initial learning rate of <sup>0</sup> = 0.001, followed by the panoptic training with a smaller initial learning rate of <sup>0</sup> = 0.0001. The same initial learning rate is also used for temporal and sensor fusion experiments, which use pretrained single frame or single sensor backbones.

Data augmentation is important to reduce overfitting and improve the results. All trainings randomly flip the point clouds along the - and -axis with a probability of 0.5 for every dimension. Additionally, the point clouds are rotated for a random angle around the -axis. Finally, random 180<sup>∘</sup> crops of the 360<sup>∘</sup> scans are used for training. Purely range view-based experiments deviate slightly from this crop and use 2D crops of size 64×1024. Finally, up to 10 instances are pasted randomly into the scene. The augmentations are applied temporally consistent for temporal trainings to retain the temporal dependencies and spatial consistency across time. The range images' size is 64 × 2083 for SemanticKITTI and 32×1084 for nuScenes. Furthermore, a bird's eye view grid of size 480 × 360 is employed in both cases, with min = 2 m, max = 50 m, and covering the entire 360<sup>∘</sup> . Points outside are mapped to the closest cell.

To generate the heatmap and offset vector targets for the instance clustering, the instance centers are determined by the mean over all instance points. An unnormalized Gaussian kernel with = (2, 8) for range view and = (4, 4) for bird's eye view is added at this position to the ground truth heatmap. NMS suppresses a 5×5-area in bird's eye view and a 3×7-area in range view when the best 50 candidates are extracted from the predicted heatmap. The default loss weights are sem = 1, ctr = 100, and off = 0.1.

Runtime measurements are performed in single precision and are reported as mean and with standard deviation (std) over the validation set. One related challenge is the usage of different GPUs across state-of-the-art approaches. If an official repository exists, the inference time is remeasured on a V100, indicated by (\*). Otherwise, and on condition that the used GPU is specified, the values are estimated (≈) for the V100 based on inference benchmark results published online¹ and shown in Table 7.1.


**Table 7.1:** Relative inference performance of different GPUs.

## **7.2 Multi View Panoptic Segmentation**

The evaluation of this thesis starts with the multi view panoptic segmentation approach as the first contribution. It is thoroughly evaluated by an extensive number of experiments, starting with the evaluation of the range and bird's eye view network and the benefits of the proposed improvements. Followup experiments assess the multi view architecture and fusion. If not stated otherwise, the default multi view architecture presented in Fig. 4.6 is used with concatenation as fusion strategy and trained with RV sem = BEV sem = PV sem = 1.

### **7.2.1 Range View Experiments**

The network RVNet presented in Section 4.1 is a single view approach, and its backbone is one of the core elements of the multi view framework. It is based on DLA from the image domain and related to LaserNet, which successfully applies this architecture to lidar data. However, the initial results of this architecture for the task of 3D panoptic segmentation are mediocre, as shown in the first line of Table 7.2, despite the usage of an established architecture from the image domain. For that reason, Section 4.1 proposed different improvements for the network, which are investigated in the following. Additionally, an improved data augmentation strategy has been presented in Section 4.3.

¹ https://mtli.github.io/gpubench/

The proposed approach considers the sensor properties by deploying the improved range view projection and asymmetric stride, motivated in Section 4.1. This sensor-aware architecture (SAA) reduces the amount of lost information during the projection based on the improved range view projection. Its asymmetric stride considers the asymmetric distance between range view cells. Next, and building upon the findings of [Aks20], Lovász loss (LV) is added to the loss and applied equally weighted with CE loss. It is robust to imbalanced class distributions and an improved surrogate for optimizing the . Finally, random objects are pasted into the range images (ROA) to increase the occurrence frequency of rare thing classes. All three proposed and discussed enhancements significantly improve the panoptic results and achieve a large overall improvement for both metrics, as illustrated in Table 7.2. Additionally, LV and ROA only influence the training procedure but not the inference time. Solely the asymmetric stride increases the latter.

**Table 7.2:** Improvements achieved by sensor-aware architecture (SAA), Lovász loss (LV), and random object augmentation (ROA) on the validation set of SemanticKITTI. Bold metrics indicate the best values of each column, and underlining indicates significance, see Section 7.1.2.


Another experiment evaluates the choice of the general backbone architecture compared to other established architectures. Therefore, two representatives of common 2D architectural families are evaluated with the same set of improvements. DeepLabV3 has been chosen as a representative for ResNet-based networks with a pyramid pooling module, which has already been successfully applied for panoptic segmentation in the image domain [Che20]. The corresponding experiment uses the official architecture of Panoptic-DeepLab with ResNet-101 and additionally applies the proposed improvements. However, the results and runtime depicted in Table 7.3 are significantly worse than the results and runtime of the DLA backbone. The second architecture, U-Net, has been chosen since it is frequently used for lidar segmentation tasks and was also one milestone in the image segmentation domain. While it achieves better results than the DeepLab backbone and is slightly faster than the DLA backbone, it lacks segmentation quality. Consequently, the chosen architecture is the strongest and most promising backbone choice for the single view and also the multi view approach.

**Table 7.3:** The achieved panoptic results and number of trainable parameters of different backbone architectures on the validation set of SemanticKITTI.


## **7.2.2 Bird's Eye View Experiments**

The bird's eye view network BEVNet presented in Section 4.2 is the second single view approach, and its backbone is the second core element of the multi view approach. While it follows the same backbone architecture as the range view network, its initial results are considerably better, see Table 7.4. Nevertheless, combining CE and Lovász loss significantly improves the results, similar to the range view network. The same holds for adding random object augmentation to insert rare instances into the point clouds. The combination of both enhancements achieves a considerable improvement of the semantic and panoptic segmentation, measured by and , respectively.

The bird's eye view network is further enhanced with the proposed sparse semantic head (SH), which restricts the convolution operations in the semantic head to occupied cells. This restriction avoids the expensive expansion of the 2D bird's eye view feature maps to 3D voxel predictions used by existing approaches [Zha20c, Zho21]. Table 7.4 underlines the benefits regarding inference time. In addition, the training time is also reduced by approximately 30 %. Simultaneously, the prediction quality remains unaffected, which makes the sparse head a valuable addition.

**Table 7.4:** Panoptic improvements achieved by Lovász loss (LV) and random object augmentation (ROA) on the validation set of SemanticKITTI. The sparse semantic head (SH) improves the inference time.


The chosen DLA backbone architecture is also compared to the already motivated representatives DeepLab and U-Net for the bird's eye view. While DeepLab achieves better results in the bird's eye view than in the range view, it still achieves the lowest values for the considered metrics and has the highest inference time. The choice of U-Net is additionally motivated by Polar-Net [Zha20c], which successfully uses a U-Net backbone in its bird's eye view network. While it achieves the highest , it is outperformed for panoptic segmentation by the DLA backbone, which also has a considerably lower inference time. The general differences in the runtime compared to the range view originate from a higher resolution and symmetric strides. The former impacts the inference time negatively, the latter positively. Depending on the architecture, this leads to faster or slower inference times. Overall, the chosen DLA architecture is the most promising choice for panoptic segmentation.

**Table 7.5:** The achieved panoptic results and number of trainable parameters of different backbone architectures on the validation set of SemanticKITTI.


### **7.2.3 Multi View Experiments**

The multi view approach proposed in this thesis combines the range, bird's eye, and point view to exploit the strengths of the individual views and compensate for weaknesses. An initial set of multi view experiments analyzes the effects of different view combinations, and the corresponding results are depicted in Table 7.6. Both previously evaluated single view approaches are the initial baselines and provide the first important insights. The results of BEVNet are superior to RVNet, especially when considering the panoptic metrics. This superiority supports the decision of this thesis to choose the superior bird's eye view for instance clustering. In the next step, two out of three views are combined to investigate the influence of the individual views and to reveal the combination with the most potential. All these experiments follow the exact same architecture proposed in Section 4.3 to ensure a fair comparison, except that one of the three views is removed. The first experiment RVNet+PV combines range and point view and shows that this setup achieves no improvements. Instance clustering is still performed in range view, without any influence of the point view, which is challenging because of instances occluding each other. As a result, instance centers are close together and the clustering is prone to errors in the offset predictions. One minor advantage is the direct prediction of a 3D semantic segmentation without the necessity of a back-projection. RVNet+PV achieves the same panoptic results as RVNet with the kNN-based back-projection strategy. The combination of bird's eye and point view BEVNet+PV also does not improve the panoptic segmentation, which is the consequence of the point view only affecting the predicted 3D semantic segmentation. Since the sparse head of the bird's eye view already predicts high resolution 3D voxel semantics, the benefits of pointwise features are negligible. The following experiments combine range and bird's eye view, initially with simpler fusion strategies and not yet with the proposed point view network. It is the first setup that predicts the center heatmap and offset vectors in the bird's eye view, while the 3D semantic segmentation benefits from range and bird's eye view features. The first and simplest fusion of both views is based on the final predictions, which are fused by computing the geometric mean. Even this simple fusion strategy significantly improves the panoptic and semantic segmentation and confirms the value of combining range and bird's eye view due to their distinct underlying projections. The panoptic results are further improved if the geometric fusion is replaced by a learned late fusion of the final pointwise range and bird's eye view features.


**Table 7.6:** Impact of different view combinations on panoptic and semantic results on the validation set of SemanticKITTI and nuScenes.

While these results are already promising, this simple multi view architecture still lacks in leveraging the full multi view potential. Therefore, the approach presented in Section 4.3 deploys a third backbone for the point view, which repeatedly fuses the features at multiple scales from range and bird's eye view to refine and enhance the pointwise features. As a result, and shown in Table 7.6, the proposed MVNet improves the panoptic segmentation even further and significantly outperforms every combination of two views. Since MVNet aggregates multi view features after selected feature extractor and aggregator blocks, see Fig. 4.6, an additional experiment, MVNetAll, evaluates the potential of fusing after every block. However, the higher number of aggregation steps provides no enhancements. MVNet achieves similar improvements on nuScenes and outperforms both single view approaches by a large margin, depicted in the lower part of Table 7.6. One notable difference is the, in relation, worse performing BEVNet, which results from nuScenes' sparser point clouds. With only a quarter of SemanticKITTI's point cloud size, the bird's eye view gets increasingly sparse, which negatively impacts its results.

To investigate the multi view benefits more closely, the class-wise outcomes for thing and stuff classes are presented in Table 7.7 and Table 7.8, respectively. The overall improvements of the are directly reflected in the individual class with better results for six out of eight thing classes. The results of the other two classes are similar to BEVNet or RVNet. Enhancements of the semantic segmentation for stuff classes are even more pronounced, and all classes but two benefit.

**Table 7.7:** Comparison of the class-wise results for thing classes on the validation set of SemanticKITTI. Due to space limitations, leading zeros are omitted in this and the following class tables.


When comparing the of thing classes, MVNet predominantly outperforms RVNet due to the superior instance predictions in the bird's eye view. Compared to BEVNet, MVNet again enhances all classes but by a smaller margin. This is expected since the instance clustering is provided by the bird's eye view part of MVNet and equals BEVNet. In this case, the main improvements originate from the semantic segmentation, which is confirmed by the fact that almost all thing or stuff classes with improved also have improved . Especially for stuff classes as one-instance classes, improving the semantic segmentation is the only possibility to improve the . Consequently, the stuff classes show similar enhancements for the as for the . Overall, the predominantly improved class metrics emphasize the value of combined range and bird's eye view features and the proposed approach's capability to successfully combine and exploit them.


**Table 7.8:** Comparison of the class-wise results for stuff classes on the validation set of SemanticKITTI.

These findings are illustrated by the selected semantic and instance examples shown in Fig. 7.7. In the semantic example, BEVNet fails to accurately segment the static environment and misses a pole T entirely (a). In addition, it classifies parts of the yard as terrain T, which is considered sidewalk T in the ground truth (b). On the other hand, MVNet correctly segments the pole and yard area as a result of the multi view architecture. The second example shows instance segmentation results and two major errors of RVNet. It is unable to correctly separate the highlighted parked cars (c) and (d), whereas MVNet provides the correct instance segmentation. This example underlines the advantage of relying on instance segmentation based on the bird's eye view.

**Figure 7.7:** Two selected examples showing the superiority of MVNet over the single view approaches BEVNet and RVNet. The left example shows semantic and the right example instance segmentation. Black points in the ground truth indicate unlabeled points.

#### **Extended Ablation Studies**

After the general multi view setup has been evaluated, the pointwise fusion of range, bird's eye, and point view features is investigated more closely. The different strategies proposed in Section 4.3 are addition, element-wise maximum, concatenation (concat), and learned weighted sum (lws). The corresponding results are depicted in Table 7.9 and show that all fusion strategies achieve a similar panoptic quality. Concatenation and addition achieve slightly better semantic segmentation than the other strategies and are therefore the favorable choices.


**Table 7.9:** Panoptic results and inference times of different multi view fusion strategies on the validation set of SemanticKITTI.

The proposed loss for the multi view framework comprises several weighted components related to the semantic or instance task, see Eq. (4.16). Crucial components are the auxiliary semantic losses for range and bird's eye view, which improve the semantic results significantly, illustrated in Table 7.10. This first set of experiments evaluates the influence of different semantic weights by training the multi view approach solely for semantic segmentation with different weight combinations. The best semantic segmentation is achieved with equally weighted views, whereas preferring individual views negatively impacts the results. The semantic loss is completed by the center and offset losses for the instance task, and all three are of different magnitudes. Therefore, the primary goal of the respective weights sem = RV sem + BEV sem + PV sem, off, and ctr is to equalize their magnitudes. Table 7.10 presents the results of additional experiments, which evaluate permutations of weights used in existing works [Yan19c, Che20, Zho21]. However, no significant best combination is observable across the evaluated permutations.


**Table 7.10:** Results of different weight combinations for the semantic and instance loss on the validation set of SemanticKITTI.

Overall, the presented results motivate the choice of the architecture of Fig. 4.6 with three MVA modules and concatenation-based fusion as the final multi view approach. Furthermore, the best loss weights are sem = 3, with equally weighted views, off = 0.1, and ctr = 100. This setup is used for all subsequent multi view experiments and its comparison to state-of-the-art. Since the comparison equals the one of the final multimodal multi view approach, it is jointly presented later in this chapter. In this case, the panoptic head unifies the semantic class of each instance to the dominating class, improving the to 0.604.

## **7.3 Temporal Panoptic Segmentation**

As the second contribution, the temporal panoptic segmentation approaches are extensively evaluated in the next step, starting with the evaluation of the temporal range and bird's eye view networks. The conducted experiments analyze the benefits of the proposed recurrent temporal architecture in both views, and several ablation studies investigate the influence of the proposed components. Finally, the presented methods are compared to other state-ofthe-art temporal approaches.

If not stated otherwise, the memory update is based on the residual strategy with four BBs. The default temporal training strategy updates the weights every <sup>1</sup> = 5 steps, propagates the gradient <sup>2</sup> = 4 steps back in time, and performs the first update after <sup>3</sup> = 10 warm up steps. The sequence length defaults to = 50, and the pre-trained backbone of RVNet or BEVNet is used as initialization instead of training from scratch.

## **7.3.1 Temporal Range View Experiments**

The temporal range view network T-RVNet introduced in Section 5.1 builds upon a recurrent architecture to recursively aggregate and propagate features and information through time. T-RVNet builds upon several individual components and a temporal training strategy to exploit temporal information. The first experiments investigate the contribution of the individual components to the improved panoptic results, with an overview depicted in Table 7.11. One of these components is the backbone of RVNet as single frame backbone (SFB). Consequently, RVNet is the appropriate single frame baseline shown in the first row of Table 7.11, allowing the evaluation of temporal improvements.

**Table 7.11:** Influence of the individual components of the temporal architecture. Starting with the single frame backbone (SFB) as a baseline and consecutively adding the temporal memory (TM), alignment (TMA) and Truncated Backpropagation Through Time (TBPTT) [Wil90].


The first step towards the proposed temporal architecture is the addition of the temporal memory (TM), which significantly improves the panoptic results. Although no temporal alignment and training has been applied yet, the temporal network is already able to benefit from temporal information. Adding TBPTT only improves the panoptic segmentation in conjunction with the temporal alignment. The alignment is crucial to ensure the correct association of past and current features by compensating ego motion in the range view. Consequently, aligning the memory based on the backward strategy further improves the results and additionally allows the architecture to benefit from TBPTT. With the alignment, the errors are correctly propagated through time, and the semantic segmentation is further enhanced. Overall, T-RVNet achieves major improvements over the single frame network RVNet with a significantly higher and . Equally convincing improvements on nuScenes are depicted in Table 7.11. The increased runtime is a consequence of the temporal memory added to the semantic and instance branch of RVNet.

To further investigate the temporal benefits, the individual class results are considered and compared with the non-temporal approach RVNet. The semantic and panoptic segmentation is improved for every thing class, as illustrated in Table 7.12. The strong benefits for these seem counterintuitive since the temporal approach disregards their movement. However, a considerable number of the respective instances are stationary, such as parked cars or bicycles, or move slowly like pedestrians. In these cases, no alignment errors exist or are negligible. In the case of higher motion velocity, the resulting error depends on the relative movement direction and is also small in many scenarios. Therefore, thing classes strongly benefit from temporal information in most situations. In addition, stuff classes similarly benefit, as illustrated in Table 7.13. The highest absolute improvements of all classes are revealed for the classes truck and vehicle. Both classes are often confused due to the vehicle class containing object types with a partially similar shape to trucks, such as buses and trailers. Since individual instances are only partially observed, this leads to shape ambiguities and confusion. Temporal information reduces this confusion based on the aggregated information, which is reflected in a considerable reduction of instances with classifications alternating between truck and vehicle over time. Overall, T-RVNet achieves predominantly enhanced results for class-wise and , as well as significant improvements of the means over thing and stuff classes.


**Table 7.12:** Comparison of the class-wise panoptic results for thing classes on the validation set of SemanticKITTI.

**Table 7.13:** Comparison of the class-wise panoptic results for stuff classes on the validation set of SemanticKITTI.


These findings are illustrated by the selected example shown in Fig. 7.8. Despite the correct segmentation of the parking area T in the first frame, RVNet fails to accurately segment the parking spot in the following frames. On the other hand, T-RVNet correctly predicts this area across all frames due to the exploitation of temporal information and provides improved and temporarily consistent predictions.

**Figure 7.8:** Example for temporally robust and improved results of T-RVNet.

### **Extended Ablation Studies**

In the next step, the discussed components are evaluated individually. The core element of the temporal architecture is the temporal memory, which recursively fuses temporally aggregated feature maps from the past with the latest feature maps. Table 7.14 illustrates the results of the different fusion strategies proposed in Section 5.1 and provides several key insights. Gating mechanisms commonly used for RNNs provide no advantages over a residualbased fusion. This finding confirms the previous claim that a residual strategy does not suffer from exploding or vanishing gradients. The underlying reasons are that the gradient is backpropagated only for a few steps and that residual networks were designed as very deep networks, see Section 5.1 for a detailed discussion. On the other hand, the native ConvGRU suffers from the discussed limited or missing spatial aggregation after the feature fusion. The spatial aggregation is especially beneficial in the considered setup since it adds the capability to compensate for small errors in the alignment step. Therefore, the native ConvGRU achieves the worst and , barely improving over RVNet. The importance of sophisticated context aggregation is additionally confirmed by the results of the proposed ContextGRUs, which integrate a residual network into their candidate branch. The provided context aggregation by two or four BBs significantly improves the panoptic segmentation, while two blocks are already sufficient. However, and despite the established gating mechanism, directly applying the residual networks as residual strategy without gating achieves a better panoptic segmentation. Hence, the best strategy and preferred choice is a residual update based on four BBs.


**Table 7.14:** Results of different fusion strategies for the memory update on the validation set of SemanticKITTI.

The second investigated component is the temporal memory alignment. Two distinct alignment strategies and a range gating mechanism have been proposed to compensate ego motion and disable wrong feature alignments caused by moving objects or occlusion. The overview in Table 7.15 reveals that both strategies perform similarly well, with a minor advantage for the backward strategy. Consequently, the backward strategy is preferable since it is faster and provides better values. Independent of the strategy, the temporal method requires no explicit alignment error detection provided by the proposed range threshold because the panoptic results remain unchanged. These errors either have no negative impact on the memory update, or their frequency is too low to influence the considered metrics. On the other hand, if the range gate is too restrictive, e.g., 2 cm, which is the standard deviation of the sensor's measurement error, a considerable amount of valuable information is ignored. Consequently, the results are negatively influenced.


**Table 7.15:** Results of the two alignment strategies and different range gate thresholds on the validation set of SemanticKITTI.

Another set of experiments investigates the influence of the temporal training parameters sequence length and temporal backpropagation steps. These parameters influence the training procedure proposed in Section 5.1.3 instead of the model architecture. The results in Table 7.16 show that the proposed approach is insensitive to changes in these parameters. Especially the is very similar across all parameter combinations, whereas the differences in the are more prominent. Overall, no clear tendency is visible that a particular sequence length or number of backpropagation steps achieves superior results. Hence, no sequence lengths longer than 100 have been evaluated. The maximum number of backpropagation steps was limited by memory requirements, which increases nearly linearly with this number. Another key insight provided by Table 7.16 is the importance of training with artificial subsequences. Training with the nine native sequences of SemanticKITTI fails and cannot leverage any temporal potential for the reasons discussed in Section 5.1.3. It achieves no improvements over the non-temporal baseline approach RVNet.


**Table 7.16:** Influence of the temporal training parameters sequence length and backpropagation steps <sup>2</sup> evaluated on the validation set of SemanticKITTI.

The recursive feature aggregation over time is the core idea of the proposed temporal approach with the important and novel benefit of an unrestricted temporal window without influence on the runtime. An unrestricted temporal memory raises the question of how many past frames actually contribute to the improvements of the latest frame. To answer this question, experiments with restricted temporal memory are conducted. While trained with the default temporal setup, the number of past frames aggregated in the temporal memory is restricted during evaluation. The corresponding results are plotted in Fig. 7.9. The first insight is the expected importance of past information, confirmed by the considerable degradation of the if the temporal memory is disabled during evaluation (mem = 0). In this case, even the single frame baseline achieves a better panoptic segmentation. When the number of considered past frames increases, the panoptic results considerably improve up to ten frames. Afterwards, the improvements slow down, and with mem = 50, the results of the unbounded memory are nearly achieved. These findings underline that a double-digit number of past frames is required to fully exploit the potential of temporal information and emphasize the value of the runtime independence.

**Figure 7.9:** Influence of the number of considered past frames, or memory length, on the results.

Furthermore, an important factor for temporal fusion is the ego velocity and time step size. The velocity influences the change of the static environment between frames and the step size the distance traveled by moving objects. Hence, another experiment investigates the influence of increasing the velocity and time step size on the results, which are depicted in Fig. 7.10. For this experiment, the evaluation skips frames to simulate higher velocities and larger time steps. Both factors are coupled for existing datasets and cannot be simulated independently. The first data point for both datasets are the default result at the given mean ego velocity over the evaluation set. The second entry corresponds to aggregating only every other frame into the temporal memory. This setup simulates a velocity or time step size twice as high. Therefore, the evaluation starts with the first frame and skips every other frame, which is repeated starting with the second frame to provide predictions for every frame. The third entry corresponds to skipping two frames, tripling velocity or time step size, and so on. Naturally, the improvements achieved by temporal fusion decrease with increasing velocity since the jointly observed area in both time steps decreases. Consequently, the area increases where temporal fusion provides no benefits. Additionally, the errors induced by disregarding object movement increase with larger time step sizes. Nevertheless, the proposed temporal approach achieves considerable improvements up to very high velocities occurring, e.g., on the highway. The enhancements are higher for SemanticKITTI because its lidar provides point clouds with a higher density, compensating for larger distances between frames to some extent. Furthermore, nuScenes suffers more from the higher time step size due to a higher number of moving objects.

Influence of ego velocity

**Figure 7.10:** Influence of the ego velocity and time between frames on the results. The time step size is 0.1 s for the first entry and increases by 0.1 s with each entry to the right.

### **7.3.2 Temporal Bird's Eye View Experiments**

The temporal bird's eye view network T-BEVNet builds upon the same concepts as T-RVNet. Therefore, a reduced subset of the previous experiments is conducted to investigate and confirm the temporal benefits in the bird's eye view. The influence of the individual components on the results is shown in Table 7.17. Since T-BEVNet builds upon the backbone of BEVNet as a single frame backbone, BEVNet is the respective non-temporal baseline shown in the first row of Table 7.17. The first added component is again the temporal memory, which considerably improves the panoptic results. The improvements are slightly lower compared to the temporal range view, which can be explained by the superior panoptic segmentation of BEVNet compared to RVNet. Again, temporal alignment and training are not required in order to benefit from temporal information. However, this reduced setup lacks in exploiting its full potential. The following lines confirm that the temporal memory alignment is crucial to further improve the panoptic results and to leverage the full potential of TBPTT. Similar to T-RVNet, major overall improvements over the single frame network BEVNet are achieved with significantly higher and . The enhanced results on nuScenes depicted in Table 7.17 confirm the convincing outcomes provided by the temporal approach.

**Table 7.17:** Benefits provided by the individual components. Starting with the single frame backbone (SFB) as baseline and consecutively adding the temporal memory (TM), temporal alignment (TMA) and Truncated Backpropagation Through Time (TBPTT).


The evaluation of the memory update in the bird's eye view focuses on the most promising strategies revealed in the previous chapter. Table 7.18 shows the results for the two best residual strategies and ContextGRU as the best gated strategy. Again, the gating mechanism provides no benefits and is outperformed by both residual strategies. While the residual network built from two BBs achieves a similar with less runtime, four BBs achieve a significantly higher .


**Table 7.18:** Results of different fusion strategies for the memory update on the validation set of SemanticKITTI.

The investigation of the temporal training parameters sequence length and backpropagation steps through time <sup>2</sup> depicted in Table 7.19 show similar results as for T-RVNet. However, the configuration with a sequence length of 50 and 4 backpropagation steps is the best strategy for temporal bird's eye view training when considering the semantic results.

**Table 7.19:** Influence of the temporal training parameters sequence length and backpropagation steps <sup>2</sup> evaluated on the validation set of SemanticKITTI.


Overall, the best temporal range and bird's eye view results are achieved with the backward strategy for temporal alignment without range gate and a temporal memory update based on four residual BBs. The best training strategy uses subsequences with a length of 50 and propagates the gradient four steps back in time.

### **7.3.3 Comparison to State-of-the-Art**

A comparison of the proposed temporal framework to other temporal methods follows the detailed ablation studies of the previous sections. Therefore, it is compared to the approaches presented in Section 2.4 across various segmentation tasks. As introduced in Section 7.1.3, the inference time of existing methods that used different GPUs is reproduced (\*) if the code was released or converted (≈) based on public benchmarks otherwise.

**Table 7.20:** Panoptic and semantic segmentation results of temporal approaches on the test set of SemanticKITTI. Additionally, the achieved temporal improvements over the respective non-temporal baseline on the validation set are reported. The inference times for T-RVNet and T-BEVNet are for the panoptic and semantic setup, respectively.


In the first step, the results and relative improvements for semantic and panoptic segmentation are considered, which only a few existing approaches explicitly address. The overall results are reported on the test set. On the other hand, the improvements are reported on the validation set because the ablation studies, which reveal the temporal benefits, are performed thereon. The overview in Table 7.20 shows the superiority of the proposed temporal framework, achieving with both representations the best results and the lowest inference time. The approach of Wang et al. [Wan22b] uses a sophisticated baseline which provides good results. However, their aggregation of input points for thing classes as temporal fusion strategy achieves only minor improvements. SpSequenceNet [Shi20] only considers one previous frame, which considerably limits the leveraged temporal potential. Consequently, it is clearly outperformed by the proposed method in overall results and achieved improvements. Finally, MetaRange [Wan22a] relies on input level fusion based on residual images. Their semantic results show that this strategy cannot compete with the proposed feature-based strategy.


**Table 7.21:** Results of temporal approaches for the task of dynamic semantic segmentation on the test set of SemanticKITTI.

Next, the results for the dynamic semantic segmentation task are compared and depicted in Table 7.21. The presented temporal framework achieves once more the best results with the lowest runtime. Some already discussed approaches also tackle this task but again cannot compete with the proposed approach for the reasons mentioned. The exception is Wang et al. providing similar results, which indicates that their aggregation of instance points at the input level provides valuable information for identifying moving points. The recurrent approach of TemporalLattice [Sch22b] achieves considerably worse results while having a higher computational complexity than T-RVNet and T-BEVNet. These findings show that a careful design of RNN-based architectures is required to exploit its potential with low computational complexity.

Finally, the temporal framework of this thesis is compared to other approaches for the task of moving object segmentation, and the results are illustrated in Table 7.22. These approaches predominantly rely on residual images and are designed specifically for this task. They exploit the temporal information solely to identify moving points and not to improve learned features in general. Nevertheless, the proposed temporal approach achieves the second-best results while being the fastest.


**Table 7.22:** Moving object segmentation results on the test set of SemanticKITTI.

To summarize, the temporal framework shows excellent performance and improvements across four different tasks and outperforms all other temporal approaches on three of them. These convincing results underline the benefits of the proposed temporal feature fusion and confirm its capabilities of exploiting these. None of the existing approaches is capable of achieving convincing results across all these tasks simultaneously. Furthermore, the presented approach achieves the best inference time, which shows the value of the recursive aggregation and reuse of previously computed feature maps. Overall, the presented temporal framework achieves excellent results with low computational complexity.

## **7.4 Multi Sensor Panoptic Segmentation**

The next step of the evaluation investigates the proposed sensor fusion of lidar and camera for panoptic segmentation, which is the third contribution. Various experiments thoroughly evaluate the presented multi sensor fusion architecture and training strategy to analyze the benefits of the proposed approach. Finally, it is compared to other state-of-the-art sensor fusion methods.

In contrast to previous sections, the following experiments are conducted on nuScenes, if not stated otherwise. Unlike SemanticKITTI, it provides camera images for the entire 360<sup>∘</sup> environment and allows using the entire lidar scan for fusion instead of a small overlapping field of view in the front. The ResNet-50 of BEVDepth [Li22d] pre-trained on the nuScenes object detection task is used as the camera backbone for these experiments. On the other hand, a PSPNet with default architecture is used for the experiments on SemanticKITTI, which is pre-trained on the semantic segmentation provided by the KITTI-STEP dataset [Web21]. Furthermore, the pre-trained backbone of RVNet is used as initialization for the lidar backbone. If not stated otherwise, the camera feature maps of the second, fourth, and fifth stage are fused by the iterative architecture, and both sensor backbones are frozen during the fusion training. The inference time reported for sensor fusion approaches excludes the runtime of the camera backbone since it is not the focus of this thesis.

## **7.4.1 Range View Fusion Experiments**

The sensor fusion network SF-RVNet introduced in Section 6.1 relies on a multi scale feature fusion for lidar and camera feature maps to improve panoptic segmentation based on multi sensor features. The first multi sensor experiments investigate the impact of the individual iterative fusion components on the results, which are depicted in Table 7.23. The baseline is RVNet since its backbone is one of the components of the fusion approach. Instead of adding the entire fusion branch at once, its modules A, B, and C are added individually to investigate the influence of each stage. The first experiment solely uses the last fusion module, which already improves the results by a considerable margin. This finding underlines the effectiveness of the proposed fusion module and the value of camera features in general. The is further enhanced when the second and first fusion modules are added. This leads to the conclusion that a multi scale fusion is beneficial, as well as the iterative refinement of the fused features.


**Table 7.23:** Influence of the individual fusion stages on the results.

These findings are illustrated by selected semantic and instance examples depicted in Fig. 7.11. In the former, RVNet fails to classify the bus T correctly and confuses it with a truck T, which has a similar shape. In contrast, SF-RVNet exploits camera information to resolve this confusion and correctly detects the bus. The second example shows instance results and the advantage of the high camera resolution. SF-RVNet provides a more accurate instance segmentation on the border of two trailers. This example also shows errors in the ground truth, one of the challenges mentioned in the introduction.

#### **Extended Ablation Studies**

Additional experiments investigate the revealed benefits of SF-RVNet over the baseline RVNet more closely. One possible and unwanted source of improvement is the increased model capacity due to the added fusion branch. Therefore, SF-RVNet is trained with empty camera features to exclude this possibility. Consequently, it uses the increased model capacity but cannot exploit camera features. The results in Table 7.24 show only a small improvement for and no improvements for , which eliminates the increased model capacity as the main source of enhancement. The second experiment examines the dependence of the fusion approach on camera features during inference. Hence, SF-RVNet is trained with camera features but receives no camera features during inference. The huge quality drop depicted in Table 7.24 is another strong evidence that the proposed fusion approach intensively exploits camera features. This simulated camera failure can be mitigated by applying the lidar head deployed during pretraining to the lidar backbone feature maps to achieve the panoptic results of RVNet instead of the degraded ones.

**Figure 7.11:** Improved semantic and instance segmentation of SF-RVNet due to the fusion of lidar and camera information.

**Table 7.24:** Relevance of camera information for SF-RVNet during training and inference, evaluated on the validation set of nuScenes.


The next step investigates the impact of the selected camera feature maps and fusion strategy. The ResNet architecture provides feature maps at five different stages. However, the first stage is not further considered since it consists solely of an initial 3 × 3-conv and max pooling, which provides only shallow features. All combinations of the other four stages are evaluated, and all perform equally well, with the results shown in Table 7.25.

**Table 7.25:** Influence of the chosen ResNet stages for extracting intermediate camera feature maps, evaluated on the validation set of nuScenes.


In contrast, the chosen fusion architecture significantly impacts the results. Table 7.26 depicts the outcomes of the strategies proposed in Section 6.1.1. The pyramid strategy achieves the best results at the cost of a high computational complexity. The latter is mainly caused by the requirement to deploy two entire pyramid modules in parallel, doubling the fusion branch's computational complexity. Additionally, the pyramid fusion itself is more complex than the iterative fusion. While iterative fusion achieves worse results, it still provides excellent improvements and has a significantly lower inference time.


**Table 7.26:** Results of the different fusion architectures on the validation set of nuScenes.

The pyramid fusion is evaluated in three different configurations, starting with the default and light pyramid strategies, which achieve similar panoptic results. The light variant replaces the feature refinement modules with a single convolutional layer, see Section 6.1.1. In contrast, the extended pyramid fusion doubles the projected camera and fusion channel sizes in the second and third sensor fusion step to maximize the exploited camera information. The larger channel sizes further increase the and , as well as the computational complexity.

One advantage of the proposed architecture is the potential independence of lidar and camera backbone. Both backbones are pre-trained on their respective data and frozen during fusion training to achieve independence. This procedure raises the general question of how the training strategy influences the panoptic results. Table 7.27 shows the results of different strategies for freezing or further optimizing the individual backbones during fusion training. If the lidar backbone is further optimized when training the overall fusion approach, SF-RVNet performs worse considering the . It is likely harder for the fusion branch to learn a high quality feature fusion of constantly changing backbone features. Combined with the major advantage of the backbones acting as a fallback for sensor failure, discussed in Section 6.1, the preferable training strategy keeps both backbones unchanged.

Overall, both sensor fusion strategies provide excellent results, with the iterative strategy being faster and the pyramid strategy providing the best results. The stages the camera features are chosen from play no significant role. Freezing the backbones during fusion training provides the best results and ensures that the fusion approach does not degrade below the lidar baseline, see Table 7.24, in case of camera failure. Therefore, the final range fusion network uses the extended pyramid fusion strategy with camera stages two, four, and five. However, the iterative strategy is chosen for the deployment in the multimodal multi view architecture. As one of many components, computational complexity is an important property in this setup.


**Table 7.27:** Impact of optimizing the backbones during training of the fusion approach. The experiments were conducted on the validation set of nuScenes.

## **7.4.2 Comparison to State-of-the-Art**

In addition to the presented extensive ablation studies, the proposed sensor fusion approach is further compared to other fusion approaches. As discussed in Section 2.5, only a few approaches for semantic segmentation have been proposed, and to the best of the author's knowledge, no approach tackles the related task of panoptic segmentation.

Since SemanticKITTI only has front-facing cameras, the evaluation for sensor fusion must be restricted to the overlapping field of view (FoV) of the lidar and camera, which covers approximately <sup>1</sup>/<sup>6</sup> of the overall point cloud. Consequently, the following results are reported on the validation set because the official test server supports no restriction to the overlapping FoV. The outcomes for different methods are depicted in Table 7.28. The first finding is the superiority of SF-RVNet over the single modality range view approaches. This confirms the capabilities of the proposed approach to improve semantic segmentation based on sensor fusion. It is worth mentioning that further approaches with better results than the listed methods exist but do not provide their code, preventing the evaluation on the restricted dataset. Another finding is that simple fusion schemes, such as PointPainting [Vor20] and RGBAL [Elm19], do not achieve state-of-the-art results. PointPainting has been proposed for 3D object detection and enhances the input point cloud with semantic labels from camera semantic segmentation. However, it achieves no convincing results for 3D semantic segmentation. RGBAL projects the camera image into range view at the input level, which omits a great amount of information due to the lower range view resolution. In contrast to these methods, and similar to the proposed approach, LaserNet++ [Mey19a] and PMF [Zhu21c] deploy a deep fusion of camera and lidar features. LaserNet++ performs a single scale fusion without reaching state-of-the-art semantic segmentation, whereas PMF provides competitive results using multi scale fusion. However, their architecture integrates camera features into their lidar backbone. In case of camera failure, their lidar backbone cannot be used as a fallback and the overall network will severely degrade.


**Table 7.28:** Semantic results of fusion approaches on the validation set of SemanticKITTI restricted to the overlapping FoV of lidar and camera. († values taken from [Zhu21c], ‡ reproduced results based on own implementation.)

The previous findings are confirmed by the results on nuScenes depicted in Table 7.29, which allows using the entire scan and official validation set. The validation instead of the test set is chosen because PMF and LIFSeg [Zha21b] report their results only on the former. With LIFSeg, an additional approach is evaluated on nuScenes and achieves the best results. However, this is mainly caused by their strong voxel-based backbone CylinderNet [Zhu21b]. The fusion strategies of PMF and SF-RVNet achieve larger improvements. When the presented sensor fusion approach is combined with a stronger backbone, such as the proposed multi view approach, it considerably outperforms LIFSeg.


**Table 7.29:** Semantic results of fusion approaches on the validation set of nuScenes with the achieved improvements over their respective lidar baseline.

Overall, the proposed sensor fusion architecture achieves state-of-the-art results on both datasets while having the ability to counteract camera failure with its lidar backbone. Additionally, the proposed fusion method can be combined with stronger networks, such as MVNet, which is investigated more closely in the next section.

# **7.5 Multimodal Multi View Panoptic Segmentation**

The main contributions of this thesis have been evaluated individually so far. Another important goal of this thesis is the combination of these contributions into a multimodal multi view approach. Therefore, the next set of experiments investigates the benefits of combining the multi view, temporal, and multi sensor frameworks. In the second step, a comprehensive comparison of the combined frameworks, as the final result of this thesis, to other state-of-theart approaches is conducted.

## **7.5.1 Multimodal Experiments**

The three frameworks presented in this thesis are combined step by step, starting with the multi view and temporal framework. Table 7.30 shows the results on SemanticKITTI. Adding the temporal memory to the range view branch of the multi view architecture, as presented in Section 5.3, significantly improves the and while the latter improvement is more pronounced. This is expected since the temporal range view memory only influences the semantic segmentation and not the center and offset predictions for the clustering. On the other hand, the temporal bird's eye view memory improves the by a considerably larger margin since it directly influences the clustering predictions. The best results are achieved when both views are temporally enhanced, which significantly outperforms the non-temporal multi view approach. The large improvement of the panoptic segmentation confirms the value of combining the presented multi view and temporal framework. Another experiment, T-MVNet-Lite, deploys only two BBs for temporal fusion instead of four. This strategy has achieved slightly worse results for the temporal range and bird's eye view approaches while being faster. It is a more efficient alternative in terms of runtime, which provides the best and still considerably improves the .


**Table 7.30:** Evaluation of the combined temporal multi view architecture on the validation set of SemanticKITTI. Temporal memories are successively added to both views.

In the next step, the temporal multi view approach is further extended by the sensor fusion approach, see Section 6.2. However, no further improvements are achieved on SemanticKITTI. The main reason is the small overlapping FoV of lidar and camera, which reduces the point cloud size to approximately <sup>1</sup>/6. While benefits for the range fusion approach can still be shown, the temporal training is unsuccessful on the strongly reduced training data. Since nuScenes provides full camera coverage, a more comprehensive evaluation of all possible combinations is performed thereon.

NuScenes offers a 360<sup>∘</sup> overlapping camera FoV which allows using the entire lidar scan for sensor fusion approaches. Table 7.31 shows the results for all possible combinations, starting with the already presented results for the range view improved with temporal or camera information. Both enhancements achieve similar semantic results, while the sensor fusion improves the panoptic quality by a larger margin. Moreover, their combination provides a multimodal single view approach, improving the panoptic segmentation even further and considerably outperforming the individual enhancements.


**Table 7.31:** Combination of the contributions of this thesis, which ultimately provides a multimodal multi view framework, and their results on the validation set of nuScenes.

The lower part of Table 7.31 considers the multi view architecture combined with temporal or sensor fusion, both enhancing the results by a similar and large margin. Consequently, all combinations of two frameworks significantly enhance the semantic and panoptic segmentation, underlining the benefits of the proposed combinations. It is worth mentioning that the range view backbone, as part of the range fusion backbone, is no longer frozen but further optimized in these combined setups. This change is required to successfully train the range fusion backbone as part of the multi view and temporal framework. The final line of Table 7.31 shows the overall framework of this thesis combining all three contributions and achieving the best results. It provides significantly improved semantic and panoptic segmentation and successfully combines the benefits of the multi view, temporal, and sensor fusion framework.

Next, the individual class results for MVNet, T-MVNet, and TSF-MVNet are compared, representing the incremental combinations of the three contributions. The individual class s predominantly benefit from the temporal fusion in the first step and further from combined temporal and sensor fusion in the second step, which reflects the strongly improved . Especially the semantic segmentation results of thing classes depicted in Table 7.32 are considerably enhanced. But also the already convincing results of MVNet for stuff classes are further improved by temporal and sensor fusion, shown in Table 7.33. The findings for the on class level are similar, and all classes benefit from the temporal fusion. In addition, every class but one additionally benefits from the combined temporal and sensor fusion, explaining the great improvement of the . As the combination of all proposed contributions, TSF-MVNet provides a significantly enhanced panoptic segmentation for all classes.


**Table 7.32:** Comparison of the class-wise panoptic results for thing classes on the validation set of nuScenes.


**Table 7.33:** Comparison of the class-wise panoptic results for stuff classes on the validation set of nuScenes.

In the next step, several qualitative examples illustrate the benefits but also different errors of the proposed approaches. The first example is depicted in Fig. 7.12, where instance and semantic segmentation are shown. Two types of errors are observable for MVNet. First, it fails to separate all parking cars T into individual instances, tagged with (a) and (b). Since the semantic segmentation is correct for these points, the error has its origin in the center and offset predictions. The temporal information exploited by T-MVNet resolves these errors and accurately splits the cars into individual instances. The second error (c) is the wrong semantic segmentation of the construction vehicle T on the right, which is confused with the background classes building T and vegetation T. Even with temporal information, T-MVNet is unable to separate it from the background. Consequently, the entire instance is missed. However, TSF-MVNet succeeds in detecting the construction vehicle based on the additional camera information. Furthermore, it correctly segments the upper part of the hydraulic lift, which is unlabeled T in the ground truth. This example underlines that all contributions are valuable and necessary to reduce instance and semantic errors and provide high quality panoptic results.

**Figure 7.12:** Benefits of combining the individual contributions. Temporal and sensor fusion are required to resolve instance (a), (b) and semantic errors (c). An overview of the semantic class colors is depicted in Figs. 7.3 and 7.4.

Since errors in the panoptic segmentation predominantly originate from semantic errors, the next examples focus thereon. Figure 7.13 illustrates the semantic benefits of T-MVNet and its temporal fusion, which correctly classifies the paved center strip T. In contrast, MVNet assumes a vegetated center strip which is considered terrain T. While MVNet correctly classifies the center strip in some preceding and subsequent frames, the temporal fusion provides a more robust semantic segmentation without these occasional errors. The example in Fig. 7.13 also illustrates the benefits of sensor fusion for distant objects. The parking pickup, which belongs to the truck class T in nuScenes, has only one row of very few measured lidar points. As a result, MVNet and T-MVNet fail to predict the correct class and confuse it with cars T. In contrast, the additional information of the higher resolution camera image enables TSF-MVNet to segment the pickup truck successfully, despite the sparse lidar information.

The next scenario, depicted in Fig. 7.14, shows the value of temporal information (a), required to distinguish between buildings T and barriers T successfully. The reason is again the increased temporal robustness, which eliminates occasional errors in individual frames. Overall, T-MVNet considerably reduces the semantic errors in this example. TSF-MVNet further improves the segmentation and is able to segment the sidewalk T more accurately (b). The camera information helps to distinguish between paved and vegetated T flat ground.

The last example shows a failure case of the overall contribution TSF-MVNet. In the scene shown in Fig. 7.15, MVNet provides convincing results but misses the sidewalk between the vegetated terrain T. T-MVNet makes similar errors and lacks in improving the semantic segmentation. Even TSF-MVNet fails and, in addition, introduces additional errors by confusing the bus T with a truck T. One potential reason is the rainy weather and water on the camera pane. With the limited dataset size, robustness to different weather conditions is difficult to achieve. Consequently, different weather conditions might negatively impact the fusion results in some examples, see Fig. 7.15, whereas in other ones, such as Fig. 7.14, TSF-MVNet provides its full benefits.

**Figure 7.13:** Temporal information supports the robust segmentation of the ground classes, and the high resolution camera image helps with distant points. The semantic errors T across all approaches are shown on the right.

**Semantic Segmentation**

**Semantic Errors**

**Semantic Segmentation**

**Semantic Errors**

**Figure 7.15:** Failure case of the proposed contributions. T-MVNet is not able to improve the segmentation over MVNet, and TSF-MVNet introduces additional errors.

Overall, the conducted experiments confirm that the main contributions of this thesis not only provide their benefits individually but that their combination provides even greater improvements. Several examples illustrate these benefits. All proposed combinations are capable of simultaneously exploiting the benefits of the multi view, temporal, or sensor fusion framework, which allows choosing the combination most suitable for a given use case. The overall combination TSF-MVNet provides excellent results for 3D panoptic segmentation and achieves outstanding absolute improvements of +0.170 and +0.124 over the single view approach RVNet for and , respectively.

## **7.5.2 Comparison to State-of-the-Art**

The temporal and sensor fusion approaches have already been compared to related state-of-the-art approaches. Therefore, the following comprehensive comparison has a broader scope and considers existing 3D panoptic segmentation methods in general, with a focus on the presented multi view framework and combined frameworks. In the first step, the results on the official test set of SemanticKITTI are investigated, which has the highest number of evaluated approaches. Both temporal single view approaches, the multi view approach, and the temporal multi view approach are considered.

The temporal range and bird's eye view frameworks have already been compared to existing temporal approaches for various tasks in Section 7.3.3. In addition, both approaches are more broadly compared to existing panoptic methods in the following. Both achieve convincing results compared to other approaches of their respective view, depicted in Table 7.34. T-RVNet accomplishes compelling results close to the best range view-based approaches while having a significantly lower computational complexity. T-BEVNet provides the best results based on the bird's eye view. As part of the temporal or multimodal multi view architecture, T-RVNet and T-BEVNet are not designed to outperform state-of-the-art approaches standalone. However, the temporal framework can be combined with stronger backbones to achieve this, such as the proposed multi view network MVNet.


**Table 7.34:** Comparison of multiple contributions to the state-of-the-art panoptic segmentation on the test set of SemanticKITTI. The second best results are highlighted in italic.

The multi view network achieves better results than all single view methods, with a few exceptions. EfficientLPS [Sir22] deploys several extensions and additionally uses pseudo labels to improve the panoptic segmentation and compensate for some disadvantages of the range view. Additionally, it relies on object detection for instance segmentation. As a result, it outperforms the proposed multi view approach by a small margin for . However, it does not achieve the same quality in semantic segmentation, and its computational complexity is nearly three times higher. With additional extensions [Due22], the proposed multi view approach achieves a of 0.588 and outperforms EfficientLPS also for . In contrast, the voxel-based GP-S3Net [Raz21a] and the most recent multi view methods SCAN [Xu22] and Pan-PHNet [Li22b] provide better results than MVNet.

Consequently, the three best approaches GP-S3Net, SCAN, and Pan-PHNet are compared more closely to the temporal multi view network, which outperforms all three on panoptic quality. GP-S3Net deploys a complex sparse voxel backbone and graph neural network with a very high computational complexity. Based on its state-of-the-art semantic backbone AF<sup>2</sup> -S3Net [Che21d], it achieves the best semantic segmentation. Nevertheless, its panoptic segmentation is considerably outperformed by T-MVNet, with approximately onethird of the computational complexity. SCAN combines the stronger but more expensive sparse voxel view with the point view. This combination achieves similar results as the proposed T-MVNet, which is based on the range and bird's eye view combined with temporal fusion. However, SCAN lacks the same quality of results on nuScenes, see Table 7.36. Especially its panoptic segmentation thereon drops by a large margin. Finally, Pan-PHNet combines voxel and bird's eye view and proposes an improved clustering for bottomup panoptic segmentation. The latter significantly contributes to the accomplished panoptic results. These are similar to T-MVNet, which relies on the established weaker clustering but on stronger features, confirmed by its superior semantic segmentation.

The nuScenes panoptic challenge was released recently, and not all approaches evaluated on SemanticKITTI have also been evaluated on nuScenes. Its authors combined all submitted semantic segmentation and object detection approaches resulting in 1,470 possible panoptic segmentation approaches. The best three are reported as baselines and are shown in rows four to six in Table 7.35, outperforming all 2D single view methods by a large margin. While these are overall strong baselines, they use two independent networks instead of one native panoptic segmentation network. Only Pan-PHNet and the proposed T-MVNet and TSF-MVNet achieve competitive or better panoptic results. While TSF-MVNet achieves the best semantic segmentation, Pan-PHNet provides better panoptic segmentation. Interestingly, its on the validation set is considerably lower, see Table 7.36, and is more in line with the findings on SemanticKITTI. Unfortunately, the authors provide no explanation for the outlier results on the test set. The outcomes of other approaches, including T-MVNet and TSF-MVNet, differ only slightly across validation and test set.

**Table 7.35:** Comparison of the combined contributions to other methods on the **test set** of nuScenes. Some approaches have been evaluated on the test set after publication in [Fon22].


**Table 7.36:** Comparison on the **validation set** of nuScenes.


Overall, the proposed contributions achieve the best panoptic segmentation on SemanticKITTI, even without sensor fusion. TSF-MVNet achieves the best semantic segmentation among the panoptic approaches on nuScenes, the best panoptic segmentation on the validation, and the second-best on the test set. Additionally, it outperforms the best combinations of individual semantic segmentation and object detection networks, combined to panoptic approaches. These findings conclude the excellent results provided by the combined contributions of this thesis.

# **8 Discussion and Outlook**

The overall contribution of this thesis is the proposed multimodal multi view approach for panoptic segmentation of 3D point clouds based on deep learning. It combines the benefits of the three individual contributions, multi view architecture, temporal fusion, and sensor fusion, which improve the panoptic segmentation based on different aspects. Existing approaches only exploit one of these and cannot simultaneously leverage their potential.

The proposed multi view architecture focuses on the lidar sensor to provide a superior panoptic segmentation for unstructured 3D point clouds with CNNs. Features and context are efficiently aggregated in the 2D range and bird's eye view, and provided to a point view backbone. The latter combines the multi view context and maintains a unique feature vector for every 3D point. These enhanced features are the foundation for improved results over single view approaches, which suffer from the drawbacks of the individual views.

The presented temporal framework focuses on temporal information in point cloud sequences instead of considering point clouds individually. It is based on a recursive temporal fusion of feature maps to exploit temporal dependencies. A temporal memory in range or bird's eye view aggregates and propagates information through time. In every time step, the temporal memory of the previous time step is temporally aligned to compensate for ego motion. Afterwards, it is updated with the information extracted from the latest point cloud. In the end, the memory contains temporally fused features of the current and all previous time steps, which are the basis for improved 3D panoptic segmentation over single frame approaches.

The introduced multi sensor approach focuses on exploiting the camera as an additional sensor modality. It fuses feature maps provided by lidar range view and camera backbones at multiple scales based on two proposed deep fusion strategies. The first one iteratively fuses and refines the multi scale feature maps, whereas the second strategy follows a pyramid-based fusion pattern. The provided multi sensor feature maps considerably improve 3D panoptic segmentation over methods solely relying on lidar.

## **8.1 Discussion**

The **multi view approach** for processing unstructured 3D point clouds with CNNs is the first contribution of this work. Its main novelty is the multi view architecture with a point view backbone connecting 2D backbones for range and bird's eye view and repeatedly aggregating multi view features. Key properties discussed in the following are the chosen views and the aggregation architecture of the point view backbone.

The first architectural property to discuss is the choice of range and bird's eye view. This combination can be interpreted as a separation of the 3D voxel view into two orthogonal 2D views. Consequently, their combination preserves a strong representation of 3D information, despite the underlying 2D projection. The conducted experiments also confirm this combination as highly beneficial since it outperforms individual range and bird's eye view baselines by a large margin, see Table 7.6. In addition, state-of-the-art range or bird's eye view approaches are also outperformed [Mil20, Hur20, Ayg21, Zho21]. One existing approach [Sir22] achieves a better by 0.006 but a significantly worse semantic segmentation, and its computational complexity is about three times higher. Therefore, the chosen combination of range and bird's eye view achieves high quality results with low computational complexity and enables efficient temporal and sensor fusion. On the other hand, choosing these views is simultaneously a limitation when focusing on the best results. The 3D voxel view still contains more information than the combined range and bird's eye views and additionally contains 3D neighborhoods. In contrast, range and bird's eye view provide distinct and orthogonal 2D neighborhoods, both including points far apart in 3D. While the combination mitigates this drawback to some extent, the voxel view inherently provides the actual 3D neighborhoods. Consequently, the sparse voxel view is the more powerful representation but has a higher computational complexity. However, most recent multi view approaches [Ye21b, Xu22, Li22b] gradually decreased its high computational complexity with the aid of additional views. Therefore, a promising future strategy to exploit 3D neighborhoods for the presented multi view approach is the deployment of a sparse 3D backbone instead of the proposed point view backbone.

The second architectural property to discuss is the novel point view backbone which connects the range and bird's eye view backbone. The repeated fusion of multi view features at distinct scales provides better results than different late fusion baselines shown in Table 7.6, existing late fusion panoptic methods [Li22a], or existing early [Ali21, Ger21] and late fusion semantic segmentation approaches [Lio21]. Moreover, the decision against considering neighborhood relations in 3D point clouds based on expensive NN-search is supported by the improved results over [Lio21, Qiu22]. They rely on this strategy in their final layer and achieve worse results while having a significantly higher computational complexity. One potential reason is that these approaches exploit 3D neighborhoods only at one scale in their final layer. As previously discussed, a more promising strategy to exploit 3D neighborhoods is the sparse voxel view, enabling multi scale 3D feature aggregation.

The second contribution of this thesis is the proposed **temporal approach** for exploiting temporal dependencies to improve 3D panoptic segmentation. Its main novelty is the recursive 2D feature map fusion composed of temporal alignment and update. The resulting novel and unique property is the independence of considered past frames and computational complexity. This property allows the temporal approach to consider past information of a potentially arbitrary number of past frames. Hence, its effectiveness in exploiting temporal information and its computational complexity are discussed in the following.

The conducted experiments underline the effectiveness of the temporal fusion, which considerably improves semantic and panoptic segmentation compared to single frame baselines, see Tables 7.11 and 7.17. Additionally, existing temporal approaches are outperformed across various tasks, such as semantic segmentation [Shi20, Wan22a], panoptic segmentation [Wan22b], dynamic semantic segmentation [Shi20, Sch22b, Wan22a, Han22, Wan22b], and moving object segmentation [Shi20, Che21b, Mer22, Sun22]. Only one superior approach for the latter task exists [Kim22], explicitly designed and optimized solely for this task. These results underline the value of the generic temporal approach, which is not optimized for one specific task but aims for general feature enhancement. It also confirms the value of exploiting a considerable number of past frames, which is further supported by the dedicated experiments shown in Fig. 7.9. Both properties enable superior results across various tasks and provide excellent improvements over the respective single frame baseline. On the other hand, the lack of explicitly considering the movement of other traffic participants is one of the limitations and offers potential for future improvements. While Table 7.12 reveals that these classes still significantly benefit, it limits the potential in scenarios with dense traffic and high velocities of traffic participants. Considering their motion based on 3D flow is a promising future research direction to address this limitation.

The computational complexity of the temporal fusion strongly benefits from the proposed temporal alignment. It enables the reuse of previously computed feature maps in range or bird's eye view and decouples the number of considered past frames and computational complexity. As a result, the proposed temporal framework has the lowest complexity among existing approaches while simultaneously achieving the discussed superior results across various tasks. Consequently, the decoupling is a precious property.

As a third contribution, the **multi sensor approach** fuses lidar and camera feature maps to exploit camera information for panoptic segmentation of 3D point clouds. Its main novelties are the iterative and pyramid-based fusion architectures for lidar range view and camera feature maps. The following discussion reasons the choice of the range view as fusion view and the proposed fusion strategies.

Choosing the range view as fusion view is motivated by the successful combination of RGB and depth information in the image domain [Sil12]. The excellent improvements over the single sensor range view network verify this choice. Other approaches [Zhu21c, Zha21b] also successfully fuse camera and lidar information in this view. On the other hand, several methods for 3D object detection [Phi20, Liu22] successfully transform and fuse feature maps of multiple cameras into bird's eye view. Exploiting this potential for panoptic segmentation would be promising future work, especially considering the multimodal multi view architecture discussed later. Nevertheless, the range view is a convincing choice, confirmed by the achieved enhancements and the low computational complexity of the geometric transformation from camera to lidar range view.

The presented multi scale fusion architectures significantly improve panoptic segmentation compared to the single scale fusion baseline, see Table 7.23, and early fusion [Elm19, Vor20] or single scale deep fusion [Mey19a] strategies deployed by existing semantic segmentation approaches. Consequently, multi scale fusion is important to leverage the full potential of sensor fusion. Employing these strategies inside an independent fusion branch combined with the proposed and evaluated training strategy additionally provides robustness against sensor failure. This crucial property is another advantage over existing approaches. The superior results of the extended pyramid strategy indicate the full potential of sensor fusion at the cost of an over-proportionate computational complexity. Hence, it was not employed in the multimodal multi view architecture due to this limitation. However, these results indicate additional potential, which requires further research for efficient exploitation.

The presented **multimodal multi view** architecture is the overall contribution of this thesis and combines all individual contributions into a unified architecture. A careful and aligned design of the individual approaches and contributions is the foundation of their successful combination. The temporal approach is applicable to 2D lidar views and, thereby, an excellent extension to the multi view framework with its 2D views and backbones. On the other hand, the sensor fusion approach seamlessly replaces the lidar range view backbone. The presented architecture can *simultaneously* exploit all three individual contributions, as shown by the experiments in Table 7.31. Furthermore, it outperforms all existing panoptic approaches on SemanticKITTI, and all approaches on nuScenes when considering semantic segmentation, see Tables 7.34 and 7.35. To the best of the author's knowledge, no approaches exist that combine these three key technologies for dense prediction tasks, such as 3D panoptic segmentation. The only other methods combining temporal and sensor fusion, 4D-Net [Pie21] and LIFT [Zen22], tackle the task of 3D object detection. However, both are single view approaches due to the targeted task of object detection, where multi view approaches are less beneficial and rather uncommon. Additional and significant differences are that 4D-Net performs early fusion for temporal fusion, and LIFT relies on attention for sensor and temporal fusion.

One additional advantage of the proposed approach is its low computation complexity. It requires approximately 100 ms to predict a 3D panoptic segmentation for the entire 360<sup>∘</sup> environment with temporal and sensor fusion of *six* cameras. This inference time is considered real-time for a common lidar recording frequency of 10 Hz. Obviously, no server-grade GPU will be available when deployed in an autonomous vehicle or robot. However, the runtime was measured in single precision using float32. Switching to half precision based on float16 halves the inference time without notably affecting the results. Further significant reduction can be achieved with int8 quantization and quantization-aware training. While depending on the available hardware, these optimizations allow embedded real-time deployment.

While the combined architecture provides excellent results with low computational complexity, it also comes with some challenges and limitations. The multi view approach with its three views complicates temporal and sensor fusion since it is challenging to benefit all views. It is convincingly solved for temporal fusion but requires the memory in both 2D views to cover range and bird's eye view feature maps. On the other hand, sensor fusion is only integrated into the range view branch, which achieves great improvements. However, it does not affect the instance feature maps of the bird's eye view branch and potentially misses exploiting the full capabilities of sensor fusion. Therefore, fusing the camera and lidar in bird's eye view, as a replacement or in addition to the range view fusion, might provide a further enhanced 3D panoptic segmentation. At first glance, performing temporal and sensor fusion once in the point view resolves these drawbacks. However, this strategy would again not improve the bird's eye view feature maps for the instance predictions. Furthermore, the temporal experiments with GRUs revealed that context aggregation after temporal fusion is crucial, which is missing in the proposed point view backbone. One possibility to address these limitations is a bidirectional information flow, which propagates not only range and bird's eye view features to the point view, but also the fused multi view features back to the 2D views. As a result, the bird's eye view and instance feature maps would benefit from sensor fusion in range view and vice versa.

When considering the **overall results** achieved by the multimodal multi view approach, there is still a considerable gap to accomplish a or of 1.0, despite exploiting temporal and sensor fusion. Likewise, other state-ofthe-art methods are unable to close this gap. Two underlying reasons can be identified that considerably impact the panoptic results.

The first reason is the imperfection of the datasets, which suffer from label errors and considerable class imbalance. Label errors negatively impact the learning of underlying concepts and lead to errors in the evaluation and metrics. Class imbalance complicates the learning of underlying concepts for rare classes since only a few examples are present in the training set. Furthermore, vaguely defined semantic classes with high intra-class variance, such as othervehicle or other-ground, are very hard to learn. These factors have a larger impact for large and meaningful test sets with many semantic classes, such as SemanticKITTI. For smaller test sets with fewer semantic classes, such as nuScenes, higher values for and are achieved. Since these factors are independent of the chosen approach, they cannot be significantly mitigated by temporal or sensor fusion.

The second reason is the lidar sensor itself with its varying sparsity. Lidar point clouds get increasingly sparse with growing distance to the sensor. Consequently, distant points have only a few or no other points in their immediate neighborhood. However, classifying an individual point without context is hardly possible. In addition, distant instances of thing classes are often covered by very few points and must be recognized based on these. Temporal and sensor fusion can mitigate these effects to some extent. However, temporal fusion cannot provide benefits in the distance of the driving direction since no past measurements exist in this area. While cameras with their high resolution are very valuable, their provided information about the environment also decreases with increasing distance.

Overall, large and diverse datasets containing many instances and points for all classes and high quality labels are the most promising step towards considerably higher and . Furthermore, improved lidar resolution and approaches can further contribute towards metrics closer to the maximum.

## **8.2 Outlook**

Although the proposed multimodal multi view approach achieves excellent results, future research has the potential to further enhance the proposed architecture and achieved panoptic results. Some of the proposed research directions have already been identified in the previous discussion and comprise all three contributions as well as the instance clustering:


computational complexity drawback of the sparse voxel view. With future research to further reduce the complexity, this replacement has the potential to provide enhanced panoptic segmentation while maintaining a low computational complexity.


When aiming for an improved panoptic segmentation, it is often a trade-off between improved results and computational complexity. If a low computational complexity is important, the research in the mentioned directions can focus on finding efficient extensions. Otherwise, research can focus on fully exploiting the respective potential to achieve greater improvements for 3D panoptic segmentation.

# **Bibliography**




Object Detection". In: *2022 International Joint Conference on Artificial Intelligence (IJCAI)* (2022), pp. 827–833.


*Conference on Computer Vision and Pattern Recognition (CVPR)*. 2020, pp. 9028–9037.




[Sia17] SIAM, Mennatullah; VALIPOUR, Sepehr; JAGERSAND, Martin and RAY, Nilanjan: "Convolutional Gated Recurrent Networks for Video Segmentation". In: *2017 IEEE International Conference on Image Processing (ICIP)*. 2017, pp. 3090–3094. [Sil12] SILBERMAN, Nathan; HOIEM, Derek; KOHLI, Pushmeet and FER-GUS, Rob: "Indoor Segmentation and Support Inference from RGBD Images". In: *2012 European Conference on Computer Vision (ECCV)*. 2012, pp. 746–760. [Sim14] SIMONYAN, Karen and ZISSERMAN, Andrew: "Very Deep Convolutional Networks for Large-Scale Image Recognition". In: *2014 International Conference on Learning Representations (ICLR)*. 2014. [Sim19] SIMON, Martin; MILZ, Stefan; AMENDE, Karl and GROSS, Horst-Michael: "Complex-YOLO: An Euler-Region-Proposal for Real-Time 3D Object Detection on Point Clouds". In: *2019 European Conference on Computer Vision Workshops (ECCVW)*. 2019, pp. 197–209. [Sin15] SINGH, Santokh: Critical Reasons for Crashes Investigated in the National Motor Vehicle Crash Causation Survey. Tech. rep. 2015. [Sin19] SINDAGI, Vishwanath A.; ZHOU, Yin and TUZEL, Oncel: "MVX-Net: Multimodal VoxelNet for 3D Object Detection". In: *2019 IEEE International Conference on Robotics and Automation (ICRA)*. 2019, pp. 7276–7282. [Sir22] SIROHI, Kshitij; MOHAN, Rohit; BÜSCHER, Daniel; BURGARD, Wolfram and VALADA, Abhinav: "EfficientLPS: Efficient LiDAR Panoptic Segmentation". In: *IEEE Transactions on Robotics* 38.3 (2022), pp. 1894–1914. [Smu07] SMUCKER, Mark D; ALLAN, James and CARTERETTE, Ben: "A Comparison of Statistical Significance Tests for Information Retrieval Evaluation". In: *2007 ACM International Conference on Information and Knowledge Management (CIKM)*. 2007, pp. 623–632.




# **Publications**


# **Acronyms**



**VX** voxel view

## **Karlsruher Schriftenreihe zur Anthropomatik (ISSN 1863-6489)**



Thomas Bader

**Band 9**


#### Philipp Woock **Umgebungskartenschätzung aus Sidescan-Sonardaten für ein autonomes Unterwasserfahrzeug.** ISBN 978-3-7315-0541-9 **Band 26**




**Band 63** Fabian Dürr **Multimodal Panoptic Segmentation of 3D Point Clouds.** ISBN 978-3-7315-1314-8

Lehrstuhl für Interaktive Echtzeitsysteme Karlsruher Institut für Technologie

Fraunhofer-Institut für Optronik, Systemtechnik und Bildauswertung IOSB Karlsruhe

The understanding and interpretation of complex 3D environments is one of the main challenges of autonomous driving. Lidar sensors and their recorded point clouds are particularly interesting for this challenge of 3D scene understanding since they provide accurate 3D information about the current environment. An essential task in this context is panoptic segmentation, which enhances every 3D point with semantic and instance information. However, the unstructured and sparse nature of 3D point clouds requires novel approaches and algorithms to achieve high quality and robust results. The objective of this work is a multimodal approach based on deep learning for 3D panoptic segmentation. It builds upon and combines the three key aspects multi view point cloud architecture, temporal feature fusion, and deep sensor fusion. Extensive experiments on two large scale datasets show the benefits of the multimodal framework which outperforms state-of-the-art results for various benchmarks.

ISSN 1863-6489 ISBN 978-3-7315-1314-8

F. Dürr

Multimodal Panoptic Segmentation of 3D Point Clouds

Band 63