Benchmarking the Robustness of Panoptic Segmentation for Automated Driving

Yiting Wang, Haonan Zhao, Daniel Gummadi, Mehrdad Dianati, Kurt Debattista and Valentina Donzella Funded by the European Union (grant no. 101069576). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or European Climate, Infrastructure and Environment Executive Agency (CINEA). Neither the European Union nor the granting authority can be held responsible for them. UK and Swiss participants in this project are supported by Innovate UK (contract no. 10045139) and the Swiss State Secretariat for Education, Research and Innovation (contract no. 22.00123) respectively.The work was partially supported by High Value Manufacturing CATAPULT. This research is partially sponsored by the Centre for Doctoral Training to Advance the Deployment of Future Mobility Technologies (CDT FMT) at the University of WarwickYiting Wang, Haonan Zhao, Daniel Gummadi, Mehrdad Dianati, Kurt Debattista, and Valentina, Donzella are with WMG, University of Warwick, Coventry CV4 7AL, UK. Corresponding author: yiting.wang.1@warwick.ac.uk

Abstract

Precise situational awareness is required for the safe decision-making of assisted and automated driving (AAD) functions. Panoptic segmentation is promising perception technique to identify and categorise objects, impending hazards, and driveable space at a pixel level. While segmentation quality is generally associated to the quality of the camera data, a comprehensive understanding and modelling of this relationship are paramount for AAD system designers. Motivated by such a need, this work proposes a unifying pipeline to assess the robustness of panoptic segmentation models for AAD, correlating it with traditional image quality. The first step of the proposed pipeline involves generating degraded camera data that reflects real-world noise factors. To this end, 19 noise factors have been identified and implemented with 3 severity levels. Of these factors, this work proposes novel models for unfavourable light and snow. After applying the degradation models, three state-of-the-art CNN- and vision transformers (ViT)-based panoptic segmentation networks are used to analyse their robustness. The variations of the segmentation performance are then correlated to 8 selected image quality metrics. This research reveals that: 1) certain specific noise factors produce the highest impact on panoptic segmentation, i.e. droplets on lens and Gaussian noise; 2) the ViT-based panoptic segmentation backbones show better robustness to the considered noise factors; 3) some image quality metrics (i.e. LPIPS and CW-SSIM) correlate strongly with panoptic segmentation performance and therefore they can be used as predictive metrics for network performance. The benchmark and code will be made available at http://

Index Terms:

Automated driving, panoptic segmentation robustness, automotive image quality, noise factors.

I Introduction

Automotive camera data is commonly used in assisted and automated driving (AAD) systems to sense and interpret the vehicle’s surroundings through the use of perception algorithms. The quality of the data and its relationship with perception algorithms’ performance can have strong implications on the safety-critical decision-making processes used to navigate the environment.

Refer to caption — Figure 1: Visual examples of the newly proposed Degraded-Cityscapes plus (D-Cityscapes+) with 19 types of degradation, from the top to the bottom, are categorised as unfavourable light, adverse weather, internal sensor noises, motion blur and distortion artefacts.

Amongst various perception tasks, panoptic segmentation helps the AV identify countable objects (e.g. cars and pedestrians) in addition to background pixel categories (e.g. sky and grass) by segmenting pixels at the instance and semantic levels [1]. This granularity of information is essential for making well-informed and precise decisions and for responding to potential hazards in AAD functions. Neither object detection, semantic segmentation, nor instance segmentation can get such detailed scene information [1]. Panoptic segmentation produces accurate information on the unique shape and boundaries of each object via the classification of different instances of cars, pedestrians, bikes, etc. Despite the advantages of this technique, various data degradation factors in the real world might degrade its performance (see Fig. 1). Therefore, for the safety of AAD functions, it is essential to investigate panoptic segmentation robustness under degraded camera data.

Many researchers have investigated the robustness of perception tasks, such as object detection and semantic segmentation, using degraded sensor data. However, there is limited work that thoroughly considers camera degradation factors and implications on AAD panoptic segmentation [2, 3, 4, 5]. In addition, there is a lack of paired datasets covering a diversity of degradation factors [6, 7, 8]. For instance, the previous robustness research disregarded dark scenarios, which are thought to be one of the primary causes of accidents [9, 10, 11, 12, 8, 3]. Due to the difficulty of capturing the degraded data in the real world, mainstream robustness research uses synthetic data, and these generated datasets might have qualitatively unconvincing results, Fig. 3 [9]. There is a lack of systematic research considering the robustness of panoptic segmentation specifically for automated driving systems.

To address the above-mentioned challenges, this work proposes a unified degradation impact pipeline, including 19 camera noise models, perception via panoptic segmentation, and result evaluation correlating image quality with perception quality, see Fig. 2. Firstly, we propose an enhanced version of the previously degraded dataset from [13]. Then, a realistic synthetic degraded driving dataset called D-Cityscapes+ is introduced, which includes more noise factors, with varying models of fog and rain synthesis, and improved snow and light modelling. We also address the uneven distribution of degraded frames, i.e. our dataset contains the same number of degraded images under all degradation types. This work also proposes three types of generation models for unfavourable dark light conditions: low light, night light and extreme light (See Tab. III). Three state-of-the-art panoptic segmentation models are selected in this work using six different architectures based on convolutional neural networks (CNNs) and Visual ViT-based backbones to analyse their robustness to degraded data. Furthermore, the robustness of perception is correlated to eight chosen image quality indexes, panoptic quality and the correlation between them.

TABLE I: Comparison of related methods in terms of perception tasks, datasets used, diversity of the degradation factors in terms of sensor noises(noises), image signal processing (ISP), compression (Jpeg), low-light, mud drop (Mud), the consideration of the severity levels (Severity) and the correlation study between the image synthetic image quality and the perception performance.

Paper	Perception Task	Dataset	Noise	ISP	Jpeg	lowlight	Mud	Severity	Correlate	AAD
[10] CVPR’2020	image classification	synthetic (paired)	✓	x	✓	x	x	x	x	x
[11] IJCV’2021	semantic segmentation	synthetic (paired)	✓	x	✓	x	x	x	x	x
[12] T-ITS’2021	image restoration	synthetic (paired)	✓	x	✓	x	x	x	x	✓
[8] T-DSC’2022	object detection	synthetic (paired)	✓	✓	x	x	x	✓	x	✓
[14] CVPR’2022	panoptic segmentation	real-world (unpaired)	x	x	x	✓	x	✓	x	✓
[15] T-IV’2022	3D object detection	real-world (unpaired)	x	x	x	✓	x	x	x	✓
[3] CVPR’2023	3D object detection	synthetic (paired)	✓	x	x	x	x	✓	x	✓
[16] IJCV’2024	depth estimation	real-world (unpaired)	✓	x	✓	x	x	✓	x	x
Ours	panoptic segmentation	synthetic (paired)	✓	✓	✓	✓	✓	✓	✓	✓

This research is the first work to qualify and quantify the effect of 19 camera data degradation factors on panoptic segmentation for AAD systems. The main contributions of this work are: (I) a systematic benchmarking of the robustness of the CNN- and ViT-based panoptic segmentation architectures; (II) a new augmented dataset (D-Cityscapes+) with 19 types of degradation (47 types considering the severity levels) to boost future robustness research for AAD; (III) new noise models for unfavourable light and snow; (IV) a correlation of panoptic segmentation robustness with image quality metrics.

II Related Work

This section introduces relevant work regarding the impact of automotive camera noise models, driving datasets embedding degradation factors, and panoptic segmentation.

II-A Impact of Camera Data Degradation

Automotive camera noise factors are a widely investigated topic [17]. For example, Ceccarelli et al. discuss the common camera failures during the imaging process (i.e. lens, camera body, Bayer filter, image sensor, ISP) using the FMEA method and giving quantitative analysis via object detection [8]. Dong et al. simulated common corruptions in cameras and Lidar for 3D object detection in autonomous driving [3]. These researchers simulate the degraded paired datasets via simple picture editing that is not designed specifically for automotive cameras, leading to unsatisfactory fidelity for automotive applications, see Fig. 3.

A summary of recent research on degraded camera data and the relationship with different perception tasks is in Tab. I. As indicated, there are a few works related to perception tasks specifically in automotive, such as image classification, object detection, and segmentation [10, 3, 15, 4, 5, 16]. Most of these works focus on natural looking images; for example, 15 types of corruption are synthesised in the ImageNet-C dataset [2]. As the robustness of perception tasks is crucial for the safety of real-world applications, research is also emerging on the robustness of automotive camera data, especially under adverse weather [22, 7, 6]. For instance, Wang et al. generated 11 types of image degradation and tested the impact of one panoptic segmentation model [13]. Differently from related previous work, this paper: 1) considers a wider range of noise factors, including unfavourable light conditions and pollutant particles (mud and stain) on the camera lens; textbf2) uses state-of-the-art noise models that are designed specifically for automotive applications to reduce the simulation to real-world (Sim2Real) gap for fairer robustness validation; 3) analyses qualitatively and quantitatively the effect of noise model at a pixel level, including semantic and instance meaning.

II-B Driving Datasets embedding Degradation Factors

Various degradation-driven benchmarking datasets are available to facilitate the robustness of research and to improve the algorithm performance under degradation conditions; they can be divided into real-world captured datasets and synthetic augmented datasets. Many real-world datasets are captured with labels for specific tasks. For example, the WildDash1 and WilDDash2 datasets are used for benchmarking hazardous conditions for semantic segmentation and panoptic segmentation, respectively [23, 14]. The ACDC dataset is captured under night, foggy, rainy and snow conditions [24]. BDD100K datasets contain additional conditions under dusk, overcast and cloudy weather [25].

Limitations of existing real-world driving datasets 1) Lack of labels and paired data for panoptic segmentation under various noise factors. 2) Uneven distributions of noise factors, resulting in reduced generality and reliability, e.g. in BDD100k only 0.2% of the images are collected in foggy conditions [25]. 3) Scarcity of extreme degradation levels [14, 26]. As an alternative, image synthesis methods are proposed to use physical models [27, 28, 19, 29, 21], deep learning-based methods [20, 30, 26, 21] and the virtual environments [31, 32, 33] to generate camera datasets including augmented degraded camera images. Except for some off-the-shelf generation models, challenges exist in generating datasets with adequate quality for adverse weather and unfavourable light conditions. Platforms such as CARLA might support the option of generating various environmental conditions using the embedded functions, but they have very limited image fidelity tailored specifically for automotive camera degradation compared with the real-world captured ones. Traditional techniques, on the other hand, involve complicated models and hand-crafted parameter formulations. Deep learning (DL)-based algorithms, which offer more flexibility, have limitations as they are dataset-dependent and lack simple solutions to regulate the severity of degradation. Additionally, the performance of DL-based approaches, particularly GAN-based methods [34], is inconsistent (e.g. failing to capture structural integrity) in some cases. Therefore, for the degraded data generation in this research, a combination of both traditional and DL-based approaches is selected, with the aim of generating a degraded dataset with better quality.

II-C Panoptic Segmentation

State-of-the-art DL-based panoptic segmentation models can be divided into top-down, bottom-up, single path, and other techniques [35, 36, 37, 38, 39, 40, 41, 42, 43]. Well established methods fuse instance and semantic segmentation information with the existing popular object detectors. For example, the feature pyramid networks are used in the panoptic FPN and the UPSNet to improve multiscale feature learning ability [35, 36]. The lightweight Efficient Net is used in the EfficientPS to reduce the number of parameters [37]. The atrous convolution and dilated convolutions are used in the Panoptic Deeplab and DeeperLab to capture multiscale context information [38, 39]. In addition to the convolutional neural network (CNN)-based architectures, recently, ViT-based architectures have been proposed with improved performance, benefiting from self-attention mechanisms and long-range correlation learning ability [41, 44, 43, 42, 45, 46]. For example, OneFormer proposes a universal image segmentation approach that can achieve better performance on multitasks (e.g. semantic, instance and panoptic segmentation) by training a single model with a single dataset compared with the existing methods, which require multi-usage of resources to train different single tasks [42]. In this research, we evaluate the impact of different degradation factors for panoptic segmentation using both the CNN-based and the ViT-based architectures with different backbones to compare their robustness.

III Analysis of Selected Noise Factors

In the context of this paper, degradation factor is defined as any external (e.g. adverse environmental conditions) or internal factors (e.g. electronic noise) that may influence the quality of the generated sensor data, with a specific focus on the effects of such factors on the perception algorithms’ performance for automated driving. Reliable evaluation of image quality and sensor perception robustness necessitates a comprehensive assessment of noise models. Using the P-diagram as a guiding tool, four categories of noise factors from the existing five are selected, i.e. piece-to-piece, change over time, usage, and environment [17]. The fifth, system interactions, is considered out of the scope of this paper. Within these 4 groups, a list of seven sub-categories is identified, as detailed in the following section, see, Fig. 2, and a total of 19 noise factors are selected for this research (see Tab. II for the implementation details). The selection of the 7 factors cause IDs are the currently most commonly seen noise factors impacting perception in AAD, and from which 19 specific noise factors that have the potential to be synthetically rendered are selected.

Off-the-shell Models Many common image noise models have been recently implemented based on relatively mature theoretical physical models; therefore, some well-established off-the-shell models have been used to generate some of the identified noise factors (cause factor ID = 1,4,5,6): strong light [47], Droplets on lens [8], JPEG Compression [18], Over-sharpening [18], No Bayer filter [8], internal sensor noises [18], motion blur [18] and unfocus blur [18] (see supp. materials for details). Apart from these, noise factors that are designed specifically for automotive applications in mud obstructions on the camera lens and No Demosaic (Bayer data synthetic) (cause factor ID = 3,4) are also chosen in this application [17, 48]. As for the weather condition (cause factor ID = 2), since there are several augmented multi-weather cityscape datasets available, we choose the most suitable ones for the rainy and foggy conditions based on the reasons listed below [26, 29, 20, 21, 27, 19]. 1) Rain Model. Rain Cityscapes follows the physical rules considering the depth information with a rain layer and a fog layer [27]. However, the method fails to consider the photometry of the rain streak or the fact that the droplets on the lens will need a larger field of view from cameras. A new rain rendering method is proposed, which has the advantage of pre-defining the required rain rate (mm/hr) and producing more vivid rain streaks for generating realistic rainy images, therefore, chosen as the rain model [19]. 2) Fog Model. Foggy Cityscapes [20] uses the scattering model to augment the foggy images. As the original depth map provided from the dataset is incomplete and discontinuous (with random “holes” that missing the depth values), the depth denoising, completion, and guided filter are used to obtain the final transmission map [20]. However, the simple filter method fails to capture the boundaries between different semantic objects, which leads to invalid depth guidance in the simulated fog physical model. An improved version of Foggy Cityscapes-DBF is therefore proposed [29] with a dual-reference cross-bilateral filter for better adherence to semantic boundaries in the scene and hence chosen as the fog model in this research.

The implementation details are shown in the Tab. II. The following sections explain the implementation of each of the 19 considered noise factors, including a more detailed description of the newly introduced noise models (cause factor ID = 1) related to unfavourable light and improved snow models.

TABLE II: Details in terms of the category ID, degradation factor names, synthetic implementation steps and configurations for the selected noise factors.

ID	Degradation	Implementation Details	Configs.
1	Low light	Apply the inverse of the DL-based curve estimation image enhancement method (EC-Zero-DCE [49])	The pre-trained model from EC-Zero-DCE [49]
1	Night light	Retrain the CyCleGAN network [34] with the BDD100K dataset [25]	PyTorch with 2k epochs for training
1	Extreme light	Apply the combination of the model of CycleGAN [34] + EC-Zero-DCE [49]	Apply low-light model effect into the nightlight model
1	Strong light	Apply the Python package from Imagecorruptions [47]	Pre-defined brightness corruptions severity levels = {1,3,5}
2	Rainy	Apply rain rendering method [19] with a physical simulator and accurate rain photometric modelling	Three pre-defined severity levels with rain rate (mm/hr) = 50,100,200
2	Foggy	Apply Foggy-DBF datasets with atmospheric scattering model from [20]	The values of the attenuation coefficient = 0.005, 0.01, 0.02 with the visibility ranges(m) = {600,300,150}
2	Snowy	Improve the Snow Cityscapes [21] by 1) synthesising different snow masks on all 500 cityscapes validation images 2) adding the veiling effect [50]	Snow mask originally generated using Photoshop from [21] and divided into small, medium, large
3	Mud obstruction	Apply the mud model from [17] with generated random mud masks and cv2 modules to add the mud and spain into the images	Three levels of severity with kernel sizes {12, 24, 36} and intensity 0.7
3	Droplets on lens	Apply the PIL library to blend the raindrops into the image to simulate the droplets on lens on camera lens	The raindrop masks from [8] with severity level = {2,3,4}
4	Compression	Apply imgaug.augmenters.arithmetic.JpegCompression(strength) in the imgaug library [18]	Three different compression levels = {20, 50, 80} corresponding to three JPEG compression rates {74.94%, 58.80%, 42.20%}
4	Over-sharpening	Apply imgaug.augmenters.Sharpen(alpha, lightness) in the imgaug library [18]	Three levels of severity with alpha ={0.25,0.5,0.75}
4	No Demosaic	Map RGB into RGGB pattern with Bayer colour-filled [8]	Three channel same size (h, w)
4	No Bayer filter	Apply RGB-to-Grayscale with numPy array	The scaling factor = {0.2989, 0.5870, 0.1140}
5	Gaussian noise	Apply imgaug.augmenters.imgcorruptlike.GaussianNoise() in the imgaug library [18]	Three pre-defined levels of severity ={1,3,5}
5	Uniform noise	Apply mathematical model, CV2 and numPY to generate random distributed noises following [18]	Three Severity levels with mean value = 0 and standard deviation = {25, 50, 75}
5	Impulse noise	Apply imgaug.augmenters.arithmetic.ImpulseNoise(severity) in [18]	Three pre-defined levels of severity ={1,3,5}
5	Poisson noises	imgaug.augmenters.arithmetic.AdditivePoissonNoise(lambda) in [18]	Three levels of severity with lambda ={5,10,15}
6	Unfocus Blur	Apply imgaug.augmenters.imgcorruptlike.DefocusBlur() from [18]	Three pre-defined levels of severity ={1,3,5}
6	Motion blur	imgaug.augmenters.imgcorruptlike.apply_motion_blur() [18]	Three pre-defined levels of severity ={1,3,5}

III-A Identified Degradation Factors

1. Unfavourable Light Condition. There is a wide range of factors that could cause brightness variations in automotive camera images, such as the different times of the day, the inconsistency of the exposure, the dynamic range of the camera sensors and imperfection of the lens [51, 49, 8]. Moreover, there exist complex light conditions for real-world AAD, especially at nighttime with multiple light sources such as headlights, streetlights, etc. In the literature, there is no unified definition of unfavourable light conditions; therefore, this work proposes an intuitive definition of four types of unfavourable light conditions: extreme light, night light, low light and strong light (from darker to brighter), as shown in Tab. III.Each one of these lighting conditions is treated separately and with three levels of severity in our study.

TABLE III: Proposed definition of unfavourable light, including low-light, night light and extreme light, classified by illumination level and glare intensity.

Low Light

Night Light

Extreme Light

Uniform darkening

of images (without

glare and flare)

Nighttime urban driving

area with vibrant street

lights (with glare and flare)

Darker illumination

compared to low

light and night light

Novel unfavourable light models. In the context of diverse light levels encountered in real-world AAD scenarios, especially during the nighttime with the presence of multiple light sources, we introduce a unifying definition of unfavourable light (see Tab. III) and a novel approach tailored to each specific lighting condition. This distinct categorization ensures a nuanced and accurate representation of various real-world lighting conditions (See Fig. 1). The complex real-world conditions not only result in unfavourable illumination, limiting the camera’s ability to capture detailed information but also critically impact perception accuracy, as underscored in prior research [52]. This paper abandons the commonly used gamma correction [53], which is effective in generating different brightness for indoor static objects but inadequate for simulating real-world dark conditions characterized by uneven light distribution and pixel saturation. Therefore, the low-light images are generated via $I^{1}_{low}=EZ_{dce}(I,\theta)$ to keep the saturated pixels while darkening the other areas. $EZ_{dce}$ is the reversed curve estimation-based image darkening method naming EC-Zero-DCE [49], $\theta$ is the parameters of the model. For nighttime images, the darkening process alone cannot produce the flare and glare features that are invariably present in nighttime photographs. Thus, we generate the night light images $N$ using cycleGAN ( $I^{1}_{night}=GAN_{cyc}(I)$ ) with the subsequent cycle consistency loss function $L_{cyc}(G,F)=E_{I\sim p_{data}(I)}\|F(G(I))-I\|_{1}+E_{N\sim p_{data}(N)}\|G(F% (N))-N\|_{1}$ [34]. $G()$ and $F()$ are the generators for clean-to-night and night-to-clean, respectively. Furthermore, the night light model is insufficient for extremely dark images since there remains a tiny percentage of images with lighting that is essentially almost unchanged from daylight. Therefore, extreme light images are generated by the compound of the above two models with the equation:

I^{1}_{extreme}=EZ_{dce}(GAN_{cyc}(I),\theta)

2. Adverse Weather. The effect of adverse weather on automated driving and cameras has been studied broadly [7, 54, 6]. These natural phenomena can reduce the quality of the captured images, hence causing potential safety risks for AAD systems. The impact on the image quality can be worse depending on the intensity of the phenomena, and it opens questions regarding the ability of vision-based automated vehicles to cope with not ideal environmental situations. This work considers adverse weather (fog, rain) with a newly introduced snow model described below.

Novel Proposed Snow Model.

Due to the varied shape of the snowflakes, most of the existing snow simulation methods simply use PhotoShop to augment the snow layer into the clean image layer, such as the Snow Cityscapes dataset [21] and the Snow 100K dataset [56]. However, the existing models neglect the veiling effect (haze or mist-like effect) when considering Koschmieder’s theory of image degradation caused by light scattering and absorption [57] (see Fig. 4). Although the veiling effect has been considered in some synthetic snow datasets (e.g. Jstars [50], CSD [58]), these are not specific to the driving scenes. Therefore, in this research, we improve the snow model with the added veiling effect to synthesise the snow dataset $I^{2^{\prime}}_{snow}$ with the following equation:

	$\displaystyle I^{2}_{snow}=z(x)\ \odot A(x)+I(x)\ \odot(1-z(x))$		(1)
	$\displaystyle I^{2^{\prime}}_{snow}=I^{2}_{snow}*e^{-}\beta d(x)+A(x)(1-e^{-}% \beta d(x)),$		(1)

where $I^{2}_{snow}$ are the non-veiling effect snowy images. $\ \odot$ is the element-wise multiplication, $A(x)$ represents the chromatic aberration map, and $z(x)$ is the independent snow mask. $e^{-}\beta d(x)$ is the median transmission map. $\beta$ , and $d(x)$ are the scattering coefficient and the distance of the object to the camera, respectively.

3. Optic Obstruction. Optic obstructions are likely to occur under certain circumstances with lens occlusions, such as droplets on the lens and mud obstructions. These occlusions will result in information loss or distortion of some portion of the frames, causing, for example, a performance drop in object detection [17].

4. ISP Failure. Raw images contain rich and unprocessed data information, and some deep learning techniques have been created and optimised to handle ISP-proceeded RGB images [48]. Errors in the ISP imaging process, for example, demosaicing, compression or sharpening, can lead to inaccurate information. Moreover, using different Bayer filters can introduce variations in colour representation, subsequently impacting the outcomes of deep learning algorithms. Therefore, this research examines the impact of compression, over-sharpening, varied Bayer filters and demosaicing in detail.

5. Internal Sensor Noises. The exploration of diverse sensor noise factors is a fundamental endeavour in the domain of image corruption analysis, a focus extensively examined in robustness research studies [3, 11]. These noise factors stem from multifaceted origins, ranging from fluctuations in external light conditions to internal camera lens intricacies, variations in ISP processes, and temporal changes over time.

6. Vehicle Motion. In AAD, vehicle motion and mounting vibrations can result in motion blur and unfocused blur impacting the quality of captured images, which introduces a significant challenge [59]. Mounting vibrations, arising from various sources such as road irregularities, vehicle movements, or external factors, have a profound effect on the focus stability of onboard cameras [60]. This translates to the loss of critical high-frequency information, such as the detailed boundary delineation of vehicles and pedestrians. This information is indispensable for AAD systems, enabling accurate object recognition and tracking.

IV Proposed Pipeline

The proposed unifying degradation impact pipeline, shown in Fig. 2, consists of three main steps: synthesis of noisy images (see previous Section) using the selected dataset, panoptic segmentation, and impact evaluation. This chapter gives details about the dataset, selected panoptic models and evaluation metrics.

IV-A Dataset

The Cityscapes dataset [61] is used as the clean dataset for our synthetic data. The Cityscapes dataset comprises high-resolution daytime images sourced from 50 European cities, meticulously annotated with 19 classes at the pixel level and 30 classes at the instance level. The dataset is chosen to generate the D-Cityscapes+ for the following reasons: 1) it is one of the most commonly used driving datasets; 2) it is captured with high quality during the day with less noise that may influence the synthetic process; 3) the dataset contains labels for both instance segmentation and semantic segmentation, along with existing panoptic models that are trained on this dataset. To maintain consistency and expedite comparisons, all 500 images in the dataset are employed for validation. For streamlined analysis, the images were resized to a standard resolution of 1024 $\times$ 512 using bicubic interpolation, ensuring both efficiency and uniformity throughout the evaluation process.

IV-B Selected Panoptic Segmentation Models.

Based on the size of the network and the panoptic quality (PQ) on the clean Cityscapes dataset, three state-of-the-art panoptic segmentation models are selected, which are the Panoptic Deeplab [38], EfficientPS [37] and Oneformer [42], Fig. 5. Amongst the different chosen models, benefiting from the Efficient Net [62] backbone, EfficeintNet has the lowest number of parameters (40.9M) compared with the highest (372M) with relatively good performance, i.e. PQ=62.8. In terms of the panoptic deeplab [38], it uses Atrous Spatial Pyramid Pooling (ASPP) which is one of the most commonly used architectures in segmentation tasks [11]. This can benefit the model with multiscale feature learning ability, a larger perceptive field, relatively fewer parameters (46.7M) and good performance (QP=63). The Oneformer [42] leverages the long-range relationship learning ability via the Swin transformer [63] and CovNet [64] architecture, which archives the best performance in terms of the PQ (70.1) with the largest number of parameters (220-372M). Six different backbones were used during the experiment, three for the Panoptic Deeplab (i.e. ResNet, Xception Net, and HR48 Net) and three for the Oneformer (i.e. Swin-L, ConvNeXt-L, ConvNeXt-XL) to compare their robustness.

IV-C Evaluation Metrics

This research evaluates the degradation impact from three perspectives: synthetically generated noisy image quality evaluation, image evaluation using panoptic segmentation-based perception, and the correlation between them.

IV-C1 Image Quality Analysis

To analyse the image quality reduction due to the selected noise factors, 8 image quality metrics are selected with a wide coverage of the image features from both the spatial domain (i.e. local mean, local contrast, edge gradients, chrominance) and the frequency domain (i.e. Fourier- and Wavelet-based) [65]. They encompassed both full-reference metrics (i.e. PSNR [66], SSIM [67], FID [68], LPIPS [69], CW-SSIM [70], FSIM[71]) and no-reference metrics (i.e. BRISQUE [72], NIQE [73]). The image signal-to-noise ratio and the structural information degradation are quantified by PSNR and SSIM, respectively. Lower PSNR and SSIM scores indicate a higher impact on the noise factors, causing loss of information. The CW-SSIM and FSIM evaluate the degradation from a frequency perspective; higher values indicate less degradation in the frequency domain. FID calculates the feature distance between the ground truth image and the generated ones. In addition, the NIQE and LPIPS scores evaluate perceptual differences between the images; the smaller these values represent, the better the naturalness of the synthetic images. BRISQUE uses a trained Support vector machine (SVM) to compute a quality score; a lower score indicates better image quality.

IV-C2 Panoptic Robustness Evaluation Metrics

To assess how panoptic models perform against the various noise models, the widely used panoptic quality (PQ) measure is adopted. The index PQ serves as the ideal indicator for this research since it contains both object-level information and fine-grained pixel-level information. Specifically, PQ is the product of segmentation quality ( $SQ$ ) and recognition quality ( $RQ$ ). $SQ$ only considers results to be a match when the overlap between the ground truth and the prediction $IOU$ value is above 0.5. $RQ$ calculates the true positive (TP), false positive (FP), and false negative (FN) to get the precision and recall. The larger the PQ values, the better the quality. The PQ can be formulated as in Eq. 2.

\small PQ\ =\ \frac{\sum^{g}_{s}IOU(s,g)}{\left|TP\right|}\times\ \frac{\left|% TP\right|}{\left|TP\right|+\frac{1}{2}\left|FP\right|+\frac{1}{2}\left|FN% \right|}\ \

(2)

where $s$ represents the segmentation results, $g$ represents the ground truth, and $(s,g)\in TP$ . For a set of $N$ images, the average PQ is calculated as: $aPQ={\textstyle\sum_{i=1}^{N}}PQ_{i}/N$ .

IV-C3 Correlation Metrics

The correlation between the 8 image quality index and the panoptic quality is calculated using Pearson’s linear correlation coefficient (PLCC) and Spearman’s rank correlation coefficient (SRCC) [65]. The main difference between PLCC and SRCC is that the former is mainly based on the value, while the latter is based on the rank of each value.

V Experimental Results and Analysis

This section gives the qualitative and quantitative experimental results and analysis in terms of the image quality of the generated noisy images, the impact of the noise factors in panoptic segmentation and the robustness between different panoptic segmentation models under different degradation levels.

V-A Generated Noisy Images (D-Cityscapes+)

Visual quality The visual results of the proposed D-Cityscapes+ are shown in Fig. 1, where 19 types of noise factors are considered. Fig. 3 compares the synthesized multi-weather (i.e. fog, snow, and rain) to state-of-the-art robustness benchmarking research [3] using the Cityscape dataset. Fig. 6 displays the visual results of the multi-weather at different severity levels (s=1, 2, and 3, where 3 corresponds to the most severe conditions). From these visual results, it can be inferred that compared to real-world captured datasets, the newly created dataset shows a better coverage of extreme weather conditions, as they are rarely seen and difficult to capture in the real world. In addition, (b) and (c) from Fig. 4 indicate the veiling effect (i.e. mist-like background and added blur in further distance) can be observed in the real world, while Snow Cityscapes (see (a) from Fig. 4) do not obey the same observation (i.e. clear background in the distance) [21], which does, therefore, not obey the characteristics of the images captured by the automotive camera in AAD. Therefore, in this work, the snow veiling effect is modelled and added with different severity levels according to the different severity levels of the precipitation, Fig. 6. Especially under medium- to heavy snow conditions, the objects in the further distance show a mist-like and blurred effect due to the clustering of snowflakes. (b) (c) compared with (a), it is not simulated in (a); therefore, this paper adds this effect to the snow model.

TABLE IV: The panoptic segmentation quantitative results using Oneformer on D-Cityscapes+ with severity

s=3

, note that

PQ_{s}

means the average PQ value at severity

s

PQ_{s}^{v}

means the variance PQ at severity

s

Metrics	Clean	light level		Weather			Lens		ISP
		Dark	Strong	Rainy	Foggy	Snowy	mud obstructions	Rain Drops	Compression
Swin-L† ( $PQ_{1}$ )	67.2	63.4	66.1	58.5	64.0	48.8	65.4	34.9	58.2
ConvNeXt-L† ( $PQ_{1}$ )	68.5	64.2	67.0	60.9	65.0	50.1	66.7	38.5	59.8
Swin-L† ( $PQ_{2}$ )	67.2	60.2	64.5	50.2	61.6	37.5	63.0	47.7	54.3
ConvNeXt-L† ( $PQ_{2}$ )	68.5	61.5	66.4	54.4	63.1	38.4	65.5	42.0	53.6
Swin-L† ( $PQ_{3}$ )	67.2	52.5	62.6	36.1	56.9	21.5	60.0	36.6	46.6
ConvNeXt-L† ( $PQ_{3}$ )	68.5	53.5	63.9	36.6	59.3	20.5	64.7	31.6	43.8
Metrics	Sensor Noises				Blur		ISP
	Gaussian	Uniform	Impulse	Poisson	Unfocus	Motion	O-Sharp	N-Dem	Colour-D
Swin-L†( $PQ_{1}$ )	44.8	54.7	59.4	59.7	51.7	54.4	65.5	61.4	61.3
ConvNeXt-L† ( $PQ_{1}$ )	48.4	57.5	58.0	62.5	52.6	55.7	65.5	61.0	62.9
Swin-L†( $PQ_{2}$ )	28.2	50.2	47.3	54.1	37.5	38.6	63.1	61.4	61.3
ConvNeXt-L† ( $PQ_{2}$ )	32.4	54.0	44.6	57.0	38.6	39.4	63.6	61.0	62.9
Swin-L†( $PQ_{3}$ )	7.4	47.4	16.7	49.0	25.2	27.8	61.7	61.4	61.3
ConvNeXt-L† ( $PQ_{3}$ )	11.5	50.7	22.7	52.2	25.6	28.6	62.2	61.0	62.9

Impact of the noise models and Image Quality The quantitative results of eight image quality metrics applied to data under all the noise factors are shown in Tab. 1 from the suppl. material. As can be observed in the Table, the synthetic noise models create frames with quality degrading according to the noise severity level. Unfavourable light conditions, however, deviate from the pattern except for the FID score, with a worse image quality for low light than night light. The reason for this discrepancy can be attributed to the fact that night light images tend to retain more local light from different light sources. On the other hand, low-light images have an even distribution of low illumination conditions. Another exception also exists for the mud obstructions, with a slightly smaller PSNR value when s=1 compared with when s=3. This result might be due to the generation process of the mud obstructions, where a bigger kernel size indicates more severe conditions, while the number of mud obstructions dynamically reduces to adapt to the image size. Therefore, this generation process results in the PSNR being more sensitive to the number than the size of the mud obstructions.

As for the quality indicated by NIQE and BRISQUE, it is found that they do not show the same trend when increasing the severity levels, however, they are sensitive to noise and artefacts. This may be because NIQE compares the feature distribution in the given image with a pre-computed natural image distribution, which sometimes differs much between the randomly generated noise. BRISQUE instead uses regression-based scoring for feature learning, and this process sometimes results in a different trend from NIQE. As for the frequency-based image quality, CW-SSIM and FSIM generally align with the trend of SSIM. With the potential reason that the frequency and spatial values can indicate the image’s structural information. Overall, FID consistently aligns with the trend of increasing degradation severity across all factors, making it the most versatile metric to use. However, specific metrics, like BRISQUE and PSNR, can offer valuable insights and context for understanding the impact of particular noise factors on image quality, as they are sensitive to slight changes.

V-B Panoptic Segmentation Results

Impact of noise models on Panoptic Segmentation The quantitative and visual results using the Oneformer [42] under 19 types of noise models are illustrated in Tab. IV and Fig. 7. see suppl. material for more details). Fig. 7 shows the same frame with noise models (at the highest severity level) and the panoptic segmentation results. With the increasing severity levels, all degradation factors (except the raindrop occlusion at s=2,3) show decreasing PQ values. Amongst the noise factors, the intensity, distortion, and scales of the artefacts generated in the frames will directly influence the panoptic segmentation performance. For example, the Gaussian noise and raindrop occlusion are the most influencing factors, while strong light, mud obstructions, and over-sharpening show the smallest impact. Overall, under all degradation factors, the ConvNeXt-L and ConvNeXt-XL perform better compared with the Swin-L backbone, indicating increased noise robustness in these architecture designs. Specifically, the ConvNeXt-L and ConvNeXt-XL show similar PQ, with ConvNeXt-L taking less processing time. ConvNeXt-XL is better at processing the compression and greyscale data, while the ConvNeXt-L largely surpasses the ConvNeXt-XL under Gaussian noise and impulse noise conditions.

To make the comparison more meaningful, we analyse the different panoptic performances within the same ID categories. 1) Light level. As can be seen from the dark light results, the uniform darkening of the illumination shows the least degradation, and the increasing dark level, and the dynamic exposure (e.g. glare and flare) can influence the panoptic segmentation performance. The categories of strong light show that too much illumination could also result in slightly reduced performance. 2) Adverse weather. With decreasing visibility, increasing intensity and bigger particles occurring during adverse weather conditions, the performance decreases (i.e. Snow<Rain <Fog) under all severity levels. 3) Lens. The mud obstructions in the used noise models show better panoptic segmentation performance compared with the image lens covered with droplets on the lens at the same severity levels, which might be caused by a distortion around the droplets on the lens. 4) Sensor noise. The Gaussian noise is one of the most impacting sensor noise factors in terms of the drop of PQ. 5) Blur. The unfocus blur can result in slightly worse performance than motion blur at each severity level. 6) ISP. For the simulation of the ISP failures, compression shows the worst performance, while over-sharpening shows the least decrease in panoptic quality. textcolormagentaFurthermore, the variance of the PQ ( $vPQ$ ) can be seen in Tab. 3 from the suppl. material, the higher values in $vPQ$ for low-light or night light show worse stability for the deep-learning-based image simulation methods compared with the physical model-based ones. The investigation conducted in this study substantiates that within the framework of synthetic data, Gaussian noise significantly influences panoptic segmentation quality. Yet, the challenge lies in linking this specific noise distribution to real-world automotive scenarios.

V-C Robustness of different panoptic segmentation models

In the comparative study, three state-of-the-art panoptic segmentation models and various backbones have been evaluated, Fig. 5, to gauge their ability to handle noise factors [37, 38, 42]. Figs. 8-9 visually represent the comparison among these models in terms of the panoptic quality and speed (time per frame) under three different severity levels of the noise factors. The PQ values obtained across diverse degradation factors and backbones can be seen in Tab. 3-5 from the suppl. material. Notably, the presented analysis juxtaposes the robustness of CNN-based methods against ViT-based methods, revealing intriguing patterns in architectural efficiencies.

Surpassing its counterparts, Oneformer (pre-trained ConvNeXt-L and ConvNeXt-XL) emerges as the standout performer with an average PQ of 55.9, whereas Panoptic deeplab utilizing ResNet exhibits lower performance with an average PQ of 32.96. However, it is essential to note that higher PQ often corresponds to increased processing time and computational demand. Models with lower PQ values showcase quicker processing capabilities, suggesting a trend in current mainstream methods aiming for an equilibrium between performance and computational efficiency. Furthermore, the potential for enhanced robustness and generality of larger models becomes evident, especially with respect to internal sensor noise factors. For instance, the average PQ values in 4 different internal sensor noise factors under different severity levels for the Oneformer is much better (i.e. $aPQ_{1}$ =55.7) compared with both P-Deeplab (i.e. $aPQ_{1}$ =14.6) and the EfficientPS (i.e. $aPQ_{1}$ =28.1).

Delving into backbone architectures, the ConvNeXt-L, XL, a transformer-based CNN architecture, exhibits marginal superiority over conventional transformer backbones like Swin-L, far outpacing performance compared to traditional CNN backbones (ResNet, Xception, HRNet, and EfficientNet) despite having larger model parameters. Notably, the upgraded convolutional network, incorporating Swin Transformer architecture into the classic ResNet, even surpasses the pure Swin Transformer-based method, hinting at the efficacy of employing larger kernels and depthwise convolution invertible networks and training techniques such as pre-training and adaptive activation functions to bolster performance and robustness in current CNN models. Additionally, EfficientPS utilizing EfficientNet showcases commendable performance in both PQ and speed, outperforming the HRNet-based Panoptic deeplab, underscoring the potential for leveraging EfficientNet backbones in light-weight robust network architecture design. These findings illuminate critical directions for future network architecture design, emphasizing the need for a nuanced balance between performance, computational cost, and the strategic integration of innovative architectural components to fortify robustness in automated driving applications.

TABLE V: The overall correlation between the 8 image quality index to the PQ of EfficeintPS about the average PLCC and SRCC.

Model	Index	PSNR↑	SSIM↑	FID↓	LPIPS↓	NIQE↓	CW-SSIM↑	FSIM↑	BRISQUE↓
EfficientPS	PLCC	0.807	0.948	-0.923	-0.948	-0.513	0.954	0.921	-0.519
EfficientPS	SRCC	0.800	0.933	-0.967	-0.867	-0.533	0.967	0.933	-0.6
Oneformer(ConNeXt-L)	PLCC	0.813	0.961	-0.947	-0.979	-0.569	0.975	0.966	-0.661
Oneformer(ConNeXt-L)	SRCC	0.800	0.933	-0.967	-0.933	-0.533	0.967	0.933	-0.6

V-D Correlation between the Image Quality index and Panoptic Quality index

For the analysed 19 noise factors, the average correlation indexes (PLCC and SRCC) between panoptic segmentation performance (PQ) and the selected image quality metrics is reported in Tab. V. The individual correlation indexes (for each type of noise) can be seen in Tab. 6-7 from the suppl. material. From the table, the panoptic models show a similar trend regarding the indexes. Specifically, the most correlated positive and negative indexes are CW-SSIM and LPIPS for PLCC, and CW-SSIM and FID for SRCC, respectively. These high correlation values indicate that the image quality index can be potentially used for predicting perception degradation. For example, for snowy conditions, PQ degrades from 51.2 to 18.3 when CW-SSIM degrade from 0.738 to 0.529. The degradation factors with the worst LPIPS and CW-SSIM scores (e.g. the Gaussian noise) also indicate the worst panoptic quality performance. In addition, the metrics capturing structural information of the image (e.g. SSIM, CW-SSIM) show a better average correlation with PQ. This correlation might be due to the structural information or features being important in the panoptic segmentation process, as the edge of objects for each instance should be predicted correctly to have higher PQ scores.

However, the non-reference-based metrics (i.e. NIQE and BRISQUE) show low correlation values (i.e. around 0.5) with PQ. This low correlation may be due to the features learnt from the training images used for these metrics, being not specific to driving scenarios. As in previous studies, PSNR shows the worst correlation scores compared to all the reference-based metrics, meaning that PSNR cannot be used as a prediction factor for the performance decrease of deep learning-based AAD perception [65]. For example, the noise factor with the lowest PSNR values is strong light when s=2 or 3; while the PQ values for strong light show the best panoptic segmentation quality. This correlation analysis has a substantial impact on the development of AAD because it successfully relates standard image quality measurements to reveal the panoptic segmentation performance.

VI Conclusion

This study proposes a novel holistic evaluation framework to assess the robustness of perception in combination with the quality evaluation of sensor data, specifically camera data. The framework includes: (i) the injection of different types of noise factors with different severity levels (19 noise factors injected, each with 3 severity levels); (ii) the generation of an augmented and balanced noisy dataset, hereby named D-Cityscapes+, that might be used for further robustness studies; (iii) the assessment of variation in perception performance due to the noisy frames and different panoptic models (iv) the correlation between image quality and perception quality, to provide a set of guidelines and metrics that can be predictive of machine learning performance. Moreover, this work proposes two new improved noise models: (1) a snow model including reduced visibility; (2) extreme light models.

This comprehensive evaluation aims to unify diverse degradation factors impacting automotive cameras within automated driving systems. The outcomes of the most influential factors guide certain noise factors (i.e. Gaussian noises and the droplets on the lens), among different corner cases, should be given priority since they pose the greatest hazard on perception (i.e. perception) in AAD. The results of better overall robustness in the ViT-based backbone architectures unveil critical insights for future architectural selections in the presence of noisy data. Through the proposed meticulous evaluation encompassing image and panoptic quality metrics, this work offers a nuanced understanding of noise factors, empowering stakeholders and designers of driving applications. The findings presented here aim to further explore the robustness of perception in general and in panoptic segmentation, specifically tailored for automated driving. Our benchmarking framework serves as a catalyst for advancing research endeavours, fostering the realization of a higher level of driving automation.

References

[1] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9404–9413.
[2] D. Hendrycks and T. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,” arXiv preprint arXiv:1903.12261, 2019.
[3] Y. Dong, C. Kang, J. Zhang, Z. Zhu, Y. Wang, X. Yang, H. Su, X. Wei, and J. Zhu, “Benchmarking robustness of 3d object detection to common corruptions in autonomous driving,” arXiv preprint arXiv:2303.11040, 2023.
[4] C. Kamann and C. Rother, “Benchmarking the robustness of semantic segmentation models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8828–8838.
[5] Y. Wang, P. H. Chan, and V. Donzella, “Semantic-aware video compression for automotive cameras,” IEEE Transactions on Intelligent Vehicles, 2023.
[6] T. Brophy, D. Mullins, A. Parsi, J. Horgan, E. Ward, P. Denny, C. Eising, B. Deegan, M. Glavin, and E. Jones, “A review of the impact of rain on camera-based perception in automated driving systems,” IEEE Access, 2023.
[7] S. Zang, M. Ding, D. Smith, P. Tyler, T. Rakotoarivelo, and M. A. Kaafar, “The impact of adverse weather conditions on autonomous vehicles: How rain, snow, fog, and hail affect the performance of a self-driving car,” IEEE Vehicular Technology Magazine, vol. 14, no. 2, pp. 103–111, 2019.
[8] A. Ceccarelli and F. Secci, “Rgb cameras failures and their effects in autonomous driving applications,” IEEE Transactions on Dependable and Secure Computing, 2022.
[9] K. N. R. Chebrolu and P. Kumar, “Deep learning based pedestrian detection at all light conditions,” in 2019 International Conference on Communication and Signal Processing (ICCSP). IEEE, 2019, pp. 0838–0842.
[10] Y. Dong, Q.-A. Fu, X. Yang, T. Pang, H. Su, Z. Xiao, and J. Zhu, “Benchmarking adversarial robustness on image classification,” in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 321–331.
[11] C. Kamann and C. Rother, “Benchmarking the robustness of semantic segmentation models with respect to common corruptions,” International journal of computer vision, vol. 129, pp. 462–483, 2021.
[12] F. Ding, K. Yu, Z. Gu, X. Li, and Y. Shi, “Perceptual enhancement for autonomous vehicles: restoring visually degraded images for context prediction via adversarial training,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 7, pp. 9430–9441, 2021.
[13] Y. Wang, H. Zhao, K. Debattista, and V. Donzella, “The effect of camera data degradation factors on panoptic segmentation for automated driving,” in 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2023, pp. 2351–2356.
[14] O. Zendel, M. Schörghuber, B. Rainer, M. Murschitz, and C. Beleznai, “Unifying panoptic segmentation for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21 351–21 360.
[15] K. Wang, T. Zhou, X. Li, and F. Ren, “Performance and challenges of 3d object detection methods in complex scenes for autonomous driving,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 2, pp. 1699–1716, 2022.
[16] K. Xian, Z. Cao, C. Shen, and G. Lin, “Towards robust monocular depth estimation: A new baseline and benchmark,” International Journal of Computer Vision, pp. 1–19, 2024.
[17] B. Li, P. H. Chan, G. Baris, M. D. Higgins, and V. Donzella, “Analysis of automotive camera sensor noise factors and impact on object detection,” IEEE Sensors Journal, vol. 22, no. 22, pp. 22 210–22 219, 2022.
[18] A. B. Jung, K. Wada, J. Crall, S. Tanaka, J. Graving, C. Reinders, S. Yadav, J. Banerjee, G. Vecsei, A. Kraft, Z. Rui, J. Borovec, C. Vallentin, S. Zhydenko, K. Pfeiffer, B. Cook, I. Fernández, F.-M. De Rainville, C.-H. Weng, A. Ayala-Acevedo, R. Meudec, M. Laporte et al., “imgaug,” https://github.com/aleju/imgaug, 2020, online; accessed 01-Feb-2020.
[19] M. Tremblay, S. S. Halder, R. De Charette, and J.-F. Lalonde, “Rain rendering for evaluating and improving robustness to bad weather,” International Journal of Computer Vision, vol. 129, pp. 341–360, 2021.
[20] C. Sakaridis, D. Dai, S. Hecker, and L. Van Gool, “Model adaptation with synthetic and real data for semantic dense foggy scene understanding,” in Proceedings of the european conference on computer vision (ECCV), 2018, pp. 687–704.
[21] K. Zhang, R. Li, Y. Yu, W. Luo, and C. Li, “Deep dense multi-scale network for snow removal using semantic and depth priors,” IEEE Transactions on Image Processing, vol. 30, pp. 7419–7431, 2021.
[22] M. Bijelic, T. Gruber, and W. Ritter, “Benchmarking image sensors under adverse weather conditions for autonomous driving,” in 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 1773–1779.
[23] O. Zendel, K. Honauer, M. Murschitz, D. Steininger, and G. F. Dominguez, “Wilddash-creating hazard-aware benchmarks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 402–416.
[24] C. Sakaridis, D. Dai, and L. Van Gool, “Acdc: The adverse conditions dataset with correspondences for semantic driving scene understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 765–10 775.
[25] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2636–2645.
[26] V. Mușat, I. Fursa, P. Newman, F. Cuzzolin, and A. Bradley, “Multi-weather city: Adverse weather stacking for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2906–2915.
[27] X. Hu, C.-W. Fu, L. Zhu, and P.-A. Heng, “Depth-attentional features for single-image rain removal,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2019, pp. 8022–8031.
[28] K. Garg and S. K. Nayar, “Vision and rain,” International Journal of Computer Vision, vol. 75, pp. 3–27, 2007.
[29] C. Sakaridis, D. Dai, and L. Van Gool, “Semantic foggy scene understanding with synthetic data,” International Journal of Computer Vision, vol. 126, pp. 973–992, 2018.
[30] C. Sakaridis, D. Dai, and L. V. Gool, “Guided curriculum model adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7374–7383.
[31] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: An open urban driving simulator,” in Conference on robot learning. PMLR, 2017, pp. 1–16.
[32] T. Sun, M. Segu, J. Postels, Y. Wang, L. Van Gool, B. Schiele, F. Tombari, and F. Yu, “Shift: a synthetic driving dataset for continuous multi-task domain adaptation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21 371–21 382.
[33] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxy for multi-object tracking analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4340–4349.
[34] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
[35] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6399–6408.
[36] Y. Xiong, R. Liao, H. Zhao, R. Hu, M. Bai, E. Yumer, and R. Urtasun, “Upsnet: A unified panoptic segmentation network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8818–8826.
[37] R. Mohan and A. Valada, “Efficientps: Efficient panoptic segmentation,” International Journal of Computer Vision, vol. 129, no. 5, pp. 1551–1579, 2021.
[38] B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam, and L.-C. Chen, “Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 475–12 485.
[39] T.-J. Yang, M. D. Collins, Y. Zhu, J.-J. Hwang, T. Liu, X. Zhang, V. Sze, G. Papandreou, and L.-C. Chen, “Deeperlab: Single-shot image parser,” arXiv preprint arXiv:1902.05093, 2019.
[40] Y. Li, H. Zhao, X. Qi, L. Wang, Z. Li, J. Sun, and J. Jia, “Fully convolutional networks for panoptic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 214–223.
[41] Z. Li, W. Wang, E. Xie, Z. Yu, A. Anandkumar, J. M. Alvarez, P. Luo, and T. Lu, “Panoptic segformer: Delving deeper into panoptic segmentation with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1280–1289.
[42] J. Jain, J. Li, M. T. Chiu, A. Hassani, N. Orlov, and H. Shi, “Oneformer: One transformer to rule universal image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2989–2998.
[43] F. Li, H. Zhang, H. Xu, S. Liu, L. Zhang, L. M. Ni, and H.-Y. Shum, “Mask dino: Towards a unified transformer-based framework for object detection and segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3041–3050.
[44] B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” Advances in Neural Information Processing Systems, vol. 34, pp. 17 864–17 875, 2021.
[45] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299.
[46] H. Wang, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “Max-deeplab: End-to-end panoptic segmentation with mask transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5463–5474.
[47] C. Michaelis, B. Mitzkus, R. Geirhos, E. Rusak, O. Bringmann, A. S. Ecker, M. Bethge, and W. Brendel, “Benchmarking robustness in object detection: Autonomous driving when winter is coming,” arXiv preprint arXiv:1907.07484, 2019.
[48] P. H. Chan, C. Wei, A. Huggett, and V. Donzella, “Raw camera data object detectors: an optimisation for automotive processing and transmission,” Authorea Preprints, 2023.
[49] S. Zhou, C. Li, and C. C. Loy, “Lednet: Joint low-light enhancement and deblurring in the dark,” in ECCV, 2022.
[50] W.-T. Chen, H.-Y. Fang, J.-J. Ding, C.-C. Tsai, and S.-Y. Kuo, “Jstasr: Joint size and transparency-aware snow removal algorithm based on modified partial convolution and veiling effect removal,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer, 2020, pp. 754–770.
[51] X. Tan, K. Xu, Y. Cao, Y. Zhang, L. Ma, and R. W. Lau, “Night-time scene parsing with a large real dataset,” IEEE Transactions on Image Processing, vol. 30, pp. 9085–9098, 2021.
[52] C.-T. Lin, S.-W. Huang, Y.-Y. Wu, and S.-H. Lai, “Gan-based day-to-night image style transfer for nighttime vehicle detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 2, pp. 951–963, 2020.
[53] F. Lv, F. Lu, J. Wu, and C. Lim, “Mbllen: Low-light image/video enhancement using cnns.” in BMVC, vol. 220, no. 1, 2018, p. 4.
[54] C. Michaelis, B. Mitzkus, R. Geirhos, E. Rusak, O. Bringmann, A. S. Ecker, M. Bethge, and W. Brendel, “Benchmarking robustness in object detection: Autonomous driving when winter is coming,” arXiv preprint arXiv:1907.07484, 2019.
[55] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631.
[56] Y.-F. Liu, D.-W. Jaw, S.-C. Huang, and J.-N. Hwang, “Desnownet: Context-aware deep network for snow removal,” IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 3064–3073, 2018.
[57] H. Koschmieder, “Theorie der horizontalen sichtweite,” Beitrage zur Physik der freien Atmosphare, pp. 33–53, 1924.
[58] W.-T. Chen, H.-Y. Fang, C.-L. Hsieh, C.-C. Tsai, I. Chen, J.-J. Ding, S.-Y. Kuo et al., “All snow removed: Single image desnowing algorithm using hierarchical dual-tree complex wavelet representation and contradict channel loss,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 4196–4205.
[59] J. Li, S. Dong, Z. Yu, Y. Tian, and T. Huang, “Event-based vision enhanced: A joint detection framework in autonomous driving,” in 2019 ieee international conference on multimedia and expo (icme). IEEE, 2019, pp. 1396–1401.
[60] R. Wang, C. Zhang, X. Zheng, Y. Lv, and Y. Zhao, “Joint defocus deblurring and superresolution learning network for autonomous driving,” IEEE Intelligent Transportation Systems Magazine, 2023.
[61] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
[62] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning. PMLR, 2019, pp. 6105–6114.
[63] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
[64] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 976–11 986.
[65] D. Gummadi, P. H. Chan, H. Wang, and V. Donzella, “Correlating traditional image quality metrics and dnn-based object detection: a case study with compressed camera data,” Authorea Preprints, 2023.
[66] Mar 2023. [Online]. Available: https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio
[67] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
[68] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” 2018.
[69] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
[70] M. P. Sampat, Z. Wang, S. Gupta, A. C. Bovik, and M. K. Markey, “Complex wavelet structural similarity: A new image similarity index,” IEEE Transactions on Image Processing, vol. 18, no. 11, pp. 2385–2401, 2009.
[71] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “Fsim: A feature similarity index for image quality assessment,” IEEE Transactions on Image Processing, vol. 20, no. 8, pp. 2378–2386, 2011.
[72] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Transactions on Image Processing, vol. 21, no. 12, pp. 4695–4708, 2012.
[73] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely blind” image quality analyzer,” IEEE Signal Processing Letters, vol. 20, no. 3, pp. 209–212, 2013.