-
On the Rate-Distortion-Complexity Trade-offs of Neural Video Coding
Authors:
Yi-Hsin Chen,
Kuan-Wei Ho,
Martin Benjak,
Jörn Ostermann,
Wen-Hsiao Peng
Abstract:
This paper aims to delve into the rate-distortion-complexity trade-offs of modern neural video coding. Recent years have witnessed much research effort being focused on exploring the full potential of neural video coding. Conditional autoencoders have emerged as the mainstream approach to efficient neural video coding. The central theme of conditional autoencoders is to leverage both spatial and t…
▽ More
This paper aims to delve into the rate-distortion-complexity trade-offs of modern neural video coding. Recent years have witnessed much research effort being focused on exploring the full potential of neural video coding. Conditional autoencoders have emerged as the mainstream approach to efficient neural video coding. The central theme of conditional autoencoders is to leverage both spatial and temporal information for better conditional coding. However, a recent study indicates that conditional coding may suffer from information bottlenecks, potentially performing worse than traditional residual coding. To address this issue, recent conditional coding methods incorporate a large number of high-resolution features as the condition signal, leading to a considerable increase in the number of multiply-accumulate operations, memory footprint, and model size. Taking DCVC as the common code base, we investigate how the newly proposed conditional residual coding, an emerging new school of thought, and its variants may strike a better balance among rate, distortion, and complexity.
△ Less
Submitted 4 October, 2024;
originally announced October 2024.
-
Quantized Inverse Design for Photonic Integrated Circuits
Authors:
Frederik Schubert,
Konrad Bethmann,
Yannik Mahlau,
Fabian Hartmann,
Reinhard Caspary,
Marco Munderloh,
Jörn Ostermann,
Bodo Rosenhahn
Abstract:
The inverse design of photonic integrated circuits (PICs) presents distinctive computational challenges, including their large memory requirements. Advancements in the two-photon polymerization (2PP) fabrication process introduce additional complexity, necessitating the development of more flexible optimization algorithms to enable the creation of multi-material 3D structures with unique propertie…
▽ More
The inverse design of photonic integrated circuits (PICs) presents distinctive computational challenges, including their large memory requirements. Advancements in the two-photon polymerization (2PP) fabrication process introduce additional complexity, necessitating the development of more flexible optimization algorithms to enable the creation of multi-material 3D structures with unique properties. This paper presents an efficient reverse-mode automatic differentiation framework for finite-difference timedomain (FDTD) simulations that is able to handle several constraints arising from novel fabrication methods. Our method is based on straight-through gradient estimation that enables non-differentiable shape parametrizations. We demonstrate the effectiveness of our approach by creating increasingly complex structures to solve the coupling problem in integrated photonic circuits. The results highlight the potential of our method for future PIC design and practical applications.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression
Authors:
Yi-Hsin Chen,
Hong-Sheng Xie,
Cheng-Wei Chen,
Zong-Lin Gao,
Martin Benjak,
Wen-Hsiao Peng,
Jörn Ostermann
Abstract:
Conditional coding has lately emerged as the mainstream approach to learned video compression. However, a recent study shows that it may perform worse than residual coding when the information bottleneck arises. Conditional residual coding was thus proposed, creating a new school of thought to improve on conditional coding. Notably, conditional residual coding relies heavily on the assumption that…
▽ More
Conditional coding has lately emerged as the mainstream approach to learned video compression. However, a recent study shows that it may perform worse than residual coding when the information bottleneck arises. Conditional residual coding was thus proposed, creating a new school of thought to improve on conditional coding. Notably, conditional residual coding relies heavily on the assumption that the residual frame has a lower entropy rate than that of the intra frame. Recognizing that this assumption is not always true due to dis-occlusion phenomena or unreliable motion estimates, we propose a masked conditional residual coding scheme. It learns a soft mask to form a hybrid of conditional coding and conditional residual coding in a pixel adaptive manner. We introduce a Transformer-based conditional autoencoder. Several strategies are investigated with regard to how to condition a Transformer-based autoencoder for inter-frame coding, a topic that is largely under-explored. Additionally, we propose a channel transform module (CTM) to decorrelate the image latents along the channel dimension, with the aim of using the simple hyperprior to approach similar compression performance to the channel-wise autoregressive model. Experimental results confirm the superiority of our masked conditional residual transformer (termed MaskCRT) to both conditional coding and conditional residual coding. On commonly used datasets, MaskCRT shows comparable BD-rate results to VTM-17.0 under the low delay P configuration in terms of PSNR-RGB and outperforms VTM-17.0 in terms of MS-SSIM-RGB. It also opens up a new research direction for advancing learned video compression.
△ Less
Submitted 10 July, 2024; v1 submitted 25 December, 2023;
originally announced December 2023.
-
SegForestNet: Spatial-Partitioning-Based Aerial Image Segmentation
Authors:
Daniel Gritzner,
Jörn Ostermann
Abstract:
Aerial image segmentation is the basis for applications such as automatically creating maps or tracking deforestation. In true orthophotos, which are often used in these applications, many objects and regions can be approximated well by polygons. However, this fact is rarely exploited by state-of-the-art semantic segmentation models. Instead, most models allow unnecessary degrees of freedom in the…
▽ More
Aerial image segmentation is the basis for applications such as automatically creating maps or tracking deforestation. In true orthophotos, which are often used in these applications, many objects and regions can be approximated well by polygons. However, this fact is rarely exploited by state-of-the-art semantic segmentation models. Instead, most models allow unnecessary degrees of freedom in their predictions by allowing arbitrary region shapes. We therefore present a refinement of our deep learning model which predicts binary space partitioning trees, an efficient polygon representation. The refinements include a new feature decoder architecture and a new differentiable BSP tree renderer which both avoid vanishing gradients. Additionally, we designed a novel loss function specifically designed to improve the spatial partitioning defined by the predicted trees. Furthermore, our expanded model can predict multiple trees at once and thus can predict class-specific segmentations. As an additional contribution, we investigate the impact of a non-optimal training process in comparison to an optimized training process. While model architectures optimized for aerial images, such as PFNet or our own model, show an advantage under non-optimal conditions, this advantage disappears under optimal training conditions. Despite this observation, our model still makes better predictions for small rectangular objects, e.g., cars.
△ Less
Submitted 8 April, 2024; v1 submitted 3 February, 2023;
originally announced February 2023.
-
Two-Stream Aural-Visual Affect Analysis in the Wild
Authors:
Felix Kuhnke,
Lars Rumberg,
Jörn Ostermann
Abstract:
Human affect recognition is an essential part of natural human-computer interaction. However, current methods are still in their infancy, especially for in-the-wild data. In this work, we introduce our submission to the Affective Behavior Analysis in-the-wild (ABAW) 2020 competition. We propose a two-stream aural-visual analysis model to recognize affective behavior from videos. Audio and image st…
▽ More
Human affect recognition is an essential part of natural human-computer interaction. However, current methods are still in their infancy, especially for in-the-wild data. In this work, we introduce our submission to the Affective Behavior Analysis in-the-wild (ABAW) 2020 competition. We propose a two-stream aural-visual analysis model to recognize affective behavior from videos. Audio and image streams are first processed separately and fed into a convolutional neural network. Instead of applying recurrent architectures for temporal analysis we only use temporal convolutions. Furthermore, the model is given access to additional features extracted during face-alignment. At training time, we exploit correlations between different emotion representations to improve performance. Our model achieves promising results on the challenging Aff-Wild2 database.
△ Less
Submitted 3 March, 2020; v1 submitted 9 February, 2020;
originally announced February 2020.
-
HEVC Inter Coding Using Deep Recurrent Neural Networks and Artificial Reference Pictures
Authors:
Felix Haub,
Thorsten Laude,
Jörn Ostermann
Abstract:
The efficiency of motion compensated prediction in modern video codecs highly depends on the available reference pictures. Occlusions and non-linear motion pose challenges for the motion compensation and often result in high bit rates for the prediction error. We propose the generation of artificial reference pictures using deep recurrent neural networks. Conceptually, a reference picture at the t…
▽ More
The efficiency of motion compensated prediction in modern video codecs highly depends on the available reference pictures. Occlusions and non-linear motion pose challenges for the motion compensation and often result in high bit rates for the prediction error. We propose the generation of artificial reference pictures using deep recurrent neural networks. Conceptually, a reference picture at the time instance of the currently coded picture is generated from previously reconstructed conventional reference pictures. Based on these artificial reference pictures, we propose a complete coding pipeline based on HEVC. By using the artificial reference pictures for motion compensated prediction, average BD-rate gains of 1.5% over HEVC are achieved.
△ Less
Submitted 5 December, 2018;
originally announced December 2018.
-
Neural Network Compression using Transform Coding and Clustering
Authors:
Thorsten Laude,
Yannick Richter,
Jörn Ostermann
Abstract:
With the deployment of neural networks on mobile devices and the necessity of transmitting neural networks over limited or expensive channels, the file size of the trained model was identified as bottleneck. In this paper, we propose a codec for the compression of neural networks which is based on transform coding for convolutional and dense layers and on clustering for biases and normalizations.…
▽ More
With the deployment of neural networks on mobile devices and the necessity of transmitting neural networks over limited or expensive channels, the file size of the trained model was identified as bottleneck. In this paper, we propose a codec for the compression of neural networks which is based on transform coding for convolutional and dense layers and on clustering for biases and normalizations. By using this codec, we achieve average compression factors between 7.9-9.3 while the accuracy of the compressed networks for image classification decreases only by 1%-2%, respectively.
△ Less
Submitted 18 May, 2018;
originally announced May 2018.
-
Unsupervised Features for Facial Expression Intensity Estimation over Time
Authors:
Maren Awiszus,
Stella Graßhof,
Felix Kuhnke,
Jörn Ostermann
Abstract:
The diversity of facial shapes and motions among persons is one of the greatest challenges for automatic analysis of facial expressions. In this paper, we propose a feature describing expression intensity over time, while being invariant to person and the type of performed expression. Our feature is a weighted combination of the dynamics of multiple points adapted to the overall expression traject…
▽ More
The diversity of facial shapes and motions among persons is one of the greatest challenges for automatic analysis of facial expressions. In this paper, we propose a feature describing expression intensity over time, while being invariant to person and the type of performed expression. Our feature is a weighted combination of the dynamics of multiple points adapted to the overall expression trajectory. We evaluate our method on several tasks all related to temporal analysis of facial expression. The proposed feature is compared to a state-of-the-art method for expression intensity estimation, which it outperforms. We use our proposed feature to temporally align multiple sequences of recorded 3D facial expressions. Furthermore, we show how our feature can be used to reveal person-specific differences in performances of facial expressions. Additionally, we apply our feature to identify the local changes in face video sequences based on action unit labels. For all the experiments our feature proves to be robust against noise and outliers, making it applicable to a variety of applications for analysis of facial movements.
△ Less
Submitted 3 May, 2018; v1 submitted 2 May, 2018;
originally announced May 2018.
-
Region of Interest (ROI) Coding for Aerial Surveillance Video using AVC & HEVC
Authors:
Holger Meuel,
Florian Kluger,
Jörn Ostermann
Abstract:
Aerial surveillance from Unmanned Aerial Vehicles (UAVs), i.e. with moving cameras, is of growing interest for police as well as disaster area monitoring. For more detailed ground images the camera resolutions are steadily increasing. Simultaneously the amount of video data to transmit is increasing significantly, too. To reduce the amount of data, Region of Interest (ROI) coding systems were intr…
▽ More
Aerial surveillance from Unmanned Aerial Vehicles (UAVs), i.e. with moving cameras, is of growing interest for police as well as disaster area monitoring. For more detailed ground images the camera resolutions are steadily increasing. Simultaneously the amount of video data to transmit is increasing significantly, too. To reduce the amount of data, Region of Interest (ROI) coding systems were introduced which mainly encode some regions in higher quality at the cost of the remaining image regions. We employ an existing ROI coding system relying on global motion compensation to retain full image resolution over the entire image. Different ROI detectors are used to automatically classify a video image on board of the UAV in ROI and non-ROI. We propose to replace the modified Advanced Video Coding (AVC) video encoder by a modified High Efficiency Video Coding (HEVC) encoder. Without any change of the detection system itself, but by replacing the video coding back-end we are able to improve the coding efficiency by 32% on average although regular HEVC provides coding gains of 12-30% only for the same test sequences and similar PSNR compared to regular AVC coding. Since the employed ROI coding mainly relies on intra mode coding of new emerging image areas, gains of HEVC-ROI coding over AVC-ROI coding compared to regular coding of the entire frames including predictive modes (inter) depend on sequence characteristics. We present a detailed analysis of bit distribution within the frames to explain the gains. In total we can provide coding data rates of 0.7-1.0 Mbit/s for full HDTV video sequences at 30 fps at reasonable quality of more than 37 dB.
△ Less
Submitted 19 January, 2018;
originally announced January 2018.
-
Evolutionary optimization of an experimental apparatus
Authors:
I. Geisel,
K. Cordes,
J. Mahnke,
S. Jöllenbeck,
J. Ostermann,
J. Arlt,
W. Ertmer,
C. Klempt
Abstract:
In recent decades, cold atom experiments have become increasingly complex. While computers control most parameters, optimization is mostly done manually. This is a time-consuming task for a high-dimensional parameter space with unknown correlations. Here we automate this process using a genetic algorithm based on Differential Evolution. We demonstrate that this algorithm optimizes 21 correlated pa…
▽ More
In recent decades, cold atom experiments have become increasingly complex. While computers control most parameters, optimization is mostly done manually. This is a time-consuming task for a high-dimensional parameter space with unknown correlations. Here we automate this process using a genetic algorithm based on Differential Evolution. We demonstrate that this algorithm optimizes 21 correlated parameters and that it is robust against local maxima and experimental noise. The algorithm is flexible and easy to implement. Thus, the presented scheme can be applied to a wide range of experimental optimization tasks.
△ Less
Submitted 4 June, 2013; v1 submitted 17 May, 2013;
originally announced May 2013.