Graphics (cs.GR)

  • PDF
    We propose the Large View Synthesis Model (LVSM), a novel transformer-based approach for scalable and generalizable novel view synthesis from sparse-view inputs. We introduce two architectures: (1) an encoder-decoder LVSM, which encodes input image tokens into a fixed number of 1D latent tokens, functioning as a fully learned scene representation, and decodes novel-view images from them; and (2) a decoder-only LVSM, which directly maps input images to novel-view outputs, completely eliminating intermediate scene representations. Both models bypass the 3D inductive biases used in previous methods -- from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps) -- addressing novel view synthesis with a fully data-driven approach. While the encoder-decoder model offers faster inference due to its independent latent representation, the decoder-only LVSM achieves superior quality, scalability, and zero-shot generalization, outperforming previous state-of-the-art methods by 1.5 to 3.5 dB PSNR. Comprehensive evaluations across multiple datasets demonstrate that both LVSM variants achieve state-of-the-art novel view synthesis quality. Notably, our models surpass all previous methods even with reduced computational resources (1-2 GPUs). Please see our website for more details: https://haian-jin.github.io/projects/LVSM/ .
  • PDF
    Vision science imposes rigorous requirements for the design and execution of psychophysical studies and experiments. These requirements ensure precise control over variables, accurate measurement of perceptual responses, and reproducibility of results, which are essential for investigating visual perception and its underlying mechanisms. Since different experiments have different requirements, not all aspects of a display system are critical for a given setting. Therefore, some display systems may be suitable for certain types of experiments but unsuitable for others. An additional challenge is that the performance of consumer systems is often highly dependent on specific monitor settings and firmware behavior. Here, we evaluate the performance of four display systems: a consumer LCD gaming monitor, a consumer OLED gaming monitor, a consumer OLED TV, and a VPixx PROPixx projector system. To allow the reader to assess the suitability of these systems for different experiments, we present a range of different metrics: luminance behavior, luminance uniformity across display surface, estimated gamma values and linearity, channel additivity, channel dependency, color gamut, pixel response time, and pixel waveform. In addition, we exhaustively report the monitor firmware settings used. Our analyses show that current consumer-level OLED display systems are promising, and adequate to fulfill the requirements of some critical vision science experiments, allowing laboratories to run their experiments even without investing in high-quality professional display systems. For example, the tested Asus OLED gaming monitor shows excellent response time, a sharp square waveform even at 240 Hz, a color gamut that covers 94% of DCI-P3 color space, and the best luminance uniformity among all four tested systems, making it a favorable option on price-to-performance ratio.
  • PDF
    CG (Computer Graphics) is a popular field of CS (Computer Science), but many students find this topic difficult due to it requiring a large number of skills, such as mathematics, programming, geometric reasoning, and creativity. Over the past few years, researchers have investigated ways to harness the power of GenAI (Generative Artificial Intelligence) to improve teaching. In CS, much of the research has focused on introductory computing. A recent study evaluating the performance of an LLM (Large Language Model), GPT-4 (text-only), on CG questions, indicated poor performance and reliance on detailed descriptions of image content, which often required considerable insight from the user to return reasonable results. So far, no studies have investigated the abilities of LMMs (Large Multimodal Models), or multimodal LLMs, to solve CG questions and how these abilities can be used to improve teaching. In this study, we construct two datasets of CG questions requiring varying degrees of visual perception skills and geometric reasoning skills, and evaluate the current state-of-the-art LMM, GPT-4o, on the two datasets. We find that although GPT-4o exhibits great potential in solving questions with visual information independently, major limitations still exist to the accuracy and quality of the generated results. We propose several novel approaches for CG educators to incorporate GenAI into CG teaching despite these limitations. We hope that our guidelines further encourage learning and engagement in CG classrooms.
  • PDF
    In medical image visualization, path tracing of volumetric medical data like CT scans produces lifelike three-dimensional visualizations. Immersive VR displays can further enhance the understanding of complex anatomies. Going beyond the diagnostic quality of traditional 2D slices, they enable interactive 3D evaluation of anatomies, supporting medical education and planning. Rendering high-quality visualizations in real-time, however, is computationally intensive and impractical for compute-constrained devices like mobile headsets. We propose a novel approach utilizing GS to create an efficient but static intermediate representation of CT scans. We introduce a layered GS representation, incrementally including different anatomical structures while minimizing overlap and extending the GS training to remove inactive Gaussians. We further compress the created model with clustering across layers. Our approach achieves interactive frame rates while preserving anatomical structures, with quality adjustable to the target hardware. Compared to standard GS, our representation retains some of the explorative qualities initially enabled by immersive path tracing. Selective activation and clipping of layers are possible at rendering time, adding a degree of interactivity to otherwise static GS models. This could enable scenarios where high computational demands would otherwise prohibit using path-traced medical volumes.
  • PDF
    We present a complete characterization of polycubes of any genus based on their dual structure: a collection of oriented loops which run in each of the axis directions and capture polycubes via their intersection patterns. A polycube loop structure uniquely corresponds to a polycube. We also describe all combinatorially different ways to add a loop to a loop structure while maintaining its validity; similarly, we show how to identify loops that can be removed from a polycube loop structure without invalidating it. Our characterization gives rise to an iterative algorithm to construct provably valid polycube-maps for a given input surface; polycube-maps are frequently used to solve texture mapping, spline fitting, and hexahedral meshing. We showcase some results of a proof-of-concept implementation of this iterative algorithm.
  • PDF
    While multi-axis 3D printing can align continuous fibers along principal stresses in continuous fiber-reinforced thermoplastic (CFRTP) composites to enhance mechanical strength, existing methods have difficulty generating toolpaths with high fiber coverage. This is mainly due to the orientation consistency constraints imposed by vector-field-based methods and the turbulent stress fields around stress concentration regions. This paper addresses these challenges by introducing a 2-RoSy representation for computing the direction field, which is then converted into a periodic scalar field to generate partial iso-curves for fiber toolpaths with nearly equal hatching distance. To improve fiber coverage in stress-concentrated regions, such as around holes, we extend the quaternion-based method for curved slicing by incorporating winding compatibility considerations. Our proposed method can achieve toolpaths coverage between 87.5% and 90.6% by continuous fibers with 1.1mm width. Models fabricated using our toolpaths show up to 84.6% improvement in failure load and 54.4% increase in stiffness when compared to the results obtained from multi-axis 3D printing with sparser fibers.
  • PDF
    Density-equalizing map is a shape deformation technique originally developed for cartogram creation and sociological data visualization on planar geographical maps. In recent years, there has been an increasing interest in developing density-equalizing mapping methods for surface and volumetric domains and applying them to various problems in geometry processing and imaging science. However, the existing surface density-equalizing mapping methods are only applicable to surfaces with relatively simple topologies but not surfaces with topological holes. In this work, we develop a novel algorithm for computing density-equalizing maps for toroidal surfaces. In particular, different shape deformation effects can be easily achieved by prescribing different population functions on the torus and performing diffusion-based deformations on a planar domain with periodic boundary conditions. Furthermore, the proposed toroidal density-equalizing mapping method naturally leads to an effective method for computing toroidal parameterizations of genus-one surfaces with controllable shape changes, with the toroidal area-preserving parameterization being a prime example. Experimental results are presented to demonstrate the effectiveness of our proposed methods.
  • PDF
    We introduce Joker, a new method for the conditional synthesis of 3D human heads with extreme expressions. Given a single reference image of a person, we synthesize a volumetric human head with the reference identity and a new expression. We offer control over the expression via a 3D morphable model (3DMM) and textual inputs. This multi-modal conditioning signal is essential since 3DMMs alone fail to define subtle emotional changes and extreme expressions, including those involving the mouth cavity and tongue articulation. Our method is built upon a 2D diffusion-based prior that generalizes well to out-of-domain samples, such as sculptures, heavy makeup, and paintings while achieving high levels of expressiveness. To improve view consistency, we propose a new 3D distillation technique that converts predictions of our 2D prior into a neural radiance field (NeRF). Both the 2D prior and our distillation technique produce state-of-the-art results, which are confirmed by our extensive evaluations. Also, to the best of our knowledge, our method is the first to achieve view-consistent extreme tongue articulation.
  • PDF
    We present Agent-to-Sim (ATS), a framework for learning interactive behavior models of 3D agents from casual longitudinal video collections. Different from prior works that rely on marker-based tracking and multiview cameras, ATS learns natural behaviors of animal and human agents non-invasively through video observations recorded over a long time-span (e.g., a month) in a single environment. Modeling 3D behavior of an agent requires persistent 3D tracking (e.g., knowing which point corresponds to which) over a long time period. To obtain such data, we develop a coarse-to-fine registration method that tracks the agent and the camera over time through a canonical 3D space, resulting in a complete and persistent spacetime 4D representation. We then train a generative model of agent behaviors using paired data of perception and motion of an agent queried from the 4D reconstruction. ATS enables real-to-sim transfer from video recordings of an agent to an interactive behavior simulator. We demonstrate results on pets (e.g., cat, dog, bunny) and human given monocular RGBD videos captured by a smartphone.
  • PDF
    The advancement of concrete 3D printing (C3DP) technology has revolutionized the construction industry, offering unique opportunities for innovation and efficiency. At the heart of this process lies a comprehensive digital chain that integrates various stages, from initial design to post-processing. This article provides an overview of this digital chain, explaining each crucial step. The chain begins with design, utilizing Design for Additive Manufacturing (DFAM) concept and parametric modeling to create optimized structures. Path generation follows, determining the precise toolpath for extruding concrete layers. Simulations, both numerical and analytical, ensure the design's integrity and feasibility. Several articles have addressed parametric modeling, process and numerical simulation, and the post-processing phase. However, none has proposed an updated methodology for the workflow. This study aims to propose a robust digital chain for C3DP technology, using one platform (3Dexperience) and seamless data transfer between applications. These steps provide insights into the structural performance of printed components, enabling necessary adjustments and optimizations. In essence, the digital chain coordinates a seamless workflow that transforms digital designs into concrete structures, unlocking the full potential of C3DP and paving the way for innovative and efficient construction.
  • PDF
    In this paper, we present TexPro, a novel method for high-fidelity material generation for input 3D meshes given text prompts. Unlike existing text-conditioned texture generation methods that typically generate RGB textures with baked lighting, TexPro is able to produce diverse texture maps via procedural material modeling, which enables physical-based rendering, relighting, and additional benefits inherent to procedural materials. Specifically, we first generate multi-view reference images given the input textual prompt by employing the latest text-to-image model. We then derive texture maps through a rendering-based optimization with recent differentiable procedural materials. To this end, we design several techniques to handle the misalignment between the generated multi-view images and 3D meshes, and introduce a novel material agent that enhances material classification and matching by exploring both part-level understanding and object-aware material reasoning. Experiments demonstrate the superiority of the proposed method over existing SOTAs and its capability of relighting.
  • PDF
    Creating and understanding art has long been a hallmark of human ability. When presented with finished digital artwork, professional graphic artists can intuitively deconstruct and replicate it using various drawing tools, such as the line tool, paint bucket, and layer features, including opacity and blending modes. While most recent research in this field has focused on art generation, proposing a range of methods, these often rely on the concept of artwork being represented as a final image. To bridge the gap between pixel-level results and the actual drawing process, we present an approach that treats a set of drawing tools as executable programs. This method predicts a sequence of steps to achieve the final image, allowing for understandable and resolution-independent reproductions under the usage of a set of drawing commands. Our experiments demonstrate that our program synthesizer, Art2Prog, can comprehensively understand complex input images and reproduce them using high-quality executable programs. The experimental results evidence the potential of machines to grasp higher-level information from images and generate compact program-level descriptions.
  • PDF
    3D Gaussian Splatting has shown fast and high-quality rendering results in static scenes by leveraging dense 3D prior and explicit representations. Unfortunately, the benefits of the prior and representation do not involve novel view synthesis for dynamic motions. Ironically, this is because the main barrier is the reliance on them, which requires increasing training and rendering times to account for dynamic motions. In this paper, we design a Explicit 4D Gaussian Splatting(Ex4DGS). Our key idea is to firstly separate static and dynamic Gaussians during training, and to explicitly sample positions and rotations of the dynamic Gaussians at sparse timestamps. The sampled positions and rotations are then interpolated to represent both spatially and temporally continuous motions of objects in dynamic scenes as well as reducing computational cost. Additionally, we introduce a progressive training scheme and a point-backtracking technique that improves Ex4DGS's convergence. We initially train Ex4DGS using short timestamps and progressively extend timestamps, which makes it work well with a few point clouds. The point-backtracking is used to quantify the cumulative error of each Gaussian over time, enabling the detection and removal of erroneous Gaussians in dynamic scenes. Comprehensive experiments on various scenes demonstrate the state-of-the-art rendering quality from our method, achieving fast rendering of 62 fps on a single 2080Ti GPU.
  • PDF
    This book offers an in-depth exploration of object detection and semantic segmentation, combining theoretical foundations with practical applications. It covers state-of-the-art advancements in machine learning and deep learning, with a focus on convolutional neural networks (CNNs), YOLO architectures, and transformer-based approaches like DETR. The book also delves into the integration of artificial intelligence (AI) techniques and large language models for enhanced object detection in complex environments. A thorough discussion of big data analysis is presented, highlighting the importance of data processing, model optimization, and performance evaluation metrics. By bridging the gap between traditional methods and modern deep learning frameworks, this book serves as a comprehensive guide for researchers, data scientists, and engineers aiming to leverage AI-driven methodologies in large-scale object detection tasks.
  • PDF
    We propose a zero-shot text-driven 3D shape deformation system that deforms an input 3D mesh of a manufactured object to fit an input text description. To do this, our system optimizes the parameters of a deformation model to maximize an objective function based on the widely used pre-trained vision language model CLIP. We find that CLIP-based objective functions exhibit many spurious local optima; to circumvent them, we parameterize deformations using a novel deformation model called BoxDefGraph which our system automatically computes from an input mesh, the BoxDefGraph is designed to capture the object aligned rectangular/circular geometry features of most manufactured objects. We then use the CMA-ES global optimization algorithm to maximize our objective, which we find to work better than popular gradient-based optimizers. We demonstrate that our approach produces appealing results and outperforms several baselines.
  • PDF
    We study invariant sets and measures generated by iterated function systems defined on countable discrete spaces that are uniform grids of a finite dimension. The discrete spaces of this type can be considered as models of spaces in which actual numerical computation takes place. In this context, we investigate the possibility of the application of the random iteration algorithm to approximate these discrete IFS invariant sets and measures. The problems concerning a discretization of hyperbolic IFSs are considered as special cases of this more general setting.
  • PDF
    Low Dynamic Range (LDR) to High Dynamic Range (HDR) image translation is an important computer vision problem. There is a significant amount of research utilizing both conventional non-learning methods and modern data-driven approaches, focusing on using both single-exposed and multi-exposed LDR for HDR image reconstruction. However, most current state-of-the-art methods require high-quality paired LDR,HDR datasets for model training. In addition, there is limited literature on using unpaired datasets for this task where the model learns a mapping between domains, i.e., LDR to HDR. To address limitations of current methods, such as the paired data constraint , as well as unwanted blurring and visual artifacts in the reconstructed HDR, we propose a method that uses a modified cycle-consistent adversarial architecture and utilizes unpaired LDR,HDR datasets for training. The method introduces novel generators to address visual artifact removal and an encoder and loss to address semantic consistency, another under-explored topic. The method achieves state-of-the-art results across several benchmark datasets and reconstructs high-quality HDR images.
  • PDF
    The use of machine learning (ML) methods for development of robust and flexible visual inspection system has shown promising. However their performance is highly dependent on the amount and diversity of training data. This is often restricted not only due to costs but also due to a wide variety of defects and product surfaces which occur with varying frequency. As such, one can not guarantee that the acquired dataset contains enough defect and product surface occurrences which are needed to develop a robust model. Using parametric synthetic dataset generation, it is possible to avoid these issues. In this work, we introduce a complete pipeline which describes in detail how to approach image synthesis for surface inspection - from first acquisition, to texture and defect modeling, data generation, comparison to real data and finally use of the synthetic data to train a defect segmentation model. The pipeline is in detail evaluated for milled and sandblasted aluminum surfaces. In addition to providing an in-depth view into each step, discussion of chosen methods, and presentation of ML results, we provide a comprehensive dual dataset containing both real and synthetic images.
  • PDF
    Reconstructing a complete object from its parts is a fundamental problem in many scientific domains. The purpose of this article is to provide a systematic survey on this topic. The reassembly problem requires understanding the attributes of individual pieces and establishing matches between different pieces. Many approaches also model priors of the underlying complete object. Existing approaches are tightly connected problems of shape segmentation, shape matching, and learning shape priors. We provide existing algorithms in this context and emphasize their similarities and differences to general-purpose approaches. We also survey the trends from early non-deep learning approaches to more recent deep learning approaches. In addition to algorithms, this survey will also describe existing datasets, open-source software packages, and applications. To the best of our knowledge, this is the first comprehensive survey on this topic in computer graphics.
  • PDF
    Our goal is to generate realistic human motion from natural language. Modern methods often face a trade-off between model expressiveness and text-to-motion alignment. Some align text and motion latent spaces but sacrifice expressiveness; others rely on diffusion models producing impressive motions, but lacking semantic meaning in their latent space. This may compromise realism, diversity, and applicability. Here, we address this by combining latent diffusion with a realignment mechanism, producing a novel, semantically structured space that encodes the semantics of language. Leveraging this capability, we introduce the task of textual motion inversion to capture novel motion concepts from a few examples. For motion synthesis, we evaluate LEAD on HumanML3D and KIT-ML and show comparable performance to the state-of-the-art in terms of realism, diversity, and text-motion consistency. Our qualitative analysis and user study reveal that our synthesized motions are sharper, more human-like and comply better with the text compared to modern methods. For motion textual inversion, our method demonstrates improved capacity in capturing out-of-distribution characteristics in comparison to traditional VAEs.
  • PDF
    Currently, there are no learning-free or neural techniques for real-time recalibration of infrared multi-camera systems. In this paper, we address the challenge of real-time, highly-accurate calibration of multi-camera infrared systems, a critical task for time-sensitive applications. Unlike traditional calibration techniques that lack adaptability and struggle with on-the-fly recalibrations, we propose a neural network-based method capable of dynamic real-time calibration. The proposed method integrates a differentiable projection model that directly correlates 3D geometries with their 2D image projections and facilitates the direct optimization of both intrinsic and extrinsic camera parameters. Key to our approach is the dynamic camera pose synthesis with perturbations in camera parameters, emulating realistic operational challenges to enhance model robustness. We introduce two model variants: one designed for multi-camera systems with onboard processing of 2D points, utilizing the direct 2D projections of 3D fiducials, and another for image-based systems, employing color-coded projected points for implicitly establishing correspondence. Through rigorous experimentation, we demonstrate our method is more accurate than traditional calibration techniques with or without perturbations while also being real-time, marking a significant leap in the field of real-time multi-camera system calibration. The source code can be found at https://github.com/theICTlab/neural-recalibration
  • PDF
    Recent advances in diffusion models have significantly enhanced image generation capabilities. However, customizing these models with new classes often leads to unintended consequences that compromise their reliability. We introduce the concept of open-world forgetting to emphasize the vast scope of these unintended alterations, contrasting it with the well-studied closed-world forgetting, which is measurable by evaluating performance on a limited set of classes or skills. Our research presents the first comprehensive investigation into open-world forgetting in diffusion models, focusing on semantic and appearance drift of representations. We utilize zero-shot classification to analyze semantic drift, revealing that even minor model adaptations lead to unpredictable shifts affecting areas far beyond newly introduced concepts, with dramatic drops in zero-shot classification of up to 60%. Additionally, we observe significant changes in texture and color of generated content when analyzing appearance drift. To address these issues, we propose a mitigation strategy based on functional regularization, designed to preserve original capabilities while accommodating new concepts. Our study aims to raise awareness of unintended changes due to model customization and advocates for the analysis of open-world forgetting in future research on model customization and finetuning methods. Furthermore, we provide insights for developing more robust adaptation methodologies.
  • PDF
    Voxels are a geometric representation used for rendering volumes, multi-resolution models, and indirect lighting effects. Since the memory consumption of uncompressed voxel volumes scales cubically with resolution, past works have introduced data structures for exploiting spatial sparsity and homogeneity to compress volumes and accelerate ray tracing. However, these works don't systematically evaluate the trade-off between compression and ray intersection performance for a variety of storage formats. We show that a hierarchical combination of voxel formats can achieve Pareto optimal trade-offs between memory consumption and rendering speed. We present a formulation of "hybrid" voxel formats, where each level of a hierarchical format can have a different structure. For evaluation, we implement a metaprogramming system to automatically generate construction and ray intersection code for arbitrary hybrid formats. We also identify transformations on these formats that can improve compression and rendering performance. We evaluate this system with several models and hybrid formats, demonstrating that compared to standalone base formats, hybrid formats achieve a new Pareto frontier in ray intersection performance and storage cost.
  • PDF
    Vision foundation models trained on massive amounts of visual data have shown unprecedented reasoning and planning skills in open-world settings. A key challenge in applying them to robotic tasks is the modality gap between visual data and action data. We introduce differentiable robot rendering, a method allowing the visual appearance of a robot body to be directly differentiable with respect to its control parameters. Our model integrates a kinematics-aware deformable model and Gaussians Splatting and is compatible with any robot form factors and degrees of freedom. We demonstrate its capability and usage in applications including reconstruction of robot poses from images and controlling robots through vision language models. Quantitative and qualitative results show that our differentiable rendering model provides effective gradients for robotic control directly from pixels, setting the foundation for the future applications of vision foundation models in robotics.
  • PDF
    Panoramic image stitching provides a unified, wide-angle view of a scene that extends beyond the camera's field of view. Stitching frames of a panning video into a panoramic photograph is a well-understood problem for stationary scenes, but when objects are moving, a still panorama cannot capture the scene. We present a method for synthesizing a panoramic video from a casually-captured panning video, as if the original video were captured with a wide-angle camera. We pose panorama synthesis as a space-time outpainting problem, where we aim to create a full panoramic video of the same length as the input video. Consistent completion of the space-time volume requires a powerful, realistic prior over video content and motion, for which we adapt generative video models. Existing generative models do not, however, immediately extend to panorama completion, as we show. We instead apply video generation as a component of our panorama synthesis system, and demonstrate how to exploit the strengths of the models while minimizing their limitations. Our system can create video panoramas for a range of in-the-wild scenes including people, vehicles, and flowing water, as well as stationary background features.
  • PDF
    Eyelid shape is integral to identity and likeness in human facial modeling. Human eyelids are diverse in appearance with varied skin fold and epicanthal fold morphology between individuals. Existing parametric face models express eyelid shape variation to an extent, but do not preserve sufficient likeness across a diverse range of individuals. We propose a new definition of eyelid fold consistency and implement geometric processing techniques to model diverse eyelid shapes in a unified topology. Using this method we reprocess data used to train a parametric face model and demonstrate significant improvements in face-related machine learning tasks.
  • PDF
    4D Gaussian Splatting (4DGS) has recently emerged as a promising technique for capturing complex dynamic 3D scenes with high fidelity. It utilizes a 4D Gaussian representation and a GPU-friendly rasterizer, enabling rapid rendering speeds. Despite its advantages, 4DGS faces significant challenges, notably the requirement of millions of 4D Gaussians, each with extensive associated attributes, leading to substantial memory and storage cost. This paper introduces a memory-efficient framework for 4DGS. We streamline the color attribute by decomposing it into a per-Gaussian direct color component with only 3 parameters and a shared lightweight alternating current color predictor. This approach eliminates the need for spherical harmonics coefficients, which typically involve up to 144 parameters in classic 4DGS, thereby creating a memory-efficient 4D Gaussian representation. Furthermore, we introduce an entropy-constrained Gaussian deformation technique that uses a deformation field to expand the action range of each Gaussian and integrates an opacity-based entropy loss to limit the number of Gaussians, thus forcing our model to use as few Gaussians as possible to fit a dynamic scene well. With simple half-precision storage and zip compression, our framework achieves a storage reduction by approximately 190$\times$ and 125$\times$ on the Technicolor and Neural 3D Video datasets, respectively, compared to the original 4DGS. Meanwhile, it maintains comparable rendering speeds and scene representation quality, setting a new standard in the field.
  • PDF
    Recent illumination estimation methods have focused on enhancing the resolution and improving the quality and diversity of the generated textures. However, few have explored tailoring the neural network architecture to the Equirectangular Panorama (ERP) format utilised in image-based lighting. Consequently, high dynamic range images (HDRI) results usually exhibit a seam at the side borders and textures or objects that are warped at the poles. To address this shortcoming we propose a novel architecture, 360U-Former, based on a U-Net style Vision-Transformer which leverages the work of PanoSWIN, an adapted shifted window attention tailored to the ERP format. To the best of our knowledge, this is the first purely Vision-Transformer model used in the field of illumination estimation. We train 360U-Former as a GAN to generate HDRI from a limited field of view low dynamic range image (LDRI). We evaluate our method using current illumination estimation evaluation protocols and datasets, demonstrating that our approach outperforms existing and state-of-the-art methods without the artefacts typically associated with the use of the ERP format.
  • PDF
    We propose L3DG, the first approach for generative 3D modeling of 3D Gaussians through a latent 3D Gaussian diffusion formulation. This enables effective generative 3D modeling, scaling to generation of entire room-scale scenes which can be very efficiently rendered. To enable effective synthesis of 3D Gaussians, we propose a latent diffusion formulation, operating in a compressed latent space of 3D Gaussians. This compressed latent space is learned by a vector-quantized variational autoencoder (VQ-VAE), for which we employ a sparse convolutional architecture to efficiently operate on room-scale scenes. This way, the complexity of the costly generation process via diffusion is substantially reduced, allowing higher detail on object-level generation, as well as scalability to large scenes. By leveraging the 3D Gaussian representation, the generated scenes can be rendered from arbitrary viewpoints in real-time. We demonstrate that our approach significantly improves visual quality over prior work on unconditional object-level radiance field synthesis and showcase its applicability to room-scale scene generation.
  • PDF
    Due to the increasing use of virtual avatars, the animation of head-hand interactions has recently gained attention. To this end, we present a novel volumetric and physics-based interaction simulation. In contrast to previous work, our simulation incorporates temporal effects such as collision paths, respects anatomical constraints, and can detect and simulate skin pulling. As a result, we can achieve more natural-looking interaction animations and take a step towards greater realism. However, like most complex and computationally expensive simulations, ours is not real-time capable even on high-end machines. Therefore, we train small and efficient neural networks as accurate approximations that achieve about 200 FPS on consumer GPUs, about 50 FPS on CPUs, and are learned in less than four hours for one person. In general, our focus is not to generalize the approximation networks to low-resolution head models but to adapt them to more detailed personalized avatars. Nevertheless, we show that these networks can learn to approximate our head-hand interaction model for multiple identities while maintaining computational efficiency. Since the quality of the simulations can only be judged subjectively, we conducted a comprehensive user study which confirms the improved realism of our approach. In addition, we provide extensive visual results and inspect the neural approximations quantitatively. All data used in this work has been recorded with a multi--view camera rig and will be made available upon publication. We will also publish relevant implementations.
  • PDF
    Given the steep learning curve of professional 3D software and the time-consuming process of managing large 3D assets, language-guided 3D scene editing has significant potential in fields such as virtual reality, augmented reality, and gaming. However, recent approaches to language-guided 3D scene editing either require manual interventions or focus only on appearance modifications without supporting comprehensive scene layout changes. In response, we propose Edit-Room, a unified framework capable of executing a variety of layout edits through natural language commands, without requiring manual intervention. Specifically, EditRoom leverages Large Language Models (LLMs) for command planning and generates target scenes using a diffusion-based method, enabling six types of edits: rotate, translate, scale, replace, add, and remove. To address the lack of data for language-guided 3D scene editing, we have developed an automatic pipeline to augment existing 3D scene synthesis datasets and introduced EditRoom-DB, a large-scale dataset with 83k editing pairs, for training and evaluation. Our experiments demonstrate that our approach consistently outperforms other baselines across all metrics, indicating higher accuracy and coherence in language-guided scene layout editing.
  • PDF
    Motion Planning (MP) is a critical challenge in robotics, especially pertinent with the burgeoning interest in embodied artificial intelligence. Traditional MP methods often struggle with high-dimensional complexities. Recently neural motion planners, particularly physics-informed neural planners based on the Eikonal equation, have been proposed to overcome the curse of dimensionality. However, these methods perform poorly in complex scenarios with shaped robots due to multiple solutions inherent in the Eikonal equation. To address these issues, this paper presents PC-Planner, a novel physics-constrained self-supervised learning framework for robot motion planning with various shapes in complex environments. To this end, we propose several physical constraints, including monotonic and optimal constraints, to stabilize the training process of the neural network with the Eikonal equation. Additionally, we introduce a novel shape-aware distance field that considers the robot's shape for efficient collision checking and Ground Truth (GT) speed computation. This field reduces the computational intensity, and facilitates adaptive motion planning at test time. Experiments in diverse scenarios with different robots demonstrate the superiority of the proposed method in efficiency and robustness for robot motion planning, particularly in complex environments.
  • PDF
    Implicit neural representations have emerged as a powerful tool in learning 3D geometry, offering unparalleled advantages over conventional representations like mesh-based methods. A common type of INR implicitly encodes a shape's boundary as the zero-level set of the learned continuous function and learns a mapping from a low-dimensional latent space to the space of all possible shapes represented by its signed distance function. However, most INRs struggle to retain high-frequency details, which are crucial for accurate geometric depiction, and they are computationally expensive. To address these limitations, we present a novel approach that both reduces computational expenses and enhances the capture of fine details. Our method integrates periodic activation functions, positional encodings, and normals into the neural network architecture. This integration significantly enhances the model's ability to learn the entire space of 3D shapes while preserving intricate details and sharp features, areas where conventional representations often fall short.
  • PDF
    Recent advancements in Radiance Fields have significantly improved novel-view synthesis. However, in many real-world applications, the more advanced challenge lies in inverse rendering, which seeks to derive the physical properties of a scene, including light, geometry, textures, and materials. Meshes, as a traditional representation adopted by many simulation pipeline, however, still show limited influence in radiance field for inverse rendering. This paper introduces a novel framework called Triangle Patchlet (abbr. Triplet), a mesh-based representation, to comprehensively approximate these scene parameters. We begin by assembling Triplets with either randomly generated points or sparse points obtained from camera calibration where all faces are treated as an independent element. Next, we simulate the physical interaction of light and optimize the scene parameters using traditional graphics rendering techniques like rasterization and ray tracing, accompanying with density control and propagation. An iterative mesh extracting process is also suggested, where we continue to optimize on geometry and materials with graph-based operation. We also introduce several regulation terms to enable better generalization of materials property. Our framework could precisely estimate the light, materials and geometry with mesh without prior of light, materials and geometry in a unified framework. Experiments demonstrate that our approach can achieve state-of-the-art visual quality while reconstructing high-quality geometry and accurate material properties.
  • PDF
    Surface parameterization is a fundamental task in geometry processing and plays an important role in many science and engineering applications. In recent years, the density-equalizing map, a shape deformation technique based on the physical principle of density diffusion, has been utilized for the parameterization of simply connected and multiply connected open surfaces. More recently, a spherical density-equalizing mapping method has been developed for the parameterization of genus-0 closed surfaces. However, for genus-0 closed surfaces with extreme geometry, using a spherical domain for the parameterization may induce large geometric distortion. In this work, we develop a novel method for computing density-equalizing maps of genus-0 closed surfaces onto an ellipsoidal domain. This allows us to achieve ellipsoidal area-preserving parameterizations and ellipsoidal parameterizations with controlled area change. We further propose an energy minimization approach that combines density-equalizing maps and quasi-conformal maps, which allows us to produce ellipsoidal density-equalizing quasi-conformal maps for achieving a balance between density-equalization and quasi-conformality. Using our proposed methods, we can significantly improve the performance of surface remeshing for genus-0 closed surfaces. Experimental results on a large variety of genus-0 closed surfaces are presented to demonstrate the effectiveness of our proposed methods.
  • PDF
    Generalizable neural radiance field (NeRF) enables neural-based digital human rendering without per-scene retraining. When combined with human prior knowledge, high-quality human rendering can be achieved even with sparse input views. However, the inference of these methods is still slow, as a large number of neural network queries on each ray are required to ensure the rendering quality. Moreover, occluded regions often suffer from artifacts, especially when the input views are sparse. To address these issues, we propose a generalizable human NeRF framework that achieves high-quality and real-time rendering with sparse input views by extensively leveraging human prior knowledge. We accelerate the rendering with a two-stage sampling reduction strategy: first constructing boundary meshes around the human geometry to reduce the number of ray samples for sampling guidance regression, and then volume rendering using fewer guided samples. To improve rendering quality, especially in occluded regions, we propose an occlusion-aware attention mechanism to extract occlusion information from the human priors, followed by an image space refinement network to improve rendering quality. Furthermore, for volume rendering, we adopt a signed ray distance function (SRDF) formulation, which allows us to propose an SRDF loss at every sample position to improve the rendering quality further. Our experiments demonstrate that our method outperforms the state-of-the-art methods in rendering quality and has a competitive rendering speed compared with speed-prioritized novel view synthesis methods.