Skip to main content

Showing 1–50 of 181 results for author: Yao, A

  1. arXiv:2409.14319  [pdf, other

    cs.CV cs.MM

    Scene-Text Grounding for Text-Based Video Question Answering

    Authors: Sheng Zhou, Junbin Xiao, Xun Yang, Peipei Song, Dan Guo, Angela Yao, Meng Wang, Tat-Seng Chua

    Abstract: Existing efforts in text-based video question answering (TextVideoQA) are criticized for their opaque decisionmaking and heavy reliance on scene-text recognition. In this paper, we propose to study Grounded TextVideoQA by forcing models to answer questions and spatio-temporally localize the relevant scene-text regions, thus decoupling QA from scenetext recognition and promoting research towards in… ▽ More

    Submitted 22 September, 2024; originally announced September 2024.

  2. arXiv:2409.10038  [pdf, other

    cs.CL cs.AI cs.LG

    On the Diagram of Thought

    Authors: Yifan Zhang, Yang Yuan, Andrew Chi-Chih Yao

    Abstract: We introduce Diagram of Thought (DoT), a framework that models iterative reasoning in large language models (LLMs) as the construction of a directed acyclic graph (DAG) within a single model. Unlike traditional approaches that represent reasoning as linear chains or trees, DoT organizes propositions, critiques, refinements, and verifications into a cohesive DAG structure, allowing the model to exp… ▽ More

    Submitted 16 September, 2024; originally announced September 2024.

  3. arXiv:2409.04388  [pdf, other

    cs.CV cs.AI cs.MM

    Question-Answering Dense Video Events

    Authors: Hangyu Qin, Junbin Xiao, Angela Yao

    Abstract: Multimodal Large Language Models (MLLMs) have shown excellent performance in question-answering of single-event videos. In this paper, we present question-answering dense video events, a novel task that requires answering and grounding the dense-event questions in long videos, thus challenging MLLMs to faithfully comprehend and reason about multiple events occurring over extended time periods. To… ▽ More

    Submitted 10 September, 2024; v1 submitted 6 September, 2024; originally announced September 2024.

  4. arXiv:2408.09919  [pdf, other

    cs.CV

    Long-Tail Temporal Action Segmentation with Group-wise Temporal Logit Adjustment

    Authors: Zhanzhong Pang, Fadime Sener, Shrinivas Ramasubramanian, Angela Yao

    Abstract: Procedural activity videos often exhibit a long-tailed action distribution due to varying action frequencies and durations. However, state-of-the-art temporal action segmentation methods overlook the long tail and fail to recognize tail actions. Existing long-tail methods make class-independent assumptions and struggle to identify tail classes when applied to temporal segmentation frameworks. This… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: Accepted by ECCV 2024

  5. arXiv:2408.04223  [pdf, other

    cs.CV cs.AI

    VideoQA in the Era of LLMs: An Empirical Study

    Authors: Junbin Xiao, Nanxin Huang, Hangyu Qin, Dongyang Li, Yicong Li, Fengbin Zhu, Zhulin Tao, Jianxing Yu, Liang Lin, Tat-Seng Chua, Angela Yao

    Abstract: Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-language tasks. As a golden testbed, Video Question Answering (VideoQA) plays pivotal role in Video-LLM developing. This work conducts a timely and comprehensive study of Video-LLMs' behavior in VideoQA, aiming to elucidate their success and failure modes, and provide insights towards more human-like video underst… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

    Comments: Preprint. Under Review

  6. arXiv:2407.13987  [pdf, other

    cs.CV

    RealViformer: Investigating Attention for Real-World Video Super-Resolution

    Authors: Yuehan Zhang, Angela Yao

    Abstract: In real-world video super-resolution (VSR), videos suffer from in-the-wild degradations and artifacts. VSR methods, especially recurrent ones, tend to propagate artifacts over time in the real-world setting and are more vulnerable than image super-resolution. This paper investigates the influence of artifacts on commonly used covariance-based attention mechanisms in VSR. Comparing the widely-used… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  7. arXiv:2407.12727  [pdf, other

    cs.CV

    NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model

    Authors: Zhongqun Zhang, Hengfei Wang, Ziwei Yu, Yihua Cheng, Angela Yao, Hyung Jin Chang

    Abstract: Modeling the physical contacts between the hand and object is standard for refining inaccurate hand poses and generating novel human grasp in 3D hand-object reconstruction. However, existing methods rely on geometric constraints that cannot be specified or controlled. This paper introduces a novel task of controllable 3D hand-object contact modeling with natural language descriptions. Challenges i… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV2024

  8. arXiv:2407.07302  [pdf, other

    eess.IV cs.CV

    Pairwise Distance Distillation for Unsupervised Real-World Image Super-Resolution

    Authors: Yuehan Zhang, Seungjun Lee, Angela Yao

    Abstract: Standard single-image super-resolution creates paired training data from high-resolution images through fixed downsampling kernels. However, real-world super-resolution (RWSR) faces unknown degradations in the low-resolution inputs, all the while lacking paired training data. Existing methods approach this problem by learning blind general models through complex synthetic augmentations on training… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  9. arXiv:2407.00574  [pdf, other

    cs.CV

    OfCaM: Global Human Mesh Recovery via Optimization-free Camera Motion Scale Calibration

    Authors: Fengyuan Yang, Kerui Gu, Ha Linh Nguyen, Angela Yao

    Abstract: Accurate camera motion estimation is critical to estimate human motion in the global space. A standard and widely used method for estimating camera motion is Simultaneous Localization and Mapping (SLAM). However, SLAM only provides a trajectory up to an unknown scale factor. Different from previous attempts that optimize the scale factor, this paper presents Optimization-free Camera Motion Scale C… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: 12 pages, 7 figures, 4 tables

  10. arXiv:2406.07879  [pdf, other

    cs.CV cs.AI cs.LG

    KernelWarehouse: Rethinking the Design of Dynamic Convolution

    Authors: Chao Li, Anbang Yao

    Abstract: Dynamic convolution learns a linear mixture of n static kernels weighted with their input-dependent attentions, demonstrating superior performance than normal convolution. However, it increases the number of convolutional parameters by n times, and thus is not parameter efficient. This leads to no research progress that can allow researchers to explore the setting n>100 (an order of magnitude larg… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: This work is accepted to ICML 2024. The project page: https://github.com/OSVAI/KernelWarehouse. arXiv admin note: substantial text overlap with arXiv:2308.08361

  11. arXiv:2406.07876  [pdf, other

    cs.CV cs.AI cs.LG

    Small Scale Data-Free Knowledge Distillation

    Authors: He Liu, Yikai Wang, Huaping Liu, Fuchun Sun, Anbang Yao

    Abstract: Data-free knowledge distillation is able to utilize the knowledge learned by a large teacher network to augment the training of a smaller student network without accessing the original training data, avoiding privacy, security, and proprietary risks in real applications. In this line of research, existing methods typically follow an inversion-and-distillation paradigm in which a generative adversa… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: This work is accepted to CVPR 2024. The project page: https://github.com/OSVAI/SSD-KD

  12. arXiv:2405.19833  [pdf, other

    cs.CV

    KITRO: Refining Human Mesh by 2D Clues and Kinematic-tree Rotation

    Authors: Fengyuan Yang, Kerui Gu, Angela Yao

    Abstract: 2D keypoints are commonly used as an additional cue to refine estimated 3D human meshes. Current methods optimize the pose and shape parameters with a reprojection loss on the provided 2D keypoints. Such an approach, while simple and intuitive, has limited effectiveness because the optimal solution is hard to find in ambiguous parameter space and may sacrifice depth. Additionally, divergent gradie… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: Accepted by CVPR24

  13. arXiv:2404.13904  [pdf, other

    cs.LG cs.CV

    Deep Regression Representation Learning with Topology

    Authors: Shihao Zhang, kenji kawaguchi, Angela Yao

    Abstract: Most works studying representation learning focus only on classification and neglect regression. Yet, the learning objectives and, therefore, the representation topologies of the two tasks are fundamentally different: classification targets class separation, leading to disconnected representations, whereas regression requires ordinality with respect to the target, leading to continuous representat… ▽ More

    Submitted 16 May, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: ICML 2024

  14. Toward industrial use of continual learning : new metrics proposal for class incremental learning

    Authors: Konaté Mohamed Abbas, Anne-Françoise Yao, Thierry Chateau, Pierre Bouges

    Abstract: In this paper, we investigate continual learning performance metrics used in class incremental learning strategies for continual learning (CL) using some high performing methods. We investigate especially mean task accuracy. First, we show that it lacks of expressiveness through some simple experiments to capture performance. We show that monitoring average tasks performance is over optimistic and… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

    Comments: 7 pages, Accepted at IJCNN 2023

  15. arXiv:2404.04037  [pdf, other

    cs.CV cs.MM

    InstructHumans: Editing Animated 3D Human Textures with Instructions

    Authors: Jiayin Zhu, Linlin Yang, Angela Yao

    Abstract: We present InstructHumans, a novel framework for instruction-driven 3D human texture editing. Existing text-based editing methods use Score Distillation Sampling (SDS) to distill guidance from generative models. This work shows that naively using such scores is harmful to editing as they destroy consistency with the source avatar. Instead, we propose an alternate SDS for Editing (SDS-E) that selec… ▽ More

    Submitted 5 April, 2024; originally announced April 2024.

    Comments: Project Page: https://jyzhu.top/instruct-humans

  16. arXiv:2404.02981  [pdf

    physics.app-ph cond-mat.mtrl-sci

    Remote-contact catalysis for target-diameter semiconducting carbon nanotube array

    Authors: Jiangtao Wang, Xudong Zheng, Gregory Pitner, Xiang Ji, Tianyi Zhang, Aijia Yao, Jiadi Zhu, Tomás Palacios, Lain-Jong Li, Han Wang, Jing Kong

    Abstract: Electrostatic catalysis has been an exciting development in chemical synthesis (beyond enzymes catalysis) in recent years, boosting reaction rates and selectively producing certain reaction products. Most of the studies to date have been focused on using external electric field (EEF) to rearrange the charge distribution in small molecule reactions such as Diels-Alder addition, carbene reaction, et… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: 4 figures, 23 pages

  17. arXiv:2403.17924  [pdf, other

    cs.CV cs.AI

    AID: Attention Interpolation of Text-to-Image Diffusion

    Authors: Qiyuan He, Jinghao Wang, Ziwei Liu, Angela Yao

    Abstract: Conditional diffusion models can create unseen images in various settings, aiding image interpolation. Interpolation in latent spaces is well-studied, but interpolation with specific conditions like text or poses is less understood. Simple approaches, such as linear interpolation in the space of conditions, often result in images that lack consistency, smoothness, and fidelity. To that end, we int… ▽ More

    Submitted 4 October, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

    Comments: NeurIPS 2024 Conference Paper

  18. arXiv:2403.16428  [pdf, other

    cs.CV

    Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects

    Authors: Zicong Fan, Takehiko Ohkawa, Linlin Yang, Nie Lin, Zhishan Zhou, Shihao Zhou, Jiajun Liang, Zhong Gao, Xuanyang Zhang, Xue Zhang, Fei Li, Zheng Liu, Feng Lu, Karim Abou Zeid, Bastian Leibe, Jeongwan On, Seungryul Baek, Aditya Prakash, Saurabh Gupta, Kun He, Yoichi Sato, Otmar Hilliges, Hyung Jin Chang, Angela Yao

    Abstract: We interact with the world with our hands and see it through our own (egocentric) perspective. A holistic 3Dunderstanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation. Accurately reconstructing such interactions in 3D is challenging due to heavy occlusion, viewpoint bias, camera distortion, and motion blur from the h… ▽ More

    Submitted 5 August, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

    Comments: Accepted to ECCV 2024

  19. arXiv:2403.14023  [pdf

    cs.CR

    A system capable of verifiably and privately screening global DNA synthesis

    Authors: Carsten Baum, Jens Berlips, Walther Chen, Hongrui Cui, Ivan Damgard, Jiangbin Dong, Kevin M. Esvelt, Leonard Foner, Mingyu Gao, Dana Gretton, Martin Kysel, Juanru Li, Xiang Li, Omer Paneth, Ronald L. Rivest, Francesca Sage-Ling, Adi Shamir, Yue Shen, Meicen Sun, Vinod Vaikuntanathan, Lynn Van Hauwe, Theia Vogel, Benjamin Weinstein-Raun, Yun Wang, Daniel Wichs , et al. (5 additional authors not shown)

    Abstract: Printing custom DNA sequences is essential to scientific and biomedical research, but the technology can be used to manufacture plagues as well as cures. Just as ink printers recognize and reject attempts to counterfeit money, DNA synthesizers and assemblers should deny unauthorized requests to make viral DNA that could be used to ignite a pandemic. There are three complications. First, we don't n… ▽ More

    Submitted 10 September, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

    Comments: Main text 10 pages, 4 figures. 5 supplementary figures. Total 21 pages. Direct correspondence to: Ivan B. Damgard (ivan@cs.au.dk), Andrew C. Yao (andrewcyao@mail.tsinghua.edu.cn), Kevin M. Esvelt (esvelt@mit.edu)

  20. arXiv:2403.13759  [pdf

    physics.chem-ph cond-mat.mtrl-sci

    How quickly can sodium-ion learn? Assessing scenarios for techno-economic competitiveness against lithium-ion batteries

    Authors: Adrian Yao, Sally M. Benson, William C. Chueh

    Abstract: Sodium-ion batteries have garnered significant attention as a potentially low-cost alternative to lithium-ion batteries, which have experienced supply shortages and pricing volatility of key minerals. Here we assess their techno-economic competitiveness against incumbent lithium-ion batteries using a modeling framework incorporating componential learning curves constrained by minerals prices and e… ▽ More

    Submitted 13 September, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

    Comments: 20 pages, 6 figures, 1 table

  21. arXiv:2403.09805  [pdf, other

    cs.CV cs.LG

    On the Utility of 3D Hand Poses for Action Recognition

    Authors: Md Salman Shamil, Dibyadip Chatterjee, Fadime Sener, Shugao Ma, Angela Yao

    Abstract: 3D hand pose is an underexplored modality for action recognition. Poses are compact yet informative and can greatly benefit applications with limited compute budgets. However, poses alone offer an incomplete understanding of actions, as they cannot fully capture objects and environments with which humans interact. We propose HandFormer, a novel multimodal transformer, to efficiently model hand-obj… ▽ More

    Submitted 14 August, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

    Comments: ECCV 2024; https://s-shamil.github.io/HandFormer/

  22. arXiv:2403.06102  [pdf, other

    cs.CV

    Coherent Temporal Synthesis for Incremental Action Segmentation

    Authors: Guodong Ding, Hans Golong, Angela Yao

    Abstract: Data replay is a successful incremental learning technique for images. It prevents catastrophic forgetting by keeping a reservoir of previous data, original or synthesized, to ensure the model retains past knowledge while adapting to novel concepts. However, its application in the video domain is rudimentary, as it simply stores frame exemplars for action recognition. This paper presents the first… ▽ More

    Submitted 10 March, 2024; originally announced March 2024.

    Comments: 10 pages, 6 figures, 5 tables, accepted to CVPR 2024

  23. arXiv:2402.07625  [pdf, other

    cs.CL cs.AI cs.LG

    Autonomous Data Selection with Language Models for Mathematical Texts

    Authors: Yifan Zhang, Yifan Luo, Yang Yuan, Andrew Chi-Chih Yao

    Abstract: To improve language models' proficiency in mathematical reasoning via continual pretraining, we introduce a novel strategy that leverages base language models for autonomous data selection. Departing from conventional supervised fine-tuning or trained classifiers with human-annotated data, our approach Autonomous Data Selection (AutoDS) utilizes meta-prompted language models as zero-shot verifiers… ▽ More

    Submitted 2 April, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

  24. arXiv:2402.02377  [pdf, other

    cs.CV cs.LG

    NOAH: Learning Pairwise Object Category Attentions for Image Classification

    Authors: Chao Li, Aojun Zhou, Anbang Yao

    Abstract: A modern deep neural network (DNN) for image classification tasks typically consists of two parts: a backbone for feature extraction, and a head for feature encoding and class predication. We observe that the head structures of mainstream DNNs adopt a similar feature encoding pipeline, exploiting global feature dependencies while disregarding local ones. In this paper, we revisit the feature encod… ▽ More

    Submitted 4 February, 2024; originally announced February 2024.

    Comments: This research work was completed in 2023. Code and pre-trained models are available at https://github.com/OSVAI/NOAH

  25. arXiv:2401.09003  [pdf, other

    cs.CL cs.AI cs.LG

    Augmenting Math Word Problems via Iterative Question Composing

    Authors: Haoxiong Liu, Yifan Zhang, Yifan Luo, Andrew Chi-Chih Yao

    Abstract: Despite the advancements in large language models (LLMs) for mathematical reasoning, solving competition-level math problems remains a significant challenge, especially for open-source LLMs without external tools. We introduce the MMIQC dataset, comprising a mixture of processed web data and synthetic question-response pairs, aimed at enhancing the mathematical reasoning capabilities of base langu… ▽ More

    Submitted 10 February, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

  26. arXiv:2312.15297  [pdf, other

    cs.LG cs.CV stat.ML

    Make Me a BNN: A Simple Strategy for Estimating Bayesian Uncertainty from Pre-trained Models

    Authors: Gianni Franchi, Olivier Laurent, Maxence Leguéry, Andrei Bursuc, Andrea Pilzer, Angela Yao

    Abstract: Deep Neural Networks (DNNs) are powerful tools for various computer vision tasks, yet they often struggle with reliable uncertainty quantification - a critical requirement for real-world applications. Bayesian Neural Networks (BNN) are equipped for uncertainty estimation but cannot scale to large DNNs that are highly unstable to train. To address this challenge, we introduce the Adaptable Bayesian… ▽ More

    Submitted 23 December, 2023; originally announced December 2023.

  27. arXiv:2312.04168  [pdf, other

    cs.CV cs.AI cs.LG

    Augmentation-Free Dense Contrastive Knowledge Distillation for Efficient Semantic Segmentation

    Authors: Jiawei Fan, Chao Li, Xiaolong Liu, Meina Song, Anbang Yao

    Abstract: In recent years, knowledge distillation methods based on contrastive learning have achieved promising results on image classification and object detection tasks. However, in this line of research, we note that less attention is paid to semantic segmentation. Existing methods heavily rely on data augmentation and memory buffer, which entail high computational resource demands when applying them to… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: The paper of Af-DCD is accepted to NeurIPS 2023. Code and models are available at https://github.com/OSVAI/Af-DCD

  28. arXiv:2312.00462  [pdf, other

    cs.CV

    Learning Unorthogonalized Matrices for Rotation Estimation

    Authors: Kerui Gu, Zhihao Li, Shiyong Liu, Jianzhuang Liu, Songcen Xu, Youliang Yan, Michael Bi Mi, Kenji Kawaguchi, Angela Yao

    Abstract: Estimating 3D rotations is a common procedure for 3D computer vision. The accuracy depends heavily on the rotation representation. One form of representation -- rotation matrices -- is popular due to its continuity, especially for pose estimation tasks. The learning process usually incorporates orthogonalization to ensure orthonormal matrices. Our work reveals, through gradient analysis, that comm… ▽ More

    Submitted 1 December, 2023; originally announced December 2023.

  29. arXiv:2311.17105  [pdf, other

    cs.CV

    On the Calibration of Human Pose Estimation

    Authors: Kerui Gu, Rongyu Chen, Angela Yao

    Abstract: Most 2D human pose estimation frameworks estimate keypoint confidence in an ad-hoc manner, using heuristics such as the maximum value of heatmaps. The confidence is part of the evaluation scheme, e.g., AP for the MSCOCO dataset, yet has been largely overlooked in the development of state-of-the-art methods. This paper takes the first steps in addressing miscalibration in pose estimation. From a ca… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  30. arXiv:2311.11482  [pdf, other

    cs.AI cs.CL

    Meta Prompting for AI Systems

    Authors: Yifan Zhang, Yang Yuan, Andrew Chi-Chih Yao

    Abstract: In this work, we present a comprehensive study of Meta Prompting (MP), an innovative technique reshaping the utilization of language models (LMs) and AI systems in problem-solving and data interaction. Grounded in type theory and category theory, Meta Prompting emphasizes the structure and syntax of information over traditional content-centric methods. The paper explores the formal definitions of… ▽ More

    Submitted 15 June, 2024; v1 submitted 19 November, 2023; originally announced November 2023.

  31. arXiv:2310.17688  [pdf, other

    cs.CY cs.AI cs.CL cs.LG

    Managing extreme AI risks amid rapid progress

    Authors: Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann

    Abstract: Artificial Intelligence (AI) is progressing rapidly, and companies are shifting their focus to developing generalist AI systems that can autonomously act and pursue goals. Increases in capabilities and autonomy may soon massively amplify AI's impact, with risks that include large-scale social harms, malicious uses, and an irreversible loss of human control over autonomous AI systems. Although rese… ▽ More

    Submitted 22 May, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

    Comments: Published in Science: https://www.science.org/doi/10.1126/science.adn0117

  32. arXiv:2310.17154  [pdf, other

    cs.CV

    Deep Imbalanced Regression via Hierarchical Classification Adjustment

    Authors: Haipeng Xiong, Angela Yao

    Abstract: Regression tasks in computer vision, such as age estimation or counting, are often formulated into classification by quantizing the target space into classes. Yet real-world data is often imbalanced -- the majority of training samples lie in a head range of target values, while a minority of samples span a usually larger tail range. By selecting the class quantization, one can adjust imbalanced re… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

    Comments: 14 pages, 5 figures

  33. arXiv:2310.00227  [pdf, other

    cs.CV

    Scaling for Training Time and Post-hoc Out-of-distribution Detection Enhancement

    Authors: Kai Xu, Rongyu Chen, Gianni Franchi, Angela Yao

    Abstract: The capacity of a modern deep learning system to determine if a sample falls within its realm of knowledge is fundamental and important. In this paper, we offer insights and analyses of recent state-of-the-art out-of-distribution (OOD) detection methods - extremely simple activation shaping (ASH). We demonstrate that activation pruning has a detrimental effect on OOD detection, while activation sc… ▽ More

    Submitted 29 September, 2023; originally announced October 2023.

  34. arXiv:2309.15478  [pdf, other

    cs.CV cs.LG

    The Robust Semantic Segmentation UNCV2023 Challenge Results

    Authors: Xuanlong Yu, Yi Zuo, Zitao Wang, Xiaowen Zhang, Jiaxuan Zhao, Yuting Yang, Licheng Jiao, Rui Peng, Xinyi Wang, Junpei Zhang, Kexin Zhang, Fang Liu, Roberto Alcover-Couso, Juan C. SanMiguel, Marcos Escudero-Viñolo, Hanlin Tian, Kenta Matsui, Tianhao Wang, Fahmy Adan, Zhitong Gao, Xuming He, Quentin Bouniot, Hossein Moghaddam, Shyam Nandan Rai, Fabio Cermelli , et al. (12 additional authors not shown)

    Abstract: This paper outlines the winning solutions employed in addressing the MUAD uncertainty quantification challenge held at ICCV 2023. The challenge was centered around semantic segmentation in urban environments, with a particular focus on natural adversarial scenarios. The report presents the results of 19 submitted entries, with numerous techniques drawing inspiration from cutting-edge uncertainty q… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: 11 pages, 4 figures, accepted at ICCV 2023 UNCV workshop

  35. arXiv:2309.01327  [pdf, other

    cs.CV cs.AI cs.MM

    Can I Trust Your Answer? Visually Grounded Video Question Answering

    Authors: Junbin Xiao, Angela Yao, Yicong Li, Tat Seng Chua

    Abstract: We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding. Specifically, by forcing vision-language models (VLMs) to answer questions and simultaneously provide visual evidence, we seek to ascertain the extent to which the predictions of such techniques are genuinely anchored in relevant video content, versus spurious… ▽ More

    Submitted 30 March, 2024; v1 submitted 3 September, 2023; originally announced September 2023.

    Comments: Accepted to CVPR'24. (Compared with preprint version, we mainly improve the presentation, discuss more related works, and extend experiments in Appendix.)

  36. arXiv:2308.13628  [pdf, other

    cs.CV cs.AI

    HiFiHR: Enhancing 3D Hand Reconstruction from a Single Image via High-Fidelity Texture

    Authors: Jiayin Zhu, Zhuoran Zhao, Linlin Yang, Angela Yao

    Abstract: We present HiFiHR, a high-fidelity hand reconstruction approach that utilizes render-and-compare in the learning-based framework from a single image, capable of generating visually plausible and accurate 3D hand meshes while recovering realistic textures. Our method achieves superior texture reconstruction by employing a parametric hand model with predefined texture assets, and by establishing a t… ▽ More

    Submitted 25 August, 2023; originally announced August 2023.

    Comments: Accepted to DAGM German Conference on Pattern Recognition 2023

  37. arXiv:2308.11488  [pdf, other

    cs.CV

    Opening the Vocabulary of Egocentric Actions

    Authors: Dibyadip Chatterjee, Fadime Sener, Shugao Ma, Angela Yao

    Abstract: Human actions in egocentric videos are often hand-object interactions composed from a verb (performed by the hand) applied to an object. Despite their extensive scaling up, egocentric datasets still face two limitations - sparsity of action compositions and a closed set of interacting objects. This paper proposes a novel open vocabulary action recognition task. Given a set of verbs and objects obs… ▽ More

    Submitted 12 December, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

    Comments: NeurIPS 2023 camera ready; https://dibschat.github.io/openvocab-egoAR/

  38. arXiv:2308.08361  [pdf, other

    cs.CV cs.LG

    KernelWarehouse: Towards Parameter-Efficient Dynamic Convolution

    Authors: Chao Li, Anbang Yao

    Abstract: Dynamic convolution learns a linear mixture of $n$ static kernels weighted with their sample-dependent attentions, demonstrating superior performance compared to normal convolution. However, existing designs are parameter-inefficient: they increase the number of convolutional parameters by $n$ times. This and the optimization difficulty lead to no research progress in dynamic convolution that can… ▽ More

    Submitted 16 August, 2023; originally announced August 2023.

    Comments: This research work was completed and submitted in early May 2023. Code and pre-trained models are available at https://github.com/OSVAI/KernelWarehouse

  39. arXiv:2308.07571  [pdf, other

    cs.CV cs.LG

    Ske2Grid: Skeleton-to-Grid Representation Learning for Action Recognition

    Authors: Dongqi Cai, Yangyuxuan Kang, Anbang Yao, Yurong Chen

    Abstract: This paper presents Ske2Grid, a new representation learning framework for improved skeleton-based action recognition. In Ske2Grid, we define a regular convolution operation upon a novel grid representation of human skeleton, which is a compact image-like grid patch constructed and learned through three novel designs. Specifically, we propose a graph-node index transform (GIT) to construct a regula… ▽ More

    Submitted 15 August, 2023; originally announced August 2023.

    Comments: The paper of Ske2Grid is published at ICML 2023. Code and models are available at https://github.com/OSVAI/Ske2Grid

  40. arXiv:2308.04371  [pdf, other

    cs.AI

    Cumulative Reasoning with Large Language Models

    Authors: Yifan Zhang, Jingqin Yang, Yang Yuan, Andrew Chi-Chih Yao

    Abstract: Despite the recent advancements in language models (LMs), their ability to solve complex problems remains limited. This paper introduces Cumulative Reasoning (CR), a novel approach that utilizes LMs cumulatively and iteratively, mirroring human thought processes for problem-solving. CR decomposes tasks into smaller, manageable components and leverages previous propositions for effective compositio… ▽ More

    Submitted 1 April, 2024; v1 submitted 8 August, 2023; originally announced August 2023.

  41. arXiv:2308.02535  [pdf, other

    cs.CV cs.LG

    Learning to Generate Training Datasets for Robust Semantic Segmentation

    Authors: Marwane Hariat, Olivier Laurent, Rémi Kazmierczak, Shihao Zhang, Andrei Bursuc, Angela Yao, Gianni Franchi

    Abstract: Semantic segmentation methods have advanced significantly. Still, their robustness to real-world perturbations and object types not seen during training remains a challenge, particularly in safety-critical applications. We propose a novel approach to improve the robustness of semantic segmentation techniques by leveraging the synergy between label-to-image generators and image-to-label segmentatio… ▽ More

    Submitted 12 March, 2024; v1 submitted 1 August, 2023; originally announced August 2023.

    Comments: Published as a conference paper at WACV 2024

    Journal ref: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024. p. 3894-3905

  42. arXiv:2307.16453  [pdf, other

    cs.AI cs.LO

    Every Mistake Counts in Assembly

    Authors: Guodong Ding, Fadime Sener, Shugao Ma, Angela Yao

    Abstract: One promising use case of AI assistants is to help with complex procedures like cooking, home repair, and assembly tasks. Can we teach the assistant to interject after the user makes a mistake? This paper targets the problem of identifying ordering mistakes in assembly procedures. We propose a system that can detect ordering mistakes by utilizing a learned knowledge base. Our framework constructs… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

    Comments: 10 pages, 5 figures

  43. arXiv:2305.13803  [pdf, other

    cs.CV cs.AI cs.LG

    NORM: Knowledge Distillation via N-to-One Representation Matching

    Authors: Xiaolong Liu, Lujun Li, Chao Li, Anbang Yao

    Abstract: Existing feature distillation methods commonly adopt the One-to-one Representation Matching between any pre-selected teacher-student layer pair. In this paper, we present N-to-One Representation (NORM), a new two-stage knowledge distillation method, which relies on a simple Feature Transform (FT) module consisting of two linear layers. In view of preserving the intact information learnt by the tea… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: The paper of NORM is published at ICLR 2023. Code and models are available at https://github.com/OSVAI/NORM

  44. arXiv:2305.06925  [pdf, other

    cond-mat.mtrl-sci cs.LG physics.chem-ph physics.comp-ph

    Accurate Surface and Finite Temperature Bulk Properties of Lithium Metal at Large Scales using Machine Learning Interaction Potentials

    Authors: Mgcini Keith Phuthi, Archie Mingze Yao, Simon Batzner, Albert Musaelian, Boris Kozinsky, Ekin Dogus Cubuk, Venkatasubramanian Viswanathan

    Abstract: The properties of lithium metal are key parameters in the design of lithium ion and lithium metal batteries. They are difficult to probe experimentally due to the high reactivity and low melting point of lithium as well as the microscopic scales at which lithium exists in batteries where it is found to have enhanced strength, with implications for dendrite suppression strategies. Computationally,… ▽ More

    Submitted 22 May, 2023; v1 submitted 24 April, 2023; originally announced May 2023.

    Comments: 9 pages, 4 figures, 3 pages of Supporting Information

  45. arXiv:2305.00646  [pdf, other

    cs.CV

    Overcoming the Trade-off Between Accuracy and Plausibility in 3D Hand Shape Reconstruction

    Authors: Ziwei Yu, Chen Li, Linlin Yang, Xiaoxu Zheng, Michael Bi Mi, Gim Hee Lee, Angela Yao

    Abstract: Direct mesh fitting for 3D hand shape reconstruction is highly accurate. However, the reconstructed meshes are prone to artifacts and do not appear as plausible hand shapes. Conversely, parametric models like MANO ensure plausible hand shapes but are not as accurate as the non-parametric methods. In this work, we introduce a novel weakly-supervised hand shape estimation framework that integrates n… ▽ More

    Submitted 30 April, 2023; originally announced May 2023.

    Comments: CVPR 2023

  46. arXiv:2305.00163  [pdf, other

    cs.CV

    Enhancing Video Super-Resolution via Implicit Resampling-based Alignment

    Authors: Kai Xu, Ziwei Yu, Xin Wang, Michael Bi Mi, Angela Yao

    Abstract: In video super-resolution, it is common to use a frame-wise alignment to support the propagation of information over time. The role of alignment is well-studied for low-level enhancement in video, but existing works overlook a critical step -- resampling. We show through extensive experiments that for alignment to be effective, the resampling should preserve the reference frequency spectrum while… ▽ More

    Submitted 17 January, 2024; v1 submitted 28 April, 2023; originally announced May 2023.

  47. arXiv:2303.15462  [pdf, other

    physics.optics physics.chem-ph

    Strong chiral optical force for small chiral molecules based on electric-dipole interactions, inspired by the asymmetrical hydrozoan $\textit{Velella velella}$

    Authors: Robert P. Cameron, Duncan McArthur, Alison M. Yao

    Abstract: Drawing inspiration from a remarkable chiral force found in nature, we show that a static electric field combined with an optical lin$\perp$lin polarization standing wave can exert a chiral optical force on a small chiral molecule that is several orders of magnitude stronger than other chiral optical forces proposed to date, being based on leading electric-dipole interactions rather than relying o… ▽ More

    Submitted 14 July, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

  48. arXiv:2303.14470  [pdf, other

    cs.CV

    Compacting Binary Neural Networks by Sparse Kernel Selection

    Authors: Yikai Wang, Wenbing Huang, Yinpeng Dong, Fuchun Sun, Anbang Yao

    Abstract: Binary Neural Network (BNN) represents convolution weights with 1-bit values, which enhances the efficiency of storage and computation. This paper is motivated by a previously revealed phenomenon that the binary kernels in successful BNNs are nearly power-law distributed: their values are mostly clustered into a small number of codewords. This phenomenon encourages us to compact typical BNNs and o… ▽ More

    Submitted 25 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  49. Intrinsic and extrinsic anomalous transport properties of Heusler ferromagnets Fe$_2$CoAl and Fe$_2$NiAl from first principles

    Authors: Xiuxian Yang, Wanxiang Feng, Xiao-Ping Li, Gui-Bin Liu, Yuriy Mokrousov, and Yugui Yao

    Abstract: Recently, Heusler ferromagnets have been found to exhibit unconventional anomalous electric, thermal, and thermoelectric transport properties. In this study, we employed first-principles density functional theory calculations to systematically investigate both intrinsic and extrinsic contributions to the anomalous Hall effect (AHE), anomalous Nernst effect (ANE), and anomalous thermal Hall effect… ▽ More

    Submitted 14 July, 2023; v1 submitted 17 March, 2023; originally announced March 2023.

    Comments: 9 pages, 5 figures

    Journal ref: Physical Review B 107, 224405 (2023)

  50. arXiv:2302.13668  [pdf, other

    cs.CV cs.MM

    Contrastive Video Question Answering via Video Graph Transformer

    Authors: Junbin Xiao, Pan Zhou, Angela Yao, Yicong Li, Richang Hong, Shuicheng Yan, Tat-Seng Chua

    Abstract: We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text tr… ▽ More

    Submitted 11 July, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

    Comments: Accepted by IEEE T-PAMI'23