Skip to main content

Showing 1–50 of 1,494 results for author: Huang, H

  1. arXiv:2410.16166  [pdf, other

    cs.CV cs.CL

    Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining

    Authors: Han Huang, Yuqi Huo, Zijia Zhao, Haoyu Lu, Shu Wu, Bingning Wang, Qiang Liu, Weipeng Chen, Liang Wang

    Abstract: Multimodal large language models (MLLMs) have made significant strides by integrating visual and textual modalities. A critical factor in training MLLMs is the quality of image-text pairs within multimodal pretraining datasets. However, $\textit {de facto}$ filter-based data quality enhancement paradigms often discard a substantial portion of high-quality image data due to inadequate semantic alig… ▽ More

    Submitted 21 October, 2024; originally announced October 2024.

  2. arXiv:2410.15501  [pdf, other

    quant-ph cs.LG

    Predicting adaptively chosen observables in quantum systems

    Authors: Jerry Huang, Laura Lewis, Hsin-Yuan Huang, John Preskill

    Abstract: Recent advances have demonstrated that $\mathcal{O}(\log M)$ measurements suffice to predict $M$ properties of arbitrarily large quantum many-body systems. However, these remarkable findings assume that the properties to be predicted are chosen independently of the data. This assumption can be violated in practice, where scientists adaptively select properties after looking at previous predictions… ▽ More

    Submitted 20 October, 2024; originally announced October 2024.

    Comments: 10 pages, 4 figures + 39-page appendix

  3. arXiv:2410.15385  [pdf, other

    cs.CV

    LoRA-IR: Taming Low-Rank Experts for Efficient All-in-One Image Restoration

    Authors: Yuang Ai, Huaibo Huang, Ran He

    Abstract: Prompt-based all-in-one image restoration (IR) frameworks have achieved remarkable performance by incorporating degradation-specific information into prompt modules. Nevertheless, handling the complex and diverse degradations encountered in real-world scenarios remains a significant challenge. To address this challenge, we propose LoRA-IR, a flexible framework that dynamically leverages compact lo… ▽ More

    Submitted 20 October, 2024; originally announced October 2024.

  4. arXiv:2410.15326  [pdf, other

    cs.CL

    A Survey of Uncertainty Estimation in LLMs: Theory Meets Practice

    Authors: Hsiu-Yuan Huang, Yutong Yang, Zhaoxi Zhang, Sanwoo Lee, Yunfang Wu

    Abstract: As large language models (LLMs) continue to evolve, understanding and quantifying the uncertainty in their predictions is critical for enhancing application credibility. However, the existing literature relevant to LLM uncertainty estimation often relies on heuristic approaches, lacking systematic classification of the methods. In this survey, we clarify the definitions of uncertainty and confiden… ▽ More

    Submitted 20 October, 2024; originally announced October 2024.

    Comments: 9 pages

  5. arXiv:2410.15287  [pdf, other

    cs.CL

    Training Language Models to Critique With Multi-agent Feedback

    Authors: Tian Lan, Wenwei Zhang, Chengqi Lyu, Shuaibin Li, Chen Xu, Heyan Huang, Dahua Lin, Xian-Ling Mao, Kai Chen

    Abstract: Critique ability, a meta-cognitive capability of humans, presents significant challenges for LLMs to improve. Recent works primarily rely on supervised fine-tuning (SFT) using critiques generated by a single LLM like GPT-4. However, these model-generated critiques often exhibit flaws due to the inherent complexity of the critique. Consequently, fine-tuning LLMs on such flawed critiques typically l… ▽ More

    Submitted 20 October, 2024; originally announced October 2024.

  6. arXiv:2410.14919  [pdf, other

    cs.CV cs.LG

    Adversarial Score identity Distillation: Rapidly Surpassing the Teacher in One Step

    Authors: Mingyuan Zhou, Huangjie Zheng, Yi Gu, Zhendong Wang, Hai Huang

    Abstract: Score identity Distillation (SiD) is a data-free method that has achieved state-of-the-art performance in image generation by leveraging only a pretrained diffusion model, without requiring any training data. However, the ultimate performance of SiD is constrained by the accuracy with which the pretrained model captures the true data scores at different stages of the diffusion process. In this pap… ▽ More

    Submitted 18 October, 2024; originally announced October 2024.

  7. arXiv:2410.13808  [pdf, other

    cs.CL

    De-mark: Watermark Removal in Large Language Models

    Authors: Ruibo Chen, Yihan Wu, Junfeng Guo, Heng Huang

    Abstract: Watermarking techniques offer a promising way to identify machine-generated content via embedding covert information into the contents generated from language models (LMs). However, the robustness of the watermarking schemes has not been well explored. In this paper, we present De-mark, an advanced framework designed to remove n-gram-based watermarks effectively. Our method utilizes a novel queryi… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

  8. arXiv:2410.13805  [pdf, other

    cs.CL

    A Watermark for Order-Agnostic Language Models

    Authors: Ruibo Chen, Yihan Wu, Yanshuo Chen, Chenxi Liu, Junfeng Guo, Heng Huang

    Abstract: Statistical watermarking techniques are well-established for sequentially decoded language models (LMs). However, these techniques cannot be directly applied to order-agnostic LMs, as the tokens in order-agnostic LMs are not generated sequentially. In this work, we introduce Pattern-mark, a pattern-based watermarking framework specifically designed for order-agnostic LMs. We develop a Markov-chain… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

  9. arXiv:2410.13694  [pdf, other

    cs.CV cs.CL

    Exploring the Design Space of Visual Context Representation in Video MLLMs

    Authors: Yifan Du, Yuqi Huo, Kun Zhou, Zijia Zhao, Haoyu Lu, Han Huang, Wayne Xin Zhao, Bingning Wang, Weipeng Chen, Ji-Rong Wen

    Abstract: Video Multimodal Large Language Models (MLLMs) have shown remarkable capability of understanding the video semantics on various downstream tasks. Despite the advancements, there is still a lack of systematic research on visual context representation, which refers to the scheme to select frames from a video and further select the tokens from a frame. In this paper, we explore the design space for v… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

    Comments: Long Video MLLM; work in progress

  10. arXiv:2410.12219  [pdf, other

    cs.AI cs.CL cs.MM

    OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities

    Authors: Lichang Chen, Hexiang Hu, Mingda Zhang, Yiwen Chen, Zifeng Wang, Yandong Li, Pranav Shyam, Tianyi Zhou, Heng Huang, Ming-Hsuan Yang, Boqing Gong

    Abstract: We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models, such as GPT-4o and Gemini. Evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges. Particularly, the user message might often consist of multiple modalities, such that OLMs have to establish holistic understanding and reasoning across modaliti… ▽ More

    Submitted 16 October, 2024; originally announced October 2024.

    Comments: 19 pages, 6 figures, 12 tables

  11. arXiv:2410.11824  [pdf, other

    cs.CV

    KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities

    Authors: Hsin-Ping Huang, Xinyi Wang, Yonatan Bitton, Hagai Taitelbaum, Gaurav Singh Tomar, Ming-Wei Chang, Xuhui Jia, Kelvin C. K. Chan, Hexiang Hu, Yu-Chuan Su, Ming-Hsuan Yang

    Abstract: Recent advancements in text-to-image generation have significantly enhanced the quality of synthesized images. Despite this progress, evaluations predominantly focus on aesthetic appeal or alignment with text prompts. Consequently, there is limited understanding of whether these models can accurately represent a wide variety of realistic visual entities - a task requiring real-world knowledge. To… ▽ More

    Submitted 15 October, 2024; originally announced October 2024.

    Comments: Project page: https://kitten-project.github.io/

  12. arXiv:2410.11588  [pdf, other

    cs.CL

    Causal Reasoning in Large Language Models: A Knowledge Graph Approach

    Authors: Yejin Kim, Eojin Kang, Juae Kim, H. Howie Huang

    Abstract: Large language models (LLMs) typically improve performance by either retrieving semantically similar information, or enhancing reasoning abilities through structured prompts like chain-of-thought. While both strategies are considered crucial, it remains unclear which has a greater impact on model performance or whether a combination of both is necessary. This paper answers this question by proposi… ▽ More

    Submitted 15 October, 2024; originally announced October 2024.

    Comments: Accepted at NeurIPS 2024 Workshop on Causality and Large Models (CaLM)

  13. arXiv:2410.11207  [pdf

    cs.LG physics.optics

    Cross-Dataset Generalization in Deep Learning

    Authors: Xuyu Zhang, Haofan Huang, Dawei Zhang, Songlin Zhuang, Shensheng Han, Puxiang Lai, Honglin Liu

    Abstract: Deep learning has been extensively used in various fields, such as phase imaging, 3D imaging reconstruction, phase unwrapping, and laser speckle reduction, particularly for complex problems that lack analytic models. Its data-driven nature allows for implicit construction of mathematical relationships within the network through training with abundant data. However, a critical challenge in practica… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

  14. arXiv:2410.11182  [pdf, other

    cs.LG cs.AI cs.CR

    Archilles' Heel in Semi-open LLMs: Hiding Bottom against Recovery Attacks

    Authors: Hanbo Huang, Yihan Li, Bowen Jiang, Lin Liu, Ruoyu Sun, Zhuotao Liu, Shiyu Liang

    Abstract: Closed-source large language models deliver strong performance but have limited downstream customizability. Semi-open models, combining both closed-source and public layers, were introduced to improve customizability. However, parameters in the closed-source layers are found vulnerable to recovery attacks. In this paper, we explore the design of semi-open models with fewer closed-source layers, ai… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: 10 pages for main content of the paper

  15. arXiv:2410.10835  [pdf, other

    cs.IR cs.LG

    DIIT: A Domain-Invariant Information Transfer Method for Industrial Cross-Domain Recommendation

    Authors: Heyuan Huang, Xingyu Lou, Chaochao Chen, Pengxiang Cheng, Yue Xin, Chengwei He, Xiang Liu, Jun Wang

    Abstract: Cross-Domain Recommendation (CDR) have received widespread attention due to their ability to utilize rich information across domains. However, most existing CDR methods assume an ideal static condition that is not practical in industrial recommendation systems (RS). Therefore, simply applying existing CDR methods in the industrial RS environment may lead to low effectiveness and efficiency. To fil… ▽ More

    Submitted 29 September, 2024; originally announced October 2024.

    Comments: Accepted at CIKM 2024

  16. arXiv:2410.10663  [pdf, other

    cs.CV cs.LG

    Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

    Authors: Zhengwei Yang, Yuke Li, Qiang Sun, Basura Fernando, Heng Huang, Zheng Wang

    Abstract: Most existing studies on few-shot learning focus on unimodal settings, where models are trained to generalize on unseen data using only a small number of labeled examples from the same modality. However, real-world data are inherently multi-modal, and unimodal approaches limit the practical applications of few-shot learning. To address this gap, this paper introduces the Cross-modal Few-Shot Learn… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: 19 pages, 7 figures

  17. arXiv:2410.10414  [pdf, other

    cs.CR cs.CL cs.LG

    On Calibration of LLM-based Guard Models for Reliable Content Moderation

    Authors: Hongfu Liu, Hengguan Huang, Hao Wang, Xiangming Gu, Ye Wang

    Abstract: Large language models (LLMs) pose significant risks due to the potential for generating harmful content or users attempting to evade guardrails. Existing studies have developed LLM-based guard models designed to moderate the input and output of threat LLMs, ensuring adherence to safety policies by blocking content that violates these protocols upon deployment. However, limited attention has been g… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: 19 pages, 9 figures

  18. arXiv:2410.10116  [pdf, other

    quant-ph cs.CC cs.CL math-ph

    How to Construct Random Unitaries

    Authors: Fermi Ma, Hsin-Yuan Huang

    Abstract: The existence of pseudorandom unitaries (PRUs) -- efficient quantum circuits that are computationally indistinguishable from Haar-random unitaries -- has been a central open question, with significant implications for cryptography, complexity theory, and fundamental physics. In this work, we close this question by proving that PRUs exist, assuming that any quantum-secure one-way function exists. W… ▽ More

    Submitted 13 October, 2024; originally announced October 2024.

    Comments: 76 pages

  19. arXiv:2410.09692  [pdf, other

    cs.LG cs.AI

    ALLoRA: Adaptive Learning Rate Mitigates LoRA Fatal Flaws

    Authors: Hai Huang, Randall Balestriero

    Abstract: Low-Rank Adaptation (LoRA) is the bread and butter of Large Language Model (LLM) finetuning. LoRA learns an additive low-rank perturbation, $AB$, of a pretrained matrix parameter $W$ to align the model to a new task or dataset with $W+AB$. We identify three core limitations to LoRA for finetuning--a setting that employs limited amount of data and training steps. First, LoRA employs Dropout to prev… ▽ More

    Submitted 12 October, 2024; originally announced October 2024.

  20. arXiv:2410.09418  [pdf, other

    cs.CL

    Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models

    Authors: Yi-Fan Lu, Xian-Ling Mao, Tian Lan, Chen Xu, Heyan Huang

    Abstract: Event extraction has gained extensive research attention due to its broad range of applications. However, the current mainstream evaluation method for event extraction relies on token-level exact match, which misjudges numerous semantic-level correct cases. This reliance leads to a significant discrepancy between the evaluated performance of models under exact match criteria and their real perform… ▽ More

    Submitted 12 October, 2024; originally announced October 2024.

  21. arXiv:2410.09036  [pdf

    cs.RO

    Design and Performance Evaluation of an Elbow-Based Biomechanical Energy Harvester

    Authors: Hubert Huang, Jeffrey Huang

    Abstract: Carbon emissions have long been attributed to the increase in climate change. With the effects of climate change escalating in the past few years, there has been an increased effort to find green alternatives to power generation, which has been a major contributor to carbon emissions. One prominent way that has arisen is biomechanical energy, or harvesting energy based on natural human movement. T… ▽ More

    Submitted 11 October, 2024; originally announced October 2024.

    Comments: 8 pages, 9 figures

    ACM Class: I.2.9

  22. arXiv:2410.08989  [pdf, other

    cs.LG cs.AI

    SubZero: Random Subspace Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning

    Authors: Ziming Yu, Pan Zhou, Sike Wang, Jia Li, Hua Huang

    Abstract: Fine-tuning Large Language Models (LLMs) has proven effective for a variety of downstream tasks. However, as LLMs grow in size, the memory demands for backpropagation become increasingly prohibitive. Zeroth-order (ZO) optimization methods offer a memory-efficient alternative by using forward passes to estimate gradients, but the variance of gradient estimates typically scales linearly with the mod… ▽ More

    Submitted 11 October, 2024; originally announced October 2024.

  23. arXiv:2410.08564  [pdf, other

    cs.CL cs.LG

    Similar Phrases for Cause of Actions of Civil Cases

    Authors: Ho-Chien Huang, Chao-Lin Liu

    Abstract: In the Taiwanese judicial system, Cause of Actions (COAs) are essential for identifying relevant legal judgments. However, the lack of standardized COA labeling creates challenges in filtering cases using basic methods. This research addresses this issue by leveraging embedding and clustering techniques to analyze the similarity between COAs based on cited legal articles. The study implements vari… ▽ More

    Submitted 11 October, 2024; originally announced October 2024.

    Comments: 10 pages, 4 figures, 3 tables(including appendix)

  24. arXiv:2410.08282  [pdf, other

    cs.RO cs.AI cs.CV cs.GR

    FusionSense: Bridging Common Sense, Vision, and Touch for Robust Sparse-View Reconstruction

    Authors: Irving Fang, Kairui Shi, Xujin He, Siqi Tan, Yifan Wang, Hanwen Zhao, Hung-Jui Huang, Wenzhen Yuan, Chen Feng, Jing Zhang

    Abstract: Humans effortlessly integrate common-sense knowledge with sensory input from vision and touch to understand their surroundings. Emulating this capability, we introduce FusionSense, a novel 3D reconstruction framework that enables robots to fuse priors from foundation models with highly sparse observations from vision and tactile sensors. FusionSense addresses three key challenges: (i) How can robo… ▽ More

    Submitted 10 October, 2024; originally announced October 2024.

    ACM Class: I.4.5; I.4.8

  25. arXiv:2410.08181  [pdf, other

    cs.CV

    RGM: Reconstructing High-fidelity 3D Car Assets with Relightable 3D-GS Generative Model from a Single Image

    Authors: Xiaoxue Chen, Jv Zheng, Hao Huang, Haoran Xu, Weihao Gu, Kangliang Chen, He xiang, Huan-ang Gao, Hao Zhao, Guyue Zhou, Yaqin Zhang

    Abstract: The generation of high-quality 3D car assets is essential for various applications, including video games, autonomous driving, and virtual reality. Current 3D generation methods utilizing NeRF or 3D-GS as representations for 3D objects, generate a Lambertian object under fixed lighting and lack separated modelings for material and global illumination. As a result, the generated assets are unsuitab… ▽ More

    Submitted 10 October, 2024; originally announced October 2024.

  26. arXiv:2410.06511  [pdf, other

    cs.CL cs.AI cs.DC cs.LG

    TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training

    Authors: Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, Stratos Idreos

    Abstract: The development of large language models (LLMs) has been instrumental in advancing state-of-the-art natural language processing applications. Training LLMs with billions of parameters and trillions of tokens require sophisticated distributed systems that enable composing and comparing several state-of-the-art techniques in order to efficiently scale across thousands of accelerators. However, exist… ▽ More

    Submitted 8 October, 2024; originally announced October 2024.

  27. arXiv:2410.05772  [pdf

    cs.CV

    Comparative Analysis of Novel View Synthesis and Photogrammetry for 3D Forest Stand Reconstruction and extraction of individual tree parameters

    Authors: Guoji Tian, Chongcheng Chen, Hongyu Huang

    Abstract: Accurate and efficient 3D reconstruction of trees is crucial for forest resource assessments and management. Close-Range Photogrammetry (CRP) is commonly used for reconstructing forest scenes but faces challenges like low efficiency and poor quality. Recently, Novel View Synthesis (NVS) technologies, including Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have shown promise for 3… ▽ More

    Submitted 8 October, 2024; originally announced October 2024.

    Comments: 31page,15figures

  28. arXiv:2410.04555  [pdf, other

    cs.LG cs.CY

    $\texttt{dattri}$: A Library for Efficient Data Attribution

    Authors: Junwei Deng, Ting-Wei Li, Shiyuan Zhang, Shixuan Liu, Yijun Pan, Hao Huang, Xinhe Wang, Pingbang Hu, Xingjian Zhang, Jiaqi W. Ma

    Abstract: Data attribution methods aim to quantify the influence of individual training samples on the prediction of artificial intelligence (AI) models. As training data plays an increasingly crucial role in the modern development of large-scale AI models, data attribution has found broad applications in improving AI performance and safety. However, despite a surge of new data attribution methods being dev… ▽ More

    Submitted 6 October, 2024; originally announced October 2024.

  29. arXiv:2410.04208  [pdf

    cs.CY

    Assessing the Impact of Disorganized Background Noise on Timed Stress Task Performance Through Attention Using Machine-Learning Based Eye-Tracking Techniques

    Authors: Hubert Huang, Jeffrey Huang

    Abstract: Noise pollution has been rising alongside urbanization. Literature shows that disorganized background noise decreases attention. Timed testing, an attention-demanding stress task, has become increasingly important in assessing students' academic performance. However, there is insufficient research on how background noise affects performance in timed stress tasks by impacting attention, which this… ▽ More

    Submitted 5 October, 2024; originally announced October 2024.

    Comments: 18 pages, 13 figures

    ACM Class: K.4.0

  30. arXiv:2410.03989  [pdf, other

    cs.LG

    Symmetry From Scratch: Group Equivariance as a Supervised Learning Task

    Authors: Haozhe Huang, Leo Kaixuan Cheng, Kaiwen Chen, Alán Aspuru-Guzik

    Abstract: In machine learning datasets with symmetries, the paradigm for backward compatibility with symmetry-breaking has been to relax equivariant architectural constraints, engineering extra weights to differentiate symmetries of interest. However, this process becomes increasingly over-engineered as models are geared towards specific symmetries/asymmetries hardwired of a particular set of equivariant ba… ▽ More

    Submitted 4 October, 2024; originally announced October 2024.

  31. arXiv:2410.03769  [pdf, other

    cs.CL cs.AI cs.CR

    SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks

    Authors: Tianhao Li, Jingyu Lu, Chuangxin Chu, Tianyu Zeng, Yujia Zheng, Mei Li, Haotian Huang, Bin Wu, Zuoxian Liu, Kai Ma, Xuejing Yuan, Xingkai Wang, Keyan Ding, Huajun Chen, Qiang Zhang

    Abstract: Large language models (LLMs) have had a transformative impact on a variety of scientific tasks across disciplines such as biology, chemistry, medicine, and physics. However, ensuring the safety alignment of these models in scientific research remains an underexplored area, with existing benchmarks primarily focus on textual content and overlooking key scientific representations such as molecular,… ▽ More

    Submitted 2 October, 2024; originally announced October 2024.

  32. arXiv:2410.02768  [pdf, other

    cs.CV cs.AI

    BoViLA: Bootstrapping Video-Language Alignment via LLM-Based Self-Questioning and Answering

    Authors: Jin Chen, Kaijing Ma, Haojian Huang, Jiayu Shen, Han Fang, Xianghao Zang, Chao Ban, Zhongjiang He, Hao Sun, Yanmei Kang

    Abstract: The development of multi-modal models has been rapidly advancing, with some demonstrating remarkable capabilities. However, annotating video-text pairs remains expensive and insufficient. Take video question answering (VideoQA) tasks as an example, human annotated questions and answers often cover only part of the video, and similar semantics can also be expressed through different text forms, lea… ▽ More

    Submitted 17 September, 2024; originally announced October 2024.

  33. arXiv:2410.02720  [pdf, other

    cs.CV cs.AI

    Curvature Diversity-Driven Deformation and Domain Alignment for Point Cloud

    Authors: Mengxi Wu, Hao Huang, Yi Fang, Mohammad Rostami

    Abstract: Unsupervised Domain Adaptation (UDA) is crucial for reducing the need for extensive manual data annotation when training deep networks on point cloud data. A significant challenge of UDA lies in effectively bridging the domain gap. To tackle this challenge, we propose \textbf{C}urvature \textbf{D}iversity-Driven \textbf{N}uclear-Norm Wasserstein \textbf{D}omain Alignment (CDND). Our approach first… ▽ More

    Submitted 4 October, 2024; v1 submitted 3 October, 2024; originally announced October 2024.

  34. arXiv:2410.02712  [pdf, other

    cs.CV cs.CL

    LLaVA-Critic: Learning to Evaluate Multimodal Models

    Authors: Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, Chunyuan Li

    Abstract: We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (1) LMM-a… ▽ More

    Submitted 3 October, 2024; originally announced October 2024.

    Comments: Project Page: https://llava-vl.github.io/blog/2024-10-03-llava-critic

  35. Obtaining Lower Query Complexities through Lightweight Zeroth-Order Proximal Gradient Algorithms

    Authors: Bin Gu, Xiyuan Wei, Hualin Zhang, Yi Chang, Heng Huang

    Abstract: Zeroth-order (ZO) optimization is one key technique for machine learning problems where gradient calculation is expensive or impossible. Several variance reduced ZO proximal algorithms have been proposed to speed up ZO optimization for non-smooth problems, and all of them opted for the coordinated ZO estimator against the random ZO estimator when approximating the true gradient, since the former i… ▽ More

    Submitted 3 October, 2024; originally announced October 2024.

    Comments: Neural Computation 36 (5), 897-935

    Journal ref: Neural Computation, 2024, 36(5): 897-935

  36. arXiv:2410.02098  [pdf, other

    cs.CV cs.LG

    EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

    Authors: Haotian Sun, Tao Lei, Bowen Zhang, Yanghao Li, Haoshuo Huang, Ruoming Pang, Bo Dai, Nan Du

    Abstract: Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion tr… ▽ More

    Submitted 4 October, 2024; v1 submitted 2 October, 2024; originally announced October 2024.

  37. arXiv:2410.00255  [pdf, other

    cs.AI cs.CL cs.CV

    Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

    Authors: Weitai Kang, Haifeng Huang, Yuzhang Shang, Mubarak Shah, Yan Yan

    Abstract: Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-follo… ▽ More

    Submitted 30 September, 2024; originally announced October 2024.

    Comments: 10 pages

  38. arXiv:2409.19606  [pdf, other

    cs.LG cs.CL cs.CV cs.NE

    Hyper-Connections

    Authors: Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, Xun Zhou

    Abstract: We present hyper-connections, a simple yet effective method that can serve as an alternative to residual connections. This approach specifically addresses common drawbacks observed in residual connection variants, such as the seesaw effect between gradient vanishing and representation collapse. Theoretically, hyper-connections allow the network to adjust the strength of connections between feature… ▽ More

    Submitted 29 September, 2024; originally announced September 2024.

  39. arXiv:2409.19342  [pdf, other

    cs.CV

    X-Prompt: Multi-modal Visual Prompt for Video Object Segmentation

    Authors: Pinxue Guo, Wanyun Li, Hao Huang, Lingyi Hong, Xinyu Zhou, Zhaoyu Chen, Jinglun Li, Kaixun Jiang, Wei Zhang, Wenqiang Zhang

    Abstract: Multi-modal Video Object Segmentation (VOS), including RGB-Thermal, RGB-Depth, and RGB-Event, has garnered attention due to its capability to address challenging scenarios where traditional VOS methods struggle, such as extreme illumination, rapid motion, and background distraction. Existing approaches often involve designing specific additional branches and performing full-parameter fine-tuning f… ▽ More

    Submitted 28 September, 2024; originally announced September 2024.

    Comments: ACMMM'2024

  40. arXiv:2409.18679  [pdf, other

    cs.CL

    "Why" Has the Least Side Effect on Model Editing

    Authors: Tsung-Hsuan Pan, Chung-Chi Chen, Hen-Hsen Huang, Hsin-Hsi Chen

    Abstract: Training large language models (LLMs) from scratch is an expensive endeavor, particularly as world knowledge continually evolves. To maintain relevance and accuracy of LLMs, model editing has emerged as a pivotal research area. While these methods hold promise, they can also produce unintended side effects. Their underlying factors and causes remain largely unexplored. This paper delves into a cri… ▽ More

    Submitted 27 September, 2024; originally announced September 2024.

  41. arXiv:2409.18677  [pdf, other

    cs.CL

    Co-Trained Retriever-Generator Framework for Question Generation in Earnings Calls

    Authors: Yining Juan, Chung-Chi Chen, Hen-Hsen Huang, Hsin-Hsi Chen

    Abstract: In diverse professional environments, ranging from academic conferences to corporate earnings calls, the ability to anticipate audience questions stands paramount. Traditional methods, which rely on manual assessment of an audience's background, interests, and subject knowledge, often fall short - particularly when facing large or heterogeneous groups, leading to imprecision and inefficiency. Whil… ▽ More

    Submitted 27 September, 2024; originally announced September 2024.

  42. arXiv:2409.18541  [pdf, other

    cs.AI

    Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

    Authors: Hongzhe Huang, Zhewen Yu, Jiang Liu, Li Cai, Dian Jiao, Wenqiao Zhang, Siliang Tang, Juncheng Li, Hao Jiang, Haoyuan Li, Yueting Zhuang

    Abstract: Recent advances in Multi-modal Large Language Models (MLLMs), such as LLaVA-series models, are driven by massive machine-generated instruction-following data tuning. Such automatic instruction collection pipelines, however, inadvertently introduce significant variability in data quality. This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and L… ▽ More

    Submitted 27 September, 2024; originally announced September 2024.

  43. arXiv:2409.18168  [pdf, other

    cs.LG

    Jump Diffusion-Informed Neural Networks with Transfer Learning for Accurate American Option Pricing under Data Scarcity

    Authors: Qiguo Sun, Hanyue Huang, XiBei Yang, Yuwei Zhang

    Abstract: Option pricing models, essential in financial mathematics and risk management, have been extensively studied and recently advanced by AI methodologies. However, American option pricing remains challenging due to the complexity of determining optimal exercise times and modeling non-linear payoffs resulting from stochastic paths. Moreover, the prevalent use of the Black-Scholes formula in hybrid mod… ▽ More

    Submitted 26 September, 2024; originally announced September 2024.

  44. arXiv:2409.17791  [pdf, other

    cs.CL cs.AI

    Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

    Authors: Jian Li, Haojing Huang, Yujia Zhang, Pengfei Xu, Xi Chen, Rui Song, Lida Shi, Jingwen Wang, Hao Xu

    Abstract: Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These approaches commonly use a binary cross-entropy mechanism on pairwise samples, i.e., minimizing and maximizing the loss based on preferred or dis-preferred respo… ▽ More

    Submitted 26 September, 2024; originally announced September 2024.

    Comments: Accepted at EMNLP 2024 Findings

  45. arXiv:2409.17566  [pdf, other

    cs.CV

    Flexiffusion: Segment-wise Neural Architecture Search for Flexible Denoising Schedule

    Authors: Hongtao Huang, Xiaojun Chang, Lina Yao

    Abstract: Diffusion models are cutting-edge generative models adept at producing diverse, high-quality images. Despite their effectiveness, these models often require significant computational resources owing to their numerous sequential denoising steps and the significant inference cost of each step. Recently, Neural Architecture Search (NAS) techniques have been employed to automatically search for faster… ▽ More

    Submitted 26 September, 2024; originally announced September 2024.

  46. arXiv:2409.17419  [pdf, other

    cs.CL

    Pre-Finetuning with Impact Duration Awareness for Stock Movement Prediction

    Authors: Chr-Jr Chiu, Chung-Chi Chen, Hen-Hsen Huang, Hsin-Hsi Chen

    Abstract: Understanding the duration of news events' impact on the stock market is crucial for effective time-series forecasting, yet this facet is largely overlooked in current research. This paper addresses this research gap by introducing a novel dataset, the Impact Duration Estimation Dataset (IDED), specifically designed to estimate impact duration based on investor opinions. Our research establishes t… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

    Comments: NTCIR-18 FinArg-2 Dataset

  47. arXiv:2409.17417  [pdf, other

    cs.CL

    Enhancing Investment Opinion Ranking through Argument-Based Sentiment Analysis

    Authors: Chung-Chi Chen, Hen-Hsen Huang, Hsin-Hsi Chen, Hiroya Takamura, Ichiro Kobayashi, Yusuke Miyao

    Abstract: In the era of rapid Internet and social media platform development, individuals readily share their viewpoints online. The overwhelming quantity of these posts renders comprehensive analysis impractical. This necessitates an efficient recommendation system to filter and present significant, relevant opinions. Our research introduces a dual-pronged argument mining technique to improve recommendatio… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

  48. arXiv:2409.17294  [pdf, other

    stat.ML cs.LG

    Schrödinger bridge based deep conditional generative learning

    Authors: Hanwen Huang

    Abstract: Conditional generative models represent a significant advancement in the field of machine learning, allowing for the controlled synthesis of data by incorporating additional information into the generation process. In this work we introduce a novel Schrödinger bridge based deep generative method for learning conditional distributions. We start from a unit-time diffusion process governed by a stoch… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

    Comments: 22 pages, 4 figures

  49. arXiv:2409.17126  [pdf, other

    cs.RO cs.AI cs.LG

    Blox-Net: Generative Design-for-Robot-Assembly Using VLM Supervision, Physics Simulation, and a Robot with Reset

    Authors: Andrew Goldberg, Kavish Kondap, Tianshuang Qiu, Zehan Ma, Letian Fu, Justin Kerr, Huang Huang, Kaiyuan Chen, Kuan Fang, Ken Goldberg

    Abstract: Generative AI systems have shown impressive capabilities in creating text, code, and images. Inspired by the rich history of research in industrial ''Design for Assembly'', we introduce a novel problem: Generative Design-for-Robot-Assembly (GDfRA). The task is to generate an assembly based on a natural language prompt (e.g., ''giraffe'') and an image of available physical components, such as 3D-pr… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

    Comments: 8 pages, 7 Figures

  50. arXiv:2409.16863  [pdf, other

    cs.CV

    Towards Unified 3D Hair Reconstruction from Single-View Portraits

    Authors: Yujian Zheng, Yuda Qiu, Leyang Jin, Chongyang Ma, Haibin Huang, Di Zhang, Pengfei Wan, Xiaoguang Han

    Abstract: Single-view 3D hair reconstruction is challenging, due to the wide range of shape variations among diverse hairstyles. Current state-of-the-art methods are specialized in recovering un-braided 3D hairs and often take braided styles as their failure cases, because of the inherent difficulty to define priors for complex hairstyles, whether rule-based or data-based. We propose a novel strategy to ena… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

    Comments: SIGGRAPH Asia 2024, project page: https://unihair24.github.io