subscribe to arXiv mailings

ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts

Authors: Xumeng Han, Longhui Wei, Zhiyang Dou, Zipeng Wang, Chenhui Qiang, Xin He, Yingfei Sun, Zhenjun Han, Qi Tian

Abstract: Mixture-of-Experts (MoE) models embody the divide-and-conquer concept and are a promising approach for increasing model capacity, demonstrating excellent scalability across multiple domains. In this paper, we integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and explore the potential of applying MoE to vision through a comprehensive study on image classificati… ▽ More Mixture-of-Experts (MoE) models embody the divide-and-conquer concept and are a promising approach for increasing model capacity, demonstrating excellent scalability across multiple domains. In this paper, we integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and explore the potential of applying MoE to vision through a comprehensive study on image classification. However, we observe that the performance is sensitive to the configuration of MoE layers, making it challenging to obtain optimal results without careful design. The underlying cause is that inappropriate MoE layers lead to unreliable routing and hinder experts from effectively acquiring helpful knowledge. To address this, we introduce a shared expert to learn and capture common information, serving as an effective way to construct stable ViMoE. Furthermore, we demonstrate how to analyze expert routing behavior, revealing which MoE layers are capable of specializing in handling specific information and which are not. This provides guidance for retaining the critical layers while removing redundancies, thereby advancing ViMoE to be more efficient without sacrificing accuracy. We aspire for this work to offer new insights into the design of vision MoE models and provide valuable empirical guidance for future research. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.15020 [pdf, other]

Iterative Methods via Locally Evolving Set Process

Authors: Baojian Zhou, Yifan Sun, Reza Babanezhad Harikandeh, Xingzhi Guo, Deqing Yang, Yanghua Xiao

Abstract: Given the damping factor $α$ and precision tolerance $ε$, \citet{andersen2006local} introduced Approximate Personalized PageRank (APPR), the \textit{de facto local method} for approximating the PPR vector, with runtime bounded by $Θ(1/(αε))$ independent of the graph size. Recently, \citet{fountoulakis2022open} asked whether faster local algorithms could be developed using $\tilde{O}(1/(\sqrtαε))$… ▽ More Given the damping factor $α$ and precision tolerance $ε$, \citet{andersen2006local} introduced Approximate Personalized PageRank (APPR), the \textit{de facto local method} for approximating the PPR vector, with runtime bounded by $Θ(1/(αε))$ independent of the graph size. Recently, \citet{fountoulakis2022open} asked whether faster local algorithms could be developed using $\tilde{O}(1/(\sqrtαε))$ operations. By noticing that APPR is a local variant of Gauss-Seidel, this paper explores the question of \textit{whether standard iterative solvers can be effectively localized}. We propose to use the \textit{locally evolving set process}, a novel framework to characterize the algorithm locality, and demonstrate that many standard solvers can be effectively localized. Let $\overline{\operatorname{vol}}{ (S_t)}$ and $\overlineγ_{t}$ be the running average of volume and the residual ratio of active nodes $\textstyle S_{t}$ during the process. We show $\overline{\operatorname{vol}}{ (S_t)}/\overlineγ_{t} \leq 1/ε$ and prove APPR admits a new runtime bound $\tilde{O}(\overline{\operatorname{vol}}(S_t)/(α\overlineγ_{t}))$ mirroring the actual performance. Furthermore, when the geometric mean of residual reduction is $Θ(\sqrtα)$, then there exists $c \in (0,2)$ such that the local Chebyshev method has runtime $\tilde{O}(\overline{\operatorname{vol}}(S_{t})/(\sqrtα(2-c)))$ without the monotonicity assumption. Numerical results confirm the efficiency of this novel framework and show up to a hundredfold speedup over corresponding standard solvers on real-world graphs. △ Less

Submitted 19 October, 2024; originally announced October 2024.

Comments: 58 pages, 15 figures, NeurIPS 2024

arXiv:2410.14932 [pdf, other]

Can AI weather models predict out-of-distribution gray swan tropical cyclones?

Authors: Y. Qiang Sun, Pedram Hassanzadeh, Mohsen Zand, Ashesh Chattopadhyay, Jonathan Weare, Dorian S. Abbot

Abstract: Predicting gray swan weather extremes, which are possible but so rare that they are absent from the training dataset, is a major concern for AI weather/climate models. An important open question is whether AI models can extrapolate from weaker weather events present in the training set to stronger, unseen weather extremes. To test this, we train independent versions of the AI model FourCastNet on… ▽ More Predicting gray swan weather extremes, which are possible but so rare that they are absent from the training dataset, is a major concern for AI weather/climate models. An important open question is whether AI models can extrapolate from weaker weather events present in the training set to stronger, unseen weather extremes. To test this, we train independent versions of the AI model FourCastNet on the 1979-2015 ERA5 dataset with all data, or with Category 3-5 tropical cyclones (TCs) removed, either globally or only over the North Atlantic or Western Pacific basin. We then test these versions of FourCastNet on 2018-2023 Category 5 TCs (gray swans). All versions yield similar accuracy for global weather, but the one trained without Category 3-5 TCs cannot accurately forecast Category 5 TCs, indicating that these models cannot extrapolate from weaker storms. The versions trained without Category 3-5 TCs in one basin show some skill forecasting Category 5 TCs in that basin, suggesting that FourCastNet can generalize across tropical basins. This is encouraging and surprising because regional information is implicitly encoded in inputs. No version satisfies gradient-wind balance, implying that enforcing such physical constraints may not improve generalizability to gray swans. Given that current state-of-the-art AI weather/climate models have similar learning strategies, we expect our findings to apply to other models and extreme events. Our work demonstrates that novel learning strategies are needed for AI weather/climate models to provide early warning or estimated statistics for the rarest, most impactful weather extremes. △ Less

Submitted 18 October, 2024; originally announced October 2024.

arXiv:2410.14900 [pdf, other]

DRACO: Differentiable Reconstruction for Arbitrary CBCT Orbits

Authors: Chengze Ye, Linda-Sophie Schneider, Yipeng Sun, Mareike Thies, Siyuan Mei, Andreas Maier

Abstract: This paper introduces a novel method for reconstructing cone beam computed tomography (CBCT) images for arbitrary orbits using a differentiable shift-variant filtered backprojection (FBP) neural network. Traditional CBCT reconstruction methods for arbitrary orbits, like iterative reconstruction algorithms, are computationally expensive and memory-intensive. The proposed method addresses these chal… ▽ More This paper introduces a novel method for reconstructing cone beam computed tomography (CBCT) images for arbitrary orbits using a differentiable shift-variant filtered backprojection (FBP) neural network. Traditional CBCT reconstruction methods for arbitrary orbits, like iterative reconstruction algorithms, are computationally expensive and memory-intensive. The proposed method addresses these challenges by employing a shift-variant FBP algorithm optimized for arbitrary trajectories through a deep learning approach that adapts to a specific orbit geometry. This approach overcomes the limitations of existing techniques by integrating known operators into the learning model, minimizing the number of parameters, and improving the interpretability of the model. The proposed method is a significant advancement in interventional medical imaging, particularly for robotic C-arm CT systems, enabling faster and more accurate CBCT reconstructions with customized orbits. Especially this method can also be used for the analytical reconstruction of non-continuous orbits like circular plus arc. The experimental results demonstrate that the proposed method significantly accelerates the reconstruction process compared to conventional iterative algorithms. It achieves comparable or superior image quality, as evidenced by metrics such as the mean squared error (MSE), the peak signal-to-noise ratio (PSNR), and the structural similarity index measure (SSIM). The validation experiments show that the method can handle data from different trajectories, demonstrating its flexibility and robustness across different scan geometries. Our method demonstrates a significant improvement, particularly for the sinusoidal trajectory, achieving a 38.6% reduction in MSE, a 7.7% increase in PSNR, and a 5.0% improvement in SSIM. Furthermore, the computation time for reconstruction was reduced by more than 97%. △ Less

Submitted 18 October, 2024; originally announced October 2024.

arXiv:2410.14804 [pdf, other]

SMILES: Discovery of Higher Ionizing Photon Production Efficiency in Overdense Regions

Authors: Yongda Zhu, Stacey Alberts, Jianwei Lyu, Jane Morrison, George H. Rieke, Yang Sun, Jakob M. Helton, Zhiyuan Ji, Rachana Bhatawdekar, Nina Bonaventura, Andrew J. Bunker, Xiaojing Lin, Marcia J. Rieke, Pierluigi Rinaldi, Irene Shivaei, Christopher N. A. Willmer, Junyu Zhang

Abstract: The topology of reionization and the environments where galaxies efficiently produce ionizing photons are key open questions. For the first time, we investigate the correlation between ionizing photon production efficiency, $ξ_{\rm ion}$, and galaxy overdensity, $\log(1+δ)$. We analyze the ionizing properties of 93 galaxies between $0.7 < z < 6.9$ using JWST NIRSpec medium-resolution spectra from… ▽ More The topology of reionization and the environments where galaxies efficiently produce ionizing photons are key open questions. For the first time, we investigate the correlation between ionizing photon production efficiency, $ξ_{\rm ion}$, and galaxy overdensity, $\log(1+δ)$. We analyze the ionizing properties of 93 galaxies between $0.7 < z < 6.9$ using JWST NIRSpec medium-resolution spectra from the Systematic Mid-infrared Instrument (MIRI) Legacy Extragalactic Survey (SMILES) program. Among these, 67 galaxies have H$α$ coverage, spanning $0.7 < z < 3.7$. The galaxy overdensity, $\log(1+δ)$, is measured using the JADES photometric catalog, which covers the SMILES footprint. For the subset with H$α$ coverage, we find that $\logξ_{\rm ion}$ is positively correlated with $\log(1+δ)$, with a slope of $0.94_{-0.46}^{+0.46}$. Additionally, the mean $ξ_{\rm ion}$ for galaxies in overdense regions ($\log(1+δ) > 0.1$) is 2.43 times that of galaxies in lower density regions ($\log(1+δ) < 0.1$). This strong correlation is found to be independent of redshift evolution. Furthermore, our results confirm the robust correlations between $ξ_{\rm ion}$ and the rest-frame equivalent widths of the [O III] or H$α$ emission lines. Our results suggest that galaxies in high-density regions are efficient producers of ionizing photons. △ Less

Submitted 18 October, 2024; originally announced October 2024.

Comments: 14 pages, 7 figures, 1 table. Submitted to AAS journals. The machine-readable table will be made available upon acceptance

arXiv:2410.14054 [pdf, other]

Independently-Normalized SGD for Generalized-Smooth Nonconvex Optimization

Authors: Yufeng Yang, Erin Tripp, Yifan Sun, Shaofeng Zou, Yi Zhou

Abstract: Recent studies have shown that many nonconvex machine learning problems meet a so-called generalized-smooth condition that extends beyond traditional smooth nonconvex optimization. However, the existing algorithms designed for generalized-smooth nonconvex optimization encounter significant limitations in both their design and convergence analysis. In this work, we first study deterministic general… ▽ More Recent studies have shown that many nonconvex machine learning problems meet a so-called generalized-smooth condition that extends beyond traditional smooth nonconvex optimization. However, the existing algorithms designed for generalized-smooth nonconvex optimization encounter significant limitations in both their design and convergence analysis. In this work, we first study deterministic generalized-smooth nonconvex optimization and analyze the convergence of normalized gradient descent under the generalized Polyak-Lojasiewicz condition. Our results provide a comprehensive understanding of the interplay between gradient normalization and function geometry. Then, for stochastic generalized-smooth nonconvex optimization, we propose an independently-normalized stochastic gradient descent algorithm, which leverages independent sampling, gradient normalization and clipping to achieve an $\mathcal{O}(ε^{-4})$ sample complexity under relaxed assumptions. Experiments demonstrate the fast convergence of our algorithm. △ Less