subscribe to arXiv mailings

A Primal-Dual Framework for Transformers and Neural Networks

Authors: Tan M. Nguyen, Tam Nguyen, Nhat Ho, Andrea L. Bertozzi, Richard G. Baraniuk, Stanley J. Osher

Abstract: Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresp… ▽ More Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem, whose primal formulation has the form of a neural network layer. Using our framework, we derive popular attention layers used in practice and propose two new attentions: 1) the Batch Normalized Attention (Attention-BN) derived from the batch normalization layer and 2) the Attention with Scaled Head (Attention-SH) derived from using less training data to fit the SVR model. We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model's accuracy, and improving the model's efficiency in a variety of practical applications including image and time-series classification. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: Accepted to ICLR 2023, 26 pages, 4 figures, 14 tables

arXiv:2402.17745 [pdf, other]

Low-light phase retrieval with implicit generative priors

Authors: Raunak Manekar, Elisa Negrini, Minh Pham, Daniel Jacobs, Jaideep Srivastava, Stanley J. Osher, Jianwei Miao

Abstract: Phase retrieval (PR) is fundamentally important in scientific imaging and is crucial for nanoscale techniques like coherent diffractive imaging (CDI). Low radiation dose imaging is essential for applications involving radiation-sensitive samples. However, most PR methods struggle in low-dose scenarios due to high shot noise. Recent advancements in optical data acquisition setups, such as in-situ C… ▽ More Phase retrieval (PR) is fundamentally important in scientific imaging and is crucial for nanoscale techniques like coherent diffractive imaging (CDI). Low radiation dose imaging is essential for applications involving radiation-sensitive samples. However, most PR methods struggle in low-dose scenarios due to high shot noise. Recent advancements in optical data acquisition setups, such as in-situ CDI, have shown promise for low-dose imaging, but they rely on a time series of measurements, making them unsuitable for single-image applications. Similarly, data-driven phase retrieval techniques are not easily adaptable to data-scarce situations. Zero-shot deep learning methods based on pre-trained and implicit generative priors have been effective in various imaging tasks but have shown limited success in PR. In this work, we propose low-dose deep image prior (LoDIP), which combines in-situ CDI with the power of implicit generative priors to address single-image low-dose phase retrieval. Quantitative evaluations demonstrate LoDIP's superior performance in this task and its applicability to real experimental scenarios. △ Less

Submitted 23 August, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

MSC Class: 68T10 68T07 78A46

arXiv:2402.06162 [pdf, other]

Wasserstein proximal operators describe score-based generative models and resolve memorization

Authors: Benjamin J. Zhang, Siting Liu, Wuchen Li, Markos A. Katsoulakis, Stanley J. Osher

Abstract: We focus on the fundamental mathematical structure of score-based generative models (SGMs). We first formulate SGMs in terms of the Wasserstein proximal operator (WPO) and demonstrate that, via mean-field games (MFGs), the WPO formulation reveals mathematical structure that describes the inductive bias of diffusion and score-based models. In particular, MFGs yield optimality conditions in the form… ▽ More We focus on the fundamental mathematical structure of score-based generative models (SGMs). We first formulate SGMs in terms of the Wasserstein proximal operator (WPO) and demonstrate that, via mean-field games (MFGs), the WPO formulation reveals mathematical structure that describes the inductive bias of diffusion and score-based models. In particular, MFGs yield optimality conditions in the form of a pair of coupled partial differential equations: a forward-controlled Fokker-Planck (FP) equation, and a backward Hamilton-Jacobi-Bellman (HJB) equation. Via a Cole-Hopf transformation and taking advantage of the fact that the cross-entropy can be related to a linear functional of the density, we show that the HJB equation is an uncontrolled FP equation. Second, with the mathematical structure at hand, we present an interpretable kernel-based model for the score function which dramatically improves the performance of SGMs in terms of training samples and training time. In addition, the WPO-informed kernel model is explicitly constructed to avoid the recently studied memorization effects of score-based generative models. The mathematical form of the new kernel-based models in combination with the use of the terminal condition of the MFG reveals new explanations for the manifold learning and generalization properties of SGMs, and provides a resolution to their memorization effects. Finally, our mathematically informed, interpretable kernel-based model suggests new scalable bespoke neural network architectures for high-dimensional applications. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2401.07364 [pdf, other]

PDE Generalization of In-Context Operator Networks: A Study on 1D Scalar Nonlinear Conservation Laws

Authors: Liu Yang, Stanley J. Osher

Abstract: Can we build a single large model for a wide range of PDE-related scientific learning tasks? Can this model generalize to new PDEs, even of new forms, without any fine-tuning? In-context operator learning and the corresponding model In-Context Operator Networks (ICON) represent an initial exploration of these questions. The capability of ICON regarding the first question has been demonstrated prev… ▽ More Can we build a single large model for a wide range of PDE-related scientific learning tasks? Can this model generalize to new PDEs, even of new forms, without any fine-tuning? In-context operator learning and the corresponding model In-Context Operator Networks (ICON) represent an initial exploration of these questions. The capability of ICON regarding the first question has been demonstrated previously. In this paper, we present a detailed methodology for solving PDE problems with ICON, and show how a single ICON model can make forward and reverse predictions for different equations with different strides, provided with appropriately designed data prompts. We show the positive evidence to the second question, i.e., ICON can generalize well to some PDEs with new forms without any fine-tuning. This is exemplified through a study on 1D scalar nonlinear conservation laws, a family of PDEs with temporal evolution. We also show how to broaden the range of problems that an ICON model can address, by transforming functions and equations to ICON's capability scope. We believe that the progress in this paper is a significant step towards the goal of training a foundation model for PDE-related tasks under the in-context operator learning framework. △ Less

Submitted 21 January, 2024; v1 submitted 14 January, 2024; originally announced January 2024.

arXiv:2310.01605 [pdf, other]

Primal-dual hybrid gradient algorithms for computing time-implicit Hamilton-Jacobi equations

Authors: Tingwei Meng, Wenbo Hao, Siting Liu, Stanley J. Osher, Wuchen Li

Abstract: Hamilton-Jacobi (HJ) partial differential equations (PDEs) have diverse applications spanning physics, optimal control, game theory, and imaging sciences. This research introduces a first-order optimization-based technique for HJ PDEs, which formulates the time-implicit update of HJ PDEs as saddle point problems. We remark that the saddle point formulation for HJ equations is aligned with the prim… ▽ More Hamilton-Jacobi (HJ) partial differential equations (PDEs) have diverse applications spanning physics, optimal control, game theory, and imaging sciences. This research introduces a first-order optimization-based technique for HJ PDEs, which formulates the time-implicit update of HJ PDEs as saddle point problems. We remark that the saddle point formulation for HJ equations is aligned with the primal-dual formulation of optimal transport and potential mean-field games (MFGs). This connection enables us to extend MFG techniques and design numerical schemes for solving HJ PDEs. We employ the primal-dual hybrid gradient (PDHG) method to solve the saddle point problems, benefiting from the simple structures that enable fast computations in updates. Remarkably, the method caters to a broader range of Hamiltonians, encompassing non-smooth and spatiotemporally dependent cases. The approach's effectiveness is verified through various numerical examples in both one-dimensional and two-dimensional examples, such as quadratic and $L^1$ Hamiltonians with spatial and time dependence. △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2308.05061 [pdf, other]

Fine-Tune Language Models as Multi-Modal Differential Equation Solvers

Authors: Liu Yang, Siting Liu, Stanley J. Osher

Abstract: In the growing domain of scientific machine learning, in-context operator learning has shown notable potential in building foundation models, as in this framework the model is trained to learn operators and solve differential equations using prompted data, during the inference stage without weight updates. However, the current model's overdependence on function data overlooks the invaluable human… ▽ More In the growing domain of scientific machine learning, in-context operator learning has shown notable potential in building foundation models, as in this framework the model is trained to learn operators and solve differential equations using prompted data, during the inference stage without weight updates. However, the current model's overdependence on function data overlooks the invaluable human insight into the operator. To address this, we present a transformation of in-context operator learning into a multi-modal paradigm. In particular, we take inspiration from the recent success of large language models, and propose using "captions" to integrate human knowledge about the operator, expressed through natural language descriptions and equations. Also, we introduce a novel approach to train a language-model-like architecture, or directly fine-tune existing language models, for in-context operator learning. We beat the baseline on single-modal learning tasks, and also demonstrated the effectiveness of multi-modal learning in enhancing performance and reducing function data requirements. The proposed method not only significantly enhanced the development of the in-context operator learning paradigm, but also created a new path for the application of language models. △ Less

Submitted 1 February, 2024; v1 submitted 9 August, 2023; originally announced August 2023.

arXiv:2306.11283 [pdf]

Computational Microscopy beyond Perfect Lenses

Authors: Xingyuan Lu, Minh Pham, Elisa Negrini, Damek Davis, Stanley J. Osher, Jianwei Miao

Abstract: We demonstrate that in situ coherent diffractive imaging (CDI), which harnesses the coherent interference between a strong and a weak beam illuminating a static and dynamic structure, can be a very dose-efficient imaging method. At low doses, in situ CDI can achieve higher resolution than perfect lenses with the point spread function as a delta function. Both our numerical simulation and experimen… ▽ More We demonstrate that in situ coherent diffractive imaging (CDI), which harnesses the coherent interference between a strong and a weak beam illuminating a static and dynamic structure, can be a very dose-efficient imaging method. At low doses, in situ CDI can achieve higher resolution than perfect lenses with the point spread function as a delta function. Both our numerical simulation and experimental results show that the combination of in situ CDI and ptychography can reduce the dose by two orders of magnitude over ptychography. We expect that computational microscopy based on in situ CDI can be implemented in different imaging modalities with photons and electrons for low-dose imaging of radiation-sensitive materials and biological samples. △ Less

Submitted 3 May, 2024; v1 submitted 20 June, 2023; originally announced June 2023.

arXiv:2304.07993 [pdf, other]

doi 10.1073/pnas.2310142120

In-Context Operator Learning with Data Prompts for Differential Equation Problems

Authors: Liu Yang, Siting Liu, Tingwei Meng, Stanley J. Osher

Abstract: This paper introduces a new neural-network-based approach, namely In-Context Operator Networks (ICON), to simultaneously learn operators from the prompted data and apply it to new questions during the inference stage, without any weight update. Existing methods are limited to using a neural network to approximate a specific equation solution or a specific operator, requiring retraining when switch… ▽ More This paper introduces a new neural-network-based approach, namely In-Context Operator Networks (ICON), to simultaneously learn operators from the prompted data and apply it to new questions during the inference stage, without any weight update. Existing methods are limited to using a neural network to approximate a specific equation solution or a specific operator, requiring retraining when switching to a new problem with different equations. By training a single neural network as an operator learner, we can not only get rid of retraining (even fine-tuning) the neural network for new problems, but also leverage the commonalities shared across operators so that only a few demos in the prompt are needed when learning a new operator. Our numerical results show the neural network's capability as a few-shot operator learner for a diversified type of differential equation problems, including forward and inverse problems of ordinary differential equations (ODEs), partial differential equations (PDEs), and mean-field control (MFC) problems, and also show that it can generalize its learning capability to operators beyond the training distribution. △ Less

Submitted 19 September, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

Comments: The second and third authors contributed equally. This is an outdated preprint. Please refer to the updated version published in PNAS: www.pnas.org/doi/10.1073/pnas.2310142120 See code in https://github.com/LiuYangMage/in-context-operator-networks

arXiv:2208.00579 [pdf, other]

Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

Authors: Tan Nguyen, Richard G. Baraniuk, Robert M. Kirby, Stanley J. Osher, Bao Wang

Abstract: Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accurac… ▽ More Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. In response, we first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the \emph{momentum transformer}, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities. Furthermore, we develop an adaptive strategy to compute the momentum value for our model based on the optimal momentum for quadratic optimization. This adaptive momentum eliminates the need to search for the optimal momentum value and further enhances the performance of the momentum transformer. A range of experiments on both autoregressive and non-autoregressive tasks, including image generation and machine translation, demonstrate that the momentum transformer outperforms popular linear transformers in training efficiency and accuracy. △ Less

Submitted 31 July, 2022; originally announced August 2022.

Comments: 22 pages, 5 figures. arXiv admin note: substantial text overlap with arXiv:2110.07034

MSC Class: 65Pxx

arXiv:2207.00129 [pdf, other]

Multi-Agent Shape Control with Optimal Transport

Authors: Alex Tong Lin, Stanley J. Osher

Abstract: We introduce a method called MASCOT (Multi-Agent Shape Control with Optimal Transport) to compute optimal control solutions of agents with shape/formation/density constraints. For example, we might want to apply shape constraints on the agents -- perhaps we desire the agents to hold a particular shape along the path, or we want agents to spread out in order to minimize collisions. We might also wa… ▽ More We introduce a method called MASCOT (Multi-Agent Shape Control with Optimal Transport) to compute optimal control solutions of agents with shape/formation/density constraints. For example, we might want to apply shape constraints on the agents -- perhaps we desire the agents to hold a particular shape along the path, or we want agents to spread out in order to minimize collisions. We might also want a proportion of agents to move to one destination, while the other agents move to another, and to do this in the optimal way, i.e. the source-destination assignments should be optimal. In order to achieve this, we utilize the Earth Mover's Distance from Optimal Transport to distribute the agents into their proper positions so that certain shapes can be satisfied. This cost is both introduced in the terminal cost and in the running cost of the optimal control problem. △ Less

Submitted 3 February, 2023; v1 submitted 30 June, 2022; originally announced July 2022.

Comments: Fixed expressions for g_shape and L_shape in section 4.1, 4.2, 5.2, and 5.3

arXiv:2206.00206 [pdf, ps, other]

Transformer with Fourier Integral Attentions

Authors: Tan Nguyen, Minh Pham, Tam Nguyen, Khai Nguyen, Stanley J. Osher, Nhat Ho

Abstract: Multi-head attention empowers the recent success of transformers, the state-of-the-art models that have achieved remarkable success in sequence modeling and beyond. These attention mechanisms compute the pairwise dot products between the queries and keys, which results from the use of unnormalized Gaussian kernels with the assumption that the queries follow a mixture of Gaussian distribution. Ther… ▽ More Multi-head attention empowers the recent success of transformers, the state-of-the-art models that have achieved remarkable success in sequence modeling and beyond. These attention mechanisms compute the pairwise dot products between the queries and keys, which results from the use of unnormalized Gaussian kernels with the assumption that the queries follow a mixture of Gaussian distribution. There is no guarantee that this assumption is valid in practice. In response, we first interpret attention in transformers as a nonparametric kernel regression. We then propose the FourierFormer, a new class of transformers in which the dot-product kernels are replaced by the novel generalized Fourier integral kernels. Different from the dot-product kernels, where we need to choose a good covariance matrix to capture the dependency of the features of data, the generalized Fourier integral kernels can automatically capture such dependency and remove the need to tune the covariance matrix. We theoretically prove that our proposed Fourier integral kernels can efficiently approximate any key and query distributions. Compared to the conventional transformers with dot-product attention, FourierFormers attain better accuracy and reduce the redundancy between attention heads. We empirically corroborate the advantages of FourierFormers over the baseline transformers in a variety of practical applications including language modeling and image classification. △ Less

Submitted 31 May, 2022; originally announced June 2022.

Comments: 35 pages, 5 tables. Tan Nguyen and Minh Pham contributed equally to this work

arXiv:2204.08621 [pdf, other]

Proximal Implicit ODE Solvers for Accelerating Learning Neural ODEs

Authors: Justin Baker, Hedi Xia, Yiwei Wang, Elena Cherkaev, Akil Narayan, Long Chen, Jack Xin, Andrea L. Bertozzi, Stanley J. Osher, Bao Wang

Abstract: Learning neural ODEs often requires solving very stiff ODE systems, primarily using explicit adaptive step size ODE solvers. These solvers are computationally expensive, requiring the use of tiny step sizes for numerical stability and accuracy guarantees. This paper considers learning neural ODEs using implicit ODE solvers of different orders leveraging proximal operators. The proximal implicit so… ▽ More Learning neural ODEs often requires solving very stiff ODE systems, primarily using explicit adaptive step size ODE solvers. These solvers are computationally expensive, requiring the use of tiny step sizes for numerical stability and accuracy guarantees. This paper considers learning neural ODEs using implicit ODE solvers of different orders leveraging proximal operators. The proximal implicit solver consists of inner-outer iterations: the inner iterations approximate each implicit update step using a fast optimization algorithm, and the outer iterations solve the ODE system over time. The proximal implicit ODE solver guarantees superiority over explicit solvers in numerical stability and computational efficiency. We validate the advantages of proximal implicit solvers over existing popular neural ODE solvers on various challenging benchmark tasks, including learning continuous-depth graph neural networks and continuous normalizing flows. △ Less

Submitted 18 April, 2022; originally announced April 2022.

Comments: 20 pages, 7 figures

MSC Class: 68T07; 65L04 ACM Class: I.2

arXiv:2203.06269 [pdf, other]

Parameter Inference of Time Series by Delay Embeddings and Learning Differentiable Operators

Authors: Alex Tong Lin, Adrian S. Wong, Robert Martin, Stanley J. Osher, Daniel Eckhardt

Abstract: We provide a method to identify system parameters of dynamical systems, called ID-ODE -- Inference by Differentiation and Observing Delay Embeddings. In this setting, we are given a dataset of trajectories from a dynamical system with system parameter labels. Our goal is to identify system parameters of new trajectories. The given trajectories may or may not encompass the full state of the system,… ▽ More We provide a method to identify system parameters of dynamical systems, called ID-ODE -- Inference by Differentiation and Observing Delay Embeddings. In this setting, we are given a dataset of trajectories from a dynamical system with system parameter labels. Our goal is to identify system parameters of new trajectories. The given trajectories may or may not encompass the full state of the system, and we may only observe a one-dimensional time series. In the latter case, we reconstruct the full state by using delay embeddings, and under sufficient conditions, Taken's Embedding Theorem assures us the reconstruction is diffeomorphic to the original. This allows our method to work on time series. Our method works by first learning the velocity operator (as given or reconstructed) with a neural network having both state and system parameters as variable inputs. Then on new trajectories we backpropagate prediction errors to the system parameter inputs giving us a gradient. We then use gradient descent to infer the correct system parameter. We demonstrate the efficacy of our approach on many numerical examples: the Lorenz system, Lorenz96, Lotka-Volterra Predator-Prey, and the Compound Double Pendulum. We also apply our algorithm on a real-world dataset: propulsion of the Hall-effect Thruster (HET). △ Less

Submitted 16 November, 2022; v1 submitted 11 March, 2022; originally announced March 2022.

arXiv:2110.08678 [pdf, other]

Improving Transformers with Probabilistic Attention Keys

Authors: Tam Nguyen, Tan M. Nguyen, Dung D. Le, Duy Khuong Nguyen, Viet-Anh Tran, Richard G. Baraniuk, Nhat Ho, Stanley J. Osher

Abstract: Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observati… ▽ More Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires fewer FLOPs to compute while achieving comparable or better accuracy across tasks. Transformer-MGK can also be easily extended to use with linear attention. We empirically demonstrate the advantage of Transformer-MGK in a range of practical applications, including language modeling and tasks that involve very long sequences. On the Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads attain comparable or better performance to the baseline transformers with 8 heads. △ Less

Submitted 12 June, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

Comments: 27 pages, 16 figures, 10 tables

Journal ref: Proceedings of the 39th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022

arXiv:2110.04840 [pdf, other]

Heavy Ball Neural Ordinary Differential Equations

Authors: Hedi Xia, Vai Suliafu, Hangjie Ji, Tan M. Nguyen, Andrea L. Bertozzi, Stanley J. Osher, Bao Wang

Abstract: We propose heavy ball neural ordinary differential equations (HBNODEs), leveraging the continuous limit of the classical momentum accelerated gradient descent, to improve neural ODEs (NODEs) training and inference. HBNODEs have two properties that imply practical advantages over NODEs: (i) The adjoint state of an HBNODE also satisfies an HBNODE, accelerating both forward and backward ODE solvers,… ▽ More We propose heavy ball neural ordinary differential equations (HBNODEs), leveraging the continuous limit of the classical momentum accelerated gradient descent, to improve neural ODEs (NODEs) training and inference. HBNODEs have two properties that imply practical advantages over NODEs: (i) The adjoint state of an HBNODE also satisfies an HBNODE, accelerating both forward and backward ODE solvers, thus significantly reducing the number of function evaluations (NFEs) and improving the utility of the trained models. (ii) The spectrum of HBNODEs is well structured, enabling effective learning of long-term dependencies from complex sequential data. We verify the advantages of HBNODEs over NODEs on benchmark tasks, including image classification, learning complex dynamics, and sequential modeling. Our method requires remarkably fewer forward and backward NFEs, is more accurate, and learns long-term dependencies more effectively than the other ODE-based neural network models. Code is available at \url{https://github.com/hedixia/HeavyBallNODE}. △ Less

Submitted 10 October, 2021; originally announced October 2021.

Comments: 23 pages, 9 figures, Accepted for publication at Advances in Neural Information Processing Systems (NeurIPS) 2021

MSC Class: 68T07 ACM Class: I.2

arXiv:2108.02347 [pdf, other]

FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention

Authors: Tan M. Nguyen, Vai Suliafu, Stanley J. Osher, Long Chen, Bao Wang

Abstract: We propose FMMformers, a class of efficient and flexible transformers inspired by the celebrated fast multipole method (FMM) for accelerating interacting particle simulation. FMM decomposes particle-particle interaction into near-field and far-field components and then performs direct and coarse-grained computation, respectively. Similarly, FMMformers decompose the attention into near-field and fa… ▽ More We propose FMMformers, a class of efficient and flexible transformers inspired by the celebrated fast multipole method (FMM) for accelerating interacting particle simulation. FMM decomposes particle-particle interaction into near-field and far-field components and then performs direct and coarse-grained computation, respectively. Similarly, FMMformers decompose the attention into near-field and far-field attention, modeling the near-field attention by a banded matrix and the far-field attention by a low-rank matrix. Computing the attention matrix for FMMformers requires linear complexity in computational time and memory footprint with respect to the sequence length. In contrast, standard transformers suffer from quadratic complexity. We analyze and validate the advantage of FMMformers over the standard transformer on the Long Range Arena and language modeling benchmarks. FMMformers can even outperform the standard transformer in terms of accuracy by a significant margin. For instance, FMMformers achieve an average classification accuracy of $60.74\%$ over the five Long Range Arena tasks, which is significantly better than the standard transformer's average accuracy of $58.70\%$. △ Less

Submitted 4 August, 2021; originally announced August 2021.

Comments: 18 pages, 8 figures

MSC Class: 68T07 ACM Class: I.2

arXiv:2104.12933 [pdf]

Direct observation of 3D topological spin textures and their interactions using soft x-ray vector ptychography

Authors: Arjun Rana, Chen-Ting Liao, Ezio Iacocca, Ji Zou, Minh Pham, Emma-Elizabeth Cating Subramanian, Yuan Hung Lo, Sinéad A. Ryan, Xingyuan Lu, Charles S. Bevis, Robert M. Karl Jr, Andrew J. Glaid, Young-Sang Yu, Pratibha Mahale, David A. Shapiro, Sadegh Yazdi, Thomas E. Mallouk, Stanley J. Osher, Henry C. Kapteyn, Vincent H. Crespi, John V. Badding, Yaroslav Tserkovnyak, Margaret M. Murnane, Jianwei Miao

Abstract: Magnetic topological defects are energetically stable spin configurations characterized by symmetry breaking. Vortices and skyrmions are two well-known examples of 2D spin textures that have been actively studied for both fundamental interest and practical applications. However, experimental evidence of the 3D spin textures has been largely indirect or qualitative to date, due to the difficulty of… ▽ More Magnetic topological defects are energetically stable spin configurations characterized by symmetry breaking. Vortices and skyrmions are two well-known examples of 2D spin textures that have been actively studied for both fundamental interest and practical applications. However, experimental evidence of the 3D spin textures has been largely indirect or qualitative to date, due to the difficulty of quantitively characterizing them within nanoscale volumes. Here, we develop soft x-ray vector ptychography to quantitatively image the 3D magnetization vector field in a frustrated superlattice with 10 nm spatial resolution. By applying homotopy theory to the experimental data, we quantify the topological charge of hedgehogs and anti-hedgehogs as emergent magnetic monopoles and probe their interactions inside the frustrated superlattice. We also directly observe virtual hedgehogs and anti-hedgehogs created by magnetically inert voids. We expect that this new quantitative imaging method will open the door to study 3D topological spin textures in a broad class of magnetic materials. Our work also demonstrates that magnetically frustrated superlattices could be used as a new platform to investigate hedgehog interactions and dynamics and to exploit optimized geometries for information storage and transport applications. △ Less

Submitted 26 April, 2021; originally announced April 2021.

arXiv:2007.03881 [pdf]

Direct observation of 3D atomic packing in monatomic amorphous materials

Authors: Yakun Yuan, Dennis S. Kim, Jihan Zhou, Dillan J. Chang, Fan Zhu, Yasutaka Nagaoka, Yao Yang, Minh Pham, Stanley J. Osher, Ou Chen, Peter Ercius, Andreas K. Schmid, Jianwei Miao

Abstract: Liquids and solids are two fundamental states of matter. However, due to the lack of direct experimental determination, our understanding of the 3D atomic structure of liquids and amorphous solids remained speculative. Here we advance atomic electron tomography to determine for the first time the 3D atomic positions in monatomic amorphous materials, including a Ta thin film and two Pd nanoparticle… ▽ More Liquids and solids are two fundamental states of matter. However, due to the lack of direct experimental determination, our understanding of the 3D atomic structure of liquids and amorphous solids remained speculative. Here we advance atomic electron tomography to determine for the first time the 3D atomic positions in monatomic amorphous materials, including a Ta thin film and two Pd nanoparticles. We observe that pentagonal bipyramids are the most abundant atomic motifs in these amorphous materials. Instead of forming icosahedra, the majority of pentagonal bipyramids arrange into networks that extend to medium-range scale. Molecular dynamic simulations further reveal that pentagonal bipyramid networks are prevalent in monatomic amorphous liquids, which rapidly grow in size and form icosahedra during the quench from the liquid state to glass state. The experimental method and results are expected to advance the study of the amorphous-crystalline phase transition and glass transition at the single-atom level. △ Less

Submitted 2 December, 2020; v1 submitted 7 July, 2020; originally announced July 2020.

arXiv:2006.06919 [pdf, other]

MomentumRNN: Integrating Momentum into Recurrent Neural Networks

Authors: Tan M. Nguyen, Richard G. Baraniuk, Andrea L. Bertozzi, Stanley J. Osher, Bao Wang

Abstract: Designing deep neural networks is an art that often involves an expensive search over candidate architectures. To overcome this for recurrent neural nets (RNNs), we establish a connection between the hidden state dynamics in an RNN and gradient descent (GD). We then integrate momentum into this framework and propose a new family of RNNs, called {\em MomentumRNNs}. We theoretically prove and numeri… ▽ More Designing deep neural networks is an art that often involves an expensive search over candidate architectures. To overcome this for recurrent neural nets (RNNs), we establish a connection between the hidden state dynamics in an RNN and gradient descent (GD). We then integrate momentum into this framework and propose a new family of RNNs, called {\em MomentumRNNs}. We theoretically prove and numerically demonstrate that MomentumRNNs alleviate the vanishing gradient issue in training RNNs. We study the momentum long-short term memory (MomentumLSTM) and verify its advantages in convergence speed and accuracy over its LSTM counterpart across a variety of benchmarks. We also demonstrate that MomentumRNN is applicable to many types of recurrent cells, including those in the state-of-the-art orthogonal RNNs. Finally, we show that other advanced momentum-based optimization methods, such as Adam and Nesterov accelerated gradients with a restart, can be easily incorporated into the MomentumRNN framework for designing new recurrent cells with even better performance. The code is available at https://github.com/minhtannguyen/MomentumRNN. △ Less

Submitted 11 October, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: 21 pages, 11 figures, Accepted for publication at Advances in Neural Information Processing Systems (NeurIPS) 2020

MSC Class: 68T07 ACM Class: I.2

Journal ref: Advances in Neural Information Processing Systems (NeurIPS) 2020

arXiv:2004.12210 [pdf, ps, other]

Computational methods for nonlocal mean field games with applications

Authors: Siting Liu, Matthew Jacobs, Wuchen Li, Levon Nurbekyan, Stanley J. Osher

Abstract: We introduce a novel framework to model and solve mean-field game systems with nonlocal interactions. Our approach relies on kernel-based representations of mean-field interactions and feature-space expansions in the spirit of kernel methods in machine learning. We demonstrate the flexibility of our approach by modeling various interaction scenarios between agents. Additionally, our method yields… ▽ More We introduce a novel framework to model and solve mean-field game systems with nonlocal interactions. Our approach relies on kernel-based representations of mean-field interactions and feature-space expansions in the spirit of kernel methods in machine learning. We demonstrate the flexibility of our approach by modeling various interaction scenarios between agents. Additionally, our method yields a computationally efficient saddle-point reformulation of the original problem that is amenable to state-of-the-art convex optimization methods such as the primal-dual hybrid gradient method (PDHG). We also discuss potential applications of our methods to multi-agent trajectory planning problems. △ Less

Submitted 28 April, 2020; v1 submitted 25 April, 2020; originally announced April 2020.

MSC Class: Primary: 35Q89; 49N80; 35A15; 65M70; 93A16 Secondary: 35Q91; 35Q93; 93A15

arXiv:2003.00631 [pdf, other]

Sparsity Meets Robustness: Channel Pruning for the Feynman-Kac Formalism Principled Robust Deep Neural Nets

Authors: Thu Dinh, Bao Wang, Andrea L. Bertozzi, Stanley J. Osher

Abstract: Deep neural nets (DNNs) compression is crucial for adaptation to mobile devices. Though many successful algorithms exist to compress naturally trained DNNs, developing efficient and stable compression algorithms for robustly trained DNNs remains widely open. In this paper, we focus on a co-design of efficient DNN compression algorithms and sparse neural architectures for robust and accurate deep l… ▽ More Deep neural nets (DNNs) compression is crucial for adaptation to mobile devices. Though many successful algorithms exist to compress naturally trained DNNs, developing efficient and stable compression algorithms for robustly trained DNNs remains widely open. In this paper, we focus on a co-design of efficient DNN compression algorithms and sparse neural architectures for robust and accurate deep learning. Such a co-design enables us to advance the goal of accommodating both sparsity and robustness. With this objective in mind, we leverage the relaxed augmented Lagrangian based algorithms to prune the weights of adversarially trained DNNs, at both structured and unstructured levels. Using a Feynman-Kac formalism principled robust and sparse DNNs, we can at least double the channel sparsity of the adversarially trained ResNet20 for CIFAR10 classification, meanwhile, improve the natural accuracy by $8.69$\% and the robust accuracy under the benchmark $20$ iterations of IFGSM attack by $5.42$\%. The code is available at \url{https://github.com/BaoWangMath/rvsm-rgsm-admm}. △ Less

Submitted 1 March, 2020; originally announced March 2020.

Comments: 16 pages, 7 figures

MSC Class: 68T01

arXiv:2002.10583 [pdf, other]

Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent

Authors: Bao Wang, Tan M. Nguyen, Andrea L. Bertozzi, Richard G. Baraniuk, Stanley J. Osher

Abstract: Stochastic gradient descent (SGD) with constant momentum and its variants such as Adam are the optimization algorithms of choice for training deep neural networks (DNNs). Since DNN training is incredibly computationally expensive, there is great interest in speeding up the convergence. Nesterov accelerated gradient (NAG) improves the convergence rate of gradient descent (GD) for convex optimizatio… ▽ More Stochastic gradient descent (SGD) with constant momentum and its variants such as Adam are the optimization algorithms of choice for training deep neural networks (DNNs). Since DNN training is incredibly computationally expensive, there is great interest in speeding up the convergence. Nesterov accelerated gradient (NAG) improves the convergence rate of gradient descent (GD) for convex optimization using a specially designed momentum; however, it accumulates error when an inexact gradient is used (such as in SGD), slowing convergence at best and diverging at worst. In this paper, we propose Scheduled Restart SGD (SRSGD), a new NAG-style scheme for training DNNs. SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule. Using a variety of models and benchmarks for image classification, we demonstrate that, in training DNNs, SRSGD significantly improves convergence and generalization; for instance in training ResNet200 for ImageNet classification, SRSGD achieves an error rate of 20.93% vs. the benchmark of 22.13%. These improvements become more significant as the network grows deeper. Furthermore, on both CIFAR and ImageNet, SRSGD reaches similar or even better error rates with significantly fewer training epochs compared to the SGD baseline. △ Less

Submitted 26 April, 2020; v1 submitted 24 February, 2020; originally announced February 2020.

Comments: 35 pages, 16 figures, 18 tables

arXiv:2002.10113 [pdf, other]

doi 10.1073/pnas.2024713118

Alternating the Population and Control Neural Networks to Solve High-Dimensional Stochastic Mean-Field Games

Authors: Alex Tong Lin, Samy Wu Fung, Wuchen Li, Levon Nurbekyan, Stanley J. Osher

Abstract: We present APAC-Net, an alternating population and agent control neural network for solving stochastic mean field games (MFGs). Our algorithm is geared toward high-dimensional instances of MFGs that are beyond reach with existing solution methods. We achieve this in two steps. First, we take advantage of the underlying variational primal-dual structure that MFGs exhibit and phrase it as a convex-c… ▽ More We present APAC-Net, an alternating population and agent control neural network for solving stochastic mean field games (MFGs). Our algorithm is geared toward high-dimensional instances of MFGs that are beyond reach with existing solution methods. We achieve this in two steps. First, we take advantage of the underlying variational primal-dual structure that MFGs exhibit and phrase it as a convex-concave saddle point problem. Second, we parameterize the value and density functions by two neural networks, respectively. By phrasing the problem in this manner, solving the MFG can be interpreted as a special case of training a generative adversarial network (GAN). We show the potential of our method on up to 100-dimensional MFG problems. △ Less

Submitted 14 July, 2023; v1 submitted 24 February, 2020; originally announced February 2020.

arXiv:1907.06800 [pdf, other]

Graph Interpolating Activation Improves Both Natural and Robust Accuracies in Data-Efficient Deep Learning

Authors: Bao Wang, Stanley J. Osher

Abstract: Improving the accuracy and robustness of deep neural nets (DNNs) and adapting them to small training data are primary tasks in deep learning research. In this paper, we replace the output activation function of DNNs, typically the data-agnostic softmax function, with a graph Laplacian-based high dimensional interpolating function which, in the continuum limit, converges to the solution of a Laplac… ▽ More Improving the accuracy and robustness of deep neural nets (DNNs) and adapting them to small training data are primary tasks in deep learning research. In this paper, we replace the output activation function of DNNs, typically the data-agnostic softmax function, with a graph Laplacian-based high dimensional interpolating function which, in the continuum limit, converges to the solution of a Laplace-Beltrami equation on a high dimensional manifold. Furthermore, we propose end-to-end training and testing algorithms for this new architecture. The proposed DNN with graph interpolating activation integrates the advantages of both deep learning and manifold learning. Compared to the conventional DNNs with the softmax function as output activation, the new framework demonstrates the following major advantages: First, it is better applicable to data-efficient learning in which we train high capacity DNNs without using a large number of training data. Second, it remarkably improves both natural accuracy on the clean images and robust accuracy on the adversarial images crafted by both white-box and black-box adversarial attacks. Third, it is a natural choice for semi-supervised learning. For reproducibility, the code is available at \url{https://github.com/BaoWangMath/DNN-DataDependentActivation}. △ Less

Submitted 15 July, 2019; originally announced July 2019.

Comments: 34 pages, 10 figures

MSC Class: 68T01; 68T45

arXiv:1906.12056 [pdf, other]

DP-LSSGD: A Stochastic Optimization Method to Lift the Utility in Privacy-Preserving ERM

Authors: Bao Wang, Quanquan Gu, March Boedihardjo, Farzin Barekat, Stanley J. Osher

Abstract: Machine learning (ML) models trained by differentially private stochastic gradient descent (DP-SGD) have much lower utility than the non-private ones. To mitigate this degradation, we propose a DP Laplacian smoothing SGD (DP-LSSGD) to train ML models with differential privacy (DP) guarantees. At the core of DP-LSSGD is the Laplacian smoothing, which smooths out the Gaussian noise used in the Gauss… ▽ More Machine learning (ML) models trained by differentially private stochastic gradient descent (DP-SGD) have much lower utility than the non-private ones. To mitigate this degradation, we propose a DP Laplacian smoothing SGD (DP-LSSGD) to train ML models with differential privacy (DP) guarantees. At the core of DP-LSSGD is the Laplacian smoothing, which smooths out the Gaussian noise used in the Gaussian mechanism. Under the same amount of noise used in the Gaussian mechanism, DP-LSSGD attains the same DP guarantee, but in practice, DP-LSSGD makes training both convex and nonconvex ML models more stable and enables the trained models to generalize better. The proposed algorithm is simple to implement and the extra computational complexity and memory overhead compared with DP-SGD are negligible. DP-LSSGD is applicable to train a large variety of ML models, including DNNs. The code is available at \url{https://github.com/BaoWangMath/DP-LSSGD}. △ Less

Submitted 7 December, 2019; v1 submitted 28 June, 2019; originally announced June 2019.

Comments: 21 pages, 7 figures

MSC Class: 68T05

arXiv:1901.06827 [pdf, other]

A Deterministic Gradient-Based Approach to Avoid Saddle Points

Authors: Lisa Maria Kreusser, Stanley J. Osher, Bao Wang

Abstract: Loss functions with a large number of saddle points are one of the major obstacles for training modern machine learning models efficiently. First-order methods such as gradient descent are usually the methods of choice for training machine learning models. However, these methods converge to saddle points for certain choices of initial guesses. In this paper, we propose a modification of the recent… ▽ More Loss functions with a large number of saddle points are one of the major obstacles for training modern machine learning models efficiently. First-order methods such as gradient descent are usually the methods of choice for training machine learning models. However, these methods converge to saddle points for certain choices of initial guesses. In this paper, we propose a modification of the recently proposed Laplacian smoothing gradient descent [Osher et al., arXiv:1806.06317], called modified Laplacian smoothing gradient descent (mLSGD), and demonstrate its potential to avoid saddle points without sacrificing the convergence rate. Our analysis is based on the attraction region, formed by all starting points for which the considered numerical scheme converges to a saddle point. We investigate the attraction region's dimension both analytically and numerically. For a canonical class of quadratic functions, we show that the dimension of the attraction region for mLSGD is floor((n-1)/2), and hence it is significantly smaller than that of the gradient descent whose dimension is n-1. △ Less

Submitted 28 September, 2020; v1 submitted 21 January, 2019; originally announced January 2019.

arXiv:1811.10745 [pdf, other]

ResNets Ensemble via the Feynman-Kac Formalism to Improve Natural and Robust Accuracies

Authors: Bao Wang, Binjie Yuan, Zuoqiang Shi, Stanley J. Osher

Abstract: Empirical adversarial risk minimization (EARM) is a widely used mathematical framework to robustly train deep neural nets (DNNs) that are resistant to adversarial attacks. However, both natural and robust accuracies, in classifying clean and adversarial images, respectively, of the trained robust models are far from satisfactory. In this work, we unify the theory of optimal control of transport eq… ▽ More Empirical adversarial risk minimization (EARM) is a widely used mathematical framework to robustly train deep neural nets (DNNs) that are resistant to adversarial attacks. However, both natural and robust accuracies, in classifying clean and adversarial images, respectively, of the trained robust models are far from satisfactory. In this work, we unify the theory of optimal control of transport equations with the practice of training and testing of ResNets. Based on this unified viewpoint, we propose a simple yet effective ResNets ensemble algorithm to boost the accuracy of the robustly trained model on both clean and adversarial images. The proposed algorithm consists of two components: First, we modify the base ResNets by injecting a variance specified Gaussian noise to the output of each residual mapping. Second, we average over the production of multiple jointly trained modified ResNets to get the final prediction. These two steps give an approximation to the Feynman-Kac formula for representing the solution of a transport equation with viscosity, or a convection-diffusion equation. For the CIFAR10 benchmark, this simple algorithm leads to a robust model with a natural accuracy of {\bf 85.62}\% on clean images and a robust accuracy of ${\bf 57.94 \%}$ under the 20 iterations of the IFGSM attack, which outperforms the current state-of-the-art in defending against IFGSM attack on the CIFAR10. Both natural and robust accuracies of the proposed ResNets ensemble can be improved dynamically as the building block ResNet advances. The code is available at: \url{https://github.com/BaoWangMath/EnResNet}. △ Less

Submitted 10 June, 2019; v1 submitted 26 November, 2018; originally announced November 2018.

Comments: 18 pages, 6 figures

MSC Class: 68Txx

arXiv:1811.06492 [pdf, other]

Mathematical Analysis of Adversarial Attacks

Authors: Zehao Dou, Stanley J. Osher, Bao Wang

Abstract: In this paper, we analyze efficacy of the fast gradient sign method (FGSM) and the Carlini-Wagner's L2 (CW-L2) attack. We prove that, within a certain regime, the untargeted FGSM can fool any convolutional neural nets (CNNs) with ReLU activation; the targeted FGSM can mislead any CNNs with ReLU activation to classify any given image into any prescribed class. For a special two-layer neural network… ▽ More In this paper, we analyze efficacy of the fast gradient sign method (FGSM) and the Carlini-Wagner's L2 (CW-L2) attack. We prove that, within a certain regime, the untargeted FGSM can fool any convolutional neural nets (CNNs) with ReLU activation; the targeted FGSM can mislead any CNNs with ReLU activation to classify any given image into any prescribed class. For a special two-layer neural network: a linear layer followed by the softmax output activation, we show that the CW-L2 attack increases the ratio of the classification probability between the target and ground truth classes. Moreover, we provide numerical results to verify all our theoretical results. △ Less

Submitted 25 November, 2018; v1 submitted 15 November, 2018; originally announced November 2018.

Comments: 21 pages

arXiv:1810.12993 [pdf, other]

Diagnosing Forward Operator Error Using Optimal Transport

Authors: Michael A. Puthawala, Cory D. Hauck, Stanley J. Osher

Abstract: We investigate overdetermined linear inverse problems for which the forward operator may not be given accurately. We introduce a new tool called the structure, based on the Wasserstein distance, and propose the use of this to diagnose and remedy forward operator error. Computing the structure turns out to use an easy calculation for a Euclidean homogeneous degree one distance, the Earth Mover's Di… ▽ More We investigate overdetermined linear inverse problems for which the forward operator may not be given accurately. We introduce a new tool called the structure, based on the Wasserstein distance, and propose the use of this to diagnose and remedy forward operator error. Computing the structure turns out to use an easy calculation for a Euclidean homogeneous degree one distance, the Earth Mover's Distance, based on recently developed algorithms. The structure is proven to distinguish between noise and signals in the residual and gives a plan to help recover the true direct operator in some interesting cases. We expect to use this technique not only to diagnose the error, but also to correct it, which we do in some simple cases presented below. △ Less

Submitted 30 October, 2018; originally announced October 2018.

arXiv:1809.08622 [pdf, ps, other]

Error estimation of weighted nonlocal Laplacian on random point cloud

Authors: Zuoqiang Shi, Bao Wang, Stanley J. Osher

Abstract: We analyze the convergence of the weighted nonlocal Laplacian (WNLL) on high dimensional randomly distributed data. The analysis reveals the importance of the scaling weight $μ\sim P|/|S|$ with $|P|$ and $|S|$ be the number of entire and labeled data, respectively. The result gives a theoretical foundation of WNLL for high dimensional data interpolation. We analyze the convergence of the weighted nonlocal Laplacian (WNLL) on high dimensional randomly distributed data. The analysis reveals the importance of the scaling weight $μ\sim P|/|S|$ with $|P|$ and $|S|$ be the number of entire and labeled data, respectively. The result gives a theoretical foundation of WNLL for high dimensional data interpolation. △ Less

Submitted 23 September, 2018; originally announced September 2018.

Comments: 15 pages; 2 figures

MSC Class: 65-00

arXiv:1809.08516 [pdf, other]

Adversarial Defense via Data Dependent Activation Function and Total Variation Minimization

Authors: Bao Wang, Alex T. Lin, Wei Zhu, Penghang Yin, Andrea L. Bertozzi, Stanley J. Osher

Abstract: We improve the robustness of Deep Neural Net (DNN) to adversarial attacks by using an interpolating function as the output activation. This data-dependent activation remarkably improves both the generalization and robustness of DNN. In the CIFAR10 benchmark, we raise the robust accuracy of the adversarially trained ResNet20 from $\sim 46\%$ to $\sim 69\%$ under the state-of-the-art Iterative Fast… ▽ More We improve the robustness of Deep Neural Net (DNN) to adversarial attacks by using an interpolating function as the output activation. This data-dependent activation remarkably improves both the generalization and robustness of DNN. In the CIFAR10 benchmark, we raise the robust accuracy of the adversarially trained ResNet20 from $\sim 46\%$ to $\sim 69\%$ under the state-of-the-art Iterative Fast Gradient Sign Method (IFGSM) based adversarial attack. When we combine this data-dependent activation with total variation minimization on adversarial images and training data augmentation, we achieve an improvement in robust accuracy by 38.9$\%$ for ResNet56 under the strongest IFGSM attack. Furthermore, We provide an intuitive explanation of our defense by analyzing the geometry of the feature space. △ Less

Submitted 29 April, 2020; v1 submitted 22 September, 2018; originally announced September 2018.

Comments: 17 pages, 6 figures

MSC Class: 68Pxx

Journal ref: Inverse Problems and Imaging, 2020

arXiv:1808.03228 [pdf, other]

Modeling Environmental Crime in Protected Areas Using the Level Set Method

Authors: David J. Arnold, Dayne Fernandez, Ruizhe Jia, Christian Parkinson, Deborah Tonne, Yotam Yaniv, Andrea L. Bertozzi, Stanley J. Osher

Abstract: National parks often serve as hotspots for environmental crime such as illegal deforestation and animal poaching. Previous attempts to model environmental crime were either discrete and network-based or required very restrictive assumptions on the geometry of the protected region and made heavy use of radial symmetry. We formulate a level set method to track criminals inside a protected region whi… ▽ More National parks often serve as hotspots for environmental crime such as illegal deforestation and animal poaching. Previous attempts to model environmental crime were either discrete and network-based or required very restrictive assumptions on the geometry of the protected region and made heavy use of radial symmetry. We formulate a level set method to track criminals inside a protected region which uses real elevation data to determine speed of travel, does not require any assumptions of symmetry, and can be applied to regions of arbitrary shape. In doing so, we design a Hamilton-Jacobi equation to describe movement of criminals while also incorporating the effects of patrollers who attempt to deter the crime. We discuss the numerical schemes that we use to solve this Hamilton-Jacobi equation. Finally, we apply our method to Yosemite National Park and Kangaroo Island, Australia and design practical patrol strategies with the goal of minimizing the area that is affected by criminal activity. △ Less

Submitted 6 August, 2018; originally announced August 2018.

Comments: 18 pages, 12 figures

arXiv:1802.00168 [pdf, other]

Deep Neural Nets with Interpolating Function as Output Activation

Authors: Bao Wang, Xiyang Luo, Zhen Li, Wei Zhu, Zuoqiang Shi, Stanley J. Osher

Abstract: We replace the output layer of deep neural nets, typically the softmax function, by a novel interpolating function. And we propose end-to-end training and testing algorithms for this new architecture. Compared to classical neural nets with softmax function as output activation, the surrogate with interpolating function as output activation combines advantages of both deep and manifold learning. Th… ▽ More We replace the output layer of deep neural nets, typically the softmax function, by a novel interpolating function. And we propose end-to-end training and testing algorithms for this new architecture. Compared to classical neural nets with softmax function as output activation, the surrogate with interpolating function as output activation combines advantages of both deep and manifold learning. The new framework demonstrates the following major advantages: First, it is better applicable to the case with insufficient training data. Second, it significantly improves the generalization accuracy on a wide variety of networks. The algorithm is implemented in PyTorch, and code will be made publicly available. △ Less

Submitted 16 June, 2018; v1 submitted 1 February, 2018; originally announced February 2018.

Comments: 11 pages, 4 figures

MSC Class: 68Txx

arXiv:1711.08833 [pdf, other]

Deep Learning for Real-Time Crime Forecasting and its Ternarization

Authors: Bao Wang, Penghang Yin, Andrea L. Bertozzi, P. Jeffrey Brantingham, Stanley J. Osher, Jack Xin

Abstract: Real-time crime forecasting is important. However, accurate prediction of when and where the next crime will happen is difficult. No known physical model provides a reasonable approximation to such a complex system. Historical crime data are sparse in both space and time and the signal of interests is weak. In this work, we first present a proper representation of crime data. We then adapt the spa… ▽ More Real-time crime forecasting is important. However, accurate prediction of when and where the next crime will happen is difficult. No known physical model provides a reasonable approximation to such a complex system. Historical crime data are sparse in both space and time and the signal of interests is weak. In this work, we first present a proper representation of crime data. We then adapt the spatial temporal residual network on the well represented data to predict the distribution of crime in Los Angeles at the scale of hours in neighborhood-sized parcels. These experiments as well as comparisons with several existing approaches to prediction demonstrate the superiority of the proposed model in terms of accuracy. Finally, we present a ternarization technique to address the resource consumption issue for its deployment in real world. This work is an extension of our short conference proceeding paper [Wang et al, Arxiv 1707.03340]. △ Less

Submitted 23 November, 2017; originally announced November 2017.

Comments: 14 pages, 7 figures

MSC Class: 62-07

arXiv:1404.1370 [pdf, ps, other]

An L1 Penalty Method for General Obstacle Problems

Authors: Giang Tran, Hayden Schaeffer, William M. Feldman, Stanley J. Osher

Abstract: We construct an efficient numerical scheme for solving obstacle problems in divergence form. The numerical method is based on a reformulation of the obstacle in terms of an L1-like penalty on the variational problem. The reformulation is an exact regularizer in the sense that for large (but finite) penalty parameter, we recover the exact solution. Our formulation is applied to classical elliptic o… ▽ More We construct an efficient numerical scheme for solving obstacle problems in divergence form. The numerical method is based on a reformulation of the obstacle in terms of an L1-like penalty on the variational problem. The reformulation is an exact regularizer in the sense that for large (but finite) penalty parameter, we recover the exact solution. Our formulation is applied to classical elliptic obstacle problems as well as some related free boundary problems, for example the two-phase membrane problem and the Hele-Shaw model. One advantage of the proposed method is that the free boundary inherent in the obstacle problem arises naturally in our energy minimization without any need for problem specific or complicated discretization. In addition, our scheme also works for nonlinear variational inequalities arising from convex minimization problems. △ Less

Submitted 4 April, 2014; originally announced April 2014.

Comments: 20 pages, 18 figures

arXiv:1403.6883 [pdf, other]

Compressed Wannier modes found from an $L_1$ regularized energy functional

Authors: Farzin Barekat, Ke Yin, Russel E. Caflisch, Stanley J. Osher, Rongjie Lai, Vidvuds Ozolins

Abstract: We propose a method for calculating Wannier functions of periodic solids directly from a modified variational principle for the energy, subject to the requirement that the Wannier functions are orthogonal to all their translations ("shift-orthogonality"). Localization is achieved by adding an $L_1$ regularization term to the energy functional. This approach results in "compressed" Wannier modes wi… ▽ More We propose a method for calculating Wannier functions of periodic solids directly from a modified variational principle for the energy, subject to the requirement that the Wannier functions are orthogonal to all their translations ("shift-orthogonality"). Localization is achieved by adding an $L_1$ regularization term to the energy functional. This approach results in "compressed" Wannier modes with compact support, where one parameter $μ$ controls the trade-off between the accuracy of the total energy and the size of the support of the Wannier modes. Efficient algorithms for shift-orthogonalization and solution of the variational minimization problem are demonstrated. △ Less

Submitted 26 March, 2014; originally announced March 2014.

Comments: 5 pages, 3 figures

arXiv:1311.5850 [pdf, ps, other]

PDEs with Compressed Solutions

Authors: Russel E. Caflisch, Stanley J. Osher, Hayden Schaeffer, Giang Tran

Abstract: Sparsity plays a central role in recent developments in signal processing, linear algebra, statistics, optimization, and other fields. In these developments, sparsity is promoted through the addition of an $L^1$ norm (or related quantity) as a constraint or penalty in a variational principle. We apply this approach to partial differential equations that come from a variational quantity, either by… ▽ More Sparsity plays a central role in recent developments in signal processing, linear algebra, statistics, optimization, and other fields. In these developments, sparsity is promoted through the addition of an $L^1$ norm (or related quantity) as a constraint or penalty in a variational principle. We apply this approach to partial differential equations that come from a variational quantity, either by minimization (to obtain an elliptic PDE) or by gradient flow (to obtain a parabolic PDE). Also, we show that some PDEs can be rewritten in an $L^1$ form, such as the divisible sandpile problem and signum-Gordon. Addition of an $L^1$ term in the variational principle leads to a modified PDE where a subgradient term appears. It is known that modified PDEs of this form will often have solutions with compact support, which corresponds to the discrete solution being sparse. We show that this is advantageous numerically through the use of efficient algorithms for solving $L^1$ based problems. △ Less

Submitted 1 August, 2014; v1 submitted 22 November, 2013; originally announced November 2013.

Comments: 21 pages, 15 figures

arXiv:1207.6430 [pdf, other]

Optimal Data Collection For Informative Rankings Expose Well-Connected Graphs

Authors: Braxton Osting, Christoph Brune, Stanley J. Osher

Abstract: Given a graph where vertices represent alternatives and arcs represent pairwise comparison data, the statistical ranking problem is to find a potential function, defined on the vertices, such that the gradient of the potential function agrees with the pairwise comparisons. Our goal in this paper is to develop a method for collecting data for which the least squares estimator for the ranking proble… ▽ More Given a graph where vertices represent alternatives and arcs represent pairwise comparison data, the statistical ranking problem is to find a potential function, defined on the vertices, such that the gradient of the potential function agrees with the pairwise comparisons. Our goal in this paper is to develop a method for collecting data for which the least squares estimator for the ranking problem has maximal Fisher information. Our approach, based on experimental design, is to view data collection as a bi-level optimization problem where the inner problem is the ranking problem and the outer problem is to identify data which maximizes the informativeness of the ranking. Under certain assumptions, the data collection problem decouples, reducing to a problem of finding multigraphs with large algebraic connectivity. This reduction of the data collection problem to graph-theoretic questions is one of the primary contributions of this work. As an application, we study the Yahoo! Movie user rating dataset and demonstrate that the addition of a small number of well-chosen pairwise comparisons can significantly increase the Fisher informativeness of the ranking. As another application, we study the 2011-12 NCAA football schedule and propose schedules with the same number of games which are significantly more informative. Using spectral clustering methods to identify highly-connected communities within the division, we argue that the NCAA could improve its notoriously poor rankings by simply scheduling more out-of-conference games. △ Less

Submitted 4 June, 2014; v1 submitted 26 July, 2012; originally announced July 2012.

Comments: 31 pages, 10 figures, 3 tables

Report number: UCLA CAM report 12-32 MSC Class: 62F07; 05C40; 49N45;

Showing 1–38 of 38 results for author: Osher, S J