-
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Authors:
Marah Abdin,
Jyoti Aneja,
Hany Awadalla,
Ahmed Awadallah,
Ammar Ahmad Awan,
Nguyen Bach,
Amit Bahree,
Arash Bakhtiari,
Jianmin Bao,
Harkirat Behl,
Alon Benhaim,
Misha Bilenko,
Johan Bjorck,
Sébastien Bubeck,
Martin Cai,
Qin Cai,
Vishrav Chaudhary,
Dong Chen,
Dongdong Chen,
Weizhu Chen,
Yen-Chun Chen,
Yi-Ling Chen,
Hao Cheng,
Parul Chopra,
Xiyang Dai
, et al. (104 additional authors not shown)
Abstract:
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version…
▽ More
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.
△ Less
Submitted 30 August, 2024; v1 submitted 22 April, 2024;
originally announced April 2024.
-
SciAI4Industry -- Solving PDEs for industry-scale problems with deep learning
Authors:
Philipp A. Witte,
Russell J. Hewett,
Kumar Saurabh,
AmirHossein Sojoodi,
Ranveer Chandra
Abstract:
Solving partial differential equations with deep learning makes it possible to reduce simulation times by multiple orders of magnitude and unlock scientific methods that typically rely on large numbers of sequential simulations, such as optimization and uncertainty quantification. Two of the largest challenges of adopting scientific AI for industrial problem settings is that training datasets must…
▽ More
Solving partial differential equations with deep learning makes it possible to reduce simulation times by multiple orders of magnitude and unlock scientific methods that typically rely on large numbers of sequential simulations, such as optimization and uncertainty quantification. Two of the largest challenges of adopting scientific AI for industrial problem settings is that training datasets must be simulated in advance and that neural networks for solving large-scale PDEs exceed the memory capabilities of current GPUs. We introduce a distributed programming API in the Julia language for simulating training data in parallel on the cloud and without requiring users to manage the underlying HPC infrastructure. In addition, we show that model-parallel deep learning based on domain decomposition allows us to scale neural networks for solving PDEs to commercial-scale problem settings and achieve above 90% parallel efficiency. Combining our cloud API for training data generation and model-parallel deep learning, we train large-scale neural networks for solving the 3D Navier-Stokes equation and simulating 3D CO2 flow in porous media. For the CO2 example, we simulate a training dataset based on a commercial carbon capture and storage (CCS) project and train a neural network for CO2 flow simulation on a 3D grid with over 2 million cells that is 5 orders of magnitudes faster than a conventional numerical simulator and 3,200 times cheaper.
△ Less
Submitted 23 November, 2022;
originally announced November 2022.
-
Model-Parallel Fourier Neural Operators as Learned Surrogates for Large-Scale Parametric PDEs
Authors:
Thomas J. Grady II,
Rishi Khan,
Mathias Louboutin,
Ziyi Yin,
Philipp A. Witte,
Ranveer Chandra,
Russell J. Hewett,
Felix J. Herrmann
Abstract:
Fourier neural operators (FNOs) are a recently introduced neural network architecture for learning solution operators of partial differential equations (PDEs), which have been shown to perform significantly better than comparable deep learning approaches. Once trained, FNOs can achieve speed-ups of multiple orders of magnitude over conventional numerical PDE solvers. However, due to the high dimen…
▽ More
Fourier neural operators (FNOs) are a recently introduced neural network architecture for learning solution operators of partial differential equations (PDEs), which have been shown to perform significantly better than comparable deep learning approaches. Once trained, FNOs can achieve speed-ups of multiple orders of magnitude over conventional numerical PDE solvers. However, due to the high dimensionality of their input data and network weights, FNOs have so far only been applied to two-dimensional or small three-dimensional problems. To remove this limited problem-size barrier, we propose a model-parallel version of FNOs based on domain-decomposition of both the input data and network weights. We demonstrate that our model-parallel FNO is able to predict time-varying PDE solutions of over 2.6 billion variables on Perlmutter using up to 512 A100 GPUs and show an example of training a distributed FNO on the Azure cloud for simulating multiphase CO$_2$ dynamics in the Earth's subsurface.
△ Less
Submitted 1 February, 2023; v1 submitted 3 April, 2022;
originally announced April 2022.
-
A Linear Algebraic Approach to Model Parallelism in Deep Learning
Authors:
Russell J. Hewett,
Thomas J. Grady II
Abstract:
Training deep neural networks (DNNs) in large-cluster computing environments is increasingly necessary, as networks grow in size and complexity. Local memory and processing limitations require robust data and model parallelism for crossing compute node boundaries. We propose a linear-algebraic approach to model parallelism in deep learning, which allows parallel distribution of any tensor in the D…
▽ More
Training deep neural networks (DNNs) in large-cluster computing environments is increasingly necessary, as networks grow in size and complexity. Local memory and processing limitations require robust data and model parallelism for crossing compute node boundaries. We propose a linear-algebraic approach to model parallelism in deep learning, which allows parallel distribution of any tensor in the DNN. Rather than rely on automatic differentiation tools, which do not universally support distributed memory parallelism models, we show that parallel data movement operations, e.g., broadcast, sum-reduce, and halo exchange, are linear operators, and by defining the relevant spaces and inner products, we manually develop the adjoint, or backward, operators required for gradient-based training of DNNs. We build distributed DNN layers using these parallel primitives, composed with sequential layer implementations, and demonstrate their application by building and training a distributed DNN using DistDL, a PyTorch and MPI-based distributed deep learning toolkit.
△ Less
Submitted 4 June, 2020;
originally announced June 2020.
-
A Survey of Computational Tools in Solar Physics
Authors:
Monica G. Bobra,
Stuart J. Mumford,
Russell J. Hewett,
Steven D. Christe,
Kevin Reardon,
Sabrina Savage,
Jack Ireland,
Tiago M. D. Pereira,
Bin Chen,
David Pérez-Suárez
Abstract:
The SunPy Project developed a 13-question survey to understand the software and hardware usage of the solar physics community. 364 members of the solar physics community, across 35 countries, responded to our survey. We found that 99$\pm$0.5% of respondents use software in their research and 66% use the Python scientific software stack. Students are twice as likely as faculty, staff scientists, an…
▽ More
The SunPy Project developed a 13-question survey to understand the software and hardware usage of the solar physics community. 364 members of the solar physics community, across 35 countries, responded to our survey. We found that 99$\pm$0.5% of respondents use software in their research and 66% use the Python scientific software stack. Students are twice as likely as faculty, staff scientists, and researchers to use Python rather than Interactive Data Language (IDL). In this respect, the astrophysics and solar physics communities differ widely: 78% of solar physics faculty, staff scientists, and researchers in our sample uses IDL, compared with 44% of astrophysics faculty and scientists sampled by Momcheva and Tollerud (2015). 63$\pm$4% of respondents have not taken any computer-science courses at an undergraduate or graduate level. We also found that most respondents utilize consumer hardware to run software for solar-physics research. Although 82% of respondents work with data from space-based or ground-based missions, some of which (e.g. the Solar Dynamics Observatory and Daniel K. Inouye Solar Telescope) produce terabytes of data a day, 14% use a regional or national cluster, 5% use a commercial cloud provider, and 29% use exclusively a laptop or desktop. Finally, we found that 73$\pm$4% of respondents cite scientific software in their research, although only 42$\pm$3% do so routinely.
△ Less
Submitted 27 March, 2020;
originally announced March 2020.
-
L-Sweeps: A scalable, parallel preconditioner for the high-frequency Helmholtz equation
Authors:
Matthias Taus,
Leonardo Zepeda-Núñez,
Russell J Hewett,
Laurent Demanet
Abstract:
We present the first fast solver for the high-frequency Helmholtz equation that scales optimally in parallel, for a single right-hand side. The L-sweeps approach achieves this scalability by departing from the usual propagation pattern, in which information flows in a 180 degree cone from interfaces in a layered decomposition. Instead, with L-sweeps, information propagates in 90 degree cones induc…
▽ More
We present the first fast solver for the high-frequency Helmholtz equation that scales optimally in parallel, for a single right-hand side. The L-sweeps approach achieves this scalability by departing from the usual propagation pattern, in which information flows in a 180 degree cone from interfaces in a layered decomposition. Instead, with L-sweeps, information propagates in 90 degree cones induced by a checkerboard domain decomposition (CDD). We extend the notion of accurate transmission conditions to CDDs and introduce a new sweeping strategy to efficiently track the wave fronts as they propagate through the CDD. The new approach decouples the subdomains at each wave front, so that they can be processed in parallel, resulting in better parallel scalability than previously demonstrated in the literature. The method has an overall O((N/p) log w) empirical run-time for N=n^d total degrees-of-freedom in a d-dimensional problem, frequency w, and p=O(n) processors. We introduce the algorithm and provide a complexity analysis for our parallel implementation of the solver. We corroborate all claims in several two- and three-dimensional numerical examples involving constant, smooth, and discontinuous wave speeds.
△ Less
Submitted 15 October, 2019; v1 submitted 3 September, 2019;
originally announced September 2019.
-
A parallel shared-memory implementation of a high-order accurate solution technique for variable coefficient Helmholtz problems
Authors:
Natalie Beams,
Adrianna Gillman,
Russell J. Hewett
Abstract:
The recently developed Hierarchical Poincaré-Steklov (HPS) method is a high-order discretization technique that comes with a direct solver. Results from previous papers demonstrate the method's ability to solve Helmholtz problems to high accuracy without the so-called pollution effect. While the asymptotic scaling of the direct solver's computational cost is the same as the nested dissection metho…
▽ More
The recently developed Hierarchical Poincaré-Steklov (HPS) method is a high-order discretization technique that comes with a direct solver. Results from previous papers demonstrate the method's ability to solve Helmholtz problems to high accuracy without the so-called pollution effect. While the asymptotic scaling of the direct solver's computational cost is the same as the nested dissection method, serial implementations of the solution technique are not practical for large scale numerical simulations. This manuscript presents the first parallel implementation of the HPS method. Specifically, we introduce an approach for a shared memory implementation of the solution technique utilizing parallel linear algebra. This approach is the foundation for future large scale simulations on supercomputers and clusters with large memory nodes. Performance results on a desktop computer (resembling a large memory node) are presented.
△ Less
Submitted 25 April, 2019; v1 submitted 17 December, 2018;
originally announced December 2018.
-
The method of polarized traces for the 3D Helmholtz equation
Authors:
Leonardo Zepeda-Núñez,
Adrien Scheuer,
Russell J. Hewett,
Laurent Demanet
Abstract:
We present a fast solver for the 3D high-frequency Helmholtz equation in heterogeneous, constant density, acoustic media. The solver is based on the method of polarized traces, coupled with distributed linear algebra libraries and pipelining to obtain an empirical online runtime $ \mathcal{O}(\max(1,R/n) N \log N)$ where $N = n^3$ is the total number of degrees of freedom and $R$ is the number of…
▽ More
We present a fast solver for the 3D high-frequency Helmholtz equation in heterogeneous, constant density, acoustic media. The solver is based on the method of polarized traces, coupled with distributed linear algebra libraries and pipelining to obtain an empirical online runtime $ \mathcal{O}(\max(1,R/n) N \log N)$ where $N = n^3$ is the total number of degrees of freedom and $R$ is the number of right-hand sides. Such a favorable scaling is a prerequisite for large-scale implementations of full waveform inversion (FWI) in frequency domain.
△ Less
Submitted 25 January, 2018;
originally announced January 2018.
-
Weight-adjusted discontinuous Galerkin methods: curvilinear meshes
Authors:
Jesse Chan,
Russell J. Hewett,
T. Warburton
Abstract:
Traditional time-domain discontinuous Galerkin (DG) methods result in large storage costs at high orders of approximation due to the storage of dense elemental matrices. In this work, we propose a weight-adjusted DG (WADG) methods for curvilinear meshes which reduce storage costs while retaining energy stability. A priori error estimates show that high order accuracy is preserved under sufficient…
▽ More
Traditional time-domain discontinuous Galerkin (DG) methods result in large storage costs at high orders of approximation due to the storage of dense elemental matrices. In this work, we propose a weight-adjusted DG (WADG) methods for curvilinear meshes which reduce storage costs while retaining energy stability. A priori error estimates show that high order accuracy is preserved under sufficient conditions on the mesh, which are illustrated through convergence tests with different sequences of meshes. Numerical and computational experiments verify the accuracy and performance of WADG for a model problem on curved domains.
△ Less
Submitted 12 August, 2016;
originally announced August 2016.
-
Weight-adjusted discontinuous Galerkin methods: wave propagation in heterogeneous media
Authors:
Jesse Chan,
Russell J. Hewett,
T. Warburton
Abstract:
Time-domain discontinuous Galerkin (DG) methods for wave propagation require accounting for the inversion of dense elemental mass matrices, where each mass matrix is computed with respect to a parameter-weighted L2 inner product. In applications where the wavespeed varies spatially at a sub-element scale, these matrices are distinct over each element, necessitating additional storage. In this work…
▽ More
Time-domain discontinuous Galerkin (DG) methods for wave propagation require accounting for the inversion of dense elemental mass matrices, where each mass matrix is computed with respect to a parameter-weighted L2 inner product. In applications where the wavespeed varies spatially at a sub-element scale, these matrices are distinct over each element, necessitating additional storage. In this work, we propose a weight-adjusted DG (WADG) method which reduces storage costs by replacing the weighted L2 inner product with a weight-adjusted inner product. This equivalent inner product results in an energy stable method, but does not increase storage costs for locally varying weights. A-priori error estimates are derived, and numerical examples are given illustrating the application of this method to the acoustic wave equation with heterogeneous wavespeed.
△ Less
Submitted 1 January, 2017; v1 submitted 5 August, 2016;
originally announced August 2016.
-
Reduced storage nodal discontinuous Galerkin methods on semi-structured prismatic meshes
Authors:
Jesse Chan,
Zheng Wang,
Russell J. Hewett,
T. Warburton
Abstract:
We present a high order time-domain nodal discontinuous Galerkin method for wave problems on hybrid meshes consisting of both wedge and tetrahedral elements. We allow for vertically mapped wedges which can be deformed along the extruded coordinate, and present a simple method for producing quasi-uniform wedge meshes for layered domains. We show that standard mass lumping techniques result in a los…
▽ More
We present a high order time-domain nodal discontinuous Galerkin method for wave problems on hybrid meshes consisting of both wedge and tetrahedral elements. We allow for vertically mapped wedges which can be deformed along the extruded coordinate, and present a simple method for producing quasi-uniform wedge meshes for layered domains. We show that standard mass lumping techniques result in a loss of energy stability on meshes of vertically mapped wedges, and propose an alternative which is both energy stable and efficient. High order convergence is demonstrated, and comparisons are made with existing low-storage methods on wedges. Finally, the computational performance of the method on Graphics Processing Units is evaluated.
△ Less
Submitted 31 October, 2016; v1 submitted 12 July, 2016;
originally announced July 2016.
-
SunPy - Python for Solar Physics
Authors:
The SunPy Community,
Stuart J Mumford,
Steven Christe,
David Pérez-Suárez,
Jack Ireland,
Albert Y Shih,
Andrew R Inglis,
Simon Liedtke,
Russell J Hewett,
Florian Mayer,
Keith Hughitt,
Nabil Freij,
Tomas Meszaros,
Samuel M Bennett,
Michael Malocha,
John Evans,
Ankit Agrawal,
Andrew J Leonard,
Thomas P Robitaille,
Benjamin Mampaey,
Jose Iván Campos-Rozo,
Michael S Kirk
Abstract:
This paper presents SunPy (version 0.5), a community-developed Python package for solar physics. Python, a free, cross-platform, general-purpose, high-level programming language, has seen widespread adoption among the scientific community, resulting in the availability of a large number of software packages, from numerical computation (NumPy, SciPy) and machine learning (scikit-learn) to visualisa…
▽ More
This paper presents SunPy (version 0.5), a community-developed Python package for solar physics. Python, a free, cross-platform, general-purpose, high-level programming language, has seen widespread adoption among the scientific community, resulting in the availability of a large number of software packages, from numerical computation (NumPy, SciPy) and machine learning (scikit-learn) to visualisation and plotting (matplotlib). SunPy is a data-analysis environment specialising in providing the software necessary to analyse solar and heliospheric data in Python. SunPy is open-source software (BSD licence) and has an open and transparent development workflow that anyone can contribute to. SunPy provides access to solar data through integration with the Virtual Solar Observatory (VSO), the Heliophysics Event Knowledgebase (HEK), and the HELiophysics Integrated Observatory (HELIO) webservices. It currently supports image data from major solar missions (e.g., SDO, SOHO, STEREO, and IRIS), time-series data from missions such as GOES, SDO/EVE, and PROBA2/LYRA, and radio spectra from e-Callisto and STEREO/SWAVES. We describe SunPy's functionality, provide examples of solar data analysis in SunPy, and show how Python-based solar data-analysis can leverage the many existing tools already available in Python. We discuss the future goals of the project and encourage interested users to become involved in the planning and development of SunPy.
△ Less
Submitted 11 May, 2015;
originally announced May 2015.
-
Multiresolution analysis of active region magnetic structure and its correlation with the Mt. Wilson classification and flaring activity
Authors:
J. Ireland,
C. A. Young,
R. T. J. McAteer,
C. Whelan,
R. J. Hewett,
P. T. Gallagher
Abstract:
Two different multi-resolution analyses are used to decompose the structure of active region magnetic flux into concentrations of different size scales. Lines separating these opposite polarity regions of flux at each size scale are found. These lines are used as a mask on a map of the magnetic field gradient to sample the local gradient between opposite polarity regions of given scale sizes. It…
▽ More
Two different multi-resolution analyses are used to decompose the structure of active region magnetic flux into concentrations of different size scales. Lines separating these opposite polarity regions of flux at each size scale are found. These lines are used as a mask on a map of the magnetic field gradient to sample the local gradient between opposite polarity regions of given scale sizes. It is shown that the maximum, average and standard deviation of the magnetic flux gradient for alpha, beta, beta-gamma and beta-gamma-delta active regions increase in the order listed, and that the order is maintained over all length-scales. This study demonstrates that, on average, the Mt. Wilson classification encodes the notion of activity over all length-scales in the active region, and not just those length-scales at which the strongest flux gradients are found. Further, it is also shown that the average gradients in the field, and the average length-scale at which they occur, also increase in the same order. Finally, there are significant differences in the gradient distribution, between flaring and non-flaring active regions, which are maintained over all length-scales. It is also shown that the average gradient content of active regions that have large flares (GOES class 'M' and above) is larger than that for active regions containing flares of all flare sizes; this difference is also maintained at all length-scales.
△ Less
Submitted 1 May, 2008;
originally announced May 2008.