Data Analysis, Statistics and Probability (physics.data-an)

  • PDF
    This paper introduces the Parsimonious Dynamic Mode Decomposition (parsDMD), a novel algorithm designed to automatically select an optimally sparse subset of dynamic modes for both spatiotemporal and purely temporal data. By incorporating time-delay embedding and leveraging Orthogonal Matching Pursuit (OMP), parsDMD ensures robustness against noise and effectively handles complex, nonlinear dynamics. The algorithm is validated on a diverse range of datasets, including standing wave signals, identifying hidden dynamics, fluid dynamics simulations (flow past a cylinder and transonic buffet), and atmospheric sea-surface temperature (SST) data. ParsDMD addresses a significant limitation of the traditional sparsity-promoting DMD (spDMD), which requires manual tuning of sparsity parameters through a rigorous trial-and-error process to balance between single-mode and all-mode solutions. In contrast, parsDMD autonomously determines the optimally sparse subset of modes without user intervention, while maintaining minimal computational complexity. Comparative analyses demonstrate that parsDMD consistently outperforms spDMD by providing more accurate mode identification and effective reconstruction in noisy environments. These advantages render parsDMD an effective tool for real-time diagnostics, forecasting, and reduced-order model construction across various disciplines.
  • PDF
    The Jaccard similarity index has often been employed in science and technology as a means to quantify the similarity between two sets. When modified to operate on real-valued values, the Jaccard similarity index can be applied to compare vectors, an operation which plays a central role in visualization, classification, and modeling. The present work aims at developing an analytical approach for estimating the probability density of the Jaccard similarity values as implied by set of data elements characterized by specific statistical densities, with emphasis on the uniform and normal cases. Several theoretical and practical situations can benefit directly from such an approach, as it allows several of the properties of the similarity comparisons among a given dataset to be better understood and anticipated. Situations in which the described approach can be applied include the estimation and visualization of data interrelationships in terms of similarity networks, as well as diverse problems in data analysis, pattern recognition and scientific modeling. In addition to presenting the analytical developments and results, examples are also provided in order to illustrate the potential of the approach. The work also includes extension of the reported developments to modifications of the Jaccard index intended for regularization and control of the sharpness of the implemented comparisons.
  • PDF
    We propose a simple method of measuring the autocorrelation function of a spin noise based on multiplication and averaging two digitized signal traces, with one of them being a time-reversed copy of the other. This procedure allows one to obtain, with lower computational expenses, all the information usually derived in the Fourier transform spin-noise spectroscopy, retaining all the merits of the latter. We successfully applied this method to the measurements of spin noise in cesium vapors by using a digital oscilloscope in the capacity of the analog-to-digital converter. Specific opportunities of this experimental approach as applied to a more general problem of studying the nature of light-intensity noise are discussed.
  • PDF
    While deep learning has been successfully applied to the data-driven classification of anomalous diffusion mechanisms, how the algorithm achieves the feat still remains a mystery. In this study, we use a well-known technique aimed at achieving explainable AI, namely the Gradient-weighted Class Activation Map (Grad-CAM), to investigate how deep learning (implemented by ResNets) recognizes the distinctive features of a particular anomalous diffusion model from the raw trajectory data. Our results show that Grad-CAM reveals the portions of the trajectory that hold crucial information about the underlying mechanism of anomalous diffusion, which can be utilized to enhance the robustness of the trained classifier against the measurement noise. Moreover, we observe that deep learning distills unique statistical characteristics of different diffusion mechanisms at various spatiotemporal scales, with larger-scale (smaller-scale) features identified at higher (lower) layers.
  • PDF
    Recently, contrastive learning (CL), a technique most prominently used in natural language and computer vision, has been used to train informative representation spaces for galaxy spectra and images in a self-supervised manner. Following this idea, we implement CL for stars in the Milky Way, for which recent astronomical surveys have produced a huge amount of heterogeneous data. Specifically, we investigate Gaia XP coefficients and RVS spectra. Thus, the methods presented in this work lay the foundation for aggregating the knowledge implicitly contained in the multimodal data to enable downstream tasks like cross-modal generation or fused stellar parameter estimation. We find that CL results in a highly structured representation space that exhibits explicit physical meaning. Evaluating Using this representation space to perform cross-modal generation and stellar label regression results in excellent performance with high-quality generated samples as well as accurate and precise label predictions.
  • PDF
    Background: In Kreuz et al., J Neurosci Methods 381, 109703 (2022) two methods were proposed that perform latency correction, i.e., optimize the spike time alignment of sparse neuronal spike trains with well defined global spiking events. The first one based on direct shifts is fast but uses only partial latency information, while the other one makes use of the full information but relies on the computationally costly simulated annealing. Both methods reach their limits and can become unreliable when successive global events are not sufficiently separated or even overlap. New Method: Here we propose an iterative scheme that combines the advantages of the two original methods by using in each step as much of the latency information as possible and by employing a very fast extrapolation direct shift method instead of the much slower simulated annealing. Results: We illustrate the effectiveness and the improved performance, measured in terms of the relative shift error, of the new iterative scheme not only on simulated data with known ground truths but also on single-unit recordings from two medial superior olive neurons of a gerbil. Comparison with Existing Method(s): The iterative scheme outperforms the existing approaches on both the simulated and the experimental data. Due to its low computational demands, and in contrast to simulated annealing, it can also be applied to very large datasets. Conclusions: The new method generalizes and improves on the original method both in terms of accuracy and speed. Importantly, it is the only method that allows to disentangle global events with overlap.
  • PDF
    In the previous paper we have shown analytically that, if the drift function of the d-dimensional Langevin equation is the Langevin function with a properly chosen scale factor, then the evolution of the drift function is a martingale associated with the histories generated by the very Langevin equation. Moreover, we numerically demonstrated that those generated histories from a common initial data become asymptotically ballistic, whose orientations obey the classical canonical spin statistics under the external field corresponding to the initial data. In the present paper we provide with an analytical explanation of the latter numerical finding by introducing a martingale in the spin functional space. In a specific context the present result elucidates a new physical aspect of martingale theory.
  • PDF
    The precise measurement of parity-violating asymmetries in parity-violating electron scattering experiments is a powerful tool for probing new physics beyond the Standard Model. Achieving the expected precision requires both experimental and post-processing signal corrections. This includes using auxiliary detectors to distinguish the main signal from background signals and implementing post-measurement corrections, such as the Bayesian statistics method, to address uncontrolled factors during the experiments. Asymmetry values in the scattering of electrons off proton targets in QWeak and P2 and off electron targets in MOLLER are influenced by detector array configurations, beam polarization angles, and beam spin variations. The Bayesian framework refines full probabilistic models to account for all necessary factors, thereby extracting asymmetry values and the underlying physics under specified conditions. For the QWeak experiment, a reanalysis of the inelastic asymmetry measurement using the Bayesian method has yielded a closer fit to measured asymmetries, with uncertainties reduced by 40\% compared to the Monte Carlo minimization method. This approach was successfully applied to simulated data for the MOLLER experiment and is predicted to be similarly effective in P2.
  • PDF
    Neutrinoless double-beta decay ($0\nu\beta\beta$) is a rare nuclear process that, if observed, will provide insight into the nature of neutrinos and help explain the matter-antimatter asymmetry in the universe. The Large Enriched Germanium Experiment for Neutrinoless Double-Beta Decay (LEGEND) will operate in two phases to search for $0\nu\beta\beta$. The first (second) stage will employ 200 (1000) kg of High-Purity Germanium (HPGe) enriched in $^{76}$Ge to achieve a half-life sensitivity of 10$^{27}$ (10$^{28}$) years. In this study, we present a semi-supervised data-driven approach to remove non-physical events captured by HPGe detectors powered by a novel artificial intelligence model. We utilize Affinity Propagation to cluster waveform signals based on their shape and a Support Vector Machine to classify them into different categories. We train, optimize, test our model on data taken from a natural abundance HPGe detector installed in the Full Chain Test experimental stand at the University of North Carolina at Chapel Hill. We demonstrate that our model yields a maximum physics event sacrifice of $0.024 ^{+0.004}_{-0.003} \%$ when performing data cleaning cuts. Our model is being used to accelerate data cleaning development for LEGEND-200.
  • PDF
    The EDR and eRASS1 data have already revealed a remarkable number of undiscovered X-ray sources. Using Bayesian inference and generative modeling techniques for X-ray imaging, we aim to increase the sensitivity and scientific value of these observations by denoising, deconvolving, and decomposing the X-ray sky. Leveraging information field theory, we can exploit the spatial and spectral correlation structures of the different physical components of the sky with non-parametric priors to enhance the image reconstruction. By incorporating instrumental effects into the forward model, we develop a comprehensive Bayesian imaging algorithm for eROSITA pointing observations. Finally, we apply the developed algorithm to EDR data of the LMC SN1987A, fusing data sets from observations made by five different telescope modules. The final result is a denoised, deconvolved, and decomposed view of the LMC, which enables the analysis of its fine-scale structures, the creation of point source catalogues of this region, and enhanced calibration for future work.
  • PDF
    In this work we present the wavScalogram R package, which contains methods based on wavelet scalograms for time series analysis. These methods are related to two main wavelet tools: the windowed scalogram difference and the scale index. The windowed scalogram difference compares two time series, identifying if their scalograms follow similar patterns at different scales and times, and it is thus a useful complement to other comparison tools such as the squared wavelet coherence. On the other hand, the scale index provides a numerical estimation of the degree of non-periodicity of a time series and it is widely used in many scientific areas.
  • PDF
    Dynamical systems (DS) theory is fundamental for many areas of science and engineering. It can provide deep insights into the behavior of systems evolving in time, as typically described by differential or recursive equations. A common approach to facilitate mathematical tractability and interpretability of DS models involves decomposing nonlinear DS into multiple linear DS separated by switching manifolds, i.e. piecewise linear (PWL) systems. PWL models are popular in engineering and a frequent choice in mathematics for analyzing the topological properties of DS. However, hand-crafting such models is tedious and only possible for very low-dimensional scenarios, while inferring them from data usually gives rise to unnecessarily complex representations with very many linear subregions. Here we introduce Almost-Linear Recurrent Neural Networks (AL-RNNs) which automatically and robustly produce most parsimonious PWL representations of DS from time series data, using as few PWL nonlinearities as possible. AL-RNNs can be efficiently trained with any SOTA algorithm for dynamical systems reconstruction (DSR), and naturally give rise to a symbolic encoding of the underlying DS that provably preserves important topological properties. We show that for the Lorenz and Rössler systems, AL-RNNs discover, in a purely data-driven way, the known topologically minimal PWL representations of the corresponding chaotic attractors. We further illustrate on two challenging empirical datasets that interpretable symbolic encodings of the dynamics can be achieved, tremendously facilitating mathematical and computational analysis of the underlying systems.
  • PDF
    In recent years, Full-Waveform Inversion (FWI) has been extensively used to derive high-resolution subsurface velocity models from seismic data. However, due to the nonlinearity and ill-posed nature of the problem, FWI requires a good starting model to avoid producing non-physical solutions. Moreover, conventional optimization methods fail to quantify the uncertainty associated with the recovered solution, which is critical for decision-making processes. Bayesian inference offers an alternative approach as it directly or indirectly evaluates the posterior probability density function. For example, Markov Chain Monte Carlo (MCMC) methods generate multiple sample chains to characterize the solution's uncertainty. Despite their ability to theoretically handle any form of distribution, MCMC methods require many sampling steps; this limits their usage in high-dimensional problems with computationally intensive forward modeling, as is the FWI case. Variational Inference (VI), on the other hand, provides an approximate solution to the posterior distribution in the form of a parametric or non-parametric proposal distribution. Among the various algorithms used in VI, Stein Variational Gradient Descent (SVGD) is recognized for its ability to iteratively refine a set of samples to approximate the target distribution. However, mode and variance-collapse issues affect SVGD in high-dimensional inverse problems. This study aims to improve the performance of SVGD within the context of FWI by utilizing, for the first time, an annealed variant of SVGD and combining it with a multi-scale strategy. Additionally, we demonstrate that Principal Component Analysis (PCA) can be used to evaluate the performance of the optimization process. Clustering techniques are also employed to provide more rigorous and meaningful statistical analysis of the particles in the presence of multi-modal distributions.
  • PDF
    The cost of writing, transferring, and storing large data from unsteady simulations limits access to the entire solution, often leaving much of the flow under-sampled or unanalyzed. For example, modeling transient behavior of rare dynamic events requires 3D snapshots at high sampling rates over long periods, generating significant amounts of data and creating challenges for practical CFD workflows, especially with limited memory resources and costly GPU writing penalties. In this work, multiple sparse flow reconstruction (SFR) methods are developed to approximate a full unsteady solution using far fewer sparse measurements, thus reducing writing costs, data storage, and enabling higher sampling rates. SFR is motivated by a large-eddy simulation of rare inlet distortion events, demonstrating that down-sampling full snapshots and supplementing them with high-frequency sparse measurements can drastically cut writing time for GPU solvers and nearly eliminate this cost for CPU solvers. The simplest single-equation "snapshot" SFR method can be compressed further using Proper Orthogonal Decomposition (POD-SFR) or a more efficient double POD-SFR variant. A streaming SFR modification improves reconstruction efficiency when local memory is limited. A sensitivity study evaluates trade-offs between sparse sampling rates and reconstruction accuracy, offering best practices. To offset error of using random sparse measurements, SFR exactly preserves dynamics in key regions by prescribing sparse measurement locations, used here to capture distortion events. Distortion events are evaluated using the conditional space-time proper orthogonal decomposition (CST-POD) to pursue physical insights that characterize the upstream causality at full resolution. A validation study of CST-POD modes confirms SFR effectiveness at retaining the event dynamics with substantial computational and memory savings.
  • PDF
    As emerging marine technologies lead to the development of new infrastructure across the ocean, they enter an environment that existing ecosystems and industries already rely on. Although necessary to provide sustainable sources of energy and food, careful planning will be important to make informed decisions and avoid conflicts. This paper examines several techniques used for marine spatial planning, an approach for analyzing and planning the use of marine resources. Using open source software including QGIS and Python, the potential for developing wave-powered offshore aquaculture farms using the RM3 wave energy converter along the Northeast coast of the United States is assessed and several feasible sites are identified. The optimal site, located at 43.7\degN, 68.9\degW along the coast of Maine, has a total cost for a 5-pen farm of $56.8M, annual fish yield of 676 tonnes, and a levelized cost of fish of $9.23 per kilogram. Overall trends indicate that the cost greatly decreases with distance to shore due to the greater availability of wave energy and that conflicts and environmental constraints significantly limit the number of feasible sites in this region.
  • PDF
    Recently, there has been a growing literature exploring the generalization of quantum algorithms, such that different quantum algorithms are special examples of a more fundamental structure. In this short paper, we provide a general approach to describe quantum algorithms as a quantum state with amplitudes that are constructed from the expected value and standard deviation of each quantum gate or a sub-sequence of gates in the algorithm. The proposed statistical-based description relies on the celebrated Aharonov-Vaidman identity. We present a more fundamental identity that, unlike the previous one, allows us to switch the basis of the states into a desired form.

Recent comments

Wojciech Kryszak Aug 01 2018 12:53 UTC

Your SciNet for the current Solar System problem settles nicely in the mode of operation that is equivalent to the ,,stadard'' Heliocentric model wih positions encoded by Sun-angles relative to the fixed-stars background.

It would be very interesting to see what your SciNet would do when:

1. t

...(continued)
Noon van der Silk Jan 27 2016 03:39 UTC

Great institute name ...

Chris Granade Sep 22 2015 19:15 UTC

Thank you for the kind comments, I'm glad that our paper, source code, and tutorial are useful!

Travis Scholten Sep 21 2015 17:05 UTC

This was a really well-written paper! Am very glad to see this kind of work being done.

In addition, the openness about source code is refreshing. By explicitly relating the work to [QInfer](https://github.com/csferrie/python-qinfer), this paper makes it more easy to check the authors' work. Furthe

...(continued)
Chris Granade Sep 15 2015 02:40 UTC

As a quick addendum, please note that the [supplementary video](https://www.youtube.com/watch?v=22ejRV0Kx2g) for this work is available [on YouTube](https://www.youtube.com/watch?v=22ejRV0Kx2g). Thank you!