subscribe to arXiv mailings

Unsupervised anomaly detection in spatio-temporal stream network sensor data

Authors: Edgar Santos-Fernandez, Jay M. Ver Hoef, Erin E. Peterson, James McGree, Cesar A. Villa, Catherine Leigh, Ryan Turner, Cameron Roberts, Kerrie Mengersen

Abstract: The use of in-situ digital sensors for water quality monitoring is becoming increasingly common worldwide. While these sensors provide near real-time data for science, the data are prone to technical anomalies that can undermine the trustworthiness of the data and the accuracy of statistical inferences, particularly in spatial and temporal analyses. Here we propose a framework for detecting anomal… ▽ More The use of in-situ digital sensors for water quality monitoring is becoming increasingly common worldwide. While these sensors provide near real-time data for science, the data are prone to technical anomalies that can undermine the trustworthiness of the data and the accuracy of statistical inferences, particularly in spatial and temporal analyses. Here we propose a framework for detecting anomalies in sensor data recorded in stream networks, which takes advantage of spatial and temporal autocorrelation to improve detection rates. The proposed framework involves the implementation of effective data imputation to handle missing data, alignment of time-series to address temporal disparities, and the identification of water quality events. We explore the effectiveness of a suite of state-of-the-art statistical methods including posterior predictive distributions, finite mixtures, and Hidden Markov Models (HMM). We showcase the practical implementation of automated anomaly detection in near-real time by employing a Bayesian recursive approach. This demonstration is conducted through a comprehensive simulation study and a practical application to a substantive case study situated in the Herbert River, located in Queensland, Australia, which flows into the Great Barrier Reef. We found that methods such as posterior predictive distributions and HMM produce the best performance in detecting multiple types of anomalies. Utilizing data from multiple sensors deployed relatively near one another enhances the ability to distinguish between water quality events and technical anomalies, thereby significantly improving the accuracy of anomaly detection. Thus, uncertainty and biases in water quality reporting, interpretation, and modelling are reduced, and the effectiveness of subsequent management actions improved. △ Less

Submitted 11 September, 2024; originally announced September 2024.

arXiv:2305.07811 [pdf, other]

Indexing and Partitioning the Spatial Linear Model for Large Data Sets

Authors: Jay M. Ver Hoef, Michael Dumelle, Matt Higham, Erin E. Peterson, Daniel J. Isaak

Abstract: We consider four main goals when fitting spatial linear models: 1) estimating covariance parameters, 2) estimating fixed effects, 3) kriging (making point predictions), and 4) block-kriging (predicting the average value over a region). Each of these goals can present different challenges when analyzing large spatial data sets. Current research uses a variety of methods, including spatial basis fun… ▽ More We consider four main goals when fitting spatial linear models: 1) estimating covariance parameters, 2) estimating fixed effects, 3) kriging (making point predictions), and 4) block-kriging (predicting the average value over a region). Each of these goals can present different challenges when analyzing large spatial data sets. Current research uses a variety of methods, including spatial basis functions (reduced rank), covariance tapering, etc, to achieve these goals. However, spatial indexing, which is very similar to composite likelihood, offers some advantages. We develop a simple framework for all four goals listed above by using indexing to create a block covariance structure and nearest-neighbor predictions while maintaining a coherent linear model. We show exact inference for fixed effects under this block covariance construction. Spatial indexing is very fast, and simulations are used to validate methods and compare to another popular method. We study various sample designs for indexing and our simulations showed that indexing leading to spatially compact partitions are best over a range of sample sizes, autocorrelation values, and generating processes. Partitions can be kept small, on the order of 50 samples per partition. We use nearest-neighbors for kriging and block kriging, finding that 50 nearest-neighbors is sufficient. In all cases, confidence intervals for fixed effects, and prediction intervals for (block) kriging, have appropriate coverage. Some advantages of spatial indexing are that it is available for any valid covariance matrix, can take advantage of parallel computing, and easily extends to non-Euclidean topologies, such as stream networks. We use stream networks to show how spatial indexing can achieve all four goals, listed above, for very large data sets, in a matter of minutes, rather than days, for an example data set. △ Less

Submitted 12 May, 2023; originally announced May 2023.

arXiv:2305.02978 [pdf, other]

doi 10.1002/env.2872

Marginal Inference for Hierarchical Generalized Linear Mixed Models with Patterned Covariance Matrices Using the Laplace Approximation

Authors: Jay M. Ver Hoef, Eryn Blagg, Michael Dumelle, Philip M. Dixon, Dale L. Zimmerman, Paul Conn

Abstract: Using a hierarchical construction, we develop methods for a wide and flexible class of models by taking a fully parametric approach to generalized linear mixed models with complex covariance dependence. The Laplace approximation is used to marginally estimate covariance parameters while integrating out all fixed and latent random effects. The Laplace approximation relies on Newton-Raphson updates,… ▽ More Using a hierarchical construction, we develop methods for a wide and flexible class of models by taking a fully parametric approach to generalized linear mixed models with complex covariance dependence. The Laplace approximation is used to marginally estimate covariance parameters while integrating out all fixed and latent random effects. The Laplace approximation relies on Newton-Raphson updates, which also leads to predictions for the latent random effects. We develop methodology for complete marginal inference, from estimating covariance parameters and fixed effects to making predictions for unobserved data, for any patterned covariance matrix in the hierarchical generalized linear mixed models framework. The marginal likelihood is developed for six distributions that are often used for binary, count, and positive continuous data, and our framework is easily extended to other distributions. The methods are illustrated with simulations from stochastic processes with known parameters, and their efficacy in terms of bias and interval coverage is shown through simulation experiments. Examples with binary and proportional data on election results, count data for marine mammals, and positive-continuous data on heavy metal concentration in the environment are used to illustrate all six distributions with a variety of patterned covariance structures that include spatial models (e.g., geostatistical and areal models), time series models (e.g., first-order autoregressive models), and mixtures with typical random intercepts based on grouping. △ Less

Submitted 4 May, 2023; originally announced May 2023.

Journal ref: Environmetrics, 2024, e2872

arXiv:2202.07166 [pdf, other]

SSNbayes: An R package for Bayesian spatio-temporal modelling on stream networks

Authors: Edgar Santos-Fernandez, Jay M. Ver Hoef, James M. McGree, Daniel J. Isaak, Kerrie Mengersen, Erin E. Peterson

Abstract: Spatio-temporal models are widely used in many research areas from ecology to epidemiology. However, most covariance functions describe spatial relationships based on Euclidean distance only. In this paper, we introduce the R package SSNbayes for fitting Bayesian spatio-temporal models and making predictions on branching stream networks. SSNbayes provides a linear regression framework with multipl… ▽ More Spatio-temporal models are widely used in many research areas from ecology to epidemiology. However, most covariance functions describe spatial relationships based on Euclidean distance only. In this paper, we introduce the R package SSNbayes for fitting Bayesian spatio-temporal models and making predictions on branching stream networks. SSNbayes provides a linear regression framework with multiple options for incorporating spatial and temporal autocorrelation. Spatial dependence is captured using stream distance and flow connectivity while temporal autocorrelation is modelled using vector autoregression approaches. SSNbayes provides the functionality to make predictions across the whole network, compute exceedance probabilities and other probabilistic estimates such as the proportion of suitable habitat. We illustrate the functionality of the package using a stream temperature dataset collected in Idaho, USA. △ Less

Submitted 14 February, 2022; originally announced February 2022.

arXiv:2103.03538 [pdf, other]

doi 10.1016/j.csda.2022.107446

Bayesian spatio-temporal models for stream networks

Authors: Edgar Santos-Fernandez, Jay M. Ver Hoef, Erin E. Peterson, James McGree, Daniel Isaak, Kerrie Mengersen

Abstract: Spatio-temporal models are widely used in many research areas including ecology. The recent proliferation of the use of in-situ sensors in streams and rivers supports space-time water quality modelling and monitoring in near real-time. A new family of spatio-temporal models is introduced. These models incorporate spatial dependence using stream distance while temporal autocorrelation is captured u… ▽ More Spatio-temporal models are widely used in many research areas including ecology. The recent proliferation of the use of in-situ sensors in streams and rivers supports space-time water quality modelling and monitoring in near real-time. A new family of spatio-temporal models is introduced. These models incorporate spatial dependence using stream distance while temporal autocorrelation is captured using vector autoregression approaches. Several variations of these novel models are proposed using a Bayesian framework. The results show that our proposed models perform well using spatio-temporal data collected from real stream networks, particularly in terms of out-of-sample RMSPE. This is illustrated considering a case study of water temperature data in the northwestern United States. △ Less

Submitted 14 February, 2022; v1 submitted 5 March, 2021; originally announced March 2021.

Comments: 30 pages, 10 figs

arXiv:2005.00952 [pdf, other]

doi 10.1016/j.spasta.2021.100510

A Linear Mixed Model Formulation for Spatio-Temporal Random Processes with Computational Advances for the Separable and Product-Sum Covariances

Authors: Michael Dumelle, Jay M. Ver Hoef, Claudio Fuentes, Alix Gitelman

Abstract: We describe spatio-temporal random processes using linear mixed models. We show how many commonly used models can be viewed as special cases of this general framework and pay close attention to models with separable or product-sum covariances. The proposed linear mixed model formulation facilitates the implementation of a novel algorithm using Stegle eigendecompositions, a recursive application of… ▽ More We describe spatio-temporal random processes using linear mixed models. We show how many commonly used models can be viewed as special cases of this general framework and pay close attention to models with separable or product-sum covariances. The proposed linear mixed model formulation facilitates the implementation of a novel algorithm using Stegle eigendecompositions, a recursive application of the Sherman-Morrison-Woodbury formula, and Helmert-Wolf blocking to efficiently invert separable and product-sum covariance matrices, even when every spatial location is not observed at every time point. We show our algorithm provides noticeable improvements over the standard Cholesky decomposition approach. Via simulations, we assess the performance of the separable and product-sum covariances and identify scenarios where separable covariances are noticeably inferior to product-sum covariances. We also compare likelihood-based and semivariogram-based estimation and discuss benefits and drawbacks of both. We use the proposed approach to analyze daily maximum temperature data in Oregon, USA, during the 2019 summer. We end by offering guidelines for choosing among these covariances and estimation methods based on properties of observed data. △ Less

Submitted 2 May, 2020; originally announced May 2020.

Comments: 43 pages (including an Appendix) and 8 figures

Journal ref: Spatial Staistics, Volume 43, 2021

arXiv:1912.00540 [pdf, other]

SSNdesign -- an R package for pseudo-Bayesian optimal and adaptive sampling designs on stream networks

Authors: Alan R. Pearse, James M. McGree, Nicholas A. Som, Catherine Leigh, Jay M. Ver Hoef, Paul Maxwell, Erin E. Peterson

Abstract: Streams and rivers are biodiverse and provide valuable ecosystem services. Maintaining these ecosystems is an important task, so organisations often monitor the status and trends in stream condition and biodiversity using field sampling and, more recently, autonomous in-situ sensors. However, data collection is often costly and so effective and efficient survey designs are crucial to maximise info… ▽ More Streams and rivers are biodiverse and provide valuable ecosystem services. Maintaining these ecosystems is an important task, so organisations often monitor the status and trends in stream condition and biodiversity using field sampling and, more recently, autonomous in-situ sensors. However, data collection is often costly and so effective and efficient survey designs are crucial to maximise information while minimising costs. Geostatistics and optimal and adaptive design theory can be used to optimise the placement of sampling sites in freshwater studies and aquatic monitoring programs. Geostatistical modelling and experimental design on stream networks pose statistical challenges due to the branching structure of the network, flow connectivity and directionality, and differences in flow volume. Thus, unique challenges of geostatistics and experimental design on stream networks necessitates the development of new open-source software for implementing the theory. We present SSNdesign, an R package for solving optimal and adaptive design problems on stream networks that integrates with existing open-source software. We demonstrate the mathematical foundations of our approach, and illustrate the functionality of SSNdesign using two case studies involving real data from Queensland, Australia. In both case studies we demonstrate that the optimal or adaptive designs outperform random and spatially balanced survey designs. The SSNdesign package has the potential to boost the efficiency of freshwater monitoring efforts and provide much-needed information for freshwater conservation and management. △ Less

Submitted 1 December, 2019; originally announced December 2019.

Comments: Main document: 18 pages, 7 figures Supp Info A: 11 pages, 0 figures Supp Info B: 24 pages, 6 figures Supp Info C: 3 pages, 0 figures

arXiv:1812.10236 [pdf, other]

Comparing Spatial Regression to Random Forests for Large Environmental Data Sets

Authors: Eric W. Fox, Jay M. Ver Hoef, Anthony R. Olsen

Abstract: Environmental data may be "large" due to number of records, number of covariates, or both. Random forests has a reputation for good predictive performance when using many covariates with nonlinear relationships, whereas spatial regression, when using reduced rank methods, has a reputation for good predictive performance when using many records that are spatially autocorrelated. In this study, we c… ▽ More Environmental data may be "large" due to number of records, number of covariates, or both. Random forests has a reputation for good predictive performance when using many covariates with nonlinear relationships, whereas spatial regression, when using reduced rank methods, has a reputation for good predictive performance when using many records that are spatially autocorrelated. In this study, we compare these two techniques using a data set containing the macroinvertebrate multimetric index (MMI) at 1859 stream sites with over 200 landscape covariates. A primary application is mapping MMI predictions and prediction errors at 1.1 million perennial stream reaches across the conterminous United States. For the spatial regression model, we develop a novel transformation procedure that estimates Box-Cox transformations to linearize covariate relationships and handles possibly zero-inflated covariates. We find that the spatial regression model with transformations, and a subsequent selection of significant covariates, has cross-validation performance slightly better than random forests. We also find that prediction interval coverage is close to nominal for each method, but that spatial regression prediction intervals tend to be narrower and have less variability than quantile regression forest prediction intervals. A simulation study is used to generalize results and clarify advantages of each modeling approach. △ Less

Submitted 26 December, 2018; originally announced December 2018.

arXiv:1710.07000 [pdf, other]

On the Relationship between Conditional (CAR) and Simultaneous (SAR) Autoregressive Models

Authors: Jay M. Ver Hoef, Ephraim M. Hanks, Mevin B. Hooten

Abstract: We clarify relationships between conditional (CAR) and simultaneous (SAR) autoregressive models. We review the literature on this topic and find that it is mostly incomplete. Our main result is that a SAR model can be written as a unique CAR model, and while a CAR model can be written as a SAR model, it is not unique. In fact, we show how any multivariate Gaussian distribution on a finite set of p… ▽ More We clarify relationships between conditional (CAR) and simultaneous (SAR) autoregressive models. We review the literature on this topic and find that it is mostly incomplete. Our main result is that a SAR model can be written as a unique CAR model, and while a CAR model can be written as a SAR model, it is not unique. In fact, we show how any multivariate Gaussian distribution on a finite set of points with a positive-definite covariance matrix can be written as either a CAR or a SAR model. We illustrate how to obtain any number of SAR covariance matrices from a single CAR covariance matrix by using Givens rotation matrices on a simulated example. We also discuss sparseness in the original CAR construction, and for the resulting SAR weights matrix. For a real example, we use crime data in 49 neighborhoods from Columbus, Ohio, and show that a geostatistical model optimizes the likelihood much better than typical first-order CAR models. We then use the implied weights from the geostatistical model to estimate CAR model parameters that provides the best overall optimization. △ Less

Submitted 19 October, 2017; originally announced October 2017.

Comments: 18 pages, 4 figures

arXiv:1410.3163 [pdf, other]

Estimating Abundance from Counts in Large Data Sets of Irregularly-Spaced Plots using Spatial Basis Functions

Authors: Jay M. Ver Hoef, John K. Jansen

Abstract: Monitoring plant and animal populations is an important goal for both academic research and management of natural resources. Successful management of populations often depends on obtaining estimates of their mean or total over a region. The basic problem considered in this paper is the estimation of a total from a sample of plots containing count data, but the plot placements are spatially irregul… ▽ More Monitoring plant and animal populations is an important goal for both academic research and management of natural resources. Successful management of populations often depends on obtaining estimates of their mean or total over a region. The basic problem considered in this paper is the estimation of a total from a sample of plots containing count data, but the plot placements are spatially irregular and non randomized. Our application had counts from thousands of irregularly-spaced aerial photo images. We used change-of-support methods to model counts in images as a realization of an inhomogeneous Poisson process that used spatial basis functions to model the spatial intensity surface. The method was very fast and took only a few seconds for thousands of images. The fitted intensity surface was integrated to provide an estimate from all unsampled areas, which is added to the observed counts. The proposed method also provides a finite area correction factor to variance estimation. The intensity surface from an inhomogeneous Poisson process tends to be too smooth for locally clustered points, typical of animal distributions, so we introduce several new overdispersion estimators due to poor performance of the classic one. We used simulated data to examine estimation bias and to investigate several variance estimators with overdispersion. A real example is given of harbor seal counts from aerial surveys in an Alaskan glacial fjord. △ Less

Submitted 12 October, 2014; originally announced October 2014.

Comments: 37 pages, 7 figures, 4 tables, keywords: sampling, change-of-support, spatial point processes, intensity function, random effects, Poisson process, overdispersion

Showing 1–10 of 10 results for author: Hoef, J M V