Search | arXiv e-print repository

arXiv:2406.19591 [pdf, other]

Mathematical modelling and uncertainty quantification for analysis of biphasic coral reef recovery patterns

Authors: David J. Warne, Kerryn Crossman, Grace E. M. Heron, Jesse A. Sharp, Wang Jin, Paul Pao-Yen Wu, Matthew J. Simpson, Kerrie Mengersen, Juan-Carlos Ortiz

Abstract: Coral reefs are increasingly subjected to major disturbances threatening the health of marine ecosystems. Substantial research underway to develop intervention strategies that assist reefs in recovery from, and resistance to, inevitable future climate and weather extremes. To assess potential benefits of interventions, mechanistic understanding of coral reef recovery and resistance patterns is ess… ▽ More Coral reefs are increasingly subjected to major disturbances threatening the health of marine ecosystems. Substantial research underway to develop intervention strategies that assist reefs in recovery from, and resistance to, inevitable future climate and weather extremes. To assess potential benefits of interventions, mechanistic understanding of coral reef recovery and resistance patterns is essential. Recent evidence suggests that more than half of the reefs surveyed across the Great Barrier Reef (GBR) exhibit deviations from standard recovery modelling assumptions when the initial coral cover is low ($\leq 10$\%). New modelling is necessary to account for these observed patterns to better inform management strategies. We consider a new model for reef recovery at the coral cover scale that accounts for biphasic recovery patterns. The model is based on a multispecies Richards' growth model that includes a change point in the recovery patterns. Bayesian inference is applied for uncertainty quantification of key parameters for assessing reef health and recovery patterns. This analysis is applied to benthic survey data from the Australian Institute of Marine Sciences (AIMS). We demonstrate agreement between model predictions and data across every recorded recovery trajectory with at least two years of observations following disturbance events occurring between 1992--2020. This new approach will enable new insights into the biological, ecological and environmental factors that contribute to the duration and severity of biphasic coral recovery patterns across the GBR. These new insights will help to inform managements and monitoring practice to mitigate the impacts of climate change on coral reefs. △ Less

Submitted 27 June, 2024; originally announced June 2024.

MSC Class: 62P12 (Primary)

arXiv:2405.16055 [pdf, other]

Federated Learning for Non-factorizable Models using Deep Generative Prior Approximations

Authors: Conor Hassan, Joshua J Bon, Elizaveta Semenova, Antonietta Mira, Kerrie Mengersen

Abstract: Federated learning (FL) allows for collaborative model training across decentralized clients while preserving privacy by avoiding data sharing. However, current FL methods assume conditional independence between client models, limiting the use of priors that capture dependence, such as Gaussian processes (GPs). We introduce the Structured Independence via deep Generative Model Approximation (SIGMA… ▽ More Federated learning (FL) allows for collaborative model training across decentralized clients while preserving privacy by avoiding data sharing. However, current FL methods assume conditional independence between client models, limiting the use of priors that capture dependence, such as Gaussian processes (GPs). We introduce the Structured Independence via deep Generative Model Approximation (SIGMA) prior which enables FL for non-factorizable models across clients, expanding the applicability of FL to fields such as spatial statistics, epidemiology, environmental science, and other domains where modeling dependencies is crucial. The SIGMA prior is a pre-trained deep generative model that approximates the desired prior and induces a specified conditional independence structure in the latent variables, creating an approximate model suitable for FL settings. We demonstrate the SIGMA prior's effectiveness on synthetic data and showcase its utility in a real-world example of FL for spatial data, using a conditional autoregressive prior to model spatial dependence across Australia. Our work enables new FL applications in domains where modeling dependent data is essential for accurate predictions and decision-making. △ Less

Submitted 25 May, 2024; originally announced May 2024.

Comments: 25 pages, 7 figures, 2 tables

arXiv:2405.04043 [pdf, other]

Scalable Vertical Federated Learning via Data Augmentation and Amortized Inference

Authors: Conor Hassan, Matthew Sutton, Antonietta Mira, Kerrie Mengersen

Abstract: Vertical federated learning (VFL) has emerged as a paradigm for collaborative model estimation across multiple clients, each holding a distinct set of covariates. This paper introduces the first comprehensive framework for fitting Bayesian models in the VFL setting. We propose a novel approach that leverages data augmentation techniques to transform VFL problems into a form compatible with existin… ▽ More Vertical federated learning (VFL) has emerged as a paradigm for collaborative model estimation across multiple clients, each holding a distinct set of covariates. This paper introduces the first comprehensive framework for fitting Bayesian models in the VFL setting. We propose a novel approach that leverages data augmentation techniques to transform VFL problems into a form compatible with existing Bayesian federated learning algorithms. We present an innovative model formulation for specific VFL scenarios where the joint likelihood factorizes into a product of client-specific likelihoods. To mitigate the dimensionality challenge posed by data augmentation, which scales with the number of observations and clients, we develop a factorized amortized variational approximation that achieves scalability independent of the number of observations. We showcase the efficacy of our framework through extensive numerical experiments on logistic regression, multilevel regression, and a novel hierarchical Bayesian split neural net model. Our work paves the way for privacy-preserving, decentralized Bayesian inference in vertically partitioned data scenarios, opening up new avenues for research and applications in various domains. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: 30 pages, 5 figures, 3 tables

arXiv:2404.12657 [pdf, other]

Proposer selection in EIP-7251

Authors: Sandra Johnson, Kerrie Mengersen, Patrick O'Callaghan, Anders L. Madsen

Abstract: Immediate settlement, or single-slot finality (SSF), is a long-term goal for Ethereum. The growing active validator set size is placing an increasing computational burden on the network, making SSF more challenging. EIP-7251 aims to reduce the number of validators by giving stakers the option to merge existing validators. Key to the success of this proposal therefore is whether stakers choose to m… ▽ More Immediate settlement, or single-slot finality (SSF), is a long-term goal for Ethereum. The growing active validator set size is placing an increasing computational burden on the network, making SSF more challenging. EIP-7251 aims to reduce the number of validators by giving stakers the option to merge existing validators. Key to the success of this proposal therefore is whether stakers choose to merge their validators once EIP-7251 is implemented. It is natural to assume stakers participate only if they anticipate greater expected utility (risk-adjusted returns) as a single large validator. In this paper, we focus on one of the duties that a validator performs, viz. being the proposer for the next block. This duty can be quite lucrative, but happens infrequently. Based on previous analysis, we may assume that EIP-7251 implies no change to the security of the protocol. We confirm that the probability of a validator being selected as block proposer is equivalent under each consolidation regime. This result ensures that the decision of one staker to merge has no impact on the opportunity of another to propose the next block, in turn ensuring there is no major systemic change to the economics of the protocol with respect to proposer selection. △ Less

Submitted 19 April, 2024; originally announced April 2024.

Comments: 15 pages

MSC Class: 62-06 ACM Class: G.3

arXiv:2403.14954 [pdf, other]

Creating a Spatial Vulnerability Index for Environmental Health

Authors: Aiden Price, Kerrie Mengersen, Michael Rigby, Paula Fiévez

Abstract: Extreme natural hazards are increasing in frequency and intensity. These natural changes in our environment, combined with man-made pollution, have substantial economic, social and health impacts globally. The impact of the environment on human health (environmental health) is becoming well understood in international research literature. However, there are significant barriers to understanding ke… ▽ More Extreme natural hazards are increasing in frequency and intensity. These natural changes in our environment, combined with man-made pollution, have substantial economic, social and health impacts globally. The impact of the environment on human health (environmental health) is becoming well understood in international research literature. However, there are significant barriers to understanding key characteristics of this impact, related to substantial data volumes, data access rights and the time required to compile and compare data over regions and time. This study aims to reduce these barriers in Australia by creating an open data repository of national environmental health data and presenting a methodology for the production of health outcome-weighted population vulnerability indices related to extreme heat, extreme cold and air pollution at various temporal and geographical resolutions. Current state-of-the-art methods for the calculation of vulnerability indices include equal weight percentile ranking and the use of principal component analysis (PCA). The weighted vulnerability index methodology proposed in this study offers an advantage over others in the literature by considering health outcomes in the calculation process. The resulting vulnerability percentiles more clearly align population sensitivity and adaptive capacity with health risks. The temporal and spatial resolutions of the indices enable national monitoring on a scale never before seen across Australia. Additionally, we show that a weekly temporal resolution can be used to identify spikes in vulnerability due to changes in relative national environmental exposure. △ Less

Submitted 22 March, 2024; originally announced March 2024.

arXiv:2403.13076 [pdf, other]

Spatial Autoregressive Model on a Dirichlet Distribution

Authors: Teo Nguyen, Sarat Moka, Kerrie Mengersen, Benoit Liquet

Abstract: Compositional data find broad application across diverse fields due to their efficacy in representing proportions or percentages of various components within a whole. Spatial dependencies often exist in compositional data, particularly when the data represents different land uses or ecological variables. Ignoring the spatial autocorrelations in modelling of compositional data may lead to incorrect… ▽ More Compositional data find broad application across diverse fields due to their efficacy in representing proportions or percentages of various components within a whole. Spatial dependencies often exist in compositional data, particularly when the data represents different land uses or ecological variables. Ignoring the spatial autocorrelations in modelling of compositional data may lead to incorrect estimates of parameters. Hence, it is essential to incorporate spatial information into the statistical analysis of compositional data to obtain accurate and reliable results. However, traditional statistical methods are not directly applicable to compositional data due to the correlation between its observations, which are constrained to lie on a simplex. To address this challenge, the Dirichlet distribution is commonly employed, as its support aligns with the nature of compositional vectors. Specifically, the R package DirichletReg provides a regression model, termed Dirichlet regression, tailored for compositional data. However, this model fails to account for spatial dependencies, thereby restricting its utility in spatial contexts. In this study, we introduce a novel spatial autoregressive Dirichlet regression model for compositional data, adeptly integrating spatial dependencies among observations. We construct a maximum likelihood estimator for a Dirichlet density function augmented with a spatial lag term. We compare this spatial autoregressive model with the same model without spatial lag, where we test both models on synthetic data as well as two real datasets, using different metrics. By considering the spatial relationships among observations, our model provides more accurate and reliable results for the analysis of compositional data. The model is further evaluated against a spatial multinomial regression model for compositional data, and their relative effectiveness is discussed. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: 33 pages, 2 figures, submitted to "Computational Statistics & Data Analysis"

arXiv:2403.10791 [pdf, other]

Bayesian Design for Sampling Anomalous Spatio-Temporal Data

Authors: Katie Buchhorn, Kerrie Mengersen, Edgar Santos-Fernandez, James McGree

Abstract: Data collected from arrays of sensors are essential for informed decision-making in various systems. However, the presence of anomalies can compromise the accuracy and reliability of insights drawn from the collected data or information obtained via statistical analysis. This study aims to develop a robust Bayesian optimal experimental design (BOED) framework with anomaly detection methods for hig… ▽ More Data collected from arrays of sensors are essential for informed decision-making in various systems. However, the presence of anomalies can compromise the accuracy and reliability of insights drawn from the collected data or information obtained via statistical analysis. This study aims to develop a robust Bayesian optimal experimental design (BOED) framework with anomaly detection methods for high-quality data collection. We introduce a general framework that involves anomaly generation, detection and error scoring when searching for an optimal design. This method is demonstrated using two comprehensive simulated case studies: the first study uses a spatial dataset, and the second uses a spatio-temporal river network dataset. As a baseline approach, we employed a commonly used prediction-based utility function based on minimising errors. Results illustrate the trade-off between predictive accuracy and anomaly detection performance for our method under various design scenarios. An optimal design robust to anomalies ensures the collection and analysis of more trustworthy data, playing a crucial role in understanding the dynamics of complex systems such as the environment, therefore enabling informed decisions in monitoring, management, and response. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2403.08127 [pdf]

Guidelines for the Creation of Analysis Ready Data

Authors: Harriette Phillips, Aiden Price, Owen Forbes, Claire Boulange, Kerrie Mengersen, Marketa Reeves, Rebecca Glauert

Abstract: Globally, there is an increased need for guidelines to produce high-quality data outputs for analysis. No framework currently exists that provides guidelines for a comprehensive approach to producing analysis ready data (ARD). Through critically reviewing and summarising current literature, this paper proposes such guidelines for the creation of ARD. The guidelines proposed in this paper inform te… ▽ More Globally, there is an increased need for guidelines to produce high-quality data outputs for analysis. No framework currently exists that provides guidelines for a comprehensive approach to producing analysis ready data (ARD). Through critically reviewing and summarising current literature, this paper proposes such guidelines for the creation of ARD. The guidelines proposed in this paper inform ten steps in the generation of ARD: ethics, project documentation, data governance, data management, data storage, data discovery and collection, data cleaning, quality assurance, metadata, and data dictionary. These steps are illustrated through a substantive case study that aimed to create ARD for a digital spatial platform: the Australian Child and Youth Wellbeing Atlas (ACYWA). △ Less

Submitted 29 April, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

Comments: 49 pages, 3 figures, 3 tables, and 5 appendices

arXiv:2403.00319 [pdf, other]

Creating area level indices of behaviours impacting cancer in Australia with a Bayesian generalised shared component model

Authors: James Hogg, Susanna Cramb, Jessica Cameron, Peter Baade, Kerrie Mengersen

Abstract: This study develops a model-based index creation approach called the Generalized Shared Component Model (GSCM) by drawing on the large field of factor models. The proposed fully Bayesian approach accommodates heteroscedastic model error, multiple shared factors and flexible spatial priors. Moreover, our model, unlike previous index approaches, provides indices with uncertainty. Focusing on Austral… ▽ More This study develops a model-based index creation approach called the Generalized Shared Component Model (GSCM) by drawing on the large field of factor models. The proposed fully Bayesian approach accommodates heteroscedastic model error, multiple shared factors and flexible spatial priors. Moreover, our model, unlike previous index approaches, provides indices with uncertainty. Focusing on Australian risk factor data, the proposed GSCM is used to develop the Area Indices of Behaviors Impacting Cancer product - representing the first area level cancer risk factor index in Australia. This advancement aids in identifying communities with elevated cancer risk, facilitating targeted health interventions. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Comments: Submitted to Health and Place

arXiv:2311.12349 [pdf, other]

Spatial Non-parametric Bayesian Clustered Coefficients

Authors: Wala Draidi Areed, Aiden Price, Helen Thompson, Reid Malseed, Kerrie Mengersen

Abstract: In the field of population health research, understanding the similarities between geographical areas and quantifying their shared effects on health outcomes is crucial. In this paper, we synthesise a number of existing methods to create a new approach that specifically addresses this goal. The approach is called a Bayesian spatial Dirichlet process clustered heterogeneous regression model. This n… ▽ More In the field of population health research, understanding the similarities between geographical areas and quantifying their shared effects on health outcomes is crucial. In this paper, we synthesise a number of existing methods to create a new approach that specifically addresses this goal. The approach is called a Bayesian spatial Dirichlet process clustered heterogeneous regression model. This non-parametric framework allows for inference on the number of clusters and the clustering configurations, while simultaneously estimating the parameters for each cluster. We demonstrate the efficacy of the proposed algorithm using simulated data and further apply it to analyse influential factors affecting children's health development domains in Queensland. The study provides valuable insights into the contributions of regional similarities in education and demographics to health outcomes, aiding targeted interventions and policy design. △ Less

Submitted 22 November, 2023; v1 submitted 20 November, 2023; originally announced November 2023.

arXiv:2311.12347 [pdf, other]

Bayesian Cluster Geographically Weighted Regression for Spatial Heterogeneous Data

Authors: Wala Draidi Areed, Aiden Price, Helen Thompson, Conor Hassan, Reid Malseed, Kerrie Mengersen

Abstract: Spatial statistical models are commonly used in geographical scenarios to ensure spatial variation is captured effectively. However, spatial models and cluster algorithms can be complicated and expensive. This paper pursues three main objectives. First, it introduces covariate effect clustering by integrating a Bayesian Geographically Weighted Regression (BGWR) with a Gaussian mixture model and th… ▽ More Spatial statistical models are commonly used in geographical scenarios to ensure spatial variation is captured effectively. However, spatial models and cluster algorithms can be complicated and expensive. This paper pursues three main objectives. First, it introduces covariate effect clustering by integrating a Bayesian Geographically Weighted Regression (BGWR) with a Gaussian mixture model and the Dirichlet process mixture model. Second, this paper examines situations in which a particular covariate holds significant importance in one region but not in another in the Bayesian framework. Lastly, it addresses computational challenges present in existing BGWR, leading to notable enhancements in Markov chain Monte Carlo estimation suitable for large spatial datasets. The efficacy of the proposed method is demonstrated using simulated data and is further validated in a case study examining children's development domains in Queensland, Australia, using data provided by Children's Health Queensland and Australia's Early Development Census. △ Less

Submitted 20 November, 2023; originally announced November 2023.

arXiv:2308.15773 [pdf, other]

doi 10.1186/s12942-023-00352-5

Mapping the prevalence of cancer risk factors at the small area level in Australia

Authors: James Hogg, Jessica Cameron, Susanna Cramb, Peter Baade, Kerrie Mengersen

Abstract: Cancer is a significant health issue globally and it is well known that cancer risk varies geographically. However in many countries there are no small area level data on cancer risk factors with high resolution and complete reach, which hinders the development of targeted prevention strategies. Using Australia as a case study, the 2017-2018 National Health Survey was used to generate prevalence e… ▽ More Cancer is a significant health issue globally and it is well known that cancer risk varies geographically. However in many countries there are no small area level data on cancer risk factors with high resolution and complete reach, which hinders the development of targeted prevention strategies. Using Australia as a case study, the 2017-2018 National Health Survey was used to generate prevalence estimates for 2221 small areas across Australia for eight cancer risk factor measures covering smoking, alcohol, physical activity, diet and weight. Utilising a recently developed Bayesian two-stage small area estimation methodology, the model incorporated survey-only covariates, spatial smoothing and hierarchical modelling techniques, along with a vast array of small area level auxiliary data, including census, remoteness, and socioeconomic data. The models borrowed strength from previously published cancer risk estimates provided by the Social Health Atlases of Australia. Estimates were internally and externally validated. We illustrated that in 2017-18 health behaviours across Australia exhibited more spatial disparities than previously realised by improving the reach and resolution of formerly published cancer risk factors. The derived estimates reveal higher prevalence of unhealthy behaviours in more remote areas, and areas of lower socioeconomic status; a trend that aligns well with previous work. Our study addresses the gaps in small area level cancer risk factor estimates in Australia. The new estimates provide improved spatial resolution and reach and will enable more targeted cancer prevention strategies at the small area level, supporting policy makers, researchers, and the general public in understanding the spatial distribution of cancer risk factors in Australia. To help disseminate the results of this work, they will be made available in the Australian Cancer Atlas 2.0. △ Less

Submitted 22 October, 2023; v1 submitted 30 August, 2023; originally announced August 2023.

Comments: Submitted to the International Journal of Health Geographics

arXiv:2308.03970 [pdf, other]

Dependent Cluster Mapping (DCMAP): Optimal clustering of directed acyclic graphs for statistical inference

Authors: Paul Pao-Yen Wu, Fabrizio Ruggeri, Kerrie Mengersen

Abstract: A Directed Acyclic Graph (DAG) can be partitioned or mapped into clusters to support and make inference more computationally efficient in Bayesian Network (BN), Markov process and other models. However, optimal partitioning with an arbitrary cost function is challenging, especially in statistical inference as the local cluster cost is dependent on both nodes within a cluster, and the mapping of cl… ▽ More A Directed Acyclic Graph (DAG) can be partitioned or mapped into clusters to support and make inference more computationally efficient in Bayesian Network (BN), Markov process and other models. However, optimal partitioning with an arbitrary cost function is challenging, especially in statistical inference as the local cluster cost is dependent on both nodes within a cluster, and the mapping of clusters connected via parent and/or child nodes, which we call dependent clusters. We propose a novel algorithm called DCMAP for optimal cluster mapping with dependent clusters. Given an arbitrarily defined, positive cost function based on the DAG, we show that DCMAP converges to find all optimal clusters, and returns near-optimal solutions along the way. Empirically, we find that the algorithm is time-efficient for a Dynamic BN (DBN) model of a seagrass complex system using a computation cost function. For a 25 and 50-node DBN, the search space size was $9.91\times 10^9$ and $1.51\times10^{21}$ possible cluster mappings, and the first optimal solution was found at iteration 934 $(\text{95\% CI } 926,971)$, and 2256 $(2150,2271)$ with a cost that was 4\% and 0.2\% of the naive heuristic cost, respectively. △ Less

Submitted 7 February, 2024; v1 submitted 7 August, 2023; originally announced August 2023.

arXiv:2307.15424 [pdf, ps, other]

Deep Generative Models, Synthetic Tabular Data, and Differential Privacy: An Overview and Synthesis

Authors: Conor Hassan, Robert Salomone, Kerrie Mengersen

Abstract: This article provides a comprehensive synthesis of the recent developments in synthetic data generation via deep generative models, focusing on tabular datasets. We specifically outline the importance of synthetic data generation in the context of privacy-sensitive data. Additionally, we highlight the advantages of using deep generative models over other methods and provide a detailed explanation… ▽ More This article provides a comprehensive synthesis of the recent developments in synthetic data generation via deep generative models, focusing on tabular datasets. We specifically outline the importance of synthetic data generation in the context of privacy-sensitive data. Additionally, we highlight the advantages of using deep generative models over other methods and provide a detailed explanation of the underlying concepts, including unsupervised learning, neural networks, and generative models. The paper covers the challenges and considerations involved in using deep generative models for tabular datasets, such as data normalization, privacy concerns, and model evaluation. This review provides a valuable resource for researchers and practitioners interested in synthetic data generation and its applications. △ Less

Submitted 27 August, 2023; v1 submitted 28 July, 2023; originally announced July 2023.

arXiv:2306.11302 [pdf, other]

A Two-Stage Bayesian Small Area Estimation Approach for Proportions

Authors: James Hogg, Jessica Cameron, Susanna Cramb, Peter Baade, Kerrie Mengersen

Abstract: With the rise in popularity of digital Atlases to communicate spatial variation, there is an increasing need for robust small-area estimates. However, current small-area estimation methods suffer from various modeling problems when data are very sparse or when estimates are required for areas with very small populations. These issues are particularly heightened when modeling proportions. Additiona… ▽ More With the rise in popularity of digital Atlases to communicate spatial variation, there is an increasing need for robust small-area estimates. However, current small-area estimation methods suffer from various modeling problems when data are very sparse or when estimates are required for areas with very small populations. These issues are particularly heightened when modeling proportions. Additionally, recent work has shown significant benefits in modeling at both the individual and area levels. We propose a two-stage Bayesian hierarchical small area estimation approach for proportions that can: account for survey design; reduce direct estimate instability; and generate prevalence estimates for small areas with no survey data. Using a simulation study we show that, compared with existing Bayesian small area estimation methods, our approach can provide optimal predictive performance (Bayesian mean relative root mean squared error, mean absolute relative bias and coverage) of proportions under a variety of data conditions, including very sparse and unstable data. To assess the model in practice, we compare modeled estimates of current smoking prevalence for 1,630 small areas in Australia using the 2017-2018 National Health Survey data combined with 2016 census data. △ Less

Submitted 4 December, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

Comments: Currently in second round of review at the International Statistical Review

arXiv:2306.01278 [pdf, ps, other]

The Fisher Geometry and Geodesics of the Multivariate Normals, without Differential Geometry

Authors: Brodie A. J. Lawson, Kevin Burrage, Kerrie Mengersen, Rodrigo Weber dos Santos

Abstract: Choosing the Fisher information as the metric tensor for a Riemannian manifold provides a powerful yet fundamental way to understand statistical distribution families. Distances along this manifold become a compelling measure of statistical distance, and paths of shorter distance improve sampling techniques that leverage a sequence of distributions in their operation. Unfortunately, even for a dis… ▽ More Choosing the Fisher information as the metric tensor for a Riemannian manifold provides a powerful yet fundamental way to understand statistical distribution families. Distances along this manifold become a compelling measure of statistical distance, and paths of shorter distance improve sampling techniques that leverage a sequence of distributions in their operation. Unfortunately, even for a distribution as generally tractable as the multivariate normal distribution, this information geometry proves unwieldy enough that closed-form solutions for shortest-distance paths or their lengths remain unavailable outside of limited special cases. In this review we present for general statisticians the most practical aspects of the Fisher geometry for this fundamental distribution family. Rather than a differential geometric treatment, we use an intuitive understanding of the covariance-induced curvature of this manifold to unify the special cases with known closed-form solution and review approximate solutions for the general case. We also use the multivariate normal information geometry to better understand the paths or distances commonly used in statistics (annealing, Wasserstein). Given the unavailability of a general solution, we also discuss the methods used for numerically obtaining geodesics in the space of multivariate normals, identifying remaining challenges and suggesting methodological improvements. △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: 22 pages, 8 figures, further figures and algorithms in supplement

MSC Class: 62B11 (primary); 62-02; 62-08 (secondary) ACM Class: G.3

arXiv:2305.15746 [pdf, other]

Assessing the Spatial Structure of the Association between Attendance at Preschool and Childrens Developmental Vulnerabilities in Queensland Australia

Authors: wala Draidi Areed, Aiden Price, Kathryn Arnett, Helen Thompson, Reid Malseed, Kerrie Mengersen

Abstract: The research explores the influence of preschool attendance (one year before full-time school) on the development of children during their first year of school. Using data collected by the Australian Early Development Census, the findings show that areas with high proportions of preschool attendance tended to have lower proportions of children with at least one developmental vulnerability. Develop… ▽ More The research explores the influence of preschool attendance (one year before full-time school) on the development of children during their first year of school. Using data collected by the Australian Early Development Census, the findings show that areas with high proportions of preschool attendance tended to have lower proportions of children with at least one developmental vulnerability. Developmental vulnerablities include not being able to cope with the school day (tired, hungry, low energy), unable to get along with others or aggressive behaviour, trouble with reading/writing or numbers. These findings, of course, vary by region. Using Data Analysis and Machine Learning, the researchers were able to identify three distinct clusters within Queensland, each characterised by different socio-demographic variables influencing the relationship between preschool attendance and developmental vulnerability. These analyses contribute to understanding regions with high vulnerability and the potential need for tailored policies or investments △ Less

Submitted 25 May, 2023; originally announced May 2023.

arXiv:2305.12651 [pdf, other]

Conditional normalization in time series analysis

Authors: Puwasala Gamakumara, Edgar Santos-Fernandez, Priyanga Dilini Talagala, Rob J. Hyndman, Kerrie Mengersen, Catherine Leigh

Abstract: Time series often reflect variation associated with other related variables. Controlling for the effect of these variables is useful when modeling or analysing the time series. We introduce a novel approach to normalize time series data conditional on a set of covariates. We do this by modeling the conditional mean and the conditional variance of the time series with generalized additive models us… ▽ More Time series often reflect variation associated with other related variables. Controlling for the effect of these variables is useful when modeling or analysing the time series. We introduce a novel approach to normalize time series data conditional on a set of covariates. We do this by modeling the conditional mean and the conditional variance of the time series with generalized additive models using a set of covariates. The conditional mean and variance are then used to normalize the time series. We illustrate the use of conditionally normalized series using two applications involving river network data. First, we show how these normalized time series can be used to impute missing values in the data. Second, we show how the normalized series can be used to estimate the conditional autocorrelation function and conditional cross-correlation functions via additive models. Finally we use the conditional cross-correlations to estimate the time it takes water to flow between two locations in a river network. △ Less

Submitted 21 May, 2023; originally announced May 2023.

Comments: 36 pages, 26 Figures, Journal Article

arXiv:2305.01144 [pdf, other]

Increasing trust in new data sources: crowdsourcing image classification for ecology

Authors: Edgar Santos-Fernandez, Julie Vercelloni, Aiden Price, Grace Heron, Bryce Christensen, Erin E. Peterson, Kerrie Mengersen

Abstract: Crowdsourcing methods facilitate the production of scientific information by non-experts. This form of citizen science (CS) is becoming a key source of complementary data in many fields to inform data-driven decisions and study challenging problems. However, concerns about the validity of these data often constrain their utility. In this paper, we focus on the use of citizen science data in addres… ▽ More Crowdsourcing methods facilitate the production of scientific information by non-experts. This form of citizen science (CS) is becoming a key source of complementary data in many fields to inform data-driven decisions and study challenging problems. However, concerns about the validity of these data often constrain their utility. In this paper, we focus on the use of citizen science data in addressing complex challenges in environmental conservation. We consider this issue from three perspectives. First, we present a literature scan of papers that have employed Bayesian models with citizen science in ecology. Second, we compare several popular majority vote algorithms and introduce a Bayesian item response model that estimates and accounts for participants' abilities after adjusting for the difficulty of the images they have classified. The model also enables participants to be clustered into groups based on ability. Third, we apply the model in a case study involving the classification of corals from underwater images from the Great Barrier Reef, Australia. We show that the model achieved superior results in general and, for difficult tasks, a weighted consensus method that uses only groups of experts and experienced participants produced better performance measures. Moreover, we found that participants learn as they have more classification opportunities, which substantially increases their abilities over time. Overall, the paper demonstrates the feasibility of CS for answering complex and challenging ecological questions when these data are appropriately analysed. This serves as motivation for future work to increase the efficacy and trustworthiness of this emerging source of data. △ Less

Submitted 1 May, 2023; originally announced May 2023.

Comments: 25 pages, 10 figures

arXiv:2304.09367 [pdf, other]

Graph Neural Network-Based Anomaly Detection for River Network Systems

Authors: Katie Buchhorn, Edgar Santos-Fernandez, Kerrie Mengersen, Robert Salomone

Abstract: Water is the lifeblood of river networks, and its quality plays a crucial role in sustaining both aquatic ecosystems and human societies. Real-time monitoring of water quality is increasingly reliant on in-situ sensor technology. Anomaly detection is crucial for identifying erroneous patterns in sensor data, but can be a challenging task due to the complexity and variability of the data, even unde… ▽ More Water is the lifeblood of river networks, and its quality plays a crucial role in sustaining both aquatic ecosystems and human societies. Real-time monitoring of water quality is increasingly reliant on in-situ sensor technology. Anomaly detection is crucial for identifying erroneous patterns in sensor data, but can be a challenging task due to the complexity and variability of the data, even under normal conditions. This paper presents a solution to the challenging task of anomaly detection for river network sensor data, which is essential for accurate and continuous monitoring. We use a graph neural network model, the recently proposed Graph Deviation Network (GDN), which employs graph attention-based forecasting to capture the complex spatio-temporal relationships between sensors. We propose an alternate anomaly scoring method, GDN+, based on the learned graph. To evaluate the model's efficacy, we introduce new benchmarking simulation experiments with highly-sophisticated dependency structures and subsequence anomalies of various types. We further examine the strengths and weaknesses of this baseline approach, GDN, in comparison to other benchmarking methods on complex real-world river network data. Findings suggest that GDN+ outperforms the baseline approach in high-dimensional data, while also providing improved interpretability. We also introduce software called gnnad. △ Less

Submitted 31 May, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

arXiv:2302.08724 [pdf, other]

Piecewise Deterministic Markov Processes for Bayesian Neural Networks

Authors: Ethan Goan, Dimitri Perrin, Kerrie Mengersen, Clinton Fookes

Abstract: Inference on modern Bayesian Neural Networks (BNNs) often relies on a variational inference treatment, imposing violated assumptions of independence and the form of the posterior. Traditional MCMC approaches avoid these assumptions at the cost of increased computation due to its incompatibility to subsampling of the likelihood. New Piecewise Deterministic Markov Process (PDMP) samplers permit subs… ▽ More Inference on modern Bayesian Neural Networks (BNNs) often relies on a variational inference treatment, imposing violated assumptions of independence and the form of the posterior. Traditional MCMC approaches avoid these assumptions at the cost of increased computation due to its incompatibility to subsampling of the likelihood. New Piecewise Deterministic Markov Process (PDMP) samplers permit subsampling, though introduce a model specific inhomogenous Poisson Process (IPPs) which is difficult to sample from. This work introduces a new generic and adaptive thinning scheme for sampling from these IPPs, and demonstrates how this approach can accelerate the application of PDMPs for inference in BNNs. Experimentation illustrates how inference with these methods is computationally feasible, can improve predictive accuracy, MCMC mixing performance, and provide informative uncertainty measurements when compared against other approximate inference schemes. △ Less

Submitted 19 October, 2023; v1 submitted 17 February, 2023; originally announced February 2023.

Comments: Includes correction to software and corrigendum note

arXiv:2302.03314 [pdf, other]

Federated Variational Inference Methods for Structured Latent Variable Models

Authors: Conor Hassan, Robert Salomone, Kerrie Mengersen

Abstract: Federated learning methods enable model training across distributed data sources without data leaving their original locations and have gained increasing interest in various fields. However, existing approaches are limited, excluding many structured probabilistic models. We present a general and elegant solution based on structured variational inference, widely used in Bayesian machine learning, a… ▽ More Federated learning methods enable model training across distributed data sources without data leaving their original locations and have gained increasing interest in various fields. However, existing approaches are limited, excluding many structured probabilistic models. We present a general and elegant solution based on structured variational inference, widely used in Bayesian machine learning, adapted for the federated setting. Additionally, we provide a communication-efficient variant analogous to the canonical FedAvg algorithm. The proposed algorithms' effectiveness is demonstrated, and their performance is compared with hierarchical Bayesian neural networks and topic models. △ Less

Submitted 7 July, 2023; v1 submitted 7 February, 2023; originally announced February 2023.

arXiv:2212.06944 [pdf, other]

doi 10.21203/rs.3.rs-1868132/v1

Where are the vulnerable children? Identification and comparison of clusters of young children with health and developmental vulnerabilities across Queensland

Authors: Wala Draidi Areed, Aiden Price, Kathryn Arnett, Kerrie Mengersen, Helen Thompson

Abstract: This study aimed to better understand the vulnerability of 5 to 6 year old children in their first year of school, based on five health and development domains. Identification of subgroups of children within these domains can lead to more targeted research and policies to reduce these vulnerabilities. The study focused on finding clusters of geographical regions with high and low proportions of vu… ▽ More This study aimed to better understand the vulnerability of 5 to 6 year old children in their first year of school, based on five health and development domains. Identification of subgroups of children within these domains can lead to more targeted research and policies to reduce these vulnerabilities. The study focused on finding clusters of geographical regions with high and low proportions of vulnerable children in Queensland, Australia. K-means analysis was conducted on data from the Australian Early Development Census and the Australian Bureau of Statistics. The clusters were then compared with respect to their geographic locations and risk factor profiles. The results are made publicly available via an interactive dashboard application in R Shiny △ Less

Submitted 13 December, 2022; originally announced December 2022.

arXiv:2211.10029 [pdf, other]

Being Bayesian in the 2020s: opportunities and challenges in the practice of modern applied Bayesian statistics

Authors: Joshua J. Bon, Adam Bretherton, Katie Buchhorn, Susanna Cramb, Christopher Drovandi, Conor Hassan, Adrianne L. Jenner, Helen J. Mayfield, James M. McGree, Kerrie Mengersen, Aiden Price, Robert Salomone, Edgar Santos-Fernandez, Julie Vercelloni, Xiaoyu Wang

Abstract: Building on a strong foundation of philosophy, theory, methods and computation over the past three decades, Bayesian approaches are now an integral part of the toolkit for most statisticians and data scientists. Whether they are dedicated Bayesians or opportunistic users, applied professionals can now reap many of the benefits afforded by the Bayesian paradigm. In this paper, we touch on six moder… ▽ More Building on a strong foundation of philosophy, theory, methods and computation over the past three decades, Bayesian approaches are now an integral part of the toolkit for most statisticians and data scientists. Whether they are dedicated Bayesians or opportunistic users, applied professionals can now reap many of the benefits afforded by the Bayesian paradigm. In this paper, we touch on six modern opportunities and challenges in applied Bayesian statistics: intelligent data collection, new data sources, federated analysis, inference for implicit models, model transfer and purposeful software products. △ Less

Submitted 17 January, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

Comments: 27 pages, 8 figures

arXiv:2209.04117 [pdf, other]

clusterBMA: Bayesian model averaging for clustering

Authors: Owen Forbes, Edgar Santos-Fernandez, Paul Pao-Yen Wu, Hong-Bo Xie, Paul E. Schwenn, Jim Lagopoulos, Lia Mills, Dashiell D. Sacks, Daniel F. Hermens, Kerrie Mengersen

Abstract: Various methods have been developed to combine inference across multiple sets of results for unsupervised clustering, within the ensemble clustering literature. The approach of reporting results from one `best' model out of several candidate clustering models generally ignores the uncertainty that arises from model selection, and results in inferences that are sensitive to the particular model and… ▽ More Various methods have been developed to combine inference across multiple sets of results for unsupervised clustering, within the ensemble clustering literature. The approach of reporting results from one `best' model out of several candidate clustering models generally ignores the uncertainty that arises from model selection, and results in inferences that are sensitive to the particular model and parameters chosen. Bayesian model averaging (BMA) is a popular approach for combining results across multiple models that offers some attractive benefits in this setting, including probabilistic interpretation of the combined cluster structure and quantification of model-based uncertainty. In this work we introduce clusterBMA, a method that enables weighted model averaging across results from multiple unsupervised clustering algorithms. We use clustering internal validation criteria to develop an approximation of the posterior model probability, used for weighting the results from each model. From a consensus matrix representing a weighted average of the clustering solutions across models, we apply symmetric simplex matrix factorisation to calculate final probabilistic cluster allocations. In addition to outperforming other ensemble clustering methods on simulated data, clusterBMA offers unique features including probabilistic allocation to averaged clusters, combining allocation probabilities from 'hard' and 'soft' clustering algorithms, and measuring model-based uncertainty in averaged cluster allocation. This method is implemented in an accompanying R package of the same name. △ Less

Submitted 25 March, 2023; v1 submitted 9 September, 2022; originally announced September 2022.

arXiv:2208.02921 [pdf, other]

A flexible, random histogram kernel for discrete-time Hawkes processes

Authors: Raiha Browning, Judith Rousseau, Kerrie Mengersen

Abstract: Hawkes processes are a self-exciting stochastic process used to describe phenomena whereby past events increase the probability of the occurrence of future events. This work presents a flexible approach for modelling a variant of these, namely discrete-time Hawkes processes. Most standard models of Hawkes processes rely on a parametric form for the function describing the influence of past events,… ▽ More Hawkes processes are a self-exciting stochastic process used to describe phenomena whereby past events increase the probability of the occurrence of future events. This work presents a flexible approach for modelling a variant of these, namely discrete-time Hawkes processes. Most standard models of Hawkes processes rely on a parametric form for the function describing the influence of past events, referred to as the triggering kernel. This is likely to be insufficient to capture the true excitation pattern, particularly for complex data. By utilising trans-dimensional Markov chain Monte Carlo inference techniques, our proposed model for the triggering kernel can take the form of any step function, affording significantly more flexibility than a parametric form. We first demonstrate the utility of the proposed model through a comprehensive simulation study. This includes univariate scenarios, and multivariate scenarios whereby there are multiple interacting Hawkes processes. We then apply the proposed model to several case studies: the interaction between two countries during the early to middle stages of the COVID-19 pandemic, taking Italy and France as an example, and the interaction of terrorist activity between two countries in close spatial proximity, Indonesia and the Philippines, and then within three regions of the Philippines. △ Less

Submitted 4 August, 2022; originally announced August 2022.

MSC Class: 60G55; 62F15

arXiv:2206.05369 [pdf, other]

Bayesian Design with Sampling Windows for Complex Spatial Processes

Authors: Katie Buchhorn, Kerrie Mengersen, Edgar Santos-Fernandez, Erin E. Peterson, James M. McGree

Abstract: Optimal design facilitates intelligent data collection. In this paper, we introduce a fully Bayesian design approach for spatial processes with complex covariance structures, like those typically exhibited in natural ecosystems. Coordinate Exchange algorithms are commonly used to find optimal design points. However, collecting data at specific points is often infeasible in practice. Currently, the… ▽ More Optimal design facilitates intelligent data collection. In this paper, we introduce a fully Bayesian design approach for spatial processes with complex covariance structures, like those typically exhibited in natural ecosystems. Coordinate Exchange algorithms are commonly used to find optimal design points. However, collecting data at specific points is often infeasible in practice. Currently, there is no provision to allow for flexibility in the choice of design. We also propose an approach to find Bayesian sampling windows, rather than points, via Gaussian process emulation to identify regions of high design efficiency across a multi-dimensional space. These developments are motivated by two ecological case studies: monitoring water temperature in a river network system in the northwestern United States and monitoring submerged coral reefs off the north-west coast of Australia. △ Less

Submitted 10 June, 2022; originally announced June 2022.

arXiv:2203.15184 [pdf, other]

doi 10.1126/sciadv.abm5952

Analysis of sloppiness in model simulations: unveiling parameter uncertainty when mathematical models are fitted to data

Authors: Gloria M. Monsalve-Bravo, Brodie A. J. Lawson, Christopher Drovandi, Kevin Burrage, Kevin S. Brown, Christopher M. Baker, Sarah A. Vollert, Kerrie Mengersen, Eve McDonald-Madden, Matthew P. Adams

Abstract: This work introduces a comprehensive approach to assess the sensitivity of model outputs to changes in parameter values, constrained by the combination of prior beliefs and data. This novel approach identifies stiff parameter combinations strongly affecting the quality of the model-data fit while simultaneously revealing which of these key parameter combinations are informed primarily by the data… ▽ More This work introduces a comprehensive approach to assess the sensitivity of model outputs to changes in parameter values, constrained by the combination of prior beliefs and data. This novel approach identifies stiff parameter combinations strongly affecting the quality of the model-data fit while simultaneously revealing which of these key parameter combinations are informed primarily by the data or are also substantively influenced by the priors. We focus on the very common context in complex systems where the amount and quality of data are low compared to the number of model parameters to be collectively estimated, and showcase the benefits of this technique for applications in biochemistry, ecology, and cardiac electrophysiology. We also show how stiff parameter combinations, once identified, uncover controlling mechanisms underlying the system being modeled and inform which of the model parameters need to be prioritized in future experiments for improved parameter inference from collective model-data fitting. △ Less

Submitted 21 September, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

Journal ref: Sci.Adv. 8(38) eabm5952 (2022)

arXiv:2203.12435 [pdf, other]

doi 10.4204/EPTCS.355.3

Stateful to Stateless: Modelling Stateless Ethereum

Authors: Sandra Johnson, David Hyland-Wood, Anders L Madsen, Kerrie Mengersen

Abstract: The concept of 'Stateless Ethereum' was conceived with the primary aim of mitigating Ethereum's unbounded state growth. The key facilitator of Stateless Ethereum is through the introduction of 'witnesses' into the ecosystem. The changes and potential consequences that these additional data packets pose on the network need to be identified and analysed to ensure that the Ethereum ecosystem can cont… ▽ More The concept of 'Stateless Ethereum' was conceived with the primary aim of mitigating Ethereum's unbounded state growth. The key facilitator of Stateless Ethereum is through the introduction of 'witnesses' into the ecosystem. The changes and potential consequences that these additional data packets pose on the network need to be identified and analysed to ensure that the Ethereum ecosystem can continue operating securely and efficiently. In this paper we propose a Bayesian Network model, a probabilistic graphical modelling approach, to capture the key factors and their interactions in Ethereum mainnet, the public Ethereum blockchain, focussing on the changes being introduced by Stateless Ethereum to estimate the health of the resulting Ethereum ecosystem. We use a mixture of empirical data and expert knowledge, where data are unavailable, to quantify the model. Based on the data and expert knowledge available to use at the time of modelling, the Ethereum ecosystem is expected to remain healthy following the introduction of Stateless Ethereum. △ Less

Submitted 23 March, 2022; originally announced March 2022.

Comments: In Proceedings MARS 2022, arXiv:2203.09299

Journal ref: EPTCS 355, 2022, pp. 27-39

arXiv:2203.04165 [pdf, other]

On the intrinsic dimensionality of Covid-19 data: a global perspective

Authors: Abhishek Varghese, Edgar Santos-Fernandez, Francesco Denti, Antonietta Mira, Kerrie Mengersen

Abstract: This paper aims to develop a global perspective of the complexity of the relationship between the standardised per-capita growth rate of Covid-19 cases, deaths, and the OxCGRT Covid-19 Stringency Index, a measure describing a country's stringency of lockdown policies. To achieve our goal, we use a heterogeneous intrinsic dimension estimator implemented as a Bayesian mixture model, called Hidalgo.… ▽ More This paper aims to develop a global perspective of the complexity of the relationship between the standardised per-capita growth rate of Covid-19 cases, deaths, and the OxCGRT Covid-19 Stringency Index, a measure describing a country's stringency of lockdown policies. To achieve our goal, we use a heterogeneous intrinsic dimension estimator implemented as a Bayesian mixture model, called Hidalgo. We identify that the Covid-19 dataset may project onto two low-dimensional manifolds without significant information loss. The low dimensionality suggests strong dependency among the standardised growth rates of cases and deaths per capita and the OxCGRT Covid-19 Stringency Index for a country over 2020-2021. Given the low dimensional structure, it may be feasible to model observable Covid-19 dynamics with few parameters. Importantly, we identify spatial autocorrelation in the intrinsic dimension distribution worldwide. Moreover, we highlight that high-income countries are more likely to lie on low-dimensional manifolds, likely arising from aging populations, comorbidities, and increased per capita mortality burden from Covid-19. Finally, we temporally stratify the dataset to examine the intrinsic dimension at a more granular level throughout the Covid-19 pandemic. △ Less

Submitted 8 March, 2022; originally announced March 2022.

MSC Class: 62P10

arXiv:2202.07166 [pdf, other]

SSNbayes: An R package for Bayesian spatio-temporal modelling on stream networks

Authors: Edgar Santos-Fernandez, Jay M. Ver Hoef, James M. McGree, Daniel J. Isaak, Kerrie Mengersen, Erin E. Peterson

Abstract: Spatio-temporal models are widely used in many research areas from ecology to epidemiology. However, most covariance functions describe spatial relationships based on Euclidean distance only. In this paper, we introduce the R package SSNbayes for fitting Bayesian spatio-temporal models and making predictions on branching stream networks. SSNbayes provides a linear regression framework with multipl… ▽ More Spatio-temporal models are widely used in many research areas from ecology to epidemiology. However, most covariance functions describe spatial relationships based on Euclidean distance only. In this paper, we introduce the R package SSNbayes for fitting Bayesian spatio-temporal models and making predictions on branching stream networks. SSNbayes provides a linear regression framework with multiple options for incorporating spatial and temporal autocorrelation. Spatial dependence is captured using stream distance and flow connectivity while temporal autocorrelation is modelled using vector autoregression approaches. SSNbayes provides the functionality to make predictions across the whole network, compute exceedance probabilities and other probabilistic estimates such as the proportion of suitable habitat. We illustrate the functionality of the package using a stream temperature dataset collected in Idaho, USA. △ Less

Submitted 14 February, 2022; originally announced February 2022.

arXiv:2107.12592 [pdf, other]

Detection of cybersecurity attacks through analysis of web browsing activities using principal component analysis

Authors: Insha Ullah, Kerrie Mengersen, Rob J Hyndman, James McGree

Abstract: Organizations such as government departments and financial institutions provide online service facilities accessible via an increasing number of internet connected devices which make their operational environment vulnerable to cyber attacks. Consequently, there is a need to have mechanisms in place to detect cyber security attacks in a timely manner. A variety of Network Intrusion Detection System… ▽ More Organizations such as government departments and financial institutions provide online service facilities accessible via an increasing number of internet connected devices which make their operational environment vulnerable to cyber attacks. Consequently, there is a need to have mechanisms in place to detect cyber security attacks in a timely manner. A variety of Network Intrusion Detection Systems (NIDS) have been proposed and can be categorized into signature-based NIDS and anomaly-based NIDS. The signature-based NIDS, which identify the misuse through scanning the activity signature against the list of known attack activities, are criticized for their inability to identify new attacks (never-before-seen attacks). Among anomaly-based NIDS, which declare a connection anomalous if it expresses deviation from a trained model, the unsupervised learning algorithms circumvent this issue since they have the ability to identify new attacks. In this study, we use an unsupervised learning algorithm based on principal component analysis to detect cyber attacks. In the training phase, our approach has the advantage of also identifying outliers in the training dataset. In the monitoring phase, our approach first identifies the affected dimensions and then calculates an anomaly score by aggregating across only those components that are affected by the anomalies. We explore the performance of the algorithm via simulations and through two applications, namely to the UNSW-NB15 dataset recently released by the Australian Centre for Cyber Security and to the well-known KDD'99 dataset. The algorithm is scalable to large datasets in both training and monitoring phases, and the results from both the simulated and real datasets show that the method has promise in detecting suspicious network activities. △ Less

Submitted 27 July, 2021; originally announced July 2021.

arXiv:2106.01719 [pdf]

doi 10.1371/journal.pone.0287640

Understanding links between water-quality variables and nitrate concentration in freshwater streams using high-frequency sensor data

Authors: Claire Kermorvant, Benoit Liquet, Guy Litt, Kerrie Mengersen, Erin Peterson, Rob Hyndman, Jeremy B. Jones Jr., Catherine Leigh

Abstract: Real time monitoring using in situ sensors is becoming a common approach for measuring water quality within watersheds. High frequency measurements produce big data sets that present opportunities to conduct new analyses for improved understanding of water quality dynamics and more effective management of rivers and streams. Of primary importance is enhancing knowledge of the relationships between… ▽ More Real time monitoring using in situ sensors is becoming a common approach for measuring water quality within watersheds. High frequency measurements produce big data sets that present opportunities to conduct new analyses for improved understanding of water quality dynamics and more effective management of rivers and streams. Of primary importance is enhancing knowledge of the relationships between nitrate, one of the most reactive forms of inorganic nitrogen in the aquatic environment, and other water quality variables. We analysed high frequency water quality data from in situ sensors deployed in three sites from different watersheds and climate zones within the National Ecological Observatory Network, USA. We used generalised additive mixed models to explain the nonlinear relationships at each site between nitrate concentration and conductivity, turbidity, dissolved oxygen, water temperature, and elevation. Temporal auto correlation was modelled with an auto regressive moving average model and we examined the relative importance of the explanatory variables. Total deviance explained by the models was high for all sites. Although variable importance and the smooth regression parameters differed among sites, the models explaining the most variation in nitrate contained the same explanatory variables. This study demonstrates that building a model for nitrate using the same set of explanatory water quality variables is achievable, even for sites with vastly different environmental and climatic characteristics. Applying such models will assist managers to select cost effective water quality variables to monitor when the goals are to gain a spatially and temporally in depth understanding of nitrate dynamics and adapt management plans accordingly. △ Less

Submitted 3 June, 2021; originally announced June 2021.

Comments: 4 figures, 17 pages

MSC Class: I.2.7 ACM Class: F.2.2

arXiv:2105.02140 [pdf, other]

A Bayesian latent allocation model for clustering compositional data with application to the Great Barrier Reef

Authors: Luiza Piancastelli, Nial Friel, Julie Vercelloni, Kerrie Mengersen, Antonietta Mira

Abstract: Relative abundance is a common metric to estimate the composition of species in ecological surveys reflecting patterns of commonness and rarity of biological assemblages. Measurements of coral reef compositions formed by four communities along Australia's Great Barrier Reef (GBR) gathered between 2012 and 2017 are the focus of this paper. We undertake the task of finding clusters of transect locat… ▽ More Relative abundance is a common metric to estimate the composition of species in ecological surveys reflecting patterns of commonness and rarity of biological assemblages. Measurements of coral reef compositions formed by four communities along Australia's Great Barrier Reef (GBR) gathered between 2012 and 2017 are the focus of this paper. We undertake the task of finding clusters of transect locations with similar community composition and investigate changes in clustering dynamics over time. During these years, an unprecedented sequence of extreme weather events (cyclones and coral bleaching) impacted the 58 surveyed locations. The dependence between constituent parts of a composition presents a challenge for existing multivariate clustering approaches. In this paper, we introduce a finite mixture of Dirichlet distributions with group-specific parameters, where cluster memberships are dictated by unobserved latent variables. The inference is carried in a Bayesian framework, where MCMC strategies are outlined to sample from the posterior model. Simulation studies are presented to illustrate the performance of the model in a controlled setting. The application of the model to the 2012 coral reef data reveals that clusters were spatially distributed in similar ways across reefs which indicates a potential influence of wave exposure at the origin of coral reef community composition. The number of clusters estimated by the model decreased from four in 2012 to two from 2014 until 2017. Posterior probabilities of transect allocations to the same cluster substantially increase through time showing a potential homogenization of community composition across the whole GBR. The Bayesian model highlights the diversity of coral reef community composition within a coral reef and rapid changes across large spatial scales that may contribute to undermining the future of the GBR's biodiversity. △ Less

Submitted 5 May, 2021; originally announced May 2021.

Comments: Paper submitted for publication

arXiv:2103.05791 [pdf, other]

doi 10.1371/journal.pone.0271457

Spatio-temporal quantile regression analysis revealing more nuanced patterns of climate change: a study of long-term daily temperature in Australia

Authors: Qibin Duan, Clare A. McGrory, Glenn Brown, Kerrie Mengersen, You-Gan Wang

Abstract: Climate change is commonly associated with an overall increase in mean temperature in a defined past time period. Many studies consider temperature trends at the global scale, but the literature is lacking in in-depth analysis of the temperature trends across Australia in recent decades. In addition to heterogeneity in mean and median values, daily Australia temperature data suffers from quasi-per… ▽ More Climate change is commonly associated with an overall increase in mean temperature in a defined past time period. Many studies consider temperature trends at the global scale, but the literature is lacking in in-depth analysis of the temperature trends across Australia in recent decades. In addition to heterogeneity in mean and median values, daily Australia temperature data suffers from quasi-periodic heterogeneity in variance. However, this issue has barely been overlooked in climate research. A contribution of this article is that we propose a joint model of quantile regression and variability. By accounting appropriately for the heterogeneity in these types of data, our analysis reveals that daily maximum temperature is warming by 0.21 Celsius per decade and daily minimum temperature by 0.13 Celsius per decade. However, our modeling also shows nuanced patterns of climate change depends on location, season, and the percentiles of the temperature series over Australia. △ Less

Submitted 9 March, 2021; originally announced March 2021.

Comments: 30 pages, 10 figures, and 3 tables

arXiv:2103.03538 [pdf, other]

doi 10.1016/j.csda.2022.107446

Bayesian spatio-temporal models for stream networks

Authors: Edgar Santos-Fernandez, Jay M. Ver Hoef, Erin E. Peterson, James McGree, Daniel Isaak, Kerrie Mengersen

Abstract: Spatio-temporal models are widely used in many research areas including ecology. The recent proliferation of the use of in-situ sensors in streams and rivers supports space-time water quality modelling and monitoring in near real-time. A new family of spatio-temporal models is introduced. These models incorporate spatial dependence using stream distance while temporal autocorrelation is captured u… ▽ More Spatio-temporal models are widely used in many research areas including ecology. The recent proliferation of the use of in-situ sensors in streams and rivers supports space-time water quality modelling and monitoring in near real-time. A new family of spatio-temporal models is introduced. These models incorporate spatial dependence using stream distance while temporal autocorrelation is captured using vector autoregression approaches. Several variations of these novel models are proposed using a Bayesian framework. The results show that our proposed models perform well using spatio-temporal data collected from real stream networks, particularly in terms of out-of-sample RMSPE. This is illustrated considering a case study of water temperature data in the northwestern United States. △ Less

Submitted 14 February, 2022; v1 submitted 5 March, 2021; originally announced March 2021.

Comments: 30 pages, 10 figs

arXiv:2011.08407 [pdf, other]

A statistical machine learning approach for benchmarking in the presence of complex contextual factors and peer groups

Authors: Daniel W. Kennedy, Jessica Cameron, Paul P. -Y. Wu, Kerrie Mengersen

Abstract: The ability to compare between individuals or organisations fairly is important for the development of robust and meaningful quantitative benchmarks. To make fair comparisons, contextual factors must be taken into account, and comparisons should only be made between similar organisations such as peer groups. Previous benchmarking methods have used linear regression to adjust for contextual factors… ▽ More The ability to compare between individuals or organisations fairly is important for the development of robust and meaningful quantitative benchmarks. To make fair comparisons, contextual factors must be taken into account, and comparisons should only be made between similar organisations such as peer groups. Previous benchmarking methods have used linear regression to adjust for contextual factors, however linear regression is known to be sub-optimal when nonlinear relationships exist between the comparative measure and covariates. In this paper we propose a random forest model for benchmarking that can adjust for these potential nonlinear relationships, and validate the approach in a case-study of high noise data. We provide new visualisations and numerical summaries of the fitted models and comparative measures to facilitate interpretation by both analysts and non-technical audiences. Comparisons can be made across the cohort or within peer groups, and bootstrapping provides a means of estimating uncertainty in both adjusted measures and rankings. We conclude that random forest models can facilitate fair comparisons between organisations for quantitative measures including in cases on complex contextual factor relationships, and that the models and outputs are readily interpreted by stakeholders. △ Less

Submitted 16 November, 2020; originally announced November 2020.

Comments: 18 pages, 8 figures

arXiv:2011.08405 [pdf, other]

doi 10.1371/journal.pone.0251723

Peer groups for organisational learning: clustering with practical constraints

Authors: Daniel William Kennedy, Jessica Cameron, Paul Pao-Yen Wu, Kerrie Mengersen

Abstract: Peer-grouping is used in many sectors for organisational learning, policy implementation, and benchmarking. Clustering provides a statistical, data-driven method for constructing meaningful peer groups, but peer groups must be compatible with business constraints such as size and stability considerations. Additionally, statistical peer groups are constructed from many different variables, and can… ▽ More Peer-grouping is used in many sectors for organisational learning, policy implementation, and benchmarking. Clustering provides a statistical, data-driven method for constructing meaningful peer groups, but peer groups must be compatible with business constraints such as size and stability considerations. Additionally, statistical peer groups are constructed from many different variables, and can be difficult to understand, especially for non-statistical audiences. We developed methodology to apply business constraints to clustering solutions and allow the decision-maker to choose the balance between statistical goodness-of-fit and conformity to business constraints. Several tools were utilised to identify complex distinguishing features in peer groups, and a number of visualisations are developed to explain high-dimensional clusters for non-statistical audiences. In a case study where peer group size was required to be small ($\leq 100$ members), we applied constrained clustering to a noisy high-dimensional data-set over two subsequent years, ensuring that the clusters were sufficiently stable between years. Our approach not only satisfied clustering constraints on the test data, but maintained an almost monotonic negative relationship between goodness-of-fit and stability between subsequent years. We demonstrated in the context of the case study how distinguishing features between clusters can be communicated clearly to different stakeholders with substantial and limited statistical knowledge. △ Less

Submitted 16 November, 2020; originally announced November 2020.

Comments: 22 pages, 4 figures

arXiv:2006.04565 [pdf, other]

doi 10.1007/978-3-030-42553-1

A Survey of Bayesian Statistical Approaches for Big Data

Authors: Farzana Jahan, Insha Ullah, Kerrie L Mengersen

Abstract: The modern era is characterised as an era of information or Big Data. This has motivated a huge literature on new methods for extracting information and insights from these data. A natural question is how these approaches differ from those that were available prior to the advent of Big Data. We present a review of published studies that present Bayesian statistical approaches specifically for Big… ▽ More The modern era is characterised as an era of information or Big Data. This has motivated a huge literature on new methods for extracting information and insights from these data. A natural question is how these approaches differ from those that were available prior to the advent of Big Data. We present a review of published studies that present Bayesian statistical approaches specifically for Big Data and discuss the reported and perceived benefits of these approaches. We conclude by addressing the question of whether focusing only on improving computational algorithms and infrastructure will be enough to face the challenges of Big Data. △ Less

Submitted 8 June, 2020; originally announced June 2020.

MSC Class: 62-08; 97K70; 97K80 ACM Class: G.3

Journal ref: In Mengersen K., Pudlo P., Robert C. (2020) Case Studies in Applied Bayesian Data Science. Lecture Notes in Mathematics, vol 2259. (pp. 17-44) Springer, Cham

arXiv:2006.00741 [pdf, other]

Correcting misclassification errors in crowdsourced ecological data: A Bayesian perspective

Authors: Edgar Santos-Fernandez, Erin E. Peterson, Julie Vercelloni, Em Rushworth, Kerrie Mengersen

Abstract: Many research domains use data elicited from "citizen scientists" when a direct measure of a process is expensive or infeasible. However, participants may report incorrect estimates or classifications due to their lack of skill. We demonstrate how Bayesian hierarchical models can be used to learn about latent variables of interest, while accounting for the participants' abilities. The model is des… ▽ More Many research domains use data elicited from "citizen scientists" when a direct measure of a process is expensive or infeasible. However, participants may report incorrect estimates or classifications due to their lack of skill. We demonstrate how Bayesian hierarchical models can be used to learn about latent variables of interest, while accounting for the participants' abilities. The model is described in the context of an ecological application that involves crowdsourced classifications of georeferenced coral-reef images from the Great Barrier Reef, Australia. The latent variable of interest is the proportion of coral cover, which is a common indicator of coral reef health. The participants' abilities are expressed in terms of sensitivity and specificity of a correctly classified set of points on the images. The model also incorporates a spatial component, which allows prediction of the latent variable in locations that have not been surveyed. We show that the model outperforms traditional weighted-regression approaches used to account for uncertainty in citizen science data. Our approach produces more accurate regression coefficients and provides a better characterization of the latent process of interest. This new method is implemented in the probabilistic programming language Stan and can be applied to a wide number of problems that rely on uncertain citizen science data. △ Less

Submitted 1 June, 2020; originally announced June 2020.

Comments: 18 figures, 5 tables

arXiv:2004.04620 [pdf, ps, other]

Bayesian Computation with Intractable Likelihoods

Authors: Matthew T. Moores, Anthony N. Pettitt, Kerrie Mengersen

Abstract: This article surveys computational methods for posterior inference with intractable likelihoods, that is where the likelihood function is unavailable in closed form, or where evaluation of the likelihood is infeasible. We review recent developments in pseudo-marginal methods, approximate Bayesian computation (ABC), the exchange algorithm, thermodynamic integration, and composite likelihood, paying… ▽ More This article surveys computational methods for posterior inference with intractable likelihoods, that is where the likelihood function is unavailable in closed form, or where evaluation of the likelihood is infeasible. We review recent developments in pseudo-marginal methods, approximate Bayesian computation (ABC), the exchange algorithm, thermodynamic integration, and composite likelihood, paying particular attention to advancements in scalability for large datasets. We also mention R and MATLAB source code for implementations of these algorithms, where they are available. △ Less

Submitted 7 April, 2020; originally announced April 2020.

Comments: arXiv admin note: text overlap with arXiv:1503.08066

MSC Class: 62F15; 62M40

arXiv:2003.06966 [pdf, other]

Bayesian item response models for citizen science ecological data

Authors: Edgar Santos-Fernandez, Kerrie Mengersen

Abstract: So-called 'citizen science' data elicited from crowds has become increasingly popular in many fields including ecology. However, the quality of this information is being frequently debated by many within the scientific community. Therefore, modern citizen science implementations require measures of the users' proficiency that account for the difficulty of the tasks. We introduce a new methodologic… ▽ More So-called 'citizen science' data elicited from crowds has become increasingly popular in many fields including ecology. However, the quality of this information is being frequently debated by many within the scientific community. Therefore, modern citizen science implementations require measures of the users' proficiency that account for the difficulty of the tasks. We introduce a new methodological framework of item response and linear logistic test models with application to citizen science data used in ecology research. This approach accommodates spatial autocorrelation within the item difficulties and produces relevant ecological measures of species and site-related difficulties, discriminatory power and guessing behavior. These, along with estimates of the subject abilities allow better management of these programs and provide deeper insights. This paper also highlights the fit of item response models to big data via divide-and-conquer. We found that the suggested methods outperform the traditional item response models in terms of RMSE, accuracy, and WAIC based on leave-one-out cross-validation on simulated and empirical data. We present a comprehensive implementation using a case study of species identification in the Serengeti, Tanzania. The R and Stan codes are provided for full reproducibility. Multiple statistical illustrations and visualizations are given which allow practitioners the extrapolation to a wide range of citizen science ecological problems. △ Less

Submitted 25 May, 2020; v1 submitted 15 March, 2020; originally announced March 2020.

Comments: under review, 24 pages, 10 figures

arXiv:2003.06291 [pdf]

Improved assessment of the accuracy of record linkage via an extended MaCSim approach

Authors: Shovanur Haque, Kerrie Mengersen

Abstract: Record linkage is the process of bringing together the same entity from overlapping data sources while removing duplicates. Huge amounts of data are now being collected by public or private organizations as well as by researchers and individuals. Linking and analysing relevant information from this massive data reservoir can provide new insights into society. However, this increase in the amount o… ▽ More Record linkage is the process of bringing together the same entity from overlapping data sources while removing duplicates. Huge amounts of data are now being collected by public or private organizations as well as by researchers and individuals. Linking and analysing relevant information from this massive data reservoir can provide new insights into society. However, this increase in the amount of data may also increase the likelihood of incorrectly linked records among databases. It has become increasingly important to have effective and efficient methods for linking data from different sources. Therefore, it becomes necessary to assess the ability of a linking method to achieve high accuracy or to compare between methods with respect to accuracy. In this paper, we improve on a Markov Chain based Monte Carlo simulation approach (MaCSim) for assessing a linking method. MaCSim utilizes two linked files that have been previously linked on similar types of data to create an agreement matrix and then simulates the matrix using a proposed algorithm developed to generate re-sampled versions of the agreement matrix. A defined linking method is used in each simulation to link the files and the accuracy of the linking method is assessed. The improvement proposed here involves calculation of a similarity weight for every linking variable value for each record pair, which allows partial agreement of the linking variable values. A threshold is calculated for every linking variable based on adjustable parameter "tolerance" for that variable. To assess the accuracy of linking method, correctly linked proportions are investigated for each record. The extended MaCSim approach is illustrated using a synthetic dataset provided by the Australian Bureau of Statistics (ABS) based on realistic data settings. Test results show higher accuracy of the assessment of linkages. △ Less

Submitted 12 October, 2020; v1 submitted 12 March, 2020; originally announced March 2020.

Comments: 32 pages, 4 figures. arXiv admin note: text overlap with arXiv:1901.04779

arXiv:2003.05686 [pdf]

Assessing the accuracy of individual link with varying block sizes and cut-off values using MaCSim approach

Authors: Shovanur Haque, Kerrie Mengersen

Abstract: Record linkage is the process of matching together records from different data sources that belong to the same entity. Record linkage is increasingly being used by many organizations including statistical, health, government etc. to link administrative, survey, and other files to create a robust file for more comprehensive analysis. Therefore, it becomes necessary to assess the ability of a linkin… ▽ More Record linkage is the process of matching together records from different data sources that belong to the same entity. Record linkage is increasingly being used by many organizations including statistical, health, government etc. to link administrative, survey, and other files to create a robust file for more comprehensive analysis. Therefore, it becomes necessary to assess the ability of a linking method to achieve high accuracy or compare between methods with respect to accuracy. In this paper, we evaluate the accuracy of individual link using varying block sizes and different cut-off values by utilizing a Markov Chain based Monte Carlo simulation approach (MaCSim). MaCSim utilizes two linked files to create an agreement matrix. The agreement matrix is simulated to generate re-sampled versions of the agreement matrix. A defined linking method is used in each simulation to link the files and the accuracy of the linking method is assessed. The aim of this paper is to facilitate optimal choice of block size and cut-off value to achieve high accuracy in terms of minimizing average False Discovery Rate and False Negative Rate. The analyses have been performed using a synthetic dataset provided by the Australian Bureau of Statistics (ABS) and indicated promising results. △ Less

Submitted 23 November, 2020; v1 submitted 12 March, 2020; originally announced March 2020.

Comments: 24 pages, 6 figures. arXiv admin note: text overlap with arXiv:1901.04779

arXiv:2002.04148 [pdf, other]

The role of intrinsic dimension in high-resolution player tracking data -- Insights in basketball

Authors: Edgar Santos-Fernandez, Francesco Denti, Kerrie Mengersen, Antonietta Mira

Abstract: A new range of statistical analysis has emerged in sports after the introduction of the high-resolution player tracking technology, specifically in basketball. However, this high dimensional data is often challenging for statistical inference and decision making. In this article, we employ Hidalgo, a state-of-the-art Bayesian mixture model that allows the estimation of heterogeneous intrinsic dime… ▽ More A new range of statistical analysis has emerged in sports after the introduction of the high-resolution player tracking technology, specifically in basketball. However, this high dimensional data is often challenging for statistical inference and decision making. In this article, we employ Hidalgo, a state-of-the-art Bayesian mixture model that allows the estimation of heterogeneous intrinsic dimensions (ID) within a dataset and propose some theoretical enhancements. ID results can be interpreted as indicators of variability and complexity of basketball plays and games. This technique allows classification and clustering of NBA basketball player's movement and shot charts data. Analyzing movement data, Hidalgo identifies key stages of offensive actions such as creating space for passing, preparation/shooting and following through. We found that the ID value spikes reaching a peak between 4 and 8 seconds in the offensive part of the court after which it declines. In shot charts, we obtained groups of shots that produce substantially higher and lower successes. Overall, game-winners tend to have a larger intrinsic dimension which is an indication of more unpredictability and unique shot placements. Similarly, we found higher ID values in plays when the score margin is small compared to large margin ones. These outcomes could be exploited by coaches to obtain better offensive/defensive results. △ Less

Submitted 10 February, 2020; originally announced February 2020.

Comments: 21 pages, 16 figures, Codes + data + results can be found in https://github.com/EdgarSantos-Fernandez/id_basketball, Submitted

arXiv:1910.14227 [pdf, other]

Combined parameter and state inference with automatically calibrated ABC

Authors: Anthony Ebert, Pierre Pudlo, Kerrie Mengersen, Paul Wu, Christopher Drovandi

Abstract: State space models contain time-indexed parameters, termed states, as well as static parameters, simply termed parameters. The problem of inferring both static parameters as well as states simultaneously, based on time-indexed observations, is the subject of much recent literature. This problem is compounded once we consider models with intractable likelihoods. In these situations, some emerging a… ▽ More State space models contain time-indexed parameters, termed states, as well as static parameters, simply termed parameters. The problem of inferring both static parameters as well as states simultaneously, based on time-indexed observations, is the subject of much recent literature. This problem is compounded once we consider models with intractable likelihoods. In these situations, some emerging approaches have incorporated existing likelihood-free techniques for static parameters, such as approximate Bayesian computation (ABC) into likelihood-based algorithms for combined inference of parameters and states. These emerging approaches currently require extensive manual calibration of a time-indexed tuning parameter: the acceptance threshold. We design an SMC$^2$ algorithm (Chopin et al., 2013, JRSS B) for likelihood-free approximation with automatically tuned thresholds. We prove consistency of the algorithm and discuss the proposed calibration. We demonstrate this algorithm's performance with three examples. We begin with two examples of state space models. The first example is a toy example, with an emission distribution that is a skew normal distribution. The second example is a stochastic volatility model involving an intractable stable distribution. The last example is the most challenging; it deals with an inhomogeneous Hawkes process. △ Less

Submitted 26 May, 2021; v1 submitted 30 October, 2019; originally announced October 2019.

arXiv:1910.02379 [pdf, other]

Factors associated with injurious from falls in people with early stage Parkinson's disease

Authors: Sarini Abdullah, James McGree, Nicole White, Kerrie Mengersen, Graham Kerr

Abstract: Falls are common in people with Parkinson's disease (PD) and have detrimental effects which can lower the quality of life. While studies have been conducted to learn about falling in general, factors distinguishing injurious from non-injurious falls are less clear. We develop a two-stage Bayesian logistic regression model was used to model the association of falls and injurious falls with data mea… ▽ More Falls are common in people with Parkinson's disease (PD) and have detrimental effects which can lower the quality of life. While studies have been conducted to learn about falling in general, factors distinguishing injurious from non-injurious falls are less clear. We develop a two-stage Bayesian logistic regression model was used to model the association of falls and injurious falls with data measured on patients. The forward stepwise selection procedure was used to determine which patient measures were associated with falls and injurious falls, and Bayesian model averaging (BMA) was used to account for uncertainty in this variable selection procedure. Data on 99 patients for a 12-month time period were considered in this analysis. Fifty five percent of the patients experienced at least one fall, with a total of 335 falls cases; 25% of which were injurious falls. Fearful, Tinetti gait, and previous falls were the risk factors for fall/non-fall, with 77% accuracy, 76% sensitivity, and 76% specificity. Fall time, body mass index, anxiety, balance, gait, and gender were the risk factors associated with injurious falls. Thus, attaining normal body mass index, improving balance and gait could be seen as preventive efforts for injurious falls. There was no significant difference in the risk of falls between males and females, yet if falls occurred, females were more likely to get injured than males. △ Less

Submitted 6 October, 2019; originally announced October 2019.

Comments: 18 pages, 3 figures, 4 tables

MSC Class: 62P10; 62-07; 62J12 ACM Class: J.3.2; G.3.2; G.3.6

arXiv:1910.01864 [pdf, other]

Profile regression for subgrouping patients with early stage Parkinson's disease

Authors: Sarini Abdullah, James McGree, Nicole White, Kerrie Mengersen, Graham Kerr

Abstract: Falls are detrimental to people with Parkinson's Disease (PD) because of the potentially severe consequences to the patients' quality of life. While many studies have attempted to predict falls/non-falls, this study aimed to determine factors related to falls frequency in people with early PD. Ninety nine participants with early stage PD were assessed based on two types of tests. The first type of… ▽ More Falls are detrimental to people with Parkinson's Disease (PD) because of the potentially severe consequences to the patients' quality of life. While many studies have attempted to predict falls/non-falls, this study aimed to determine factors related to falls frequency in people with early PD. Ninety nine participants with early stage PD were assessed based on two types of tests. The first type of tests is disease-specific tests, comprised of the Unified Parkinson's Disease Rating Scale (UPDRS) and the Schwab and England activities of daily living scale (SEADL). A measure of postural instability and gait disorder (PIGD) and subtotal scores for subscales I, II, and III were derived from the UPDRS. The second type of tests is functional tests, including Tinetti gait and balance, Berg Balance Scale (BBS), Timed-Up and Go (TUG), Functional Reach (FR), Freezing of Gait (FOG), Mini Mental State Examination (MMSE), and Melbourne Edge Test (MET). Falls were recorded each month for 6 months. Clustering of patients via Finite Mixture Model (FMM) was conducted. Three clusters of patients were found: non-or single-fallers, low frequency fallers, and high frequency fallers. Several factors that are important to clustering PD patients were identified: UPDRS subscales II and III subtotals, PIGD and SE ADL. However these factors could not differentiate PD patients with low frequency fallers from high frequency fallers. While Tinetti,TUG, and BBS turned to be important factors in clustering PD patients, and could differentiate the three clusters. FMM is able to cluster people with PD into three groups. We obtain several factors important to explaining the clusters and also found different role of disease specific measures and functional tests to clustering PD patients. Upon examining these measures, it might be possible to develop new disease treatment to prevent, or to delay, the occurrence of falls. △ Less

Submitted 4 October, 2019; originally announced October 2019.

Comments: 30 pages, 11 figures, 4 tables

MSC Class: 62-07; 62P10; 62H30 ACM Class: G.3.6; G.3.14; J.3.2

arXiv:1910.01313 [pdf, other]

Assessing the predictive ability of the UPDRS for falls classification in early stage Parkinson's disease

Authors: Sarini Abdullah, Nicole White, James McGree, Kerrie Mengersen, Graham Kerr

Abstract: Identification of risk factors associated with falls in people with Parkinson's Disease (PD) is important due to their high risk of falling. In this study, various ways of utilizing the Unified Parkinson's Disease Rating Scale (UPDRS) were assessed for the identification of risk factors and for the prediction of falls. Three statistical methods for classification were considered:decision trees, ra… ▽ More Identification of risk factors associated with falls in people with Parkinson's Disease (PD) is important due to their high risk of falling. In this study, various ways of utilizing the Unified Parkinson's Disease Rating Scale (UPDRS) were assessed for the identification of risk factors and for the prediction of falls. Three statistical methods for classification were considered:decision trees, random forests, and logistic regression. UPDRS measurements on 51 participants with early stage PD, who completed monthly falls diaries for 12 months of follow-up were analyzed. All classification methods applied produced similar results in regards to classification accuracy and the selected important variables. The highest classification rates were obtained from model with individual items of the UPDRS with 80% accuracy (85% sensitivity and 77% specificity), higher than in any previous study. A comparison of the independent performance of the four parts of the UPDRS revealed the comparably high classification rates for Parts II and III of the UPDRS. Similar patterns with slightly different classification rates were observed for the 6- and 12-month of follow-up times. Consistent predictors for falls selected by all classification methods at two follow-up times are: thought disorder for UPDRS I, dressing and falling for UPDRS II, hand pronate/supinate for UPDRS III, and sleep disturbance and symptomatic orthostasis for UPDRS IV. While for the aggregate measures, subtotal 2 (sum of UPDRS II items) and bradykinesia showed high association with fall/non-fall. Fall/non-fall occurrences were more associated with individual items of the UPDRS than with the aggregate measures. UPDRS parts II and III produced comparably high classification rates for fall/non-fall prediction. Similar results were obtained for modelling data at 6-month and 12-month follow-up times. △ Less

Submitted 3 October, 2019; originally announced October 2019.

Comments: 29 pages, 7 figures, 5 tables

MSC Class: 62P10; 62-07; 62H30 ACM Class: G.3.6; G.3.7; J.3.2

arXiv:1909.02169 [pdf]

doi 10.1371/journal.pcbi.1007878

Estimating a novel stochastic model for within-field disease dynamics of banana bunchy top virus via approximate Bayesian computation

Authors: Abhishek Varghese, Christopher Drovandi, Kerrie Mengersen, Antonietta Mira

Abstract: The Banana Bunchy Top Virus (BBTV) is one of the most economically important vector-borne banana diseases throughout the Asia-Pacific Basin and presents a significant challenge to the agricultural sector. Current models of BBTV are largely deterministic, limited by an incomplete understanding of interactions in complex natural systems, and the appropriate identification of parameters. A stochastic… ▽ More The Banana Bunchy Top Virus (BBTV) is one of the most economically important vector-borne banana diseases throughout the Asia-Pacific Basin and presents a significant challenge to the agricultural sector. Current models of BBTV are largely deterministic, limited by an incomplete understanding of interactions in complex natural systems, and the appropriate identification of parameters. A stochastic network-based Susceptible-Infected model has been created which simulates the spread of BBTV across the subsections of a banana plantation, parameterising nodal recovery, neighbouring and distant infectivity across summer and winter. Findings from posterior results achieved through Markov Chain Monte Carlo approach to approximate Bayesian computation suggest seasonality in all parameters, which are influenced by correlated changes in inspection accuracy, temperatures and aphid activity. This paper demonstrates how the model may be used for monitoring and forecasting of various disease management strategies to support policy-level decision making. △ Less

Submitted 16 March, 2020; v1 submitted 4 September, 2019; originally announced September 2019.

Comments: 40 pages, 16 figures

MSC Class: 62P12

Showing 1–50 of 81 results for author: Mengersen, K