Statistics
See recent articles
Showing new listings for Wednesday, 2 October 2024
- [1] arXiv:2410.00078 [pdf, html, other]
-
Title: Shuffled Linear Regression via Spectral MatchingComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Spectral Theory (math.SP); Machine Learning (stat.ML)
Shuffled linear regression (SLR) seeks to estimate latent features through a linear transformation, complicated by unknown permutations in the measurement dimensions. This problem extends traditional least-squares (LS) and Least Absolute Shrinkage and Selection Operator (LASSO) approaches by jointly estimating the permutation, resulting in shuffled LS and shuffled LASSO formulations. Existing methods, constrained by the combinatorial complexity of permutation recovery, often address small-scale cases with limited measurements. In contrast, we focus on large-scale SLR, particularly suited for environments with abundant measurement samples. We propose a spectral matching method that efficiently resolves permutations by aligning spectral components of the measurement and feature covariances. Rigorous theoretical analyses demonstrate that our method achieves accurate estimates in both shuffled LS and shuffled LASSO settings, given a sufficient number of samples. Furthermore, we extend our approach to address simultaneous pose and correspondence estimation in image registration tasks. Experiments on synthetic datasets and real-world image registration scenarios show that our method outperforms existing algorithms in both estimation accuracy and registration performance.
- [2] arXiv:2410.00116 [pdf, html, other]
-
Title: Bayesian Calibration in a multi-output transposition contextGilles Defaux, Cédric Durantin, Josselin Garnier, Baptiste Kerleguer, Guillaume Perrin, Charlie SireComments: Submitted to International Journal for Uncertainty QuantificationSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Bayesian calibration is an effective approach for ensuring that numerical simulations accurately reflect the behavior of physical systems. However, because numerical models are never perfect, a discrepancy known as model error exists between the model outputs and the observed data, and must be quantified. Conventional methods can not be implemented in transposition situations, such as when a model has multiple outputs but only one is experimentally observed. To account for the model error in this context, we propose augmenting the calibration process by introducing additional input numerical parameters through a hierarchical Bayesian model, which includes hyperparameters for the prior distribution of the calibration variables. Importance sampling estimators are used to avoid increasing computational costs. Performance metrics are introduced to assess the proposed probabilistic model and the accuracy of its predictions. The method is applied on a computer code with three outputs that models the Taylor cylinder impact test. The outputs are considered as the observed variables one at a time, to work with three different transposition situations. The proposed method is compared with other approaches that embed model errors to demonstrate the significance of the hierarchical formulation.
- [3] arXiv:2410.00125 [pdf, html, other]
-
Title: Relative Cumulative Residual Information MeasureSubjects: Methodology (stat.ME)
In this paper, we develop a relative cumulative residual information (RCRI) measure that intends to quantify the divergence between two survival functions. The dynamic relative cumulative residual information (DRCRI) measure is also introduced. We establish some characterization results under the proportional hazards model assumption. Additionally, we obtained the non-parametric estimators of RCRI and DRCRI measures based on the kernel density type estimator for the survival function. The effectiveness of the estimators are assessed through an extensive Monte Carlo simulation study. We consider the data from the third Gaia data release (Gaia DR3) for demonstrating the use of the proposed measure. For this study, we have collected epoch photometry data for the objects Gaia DR3 4111834567779557376 and Gaia DR3 5090605830056251776.
- [4] arXiv:2410.00142 [pdf, html, other]
-
Title: On the posterior property of the Rician distributionSubjects: Methodology (stat.ME)
The Rician distribution, a well-known statistical distribution frequently encountered in fields like magnetic resonance imaging and wireless communications, is particularly useful for describing many real phenomena such as signal process data. In this paper, we introduce objective Bayesian inference for the Rician distribution parameters, specifically the Jeffreys rule and Jeffreys prior are derived. We proved that the obtained posterior for the first priors led to an improper posterior while the Jeffreys prior led to a proper distribution. To evaluate the effectiveness of our proposed Bayesian estimation method, we perform extensive numerical simulations and compare the results with those obtained from traditional moment-based and maximum likelihood estimators. Our simulations illustrate that the Bayesian estimators derived from the Jeffreys prior provide nearly unbiased estimates, showcasing the advantages of our approach over classical techniques.
- [5] arXiv:2410.00183 [pdf, html, other]
-
Title: Generalised mixed effects models for changepoint analysis of biomedical time series dataSubjects: Methodology (stat.ME)
Motivated by two distinct types of biomedical time series data, digital health monitoring and neuroimaging, we develop a novel approach for changepoint analysis that uses a generalised linear mixed model framework. The generalised linear mixed model framework lets us incorporate structure that is usually present in biomedical time series data. We embed the mixed model in a dynamic programming algorithm for detecting multiple changepoints in the fMRI data. We evaluate the performance of our proposed method across several scenarios using simulations. Finally, we show the utility of our proposed method on our two distinct motivating applications.
- [6] arXiv:2410.00219 [pdf, html, other]
-
Title: Improved performance guarantees for Tukey's medianSubjects: Statistics Theory (math.ST); Probability (math.PR)
Is there a natural way to order data in dimension greater than one? The approach based on the notion of data depth, often associated with the name of John Tukey, is among the most popular. Tukey's depth has found applications in robust statistics, graph theory, and the study of elections and social choice. We present improved performance guarantees for empirical Tukey's median, a deepest point associated with the given sample, when the data-generating distribution is elliptically symmetric and possibly anisotropic. Some of our results remain valid in the class of affine equivariant estimators. As a corollary of our bounds, we show that the diameter of the set of all empirical Tukey's medians scales like $o(n^{-1/2})$ where $n$ is the sample size.
- [7] arXiv:2410.00229 [pdf, html, other]
-
Title: Stochastic Inverse Problem: stability, regularization and Wasserstein gradient flowSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
Inverse problems in physical or biological sciences often involve recovering an unknown parameter that is random. The sought-after quantity is a probability distribution of the unknown parameter, that produces data that aligns with measurements. Consequently, these problems are naturally framed as stochastic inverse problems. In this paper, we explore three aspects of this problem: direct inversion, variational formulation with regularization, and optimization via gradient flows, drawing parallels with deterministic inverse problems. A key difference from the deterministic case is the space in which we operate. Here, we work within probability space rather than Euclidean or Sobolev spaces, making tools from measure transport theory necessary for the study. Our findings reveal that the choice of metric -- both in the design of the loss function and in the optimization process -- significantly impacts the stability and properties of the optimizer.
- [8] arXiv:2410.00259 [pdf, html, other]
-
Title: Robust Emax Model Fitting: Addressing Nonignorable Missing Binary Outcome in Dose-Response AnalysisSubjects: Methodology (stat.ME); Applications (stat.AP)
The Binary Emax model is widely employed in dose-response analysis during drug development, where missing data often pose significant challenges. Addressing nonignorable missing binary responses, where the likelihood of missing data is related to unobserved outcomes, is particularly important, yet existing methods often lead to biased estimates. This issue is compounded when using the regulatory-recommended imputing as treatment failure approach, known as non-responder imputation. Moreover, the problem of separation, where a predictor perfectly distinguishes between outcome classes, can further complicate likelihood maximization. In this paper, we introduce a penalized likelihood-based method that integrates a modified Expectation-Maximization algorithm in the spirit of Ibrahim and Lipsitz to effectively manage both nonignorable missing data and separation issues. Our approach applies a noninformative Jeffreys prior to the likelihood, reducing bias in parameter estimation. Simulation studies demonstrate that our method outperforms existing methods, such as NRI, and the superiority is further supported by its application to data from a Phase II clinical trial. Additionally, we have developed an R package, ememax, to facilitate the implementation of the proposed method.
- [9] arXiv:2410.00300 [pdf, html, other]
-
Title: Visualization for departures from symmetry with the power-divergence-type measure in two-way contingency tablesSubjects: Methodology (stat.ME)
When the row and column variables consist of the same category in a two-way contingency table, it is specifically called a square contingency table. Since it is clear that the square contingency tables have an association structure, a primary objective is to examine symmetric relationships and transitions between variables. While various models and measures have been proposed to analyze these structures understanding changes between two variables in behavior at two-time points or cohorts, it is also necessary to require a detailed investigation of individual categories and their interrelationships, such as shifts in brand preferences. This paper proposes a novel approach to correspondence analysis (CA) for evaluating departures from symmetry in square contingency tables with nominal categories, using a power-divergence-type measure. The approach ensures that well-known divergences can also be visualized and, regardless of the divergence used, the CA plot consists of two principal axes with equal contribution rates. Additionally, the scaling is independent of sample size, making it well-suited for comparing departures from symmetry across multiple contingency tables. Confidence regions are also constructed to enhance the accuracy of the CA plot.
- [10] arXiv:2410.00338 [pdf, html, other]
-
Title: The generalized Nelson--Aalen estimator by inverse probability of treatment weightingSubjects: Methodology (stat.ME)
Inverse probability of treatment weighting (IPTW) has been well applied in causal inference. For time-to-event outcomes, IPTW is performed by weighting the event counting process and at-risk process, resulting in a generalized Nelson--Aalen estimator for population-level hazards. In the presence of competing events, we adopt the counterfactual cumulative incidence of a primary event as the estimated. When the propensity score is estimated, we derive the influence function of the hazard estimator, and then establish the asymptotic property of the incidence estimator. We show that the uncertainty in the estimated propensity score contributes to an additional variation in the IPTW estimator of the cumulative incidence. However, through simulation and real-data application, we find that the additional variation is usually small.
- [11] arXiv:2410.00370 [pdf, html, other]
-
Title: Covariate Adjusted Functional Mixed Membership ModelsNicholas Marco, Damla Şentürk, Shafali Jeste, Charlotte DiStefano, Abigail Dickinson, Donatello TelescaComments: 71 pages including supplementSubjects: Methodology (stat.ME)
Mixed membership models are a flexible class of probabilistic data representations used for unsupervised and semi-supervised learning, allowing each observation to partially belong to multiple clusters or features. In this manuscript, we extend the framework of functional mixed membership models to allow for covariate-dependent adjustments. The proposed model utilizes a multivariate Karhunen-Loève decomposition, which allows for a scalable and flexible model. Within this framework, we establish a set of sufficient conditions ensuring the identifiability of the mean, covariance, and allocation structure up to a permutation of the labels. This manuscript is primarily motivated by studies on functional brain imaging through electroencephalography (EEG) of children with autism spectrum disorder (ASD). Specifically, we are interested in characterizing the heterogeneity of alpha oscillations for typically developing (TD) children and children with ASD. Since alpha oscillations are known to change as children develop, we aim to characterize the heterogeneity of alpha oscillations conditionally on the age of the child. Using the proposed framework, we were able to gain novel information on the developmental trajectories of alpha oscillations for children with ASD and how the developmental trajectories differ between TD children and children with ASD.
- [12] arXiv:2410.00397 [pdf, html, other]
-
Title: A Generalized Mean Approach for Distributed-PCAComments: 17 pages, 1 table, 1 figureSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Principal component analysis (PCA) is a widely used technique for dimension reduction. As datasets continue to grow in size, distributed-PCA (DPCA) has become an active research area. A key challenge in DPCA lies in efficiently aggregating results across multiple machines or computing nodes due to computational overhead. Fan et al. (2019) introduced a pioneering DPCA method to estimate the leading rank-$r$ eigenspace, aggregating local rank-$r$ projection matrices by averaging. However, their method does not utilize eigenvalue information. In this article, we propose a novel DPCA method that incorporates eigenvalue information to aggregate local results via the matrix $\beta$-mean, which we call $\beta$-DPCA. The matrix $\beta$-mean offers a flexible and robust aggregation method through the adjustable choice of $\beta$ values. Notably, for $\beta=1$, it corresponds to the arithmetic mean; for $\beta=-1$, the harmonic mean; and as $\beta \to 0$, the geometric mean. Moreover, the matrix $\beta$-mean is shown to associate with the matrix $\beta$-divergence, a subclass of the Bregman matrix divergence, to support the robustness of $\beta$-DPCA. We also study the stability of eigenvector ordering under eigenvalue perturbation for $\beta$-DPCA. The performance of our proposal is evaluated through numerical studies.
- [13] arXiv:2410.00415 [pdf, html, other]
-
Title: Singularities in bivariate normal mixturesComments: 12 page, 5 figuresSubjects: Statistics Theory (math.ST); Geometric Topology (math.GT)
We investigate mappings $F = (f_1, f_2) \colon \mathbb{R}^2 \to \mathbb{R}^2 $ where $ f_1, f_2 $ are bivariate normal densities from the perspective of singularity theory of mappings, motivated by the need to understand properties of two-component bivariate normal mixtures. We show a classification of mappings $ F = (f_1, f_2) $ via $\mathcal{A}$-equivalence and characterize them using statistical notions. Our analysis reveals three distinct types, each with specific geometric properties. Furthermore, we determine the upper bounds for the number of modes in the mixture for each type.
- [14] arXiv:2410.00429 [pdf, html, other]
-
Title: Optimal Designs for Regression on Lie GroupsSubjects: Statistics Theory (math.ST)
We consider a linear regression model with complex-valued response and predictors from a compact and connected Lie group. The regression model is formulated in terms of eigenfunctions of the Laplace-Beltrami operator on the Lie group. We show that the normalized Haar measure is an approximate optimal design with respect to all Kiefer's $\Phi_p$-criteria. Inspired by the concept of $t$-designs in the field of algebraic combinatorics, we then consider so-called $\lambda$-designs in order to construct exact $\Phi_p$-optimal designs for fixed sample sizes in the considered regression problem. In particular, we explicitly construct $\Phi_p$-optimal designs for regression models with predictors in the Lie groups $\mathrm{SU}(2)$ and $\mathrm{SO}(3)$, the groups of $2\times 2$ unitary matrices and $3\times 3$ orthogonal matrices with determinant equal to $1$, respectively. We also discuss the advantages of the derived theoretical results in a concrete biological application.
- [15] arXiv:2410.00496 [pdf, html, other]
-
Title: Grand Challenges in Bayesian ComputationSubjects: Computation (stat.CO)
This article appeared in the September 2024 issue (Vol. 31, No. 3) of the Bulletin of the International Society for Bayesian Analysis (ISBA).
- [16] arXiv:2410.00546 [pdf, html, other]
-
Title: Some notes on the $k$-means clustering for missing dataComments: 16 pages, 4 figuresSubjects: Statistics Theory (math.ST)
The classical $k$-means clustering requires a complete data matrix without missing entries. As a natural extension of the $k$-means clustering for missing data, the $k$-POD clustering has been proposed, which ignores the missing entries in the $k$-means clustering. This paper shows the inconsistency of the $k$-POD clustering even under the missing completely at random mechanism. More specifically, the expected loss of the $k$-POD clustering can be represented as the weighted sum of the expected $k$-means losses with parts of variables. Thus, the $k$-POD clustering converges to the different clustering from the $k$-means clustering as the sample size goes to infinity. This result indicates that although the $k$-means clustering works well, the $k$-POD clustering may fail to capture the hidden cluster structure. On the other hand, for high-dimensional data, the $k$-POD clustering could be a suitable choice when the missing rate in each variable is low.
- [17] arXiv:2410.00566 [pdf, html, other]
-
Title: Research Frontiers in Ambit Stochastics: In memory of Ole E. Barndorff-NielsenSubjects: Methodology (stat.ME); Probability (math.PR); Statistics Theory (math.ST); Applications (stat.AP)
This article surveys key aspects of ambit stochastics and remembers Ole E. Barndorff-Nielsen's important contributions to the foundation and advancement of this new research field over the last two decades. It also highlights some of the emerging trends in ambit stochastics.
- [18] arXiv:2410.00574 [pdf, html, other]
-
Title: Asymmetric GARCH modelling without moment conditionsSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
There is a serious and long-standing restriction in the literature on heavy-tailed phenomena in that moment conditions, which are unrealistic, are almost always assumed in modelling such phenomena. Further, the issue of stability is often insufficiently addressed. To this end, we develop a comprehensive statistical inference for an asymmetric generalized autoregressive conditional heteroskedasticity model with standardized non-Gaussian symmetric stable innovation (sAGARCH) in a unified framework, covering both the stationary case and the explosive case. We consider first the maximum likelihood estimation of the model including the asymptotic properties of the estimator of the stable exponent parameter among others. We then propose a modified Kolmogorov-type test statistic for diagnostic checking, as well as those for strict stationarity and asymmetry testing. We conduct Monte Carlo simulation studies to examine the finite-sample performance of our entire statistical inference procedure. We include empirical examples of stock returns to highlight the usefulness and merits of our sAGARCH model.
- [19] arXiv:2410.00620 [pdf, html, other]
-
Title: Differentiable Interacting Multiple Model Particle FilteringSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
We propose a sequential Monte Carlo algorithm for parameter learning when the studied model exhibits random discontinuous jumps in behaviour. To facilitate the learning of high dimensional parameter sets, such as those associated to neural networks, we adopt the emerging framework of differentiable particle filtering, wherein parameters are trained by gradient descent. We design a new differentiable interacting multiple model particle filter to be capable of learning the individual behavioural regimes and the model which controls the jumping simultaneously. In contrast to previous approaches, our algorithm allows control of the computational effort assigned per regime whilst using the probability of being in a given regime to guide sampling. Furthermore, we develop a new gradient estimator that has a lower variance than established approaches and remains fast to compute, for which we prove consistency. We establish new theoretical results of the presented algorithms and demonstrate superior numerical performance compared to the previous state-of-the-art algorithms.
- [20] arXiv:2410.00627 [pdf, html, other]
-
Title: Parallel state estimation for systems with integrated measurementsSubjects: Computation (stat.CO); Distributed, Parallel, and Cluster Computing (cs.DC)
This paper presents parallel-in-time state estimation methods for systems with Slow-Rate inTegrated Measurements (SRTM). Integrated measurements are common in various applications, and they appear in analysis of data resulting from processes that require material collection or integration over the sampling period. Current state estimation methods for SRTM are inherently sequential, preventing temporal parallelization in their standard form. This paper proposes parallel Bayesian filters and smoothers for linear Gaussian SRTM models. For that purpose, we develop a novel smoother for SRTM models and develop parallel-in-time filters and smoother for them using an associative scan-based parallel formulation. Empirical experiments ran on a GPU demonstrate the superior time complexity of the proposed methods over traditional sequential approaches.
- [21] arXiv:2410.00662 [pdf, html, other]
-
Title: Bias in mixed models when analysing longitudinal data subject to irregular observation: when should we worry about it and how can recommended visit intervals help in specifying joint models when needed?Comments: 43 pages, 4 tables, 6 figuresSubjects: Methodology (stat.ME)
In longitudinal studies using routinely collected data, such as electronic health records (EHRs), patients tend to have more measurements when they are unwell; this informative observation pattern may lead to bias. While semi-parametric approaches to modelling longitudinal data subject to irregular observation are known to be sensitive to misspecification of the visit process, parametric models may provide a more robust alternative. Robustness of parametric models on the outcome alone has been assessed under the assumption that the visit intensity is independent of the time since the last visit, given the covariates and random effects. However, this assumption of a memoryless visit process may not be realistic in the context of EHR data. In a special case which includes memory embedded into the visit process, we derive an expression for the bias in parametric models for the outcome alone and use this to identify factors that lead to increasing bias. Using simulation studies, we show that this bias is often small in practice. We suggest diagnostics for identifying the specific cases when the outcome model may be susceptible to meaningful bias, and propose a novel joint model of the outcome and visit processes that can eliminate or reduce the bias. We apply these diagnostics and the joint model to a study of juvenile dermatomyositis. We recommend that future studies using EHR data avoid relying only on the outcome model and instead first evaluate its appropriateness with our proposed diagnostics, applying our proposed joint model if necessary.
- [22] arXiv:2410.00677 [pdf, other]
-
Title: Nonparametric Diffusivity Estimation for the Stochastic Heat Equation from Noisy ObservationsSubjects: Statistics Theory (math.ST)
We estimate nonparametrically the spatially varying diffusivity of a stochastic heat equation from observations perturbed by additional noise. To that end, we employ a two-step localization procedure, more precisely, we combine local state estimates into a locally linear regression approach. Our analysis relies on quantitative Trotter--Kato type approximation results for the heat semigroup that are of independent interest. The presence of observational noise leads to non-standard scaling behaviour of the model. Numerical simulations illustrate the results.
- [23] arXiv:2410.00781 [pdf, html, other]
-
Title: Modeling Neural Switching via Drift-Diffusion ModelsSubjects: Methodology (stat.ME)
Neural encoding, or neural representation, is a field in neuroscience that focuses on characterizing how information is encoded in the spiking activity of neurons. Currently, little is known about how sensory neurons can preserve information from multiple stimuli given their broad receptive fields. Multiplexing is a neural encoding theory that posits that neurons temporally switch between encoding various stimuli in their receptive field. Here, we construct a statistically falsifiable single-neuron model for multiplexing using a competition-based framework. The spike train models are constructed using drift-diffusion models, implying an integrate-and-fire framework to model the temporal dynamics of the membrane potential of the neuron. In addition to a multiplexing-specific model, we develop alternative models that represent alternative encoding theories (normalization, winner-take-all, subadditivity, etc.) with some level of abstraction. Using information criteria, we perform model comparison to determine whether the data favor multiplexing over alternative theories of neural encoding. Analysis of spike trains from the inferior colliculus of two macaque monkeys provides tenable evidence of multiplexing and offers new insight into the timescales at which switching occurs.
- [24] arXiv:2410.00845 [pdf, other]
-
Title: Control Variate-based Stochastic Sampling from the Probability SimplexSubjects: Methodology (stat.ME)
This paper presents a control variate-based Markov chain Monte Carlo algorithm for efficient sampling from the probability simplex, with a focus on applications in large-scale Bayesian models such as latent Dirichlet allocation. Standard Markov chain Monte Carlo methods, particularly those based on Langevin diffusions, suffer from significant discretization errors near the boundaries of the simplex, which are exacerbated in sparse data settings. To address this issue, we propose an improved approach based on the stochastic Cox--Ingersoll--Ross process, which eliminates discretization errors and enables exact transition densities. Our key contribution is the integration of control variates, which significantly reduces the variance of the stochastic gradient estimator in the Cox--Ingersoll--Ross process, thereby enhancing the accuracy and computational efficiency of the algorithm. We provide a theoretical analysis showing the variance reduction achieved by the control variates approach and demonstrate the practical advantages of our method in data subsampling settings. Empirical results on large datasets show that the proposed method outperforms existing approaches in both accuracy and scalability.
- [25] arXiv:2410.00848 [pdf, html, other]
-
Title: An EM Gradient Algorithm for Mixture Models with Components Derived from the Manly TransformationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Zhu and Melnykov (2018) develop a model to fit mixture models when the components are derived from the Manly transformation. Their EM algorithm utilizes Nelder-Mead optimization in the M-step to update the skew parameter, $\boldsymbol{\lambda}_g$. An alternative EM gradient algorithm is proposed, using one step of Newton's method, when initial estimates for the model parameters are good.
- [26] arXiv:2410.00865 [pdf, html, other]
-
Title: How should we aggregate ratings? Accounting for personal rating scales via Wasserstein barycentersComments: 40 pages, 4 figuresSubjects: Statistics Theory (math.ST); Methodology (stat.ME)
A common method of making quantitative conclusions in qualitative situations is to collect numerical ratings on a linear scale. We investigate the problem of calculating aggregate numerical ratings from individual numerical ratings and propose a new, non-parametric model for the problem. We show that, with minimal modeling assumptions, the equal-weights average is inconsistent for estimating the quality of items. Analyzing the problem from the perspective of optimal transport, we derive an alternative rating estimator, which we show is asymptotically consistent almost surely and in $L^p$ for estimating quality, with an optimal rate of convergence. Further, we generalize Kendall's W, a non-parametric coefficient of preference concordance between raters, from the special case of rankings to the more general case of arbitrary numerical ratings. Along the way, we prove Glivenko--Cantelli-type theorems for uniform convergence of the cumulative distribution functions and quantile functions for Wasserstein-2 Fréchet means on [0,1].
- [27] arXiv:2410.00903 [pdf, html, other]
-
Title: Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as TreatmentsSubjects: Applications (stat.AP); Computation and Language (cs.CL); Machine Learning (cs.LG)
In this paper, we demonstrate how to enhance the validity of causal inference with unstructured high-dimensional treatments like texts, by leveraging the power of generative Artificial Intelligence. Specifically, we propose to use a deep generative model such as large language models (LLMs) to efficiently generate treatments and use their internal representation for subsequent causal effect estimation. We show that the knowledge of this true internal representation helps separate the treatment features of interest, such as specific sentiments and certain topics, from other possibly unknown confounding features. Unlike the existing methods, our proposed approach eliminates the need to learn causal representation from the data and hence produces more accurate and efficient estimates. We formally establish the conditions required for the nonparametric identification of the average treatment effect, propose an estimation strategy that avoids the violation of the overlap assumption, and derive the asymptotic properties of the proposed estimator through the application of double machine learning. Finally, using an instrumental variables approach, we extend the proposed methodology to the settings, in which the treatment feature is based on human perception rather than is assumed to be fixed given the treatment object. We conduct simulation studies using the generated text data with an open-source LLM, Llama3, to illustrate the advantages of our estimator over the state-of-the-art causal representation learning algorithms.
- [28] arXiv:2410.00912 [pdf, html, other]
-
Title: Glucodensity Functional Profiles Outperform Traditional Continuous Glucose Monitoring MetricsMarcos Matabuena, Rahul Ghosal, Javier Enrique Aguilar, Robert Wagner, Carmen Fernández Merino, Juan Sánchez Castro, Vadim Zipunnikov, Jukka-Pekka Onnela, Francisco GudeSubjects: Applications (stat.AP)
Continuous glucose monitoring (CGM) data has revolutionized the management of type 1 diabetes, particularly when integrated with insulin pumps to mitigate clinical events such as hypoglycemia. Recently, there has been growing interest in utilizing CGM devices in clinical studies involving healthy and diabetes populations. However, efficiently exploiting the high temporal resolution of CGM profiles remains a significant challenge. Numerous indices -- such as time-in-range metrics and glucose variability measures -- have been proposed, but evidence suggests these metrics overlook critical aspects of glucose dynamic homeostasis. As an alternative method, this paper explores the clinical value of glucodensity metrics in capturing glucose dynamics -- specifically the speed and acceleration of CGM time series -- as new biomarkers for predicting long-term glucose outcomes. Our results demonstrate significant information gains, exceeding 20\% in terms of adjusted $R^2$, in forecasting glycosylated hemoglobin (HbA1c) and fasting plasma glucose (FPG) at five and eight years from baseline AEGIS data, compared to traditional non-CGM and CGM glucose biomarkers. These findings underscore the importance of incorporating more complex CGM functional metrics, such as the glucodensity approach, to fully capture continuous glucose fluctuations across different time-scale resolutions.
New submissions (showing 28 of 28 entries)
- [29] arXiv:2410.00068 (cross-list from eess.IV) [pdf, other]
-
Title: Denoising Variational Autoencoder as a Feature Reduction Pipeline for the diagnosis of Autism based on Resting-state fMRIXinyuan Zheng, Orren Ravid, Robert A.J. Barry, Yoojean Kim, Qian Wang, Young-geun Kim, Xi Zhu, Xiaofu HeSubjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Applications (stat.AP)
Autism spectrum disorders (ASDs) are developmental conditions characterized by restricted interests and difficulties in communication. The complexity of ASD has resulted in a deficiency of objective diagnostic biomarkers. Deep learning methods have gained recognition for addressing these challenges in neuroimaging analysis, but finding and interpreting such diagnostic biomarkers are still challenging computationally. We propose an ASD feature reduction pipeline using resting-state fMRI (rs-fMRI). We used Ncuts parcellations and Power atlas to extract functional connectivity data, resulting in over 30 thousand features. Then the pipeline further compresses the connectivities into 5 latent Gaussian distributions, providing is a low-dimensional representation of the data, using a denoising variational autoencoder (DVAE). To test the method, we employed the extracted latent features from the DVAE to classify ASD using traditional classifiers such as support vector machine (SVM) on a large multi-site dataset. The 95% confidence interval for the prediction accuracy of the SVM is [0.63, 0.76] after site harmonization using the extracted latent distributions. Without using DVAE, the prediction accuracy is 0.70, which falls within the interval. This implies that the model successfully encodes the diagnostic information in rs-fMRI data to 5 Gaussian distributions (10 features) without sacrificing prediction performance. The runtime for training the DVAE and obtaining classification results from its extracted latent features (37 minutes) was 7 times shorter compared to training classifiers directly on the raw connectivity matrices (5-6 hours). Our findings also suggest that the Power atlas provides more effective brain connectivity insights for diagnosing ASD than Ncuts parcellations. The encoded features can be used for the help of diagnosis and interpretation of the disease.
- [30] arXiv:2410.00075 (cross-list from cs.SI) [pdf, html, other]
-
Title: Optimizing Treatment Allocation in the Presence of InterferenceSubjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG); Machine Learning (stat.ML)
In Influence Maximization (IM), the objective is to -- given a budget -- select the optimal set of entities in a network to target with a treatment so as to maximize the total effect. For instance, in marketing, the objective is to target the set of customers that maximizes the total response rate, resulting from both direct treatment effects on targeted customers and indirect, spillover, effects that follow from targeting these customers. Recently, new methods to estimate treatment effects in the presence of network interference have been proposed. However, the issue of how to leverage these models to make better treatment allocation decisions has been largely overlooked. Traditionally, in Uplift Modeling (UM), entities are ranked according to estimated treatment effect, and the top entities are allocated treatment. Since, in a network context, entities influence each other, the UM ranking approach will be suboptimal. The problem of finding the optimal treatment allocation in a network setting is combinatorial and generally has to be solved heuristically. To fill the gap between IM and UM, we propose OTAPI: Optimizing Treatment Allocation in the Presence of Interference to find solutions to the IM problem using treatment effect estimates. OTAPI consists of two steps. First, a causal estimator is trained to predict treatment effects in a network setting. Second, this estimator is leveraged to identify an optimal treatment allocation by integrating it into classic IM algorithms. We demonstrate that this novel method outperforms classic IM and UM approaches on both synthetic and semi-synthetic datasets.
- [31] arXiv:2410.00169 (cross-list from cs.LG) [pdf, html, other]
-
Title: (Almost) Smooth Sailing: Towards Numerical Stability of Neural Networks Through Differentiable Regularization of the Condition NumberComments: Accepted at ICML24 Workshop: Differentiable Almost Everything: Differentiable Relaxations, Algorithms, Operators, and SimulatorsSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Maintaining numerical stability in machine learning models is crucial for their reliability and performance. One approach to maintain stability of a network layer is to integrate the condition number of the weight matrix as a regularizing term into the optimization algorithm. However, due to its discontinuous nature and lack of differentiability the condition number is not suitable for a gradient descent approach. This paper introduces a novel regularizer that is provably differentiable almost everywhere and promotes matrices with low condition numbers. In particular, we derive a formula for the gradient of this regularizer which can be easily implemented and integrated into existing optimization algorithms. We show the advantages of this approach for noisy classification and denoising of MNIST images.
- [32] arXiv:2410.00225 (cross-list from cs.LG) [pdf, html, other]
-
Title: Probabilistic Classification of Near-Surface Shallow-Water Sediments using A Portable Free-Fall PenetrometerMd Rejwanur Rahman, Adrian Rodriguez-Marek, Nina Stark, Grace Massey, Carl Friedrichs, Kelly M. DorganSubjects: Machine Learning (cs.LG); Applications (stat.AP)
The geotechnical evaluation of seabed sediments is important for engineering projects and naval applications, offering valuable insights into sediment properties, behavior, and strength. Obtaining high-quality seabed samples can be a challenging task, making in-situ testing an essential part of site characterization. Free Fall Penetrometers (FFP) have emerged as robust tools for rapidly profiling seabed surface sediments, even in energetic nearshore or estuarine conditions and shallow as well as deep depths. While methods for interpretation of traditional offshore Cone Penetration Testing (CPT) data are well-established, their adaptation to FFP data is still an area of research. In this study, we introduce an innovative approach that utilizes machine learning algorithms to create a sediment behavior classification system based on portable free fall penetrometer (PFFP) data. The proposed model leverages PFFP measurements obtained from locations such as Sequim Bay (Washington), the Potomac River, and the York River (Virginia). The result shows 91.1\% accuracy in the class prediction, with the classes representing cohesionless sediment with little to no plasticity, cohesionless sediment with some plasticity, cohesive sediment with low plasticity, and cohesive sediment with high plasticity. The model prediction not only provides the predicted class but also yields an estimate of inherent uncertainty associated with the prediction, which can provide valuable insight about different sediment behaviors. These uncertainties typically range from very low to very high, with lower uncertainties being more common, but they can increase significantly dpending on variations in sediment composition, environmental conditions, and operational techniques. By quantifying uncertainty, the model offers a more comprehensive and informed approach to sediment classification.
- [33] arXiv:2410.00232 (cross-list from cs.LG) [pdf, html, other]
-
Title: Preconditioning for Accelerated Gradient Descent Optimization and RegularizationComments: 7 pagesSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
Accelerated training algorithms, such as adaptive learning rates and various normalization methods, are widely used but not fully understood. When regularization is introduced, standard optimizers like adaptive learning rates may not perform effectively. This raises the need for alternative regularization approaches and the question of how to properly combine regularization with preconditioning. In this paper, we address these challenges using the theory of preconditioning as follows: (1) We explain how preconditioning with AdaGrad, RMSProp, and Adam accelerates training; (2) We explore the interaction between regularization and preconditioning, outlining different options for selecting the variables for regularization, and in particular we discuss how to implement that for the gradient regularization; and (3) We demonstrate how normalization methods accelerate training by improving Hessian conditioning, and discuss how this perspective can lead to new preconditioning training algorithms. Our findings offer a unified mathematical framework for understanding various acceleration techniques and deriving appropriate regularization schemes.
- [34] arXiv:2410.00286 (cross-list from astro-ph.HE) [pdf, html, other]
-
Title: Fermi-GBM Team Analysis on The Ravasio LineEric Burns, Stephen Lesage, Adam Goldstein, Michael S. Briggs, Peter Veres, Suman Bala, Cuan de Barra, Elisabetta Bissaldi, William H Cleveland, Misty M Giles, Matthew Godwin, Boyan A. Hristov, C. Michelle Hui, Daniel Kocevski, Bagrat Mailyan, Christian Malacaria, Sheila McBreen, Robert Preece, Oliver J. Roberts, Lorenzo Scotton, A. von Kienlin, Colleen A. Wilson-Hodge, Joshua WoodSubjects: High Energy Astrophysical Phenomena (astro-ph.HE); Applications (stat.AP)
The prompt spectra of gamma-ray bursts are known to follow broadband continuum behavior over decades in energy. GRB 221009A, given the moniker the brightest of all time (BOAT), is the brightest gamma-ray burst identified in half a century of observations, and was first identified by the Fermi Gamma-ray Burst Monitor (GBM). On behalf of the Fermi-GBM Team, Lesage et al. (2023) described the initial GBM analysis. Ravasio et al. (2024) report the identification of a spectral line in part of the prompt emission of this burst, which they describe as evolving over 80 s from $\sim$12 MeV to 6 MeV. We report a GBM Team analysis on the Ravasio Line: 1) We cannot identify an instrumental effect that could have produced this signal, and 2) our method of calculating the statistical significance of the line shows it easily exceeds the 5$\sigma$ discovery threshold. We additionally comment on the claim of the line beginning at earlier time intervals, up to 37 MeV, as reported in Zhang et al. (2024). We find that it is reasonable to utilize these measurements for characterization of the line evolution, with caution. We encourage theoretical studies exploring this newly discovered gamma-ray burst spectral feature, unless any rigorous alternative explanation unrelated to the emission from GRB 221009A is identified.
- [35] arXiv:2410.00301 (cross-list from cs.SI) [pdf, html, other]
-
Title: Network Science in PsychologyComments: 8 figures, 2 tablesSubjects: Social and Information Networks (cs.SI); Applications (stat.AP); Methodology (stat.ME)
Social network analysis can answer research questions such as why or how individuals interact or form relationships and how those relationships impact other outcomes. Despite the breadth of methods available to address psychological research questions, social network analysis is not yet a standard practice in psychological research. To promote the use of social network analysis in psychological research, we present an overview of network methods, situating each method within the context of research studies and questions in psychology.
- [36] arXiv:2410.00345 (cross-list from cs.LG) [pdf, other]
-
Title: A Taxonomy of Loss Functions for Stochastic Optimal ControlSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Stochastic optimal control (SOC) aims to direct the behavior of noisy systems and has widespread applications in science, engineering, and artificial intelligence. In particular, reward fine-tuning of diffusion and flow matching models and sampling from unnormalized methods can be recast as SOC problems. A recent work has introduced Adjoint Matching (Domingo-Enrich et al., 2024), a loss function for SOC problems that vastly outperforms existing loss functions in the reward fine-tuning setup. The goal of this work is to clarify the connections between all the existing (and some new) SOC loss functions. Namely, we show that SOC loss functions can be grouped into classes that share the same gradient in expectation, which means that their optimization landscape is the same; they only differ in their gradient variance. We perform simple SOC experiments to understand the strengths and weaknesses of different loss functions.
- [37] arXiv:2410.00357 (cross-list from cs.LG) [pdf, html, other]
-
Title: Neural Scaling Laws of Deep ReLU and Deep Operator Network: A Theoretical StudySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Neural scaling laws play a pivotal role in the performance of deep neural networks and have been observed in a wide range of tasks. However, a complete theoretical framework for understanding these scaling laws remains underdeveloped. In this paper, we explore the neural scaling laws for deep operator networks, which involve learning mappings between function spaces, with a focus on the Chen and Chen style architecture. These approaches, which include the popular Deep Operator Network (DeepONet), approximate the output functions using a linear combination of learnable basis functions and coefficients that depend on the input functions. We establish a theoretical framework to quantify the neural scaling laws by analyzing its approximation and generalization errors. We articulate the relationship between the approximation and generalization errors of deep operator networks and key factors such as network model size and training data size. Moreover, we address cases where input functions exhibit low-dimensional structures, allowing us to derive tighter error bounds. These results also hold for deep ReLU networks and other similar structures. Our results offer a partial explanation of the neural scaling laws in operator learning and provide a theoretical foundation for their applications.
- [38] arXiv:2410.00373 (cross-list from cs.LG) [pdf, html, other]
-
Title: Robust Traffic Forecasting against Spatial Shift over YearsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (stat.ML)
Recent advancements in Spatiotemporal Graph Neural Networks (ST-GNNs) and Transformers have demonstrated promising potential for traffic forecasting by effectively capturing both temporal and spatial correlations. The generalization ability of spatiotemporal models has received considerable attention in recent scholarly discourse. However, no substantive datasets specifically addressing traffic out-of-distribution (OOD) scenarios have been proposed. Existing ST-OOD methods are either constrained to testing on extant data or necessitate manual modifications to the dataset. Consequently, the generalization capacity of current spatiotemporal models in OOD scenarios remains largely underexplored. In this paper, we investigate state-of-the-art models using newly proposed traffic OOD benchmarks and, surprisingly, find that these models experience a significant decline in performance. Through meticulous analysis, we attribute this decline to the models' inability to adapt to previously unobserved spatial relationships. To address this challenge, we propose a novel Mixture of Experts (MoE) framework, which learns a set of graph generators (i.e., graphons) during training and adaptively combines them to generate new graphs based on novel environmental conditions to handle spatial distribution shifts during testing. We further extend this concept to the Transformer architecture, achieving substantial improvements. Our method is both parsimonious and efficacious, and can be seamlessly integrated into any spatiotemporal model, outperforming current state-of-the-art approaches in addressing spatial dynamics.
- [39] arXiv:2410.00535 (cross-list from cs.LG) [pdf, html, other]
-
Title: Optimal Causal Representations and the Causal Information BottleneckComments: Submitted to ICLR 2025. Code available at this http URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
To effectively study complex causal systems, it is often useful to construct representations that simplify parts of the system by discarding irrelevant details while preserving key features. The Information Bottleneck (IB) method is a widely used approach in representation learning that compresses random variables while retaining information about a target variable. Traditional methods like IB are purely statistical and ignore underlying causal structures, making them ill-suited for causal tasks. We propose the Causal Information Bottleneck (CIB), a causal extension of the IB, which compresses a set of chosen variables while maintaining causal control over a target variable. This method produces representations which are causally interpretable, and which can be used when reasoning about interventions. We present experimental results demonstrating that the learned representations accurately capture causality as intended.
- [40] arXiv:2410.00571 (cross-list from math.PR) [pdf, html, other]
-
Title: Distribution of a Unified $(k_1,k_2,\ldots,k_m)$-runComments: 17 pagesSubjects: Probability (math.PR); Statistics Theory (math.ST)
We explore a unified $(k_1,k_2,\ldots,k_m)$-run in multi-state trials, examining its distributional properties and waiting time distribution. Our study reveals that this particular run serves as a generalization encompassing various patterns. Additionally, we discuss various results pertaining to existing patterns as special cases. To illustrate our findings, we provide an application related to DNA frequent patterns.
- [41] arXiv:2410.00660 (cross-list from cs.LG) [pdf, html, other]
-
Title: Stabilizing the Kumaraswamy DistributionSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Large-scale latent variable models require expressive continuous distributions that support efficient sampling and low-variance differentiation, achievable through the reparameterization trick. The Kumaraswamy (KS) distribution is both expressive and supports the reparameterization trick with a simple closed-form inverse CDF. Yet, its adoption remains limited. We identify and resolve numerical instabilities in the inverse CDF and log-pdf, exposing issues in libraries like PyTorch and TensorFlow. We then introduce simple and scalable latent variable models based on the KS, improving exploration-exploitation trade-offs in contextual multi-armed bandits and enhancing uncertainty quantification for link prediction with graph neural networks. Our results support the stabilized KS distribution as a core component in scalable variational models for bounded latent variables.
- [42] arXiv:2410.00680 (cross-list from eess.AS) [pdf, html, other]
-
Title: The Conformer Encoder May Reverse the Time DimensionComments: Submitted to ICASSP 2025Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Machine Learning (stat.ML)
We sometimes observe monotonically decreasing cross-attention weights in our Conformer-based global attention-based encoder-decoder (AED) models. Further investigation shows that the Conformer encoder internally reverses the sequence in the time dimension. We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention to build a connection between the initial frames and all other informative frames. Furthermore, we show that, at some point in training, the self-attention module of the Conformer starts dominating the output over the preceding feed-forward module, which then only allows the reversed information to pass through. We propose several methods and ideas of how this flipping can be avoided. Additionally, we investigate a novel method to obtain label-frame-position alignments by using the gradients of the label log probabilities w.r.t. the encoder input frames.
- [43] arXiv:2410.00699 (cross-list from cs.LG) [pdf, html, other]
-
Title: Investigating the Impact of Model Complexity in Large Language ModelsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Large Language Models (LLMs) based on the pre-trained fine-tuning paradigm have become pivotal in solving natural language processing tasks, consistently achieving state-of-the-art performance. Nevertheless, the theoretical understanding of how model complexity influences fine-tuning performance remains challenging and has not been well explored yet. In this paper, we focus on autoregressive LLMs and propose to employ Hidden Markov Models (HMMs) to model them. Based on the HMM modeling, we investigate the relationship between model complexity and the generalization capability in downstream tasks. Specifically, we consider a popular tuning paradigm for downstream tasks, head tuning, where all pre-trained parameters are frozen and only individual heads are trained atop pre-trained LLMs. Our theoretical analysis reveals that the risk initially increases and then decreases with rising model complexity, showcasing a "double descent" phenomenon. In this case, the initial "descent" is degenerate, signifying that the "sweet spot" where bias and variance are balanced occurs when the model size is zero. Obtaining the presented in this study conclusion confronts several challenges, primarily revolving around effectively modeling autoregressive LLMs and downstream tasks, as well as conducting a comprehensive risk analysis for multivariate regression. Our research is substantiated by experiments conducted on data generated from HMMs, which provided empirical support and alignment with our theoretical insights.
- [44] arXiv:2410.00709 (cross-list from q-bio.QM) [pdf, html, other]
-
Title: Binding Affinity Prediction: From Conventional to Machine Learning-Based ApproachesXuefeng Liu, Songhao Jiang, Xiaotian Duan, Archit Vasan, Chong Liu, Chih-chan Tien, Heng Ma, Thomas Brettin, Fangfang Xia, Ian T. Foster, Rick L. StevensSubjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Protein-ligand binding is the process by which a small molecule (drug or inhibitor) attaches to a target protein. The binding affinity, which refers to the strength of this interaction, is central to many important problems in bioinformatics such as drug design. An extensive amount of work has been devoted to predicting binding affinity over the past decades due to its significance. In this paper, we review all significant recent works, focusing on the methods, features, and benchmark datasets. We have observed a rising trend in the use of traditional machine learning and deep learning models for predicting binding affinity, accompanied by an increasing amount of data on proteins and small drug-like molecules. While prediction results are constantly improving, we also identify several open questions and potential directions that remain unexplored in the field. This paper could serve as an excellent starting point for machine learning researchers who wish to engage in the study of binding affinity, or for anyone with general interests in machine learning, drug discovery, and bioinformatics.
- [45] arXiv:2410.00759 (cross-list from cs.LG) [pdf, html, other]
-
Title: Targeted synthetic data generation for tabular data via hardness characterizationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Synthetic data generation has been proven successful in improving model performance and robustness in the context of scarce or low-quality data. Using the data valuation framework to statistically identify beneficial and detrimental observations, we introduce a novel augmentation pipeline that generates only high-value training points based on hardness characterization. We first demonstrate via benchmarks on real data that Shapley-based data valuation methods perform comparably with learning-based methods in hardness characterisation tasks, while offering significant theoretical and computational advantages. Then, we show that synthetic data generators trained on the hardest points outperform non-targeted data augmentation on simulated data and on a large scale credit default prediction task. In particular, our approach improves the quality of out-of-sample predictions and it is computationally more efficient compared to non-targeted methods.
- [46] arXiv:2410.00858 (cross-list from math.PR) [pdf, html, other]
-
Title: Entropy contraction of the Gibbs sampler under log-concavitySubjects: Probability (math.PR); Statistics Theory (math.ST); Computation (stat.CO); Machine Learning (stat.ML)
The Gibbs sampler (a.k.a. Glauber dynamics and heat-bath algorithm) is a popular Markov Chain Monte Carlo algorithm which iteratively samples from the conditional distributions of a probability measure $\pi$ of interest. Under the assumption that $\pi$ is strongly log-concave, we show that the random scan Gibbs sampler contracts in relative entropy and provide a sharp characterization of the associated contraction rate. Assuming that evaluating conditionals is cheap compared to evaluating the joint density, our results imply that the number of full evaluations of $\pi$ needed for the Gibbs sampler to mix grows linearly with the condition number and is independent of the dimension. If $\pi$ is non-strongly log-concave, the convergence rate in entropy degrades from exponential to polynomial. Our techniques are versatile and extend to Metropolis-within-Gibbs schemes and the Hit-and-Run algorithm. A comparison with gradient-based schemes and the connection with the optimization literature are also discussed.
- [47] arXiv:2410.00862 (cross-list from cs.LG) [pdf, html, other]
-
Title: Timber! Poisoning Decision TreesComments: 18 pages, 7 figures, 5 tablesSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
We present Timber, the first white-box poisoning attack targeting decision trees. Timber is based on a greedy attack strategy leveraging sub-tree retraining to efficiently estimate the damage performed by poisoning a given training instance. The attack relies on a tree annotation procedure which enables sorting training instances so that they are processed in increasing order of computational cost of sub-tree retraining. This sorting yields a variant of Timber supporting an early stopping criterion designed to make poisoning attacks more efficient and feasible on larger datasets. We also discuss an extension of Timber to traditional random forest models, which is useful because decision trees are normally combined into ensembles to improve their predictive power. Our experimental evaluation on public datasets shows that our attacks outperform existing baselines in terms of effectiveness, efficiency or both. Moreover, we show that two representative defenses can mitigate the effect of our attacks, but fail at effectively thwarting them.
Cross submissions (showing 19 of 19 entries)
- [48] arXiv:2204.10508 (replaced) [pdf, html, other]
-
Title: Identification enhanced generalised linear model estimation with nonignorable missing outcomesSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Missing data often result in undesirable bias and loss of efficiency. These become substantial problems when the response mechanism is nonignorable, such that the response model depends on unobserved variables. It is necessary to estimate the joint distribution of unobserved variables and response indicators to manage nonignorable nonresponse. However, model misspecification and identification issues prevent robust estimates despite careful estimation of the target joint distribution. In this study, we modelled the distribution of the observed parts and derived sufficient conditions for model identifiability, assuming a logistic regression model as the response mechanism and generalised linear models as the main outcome model of interest. More importantly, the derived sufficient conditions are testable with the observed data and do not require any instrumental variables, which are often assumed to guarantee model identifiability but cannot be practically determined beforehand. To analyse missing data, we propose a new imputation method which incorporates verifiable identifiability using only observed data. Furthermore, we present the performance of the proposed estimators in numerical studies and apply the proposed method to two sets of real data: exit polls for the 19th South Korean election data and public data collected from the Korean Survey of Household Finances and Living Conditions.
- [49] arXiv:2212.08766 (replaced) [pdf, html, other]
-
Title: Asymptotically Optimal Knockoff Statistics via the Masked Likelihood RatioComments: 68 pages, 21 figures; clearer notation throughoutSubjects: Methodology (stat.ME)
In feature selection problems, knockoffs are synthetic controls for the original features. Employing knockoffs allows analysts to use nearly any variable importance measure or "feature statistic" to select features while rigorously controlling false positives. However, it is not clear which statistic maximizes power. In this paper, we argue that state-of-the-art lasso-based feature statistics often prioritize features that are unlikely to be discovered, leading to low power in real applications. Instead, we introduce masked likelihood ratio (MLR) statistics, which prioritize features according to one's ability to distinguish each feature from its knockoff. Although no single feature statistic is uniformly most powerful in all situations, we show that MLR statistics asymptotically maximize the number of discoveries under a user-specified Bayesian model of the data. (Like all feature statistics, MLR statistics always provide frequentist error control.) This result places no restrictions on the problem dimensions and makes no parametric assumptions; instead, we require a "local dependence" condition that depends only on known quantities. In simulations and three real applications, MLR statistics outperform state-of-the-art feature statistics, including in settings where the Bayesian model is misspecified. We implement MLR statistics in the python package knockpy; our implementation is often faster than computing a cross-validated lasso.
- [50] arXiv:2306.15629 (replaced) [pdf, html, other]
-
Title: A Non-Parametric Approach to Detect Patterns in Binary SequencesSubjects: Methodology (stat.ME)
In many circumstances, given an ordered sequence of one or more types of elements or symbols, the objective is to determine the existence of any randomness in the occurrence of one specific element, say type 1. This method can help detect non-random patterns, such as wins or losses in a series of games. Existing methods of tests based on total number of runs or tests based on length of longest run (Mosteller (1941)) can be used for testing the null hypothesis of randomness in the entire sequence, and not a specific type of element. Moreover, the Runs Test often yields results that contradict the patterns visualized in graphs showing, for instance, win proportions over time. This paper develops a test approach to address this problem by computing the gaps between two consecutive type 1 elements, by identifying patterns in occurrence and directional trends (increasing, decreasing, or constant), applies the exact Binomial test, Kendall's Tau, and the Siegel-Tukey test for scale problems. Further modifications suggested by Jan Vegelius(1982) have been applied in the Siegel Tukey test to adjust for tied ranks and achieve more accurate results. This approach is distribution-free and suitable for small sample sizes. Also comparisons with the conventional runs test demonstrates the superiority of the proposed method under the null hypothesis of randomness in the occurrence of type 1 elements.
- [51] arXiv:2308.12108 (replaced) [pdf, html, other]
-
Title: The Local Learning Coefficient: A Singularity-Aware Complexity MeasureComments: This version contains new empirical results and merged content from a related paper (arXiv:2402.03698) to provide a more comprehensive studySubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The Local Learning Coefficient (LLC) is introduced as a novel complexity measure for deep neural networks (DNNs). Recognizing the limitations of traditional complexity measures, the LLC leverages Singular Learning Theory (SLT), which has long recognized the significance of singularities in the loss landscape geometry. This paper provides an extensive exploration of the LLC's theoretical underpinnings, offering both a clear definition and intuitive insights into its application. Moreover, we propose a new scalable estimator for the LLC, which is then effectively applied across diverse architectures including deep linear networks up to 100M parameters, ResNet image models, and transformer language models. Empirical evidence suggests that the LLC provides valuable insights into how training heuristics might influence the effective complexity of DNNs. Ultimately, the LLC emerges as a crucial tool for reconciling the apparent contradiction between deep learning's complexity and the principle of parsimony.
- [52] arXiv:2309.10068 (replaced) [pdf, other]
-
Title: A Unifying Perspective on Non-Stationary Kernels for Deeper Gaussian ProcessesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
The Gaussian process (GP) is a popular statistical technique for stochastic function approximation and uncertainty quantification from data. GPs have been adopted into the realm of machine learning in the last two decades because of their superior prediction abilities, especially in data-sparse scenarios, and their inherent ability to provide robust uncertainty estimates. Even so, their performance highly depends on intricate customizations of the core methodology, which often leads to dissatisfaction among practitioners when standard setups and off-the-shelf software tools are being deployed. Arguably the most important building block of a GP is the kernel function which assumes the role of a covariance operator. Stationary kernels of the Matérn class are used in the vast majority of applied studies; poor prediction performance and unrealistic uncertainty quantification are often the consequences. Non-stationary kernels show improved performance but are rarely used due to their more complicated functional form and the associated effort and expertise needed to define and tune them optimally. In this perspective, we want to help ML practitioners make sense of some of the most common forms of non-stationarity for Gaussian processes. We show a variety of kernels in action using representative datasets, carefully study their properties, and compare their performances. Based on our findings, we propose a new kernel that combines some of the identified advantages of existing kernels.
- [53] arXiv:2310.05288 (replaced) [pdf, html, other]
-
Title: Clustering Three-Way Data with OutliersSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Matrix-variate distributions are a recent addition to the model-based clustering field, thereby making it possible to analyze data in matrix form with complex structure such as images and time series. Due to its recent appearance, there is limited literature on matrix-variate data, with even less on dealing with outliers in these models. An approach for clustering matrix-variate normal data with outliers is discussed. The approach, which uses the distribution of subset log-likelihoods, extends the OCLUST algorithm to matrix-variate normal data and uses an iterative approach to detect and trim outliers.
- [54] arXiv:2310.07850 (replaced) [pdf, html, other]
-
Title: Conformal prediction with local weights: randomization enables local guaranteesComments: 45 pages, 13 figuresSubjects: Methodology (stat.ME)
In this work, we consider the problem of building distribution-free prediction intervals with finite-sample conditional coverage guarantees. Conformal prediction (CP) is an increasingly popular framework for building prediction intervals with distribution-free guarantees, but these guarantees only ensure marginal coverage: the probability of coverage is averaged over a random draw of both the training and test data, meaning that there might be substantial undercoverage within certain subpopulations. Instead, ideally, we would want to have local coverage guarantees that hold for each possible value of the test point's features. While the impossibility of achieving pointwise local coverage is well established in the literature, many variants of conformal prediction algorithm show favorable local coverage properties empirically. Relaxing the definition of local coverage can allow for a theoretical understanding of this empirical phenomenon. We aim to bridge this gap between theoretical validation and empirical performance by proving achievable and interpretable guarantees for a relaxed notion of local coverage. Building on the localized CP method of Guan (2023) and the weighted CP framework of Tibshirani et al. (2019), we propose a new method, randomly-localized conformal prediction (RLCP), which returns prediction intervals that are not only marginally valid but also achieve a relaxed local coverage guarantee and guarantees under covariate shift. Through a series of simulations and real data experiments, we validate these coverage guarantees of RLCP while comparing it with the other local conformal prediction methods.
- [55] arXiv:2310.08208 (replaced) [pdf, html, other]
-
Title: DsubCox: A Fast Subsampling Algorithm for Cox Model with Distributed and Massive Survival DataSubjects: Computation (stat.CO)
To ensure privacy protection and alleviate computational burden, we propose a fast subsmaling procedure for the Cox model with massive survival datasets from multi-centered, decentralized sources. The proposed estimator is computed based on optimal subsampling probabilities that we derived and enables transmission of subsample-based summary level statistics between different storage sites with only one round of communication. For inference, the asymptotic properties of the proposed estimator were rigorously established. An extensive simulation study demonstrated that the proposed approach is effective. The methodology was applied to analyze a large dataset from the U.S. airlines.
- [56] arXiv:2311.00202 (replaced) [pdf, html, other]
-
Title: On the Gaussian product inequality conjecture for disjoint principal minors of Wishart random matricesComments: 26 pages, 0 figuresSubjects: Statistics Theory (math.ST); Functional Analysis (math.FA); Probability (math.PR)
This paper extends various results related to the Gaussian product inequality (GPI) conjecture to the setting of disjoint principal minors of Wishart random matrices. This includes product-type inequalities for matrix-variate analogs of completely monotone functions and Bernstein functions of Wishart disjoint principal minors, respectively. In particular, the product-type inequalities apply to inverse determinant powers. Quantitative versions of the inequalities are also obtained when there is a mix of positive and negative exponents. Furthermore, an extended form of the GPI is shown to hold for the eigenvalues of Wishart random matrices by virtue of their law being multivariate totally positive of order 2 (MTP${}_2$). A new, unexplored avenue of research is presented to study the GPI from the point of view of elliptical distributions.
- [57] arXiv:2312.02404 (replaced) [pdf, html, other]
-
Title: Nonparametric Bayesian Adjustment of Unmeasured Confounders in Cox Proportional Hazards ModelsComments: general Bayes, invalid instrumental variable, Mendelian randomization, UK Biobank, weak instrumental variableSubjects: Methodology (stat.ME)
In observational studies, unmeasured confounders present a crucial challenge in accurately estimating desired causal effects. To calculate the hazard ratio (HR) in Cox proportional hazard models for time-to-event outcomes, two-stage residual inclusion and limited information maximum likelihood are typically employed. However, these methods are known to entail difficulty in terms of potential bias of HR estimates and parameter identification. This study introduces a novel nonparametric Bayesian method designed to estimate an unbiased HR, addressing concerns that previous research methods have had. Our proposed method consists of two phases: 1) detecting clusters based on the likelihood of the exposure and outcome variables, and 2) estimating the hazard ratio within each cluster. Although it is implicitly assumed that unmeasured confounders affect outcomes through cluster effects, our algorithm is well-suited for such data structures. The proposed Bayesian estimator has good performance compared with some competitors.
- [58] arXiv:2312.09586 (replaced) [pdf, html, other]
-
Title: Matching prior pairs connecting Maximum A Posteriori estimation and posterior expectationSubjects: Statistics Theory (math.ST)
Bayesian statistics has two common measures of central tendency of a posterior distribution: posterior means and Maximum A Posteriori (MAP) estimates. In this paper, we discuss a connection between MAP estimates and posterior means. We derive an asymptotic condition for a pair of prior densities under which the posterior mean based on one prior coincides with the MAP estimate based on the other prior. A sufficient condition for the existence of this prior pair relates to $\alpha$-flatness of the statistical model in information geometry. We also construct a matching prior pair using $\alpha$-parallel priors. Our result elucidates an interesting connection between regularization in generalized linear regression models and posterior expectation.
- [59] arXiv:2404.16209 (replaced) [pdf, other]
-
Title: Exploring Spatial Context: A Comprehensive Bibliography of GWR and MGWRA. Stewart Fotheringham, Chen-Lun Kao, Hanchen Yu, Sarah Bardin, Taylor Oshan, Ziqi Li, Mehak Sachdeva, Wei LuoComments: 423 pagesSubjects: Methodology (stat.ME); Applications (stat.AP); Computation (stat.CO)
Local spatial models such as Geographically Weighted Regression (GWR) and Multiscale Geographically Weighted Regression (MGWR) serve as instrumental tools to capture intrinsic contextual effects through the estimates of the local intercepts and behavioral contextual effects through estimates of the local slope parameters. GWR and MGWR provide simple implementation yet powerful frameworks that could be extended to various disciplines that handle spatial data. This bibliography aims to serve as a comprehensive compilation of peer-reviewed papers that have utilized GWR or MGWR as a primary analytical method to conduct spatial analyses and acts as a useful guide to anyone searching the literature for previous examples of local statistical modeling in a wide variety of application fields.
- [60] arXiv:2405.10247 (replaced) [pdf, html, other]
-
Title: Alternative ranking measures to predict international football resultsSubjects: Applications (stat.AP)
Over the last few years, there has been a growing interest in the prediction and modelling of competitive sports outcomes, with particular emphasis placed on this area by the Bayesian statistics and machine learning communities. In this paper, we have carried out a comparative evaluation of statistical and machine learning models to assess their predictive performance for the 2022 FIFA World Cup and for the 2023 CAF Africa Cup of Nations by evaluating alternative summaries of past performances related to the involved teams. More specifically, we consider the Bayesian Bradley-Terry-Davidson model, which is a widely used statistical framework for ranking items based on paired comparisons that have been applied successfully in various domains, including football. The analysis was performed including in some canonical goal-based models both the Bradley-Terry-Davidson derived ranking and the widely recognized Coca-Cola FIFA ranking commonly adopted by football fans and amateurs.
- [61] arXiv:2405.18597 (replaced) [pdf, html, other]
-
Title: Nonparametric causal inference for optogenetics: sequential excursion effects for dynamic regimesComments: 52 pages, 15 figuresSubjects: Methodology (stat.ME); Applications (stat.AP)
Optogenetics is a powerful neuroscience technique for studying how neural circuit manipulation affects behavior. Standard analysis conventions discard information and severely limit the scope of the causal questions that can be probed. To address this gap, we 1) draw connections to the causal inference literature on sequentially randomized experiments, 2) propose a non-parametric framework for analyzing "open-loop" (static regime) optogenetics behavioral experiments, 3) derive extensions of history-restricted marginal structural models for dynamic treatment regimes with positivity violations for "closed-loop" designs, and 4) propose a taxonomy of identifiable causal effects that encompass a far richer collection of scientific questions compared to standard methods. From another view, our work extends "excursion effect" methods, popularized recently in the mobile health literature, to enable estimation of causal contrasts for treatment sequences in the presence of positivity violations. We describe sufficient conditions for identifiability of the proposed causal estimands, and provide asymptotic statistical guarantees for a proposed inverse probability-weighted estimator, a multiply-robust estimator (for two intervention timepoints), a framework for hypothesis testing, and a computationally scalable implementation. Finally, we apply our framework to data from a recent neuroscience study and show how it provides insight into causal effects of optogenetics on behavior that are obscured by standard analyses.
- [62] arXiv:2407.03616 (replaced) [pdf, other]
-
Title: When can weak latent factors be statistically inferred?Subjects: Methodology (stat.ME); Econometrics (econ.EM); Statistics Theory (math.ST); Statistical Finance (q-fin.ST); Machine Learning (stat.ML)
This article establishes a new and comprehensive estimation and inference theory for principal component analysis (PCA) under the weak factor model that allow for cross-sectional dependent idiosyncratic components under the nearly minimal factor strength relative to the noise level or signal-to-noise ratio. Our theory is applicable regardless of the relative growth rate between the cross-sectional dimension $N$ and temporal dimension $T$. This more realistic assumption and noticeable result require completely new technical device, as the commonly-used leave-one-out trick is no longer applicable to the case with cross-sectional dependence. Another notable advancement of our theory is on PCA inference $ - $ for example, under the regime where $N\asymp T$, we show that the asymptotic normality for the PCA-based estimator holds as long as the signal-to-noise ratio (SNR) grows faster than a polynomial rate of $\log N$. This finding significantly surpasses prior work that required a polynomial rate of $N$. Our theory is entirely non-asymptotic, offering finite-sample characterizations for both the estimation error and the uncertainty level of statistical inference. A notable technical innovation is our closed-form first-order approximation of PCA-based estimator, which paves the way for various statistical tests. Furthermore, we apply our theories to design easy-to-implement statistics for validating whether given factors fall in the linear spans of unknown latent factors, testing structural breaks in the factor loadings for an individual unit, checking whether two units have the same risk exposures, and constructing confidence intervals for systematic risks. Our empirical studies uncover insightful correlations between our test results and economic cycles.
- [63] arXiv:2407.06875 (replaced) [pdf, html, other]
-
Title: Extending the blended generalized extreme value distributionSubjects: Applications (stat.AP); Atmospheric and Oceanic Physics (physics.ao-ph); Methodology (stat.ME)
The generalized extreme value (GEV) distribution is commonly employed to help estimate the likelihood of extreme events in many geophysical and other application areas. The recently proposed blended generalized extreme value (bGEV) distribution modifies the GEV with positive shape parameter to avoid a hard lower bound that complicates fitting and inference. Here, the bGEV is extended to the GEV with negative shape parameter, avoiding a hard upper bound that is unrealistic in many applications. This extended bGEV is shown to improve on the GEV for forecasting heat and sea level extremes based on past data. Software implementing this bGEV and applying it to the example temperature and sea level data is provided.
- [64] arXiv:2409.07391 (replaced) [pdf, html, other]
-
Title: Improve Sensitivity Analysis Synthesizing Randomized Clinical Trials With Limited OverlapSubjects: Methodology (stat.ME)
To estimate the average treatment effect in real-world populations, observational studies are typically designed around real-world cohorts. However, even when study samples from these designs represent the population, unmeasured confounders can introduce bias. Sensitivity analysis is often used to estimate bounds for the average treatment effect without relying on the strict mathematical assumptions of other existing methods. This article introduces a new approach that improves sensitivity analysis in observational studies by incorporating randomized clinical trial data, even with limited overlap due to inclusion/exclusion criteria. Theoretical proof and simulations show that this method provides a tighter bound width than existing approaches. We also apply this method to both a trial dataset and a real-world drug effectiveness comparison dataset for practical analysis.
- [65] arXiv:2409.09003 (replaced) [pdf, html, other]
-
Title: Model-independent variable selection via the rule-based variable prioritySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
While achieving high prediction accuracy is a fundamental goal in machine learning, an equally important task is finding a small number of features with high explanatory power. One popular selection technique is permutation importance, which assesses a variable's impact by measuring the change in prediction error after permuting the variable. However, this can be problematic due to the need to create artificial data, a problem shared by other methods as well. Another problem is that variable selection methods can be limited by being model-specific. We introduce a new model-independent approach, Variable Priority (VarPro), which works by utilizing rules without the need to generate artificial data or evaluate prediction error. The method is relatively easy to use, requiring only the calculation of sample averages of simple statistics, and can be applied to many data settings, including regression, classification, and survival. We investigate the asymptotic properties of VarPro and show, among other things, that VarPro has a consistent filtering property for noise variables. Empirical studies using synthetic and real-world data show the method achieves a balanced performance and compares favorably to many state-of-the-art procedures currently used for variable selection.
- [66] arXiv:2409.17544 (replaced) [pdf, html, other]
-
Title: Optimizing the Induced Correlation in Omnibus Joint Graph EmbeddingsComments: 34 pages, 8 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
Theoretical and empirical evidence suggests that joint graph embedding algorithms induce correlation across the networks in the embedding space. In the Omnibus joint graph embedding framework, previous results explicitly delineated the dual effects of the algorithm-induced and model-inherent correlations on the correlation across the embedded networks. Accounting for and mitigating the algorithm-induced correlation is key to subsequent inference, as sub-optimal Omnibus matrix constructions have been demonstrated to lead to loss in inference fidelity. This work presents the first efforts to automate the Omnibus construction in order to address two key questions in this joint embedding framework: the correlation-to-OMNI problem and the flat correlation problem. In the flat correlation problem, we seek to understand the minimum algorithm-induced flat correlation (i.e., the same across all graph pairs) produced by a generalized Omnibus embedding. Working in a subspace of the fully general Omnibus matrices, we prove both a lower bound for this flat correlation and that the classical Omnibus construction induces the maximal flat correlation. In the correlation-to-OMNI problem, we present an algorithm -- named corr2Omni -- that, from a given matrix of estimated pairwise graph correlations, estimates the matrix of generalized Omnibus weights that induces optimal correlation in the embedding space. Moreover, in both simulated and real data settings, we demonstrate the increased effectiveness of our corr2Omni algorithm versus the classical Omnibus construction.
- [67] arXiv:1112.1768 (replaced) [pdf, html, other]
-
Title: The Extended UCB Policies for Frequentist Multi-armed Bandit ProblemsComments: 25 pages, 3 figuresSubjects: Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
The multi-armed bandit (MAB) problem is a widely studied model in the field of operations research for sequential decision making and reinforcement learning. This paper mainly considers the classical MAB model with the heavy-tailed reward distributions. We introduce the extended robust UCB policy, which is an extension of the pioneering UCB policies proposed by Bubeck et al. [5] and Lattimore [21]. The previous UCB policies require the knowledge of an upper bound on specific moments of reward distributions or a particular moment to exist, which can be hard to acquire or guarantee in practical scenarios. Our extended robust UCB generalizes Lattimore's seminary work (for moments of orders $p=4$ and $q=2$) to arbitrarily chosen $p$ and $q$ as long as the two moments have a known controlled relationship, while still achieving the optimal regret growth order O(log T), thus providing a broadened application area of the UCB policies for the heavy-tailed reward distributions.
- [68] arXiv:2207.09660 (replaced) [pdf, other]
-
Title: Alternating minimization for generalized rank one matrix sensing: Sharp predictions from a random initializationComments: v2 is consistent with version to appear in Information and Inference: A Journal of the IMASubjects: Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST); Machine Learning (stat.ML)
We consider the problem of estimating the factors of a rank-$1$ matrix with i.i.d. Gaussian, rank-$1$ measurements that are nonlinearly transformed and corrupted by noise. Considering two prototypical choices for the nonlinearity, we study the convergence properties of a natural alternating update rule for this nonconvex optimization problem starting from a random initialization. We show sharp convergence guarantees for a sample-split version of the algorithm by deriving a deterministic recursion that is accurate even in high-dimensional problems. Notably, while the infinite-sample population update is uninformative and suggests exact recovery in a single step, the algorithm -- and our deterministic prediction -- converges geometrically fast from a random initialization. Our sharp, non-asymptotic analysis also exposes several other fine-grained properties of this problem, including how the nonlinearity and noise level affect convergence behavior.
On a technical level, our results are enabled by showing that the empirical error recursion can be predicted by our deterministic sequence within fluctuations of the order $n^{-1/2}$ when each iteration is run with $n$ observations. Our technique leverages leave-one-out tools originating in the literature on high-dimensional $M$-estimation and provides an avenue for sharply analyzing higher-order iterative algorithms from a random initialization in other high-dimensional optimization problems with random data. - [69] arXiv:2210.05222 (replaced) [pdf, other]
-
Title: Stochastic Direct Search Method for Blind Resource AllocationJuliette Achddou (PSL, DI-ENS), Olivier Cappe (CNRS, DI-ENS, PSL), Aurélien Garivier (UMPA-ENSL, CNRS)Journal-ref: Transactions on Machine Learning Research Journal, 2024Subjects: Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
Motivated by programmatic advertising optimization, we consider the task of sequentially allocating budget across a set of resources. At every time step, a feasible allocation is chosen and only a corresponding random return is observed. The goal is to maximize the cumulative expected sum of returns. This is a realistic model for budget allocation across subdivisions of marketing campaigns, with the objective of maximizing the number of conversions. We study direct search (also known as pattern search) methods for linearly constrained and derivative-free optimization in the presence of noise, which apply in particular to sequential budget allocation. These algorithms, which do not rely on hierarchical partitioning of the resource space, are easy to implement; they respect the operational constraints of resource allocation by avoiding evaluation outside of the feasible domain; and they are also compatible with warm start by being (approximate) descent algorithms. However, they have not yet been analyzed from the perspective of cumulative regret. We show that direct search methods achieves finite regret in the deterministic and unconstrained case. In the presence of evaluation noise and linear constraints, we propose a simple extension of direct search that achieves a regret upper-bound of the order of $T^{2/3}$. We also propose an accelerated version of the algorithm, relying on repeated sequential testing, that significantly improves the practical behavior of the approach.
- [70] arXiv:2211.02032 (replaced) [pdf, html, other]
-
Title: To spike or not to spike: the whims of the Wonham filter in the strong noise regimeComments: v1, v2: Preliminary versions. v3: Submitted versionSubjects: Probability (math.PR); Information Theory (cs.IT); Optimization and Control (math.OC); Statistics Theory (math.ST)
We study the celebrated Shiryaev-Wonham filter (1964) in its historical setup where the hidden Markov jump process has two states. We are interested in the weak noise regime for the observation equation. Interestingly, this becomes a strong noise regime for the filtering equations.
Earlier results of the authors show the appearance of spikes in the filtered process, akin to a metastability phenomenon. This paper is aimed at understanding the smoothed optimal filter, which is relevant for any system with feedback. In particular, we exhibit a sharp phase transition between a spiking regime and a regime with perfect smoothing. - [71] arXiv:2311.10270 (replaced) [pdf, html, other]
-
Title: Multiscale Hodge Scattering Networks for Data AnalysisComments: 20 Pages, Comments WelcomeSubjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Signal Processing (eess.SP); Numerical Analysis (math.NA); Machine Learning (stat.ML)
We propose new scattering networks for signals measured on simplicial complexes, which we call \emph{Multiscale Hodge Scattering Networks} (MHSNs). Our construction is based on multiscale basis dictionaries on simplicial complexes, i.e., the $\kappa$-GHWT and $\kappa$-HGLET, which we recently developed for simplices of dimension $\kappa \in \mathbb{N}$ in a given simplicial complex by generalizing the node-based Generalized Haar-Walsh Transform (GHWT) and Hierarchical Graph Laplacian Eigen Transform (HGLET). The $\kappa$-GHWT and the $\kappa$-HGLET both form redundant sets (i.e., dictionaries) of multiscale basis vectors and the corresponding expansion coefficients of a given signal. Our MHSNs use a layered structure analogous to a convolutional neural network (CNN) to cascade the moments of the modulus of the dictionary coefficients. The resulting features are invariant to reordering of the simplices (i.e., node permutation of the underlying graphs). Importantly, the use of multiscale basis dictionaries in our MHSNs admits a natural pooling operation that is akin to local pooling in CNNs, and which may be performed either locally or per-scale. These pooling operations are harder to define in both traditional scattering networks based on Morlet wavelets, and geometric scattering networks based on Diffusion Wavelets. As a result, we are able to extract a rich set of descriptive yet robust features that can be used along with very simple machine learning methods (i.e., logistic regression or support vector machines) to achieve high-accuracy classification systems with far fewer parameters to train than most modern graph neural networks. Finally, we demonstrate the usefulness of our MHSNs in three distinct types of problems: signal classification, domain (i.e., graph/simplex) classification, and molecular dynamics prediction.
- [72] arXiv:2312.07151 (replaced) [pdf, other]
-
Title: The Gaussian-Linear Hidden Markov model: a Python packageDiego Vidaurre, Laura Masaracchia, Nick Y. Larsen, Lenno R.P.T Ruijters, Sonsoles Alonso, Christine Ahrends, Mark W. WoolrichComments: 24 pages, 8 figures, 1 tableSubjects: Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
We propose the Gaussian-Linear Hidden Markov model (GLHMM), a generalisation of different types of HMMs commonly used in neuroscience. In short, the GLHMM is a general framework where linear regression is used to flexibly parameterise the Gaussian state distribution, thereby accommodating a wide range of uses -- including unsupervised, encoding and decoding models. GLHMM is implemented as a Python toolbox with an emphasis on statistical testing and out-of-sample prediction -- i.e. aimed at finding and characterising brain-behaviour associations. The toolbox uses a stochastic variational inference approach, enabling it to handle large data sets at reasonable computational time. The approach can be applied to several data modalities, including animal recordings or non-brain data, and applied over a broad range of experimental paradigms. For demonstration, we show examples with fMRI, electrocorticography, magnetoencephalography and pupillometry.
- [73] arXiv:2401.08626 (replaced) [pdf, html, other]
-
Title: Validation and Comparison of Non-Stationary Cognitive Models: A Diffusion Model ApplicationSubjects: Neurons and Cognition (q-bio.NC); Methodology (stat.ME)
Cognitive processes undergo various fluctuations and transient states across different temporal scales. Superstatistics are emerging as a flexible framework for incorporating such non-stationary dynamics into existing cognitive model classes. In this work, we provide the first experimental validation of superstatistics and formal comparison of four non-stationary diffusion decision models in a specifically designed perceptual decision-making task. Task difficulty and speed-accuracy trade-off were systematically manipulated to induce expected changes in model parameters. To validate our models, we assess whether the inferred parameter trajectories align with the patterns and sequences of the experimental manipulations. To address computational challenges, we present novel deep learning techniques for amortized Bayesian estimation and comparison of models with time-varying parameters. Our findings indicate that transition models incorporating both gradual and abrupt parameter shifts provide the best fit to the empirical data. Moreover, we find that the inferred parameter trajectories closely mirror the sequence of experimental manipulations. Posterior re-simulations further underscore the ability of the models to faithfully reproduce critical data patterns. Accordingly, our results suggest that the inferred non-stationary dynamics may reflect actual changes in the targeted psychological constructs. We argue that our initial experimental validation paves the way for the widespread application of superstatistics in cognitive modeling and beyond.
- [74] arXiv:2404.12968 (replaced) [pdf, html, other]
-
Title: Scalable Data Assimilation with Message PassingSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Applications (stat.AP)
Data assimilation is a core component of numerical weather prediction systems. The large quantity of data processed during assimilation requires the computation to be distributed across increasingly many compute nodes, yet existing approaches suffer from synchronisation overhead in this setting. In this paper, we exploit the formulation of data assimilation as a Bayesian inference problem and apply a message-passing algorithm to solve the spatial inference problem. Since message passing is inherently based on local computations, this approach lends itself to parallel and distributed computation. In combination with a GPU-accelerated implementation, we can scale the algorithm to very large grid sizes while retaining good accuracy and compute and memory requirements.
- [75] arXiv:2407.01656 (replaced) [pdf, other]
-
Title: Statistical signatures of abstraction in deep neural networksComments: The estimate of the Kullback-Leibler distance used in the paper is affected by strong sampling errors. Additional statistical analysis is neededSubjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
We study how abstract representations emerge in a Deep Belief Network (DBN) trained on benchmark datasets. Our analysis targets the principles of learning in the early stages of information processing, starting from the "primordial soup" of the under-sampling regime. As the data is processed by deeper and deeper layers, features are detected and removed, transferring more and more "context-invariant" information to deeper layers. We show that the representation approaches an universal model -- the Hierarchical Feature Model (HFM) -- determined by the principle of maximal relevance. Relevance quantifies the uncertainty on the model of the data, thus suggesting that "meaning" -- i.e. syntactic information -- is that part of the data which is not yet captured by a model. Our analysis shows that shallow layers are well described by pairwise Ising models, which provide a representation of the data in terms of generic, low order features. We also show that plasticity increases with depth, in a similar way as it does in the brain. These findings suggest that DBNs are capable of extracting a hierarchy of features from the data which is consistent with the principle of maximal relevance.
- [76] arXiv:2407.12897 (replaced) [pdf, html, other]
-
Title: Generative models of MRI-derived neuroimaging features and associated dataset of 18,000 samplesSai Spandana Chintapalli, Rongguang Wang, Zhijian Yang, Vasiliki Tassopoulou, Fanyang Yu, Vishnu Bashyam, Guray Erus, Pratik Chaudhari, Haochang Shou, Christos DavatzikosSubjects: Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
Availability of large and diverse medical datasets is often challenged by privacy and data sharing restrictions. For successful application of machine learning techniques for disease diagnosis, prognosis, and precision medicine, large amounts of data are necessary for model building and optimization. To help overcome such limitations in the context of brain MRI, we present GenMIND: a collection of generative models of normative regional volumetric features derived from structural brain imaging. GenMIND models are trained on real brain imaging regional volumetric measures from the iSTAGING consortium, which encompasses over 40,000 MRI scans across 13 studies, incorporating covariates such as age, sex, and race. Leveraging GenMIND, we produce and offer 18,000 synthetic samples spanning the adult lifespan (ages 22-90 years), alongside the model's capability to generate unlimited data. Experimental results indicate that samples generated from GenMIND agree with the distributions obtained from real data. Most importantly, the generated normative data significantly enhance the accuracy of downstream machine learning models on tasks such as disease classification. Data and models are available at: this https URL.
- [77] arXiv:2407.14021 (replaced) [pdf, html, other]
-
Title: GE2E-AC: Generalized End-to-End Loss Training for Accent ClassificationSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Machine Learning (stat.ML)
Accent classification or AC is a task to predict the accent type of an input utterance, and it can be used as a preliminary step toward accented speech recognition and accent conversion. Existing studies have often achieved such classification by training a neural network model to minimize the classification error of the predicted accent label, which can be obtained as a model output. Since we optimize the entire model only from the perspective of classification loss during training time in this approach, the model might learn to predict the accent type from irrelevant features, such as individual speaker identity, which are not informative during test time. To address this problem, we propose a GE2E-AC, in which we train a model to extract accent embedding or AE of an input utterance such that the AEs of the same accent class get closer, instead of directly minimizing the classification loss. We experimentally show the effectiveness of the proposed GE2E-AC, compared to the baseline model trained with the conventional cross-entropy-based loss.
- [78] arXiv:2408.05622 (replaced) [pdf, html, other]
-
Title: More Skin, More Likes! Measuring Child Exposure and User Engagement on TikTokSubjects: Computers and Society (cs.CY); Applications (stat.AP)
Sharenting, the practice of parents sharing content about their children on social media platforms, has become increasingly common, raising concerns about children's privacy and safety online. This study investigates children's exposure on TikTok, offering a detailed examination of the platform's content and associated comments. Analyzing 432,178 comments across 5,896 videos from 115 user accounts featuring children, we categorize content into Family, Fashion, and Sports. Our analysis highlights potential risks, such as inappropriate comments or contact offers, with a focus on appearance-based comments. Notably, 21% of comments relate to visual appearance. Additionally, 19.57% of videos depict children in revealing clothing, such as swimwear or bare midriffs, attracting significantly more appearance-based comments and likes than videos featuring fully clothed children, although this trend does not extend to downloads. These findings underscore the need for heightened awareness and protective measures to safeguard children's privacy and well-being in the digital age.
- [79] arXiv:2409.10773 (replaced) [pdf, html, other]
-
Title: Tight Lower Bounds under Asymmetric High-Order H\"older Smoothness and Uniform ConvexitySubjects: Optimization and Control (math.OC); Machine Learning (stat.ML)
In this paper, we provide tight lower bounds for the oracle complexity of minimizing high-order Hölder smooth and uniformly convex functions. Specifically, for a function whose $p^{th}$-order derivatives are Hölder continuous with degree $\nu$ and parameter $H$, and that is uniformly convex with degree $q$ and parameter $\sigma$, we focus on two asymmetric cases: (1) $q > p + \nu$, and (2) $q < p+\nu$. Given up to $p^{th}$-order oracle access, we establish worst-case oracle complexities of $\Omega\left( \left( \frac{H}{\sigma}\right)^\frac{2}{3(p+\nu)-2}\left( \frac{\sigma}{\epsilon}\right)^\frac{2(q-p-\nu)}{q(3(p+\nu)-2)}\right)$ in the first case with an $\ell_\infty$-ball-truncated-Gaussian smoothed hard function and $\Omega\left(\left(\frac{H}{\sigma}\right)^\frac{2}{3(p+\nu)-2}+ \log^2\left(\frac{\sigma^{p+\nu}}{H^q}\right)^\frac{1}{p+\nu-q}\right)$ in the second case, for reaching an $\epsilon$-approximate solution in terms of the optimality gap. Our analysis generalizes previous lower bounds for functions under first- and second-order smoothness as well as those for uniformly convex functions, and furthermore our results match the corresponding upper bounds in the general setting.
- [80] arXiv:2409.19422 (replaced) [pdf, html, other]
-
Title: Identifiable Shared Component Analysis of Unpaired Multimodal MixturesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
A core task in multi-modal learning is to integrate information from multiple feature spaces (e.g., text and audio), offering modality-invariant essential representations of data. Recent research showed that, classical tools such as {\it canonical correlation analysis} (CCA) provably identify the shared components up to minor ambiguities, when samples in each modality are generated from a linear mixture of shared and private components. Such identifiability results were obtained under the condition that the cross-modality samples are aligned/paired according to their shared information. This work takes a step further, investigating shared component identifiability from multi-modal linear mixtures where cross-modality samples are unaligned. A distribution divergence minimization-based loss is proposed, under which a suite of sufficient conditions ensuring identifiability of the shared components are derived. Our conditions are based on cross-modality distribution discrepancy characterization and density-preserving transform removal, which are much milder than existing studies relying on independent component analysis. More relaxed conditions are also provided via adding reasonable structural constraints, motivated by available side information in various applications. The identifiability claims are thoroughly validated using synthetic and real-world data.
- [81] arXiv:2409.19546 (replaced) [pdf, html, other]
-
Title: Almost Sure Convergence of Average Reward Temporal Difference LearningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
Tabular average reward Temporal Difference (TD) learning is perhaps the simplest and the most fundamental policy evaluation algorithm in average reward reinforcement learning. After at least 25 years since its discovery, we are finally able to provide a long-awaited almost sure convergence analysis. Namely, we are the first to prove that, under very mild conditions, tabular average reward TD converges almost surely to a sample path dependent fixed point. Key to this success is a new general stochastic approximation result concerning nonexpansive mappings with Markovian and additive noise, built on recent advances in stochastic Krasnoselskii-Mann iterations.