Journal Description
Stats
Stats
is an international, peer-reviewed, open access journal on statistical science published quarterly online by MDPI. The journal focuses on methodological and theoretical papers in statistics, probability, stochastic processes and innovative applications of statistics in all scientific disciplines including biological and biomedical sciences, medicine, business, economics and social sciences, physics, data science and engineering.
- Open Access— free for readers, with article processing charges (APC) paid by authors or their institutions.
- High Visibility: indexed within ESCI (Web of Science), Scopus, RePEc, and other databases.
- Rapid Publication: manuscripts are peer-reviewed and a first decision is provided to authors approximately 19 days after submission; acceptance to publication is undertaken in 2.2 days (median values for papers published in this journal in the first half of 2024).
- Recognition of Reviewers: reviewers who provide timely, thorough peer-review reports receive vouchers entitling them to a discount on the APC of their next publication in any MDPI journal, in appreciation of the work done.
Impact Factor:
0.9 (2023);
5-Year Impact Factor:
1.0 (2023)
Latest Articles
Estimating Time-to-Death and Determining Risk Predictors for Heart Failure Patients: Bayesian AFT Shared Frailty Models with the INLA Method
Stats 2024, 7(3), 1066-1083; https://doi.org/10.3390/stats7030063 - 23 Sep 2024
Abstract
Heart failure is a major global health concern, especially in Ethiopia. Numerous studies have analyzed heart failure data to inform decision-making, but these often struggle with limitations to accurately capture death dynamics and account for within-cluster dependence and heterogeneity. Addressing these limitations, this
[...] Read more.
Heart failure is a major global health concern, especially in Ethiopia. Numerous studies have analyzed heart failure data to inform decision-making, but these often struggle with limitations to accurately capture death dynamics and account for within-cluster dependence and heterogeneity. Addressing these limitations, this study aims to incorporate dependence and analyze heart failure data to estimate survival time and identify risk factors affecting patient survival. The data, obtained from 497 patients at Jimma University Medical Center in Ethiopia were collected between July 2015 and January 2019. Residence was considered as the clustering factor in the analysis. We employed the Bayesian accelerated failure time (AFT), and Bayesian AFT shared gamma frailty models, comparing their performance using the Deviance Information Criterion (DIC) and Watanabe–Akaike Information Criterion (WAIC). The Bayesian log-normal AFT shared gamma frailty model had the lowest DIC and WAIC, with well-capturing cluster dependency that was attributed to unobserved heterogeneity between patient residences. Unlike other methods that use Markov-Chain Monte-Carlo (MCMC), we applied the Integrated Nested Laplace Approximation (INLA) to reduce computational load. The study found that 39.44% of patients died, while 60.56% were censored, with a median survival time of 34 months. Another interesting finding of this study is that adding frailty into the Bayesian AFT models boosted the performance in fitting the heart failure dataset. Significant factors reducing survival time included age, chronic kidney disease, heart failure history, diabetes, heart failure etiology, hypertension, anemia, smoking, and heart failure stage.
Full article
(This article belongs to the Section Survival Analysis)
►
Show Figures
Open AccessArticle
Direct and Indirect Effects of Environmental and Socio-Economic Factors on COVID-19 in Africa Using Structural Equation Modeling
by
Bissilimou Rachidatou Orounla, Ayédèguè Eustache Alaye, Kolawolé Valère Salako, Codjo Emile Agbangba, Justice Moses K. Aheto and Romain Glèlè Kakaï
Stats 2024, 7(3), 1051-1065; https://doi.org/10.3390/stats7030062 - 19 Sep 2024
Abstract
►▼
Show Figures
Understanding direct and indirect relationships of environmental, socio-economic and climate variables and the dynamics of epidemics is key to guiding targeted public health policy and interventions. This study investigates the direct and indirect effects of environmental and socio-economic factors on the COVID-19 dynamics
[...] Read more.
Understanding direct and indirect relationships of environmental, socio-economic and climate variables and the dynamics of epidemics is key to guiding targeted public health policy and interventions. This study investigates the direct and indirect effects of environmental and socio-economic factors on the COVID-19 dynamics in Africa (54 African countries from 2019 to 2021) using SEM approach. Specifically, the study aimed to: (i) assess the performance of two SEM estimation methods (Lisrel and PLS-SEM) in relationship to sample size (100, 200, 500, and 1000) and level of model complexity (No, two, and four indirect effects) and (ii) use the most performing SEM estimation method to examine direct and indirect effects of factors influencing the number of cases and deaths of COVID-19 in Africa. The results highlight a positive spatial correlation between factors such as temperature, humidity, age, the proportion of people aged over 65, and the COVID-19 incidence. Under the control of confounding factors, Lisrel turns out to be the most performing method, identifying climate, demographic and economic factors as the main determinants of COVID-19 dynamics. These factors have a direct and significant impact on the incidence of COVID-19. An indirect relationship was also observed between economic factors and the incidence of COVID-19 through air pollutants. The results highlight the importance of considering these factors in understanding the spread of the virus to avoid further disasters.
Full article
Figure 1
Open AccessArticle
Copula Approximate Bayesian Computation Using Distribution Random Forests
by
George Karabatsos
Stats 2024, 7(3), 1002-1050; https://doi.org/10.3390/stats7030061 - 17 Sep 2024
Abstract
Ongoing modern computational advancements continue to make it easier to collect increasingly large and complex datasets, which can often only be realistically analyzed using models defined by intractable likelihood functions. This Stats invited feature article introduces and provides an extensive simulation study of
[...] Read more.
Ongoing modern computational advancements continue to make it easier to collect increasingly large and complex datasets, which can often only be realistically analyzed using models defined by intractable likelihood functions. This Stats invited feature article introduces and provides an extensive simulation study of a new approximate Bayesian computation (ABC) framework for estimating the posterior distribution and the maximum likelihood estimate (MLE) of the parameters of models defined by intractable likelihoods, that unifies and extends previous ABC methods proposed separately. This framework, copulaABCdrf, aims to accurately estimate and describe the possibly skewed and high-dimensional posterior distribution by a novel multivariate copula-based meta-t distribution based on univariate marginal posterior distributions that can be accurately estimated by distribution random forests (drf), while performing automatic summary statistics (covariates) selection, based on robustly estimated copula dependence parameters. The copulaABCdrf framework also provides a novel multivariate mode estimator to perform MLE and posterior mode estimation and an optional step to perform model selection from a given set of models using posterior probabilities estimated by drf. The posterior distribution estimation accuracy of the ABC framework is illustrated and compared with previous standard ABC methods through several simulation studies involving low- and high-dimensional models with computable posterior distributions, which are either unimodal, skewed, or multimodal; and exponential random graph and mechanistic network models, each defined by an intractable likelihood from which it is costly to simulate large network datasets. This paper also proposes and studies a new solution to the simulation cost problem in ABC involving the posterior estimation of parameters from datasets simulated from the given model that are smaller compared to the potentially large size of the dataset being analyzed. This proposal is motivated by the fact that, for many models defined by intractable likelihoods, such as the network models when they are applied to analyze massive networks, the repeated simulation of large datasets (networks) for posterior-based parameter estimation can be too computationally costly and vastly slow down or prohibit the use of standard ABC methods. The copulaABCdrf framework and standard ABC methods are further illustrated through analyses of large real-life networks of sizes ranging between 28,000 and 65.6 million nodes (between 3 million and 1.8 billion edges), including a large multilayer network with weighted directed edges. The results of the simulation studies show that, in settings where the true posterior distribution is not highly multimodal, copulaABCdrf usually produced similar point estimates from the posterior distribution for low-dimensional parametric models as previous ABC methods, but the copula-based method can produce more accurate estimates from the posterior distribution for high-dimensional models, and, in both dimensionality cases, usually produced more accurate estimates of univariate marginal posterior distributions of parameters. Also, posterior estimation accuracy was usually improved when pre-selecting the important summary statistics using drf compared to ABC employing no pre-selection of the subset of important summaries. For all ABC methods studied, accurate estimation of a highly multimodal posterior distribution was challenging. In light of the results of all the simulation studies, this article concludes by discussing how the copulaABCdrf framework can be improved for future research.
Full article
(This article belongs to the Section Bayesian Methods)
►▼
Show Figures
Figure 1
Open AccessArticle
Factor Analysis of Ordinal Items: Old Questions, Modern Solutions?
by
João Marôco
Stats 2024, 7(3), 984-1001; https://doi.org/10.3390/stats7030060 - 16 Sep 2024
Abstract
Factor analysis, a staple of correlational psychology, faces challenges with ordinal variables like Likert scales. The validity of traditional methods, particularly maximum likelihood (ML), is debated. Newer approaches, like using polychoric correlation matrices with weighted least squares estimators (WLS), offer solutions. This paper
[...] Read more.
Factor analysis, a staple of correlational psychology, faces challenges with ordinal variables like Likert scales. The validity of traditional methods, particularly maximum likelihood (ML), is debated. Newer approaches, like using polychoric correlation matrices with weighted least squares estimators (WLS), offer solutions. This paper compares maximum likelihood estimation (MLE) with WLS for ordinal variables. While WLS on polychoric correlations generally outperforms MLE on Pearson correlations, especially with nonbell-shaped distributions, it may yield artefactual estimates with severely skewed data. MLE tends to underestimate true loadings, while WLS may overestimate them. Simulations and case studies highlight the importance of item psychometric distributions. Despite advancements, MLE remains robust, underscoring the complexity of analyzing ordinal data in factor analysis. There is no one-size-fits-all approach, emphasizing the need for distributional analyses and careful consideration of data characteristics.
Full article
(This article belongs to the Section Computational Statistics)
►▼
Show Figures
Figure 1
Open AccessArticle
A Statistical Methodology for Evaluating Asymmetry after Normalization with Application to Genomic Data
by
Víctor Leiva, Jimmy Corzo, Myrian E. Vergara, Raydonal Ospina and Cecilia Castro
Stats 2024, 7(3), 967-983; https://doi.org/10.3390/stats7030059 - 9 Sep 2024
Abstract
►▼
Show Figures
This study evaluates the symmetry of data distributions after normalization, focusing on various statistical tests, including a few explored test named Rp. We apply normalization techniques, such as variance stabilizing transformations, to ribonucleic acid sequencing data with varying sample sizes to assess their
[...] Read more.
This study evaluates the symmetry of data distributions after normalization, focusing on various statistical tests, including a few explored test named Rp. We apply normalization techniques, such as variance stabilizing transformations, to ribonucleic acid sequencing data with varying sample sizes to assess their effectiveness in achieving symmetric data distributions. Our findings reveal that while normalization generally induces symmetry, some samples retain asymmetric distributions, challenging the conventional assumption of post-normalization symmetry. The Rp test, in particular, shows superior performance when there are variations in sample size and data distribution, making it a preferred tool for assessing symmetry when applied to genomic data. This finding underscores the importance of validating symmetry assumptions during data normalization, especially in genomic data, as overlooked asymmetries can lead to potential inaccuracies in downstream analyses. We analyze postmortem lateral temporal lobe samples to explore normal aging and Alzheimer’s disease, highlighting the critical role of symmetry testing in the accurate interpretation of genomic data.
Full article
Figure 1
Open AccessCase Report
The Integrated Violin-Box-Scatter (VBS) Plot to Visualize the Distribution of a Continuous Variable
by
David W. Gerbing
Stats 2024, 7(3), 955-966; https://doi.org/10.3390/stats7030058 - 4 Sep 2024
Abstract
►▼
Show Figures
The histogram remains a widely used tool for visualization of the distribution of a continuous variable, despite the disruption of binning the underlying continuity into somewhat arbitrarily sized discrete intervals imposed by the simplicity of its pre-computer origins. Alternatives include three visualizations, namely
[...] Read more.
The histogram remains a widely used tool for visualization of the distribution of a continuous variable, despite the disruption of binning the underlying continuity into somewhat arbitrarily sized discrete intervals imposed by the simplicity of its pre-computer origins. Alternatives include three visualizations, namely a smoothed density distribution such as a violin plot, a box plot, and the direct visualization of the individual data values as a one-dimensional scatter plot. To promote ease of use, the plotting function discussed in this work, Plot(x), automatically integrates these three visualizations of a continuous variable x into what is called a VBS plot here, tuning the resulting plot to the sample size and discreteness of the data. This integration complements the information derived from the histogram well and more easily generalizes to a multi-panel presentation at each level of a second categorical variable.
Full article
Figure 1
Open AccessArticle
Weighted Empirical Likelihood for Accelerated Life Model with Various Types of Censored Data
by
Jian-Jian Ren and Yiming Lyu
Stats 2024, 7(3), 944-954; https://doi.org/10.3390/stats7030057 - 3 Sep 2024
Abstract
In analysis of survival data, the Accelerated Life Model (ALM) is one of the widely used semiparametric models, and we often encounter various types of censored survival data, such as right censored data, doubly censored data, interval censored data, partly interval-censored data, etc.
[...] Read more.
In analysis of survival data, the Accelerated Life Model (ALM) is one of the widely used semiparametric models, and we often encounter various types of censored survival data, such as right censored data, doubly censored data, interval censored data, partly interval-censored data, etc. For complicated types of censored data, the studies of statistical inferences on the ALM are very technical and challenging mathematically, thus up to now little work has been done. In this article, we extend the concept of weighted empirical likelihood (WEL) from univariate case to multivariate case, and we apply it to the ALM, which leads to an estimation approach, called weighted maximum likelihood estimator, as well as the WEL based confidence interval for the regression parameter. Our proposed procedures are applicable to various types of censored data under a unified framework, and some simulation results are presented.
Full article
(This article belongs to the Section Survival Analysis)
Open AccessArticle
Doubly Robust Estimation and Semiparametric Efficiency in Generalized Partially Linear Models with Missing Outcomes
by
Lu Wang, Zhongzhe Ouyang and Xihong Lin
Stats 2024, 7(3), 924-943; https://doi.org/10.3390/stats7030056 - 31 Aug 2024
Abstract
We investigate a semiparametric generalized partially linear regression model that accommodates missing outcomes, with some covariates modeled parametrically and others nonparametrically. We propose a class of augmented inverse probability weighted (AIPW) kernel–profile estimating equations. The nonparametric component is estimated using AIPW kernel estimating
[...] Read more.
We investigate a semiparametric generalized partially linear regression model that accommodates missing outcomes, with some covariates modeled parametrically and others nonparametrically. We propose a class of augmented inverse probability weighted (AIPW) kernel–profile estimating equations. The nonparametric component is estimated using AIPW kernel estimating equations, while parametric regression coefficients are estimated using AIPW profile estimating equations. We demonstrate the doubly robust nature of the AIPW estimators for both nonparametric and parametric components. Specifically, these estimators remain consistent if either the assumed model for the probability of missing data or that for the conditional mean of the outcome, given covariates and auxiliary variables, is correctly specified, though not necessarily both simultaneously. Additionally, the AIPW profile estimator for parametric regression coefficients is consistent and asymptotically normal under the semiparametric model defined by the generalized partially linear model on complete data, assuming that the missing data mechanism is missing at random. When both working models are correctly specified, this estimator achieves semiparametric efficiency, with its asymptotic variance reaching the efficiency bound. We validate our approach through simulations to assess the finite sample performance of the proposed estimators and apply the method to a study that investigates risk factors associated with myocardial ischemia.
Full article
(This article belongs to the Special Issue Novel Semiparametric Methods)
►▼
Show Figures
Figure 1
Open AccessArticle
A Dynamic Reliability Analysis for the Conditional Number of Working Components within a Structure
by
Ioannis S. Triantafyllou
Stats 2024, 7(3), 906-923; https://doi.org/10.3390/stats7030055 - 28 Aug 2024
Abstract
►▼
Show Figures
In the present work, we study the number of working units of a consecutive-type structure at a specific time point under the condition that the system’s failure has not been observed yet. The main results of this paper offer some closed formulae for
[...] Read more.
In the present work, we study the number of working units of a consecutive-type structure at a specific time point under the condition that the system’s failure has not been observed yet. The main results of this paper offer some closed formulae for determining the distribution of the number of working components under the aforementioned condition. Several alternatives are considered for identifying the structure of the underlying system. The numerical investigation which is carried out takes into account different distributional assumptions for the lifetime of the components of the reliability system. Some concluding remarks and comments are provided for the performance of the resulting consecutive-type design.
Full article
Figure 1
Open AccessArticle
Scoring Individual Moral Inclination for the CNI Test
by
Yi Chen, Benjamin Lugu, Wenchao Ma and Hyemin Han
Stats 2024, 7(3), 894-905; https://doi.org/10.3390/stats7030054 - 23 Aug 2024
Abstract
►▼
Show Figures
Item response theory (IRT) is a modern psychometric framework for estimating respondents’ latent traits (e.g., ability, attitude, and personality) based on their responses to a set of questions in psychological tests. The current study adopted an item response tree (IRTree) method, which combines
[...] Read more.
Item response theory (IRT) is a modern psychometric framework for estimating respondents’ latent traits (e.g., ability, attitude, and personality) based on their responses to a set of questions in psychological tests. The current study adopted an item response tree (IRTree) method, which combines the tree model with IRT models for handling the sequential process of responding to a test item, to score individual moral inclination for the CNI test—a broadly adopted model for examining humans’ moral decision-making with three parameters generated: sensitivity to moral norms, sensitivity to consequences, and inaction preference. Compared to previous models for the CNI test, the resulting EIRTree-CNI Model is able to generate individual scores without increasing the number of items (thus, less subject fatigue or compromised response quality) or employing a post hoc approach that is deemed statistically suboptimal. The model fits the data well, and the subsequent test also supported the concurrent validity and the predictive validity of the model. Limitations are discussed further.
Full article
Figure 1
Open AccessCase Report
Integrating Proteomic Analysis and Machine Learning to Predict Prostate Cancer Aggressiveness
by
Sheila M. Valle Cortés, Jaileene Pérez Morales, Mariely Nieves Plaza, Darielys Maldonado, Swizel M. Tevenal Baez, Marc A. Negrón Blas, Cayetana Lazcano Etchebarne, José Feliciano, Gilberto Ruiz Deyá, Juan C. Santa Rosario and Pedro Santiago Cardona
Stats 2024, 7(3), 875-893; https://doi.org/10.3390/stats7030053 - 21 Aug 2024
Abstract
Prostate cancer (PCa) poses a significant challenge because of the difficulty in identifying aggressive tumors, leading to overtreatment and missed personalized therapies. Although only 8% of cases progress beyond the prostate, the accurate prediction of aggressiveness remains crucial. Thus, this study focused on
[...] Read more.
Prostate cancer (PCa) poses a significant challenge because of the difficulty in identifying aggressive tumors, leading to overtreatment and missed personalized therapies. Although only 8% of cases progress beyond the prostate, the accurate prediction of aggressiveness remains crucial. Thus, this study focused on studying retinoblastoma phosphorylated at Serine 249 (Phospho-Rb S249), N-cadherin, β-catenin, and E-cadherin as biomarkers for identifying aggressive PCa using a logistic regression model and a classification and regression tree (CART). Using immunohistochemistry (IHC), we targeted the expression of these biomarkers in PCa tissues and correlated their expression with clinicopathological data of the tumor. The results showed a negative correlation between E-cadherin and β-catenin with aggressive tumor behavior, whereas Phospho-Rb S249 and N-cadherin positively correlated with increased tumor aggressiveness. Furthermore, patients were stratified based on Gleason scores and E-cadherin staining patterns to evaluate their capability for early identification of aggressive PCa. Our findings suggest that the classification tree is the most effective method for measuring the utility of these biomarkers in clinical practice, incorporating β-catenin, tumor grade, and Gleason grade as relevant determinants for identifying patients with Gleason scores ≥ 4 + 3. This study could potentially benefit patients with aggressive PCa by enabling early disease detection and closer monitoring.
Full article
(This article belongs to the Section Regression Models)
►▼
Show Figures
Figure 1
Open AccessArticle
An Analysis of the Impact of Injury Severity on Incident Clearance Time on Urban Interstates Using a Bivariate Random-Parameter Probit Model
by
M. Ashifur Rahman, Milhan Moomen, Waseem Akhtar Khan and Julius Codjoe
Stats 2024, 7(3), 863-874; https://doi.org/10.3390/stats7030052 - 9 Aug 2024
Abstract
Incident clearance time (ICT) is impacted by several factors, including crash injury severity. The strategy of most transportation agencies is to allocate more resources and respond promptly when injuries are reported. Such a strategy should result in faster clearance of incidents, given the
[...] Read more.
Incident clearance time (ICT) is impacted by several factors, including crash injury severity. The strategy of most transportation agencies is to allocate more resources and respond promptly when injuries are reported. Such a strategy should result in faster clearance of incidents, given the resources used. However, injury crashes by nature require extra time to attend to and move crash victims while restoring the highway to its capacity. This usually leads to longer incident clearance duration, despite the higher amount of resources used. This finding has been confirmed by previous studies. The implication is that the relationship between ICT and injury severity is complex as well as correlated with the possible presence of unobserved heterogeneity. This study investigated the impact of injury severity on ICT on Louisiana’s urban interstates by adopting a random-parameter bivariate modeling framework that accounts for potential correlation between injury severity and ICT, while also investigating unobserved heterogeneity in the data. The results suggest that there is a correlation between injury severity and ICT. Importantly, it was found that injury severity does not impact ICT in only one way, as suggested by most previous studies. Also, some shared factors were found to impact both injury severity and ICT. These are young drivers, truck and bus crashes, and crashes that occur during daylight. The findings from this study can contribute to an improvement in safety on Louisiana’s interstates while furthering the state’s mobility goals.
Full article
Open AccessArticle
Comparison of Methods for Addressing Outliers in Exploratory Factor Analysis and Impact on Accuracy of Determining the Number of Factors
by
W. Holmes Finch
Stats 2024, 7(3), 842-862; https://doi.org/10.3390/stats7030051 - 5 Aug 2024
Abstract
►▼
Show Figures
Exploratory factor analysis (EFA) is a very common tool used in the social sciences to identify the underlying latent structure for a set of observed measurements. A primary component of EFA practice is determining the number of factors to retain, given the sample
[...] Read more.
Exploratory factor analysis (EFA) is a very common tool used in the social sciences to identify the underlying latent structure for a set of observed measurements. A primary component of EFA practice is determining the number of factors to retain, given the sample data. A variety of methods are available for this purpose, including parallel analysis, minimum average partial, and the Chi-square difference test. Research has shown that the presence of outliers among the indicator variables can have a deleterious impact on the performance of these methods for determining the number of factors to retain. The purpose of the current simulation study was to compare the performance of several methods for dealing with outliers combined with multiple techniques for determining the number of factors to retain. Results showed that using correlation matrices produced by either the percentage bend or heavy-tailed Student’s t-distribution, coupled with either parallel analysis or the minimum average partial yield, were most accurate in terms of identifying the number of factors to retain. Implications of these findings for practice are discussed.
Full article
Figure 1
Open AccessArticle
Patent Keyword Analysis Using Bayesian Zero-Inflated Model and Text Mining
by
Sunghae Jun
Stats 2024, 7(3), 827-841; https://doi.org/10.3390/stats7030050 - 3 Aug 2024
Cited by 1
Abstract
►▼
Show Figures
Patent keyword analysis is used to analyze the technology keywords extracted from collected patent documents for specific technological fields. Thus, various methods related to this type of analysis have been researched in the industrial engineering fields, such as technology management and new product
[...] Read more.
Patent keyword analysis is used to analyze the technology keywords extracted from collected patent documents for specific technological fields. Thus, various methods related to this type of analysis have been researched in the industrial engineering fields, such as technology management and new product development. To analyze the patent document data, we have to search for patents related to the target technology and preprocess them to construct the patent–keyword matrix for statistical and machine learning algorithms. In general, a patent–keyword matrix has an extreme zero-inflated problem. This is because each keyword occupies one column even if it is included in only one document among all patent documents. General zero-inflated models have a limit at which the performance of the model deteriorates when the proportion of zeros becomes extremely large. To solve this problem, we applied a Bayesian inference to a general zero-inflated model. In this paper, we propose a patent keyword analysis using a Bayesian zero-inflated model to overcome the extreme zero-inflated problem in the patent–keyword matrix. In our experiments, we collected practical patents related to digital therapeutics technology and used the patent–keyword matrix preprocessed from them. We compared the performance of our proposed method with other comparative methods. Finally, we showed the validity and improved performance of our patent keyword analysis. We expect that our research can contribute to solving the extreme zero-inflated problem that occurs not only in patent keyword analysis, but also in various text big data analyses.
Full article
Figure 1
Open AccessArticle
Mass Conservative Time-Series GAN for Synthetic Extreme Flood-Event Generation: Impact on Probabilistic Forecasting Models
by
Divas Karimanzira
Stats 2024, 7(3), 808-826; https://doi.org/10.3390/stats7030049 - 3 Aug 2024
Abstract
►▼
Show Figures
The lack of data on flood events poses challenges in flood management. In this paper, we propose a novel approach to enhance flood-forecasting models by utilizing the capabilities of Generative Adversarial Networks (GANs) to generate synthetic flood events. We modified a time-series GAN
[...] Read more.
The lack of data on flood events poses challenges in flood management. In this paper, we propose a novel approach to enhance flood-forecasting models by utilizing the capabilities of Generative Adversarial Networks (GANs) to generate synthetic flood events. We modified a time-series GAN by incorporating constraints related to mass conservation, energy balance, and hydraulic principles into the GAN model through appropriate regularization terms in the loss function and by using mass conservative LSTM in the generator and discriminator models. In this way, we can improve the realism and physical consistency of the generated extreme flood-event data. These constraints ensure that the synthetic flood-event data generated by the GAN adhere to fundamental hydrological principles and characteristics, enhancing the accuracy and reliability of flood-forecasting and risk-assessment applications. PCA and t-SNE are applied to provide valuable insights into the structure and distribution of the synthetic flood data, highlighting patterns, clusters, and relationships within the data. We aimed to use the generated synthetic data to supplement the original data and train probabilistic neural runoff model for forecasting multi-step ahead flood events. t-statistic was performed to compare the means of synthetic data generated by TimeGAN with the original data, and the results showed that the means of the two datasets were statistically significant at 95% level. The integration of time-series GAN-generated synthetic flood events with real data improved the robustness and accuracy of the autoencoder model, enabling more reliable predictions of extreme flood events. In the pilot study, the model trained on the augmented dataset with synthetic data from time-series GAN shows higher NSE and KGE scores of NSE = 0.838 and KGE = 0.908, compared to the NSE = 0.829 and KGE = 0.90 of the sixth hour ahead, indicating improved accuracy of 9.8% NSE in multistep-ahead predictions of extreme flood events compared to the model trained on the original data alone. The integration of synthetic training datasets in the probabilistic forecasting improves the model’s ability to achieve a reduced Prediction Interval Normalized Average Width (PINAW) for interval forecasting, yet this enhancement comes with a trade-off in the Prediction Interval Coverage Probability (PICP).
Full article
Figure 1
Open AccessArticle
The Negative Binomial INAR(1) Process under Different Thinning Processes: Can We Separate between the Different Models?
by
Dimitris Karlis, Naushad Mamode Khan and Yuvraj Sunecher
Stats 2024, 7(3), 793-807; https://doi.org/10.3390/stats7030048 - 27 Jul 2024
Abstract
The literature on discrete valued time series is expanding very fast. Very often we see new models with very similar properties to the existing ones. A natural question that arises is whether the multitude of models with very similar properties can really have
[...] Read more.
The literature on discrete valued time series is expanding very fast. Very often we see new models with very similar properties to the existing ones. A natural question that arises is whether the multitude of models with very similar properties can really have a practical purpose or if they mostly present theoretical interest. In the present paper, we consider four models that have negative binomial marginal distributions and are autoregressive in order 1 behavior, but they have a very different generating mechanism. Then we try to answer the question whether we can distinguish between them with real data. Extensive simulations show that while the differences are small, we still can discriminate between the models with relatively moderate sample sizes. However, the mean forecasts are expected to be almost identical for all models.
Full article
(This article belongs to the Special Issue Modern Time Series Analysis II)
►▼
Show Figures
Figure 1
Open AccessArticle
Seismic Evaluation Based on Poisson Hidden Markov Models—The Case of Central and South America
by
Evangelia Georgakopoulou, Theodoros M. Tsapanos, Andreas Makrides, Emmanuel Scordilis, Alex Karagrigoriou, Alexandra Papadopoulou and Vassilios Karastathis
Stats 2024, 7(3), 777-792; https://doi.org/10.3390/stats7030047 - 23 Jul 2024
Abstract
►▼
Show Figures
A study of earthquake seismicity is undertaken over the areas of Central and South America, the tectonics of which are of great interest. The whole territory is divided into 10 seismic zones based on some seismotectonic characteristics, as in previously published studies. The
[...] Read more.
A study of earthquake seismicity is undertaken over the areas of Central and South America, the tectonics of which are of great interest. The whole territory is divided into 10 seismic zones based on some seismotectonic characteristics, as in previously published studies. The earthquakes used in the present study are extracted from the catalogs of the International Seismological Center, cover the period of 1900–2021, and are restricted to shallow depths (≤60 km) and a magnitude . Fore- and aftershocks are removed according to Reasenberg’s technique. The paper confines itself to the evaluation of earthquake occurrence probabilities in the seismic zones covering parts of Central and South America, and we implement the hidden Markov model (HMM) and apply the EM algorithm.
Full article
Figure 1
Open AccessArticle
Time-Varying Correlations between JSE.JO Stock Market and Its Partners Using Symmetric and Asymmetric Dynamic Conditional Correlation Models
by
Anas Eisa Abdelkreem Mohammed, Henry Mwambi and Bernard Omolo
Stats 2024, 7(3), 761-776; https://doi.org/10.3390/stats7030046 - 22 Jul 2024
Abstract
The extent of correlation or co-movement among the returns of developed and emerging stock markets remains pivotal for efficiently diversifying global portfolios. This correlation is prone to variation over time as a consequence of escalating economic interdependence fostered by international trade and financial
[...] Read more.
The extent of correlation or co-movement among the returns of developed and emerging stock markets remains pivotal for efficiently diversifying global portfolios. This correlation is prone to variation over time as a consequence of escalating economic interdependence fostered by international trade and financial markets. In this study, the time-varying correlation and co-movement between the JSE.JO stock market of South Africa and its developed and developing stock market partners are analyzed. The dynamic conditional correlation–exponential generalized autoregressive conditional heteroscedasticity (DCC-EGARCH) methodology is employed with different multivariate distributions to explore the time-varying correlation and volatilities between the JSE.JO stock market and its partners. Based on the conditional correlation results, the JSE.JO stock market is integrated and co-moves with its partners, and the conditional correlation for all markets exhibits time-variant behavior. The conditional volatility results show that the JSE.JO stock market behaves differently from other markets, especially after 2015, indicating a positive sign for investors to diversify between the JSE.JO and its partners. The highest value of conditional volatility for markets was in 2020 during the COVID-19 pandemic, representing the riskiest period that investors should avoid due to the lack of diversification opportunities during crises.
Full article
(This article belongs to the Section Time Series Analysis)
►▼
Show Figures
Figure 1
Open AccessCase Report
Parametric Estimation in Fractional Stochastic Differential Equation
by
Paramahansa Pramanik, Edward L. Boone and Ryad A. Ghanam
Stats 2024, 7(3), 745-760; https://doi.org/10.3390/stats7030045 - 20 Jul 2024
Abstract
Fractional Stochastic Differential Equations are becoming more popular in the literature as they can model phenomena in financial data that typical Stochastic Differential Equations models cannot. In the formulation considered here, the Hurst parameter, H, controls the Fraction of Differentiation, which needs
[...] Read more.
Fractional Stochastic Differential Equations are becoming more popular in the literature as they can model phenomena in financial data that typical Stochastic Differential Equations models cannot. In the formulation considered here, the Hurst parameter, H, controls the Fraction of Differentiation, which needs to be estimated from the data. Fortunately, the covariance structure among observations in time is easily expressed in terms of the Hurst parameter which means that a likelihood is easily defined. This work derives the Maximum Likelihood Estimator for H, which shows that it is biased and is not a consistent estimator. Simulation data used to understand the bias of the estimator is used to create an empirical bias correction function and a bias-corrected estimator is proposed and studied. Via simulation, the bias-corrected estimator is shown to be minimally biased and its simulation-based standard error is created, which is then used to create a 95% confidence interval for H. A simulation study shows that the 95% confidence intervals have decent coverage probabilities for large n. This method is then applied to the S&P500 and VIX data before and after the 2008 financial crisis.
Full article
(This article belongs to the Special Issue Novel Semiparametric Methods)
►▼
Show Figures
Figure 1
Open AccessCase Report
Bayesian Model Averaging and Regularized Regression as Methods for Data-Driven Model Exploration, with Practical Considerations
by
Hyemin Han
Stats 2024, 7(3), 732-744; https://doi.org/10.3390/stats7030044 - 18 Jul 2024
Abstract
Methodological experts suggest that psychological and educational researchers should employ appropriate methods for data-driven model exploration, such as Bayesian Model Averaging and regularized regression, instead of conventional hypothesis-driven testing, if they want to explore the best prediction model. I intend to discuss practical
[...] Read more.
Methodological experts suggest that psychological and educational researchers should employ appropriate methods for data-driven model exploration, such as Bayesian Model Averaging and regularized regression, instead of conventional hypothesis-driven testing, if they want to explore the best prediction model. I intend to discuss practical considerations regarding data-driven methods for end-user researchers without sufficient expertise in quantitative methods. I tested three data-driven methods, i.e., Bayesian Model Averaging, LASSO as a form of regularized regression, and stepwise regression, with datasets in psychology and education. I compared their performance in terms of cross-validity indicating robustness against overfitting across different conditions. I employed functionalities widely available via R with default settings to provide information relevant to end users without advanced statistical knowledge. The results demonstrated that LASSO showed the best performance and Bayesian Model Averaging outperformed stepwise regression when there were many candidate predictors to explore. Based on these findings, I discussed appropriately using the data-driven model exploration methods across different situations from laypeople’s perspectives.
Full article
(This article belongs to the Section Data Science)
Highly Accessed Articles
Latest Books
E-Mail Alert
News
Topics
Topic in
Entropy, Mathematics, Modelling, Stats
Interfacing Statistics, Machine Learning and Data Science from a Probabilistic Modelling Viewpoint
Topic Editors: Jürgen Pilz, Noelle I. Samia, Dirk HusmeierDeadline: 31 December 2024
Conferences
Special Issues
Special Issue in
Stats
Feature Paper Special Issue: Reinforcement Learning
Guest Editors: Wei Zhu, Sourav Sen, Keli XiaoDeadline: 30 September 2024
Special Issue in
Stats
Statistical Learning for High-Dimensional Data
Guest Editor: Paulo Canas RodriguesDeadline: 30 September 2024
Special Issue in
Stats
Statistics, Analytics, and Inferences for Discrete Data
Guest Editor: Dungang LiuDeadline: 30 November 2024
Special Issue in
Stats
Integrative Approaches in Statistical Modeling and Machine Learning for Data Analytics and Data Mining
Guest Editors: Victor Leiva, Cecília CastroDeadline: 31 January 2025