The problem of approximating an interval null or imprecise hypothesis test by a point null or pre... more The problem of approximating an interval null or imprecise hypothesis test by a point null or precise hypothesis test under a Bayesian framework is considered. In the literature, some of the methods for solving this problem have used the Bayes factor for testing a point null and justified it as an approximation to the interval null. However, many authors recommend evaluating tests through the posterior odds, a Bayesian measure of evidence against the null hypothesis. It is of interest then to determine whether similar results hold when using the posterior odds as the primary measure of evidence. For the prior distributions under which the approximation holds with respect to the Bayes factor, it is shown that the posterior odds for testing the point null hypothesis does not approximate the posterior odds for testing the interval null hypothesis. In fact, in order to obtain convergence of the posterior odds, a number of restrictive conditions need to be placed on the prior structure. ...
While genome-wide association studies (GWAS) have discovered thousands of risk loci for heritable... more While genome-wide association studies (GWAS) have discovered thousands of risk loci for heritable disorders, so far even very large meta-analyses have recovered only a fraction of the heritability of most complex traits. Recent work utilizing variance components models has demonstrated that a larger fraction of the heritability of complex phenotypes is captured by the additive effects of SNPs than is evident only in loci surpassing genome-wide significance thresholds, typically set at a Bonferroni-inspired p ≤ 5 x 10-8. Procedures that control false discovery rate can be more powerful, yet these are still under-powered to detect the majority of non-null effects from GWAS. The current work proposes a novel Bayesian semi-parametric two-group mixture model and develops a Markov Chain Monte Carlo (MCMC) algorithm for a covariate-modulated local false discovery rate (cmfdr). The probability of being non-null depends on a set of covariates via a logistic function, and the non-null distrib...
Propensity score methods account for selection bias in observational studies. However, the consis... more Propensity score methods account for selection bias in observational studies. However, the consistency of the propensity score estimators strongly depends on a correct specification of the propensity score model. Logistic regression and, with increasing popularity, machine learning tools are used to estimate propensity scores. We introduce a stacked generalization ensemble learning approach to improve propensity score estimation by fitting a meta learner on the predictions of a suitable set of diverse base learners. We perform a comprehensive Monte Carlo simulation study, implementing a broad range of scenarios that mimic characteristics of typical data sets in educational studies. The population average treatment effect is estimated using the propensity score in Inverse Probability of Treatment Weighting. Our proposed stacked ensembles, especially using gradient boosting machines as a meta learner trained on a set of 12 base learner predictions, led to superior reduction of bias co...
How much does the far future matter? This question lies at the heart of many important environmen... more How much does the far future matter? This question lies at the heart of many important environmental policy issues such as global climate change, biodiversity loss, and the disposal of radioactive waste. While philosophers, experts, and others offer their viewpoints on this deep question, the solution to many environmental problems lies in the willingness of the public to bear significant costs now in order to make the far future a better place. Short of national plebiscites, the only way to assess the public’s willingness to mitigate impacts in the far future is to ask them. Using a unique set of survey data in which respondents were provided with sets of scenarios describing different amounts of forest loss due to climate change, along with associated mitigation methods and costs, we can infer their willingness to bear additional costs to mitigate future ecological impacts of climate change. The survey also varied the timing of the impacts which allows us to assess how the willing...
We congratulate the authors on a review of convergence rates for Gibbs sampling routines. Their c... more We congratulate the authors on a review of convergence rates for Gibbs sampling routines. Their combined work on studying convergence rates via orthogonal polynomials in the present paper under
Random forests are presented as an analytics foundation for educational data mining tasks. The f... more Random forests are presented as an analytics foundation for educational data mining tasks. The focus is on course- and program-level analytics including evaluating pedagogical approaches and interventions and identifying and characterizing at-risk students. As part of this development, the concept of individualized treatment effects (ITE) is introduced as a method to provide personalized feedback to students. The ITE quantifies the effectiveness of intervention and/or instructional regimes for a particular student based on institutional student information and performance data. The proposed random forest framework and methods are illustrated on a study of the efficacy of a supplemental, weekly, one-unit problem-solving session in a large enrollment, bottleneck introductory statistics course. The analytics tools are used to identify factors for student success, characterize students benefitting from the supplemental instruction section, develop an objective criterion to, at the...
We propose an Iterative Nonlinear Gaussianization Algorithm (INGA) which seeks a nonlinear map fr... more We propose an Iterative Nonlinear Gaussianization Algorithm (INGA) which seeks a nonlinear map from a set of dependent random variables to independent Gaussian random variables. A direct motivation of INGA is to extend principal component analysis (PCA), which transforms a set of correlated random variables into uncorrelated (independent up to second order) random variables, and Independent Component Analysis (ICA), which linearly transforms random variables into variates that are \as independent as possible." A modi ed INGA is then proposed to nonlinearly transform ICA coe cients into statistically independent components. To quantify the performance of each algorithm: PCA, ICA, INGA, and modi ed INGA, we study the Edgeworth Kullback-Leibler Distance (EKLD) which serves to measure the \distance" between two distributions in multi-dimensions. Several examples are presented to demonstrate the superior performance of INGA (and its modi ed version) in situations where PCA and ...
Higher education institutions often examine performance discrepancies of specific subgroups, such... more Higher education institutions often examine performance discrepancies of specific subgroups, such as students from underrepresented minority and first-generation backgrounds. An increase in educational technology and computational power has promoted research interest in using data mining tools to help identify groups of students who are academically at-risk. Institutions can then implement data-informed decisions to help promote student access, increase retention and graduation rates, and guide intervention programs. We introduce a latent class forest, a latent class analysis and a random forest ensemble that will recursively partition observations into groups to help identify at-risk students. The procedure is a form of model-based hierarchical clustering that relies on latent class trees to optimally identify subgroups. We motivate and apply our latent class forest method to identify key demographic and academic characteristics of at-risk students in a large enrollment, bottleneck...
The passer rating system (PRS) in the National Football League is designed to quantify a quarterb... more The passer rating system (PRS) in the National Football League is designed to quantify a quarterback’s efficiency across seasons and careers. While many fans are aware of the measure, few truly know how the formula was designed or is calculated. This paper will analyze the already in place PRS, and propose a similar efficiency rating system for the position of running back. The PRS and rusher rating system (RRS) are then analyzed with player data by both season and career. A final addition is a look at adjusted ratings for seasons and careers based upon differences in yearly averages for both rating systems.
Evaluation of syntheses or simulated data is often done subjectively through visual comparisons w... more Evaluation of syntheses or simulated data is often done subjectively through visual comparisons with the original samples. This subjective evaluation is particularly dominant in the area of texture modeling and simulation. In order to objectively evaluate the similarity (or difference) between original samples and syntheses, we propose an approximation for the Kullback-Leibler distance based on Edgeworth expansions (EKLD). We use this approximation to study the sampling distribution of the original and synthesized images. As part of our development, we present numerical examples to study the behavior of EKLD for sample mean distributions and illustrate the advantages of our approach for evaluating the differential entropy and choosing the least statistically dependent basis from wavelet packet dictionaries. Finally, we introduce how to use EKLD in statistical image processing to validate synthetic representations of images.
The problem of approximating an interval null or imprecise hypothesis test by a point null or pre... more The problem of approximating an interval null or imprecise hypothesis test by a point null or precise hypothesis test under a Bayesian framework is considered. In the literature, some of the methods for solving this problem have used the Bayes factor for testing a point null and justified it as an approximation to the interval null. However, many authors recommend evaluating tests through the posterior odds, a Bayesian measure of evidence against the null hypothesis. It is of interest then to determine whether similar results hold when using the posterior odds as the primary measure of evidence. For the prior distributions under which the approximation holds with respect to the Bayes factor, it is shown that the posterior odds for testing the point null hypothesis does not approximate the posterior odds for testing the interval null hypothesis. In fact, in order to obtain convergence of the posterior odds, a number of restrictive conditions need to be placed on the prior structure. ...
While genome-wide association studies (GWAS) have discovered thousands of risk loci for heritable... more While genome-wide association studies (GWAS) have discovered thousands of risk loci for heritable disorders, so far even very large meta-analyses have recovered only a fraction of the heritability of most complex traits. Recent work utilizing variance components models has demonstrated that a larger fraction of the heritability of complex phenotypes is captured by the additive effects of SNPs than is evident only in loci surpassing genome-wide significance thresholds, typically set at a Bonferroni-inspired p ≤ 5 x 10-8. Procedures that control false discovery rate can be more powerful, yet these are still under-powered to detect the majority of non-null effects from GWAS. The current work proposes a novel Bayesian semi-parametric two-group mixture model and develops a Markov Chain Monte Carlo (MCMC) algorithm for a covariate-modulated local false discovery rate (cmfdr). The probability of being non-null depends on a set of covariates via a logistic function, and the non-null distrib...
Propensity score methods account for selection bias in observational studies. However, the consis... more Propensity score methods account for selection bias in observational studies. However, the consistency of the propensity score estimators strongly depends on a correct specification of the propensity score model. Logistic regression and, with increasing popularity, machine learning tools are used to estimate propensity scores. We introduce a stacked generalization ensemble learning approach to improve propensity score estimation by fitting a meta learner on the predictions of a suitable set of diverse base learners. We perform a comprehensive Monte Carlo simulation study, implementing a broad range of scenarios that mimic characteristics of typical data sets in educational studies. The population average treatment effect is estimated using the propensity score in Inverse Probability of Treatment Weighting. Our proposed stacked ensembles, especially using gradient boosting machines as a meta learner trained on a set of 12 base learner predictions, led to superior reduction of bias co...
How much does the far future matter? This question lies at the heart of many important environmen... more How much does the far future matter? This question lies at the heart of many important environmental policy issues such as global climate change, biodiversity loss, and the disposal of radioactive waste. While philosophers, experts, and others offer their viewpoints on this deep question, the solution to many environmental problems lies in the willingness of the public to bear significant costs now in order to make the far future a better place. Short of national plebiscites, the only way to assess the public’s willingness to mitigate impacts in the far future is to ask them. Using a unique set of survey data in which respondents were provided with sets of scenarios describing different amounts of forest loss due to climate change, along with associated mitigation methods and costs, we can infer their willingness to bear additional costs to mitigate future ecological impacts of climate change. The survey also varied the timing of the impacts which allows us to assess how the willing...
We congratulate the authors on a review of convergence rates for Gibbs sampling routines. Their c... more We congratulate the authors on a review of convergence rates for Gibbs sampling routines. Their combined work on studying convergence rates via orthogonal polynomials in the present paper under
Random forests are presented as an analytics foundation for educational data mining tasks. The f... more Random forests are presented as an analytics foundation for educational data mining tasks. The focus is on course- and program-level analytics including evaluating pedagogical approaches and interventions and identifying and characterizing at-risk students. As part of this development, the concept of individualized treatment effects (ITE) is introduced as a method to provide personalized feedback to students. The ITE quantifies the effectiveness of intervention and/or instructional regimes for a particular student based on institutional student information and performance data. The proposed random forest framework and methods are illustrated on a study of the efficacy of a supplemental, weekly, one-unit problem-solving session in a large enrollment, bottleneck introductory statistics course. The analytics tools are used to identify factors for student success, characterize students benefitting from the supplemental instruction section, develop an objective criterion to, at the...
We propose an Iterative Nonlinear Gaussianization Algorithm (INGA) which seeks a nonlinear map fr... more We propose an Iterative Nonlinear Gaussianization Algorithm (INGA) which seeks a nonlinear map from a set of dependent random variables to independent Gaussian random variables. A direct motivation of INGA is to extend principal component analysis (PCA), which transforms a set of correlated random variables into uncorrelated (independent up to second order) random variables, and Independent Component Analysis (ICA), which linearly transforms random variables into variates that are \as independent as possible." A modi ed INGA is then proposed to nonlinearly transform ICA coe cients into statistically independent components. To quantify the performance of each algorithm: PCA, ICA, INGA, and modi ed INGA, we study the Edgeworth Kullback-Leibler Distance (EKLD) which serves to measure the \distance" between two distributions in multi-dimensions. Several examples are presented to demonstrate the superior performance of INGA (and its modi ed version) in situations where PCA and ...
Higher education institutions often examine performance discrepancies of specific subgroups, such... more Higher education institutions often examine performance discrepancies of specific subgroups, such as students from underrepresented minority and first-generation backgrounds. An increase in educational technology and computational power has promoted research interest in using data mining tools to help identify groups of students who are academically at-risk. Institutions can then implement data-informed decisions to help promote student access, increase retention and graduation rates, and guide intervention programs. We introduce a latent class forest, a latent class analysis and a random forest ensemble that will recursively partition observations into groups to help identify at-risk students. The procedure is a form of model-based hierarchical clustering that relies on latent class trees to optimally identify subgroups. We motivate and apply our latent class forest method to identify key demographic and academic characteristics of at-risk students in a large enrollment, bottleneck...
The passer rating system (PRS) in the National Football League is designed to quantify a quarterb... more The passer rating system (PRS) in the National Football League is designed to quantify a quarterback’s efficiency across seasons and careers. While many fans are aware of the measure, few truly know how the formula was designed or is calculated. This paper will analyze the already in place PRS, and propose a similar efficiency rating system for the position of running back. The PRS and rusher rating system (RRS) are then analyzed with player data by both season and career. A final addition is a look at adjusted ratings for seasons and careers based upon differences in yearly averages for both rating systems.
Evaluation of syntheses or simulated data is often done subjectively through visual comparisons w... more Evaluation of syntheses or simulated data is often done subjectively through visual comparisons with the original samples. This subjective evaluation is particularly dominant in the area of texture modeling and simulation. In order to objectively evaluate the similarity (or difference) between original samples and syntheses, we propose an approximation for the Kullback-Leibler distance based on Edgeworth expansions (EKLD). We use this approximation to study the sampling distribution of the original and synthesized images. As part of our development, we present numerical examples to study the behavior of EKLD for sample mean distributions and illustrate the advantages of our approach for evaluating the differential entropy and choosing the least statistically dependent basis from wavelet packet dictionaries. Finally, we introduce how to use EKLD in statistical image processing to validate synthetic representations of images.
Uploads