The U.S. Census Bureau (USCB) assists the federal government in distributing approximately $400 b... more The U.S. Census Bureau (USCB) assists the federal government in distributing approximately $400 billion of aid by providing a complete ranking of the states according to certain criteria, such as average poverty level. It is imperative that this ranking be as accurate as possible in order to ensure the fairness of the allocation of funds. Currently, the USCB ranks states based on point estimates of their true poverty level. Dr. Klein and Dr. Wright of the USCB have compared the performance of this method against more sophisticated procedures in simulation trials, but have found that they do not consistently outperform the existing method. We investigate this phenomenon by revisiting some of these procedures, and we expand on this work to produce new ranking algorithms. We utilize parallel programming to expedite Dr. Klein's procedures. In addition, we specify two new prior distributions on the population means — using previous years' census data as well as regression. We dis...
These results were obtained as part of the REU Site: Interdisciplinary Program in High Performanc... more These results were obtained as part of the REU Site: Interdisciplinary Program in High Performance Computing (hpcreu.umbc.edu) in the Department of Mathematics and Statistics at the University of Maryland, Baltimore County (UMBC) in Summer 2016. This program is funded by the National Science Foundation (NSF), the National Security Agency (NSA), and the Department of Defense (DOD), with additional support from UMBC, the Department of Mathematics and Statistics, the Center for Interdisciplinary Research and Consulting (CIRC), and the UMBC High Performance Computing Facility (HPCF). HPCF is supported by the U.S. National Science Foundation through the MRI program (grant nos. CNS{0821258 and CNS{1228778) and the SCREMS program (grant no. DMS{0821311), with additional substantial support from UMBC. Co-author Danielle Sykes was supported, in part, by the UMBC National Security Agency (NSA) Scholars Program through a contract with the NSA. Graduate assistants Sai K. Popuri and Nadeesri Wij...
Prediction of precipitation using simulations on various climate variables provided by Global Cli... more Prediction of precipitation using simulations on various climate variables provided by Global Climate Models (GCM) as covariates is often required for regional hydrological assessment studies. In this paper, we use a sufficient dimension reduction method to analyze monthly precipitation data over the Missouri River Basin (MRB). At each location, effective reduced sets of monthly historical simulated data from a neighborhood provided by MIROC5, a Global Climate Model, are first obtained via a semi-continuous adaptation of the Sliced Inverse Regression, a sufficient dimension reduction approach. These reduced sets are used subsequently in a modified Nadaraya-Watson method for prediction. We implement the method on a computing cluster, and demonstrate that it is scalable. We observe a signficant speedup in the runtime when implemented in parallel. This is an attractive alternative to the traditional spatio-temporal analysis of the entire region given the large number of locations and t...
Programming with big data in R (pbdR), a package used to implement high-performance computing in ... more Programming with big data in R (pbdR), a package used to implement high-performance computing in the statistical software R, uses block cyclic distribution to organize large data across many processes. Because computations performed on large matrices are often not associative, a systematic approach must be used during parallelization to divide the matrix correctly. The block cyclic distribution method stresses a balanced load across processes by allocating sections of data to a corresponding node. This method achieves well divided data that each process computes individually and calculates a final result more efficiently. A nontrivial problem occurs when using block cyclic distribution: Which combinations of different block sizes and grid layouts are most effective? These two factors greatly influence computational efficiency, and therefore it is crucial to study and understand their relationship. To analyze the effects of block size and processor grid layout, we carry out a perform...
Lameness remains a significant cause of production losses, a growing welfare concern and may be a... more Lameness remains a significant cause of production losses, a growing welfare concern and may be a greater economic burden than clinical mastitis . A growing need for accurate, continuous automated detection systems continues because US prevalence of lameness is 12·5% while individual herds may experience prevalence's of 27·8-50·8%. To that end the first force-plate system restricted to the vertical dimension identified lame cows with 85% specificity and 52% sensitivity . These results lead to the hypothesis that addition of transverse and longitudinal dimensions could improve sensitivity of lameness detection. To address the hypothesis we upgraded the original force plate system to measure ground reaction forces (GRFs) across three directions. GRFs and locomotion scores were generated from randomly selected cows and logistic regression was used to develop a model that characterised relationships of locomotion scores to the GRFs. This preliminary study showed 76 variables across ...
Machine Learning and Data Mining Approaches to Climate Science, 2015
We consider the problem of improving the quality of downscaled daily precipitation data over the ... more We consider the problem of improving the quality of downscaled daily precipitation data over the Missouri River Basin (MRB) at the resolution of the observed data provided based on surface observations. We use the observed precipitation as the response variable and simulated historical data provided by MIROC5 (Model of Interdisciplinary Research on Climate) as the independent variable to evaluate the use of a standard Tobit model in relation to simple linear regression. Although the Tobit approach is able to incorporate the zeros into the downscaling process and produce zero predictions with more accuracy, it is not as successful in predicting the magnitude of the positive precipitation due to its heavy model dependency. The paper also lays the groundwork for a more extensive spatiotemporal modeling approach to be pursued in the future.
In a previous paper Neerchal, Lacayo and Nussbaum (2007) explored the behavior of the well-known ... more In a previous paper Neerchal, Lacayo and Nussbaum (2007) explored the behavior of the well-known problem of finding the optimal sample size for obtaining a confidence interval of a pre-assigned precision (or length) for the proportion parameter of a finite or infinite binary population. We illustrated some special problems that arise due to the discreteness of the population distribution and precision that is measured by the length of the interval rather than by the variance. Specifically, the confidence level of an interval of fixed length does not necessarily increase as the sample size increases. However, when such confidence levels are computed using normal approximations, we see a monotonic behavior. In this paper, we consider the corresponding problem under the Poisson approximation and show that for this distribution monotonicity does not hold and one should be beware of this seeming peculiarity in recommending sample sizes for studies involving estimation of means or proport...
The beta-binomial distribution introduced by Skellam has been applied in many teratology problems... more The beta-binomial distribution introduced by Skellam has been applied in many teratology problems for modelling the litter effect. Recently, Morel and Nagaraj proposed a new distribution for modelling cluster multinomial data when the clustering is believed to be caused by clumped sampling. It turns out that the distribution is a mixture of two binomial distributions and accommodates the estimation of an additional parameter to account for intra-litter effect. The new distribution arises from a cluster mechanism in which some individuals within a cluster exhibit the same behaviour while the remaining individuals from the cluster react independently of each other. Such a mechanism is a natural model in teratology problems, where typically a genetic trait is passed with a certain probability to the foetuses of the same litter. In this article, we use the new distribution to model binary responses with logistic regression. We analyse data from a teratology experiment to demonstrate that the new model provides a useful addition to current methodology. The experiment investigates the synergistic effect of the anticonvulsant phenytoin and trichloropopene oxide on the prenatal development of inbred mice. In a simulation study we investigate the type I error rate and the power of the maximum likelihood ratio test when the data follow a finite mixture distribution.
Journal of the American Statistical Association, 1998
Two parametric extra variation models are considered. Approximate closed-form expressions are giv... more Two parametric extra variation models are considered. Approximate closed-form expressions are given for the Fisher information matrices. The expressions are useful in computing maximum likelihood estimates and obtaining large cluster efficiencies. A simulation study shows that the approximations perform very well even in clusters of moderate size. The models are applied in illustrative examples. A goodness-of-fit test is developed that
Journal of Statistical Planning and Inference, 2008
ABSTRACT Overdispersion or extra variation is a common phenomenon that occurs when binomial (mult... more ABSTRACT Overdispersion or extra variation is a common phenomenon that occurs when binomial (multinomial) data exhibit larger variances than that permitted by the binomial (multinomial) model. This arises when the data are clustered or when the assumption of independence is violated. Goodness-of-fit (GOF) tests available in the overdispersion literature have focused on testing for the presence of overdispersion in the data and hence they are not applicable for choosing between the several competing overdispersion models. In this paper, we consider a GOF test proposed by Neerchal and Morel [1998. Large cluster results for two parametric multinomial extra variation models. J. Amer. Statist. Assoc. 93(443), 1078–1087], and study its distributional properties and performance characteristics. This statistic is a direct analogue of the usual Pearson chi-squared statistic, but is also applicable when the clusters are not necessarily of the same size. As this test statistic is for testing model adequacy against the alternative that the model is not adequate, it is applicable in testing two competing overdispersion models.
The objective of the study was to evaluate the relationship of veterinary clinical assessments of... more The objective of the study was to evaluate the relationship of veterinary clinical assessments of lameness to probability estimates of lameness predicted from vertical kinetic measures. We hypothesized that algorithm-derived probability estimates of lameness would accurately reflect vertical measures in lame limbs even though vertical changes may not inevitably occur in all lameness. Kinetic data were collected from sound (n=179) and unilaterally lame (n=167) dairy cattle with a 1-dimensional, parallel force plate system that registered vertical ground reaction force signatures of all four limbs as cows freely exited the milking parlour. Locomotion was scored for each hind limb using a 1–5 locomotion score system (1=sound, 5=severely lame). Pain response in the interdigital space was quantified with an algometer and pain response in the claw was quantified with a hoof tester fitted with a pressure gage. Lesions were assigned severity scores (1=minimal pathology to 5=severe pathology...
ABSTRACT Freshwater biological monitoring and assessment programs using biological indicators of ... more ABSTRACT Freshwater biological monitoring and assessment programs using biological indicators of ecological integrity (biocriteria) are integral to successful water resources planning and decision making. In the United States, the Clean Water Act requires every state to evaluate whether or not the designated aquatic life use (defined in its water quality standards) is being attained for each river and stream, and to submit a biennial list of impaired waters for US Environmental Protection Agency (EPA) approval. Economic constraints (personnel, equipment, transportation, laboratory, and data management costs) on water quality monitoring present a considerable challenge to states in reaching this goal. Many US states are using biocriteria-based statewide monitoring of the condition of surface waters to meet this challenge. The sampling effort of state monitoring programs, however, is often not sufficient to provide reliable estimates of stream condition for individual watersheds. Fortunately, other organizations such as county and municipal governments, regional water management authorities, and volunteer watershed groups are also gathering valuable stream monitoring data. When more than one monitoring program is conducted in a local watershed, it is desirable to integrate the estimates to (1) provide consistent estimates of stream condition, (2) increase the effective sample size and hence precision of the estimates, and (3) improve the spatial coverage of the stream network. In this paper, we show how a composite estimator can be used to combine the results of more than one probability-based survey to estimate mean condition in watersheds. As an example, we estimate the mean Fish Index of Biotic Integrity (IBI) for the Seneca Creek watershed in the State of Maryland, combining estimates from the Maryland Biological Stream Survey (MBSS) and the Montgomery County Stream Monitoring Program. Separate estimates are provided for the network of streams that are in the sample frame of both surveys and for the expanded stream network covered only by the Montgomery County survey. The composite estimate of mean Fish IBI for the streams common to both programs has a lower standard error than the MBSS estimate and yields a consistent estimate of stream condition that can be used to evaluate Seneca Creek watershed for inclusion in the State’s priority list of impaired waters. For the expanded stream network, the composite estimation also increased the precision of the mean Fish IBI relative to the Montgomery County estimate. The combination of surveys across space result in increased precision because of increased sample sizes and the application of weights that minimize the variance, and hence can provide more definitive classification of water bodies and reduce the need for follow-up monitoring. Integration of survey estimates can improve communication with the public and ultimately lead to more reliable water resource management decisions. State, provincial, and regional authorities in the US and other countries can use the composite indicator estimation technique presented here to derive more information from their limited monitoring resources and make better water resource protection decisions.
The U.S. Census Bureau (USCB) assists the federal government in distributing approximately $400 b... more The U.S. Census Bureau (USCB) assists the federal government in distributing approximately $400 billion of aid by providing a complete ranking of the states according to certain criteria, such as average poverty level. It is imperative that this ranking be as accurate as possible in order to ensure the fairness of the allocation of funds. Currently, the USCB ranks states based on point estimates of their true poverty level. Dr. Klein and Dr. Wright of the USCB have compared the performance of this method against more sophisticated procedures in simulation trials, but have found that they do not consistently outperform the existing method. We investigate this phenomenon by revisiting some of these procedures, and we expand on this work to produce new ranking algorithms. We utilize parallel programming to expedite Dr. Klein's procedures. In addition, we specify two new prior distributions on the population means — using previous years' census data as well as regression. We dis...
These results were obtained as part of the REU Site: Interdisciplinary Program in High Performanc... more These results were obtained as part of the REU Site: Interdisciplinary Program in High Performance Computing (hpcreu.umbc.edu) in the Department of Mathematics and Statistics at the University of Maryland, Baltimore County (UMBC) in Summer 2016. This program is funded by the National Science Foundation (NSF), the National Security Agency (NSA), and the Department of Defense (DOD), with additional support from UMBC, the Department of Mathematics and Statistics, the Center for Interdisciplinary Research and Consulting (CIRC), and the UMBC High Performance Computing Facility (HPCF). HPCF is supported by the U.S. National Science Foundation through the MRI program (grant nos. CNS{0821258 and CNS{1228778) and the SCREMS program (grant no. DMS{0821311), with additional substantial support from UMBC. Co-author Danielle Sykes was supported, in part, by the UMBC National Security Agency (NSA) Scholars Program through a contract with the NSA. Graduate assistants Sai K. Popuri and Nadeesri Wij...
Prediction of precipitation using simulations on various climate variables provided by Global Cli... more Prediction of precipitation using simulations on various climate variables provided by Global Climate Models (GCM) as covariates is often required for regional hydrological assessment studies. In this paper, we use a sufficient dimension reduction method to analyze monthly precipitation data over the Missouri River Basin (MRB). At each location, effective reduced sets of monthly historical simulated data from a neighborhood provided by MIROC5, a Global Climate Model, are first obtained via a semi-continuous adaptation of the Sliced Inverse Regression, a sufficient dimension reduction approach. These reduced sets are used subsequently in a modified Nadaraya-Watson method for prediction. We implement the method on a computing cluster, and demonstrate that it is scalable. We observe a signficant speedup in the runtime when implemented in parallel. This is an attractive alternative to the traditional spatio-temporal analysis of the entire region given the large number of locations and t...
Programming with big data in R (pbdR), a package used to implement high-performance computing in ... more Programming with big data in R (pbdR), a package used to implement high-performance computing in the statistical software R, uses block cyclic distribution to organize large data across many processes. Because computations performed on large matrices are often not associative, a systematic approach must be used during parallelization to divide the matrix correctly. The block cyclic distribution method stresses a balanced load across processes by allocating sections of data to a corresponding node. This method achieves well divided data that each process computes individually and calculates a final result more efficiently. A nontrivial problem occurs when using block cyclic distribution: Which combinations of different block sizes and grid layouts are most effective? These two factors greatly influence computational efficiency, and therefore it is crucial to study and understand their relationship. To analyze the effects of block size and processor grid layout, we carry out a perform...
Lameness remains a significant cause of production losses, a growing welfare concern and may be a... more Lameness remains a significant cause of production losses, a growing welfare concern and may be a greater economic burden than clinical mastitis . A growing need for accurate, continuous automated detection systems continues because US prevalence of lameness is 12·5% while individual herds may experience prevalence's of 27·8-50·8%. To that end the first force-plate system restricted to the vertical dimension identified lame cows with 85% specificity and 52% sensitivity . These results lead to the hypothesis that addition of transverse and longitudinal dimensions could improve sensitivity of lameness detection. To address the hypothesis we upgraded the original force plate system to measure ground reaction forces (GRFs) across three directions. GRFs and locomotion scores were generated from randomly selected cows and logistic regression was used to develop a model that characterised relationships of locomotion scores to the GRFs. This preliminary study showed 76 variables across ...
Machine Learning and Data Mining Approaches to Climate Science, 2015
We consider the problem of improving the quality of downscaled daily precipitation data over the ... more We consider the problem of improving the quality of downscaled daily precipitation data over the Missouri River Basin (MRB) at the resolution of the observed data provided based on surface observations. We use the observed precipitation as the response variable and simulated historical data provided by MIROC5 (Model of Interdisciplinary Research on Climate) as the independent variable to evaluate the use of a standard Tobit model in relation to simple linear regression. Although the Tobit approach is able to incorporate the zeros into the downscaling process and produce zero predictions with more accuracy, it is not as successful in predicting the magnitude of the positive precipitation due to its heavy model dependency. The paper also lays the groundwork for a more extensive spatiotemporal modeling approach to be pursued in the future.
In a previous paper Neerchal, Lacayo and Nussbaum (2007) explored the behavior of the well-known ... more In a previous paper Neerchal, Lacayo and Nussbaum (2007) explored the behavior of the well-known problem of finding the optimal sample size for obtaining a confidence interval of a pre-assigned precision (or length) for the proportion parameter of a finite or infinite binary population. We illustrated some special problems that arise due to the discreteness of the population distribution and precision that is measured by the length of the interval rather than by the variance. Specifically, the confidence level of an interval of fixed length does not necessarily increase as the sample size increases. However, when such confidence levels are computed using normal approximations, we see a monotonic behavior. In this paper, we consider the corresponding problem under the Poisson approximation and show that for this distribution monotonicity does not hold and one should be beware of this seeming peculiarity in recommending sample sizes for studies involving estimation of means or proport...
The beta-binomial distribution introduced by Skellam has been applied in many teratology problems... more The beta-binomial distribution introduced by Skellam has been applied in many teratology problems for modelling the litter effect. Recently, Morel and Nagaraj proposed a new distribution for modelling cluster multinomial data when the clustering is believed to be caused by clumped sampling. It turns out that the distribution is a mixture of two binomial distributions and accommodates the estimation of an additional parameter to account for intra-litter effect. The new distribution arises from a cluster mechanism in which some individuals within a cluster exhibit the same behaviour while the remaining individuals from the cluster react independently of each other. Such a mechanism is a natural model in teratology problems, where typically a genetic trait is passed with a certain probability to the foetuses of the same litter. In this article, we use the new distribution to model binary responses with logistic regression. We analyse data from a teratology experiment to demonstrate that the new model provides a useful addition to current methodology. The experiment investigates the synergistic effect of the anticonvulsant phenytoin and trichloropopene oxide on the prenatal development of inbred mice. In a simulation study we investigate the type I error rate and the power of the maximum likelihood ratio test when the data follow a finite mixture distribution.
Journal of the American Statistical Association, 1998
Two parametric extra variation models are considered. Approximate closed-form expressions are giv... more Two parametric extra variation models are considered. Approximate closed-form expressions are given for the Fisher information matrices. The expressions are useful in computing maximum likelihood estimates and obtaining large cluster efficiencies. A simulation study shows that the approximations perform very well even in clusters of moderate size. The models are applied in illustrative examples. A goodness-of-fit test is developed that
Journal of Statistical Planning and Inference, 2008
ABSTRACT Overdispersion or extra variation is a common phenomenon that occurs when binomial (mult... more ABSTRACT Overdispersion or extra variation is a common phenomenon that occurs when binomial (multinomial) data exhibit larger variances than that permitted by the binomial (multinomial) model. This arises when the data are clustered or when the assumption of independence is violated. Goodness-of-fit (GOF) tests available in the overdispersion literature have focused on testing for the presence of overdispersion in the data and hence they are not applicable for choosing between the several competing overdispersion models. In this paper, we consider a GOF test proposed by Neerchal and Morel [1998. Large cluster results for two parametric multinomial extra variation models. J. Amer. Statist. Assoc. 93(443), 1078–1087], and study its distributional properties and performance characteristics. This statistic is a direct analogue of the usual Pearson chi-squared statistic, but is also applicable when the clusters are not necessarily of the same size. As this test statistic is for testing model adequacy against the alternative that the model is not adequate, it is applicable in testing two competing overdispersion models.
The objective of the study was to evaluate the relationship of veterinary clinical assessments of... more The objective of the study was to evaluate the relationship of veterinary clinical assessments of lameness to probability estimates of lameness predicted from vertical kinetic measures. We hypothesized that algorithm-derived probability estimates of lameness would accurately reflect vertical measures in lame limbs even though vertical changes may not inevitably occur in all lameness. Kinetic data were collected from sound (n=179) and unilaterally lame (n=167) dairy cattle with a 1-dimensional, parallel force plate system that registered vertical ground reaction force signatures of all four limbs as cows freely exited the milking parlour. Locomotion was scored for each hind limb using a 1–5 locomotion score system (1=sound, 5=severely lame). Pain response in the interdigital space was quantified with an algometer and pain response in the claw was quantified with a hoof tester fitted with a pressure gage. Lesions were assigned severity scores (1=minimal pathology to 5=severe pathology...
ABSTRACT Freshwater biological monitoring and assessment programs using biological indicators of ... more ABSTRACT Freshwater biological monitoring and assessment programs using biological indicators of ecological integrity (biocriteria) are integral to successful water resources planning and decision making. In the United States, the Clean Water Act requires every state to evaluate whether or not the designated aquatic life use (defined in its water quality standards) is being attained for each river and stream, and to submit a biennial list of impaired waters for US Environmental Protection Agency (EPA) approval. Economic constraints (personnel, equipment, transportation, laboratory, and data management costs) on water quality monitoring present a considerable challenge to states in reaching this goal. Many US states are using biocriteria-based statewide monitoring of the condition of surface waters to meet this challenge. The sampling effort of state monitoring programs, however, is often not sufficient to provide reliable estimates of stream condition for individual watersheds. Fortunately, other organizations such as county and municipal governments, regional water management authorities, and volunteer watershed groups are also gathering valuable stream monitoring data. When more than one monitoring program is conducted in a local watershed, it is desirable to integrate the estimates to (1) provide consistent estimates of stream condition, (2) increase the effective sample size and hence precision of the estimates, and (3) improve the spatial coverage of the stream network. In this paper, we show how a composite estimator can be used to combine the results of more than one probability-based survey to estimate mean condition in watersheds. As an example, we estimate the mean Fish Index of Biotic Integrity (IBI) for the Seneca Creek watershed in the State of Maryland, combining estimates from the Maryland Biological Stream Survey (MBSS) and the Montgomery County Stream Monitoring Program. Separate estimates are provided for the network of streams that are in the sample frame of both surveys and for the expanded stream network covered only by the Montgomery County survey. The composite estimate of mean Fish IBI for the streams common to both programs has a lower standard error than the MBSS estimate and yields a consistent estimate of stream condition that can be used to evaluate Seneca Creek watershed for inclusion in the State’s priority list of impaired waters. For the expanded stream network, the composite estimation also increased the precision of the mean Fish IBI relative to the Montgomery County estimate. The combination of surveys across space result in increased precision because of increased sample sizes and the application of weights that minimize the variance, and hence can provide more definitive classification of water bodies and reduce the need for follow-up monitoring. Integration of survey estimates can improve communication with the public and ultimately lead to more reliable water resource management decisions. State, provincial, and regional authorities in the US and other countries can use the composite indicator estimation technique presented here to derive more information from their limited monitoring resources and make better water resource protection decisions.
Uploads
Papers by Nagaraj Neerchal