-
The 2010 Census Confidentiality Protections Failed, Here's How and Why
Authors:
John M. Abowd,
Tamara Adams,
Robert Ashmead,
David Darais,
Sourya Dey,
Simson L. Garfinkel,
Nathan Goldschlag,
Daniel Kifer,
Philip Leclerc,
Ethan Lew,
Scott Moore,
Rolando A. Rodríguez,
Ramy N. Tadros,
Lars Vilhuber
Abstract:
Using only 34 published tables, we reconstruct five variables (census block, sex, age, race, and ethnicity) in the confidential 2010 Census person records. Using the 38-bin age variable tabulated at the census block level, at most 20.1% of reconstructed records can differ from their confidential source on even a single value for these five variables. Using only published data, an attacker can veri…
▽ More
Using only 34 published tables, we reconstruct five variables (census block, sex, age, race, and ethnicity) in the confidential 2010 Census person records. Using the 38-bin age variable tabulated at the census block level, at most 20.1% of reconstructed records can differ from their confidential source on even a single value for these five variables. Using only published data, an attacker can verify that all records in 70% of all census blocks (97 million people) are perfectly reconstructed. The tabular publications in Summary File 1 thus have prohibited disclosure risk similar to the unreleased confidential microdata. Reidentification studies confirm that an attacker can, within blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with nonmodal characteristics) with 95% accuracy, the same precision as the confidential data achieve and far greater than statistical baselines. The flaw in the 2010 Census framework was the assumption that aggregation prevented accurate microdata reconstruction, justifying weaker disclosure limitation methods than were applied to 2010 Census public microdata. The framework used for 2020 Census publications defends against attacks that are based on reconstruction, as we also demonstrate here. Finally, we show that alternatives to the 2020 Census Disclosure Avoidance System with similar accuracy (enhanced swapping) also fail to protect confidentiality, and those that partially defend against reconstruction attacks (incomplete suppression implementations) destroy the primary statutory use case: data for redistricting all legislatures in the country in compliance with the 1965 Voting Rights Act.
△ Less
Submitted 18 December, 2023;
originally announced December 2023.
-
Variable Selection for Kernel Two-Sample Tests
Authors:
Jie Wang,
Santanu S. Dey,
Yao Xie
Abstract:
We consider the variable selection problem for two-sample tests, aiming to select the most informative variables to distinguish samples from two groups. To solve this problem, we propose a framework based on the kernel maximum mean discrepancy (MMD). Our approach seeks a group of variables with a pre-specified size that maximizes the variance-regularized MMD statistics. This formulation also corre…
▽ More
We consider the variable selection problem for two-sample tests, aiming to select the most informative variables to distinguish samples from two groups. To solve this problem, we propose a framework based on the kernel maximum mean discrepancy (MMD). Our approach seeks a group of variables with a pre-specified size that maximizes the variance-regularized MMD statistics. This formulation also corresponds to the minimization of asymptotic type-II error while controlling type-I error, as studied in the literature. We present mixed-integer programming formulations and develop exact and approximation algorithms with performance guarantees for different choices of kernel functions. Furthermore, we provide a statistical testing power analysis of our proposed framework. Experiment results on synthetic and real datasets demonstrate the superior performance of our approach.
△ Less
Submitted 12 October, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
Modelling spatially autocorrelated detection probabilities in spatial capture-recapture using random effects
Authors:
Soumen Dey,
Ehsan M. Moqanaki,
Cyril Milleret,
Pierre Dupont,
Mahdieh Tourani,
Richard Bischof
Abstract:
Spatial capture-recapture (SCR) models are now widely used for estimating density from repeated individual spatial encounters. SCR accounts for the inherent spatial autocorrelation in individual detections by modelling detection probabilities as a function of distance between the detectors and individual activity centres. However, additional spatial heterogeneity in detection probability may still…
▽ More
Spatial capture-recapture (SCR) models are now widely used for estimating density from repeated individual spatial encounters. SCR accounts for the inherent spatial autocorrelation in individual detections by modelling detection probabilities as a function of distance between the detectors and individual activity centres. However, additional spatial heterogeneity in detection probability may still creep in due to environmental or sampling characteristics. if unaccounted for, such variation can lead to pronounced bias in population size estimates. Using simulations, we describe and test three Bayesian SCR models that use generalized linear mixed models (GLMM) to account for latent heterogeneity in baseline detection probability across detectors using: independent random effects (RE), spatially autocorrelated random effects (SARE), and a two-group finite mixture model (FM). Overall, SARE provided the least biased population size estimates (median RB: -9 -- 6%). When spatial autocorrelation was high, SARE also performed best at predicting the spatial pattern of heterogeneity in detection probability. At intermediate levels of autocorrelation, spatially-explicit estimates of detection probability obtained with FM where more accurate than those generated by SARE and RE. In cases where the number of detections per detector is realistically low (at most 1), all GLMMs considered here may require dimension reduction of the random effects by pooling baseline detection probability parameters across neighboring detectors ("aggregation") to avoid over-parameterization. The added complexity and computational overhead associated with SCR-GLMMs may only be justified in extreme cases of spatial heterogeneity. However, even in less extreme cases, detecting and estimating spatially heterogeneous detection probability may assist in planning or adjusting monitoring schemes.
△ Less
Submitted 12 May, 2022;
originally announced May 2022.
-
Fuzzy Forests For Feature Selection in High-Dimensional Survey Data: An Application to the 2020 U.S. Presidential Election
Authors:
Sreemanti Dey,
R. Michael Alvarez
Abstract:
An increasingly common methodological issue in the field of social science is high-dimensional and highly correlated datasets that are unamenable to the traditional deductive framework of study. Analysis of candidate choice in the 2020 Presidential Election is one area in which this issue presents itself: in order to test the many theories explaining the outcome of the election, it is necessary to…
▽ More
An increasingly common methodological issue in the field of social science is high-dimensional and highly correlated datasets that are unamenable to the traditional deductive framework of study. Analysis of candidate choice in the 2020 Presidential Election is one area in which this issue presents itself: in order to test the many theories explaining the outcome of the election, it is necessary to use data such as the 2020 Cooperative Election Study Common Content, with hundreds of highly correlated features. We present the Fuzzy Forests algorithm, a variant of the popular Random Forests ensemble method, as an efficient way to reduce the feature space in such cases with minimal bias, while also maintaining predictive performance on par with common algorithms like Random Forests and logit. Using Fuzzy Forests, we isolate the top correlates of candidate choice and find that partisan polarization was the strongest factor driving the 2020 presidential election.
△ Less
Submitted 5 March, 2022;
originally announced March 2022.
-
Estimating Software Reliability Using Size-biased Modelling
Authors:
Soumen Dey,
Ashis Kumar Chakraborty
Abstract:
Software reliability estimation is one of the most active areas of research in software testing. Since time between failures (TBF) has often been challenging to record, software testing data are commonly recorded as test-case-wise in a discrete set up. We have developed a Bayesian generalised linear mixed model (GLMM) based on software testing detection data and a size-biased strategy which not…
▽ More
Software reliability estimation is one of the most active areas of research in software testing. Since time between failures (TBF) has often been challenging to record, software testing data are commonly recorded as test-case-wise in a discrete set up. We have developed a Bayesian generalised linear mixed model (GLMM) based on software testing detection data and a size-biased strategy which not only estimates the software reliability, but also estimates the total number of bugs present in the software. Our approach provides a flexible, unified modelling framework and can be adopted to various real-life situations. We have assessed the performance of our model via simulation study and found that each of the key parameters could be estimated with a satisfactory level of accuracy. We have also applied our model to two empirical software testing data sets. While there can be other fields of study for application of our model (e.g., hydrocarbon exploration), we anticipate that our novel modelling approach to estimate software reliability could be very useful for the users and can potentially be a key tool in the field of software reliability estimation.
△ Less
Submitted 20 April, 2022; v1 submitted 16 February, 2022;
originally announced February 2022.
-
Learning a Shared Model for Motorized Prosthetic Joints to Predict Ankle-Joint Motion
Authors:
Sharmita Dey,
Sabri Boughorbel,
Arndt F. Schilling
Abstract:
Control strategies for active prostheses or orthoses use sensor inputs to recognize the user's locomotive intention and generate corresponding control commands for producing the desired locomotion. In this paper, we propose a learning-based shared model for predicting ankle-joint motion for different locomotion modes like level-ground walking, stair ascent, stair descent, slope ascent, and slope d…
▽ More
Control strategies for active prostheses or orthoses use sensor inputs to recognize the user's locomotive intention and generate corresponding control commands for producing the desired locomotion. In this paper, we propose a learning-based shared model for predicting ankle-joint motion for different locomotion modes like level-ground walking, stair ascent, stair descent, slope ascent, and slope descent without the need to classify between them. Features extracted from hip and knee joint angular motion are used to continuously predict the ankle angles and moments using a Feed-Forward Neural Network-based shared model. We show that the shared model is adequate for predicting the ankle angles and moments for different locomotion modes without explicitly classifying between the modes. The proposed strategy shows the potential for devising a high-level controller for an intelligent prosthetic ankle that can adapt to different locomotion modes.
△ Less
Submitted 14 November, 2021;
originally announced November 2021.
-
Local Convolutions Cause an Implicit Bias towards High Frequency Adversarial Examples
Authors:
Josue Ortega Caro,
Yilong Ju,
Ryan Pyle,
Sourav Dey,
Wieland Brendel,
Fabio Anselmi,
Ankit Patel
Abstract:
Adversarial Attacks are still a significant challenge for neural networks. Recent work has shown that adversarial perturbations typically contain high-frequency features, but the root cause of this phenomenon remains unknown. Inspired by theoretical work on linear full-width convolutional models, we hypothesize that the local (i.e. bounded-width) convolutional operations commonly used in current n…
▽ More
Adversarial Attacks are still a significant challenge for neural networks. Recent work has shown that adversarial perturbations typically contain high-frequency features, but the root cause of this phenomenon remains unknown. Inspired by theoretical work on linear full-width convolutional models, we hypothesize that the local (i.e. bounded-width) convolutional operations commonly used in current neural networks are implicitly biased to learn high frequency features, and that this is one of the root causes of high frequency adversarial examples. To test this hypothesis, we analyzed the impact of different choices of linear and nonlinear architectures on the implicit bias of the learned features and the adversarial perturbations, in both spatial and frequency domains. We find that the high-frequency adversarial perturbations are critically dependent on the convolution operation because the spatially-limited nature of local convolutions induces an implicit bias towards high frequency features. The explanation for the latter involves the Fourier Uncertainty Principle: a spatially-limited (local in the space domain) filter cannot also be frequency-limited (local in the frequency domain). Furthermore, using larger convolution kernel sizes or avoiding convolutions (e.g. by using Vision Transformers architecture) significantly reduces this high frequency bias, but not the overall susceptibility to attacks. Looking forward, our work strongly suggests that understanding and controlling the implicit bias of architectures will be essential for achieving adversarial robustness.
△ Less
Submitted 8 March, 2023; v1 submitted 19 June, 2020;
originally announced June 2020.
-
Deep-n-Cheap: An Automated Search Framework for Low Complexity Deep Learning
Authors:
Sourya Dey,
Saikrishna C. Kanala,
Keith M. Chugg,
Peter A. Beerel
Abstract:
We present Deep-n-Cheap -- an open-source AutoML framework to search for deep learning models. This search includes both architecture and training hyperparameters, and supports convolutional neural networks and multi-layer perceptrons. Our framework is targeted for deployment on both benchmark and custom datasets, and as a result, offers a greater degree of search space customizability as compared…
▽ More
We present Deep-n-Cheap -- an open-source AutoML framework to search for deep learning models. This search includes both architecture and training hyperparameters, and supports convolutional neural networks and multi-layer perceptrons. Our framework is targeted for deployment on both benchmark and custom datasets, and as a result, offers a greater degree of search space customizability as compared to a more limited search over only pre-existing models from literature. We also introduce the technique of 'search transfer', which demonstrates the generalization capabilities of the models found by our framework to multiple datasets.
Deep-n-Cheap includes a user-customizable complexity penalty which trades off performance with training time or number of parameters. Specifically, our framework results in models offering performance comparable to state-of-the-art while taking 1-2 orders of magnitude less time to train than models from other AutoML and model search frameworks. Additionally, this work investigates and develops various insights regarding the search process. In particular, we show the superiority of a greedy strategy and justify our choice of Bayesian optimization as the primary search methodology over random / grid search.
△ Less
Submitted 5 September, 2020; v1 submitted 27 March, 2020;
originally announced April 2020.
-
Drift-Adjusted And Arbitrated Ensemble Framework For Time Series Forecasting
Authors:
Anirban Chatterjee,
Subhadip Paul,
Uddipto Dutta,
Smaranya Dey
Abstract:
Time Series Forecasting is at the core of many practical applications such as sales forecasting for business, rainfall forecasting for agriculture and many others. Though this problem has been extensively studied for years, it is still considered a challenging problem due to complex and evolving nature of time series data. Typical methods proposed for time series forecasting modeled linear or non-…
▽ More
Time Series Forecasting is at the core of many practical applications such as sales forecasting for business, rainfall forecasting for agriculture and many others. Though this problem has been extensively studied for years, it is still considered a challenging problem due to complex and evolving nature of time series data. Typical methods proposed for time series forecasting modeled linear or non-linear dependencies between data observations. However it is a generally accepted notion that no one method is universally effective for all kinds of time series data. Attempts have been made to use dynamic and weighted combination of heterogeneous and independent forecasting models and it has been found to be a promising direction to tackle this problem. This method is based on the assumption that different forecasters have different specialization and varying performance for different distribution of data and weights are dynamically assigned to multiple forecasters accordingly. However in many practical time series data-set, the distribution of data slowly evolves with time. We propose to employ a re-weighting based method to adjust the assigned weights to various forecasters in order to account for such distribution-drift. An exhaustive testing was performed against both real-world and synthesized time-series. Experimental results show the competitiveness of the method in comparison to state-of-the-art approaches for combining forecasters and handling drift.
△ Less
Submitted 16 March, 2020;
originally announced March 2020.
-
Asymptotic Performance Analysis of Non-Bayesian Quickest Change Detection with an Energy Harvesting Sensor
Authors:
Subhrakanti Dey
Abstract:
In this paper, we consider a non-Bayesian sequential change detection based on the Cumulative Sum (CUSUM) algorithm employed by an energy harvesting sensor where the distributions before and after the change are assumed to be known. In a slotted discrete-time model, the sensor, exclusively powered by randomly available harvested energy, obtains a sample and computes the log-likelihood ratio of the…
▽ More
In this paper, we consider a non-Bayesian sequential change detection based on the Cumulative Sum (CUSUM) algorithm employed by an energy harvesting sensor where the distributions before and after the change are assumed to be known. In a slotted discrete-time model, the sensor, exclusively powered by randomly available harvested energy, obtains a sample and computes the log-likelihood ratio of the two distributions if it has enough energy to sense and process a sample. If it does not have enough energy in a given slot, it waits until it harvests enough energy to perform the task in a future time slot. We derive asymptotic expressions for the expected detection delay (when a change actually occurs), and the asymptotic tail distribution of the run-length to a false alarm (when a change never happens). We show that when the average harvested energy ($\bar H$) is greater than or equal to the energy required to sense and process a sample ($E_s$), standard existing asymptotic results for the CUSUM test apply since the energy storage level at the sensor is greater than $E_s$ after a sufficiently long time. However, when the $\bar H < E_s$, the energy storage level can be modelled by a positive Harris recurrent Markov chain with a unique stationary distribution. Using asymptotic results from Markov random walk theory and associated nonlinear Markov renewal theory, we establish asymptotic expressions for the expected detection delay and asymptotic exponentiality of the tail distribution of the run-length to a false alarm in this non-trivial case. Numerical results are provided to support the theoretical results.
△ Less
Submitted 14 January, 2020;
originally announced January 2020.
-
Attentive Modality Hopping Mechanism for Speech Emotion Recognition
Authors:
Seunghyun Yoon,
Subhadeep Dey,
Hwanhee Lee,
Kyomin Jung
Abstract:
In this work, we explore the impact of visual modality in addition to speech and text for improving the accuracy of the emotion detection system. The traditional approaches tackle this task by fusing the knowledge from the various modalities independently for performing emotion classification. In contrast to these approaches, we tackle the problem by introducing an attention mechanism to combine t…
▽ More
In this work, we explore the impact of visual modality in addition to speech and text for improving the accuracy of the emotion detection system. The traditional approaches tackle this task by fusing the knowledge from the various modalities independently for performing emotion classification. In contrast to these approaches, we tackle the problem by introducing an attention mechanism to combine the information. In this regard, we first apply a neural network to obtain hidden representations of the modalities. Then, the attention mechanism is defined to select and aggregate important parts of the video data by conditioning on the audio and text data. Furthermore, the attention mechanism is again applied to attend important parts of the speech and textual data, by considering other modality. Experiments are performed on the standard IEMOCAP dataset using all three modalities (audio, text, and video). The achieved results show a significant improvement of 3.65% in terms of weighted accuracy compared to the baseline system.
△ Less
Submitted 22 April, 2020; v1 submitted 29 November, 2019;
originally announced December 2019.
-
Order Matters at Fanatics Recommending Sequentially Ordered Products by LSTM Embedded with Word2Vec
Authors:
Jing Pan,
Weian Sheng,
Santanu Dey
Abstract:
A unique challenge for e-commerce recommendation is that customers are often interested in products that are more advanced than their already purchased products, but not reversed. The few existing recommender systems modeling unidirectional sequence output a limited number of categories or continuous variables. To model the ordered sequence, we design the first recommendation system that both embe…
▽ More
A unique challenge for e-commerce recommendation is that customers are often interested in products that are more advanced than their already purchased products, but not reversed. The few existing recommender systems modeling unidirectional sequence output a limited number of categories or continuous variables. To model the ordered sequence, we design the first recommendation system that both embed purchased items with Word2Vec, and model the sequence with stateless LSTM RNN. The click-through rate of this recommender system in production outperforms its solely Word2Vec based predecessor. Developed in 2017, it was perhaps the first published real-world application that makes distributed predictions of a single machine trained Keras model on Spark slave nodes at a scale of more than 0.4 million columns per row.
△ Less
Submitted 21 November, 2019;
originally announced November 2019.
-
Exploiting semi-supervised training through a dropout regularization in end-to-end speech recognition
Authors:
Subhadeep Dey,
Petr Motlicek,
Trung Bui,
Franck Dernoncourt
Abstract:
In this paper, we explore various approaches for semi supervised learning in an end to end automatic speech recognition (ASR) framework. The first step in our approach involves training a seed model on the limited amount of labelled data. Additional unlabelled speech data is employed through a data selection mechanism to obtain the best hypothesized output, further used to retrain the seed model.…
▽ More
In this paper, we explore various approaches for semi supervised learning in an end to end automatic speech recognition (ASR) framework. The first step in our approach involves training a seed model on the limited amount of labelled data. Additional unlabelled speech data is employed through a data selection mechanism to obtain the best hypothesized output, further used to retrain the seed model. However, uncertainties of the model may not be well captured with a single hypothesis. As opposed to this technique, we apply a dropout mechanism to capture the uncertainty by obtaining multiple hypothesized text transcripts of an speech recording. We assume that the diversity of automatically generated transcripts for an utterance will implicitly increase the reliability of the model. Finally, the data selection process is also applied on these hypothesized transcripts to reduce the uncertainty. Experiments on freely available TEDLIUM corpus and proprietary Adobe's internal dataset show that the proposed approach significantly reduces ASR errors, compared to the baseline model.
△ Less
Submitted 8 August, 2019;
originally announced August 2019.
-
Deep Residual Autoencoders for Expectation Maximization-inspired Dictionary Learning
Authors:
Bahareh Tolooshams,
Sourav Dey,
Demba Ba
Abstract:
We introduce a neural-network architecture, termed the constrained recurrent sparse autoencoder (CRsAE), that solves convolutional dictionary learning problems, thus establishing a link between dictionary learning and neural networks. Specifically, we leverage the interpretation of the alternating-minimization algorithm for dictionary learning as an approximate Expectation-Maximization algorithm t…
▽ More
We introduce a neural-network architecture, termed the constrained recurrent sparse autoencoder (CRsAE), that solves convolutional dictionary learning problems, thus establishing a link between dictionary learning and neural networks. Specifically, we leverage the interpretation of the alternating-minimization algorithm for dictionary learning as an approximate Expectation-Maximization algorithm to develop autoencoders that enable the simultaneous training of the dictionary and regularization parameter (ReLU bias). The forward pass of the encoder approximates the sufficient statistics of the E-step as the solution to a sparse coding problem, using an iterative proximal gradient algorithm called FISTA. The encoder can be interpreted either as a recurrent neural network or as a deep residual network, with two-sided ReLU non-linearities in both cases. The M-step is implemented via a two-stage back-propagation. The first stage relies on a linear decoder applied to the encoder and a norm-squared loss. It parallels the dictionary update step in dictionary learning. The second stage updates the regularization parameter by applying a loss function to the encoder that includes a prior on the parameter motivated by Bayesian statistics. We demonstrate in an image-denoising task that CRsAE learns Gabor-like filters, and that the EM-inspired approach for learning biases is superior to the conventional approach. In an application to recordings of electrical activity from the brain, we demonstrate that CRsAE learns realistic spike templates and speeds up the process of identifying spike times by 900x compared to algorithms based on convex optimization.
△ Less
Submitted 18 October, 2020; v1 submitted 18 April, 2019;
originally announced April 2019.
-
Pre-Defined Sparse Neural Networks with Hardware Acceleration
Authors:
Sourya Dey,
Kuan-Wen Huang,
Peter A. Beerel,
Keith M. Chugg
Abstract:
Neural networks have proven to be extremely powerful tools for modern artificial intelligence applications, but computational and storage complexity remain limiting factors. This paper presents two compatible contributions towards reducing the time, energy, computational, and storage complexities associated with multilayer perceptrons. Pre-defined sparsity is proposed to reduce the complexity duri…
▽ More
Neural networks have proven to be extremely powerful tools for modern artificial intelligence applications, but computational and storage complexity remain limiting factors. This paper presents two compatible contributions towards reducing the time, energy, computational, and storage complexities associated with multilayer perceptrons. Pre-defined sparsity is proposed to reduce the complexity during both training and inference, regardless of the implementation platform. Our results show that storage and computational complexity can be reduced by factors greater than 5X without significant performance loss. The second contribution is an architecture for hardware acceleration that is compatible with pre-defined sparsity. This architecture supports both training and inference modes and is flexible in the sense that it is not tied to a specific number of neurons. For example, this flexibility implies that various sized neural networks can be supported on various sized Field Programmable Gate Array (FPGA)s.
△ Less
Submitted 3 December, 2018;
originally announced December 2018.
-
Bayesian Model Selection for a Class of Spatially-Explicit Capture Recapture Models
Authors:
Soumen Dey,
Mohan Delampady,
Arjun M. Gopalaswamy
Abstract:
A vast amount of ecological knowledge generated recently has hinged upon the ability of model selection methods to discriminate among various ecological hypotheses. The last decade has seen the rise of Bayesian hierarchical models in ecology. Consequently, popular tools, such as the AIC, become largely inapplicable and other tools are not universally applicable. We focus on a class of competing Ba…
▽ More
A vast amount of ecological knowledge generated recently has hinged upon the ability of model selection methods to discriminate among various ecological hypotheses. The last decade has seen the rise of Bayesian hierarchical models in ecology. Consequently, popular tools, such as the AIC, become largely inapplicable and other tools are not universally applicable. We focus on a class of competing Bayesian spatially explicit capture recapture (SECR) models and first apply some of the recommended Bayesian model selection tools: (1) Bayes Factor - using (a) Gelfand-Dey (b) harmonic mean methods, (2) DIC, (3) WAIC and (4) the posterior predictive loss function. In all, we evaluate 25 variants of model selection tools in our study. We evaluate these model selection tools from the standpoint of model selection and parameter estimation by contrasting the choice recommended by a tool with a `true' model. In all, we generate 120 simulated data sets using the true model and assess the frequency with which the true model is selected and how well the tool estimates N (population size). We find that when information content is low, no particular tool can be recommended to help realise, simultaneously, both the goals of model selection and parameter estimation. In such scenarios, we recommend that practitioners utilise our application of Bayes Factor for parameter estimation and recommend the posterior predictive loss approach for model selection when information content is low. When both the objectives are taken together, we recommend the use of our applications of Bayes Factor for Bayesian SECR models. Our study reveals that although new model selection tools are emerging (eg: WAIC) in the applied statistics literature, an uncritical absorption of these new tools (i.e. without assessing their efficacies for the problem at hand) into ecological practice may mislead inferences.
△ Less
Submitted 4 October, 2018;
originally announced October 2018.
-
Bayesian analysis of absolute continuous Marshall-Olkin bivariate Pareto distribution with location and scale parameters
Authors:
Biplab Paul,
Arabin Kumar Dey,
Sanku Dey
Abstract:
This paper provides two different novel approaches of slice sampling to estimate the parameters of absolute continuous Marshall-Olkin bivariate Pareto distribution with location and scale parameters. We carry out the bayesian analysis taking gamma prior for shape and scale parameters and truncated normal for location parameters. Credible intervals and coverage probabilities are also provided for a…
▽ More
This paper provides two different novel approaches of slice sampling to estimate the parameters of absolute continuous Marshall-Olkin bivariate Pareto distribution with location and scale parameters. We carry out the bayesian analysis taking gamma prior for shape and scale parameters and truncated normal for location parameters. Credible intervals and coverage probabilities are also provided for all methods. A real-life data analysis is shown for illustrative purpose.
△ Less
Submitted 17 September, 2018;
originally announced September 2018.
-
Scalable Convolutional Dictionary Learning with Constrained Recurrent Sparse Auto-encoders
Authors:
Bahareh Tolooshams,
Sourav Dey,
Demba Ba
Abstract:
Given a convolutional dictionary underlying a set of observed signals, can a carefully designed auto-encoder recover the dictionary in the presence of noise? We introduce an auto-encoder architecture, termed constrained recurrent sparse auto-encoder (CRsAE), that answers this question in the affirmative. Given an input signal and an approximate dictionary, the encoder finds a sparse approximation…
▽ More
Given a convolutional dictionary underlying a set of observed signals, can a carefully designed auto-encoder recover the dictionary in the presence of noise? We introduce an auto-encoder architecture, termed constrained recurrent sparse auto-encoder (CRsAE), that answers this question in the affirmative. Given an input signal and an approximate dictionary, the encoder finds a sparse approximation using FISTA. The decoder reconstructs the signal by applying the dictionary to the output of the encoder. The encoder and decoder in CRsAE parallel the sparse-coding and dictionary update steps in optimization-based alternating-minimization schemes for dictionary learning. As such, the parameters of the encoder and decoder are not independent, a constraint which we enforce for the first time. We derive the back-propagation algorithm for CRsAE. CRsAE is a framework for blind source separation that, only knowing the number of sources (dictionary elements), and assuming sparsely-many can overlap, is able to separate them. We demonstrate its utility in the context of spike sorting, a source separation problem in computational neuroscience. We demonstrate the ability of CRsAE to recover the underlying dictionary and characterize its sensitivity as a function of SNR.
△ Less
Submitted 12 July, 2018;
originally announced July 2018.
-
Morse Code Datasets for Machine Learning
Authors:
Sourya Dey,
Keith M. Chugg,
Peter A. Beerel
Abstract:
We present an algorithm to generate synthetic datasets of tunable difficulty on classification of Morse code symbols for supervised machine learning problems, in particular, neural networks. The datasets are spatially one-dimensional and have a small number of input features, leading to high density of input information content. This makes them particularly challenging when implementing network co…
▽ More
We present an algorithm to generate synthetic datasets of tunable difficulty on classification of Morse code symbols for supervised machine learning problems, in particular, neural networks. The datasets are spatially one-dimensional and have a small number of input features, leading to high density of input information content. This makes them particularly challenging when implementing network complexity reduction methods. We explore how network performance is affected by deliberately adding various forms of noise and expanding the feature set and dataset size. Finally, we establish several metrics to indicate the difficulty of a dataset, and evaluate their merits. The algorithm and datasets are open-source.
△ Less
Submitted 30 November, 2018; v1 submitted 11 July, 2018;
originally announced July 2018.
-
Predicting Gross Movie Revenue
Authors:
Sharmistha Dey
Abstract:
'There is no terror in the bang, only is the anticipation of it' - Alfred Hitchcock.
Yet there is everything in correctly anticipating the bang a movie would make in the box-office. Movies make a high profile, billion dollar industry and prediction of movie revenue can be very lucrative. Predicted revenues can be used for planning both the production and distribution stages. For example, project…
▽ More
'There is no terror in the bang, only is the anticipation of it' - Alfred Hitchcock.
Yet there is everything in correctly anticipating the bang a movie would make in the box-office. Movies make a high profile, billion dollar industry and prediction of movie revenue can be very lucrative. Predicted revenues can be used for planning both the production and distribution stages. For example, projected gross revenue can be used to plan the remuneration of the actors and crew members as well as other parts of the budget [1].
Success or failure of a movie can depend on many factors: star-power, release date, budget, MPAA (Motion Picture Association of America) rating, plot and the highly unpredictable human reactions. The enormity of the number of exogenous variables makes manual revenue prediction process extremely difficult. However, in the era of computer and data sciences, volumes of data can be efficiently processed and modelled. Hence the tough job of predicting gross revenue of a movie can be simplified with the help of modern computing power and the historical data available as movie databases [2].
△ Less
Submitted 3 April, 2018;
originally announced April 2018.
-
The Frechet distribution: Estimation and Application an Overview
Authors:
Pedro Luiz Ramos,
Francisco Louzada,
Eduardo Ramos,
Sanku Dey
Abstract:
In this article, we consider the problem of estimating the parameters of the Fréchet distribution from both frequentist and Bayesian points of view. First we briefly describe different frequentist approaches, namely, maximum likelihood, method of moments, percentile estimators, L-moments, ordinary and weighted least squares, maximum product of spacings, maximum goodness-of-fit estimators and compa…
▽ More
In this article, we consider the problem of estimating the parameters of the Fréchet distribution from both frequentist and Bayesian points of view. First we briefly describe different frequentist approaches, namely, maximum likelihood, method of moments, percentile estimators, L-moments, ordinary and weighted least squares, maximum product of spacings, maximum goodness-of-fit estimators and compare them with respect to mean relative estimates, mean squared errors and the 95\% coverage probability of the asymptotic confidence intervals using extensive numerical simulations. Next, we consider the Bayesian inference approach using reference priors. The Metropolis-Hasting algorithm is used to draw Markov Chain Monte Carlo samples, and they have in turn been used to compute the Bayes estimates and also to construct the corresponding credible intervals. Five real data sets related to the minimum flow of water on Piracicaba river in Brazil are used to illustrate the applicability of the discussed procedures.
△ Less
Submitted 16 January, 2018;
originally announced January 2018.
-
A spatially explicit capture recapture model for partially identified individuals when trap detection rate is less than one
Authors:
Soumen Dey,
Mohan Delampady,
K. Ullas Karanth,
Arjun M. Gopalaswamy
Abstract:
Spatially explicit capture recapture (SECR) models have gained enormous popularity to solve abundance estimation problems in ecology. In this study, we develop a novel Bayesian SECR model that disentangles the process of animal movement through a detector from the process of recording data by a detector in the face of imperfect detection. We integrate this complexity into an advanced version of a…
▽ More
Spatially explicit capture recapture (SECR) models have gained enormous popularity to solve abundance estimation problems in ecology. In this study, we develop a novel Bayesian SECR model that disentangles the process of animal movement through a detector from the process of recording data by a detector in the face of imperfect detection. We integrate this complexity into an advanced version of a recent SECR model involving partially identified individuals (Royle, 2015). We assess the performance of our model over a range of realistic simulation scenarios and demonstrate that estimates of population size $N$ improve when we utilize the proposed model relative to the model that does not explicitly estimate trap detection probability (Royle, 2015). We confront and investigate the proposed model with a spatial capture-recapture data set from a camera trapping survey on tigers (\textit{Panthera tigris}) in Nagarahole, southern India. Trap detection probability is estimated at 0.489 and therefore justifies the necessity to utilize our model in field situations. We discuss possible extensions, future work and relevance of our model to other statistical applications beyond ecology.
△ Less
Submitted 28 December, 2017;
originally announced December 2017.
-
SolarisNet: A Deep Regression Network for Solar Radiation Prediction
Authors:
Subhadip Dey,
Sawon Pratiher,
Saon Banerjee,
Chanchal Kumar Mukherjee
Abstract:
Effective utilization of photovoltaic (PV) plants requires weather variability robust global solar radiation (GSR) forecasting models. Random weather turbulence phenomena coupled with assumptions of clear sky model as suggested by Hottel pose significant challenges to parametric & non-parametric models in GSR conversion rate estimation. Also, a decent GSR estimate requires costly high-tech radiome…
▽ More
Effective utilization of photovoltaic (PV) plants requires weather variability robust global solar radiation (GSR) forecasting models. Random weather turbulence phenomena coupled with assumptions of clear sky model as suggested by Hottel pose significant challenges to parametric & non-parametric models in GSR conversion rate estimation. Also, a decent GSR estimate requires costly high-tech radiometer and expert dependent instrument handling and measurements, which are subjective. As such, a computer aided monitoring (CAM) system to evaluate PV plant operation feasibility by employing smart grid past data analytics and deep learning is developed. Our algorithm, SolarisNet is a 6-layer deep neural network trained on data collected at two weather stations located near Kalyani metrological site, West Bengal, India. The daily GSR prediction performance using SolarisNet outperforms the existing state of art and its efficacy in inferring past GSR data insights to comprehend daily and seasonal GSR variability along with its competence for short term forecasting is discussed.
△ Less
Submitted 10 December, 2017; v1 submitted 22 November, 2017;
originally announced November 2017.
-
Bayesian analysis of three parameter singular Marshall-Olkin bivariate Pareto distribution
Authors:
Biplab Paul,
Arabin Kumar Dey,
Sanku Dey,
Debasis Kundu
Abstract:
This paper provides bayesian analysis of singular Marshall-Olkin bivariate Pareto distribution. We consider three parameter singular Marshall-Olkin bivariate Pareto distribution. We consider two types of prior - reference prior and gamma prior. Bayes estimate of the parameters are calculated based on slice cum gibbs sampler and Lindley approximation. Credible interval is also provided for all meth…
▽ More
This paper provides bayesian analysis of singular Marshall-Olkin bivariate Pareto distribution. We consider three parameter singular Marshall-Olkin bivariate Pareto distribution. We consider two types of prior - reference prior and gamma prior. Bayes estimate of the parameters are calculated based on slice cum gibbs sampler and Lindley approximation. Credible interval is also provided for all methods and all prior distributions. A data analysis is kept for illustrative purpose.
△ Less
Submitted 29 September, 2017; v1 submitted 18 September, 2017;
originally announced September 2017.
-
Multi-sensor Transmission Management for Remote State Estimation under Coordination
Authors:
Kemi Ding,
Yuzhe Li,
Subhrakanti Dey,
Ling Shi
Abstract:
This paper considers the remote state estimation in a cyber-physical system (CPS) using multiple sensors. The measurements of each sensor are transmitted to a remote estimator over a shared channel, where simultaneous transmissions from other sensors are regarded as interference signals. In such a competitive environment, each sensor needs to choose its transmission power for sending data packets…
▽ More
This paper considers the remote state estimation in a cyber-physical system (CPS) using multiple sensors. The measurements of each sensor are transmitted to a remote estimator over a shared channel, where simultaneous transmissions from other sensors are regarded as interference signals. In such a competitive environment, each sensor needs to choose its transmission power for sending data packets taking into account of other sensors' behavior. To model this interactive decision-making process among the sensors, we introduce a multi-player non-cooperative game framework. To overcome the inefficiency arising from the Nash equilibrium (NE) solution, we propose a correlation policy, along with the notion of correlation equilibrium (CE). An analytical comparison of the game value between the NE and the CE is provided, with/without the power expenditure constraints for each sensor. Also, numerical simulations demonstrate the comparison results.
△ Less
Submitted 27 March, 2017;
originally announced March 2017.
-
Bayesian Analysis of Modified Weibull distribution under progressively censored competing risk model
Authors:
Arabin Kumar Dey,
Abhilash Jha,
Sanku Dey
Abstract:
In this paper we study bayesian analysis of Modified Weibull distribution under progressively censored competing risk model. This study is made for progressively censored data. We use deterministic scan Gibbs sampling combined with slice sampling to generate from the posterior distribution. Posterior distribution is formed by taking prior distribution as reference prior. A real life data analysis…
▽ More
In this paper we study bayesian analysis of Modified Weibull distribution under progressively censored competing risk model. This study is made for progressively censored data. We use deterministic scan Gibbs sampling combined with slice sampling to generate from the posterior distribution. Posterior distribution is formed by taking prior distribution as reference prior. A real life data analysis is shown for illustrative purpose.
△ Less
Submitted 21 May, 2016;
originally announced May 2016.
-
A multilevel multinomial logistic regression model for identifying risk factors of anemia in children aged 6-59 months in northeastern states of India
Authors:
Sanku Dey,
Enayetur Raheem
Abstract:
In this article, we use multilevel multinomial logistic regression model to identify the risk factors of anemia in children of northeastern States of India. The data consisted of 10,136 children of age group 6-59 months. We considered the level of anemia as the outcome variable with four ordinal categories (severe, moderate, mild, and non-anemic) based on hemoglobin concentration in blood as per W…
▽ More
In this article, we use multilevel multinomial logistic regression model to identify the risk factors of anemia in children of northeastern States of India. The data consisted of 10,136 children of age group 6-59 months. We considered the level of anemia as the outcome variable with four ordinal categories (severe, moderate, mild, and non-anemic) based on hemoglobin concentration in blood as per WHO guidelines. A two-level random intercept model was considered with state of residence as the level-2 variable. The intra-class correlation (ICC) between states is 0.0577 indicating approximately 6% of the total variation in the response variable accounted for by the state of residence. Several multilevel models have been compared, and a final model was decided based on deviance test. We observed that predicted probability of being at or below severely anemic level to be 0.1247, at moderately anemic level: 0.3578, at mildly anemic level: 0.0698, and being non-anemic to be 0.4477. We found that age at marriage (OR=1.13, 95% CI: 1.05, 1.21) and the number of children even born (OR=1.09, 95% CI: 1.03, 1.15) have significant effect on being at or below lower hemoglobin level (severely anemic). Furthermore, age of child (OR=0.92, 95% CI: 0.86-1.00) was a significant predictor, indicating that odds of severe anemia decreases if the child is 48 months or older.
△ Less
Submitted 11 April, 2015;
originally announced April 2015.