We consider the problem of active feature elicitation in which, given some examples with all the ... more We consider the problem of active feature elicitation in which, given some examples with all the features (say, the full Electronic Health Record), and many examples with some of the features (say, demographics), the goal is to identify the set of examples on which more information (say, lab tests) need to be collected. The observation is that some set of features may be more expensive, personal or cumbersome to collect. We propose a classifier-independent, similarity metric-independent, general active learning approach which identifies examples that are dissimilar to the ones with the full set of data and acquire the complete set of features for these examples. Motivated by four real clinical tasks, our extensive evaluation demonstrates the effectiveness of this approach. To demonstrate the generalization capabilities of the proposed approach, we consider different divergence metrics and classifiers and present consistent results across the domains.
We consider the problem of learning generalized first-order representations of concepts from a si... more We consider the problem of learning generalized first-order representations of concepts from a single example. To address this challenging problem, we augment an inductive logic programming learner with two novel algorithmic contributions. First, we define a distance measure between candidate concept representations that improves the efficiency of search for target concept and generalization. Second, we leverage richer human inputs in the form of advice to improve the sample-efficiency of learning. We prove that the proposed distance measure is semantically valid and use that to derive a PAC bound. Our experimental analysis on diverse concept learning tasks demonstrates both the effectiveness and efficiency of the proposed approach over a first-order concept learner using only examples.
Anomaly detection for time-series data becomes an essential task for many datadriven applications... more Anomaly detection for time-series data becomes an essential task for many datadriven applications fueled with an abundance of data and out-of-the-box machinelearning algorithms. In many real-world settings, developing a reliable anomaly model is highly challenging due to insufficient anomaly labels and the prohibitively expensive cost of obtaining anomaly examples. It imposes a significant bottleneck to evaluate model quality for model selection and parameter tuning reliably. As a result, many existing anomaly detection algorithms fail to show their promised performance after deployment. In this paper, we propose LaF-AD, a novel anomaly detection algorithm with labelfree model selection for unlabeled times-series data. Our proposed algorithm performs a fully unsupervised ensemble learning across a large number of candidate parametric models. We develop a model variance metric that quantifies the sensitivity of anomaly probability with a bootstrapping method. Then it makes a collecti...
Practical machine learning applications involving time series data, such as firewall log analysis... more Practical machine learning applications involving time series data, such as firewall log analysis to proactively detect anomalous behavior, are concerned with real time analysis of streaming data. Consequently, we need to update the ML models as the statistical characteristics of such data may shift frequently with time. One alternative explored in the literature is to retrain models with updated data whenever the model’s accuracy is observed to degrade. However, these methods rely on near real-time availability of ground truth, which is rarely fulfilled. Further, in applications with seasonal data, temporal concept drift is confounded by seasonal variation. In this work we propose an approach called Unsupervised Temporal Drift Detector or UTDD to flexibly account for seasonal variation, efficiently detect temporal concept drift in time series data in the absence of ground truth, and subsequently adapt our ML models to concept drift for better generalization.
In this work-in-progress paper, we speculate a method for learning causal models directly from da... more In this work-in-progress paper, we speculate a method for learning causal models directly from data without any interventions or inductive bias. Our ensemble approach uncovers some interesting relations for understanding post-partum depression based on family and socio-economic factors.
Time series forecasting is a fundamental task emerging from diverse data-driven applications. Man... more Time series forecasting is a fundamental task emerging from diverse data-driven applications. Many advanced autoregressive methods such as ARIMA[8] were used to develop forecasting models. Recently, deep learning based methods such as DeepAr[16], NeuralProphet[1], Seq2Seq [30] have been explored for time series forecasting problem. In this paper, we propose a novel time series forecast model, DeepGB. We formulate and implement a variant of Gradient boosting [18] wherein the weak learners are DNNs whose weights are incrementally found in a greedy manner over iterations. In particular, we develop a new embedding architecture that improves the performance of many deep learning models on time series using Gradient boosting [18] variant. We demonstrate that our model outperforms existing comparable state-of-the-art models using real-world sensor data and public dataset.
Arithmetic Circuits (AC) and Sum-Product Networks (SPN) have recently gained significant interest... more Arithmetic Circuits (AC) and Sum-Product Networks (SPN) have recently gained significant interest by virtue of being tractable deep probabilistic models. We propose the first gradient-boosted method for structure learning of discriminative ACs (DACs), called DACBOOST. In discrete domains ACs are essentially equivalent to mixtures of trees, thus DACBOOST decomposes a large AC into smaller tree-structured ACs and learns them in sequential, additive manner. The resulting non-parametric manner of learning DACs results in a model with very few tuning parameters making our learned model significantly more efficient. We demonstrate on standard data sets and real data sets, efficiency of DACBOOST compared to state-of-the-art DAC learners without sacrificing effectiveness.
We consider the problem of learning generalized first-order representations of concepts from a sm... more We consider the problem of learning generalized first-order representations of concepts from a small number of examples. We augment an inductive logic programming learner with 2 novel contributions. First, we define a distance measure between candidate concept representations that improves the efficiency of search for target concept and generalization. Second, we leverage richer human inputs in the form of advice to improve the sample efficiency of learning. We prove that the proposed distance measure is semantically valid and use that to derive a PAC bound. Our experiments on diverse learning tasks demonstrate both the effectiveness and efficiency of our approach.
We consider the problem of learning Relational Logistic Regression (RLR). Unlike standard logisti... more We consider the problem of learning Relational Logistic Regression (RLR). Unlike standard logistic regression, the features of RLRs are first-order formulae with associated weight vectors instead of scalar weights. We turn the problem of learning RLR to learning these vector-weighted formulae and develop a learning algorithm based on the recently successful functional-gradient boosting methods for probabilistic logic models. We derive the functional gradients and show how weights can be learned simultaneously in an efficient manner. Our empirical evaluation on standard and novel data sets demonstrates the superiority of our approach over other methods for learning RLR.
We consider the problem of active feature elicitation in which, given some examples with all the ... more We consider the problem of active feature elicitation in which, given some examples with all the features (say, the full Electronic Health Record), and many examples with some of the features (say, demographics), the goal is to identify the set of examples on which more information (say, lab tests) need to be collected. The observation is that some set of features may be more expensive, personal or cumbersome to collect. We propose a classifierindependent, similarity metric-independent, general active learning approach which identifies examples that are dissimilar to the ones with the full set of data and acquire the complete set of features for these examples. Motivated by four real clinical tasks, our extensive evaluation demonstrates the effectiveness of this approach.
Analysis of large observational data sets generated by a reactive system is a common challenge in... more Analysis of large observational data sets generated by a reactive system is a common challenge in debugging system failures and determining their root cause. One of the major problems is that these observational data suffers from survivorship bias. Examples include analyzing traffic logs from networks, and simulation logs from circuit design. In such applications, users want to detect non spurious correlations from observational data, and obtain actionable insights about them. In this paper, we introduce log to Neuro-symbolic (Log2NS), a framework that combines probabilistic analysis from machine learning (ML) techniques on observational data with certainties derived from symbolic reasoning on an underlying formal model. We apply the proposed framework to network traffic debugging by employing the following steps. To detect patterns in network logs, we first generate global embedding vector representations of entities such as IP addresses, ports, and applications. Next, we represent...
We consider the problem of active feature elicitation in which, given some examples with all the ... more We consider the problem of active feature elicitation in which, given some examples with all the features (say, the full Electronic Health Record), and many examples with some of the features (say, demographics), the goal is to identify the set of examples on which more information (say, lab tests) need to be collected. The observation is that some set of features may be more expensive, personal or cumbersome to collect. We propose a classifier-independent, similarity metric-independent, general active learning approach which identifies examples that are dissimilar to the ones with the full set of data and acquire the complete set of features for these examples. Motivated by four real clinical tasks, our extensive evaluation demonstrates the effectiveness of this approach. To demonstrate the generalization capabilities of the proposed approach, we consider different divergence metrics and classifiers and present consistent results across the domains.
We consider the problem of learning generalized first-order representations of concepts from a si... more We consider the problem of learning generalized first-order representations of concepts from a single example. To address this challenging problem, we augment an inductive logic programming learner with two novel algorithmic contributions. First, we define a distance measure between candidate concept representations that improves the efficiency of search for target concept and generalization. Second, we leverage richer human inputs in the form of advice to improve the sample-efficiency of learning. We prove that the proposed distance measure is semantically valid and use that to derive a PAC bound. Our experimental analysis on diverse concept learning tasks demonstrates both the effectiveness and efficiency of the proposed approach over a first-order concept learner using only examples.
Anomaly detection for time-series data becomes an essential task for many datadriven applications... more Anomaly detection for time-series data becomes an essential task for many datadriven applications fueled with an abundance of data and out-of-the-box machinelearning algorithms. In many real-world settings, developing a reliable anomaly model is highly challenging due to insufficient anomaly labels and the prohibitively expensive cost of obtaining anomaly examples. It imposes a significant bottleneck to evaluate model quality for model selection and parameter tuning reliably. As a result, many existing anomaly detection algorithms fail to show their promised performance after deployment. In this paper, we propose LaF-AD, a novel anomaly detection algorithm with labelfree model selection for unlabeled times-series data. Our proposed algorithm performs a fully unsupervised ensemble learning across a large number of candidate parametric models. We develop a model variance metric that quantifies the sensitivity of anomaly probability with a bootstrapping method. Then it makes a collecti...
Practical machine learning applications involving time series data, such as firewall log analysis... more Practical machine learning applications involving time series data, such as firewall log analysis to proactively detect anomalous behavior, are concerned with real time analysis of streaming data. Consequently, we need to update the ML models as the statistical characteristics of such data may shift frequently with time. One alternative explored in the literature is to retrain models with updated data whenever the model’s accuracy is observed to degrade. However, these methods rely on near real-time availability of ground truth, which is rarely fulfilled. Further, in applications with seasonal data, temporal concept drift is confounded by seasonal variation. In this work we propose an approach called Unsupervised Temporal Drift Detector or UTDD to flexibly account for seasonal variation, efficiently detect temporal concept drift in time series data in the absence of ground truth, and subsequently adapt our ML models to concept drift for better generalization.
In this work-in-progress paper, we speculate a method for learning causal models directly from da... more In this work-in-progress paper, we speculate a method for learning causal models directly from data without any interventions or inductive bias. Our ensemble approach uncovers some interesting relations for understanding post-partum depression based on family and socio-economic factors.
Time series forecasting is a fundamental task emerging from diverse data-driven applications. Man... more Time series forecasting is a fundamental task emerging from diverse data-driven applications. Many advanced autoregressive methods such as ARIMA[8] were used to develop forecasting models. Recently, deep learning based methods such as DeepAr[16], NeuralProphet[1], Seq2Seq [30] have been explored for time series forecasting problem. In this paper, we propose a novel time series forecast model, DeepGB. We formulate and implement a variant of Gradient boosting [18] wherein the weak learners are DNNs whose weights are incrementally found in a greedy manner over iterations. In particular, we develop a new embedding architecture that improves the performance of many deep learning models on time series using Gradient boosting [18] variant. We demonstrate that our model outperforms existing comparable state-of-the-art models using real-world sensor data and public dataset.
Arithmetic Circuits (AC) and Sum-Product Networks (SPN) have recently gained significant interest... more Arithmetic Circuits (AC) and Sum-Product Networks (SPN) have recently gained significant interest by virtue of being tractable deep probabilistic models. We propose the first gradient-boosted method for structure learning of discriminative ACs (DACs), called DACBOOST. In discrete domains ACs are essentially equivalent to mixtures of trees, thus DACBOOST decomposes a large AC into smaller tree-structured ACs and learns them in sequential, additive manner. The resulting non-parametric manner of learning DACs results in a model with very few tuning parameters making our learned model significantly more efficient. We demonstrate on standard data sets and real data sets, efficiency of DACBOOST compared to state-of-the-art DAC learners without sacrificing effectiveness.
We consider the problem of learning generalized first-order representations of concepts from a sm... more We consider the problem of learning generalized first-order representations of concepts from a small number of examples. We augment an inductive logic programming learner with 2 novel contributions. First, we define a distance measure between candidate concept representations that improves the efficiency of search for target concept and generalization. Second, we leverage richer human inputs in the form of advice to improve the sample efficiency of learning. We prove that the proposed distance measure is semantically valid and use that to derive a PAC bound. Our experiments on diverse learning tasks demonstrate both the effectiveness and efficiency of our approach.
We consider the problem of learning Relational Logistic Regression (RLR). Unlike standard logisti... more We consider the problem of learning Relational Logistic Regression (RLR). Unlike standard logistic regression, the features of RLRs are first-order formulae with associated weight vectors instead of scalar weights. We turn the problem of learning RLR to learning these vector-weighted formulae and develop a learning algorithm based on the recently successful functional-gradient boosting methods for probabilistic logic models. We derive the functional gradients and show how weights can be learned simultaneously in an efficient manner. Our empirical evaluation on standard and novel data sets demonstrates the superiority of our approach over other methods for learning RLR.
We consider the problem of active feature elicitation in which, given some examples with all the ... more We consider the problem of active feature elicitation in which, given some examples with all the features (say, the full Electronic Health Record), and many examples with some of the features (say, demographics), the goal is to identify the set of examples on which more information (say, lab tests) need to be collected. The observation is that some set of features may be more expensive, personal or cumbersome to collect. We propose a classifierindependent, similarity metric-independent, general active learning approach which identifies examples that are dissimilar to the ones with the full set of data and acquire the complete set of features for these examples. Motivated by four real clinical tasks, our extensive evaluation demonstrates the effectiveness of this approach.
Analysis of large observational data sets generated by a reactive system is a common challenge in... more Analysis of large observational data sets generated by a reactive system is a common challenge in debugging system failures and determining their root cause. One of the major problems is that these observational data suffers from survivorship bias. Examples include analyzing traffic logs from networks, and simulation logs from circuit design. In such applications, users want to detect non spurious correlations from observational data, and obtain actionable insights about them. In this paper, we introduce log to Neuro-symbolic (Log2NS), a framework that combines probabilistic analysis from machine learning (ML) techniques on observational data with certainties derived from symbolic reasoning on an underlying formal model. We apply the proposed framework to network traffic debugging by employing the following steps. To detect patterns in network logs, we first generate global embedding vector representations of entities such as IP addresses, ports, and applications. Next, we represent...
Uploads
Papers by Nandini Ramanan