There are several excellent image datasets with pixel-level annotations available in the computer... more There are several excellent image datasets with pixel-level annotations available in the computer vision community to enable semantic segmentation of scenes, motivated by applications such as autonomous driving. Examples of such datasets include Cityscapes [5] or Vistas [9]. However, data is scarce for training computer vision models for unmanned aerial vehicles (UAVs), also known as drones. We propose a framework to compensate for this lack of training data and still obtain generalizable models for segmentation of images/videos acquired by drones. We start with street view annotations, i.e., pixel-labeled images captured at the street-view level – either provided in a dataset or generated by running an existing “street-view” semantic segmentation model. Then, we consider images at varying poses or elevation angles captured by a drone. By leveraging good segmentations of the street-view data, we train parameters of a “helper” network that learns to nominally change the internal feat...
A14 Highly parallel analysis of gene expression has recently been used to identify gene sets or 9... more A14 Highly parallel analysis of gene expression has recently been used to identify gene sets or 9signatures9 to improve patient diagnosis and risk stratification. These signatures are usually identified by way of traditional statistical testing procedures that, due to the dimensionality of microarrays, lead to a high number of false prognostic signatures. A method was developed to test batches of a user-specified number of randomly chosen signatures in patient microarray datasets. The percentage of random generated signatures yielding prognostic value was assessed using ROC analysis by calculating the area under the curve (AUC) in six publically available cancer patient microarray datasets. Within these datasets, we found that a signature consisting of randomly selected genes has an average 10% chance of being called prognostic when assessed in a single dataset, but can range from 1% to ~40% depending on the dataset in question. Increasing the number of validation datasets markedly reduced this number. We further show that the use of an arbitrary cut-off value for evaluation of signature significance is not suitable for this type of research, but should be defined for each dataset separately. Our method can be used to establish and evaluate signature performance of any derived signature within single or multiple datasets by comparing its performance to thousands of randomly generated signatures.
2009 International Conference on Machine Learning and Applications, 2009
Missing data is a given in the medical domain, so machine learning models should have satisfactor... more Missing data is a given in the medical domain, so machine learning models should have satisfactory performance even when missing data occurs. Our previous work has focused on support vector machines (SVM), but we hypothesize that Bayesian networks (BN) can handle missing data better. To test the hypothesis, we trained a BN and SVM model for 2 year survival on
Data mining represents an alternative approach to identify new predictors of multifactorial disea... more Data mining represents an alternative approach to identify new predictors of multifactorial diseases. This work aimed at building an accurate predictive model for incident hypertension using data mining procedures. The primary study population consisted of 1605 normotensive individuals aged 20-79 years with 5-year follow-up from the population-based study, that is the Study of Health in Pomerania (SHIP). The initial set was randomly split into a training and a testing set. We used a probabilistic graphical model applying a Bayesian network to create a predictive model for incident hypertension and compared the predictive performance with the established Framingham risk score for hypertension. Finally, the model was validated in 2887 participants from INTER99, a Danish community-based intervention study. In the training set of SHIP data, the Bayesian network used a small subset of relevant baseline features including age, mean arterial pressure, rs16998073, serum glucose and urinary albumin concentrations. Furthermore, we detected relevant interactions between age and serum glucose as well as between rs16998073 and urinary albumin concentrations [area under the receiver operating characteristic (AUC 0.76)]. The model was confirmed in the SHIP validation set (AUC 0.78) and externally replicated in INTER99 (AUC 0.77). Compared to the established Framingham risk score for hypertension, the predictive performance of the new model was similar in the SHIP validation set and moderately better in INTER99. Data mining procedures identified a predictive model for incident hypertension, which included innovative and easy-to-measure variables. The findings promise great applicability in screening settings and clinical practice.
... accept-able to the user, and to help the human expert more easily identify errors in the conc... more ... accept-able to the user, and to help the human expert more easily identify errors in the conclusion reached by the system [4]. On the other hand, when building classifiers from (medical) data sets, the best performance is often achieved by “black-box” systems, such as ... i=1 wix ∗ i ...
Proceedings of the AAAI Conference on Artificial Intelligence
Transformers have emerged as a powerful tool for a broad range of natural language processing tas... more Transformers have emerged as a powerful tool for a broad range of natural language processing tasks. A key component that drives the impressive performance of Transformers is the self-attention mechanism that encodes the influence or dependence of other tokens on each specific token. While beneficial, the quadratic complexity of self-attention on the input sequence length has limited its application to longer sequences - a topic being actively studied in the community. To address this limitation, we propose Nyströmformer - a model that exhibits favorable scalability as a function of sequence length. Our idea is based on adapting the Nyström method to approximate standard self-attention with O(n) complexity. The scalability of Nyströmformer enables application to longer sequences with thousands of tokens. We perform evaluations on multiple downstream tasks on the GLUE benchmark and IMDB reviews with standard sequence length, and find that our Nyströmformer performs comparably, or in ...
Proceedings of the AAAI Conference on Artificial Intelligence
We present a machine-learning-guided process that can efficiently extract factor tables from unst... more We present a machine-learning-guided process that can efficiently extract factor tables from unstructured rate filing documents. Our approach combines multiple deep-learning-based models that work in tandem to create structured representations of tabular data present in unstructured documents such as pdf files. This process combines CNN's to detect tables, language-based models to extract table metadata and conventional computer vision techniques to improve the accuracy of tabular data on the machine-learning side. The extracted tabular data is validated through an intuitive user interface. This process, which we call Harvest, significantly reduces the time needed to extract tabular information from PDF files, enabling analysis of such data at a speed and scale that was previously unattainable.
The problem of learning by aggregating the opinions or knowledge of multiple sources does not fit... more The problem of learning by aggregating the opinions or knowledge of multiple sources does not fit the usual single-annotator learning scenario. In these problems, ground-truth may not exist and multiple annotators are available. In particular, active learning offers new challenges as, in addition to a data point, a knowledge source must also be optimally selected. This is of interest in a crowdsourcing setting as annotators may have varying expertise or be be adversarial; thus, the information they can offer will vary considerably. We propose an approach to address this situation by focusing on maximizing the information that an annotator label provides about the true (but unknown) label of data points. 1
In this paper we describe RoCKET (Robust Classification and Knowledge Extraction from Text), a ma... more In this paper we describe RoCKET (Robust Classification and Knowledge Extraction from Text), a machine learning platform and interface that allows users to extract information and answer non-trivial questions about a large corpus of unstructured documents. Our approach leverages stateof-the-art text representations, active learning and crowdsourcing to efficiently label concepts and train algorithms to classify documents without requiring extensive domain knowledge expertise from the users. We claim and show empirical evidence that (1) our implementation of active learning algorithms provides a more efficient labeling experience than passive learning, (2) our text representations improve performance over baseline bag-of-word models when the number of labeled examples is small, and (3) RoCKET can be applied in industry settings and more specifically the insurance domain where it is a valuable tool to extract relevant customer-related information during claims processes.
The ubiquitous adoption of Conversational Agents (CA) in commercial settings is changing the way ... more The ubiquitous adoption of Conversational Agents (CA) in commercial settings is changing the way industries interact with their customers. Intent classification is an important first step in designing an efficient CA. Every intent that the CA can recognize is represented by a set of natural language examples that are used by the system to learn how to map any user’s utterance to the corresponding intent. However, when a new intent is introduced, there are usually not enough examples to train the intent appropriately. In this paper we propose a hybrid system that combines a traditional Deep Neural Network-based classification approach with few-shot learning strategies. The simple but yet effective proposed approach achieves good performance for newly introduced intents with few training examples while maintaining performance for previously known intents. We show the potential of the proposed approach on a data generated by a deployed chat system for the insurance domain. To demonstra...
There are several excellent image datasets with pixel-level annotations available in the computer... more There are several excellent image datasets with pixel-level annotations available in the computer vision community to enable semantic segmentation of scenes, motivated by applications such as autonomous driving. Examples of such datasets include Cityscapes [5] or Vistas [9]. However, data is scarce for training computer vision models for unmanned aerial vehicles (UAVs), also known as drones. We propose a framework to compensate for this lack of training data and still obtain generalizable models for segmentation of images/videos acquired by drones. We start with street view annotations, i.e., pixel-labeled images captured at the street-view level – either provided in a dataset or generated by running an existing “street-view” semantic segmentation model. Then, we consider images at varying poses or elevation angles captured by a drone. By leveraging good segmentations of the street-view data, we train parameters of a “helper” network that learns to nominally change the internal feat...
A14 Highly parallel analysis of gene expression has recently been used to identify gene sets or 9... more A14 Highly parallel analysis of gene expression has recently been used to identify gene sets or 9signatures9 to improve patient diagnosis and risk stratification. These signatures are usually identified by way of traditional statistical testing procedures that, due to the dimensionality of microarrays, lead to a high number of false prognostic signatures. A method was developed to test batches of a user-specified number of randomly chosen signatures in patient microarray datasets. The percentage of random generated signatures yielding prognostic value was assessed using ROC analysis by calculating the area under the curve (AUC) in six publically available cancer patient microarray datasets. Within these datasets, we found that a signature consisting of randomly selected genes has an average 10% chance of being called prognostic when assessed in a single dataset, but can range from 1% to ~40% depending on the dataset in question. Increasing the number of validation datasets markedly reduced this number. We further show that the use of an arbitrary cut-off value for evaluation of signature significance is not suitable for this type of research, but should be defined for each dataset separately. Our method can be used to establish and evaluate signature performance of any derived signature within single or multiple datasets by comparing its performance to thousands of randomly generated signatures.
2009 International Conference on Machine Learning and Applications, 2009
Missing data is a given in the medical domain, so machine learning models should have satisfactor... more Missing data is a given in the medical domain, so machine learning models should have satisfactory performance even when missing data occurs. Our previous work has focused on support vector machines (SVM), but we hypothesize that Bayesian networks (BN) can handle missing data better. To test the hypothesis, we trained a BN and SVM model for 2 year survival on
Data mining represents an alternative approach to identify new predictors of multifactorial disea... more Data mining represents an alternative approach to identify new predictors of multifactorial diseases. This work aimed at building an accurate predictive model for incident hypertension using data mining procedures. The primary study population consisted of 1605 normotensive individuals aged 20-79 years with 5-year follow-up from the population-based study, that is the Study of Health in Pomerania (SHIP). The initial set was randomly split into a training and a testing set. We used a probabilistic graphical model applying a Bayesian network to create a predictive model for incident hypertension and compared the predictive performance with the established Framingham risk score for hypertension. Finally, the model was validated in 2887 participants from INTER99, a Danish community-based intervention study. In the training set of SHIP data, the Bayesian network used a small subset of relevant baseline features including age, mean arterial pressure, rs16998073, serum glucose and urinary albumin concentrations. Furthermore, we detected relevant interactions between age and serum glucose as well as between rs16998073 and urinary albumin concentrations [area under the receiver operating characteristic (AUC 0.76)]. The model was confirmed in the SHIP validation set (AUC 0.78) and externally replicated in INTER99 (AUC 0.77). Compared to the established Framingham risk score for hypertension, the predictive performance of the new model was similar in the SHIP validation set and moderately better in INTER99. Data mining procedures identified a predictive model for incident hypertension, which included innovative and easy-to-measure variables. The findings promise great applicability in screening settings and clinical practice.
... accept-able to the user, and to help the human expert more easily identify errors in the conc... more ... accept-able to the user, and to help the human expert more easily identify errors in the conclusion reached by the system [4]. On the other hand, when building classifiers from (medical) data sets, the best performance is often achieved by “black-box” systems, such as ... i=1 wix ∗ i ...
Proceedings of the AAAI Conference on Artificial Intelligence
Transformers have emerged as a powerful tool for a broad range of natural language processing tas... more Transformers have emerged as a powerful tool for a broad range of natural language processing tasks. A key component that drives the impressive performance of Transformers is the self-attention mechanism that encodes the influence or dependence of other tokens on each specific token. While beneficial, the quadratic complexity of self-attention on the input sequence length has limited its application to longer sequences - a topic being actively studied in the community. To address this limitation, we propose Nyströmformer - a model that exhibits favorable scalability as a function of sequence length. Our idea is based on adapting the Nyström method to approximate standard self-attention with O(n) complexity. The scalability of Nyströmformer enables application to longer sequences with thousands of tokens. We perform evaluations on multiple downstream tasks on the GLUE benchmark and IMDB reviews with standard sequence length, and find that our Nyströmformer performs comparably, or in ...
Proceedings of the AAAI Conference on Artificial Intelligence
We present a machine-learning-guided process that can efficiently extract factor tables from unst... more We present a machine-learning-guided process that can efficiently extract factor tables from unstructured rate filing documents. Our approach combines multiple deep-learning-based models that work in tandem to create structured representations of tabular data present in unstructured documents such as pdf files. This process combines CNN's to detect tables, language-based models to extract table metadata and conventional computer vision techniques to improve the accuracy of tabular data on the machine-learning side. The extracted tabular data is validated through an intuitive user interface. This process, which we call Harvest, significantly reduces the time needed to extract tabular information from PDF files, enabling analysis of such data at a speed and scale that was previously unattainable.
The problem of learning by aggregating the opinions or knowledge of multiple sources does not fit... more The problem of learning by aggregating the opinions or knowledge of multiple sources does not fit the usual single-annotator learning scenario. In these problems, ground-truth may not exist and multiple annotators are available. In particular, active learning offers new challenges as, in addition to a data point, a knowledge source must also be optimally selected. This is of interest in a crowdsourcing setting as annotators may have varying expertise or be be adversarial; thus, the information they can offer will vary considerably. We propose an approach to address this situation by focusing on maximizing the information that an annotator label provides about the true (but unknown) label of data points. 1
In this paper we describe RoCKET (Robust Classification and Knowledge Extraction from Text), a ma... more In this paper we describe RoCKET (Robust Classification and Knowledge Extraction from Text), a machine learning platform and interface that allows users to extract information and answer non-trivial questions about a large corpus of unstructured documents. Our approach leverages stateof-the-art text representations, active learning and crowdsourcing to efficiently label concepts and train algorithms to classify documents without requiring extensive domain knowledge expertise from the users. We claim and show empirical evidence that (1) our implementation of active learning algorithms provides a more efficient labeling experience than passive learning, (2) our text representations improve performance over baseline bag-of-word models when the number of labeled examples is small, and (3) RoCKET can be applied in industry settings and more specifically the insurance domain where it is a valuable tool to extract relevant customer-related information during claims processes.
The ubiquitous adoption of Conversational Agents (CA) in commercial settings is changing the way ... more The ubiquitous adoption of Conversational Agents (CA) in commercial settings is changing the way industries interact with their customers. Intent classification is an important first step in designing an efficient CA. Every intent that the CA can recognize is represented by a set of natural language examples that are used by the system to learn how to map any user’s utterance to the corresponding intent. However, when a new intent is introduced, there are usually not enough examples to train the intent appropriately. In this paper we propose a hybrid system that combines a traditional Deep Neural Network-based classification approach with few-shot learning strategies. The simple but yet effective proposed approach achieves good performance for newly introduced intents with few training examples while maintaining performance for previously known intents. We show the potential of the proposed approach on a data generated by a deployed chat system for the insurance domain. To demonstra...
Uploads
Papers by Glenn Fung