Named entity recognition (NER) is one of the fundamental tasks in natural-language processing (NLP). Though the combination of different classifiers has been widely applied in several well-studied languages, this is the first time this... more
Named entity recognition (NER) is one of the fundamental tasks in natural-language processing (NLP). Though the combination of different classifiers has been widely applied in several well-studied languages, this is the first time this method has been applied to Vietnamese. In this article, we describe how voting techniques can improve the performance of Vietnamese NER. By combining several state-of-the-art machine-learning algorithms using voting strategies, our final result outperforms individual algorithms and gained an F-measure of 89.12. A detailed discussion about the challenges of NER in Vietnamese is also presented.
Low-density languages are also known as lesser-known, poorly-described, less-resourced, minority or less-computerized language because they have fewer resources available. Collection and annotation of a voluminous corpus for the purpose... more
Low-density languages are also known as lesser-known, poorly-described, less-resourced, minority or less-computerized language because they have fewer resources available. Collection and annotation of a voluminous corpus for the purpose of NLP application for these languages prove to be quite challenging. For the development of any NLP application for a low-density language, one needs to have an annotated corpus and a standard scheme for annotation. Because of their non-standard usage in text and other linguistic nuances, they pose significant challenges that are of linguistic and technical in nature. The present paper highlights some of the underlying issues and challenges in developing statistical POS taggers applying SVM and CRF++ for Sambalpuri, a less-resourced Eastern Indo-Aryan language. A corpus of approximately 121 k is collected from the web and converted into Unicode encoding. The whole corpus is annotated under the BIS (Bureau of Indian Standards) annotation scheme devised for Odia under the ILCI (Indian Languages Corpora Initiative) Project. Both the taggers are trained and tested with approximately 80 k and 13 k respectively. The SVM tagger provides 83% accuracy while the CRF++ has 71.56% which is less in comparison to the former.
In this paper we describe a semi-supervised approach to person re-identification that combines discriminative models of person identity with a Conditional Random Field (CRF) to exploit the local manifold approximation induced by the... more
In this paper we describe a semi-supervised approach to person re-identification that combines discriminative models of person identity with a Conditional Random Field (CRF) to exploit the local manifold approximation induced by the nearest neighbor graph in feature space. The linear discriminative models learned on few gallery images provides coarse separation of probe images into identities, while a graph topology defined by distances between all person images in feature space leverages local support for label propagation in the CRF. We evaluate our approach using multiple scenarios on several publicly available datasets, where the number of identities varies from 28 to 191 and the number of images ranges between 1003 and 36 171. We demonstrate that the discriminative model and the CRF are complementary and that the combination of both leads to significant improvement over state-of-the-art approaches. We further demonstrate how the performance of our approach improves with increas...
Strong ground motions can trigger soil liquefaction that will alter the propagating signal and induce ground failure. Important damage in structures and lifelines has been evidenced after recent earthquakes such as Christchurch, New... more
Strong ground motions can trigger soil liquefaction that will alter the propagating signal and induce ground failure. Important damage in structures and lifelines has been evidenced after recent earthquakes such as Christchurch, New Zealand and Tohoku, Japanin 2011. Accurate prediction of the structures’ seismic risk requires a careful modeling of the nonlinear behavior of soil-structure interaction (SSI) systems. In general, seismic risk analysisis described as the convolution between the natural hazard and the vulnerability of the system. This thesis arises as a contribution to the numerical modeling of liquefaction evaluation and mitigation.For this purpose, the finite element method (FEM) in time domain is used as numerical tool. The main numerical model consists of are inforced concrete building with a shallow rigid foundation standing on saturated cohesionless soil. As the initial step on the seismic risk analysis, the first part of the thesis is consecrated to the characteriz...
(Monographs on statistics and applied probability 104) Havard Rue, Leonhard Held-Gaussian Markov random fields_ theory and applications-Chapman & Hall_CRC (2005)
This study focusses on developing statistical POS taggers for Odia using two distinct algorithms CRF (probability) and SVM (classifier). Approximately, 400k tokens have been applied to develop both of them with the training and testing... more
This study focusses on developing statistical POS taggers for Odia using two distinct algorithms CRF (probability) and SVM (classifier). Approximately, 400k tokens have been applied to develop both of them with the training and testing data estimating to 236k and 123k tokens respectively. For annotating the whole ILCI corpus the BIS annotation scheme has been taken into consideration with some modifications. So far as the experimental set up is concerned, similar feature has been selected to train both the models. Evaluation has been conducted on the precision and recall measures for CRF and known-unknown words accuracy for SVM. A comprehensive error analysis has been conducted to figure out the types of errors committed by both in common based on which 5-fold manual error correction and final evaluation have been conducted. After identifying and discussing issues, different solutions have been proposed: formulation of linguistic rules, corpus-driven, word sense disambiguation, and application of external tools like NER, WSD, morph analyser. Finally, the taggers are made online using JSP and JST technology. Both the taggers, CRF++ (94.39 and 88.87) and SVM (96.85 and 93.59), have outperformed the existing Odia POS taggers in terms of both reliability and accuracy. For ensuring the quality of the output, an IA agreement has been conducted.
In this paper, we propose the efficient approach to tackle the multi-label interactive image segmentation issue by applying the higher order Conditional Random Fields model which associates superpixel as higher order energy. People did... more
In this paper, we propose the efficient approach to tackle the multi-label interactive image segmentation issue by applying the higher order Conditional Random Fields model which associates superpixel as higher order energy. People did take advantage of CRF model for unsupervised segmentation for years, but it requires training set for providing neccessary information. Therefore, unsupervised strategy is fairly restrictive for the variety of image contexts and categorizations. For this reason, the user interaction seems inevitable to help us address the multi- label segmentation's riddle in accordance with exploiting CRF perspectives. The promising experiments are conducted in MSRC and Berkeley dataset comparing with the original Conditional Random Fields framework.
Due to the dominating influence of Partially Observable Markov Decision Process (POMDP) framework used in spoken dialog systems, most previously proposed dialog state tracking methods favor generative models. However, in this work we... more
Due to the dominating influence of Partially Observable Markov Decision Process (POMDP) framework used in spoken dialog systems, most previously proposed dialog state tracking methods favor generative models. However, in this work we adopt a discriminative approach to model the evolution of the belief state within a spoken dialog system - more specifically, we use Conditional Random Fields (CRFs). Although we are not the first to apply CRFs to dialog state tracking, the proposed approach considers the dialog state tracking task as a sequence tagging problem, in the hope of capturing the evolving user goals during a dialog. Equipped with an incremental decoding strategy as well as user goal change detection, our results show that both sequence modeling and goal change information could bring advantage to the task.
This research work presents a probability-based CRF++ parts of speech (POS) tagger for Odia language. A corpus of approximately 600k tokens has been annotated manually in the Indian Languages Corpora Initiative (ILCI) project for Odia.... more
This research work presents a probability-based CRF++ parts of speech (POS) tagger for Odia language. A corpus of approximately 600k tokens has been annotated manually in the Indian Languages Corpora Initiative (ILCI) project for Odia. The whole Odia corpus has been annotated based on the Bureau of Indian Standards (BIS) tagset developed by the DIT, govt. of India with some modifications under the ILCI. The tagger has been trained and tested with 2, 36, 793 and 1, 28, 646 tokens respectively. It provides 94.39% accuracy in the domain of seen data and 88.87% in the unseen dataset in precision and recall measures. In addition, this study further conducts an IA (inter-annotator) agreement, an error analysis to figure out salient erroneous labels committed by the automatic tagger and provides various suggestions to improve its efficiency. Furthermore, this study also provides the user-interface architecture and its functionalities.
Process mining techniques focus on extracting insight in processes from event logs. In many cases, events recorded in the event log are too fine-grained, causing process discovery algorithms to discover incomprehensible process models or... more
Process mining techniques focus on extracting insight in processes from event logs. In many cases, events recorded in the event log are too fine-grained, causing process discovery algorithms to discover incomprehensible process models or process models that are not representative of the event log. We show that when process discovery algorithms are only able to discover an unrepresentative process model from a low-level event log, structure in the process can in some cases still be discovered by first abstracting the event log to a higher level of granularity. This gives rise to the challenge to bridge the gap between an original low-level event log and a desired high-level perspective on this log, such that a more structured or more comprehensible process model can be discovered. We show that supervised learning can be leveraged for the event abstraction task when annotations with high-level interpretations of the low-level events are available for a subset of the sequences (i.e., traces). We present a method to generate feature vector representations of events based on XES extensions, and describe an approach to abstract events in an event log with Condition Random Fields using these event features. Furthermore, we propose a sequence-focused metric to evaluate supervised event abstraction results that fits closely to the tasks of process discovery and conformance checking. We conclude this paper by demonstrating the usefulness of supervised event abstraction for obtaining more structured and/or more comprehensible process models using both real life event data and synthetic event data.
Fully automated interpretation and understanding of remotely sensed data by a computer has been a challenge for many decades, and many approaches have been developed over the years. Significant advances in knowledge-based image... more
Fully automated interpretation and understanding of remotely sensed data by a computer has been a challenge for many decades, and many approaches have been developed over the years. Significant advances in knowledge-based image understanding, machine learning and artificial intelligence has led to this topic being the focus of much research in recent years.
This book highlights the different theoretical and application-oriented aspects and potential solutions to the topic of automated remote sensing data analysis. Thereby, both classical knowledge-based as well as modern machine learning-oriented concepts are described. A field such as this is specialized and dynamic and also interdisciplinary and multilayered. Written by an international team of experts, the book has therefore been split into parts dealing with the concepts and applications, and the focus is on elucidating the complementarity of different lines of research rather than providing the complete set of scientific approaches.
Part A of this book gives insight into the basic theories and concepts of feature extraction, image understanding and the respective assessment strategies as well as into geometric, radiometric and sensor-related fundamentals of remote sensing technology. Part B focuses on various scientific and practical applications of remote sensing data analysis. These range from the automatic detailed reconstruction of complex 3D environments to visual tracking of objects in image sequences as well as monitoring natural and anthropogenic long-term processes on a regional scale. Part C sketches recent trends in automatic analysis of remote sensing data.
Named Entity Recognition is a prior task in Natural Language Processing. Named Entity Recognition is a sub task of information extraction and it identifies and classifies proper nouns in to its predefined categories such as person,... more
Named Entity Recognition is a prior task in Natural Language Processing. Named Entity Recognition is a sub task of information extraction and it identifies and classifies proper nouns in to its predefined categories such as person, location, organization, time, date etc. In this document the major focus is given on NER approaches and the work done till now for various languages to identify Named Entities is been discussed. Author have done comparative study to recognize named entity and identified that CRF approach proven best for Indian languages to identify named entity.
Pitambar Behera, M.A., B.Ed., M.Phil., Ph.D. ==================================================================== Abstract This research work presents a probability-based CRF++ parts of speech (POS) tagger for Odia language. A corpus of... more
Pitambar Behera, M.A., B.Ed., M.Phil., Ph.D. ==================================================================== Abstract This research work presents a probability-based CRF++ parts of speech (POS) tagger for Odia language. A corpus of approximately 600k tokens has been annotated manually in the Indian Languages Corpora Initiative (ILCI) project for Odia. The whole Odia corpus has been annotated based on the Bureau of Indian Standards (BIS) tagset developed by the DIT, govt. of India with some modifications under the ILCI. The tagger has been trained and tested with 2, 36, 793 and 1, 28, 646 tokens respectively. It provides 94.39% accuracy in the domain of seen data and 88.87% in the unseen dataset in precision and recall measures. In addition, this study further conducts an IA (inter-annotator) agreement, an error analysis to figure out salient erroneous labels committed by the automatic tagger and provides various suggestions to improve its efficiency. Furthermore, this study als...
Activity Recognition is an integral component of ubiquitous computing. Recognizing an activity is a challenging task since activities can be concurrent, interleaved or ambiguous and can consist of multiple actors (which would require... more
Activity Recognition is an integral component of ubiquitous computing. Recognizing an activity is a challenging task since activities can be concurrent, interleaved or ambiguous and can consist of multiple actors (which would require parallel activity recognition). This paper investigates how the discriminative nature of Conditional Random Fields (CRF) can be exploited to enhance the accuracy of recognizing activities when compared to that achieved using generative models. It aims to apply CRF to recognize complex activities, analyze the model trained by CRF and evaluate the performance of CRF against existing models using Stochastic Gradient Descent (which is suitable for online learning).
In this paper, we perform a survey of various techniques that can be used to perform change detection on a pair of images taken at different times. Each of these techniques perform analysis on multitemporal images and identify... more
In this paper, we perform a survey of various techniques that can be used to perform change detection on a pair of images taken at different times. Each of these techniques perform analysis on multitemporal images and identify modifications, if any. The advantages and disadvantages of each technique is identified and scrutinized to evaluate the performance of each technique. A comparative analysis of the techniques is performed to determine the most suitable technique to be employed for different scenarios, like video surveillance, infrastructure monitoring and so on.
—This research aims to classify cheating activity during exam from video observation. The method uses Conditional Random Field (CRF) for classifying and detecting some classes of cheating activities. The method used to detect the location... more
—This research aims to classify cheating activity during exam from video observation. The method uses Conditional Random Field (CRF) for classifying and detecting some classes of cheating activities. The method used to detect the location of the joints of the body is a Multimodal Decomposable Model (MODEC) with superpixel segmentation. The used joints are head, shoulders, elbows, and wrists. The superpixel method is Simple Linear Iterative Clustering (SLIC). Comparison between MODEC and MODEC + SLIC as feature detector for CRF showed that MODEC + SLIC capable of providing a better activity classification. From our experiments, the cheating activities in average can be detected up to 83.9%. Moving beyond only detecting the class of motion segments, we also devised point-in-time event detection system also using CRF. The time of occurrences of three consecutive cheating activities are determined from a sequence of video frames.
Rainfall infiltration in an unsaturated soil slope induces loss of suction (and even positive pore-water pressures), which can eventually lead to failure. This paper investigates the probability and the size of failure of an unsaturated... more
Rainfall infiltration in an unsaturated soil slope induces loss of suction (and even positive pore-water pressures), which can eventually lead to failure. This paper investigates the probability and the size of failure of an unsaturated slope with spatially variable void ratio, subjected to a constant intensity rainfall. The random finite element method is employed in conjunction with a Monte Carlo simulation to stochastically evaluate the factor of safety and the size of the sliding mass. The results indicate that the mean value and the variability of these two quantities depend on both correlation length and coefficient of variation of the void ratio field. This dependency is more prominent during the transient regime than at steady states. Notably, the factor of safety in some cases can be low but the corresponding sliding mass is relatively small while, in other instances, the factor of safety might remain large though the associated sliding mass is very sizeable. The correlation between the factor of safety and the size of the sliding mass shifts from positive to negative as the rainfall progresses. A simple quadrant plot is suggested to assess the risk associated with slope failure taking into account both the factor of safety and the size of failure, rather than the factor of safety alone as it is usually the case. The study also demonstrates an application of a numerical approach to assess stability of geostructures composed of complex multiphase materials such as unsaturated soils or frozen soils.
Intrusion Detection systems are now an essential component in the overall network. With the rapid advancement in the network technologies including higher bandwidths and ease of connectivity of wireless and hand held devices, the main... more
Intrusion Detection systems are now an essential component in the overall network. With the rapid advancement in the network technologies including higher bandwidths and ease of connectivity of wireless and hand held devices, the main focus of intrusion detection has shifted from simple signature matching approaches to detecting attacks based on analyzing contextual information which may be specific to individual networks and applications. As a result, anomaly and hybrid intrusion detection approaches have gained significance. The Denial of Service Attacks (DoS), Probe, User to Root (U2R) and Remote to Local (R2L) are some of the common attacks that affect network resources. Intrusion detection faces a number of challenges; an intrusion detection system must reliably detect malicious activities in a network and cope up with large amount of network traffic. In this paper, we address these two issues of Accuracy and Efficiency using Conditional Random Fields and Layered Approach. Finally we demonstrate that high attack detection accuracy can be achieved by using Memetic algorithm for feature selection with Layered Conditional Random Fields.
Geographic Information Retrieval (GIR) systems rely on the identification and disambiguation of place names in documents to determine the region about which they are relevant. The place names are mapped into geographic concepts and used... more
Geographic Information Retrieval (GIR) systems rely on the identification and disambiguation of place names in documents to determine the region about which they are relevant. The place names are mapped into geographic concepts and used to assign an encompassing concept (a scope) to each document. However, sometimes a single scope is too restrictive and insufficient for capturing the geographic semantics of a document. We propose as an alternative to abstract the geographic semantics of a document as a geographic signature, which is a list of maximally disambiguated geographic references found in a document. A signature can be used in multiple GIR applications, such as in building a geographic index for a document collection. We perform the disambiguation of the possible geographic meanings using semantic similarity measures.
Low-density languages are also known as lesser-known, poorly-described, less-resourced, minority or less-computerized language because they have fewer resources available. Collecting and annotating a voluminous corpus for these languages... more
Low-density languages are also known as lesser-known, poorly-described, less-resourced, minority or less-computerized language because they have fewer resources available. Collecting and annotating a voluminous corpus for these languages prove to be quite daunting. For developing any NLP application for a low-density language, one needs to have an annotated corpus and a standard scheme for annotation. Because of their non-standard usage in text and other linguistic nuances, they pose significant challenges that are of linguistic and technical in nature. The present paper highlights some of the underlying issues and challenges in developing statistical POS taggers applying SVM and CRF++ for Sambalpuri, a less-resourced Eastern Indo-Aryan language. A corpus of approximately 121k is collected from the web and converted into Unicode encoding. The whole corpus is annotated under the BIS (Bureau of Indian Standards) annotation scheme devised for Odia under the ILCI (Indian Languages Corpo...
Data-driven Spoken Language Understanding (SLU) systems need semantically annotated data which are expensive, time consuming and prone to human errors. Active learning has been successfully applied to automatic speech recognition and... more
Data-driven Spoken Language Understanding (SLU) systems need semantically annotated data which are expensive, time consuming and prone to human errors. Active learning has been successfully applied to automatic speech recognition and utterance classification. In general, corpora annotation for SLU involves such tasks as sentence segmentation, chunking or frame labeling and predicate-argument annotation. In such cases human annotations are subject to errors increasing with the annotation complexity. We investigate two alternative noise-robust active learning strategies that are either data-intensive or supervision-intensive. The strategies detect likely erroneous examples and improve significantly the SLU performance for a given labeling cost. We apply uncertainty based active learning with conditional random fields on the concept segmentation task for SLU. We perform annotation experiments on two databases, namely ATIS (English) and Media (French). We show that our noise-robust algo...
In this paper, we present the results of user requirement solicitation for a search system of grey literature in archaeology, specifically Dutch excavation reports. This search system uses Named Entity Recognition and Information... more
In this paper, we present the results of user requirement solicitation for a search system of grey literature in archaeology, specifically Dutch excavation reports. This search system uses Named Entity Recognition and Information Retrieval techniques to create an effective and effortless search experience. Specifically, we used Conditional Random Fields to identify entities, with an average accuracy of 56%. This is a baseline result, and we identified many possibilities for improvement. These entities were indexed in ElasticSearch and a user interface was developed on top of the index. This proof of concept was used in user requirement solicitation and evaluation with a group of end users. Feedback from this group indicated that there is a dire need for such a system, and that the first results are promising.
In this paper we apply the Conditional Random Fields approach for modeling human navigational behavior based on mouse movements to recognize web user tasks. In fact, inferring activity of web users is an important topic of Human Computer... more
In this paper we apply the Conditional Random Fields approach for modeling human navigational behavior based on mouse movements to recognize web user tasks. In fact, inferring activity of web users is an important topic of Human Computer Interaction. To improve the interaction process, many studies have been performed for understanding how users interact with web interfaces in order to perform a given activity. The Experimental evaluation and analysis of the results of the model we present in this paper demonstrate the efficiency of our model in human tasks recognition.
Spatial variability of soil materials has long been recognized as an important factor influencing the reliability of geo-structures. This study stochastically investigates the influence of spatial variability of shear strength on the... more
Spatial variability of soil materials has long been recognized as an important factor influencing the reliability of geo-structures. This study stochastically investigates the influence of spatial variability of shear strength on the stability of heterogeneous slopes, focusing on the auto-correlation function, auto-correlation distance and cross-correlation between soil parameters. The finite element method is merged with the random field theory to probabilistically evaluate factor of safety and probability of failure via Monte-Carlo simulations. The simulation procedure is explained in detail with suggestions on improving efficiency of the Monte-Carlo process. A simple procedure to create cross-correlation between random variables, which allows direct comparison of the influence of each strength variable, is discussed. The results show that the auto-correlation distance and cross-correlation can significantly influence slope stability, while the choice of auto-correlation function only has a minor effect. An equation relating the probability of failure with the auto-correlation distance is suggested in light of the analyses performed in this work and other results from the literature.
Machine Translation and Word Sense Disambiguation are most popular applications of Natural Language Processing, because Machine Translation is cheap and best to understand than any other language during conversation. Whereas Word Sense... more
Machine Translation and Word Sense Disambiguation are most popular applications of Natural Language Processing, because Machine Translation is cheap and best to understand than any other language during conversation. Whereas Word Sense Disambiguation helps to get the correct meaning of particular word in which context that is used. In our system we are using hybrid approach with help of which we can disambiguate the words and can get best result of machine translation. Conditional Random Field algorithm with decision list using direct mapping is easiest method with best result to solve the problem of disambiguation. In our system, Conditional Random field, divide the data into categories and calculate the frequency of words with respect to the category. Category having maximum frequency in the sentence meaning will relates to that category. Accuracy of our System for correct sentences is 81.2% on the bases of tested sentences only.