ABSTRACT The present article is concerned with the problem of automatic database population via i... more ABSTRACT The present article is concerned with the problem of automatic database population via information extraction (IE) from web pages obtained from heterogeneous sources, such as those retrieved by a domain crawler. Specifically, we address the task of filling single multi-field templates from individual documents, a common scenario that involves free-format documents with the same communicative goal such as job adverts, CVs, or meeting/seminar announcements. We discuss challenges that arise in this scenario and propose solutions to them at different levels of the processing of web page content. Our main focus is on the issue of information extraction, which we address with a two-step machine learning approach that first aims to determine segments of a page that are likely to contain relevant facts and then delimits specific natural language expressions with which to fill template fields. We also present a range of techniques for the enrichment of web pages with semantic annotations, such as recognition of named entities, domain terminology and coreference resolution, and examine their effect on the information extraction method. We evaluate the developed IE system on the task of automatically populating a database with information on language resources available on the web.
Social media proves to be a major source of timely information during mass emergencies. A conside... more Social media proves to be a major source of timely information during mass emergencies. A considerable amount of recent research has aimed at developing methods to detect social media messages that report such disasters at early stages. In contrast to previous work, the goal of this paper is to identify messages relating to a very broad range of possible emergencies including technological and natural disasters. The challenge of this task is data heterogeneity: messages relating to different types of disasters tend to have different feature distributions. This makes it harder to learn the classification problem; a classifier trained on certain emergency types tends to perform poorly when tested on some other types of disasters. To counteract the negative effects of data heterogeneity, we present two novel methods. The first is an ensemble method, which combines multiple classifiers specific to each emergency type to classify previously unseen texts, and the second is a semi-supervis...
ABSTRACT The present article is concerned with the problem of automatic database population via i... more ABSTRACT The present article is concerned with the problem of automatic database population via information extraction (IE) from web pages obtained from heterogeneous sources, such as those retrieved by a domain crawler. Specifically, we address the task of filling single multi-field templates from individual documents, a common scenario that involves free-format documents with the same communicative goal such as job adverts, CVs, or meeting/seminar announcements. We discuss challenges that arise in this scenario and propose solutions to them at different levels of the processing of web page content. Our main focus is on the issue of information extraction, which we address with a two-step machine learning approach that first aims to determine segments of a page that are likely to contain relevant facts and then delimits specific natural language expressions with which to fill template fields. We also present a range of techniques for the enrichment of web pages with semantic annotations, such as recognition of named entities, domain terminology and coreference resolution, and examine their effect on the information extraction method. We evaluate the developed IE system on the task of automatically populating a database with information on language resources available on the web.
... Proceedings of the 4th Conference of the Association for Machine Translation in the Americas ... more ... Proceedings of the 4th Conference of the Association for Machine Translation in the Americas on Envisioning Machine Translation in the Information Future, 158-168. [6] Oana Frunza andDiana Inkpen. 2006. ... [7] Diana Inkpen, Oana Frunza and Grzegorz Kondrak. 2005. ...
Proceedings of the International AAAI Conference on Web and Social Media
The paper addresses the problem of forecasting consumer expenditure from social media data. Previ... more The paper addresses the problem of forecasting consumer expenditure from social media data. Previous research of the topic exploited the intuition that search engine traffic reflects purchase intentions and constructed predictive models of consumer behaviour from search query volumes. In contrast, we derive predictors from explicit expressions of purchase intentions found in social media posts. Two types of predictors created from these expressions are explored: those based on word embeddings and those based on topical word clusters. We introduce a new clustering method, which takes into account temporal co-occurrence of words, in addition to their semantic similarity, in order to create predictors relevant to the forecasting problem. The predictors are evaluated against baselines that use only macroeconomic variables, and against models trained on search traffic data. Conducting experiments with three different regression methods on Facebook and Twitter data, we find that both word...
Proceedings of the 2nd International Conference on Advanced Research Methods and Analytics (CARMA 2018), 2018
Consumer expenditure constitutes the largest component of Gross Domestic Product in developed cou... more Consumer expenditure constitutes the largest component of Gross Domestic Product in developed countries, and forecasts of consumer spending are therefore an important tool that governments and central bank use in their policy-making. In this paper we examine methods to forecast consumer spending from user-generated content, such as search engine queries and social media data, which hold the promise to produce forecasts much more efficiently than traditional surveys. Specifically, the aim of the paper is to study the relative utility of evidence about purchase intentions found in Google Trends versus those found in Twitter posts, for the problem of forecasting consumer expenditure. Our main findings are that, firstly, the Google Trends indicators and indicators extracted from Twitter are both beneficial for the forecasts: adding them as exogenous variables into regression model produces improvements on the pure AR baseline, consistently across all the forecast horizons. Secondly, we...
Probabilistic syntactic parsing has made rapid progress, but is reaching a performance ceiling. M... more Probabilistic syntactic parsing has made rapid progress, but is reaching a performance ceiling. More semantic resources need to be included. We exploit a number of semantic resources to improve parsing accuracy of a dependency parser. We compare semantic lexica on this task, then we extend the back-off chain by punishing underspecified decisions. Further, a simple distributional semantics approach is tested. Selectional restrictions are employed to boost interpretations that are semantically plausible. We also show that self-training can improve parsing even without needing a re-ranker, as we can rely on a sufficiently good estimation of parsing accuracy. Parsing large amounts of data and using it in self-training allows us to learn world knowledge from the distribution of syntactic relation. We show that the performance of the parser considerably improves due to our extensions.
ABSTRACT Public email announcements present a number of unique challenges for an Information Extr... more ABSTRACT Public email announcements present a number of unique challenges for an Information Extraction (IE) system, such as the presence of both free and semi-structured text, inconsistent document layout and widely varying formats of template fillers. In this paper we describe a study of parametrisation of an IE method to determine settings that best suit the specifics of the task at hand.
ABSTRACT The present article is concerned with the problem of automatic database population via i... more ABSTRACT The present article is concerned with the problem of automatic database population via information extraction (IE) from web pages obtained from heterogeneous sources, such as those retrieved by a domain crawler. Specifically, we address the task of filling single multi-field templates from individual documents, a common scenario that involves free-format documents with the same communicative goal such as job adverts, CVs, or meeting/seminar announcements. We discuss challenges that arise in this scenario and propose solutions to them at different levels of the processing of web page content. Our main focus is on the issue of information extraction, which we address with a two-step machine learning approach that first aims to determine segments of a page that are likely to contain relevant facts and then delimits specific natural language expressions with which to fill template fields. We also present a range of techniques for the enrichment of web pages with semantic annotations, such as recognition of named entities, domain terminology and coreference resolution, and examine their effect on the information extraction method. We evaluate the developed IE system on the task of automatically populating a database with information on language resources available on the web.
Social media proves to be a major source of timely information during mass emergencies. A conside... more Social media proves to be a major source of timely information during mass emergencies. A considerable amount of recent research has aimed at developing methods to detect social media messages that report such disasters at early stages. In contrast to previous work, the goal of this paper is to identify messages relating to a very broad range of possible emergencies including technological and natural disasters. The challenge of this task is data heterogeneity: messages relating to different types of disasters tend to have different feature distributions. This makes it harder to learn the classification problem; a classifier trained on certain emergency types tends to perform poorly when tested on some other types of disasters. To counteract the negative effects of data heterogeneity, we present two novel methods. The first is an ensemble method, which combines multiple classifiers specific to each emergency type to classify previously unseen texts, and the second is a semi-supervis...
ABSTRACT The present article is concerned with the problem of automatic database population via i... more ABSTRACT The present article is concerned with the problem of automatic database population via information extraction (IE) from web pages obtained from heterogeneous sources, such as those retrieved by a domain crawler. Specifically, we address the task of filling single multi-field templates from individual documents, a common scenario that involves free-format documents with the same communicative goal such as job adverts, CVs, or meeting/seminar announcements. We discuss challenges that arise in this scenario and propose solutions to them at different levels of the processing of web page content. Our main focus is on the issue of information extraction, which we address with a two-step machine learning approach that first aims to determine segments of a page that are likely to contain relevant facts and then delimits specific natural language expressions with which to fill template fields. We also present a range of techniques for the enrichment of web pages with semantic annotations, such as recognition of named entities, domain terminology and coreference resolution, and examine their effect on the information extraction method. We evaluate the developed IE system on the task of automatically populating a database with information on language resources available on the web.
... Proceedings of the 4th Conference of the Association for Machine Translation in the Americas ... more ... Proceedings of the 4th Conference of the Association for Machine Translation in the Americas on Envisioning Machine Translation in the Information Future, 158-168. [6] Oana Frunza andDiana Inkpen. 2006. ... [7] Diana Inkpen, Oana Frunza and Grzegorz Kondrak. 2005. ...
Proceedings of the International AAAI Conference on Web and Social Media
The paper addresses the problem of forecasting consumer expenditure from social media data. Previ... more The paper addresses the problem of forecasting consumer expenditure from social media data. Previous research of the topic exploited the intuition that search engine traffic reflects purchase intentions and constructed predictive models of consumer behaviour from search query volumes. In contrast, we derive predictors from explicit expressions of purchase intentions found in social media posts. Two types of predictors created from these expressions are explored: those based on word embeddings and those based on topical word clusters. We introduce a new clustering method, which takes into account temporal co-occurrence of words, in addition to their semantic similarity, in order to create predictors relevant to the forecasting problem. The predictors are evaluated against baselines that use only macroeconomic variables, and against models trained on search traffic data. Conducting experiments with three different regression methods on Facebook and Twitter data, we find that both word...
Proceedings of the 2nd International Conference on Advanced Research Methods and Analytics (CARMA 2018), 2018
Consumer expenditure constitutes the largest component of Gross Domestic Product in developed cou... more Consumer expenditure constitutes the largest component of Gross Domestic Product in developed countries, and forecasts of consumer spending are therefore an important tool that governments and central bank use in their policy-making. In this paper we examine methods to forecast consumer spending from user-generated content, such as search engine queries and social media data, which hold the promise to produce forecasts much more efficiently than traditional surveys. Specifically, the aim of the paper is to study the relative utility of evidence about purchase intentions found in Google Trends versus those found in Twitter posts, for the problem of forecasting consumer expenditure. Our main findings are that, firstly, the Google Trends indicators and indicators extracted from Twitter are both beneficial for the forecasts: adding them as exogenous variables into regression model produces improvements on the pure AR baseline, consistently across all the forecast horizons. Secondly, we...
Probabilistic syntactic parsing has made rapid progress, but is reaching a performance ceiling. M... more Probabilistic syntactic parsing has made rapid progress, but is reaching a performance ceiling. More semantic resources need to be included. We exploit a number of semantic resources to improve parsing accuracy of a dependency parser. We compare semantic lexica on this task, then we extend the back-off chain by punishing underspecified decisions. Further, a simple distributional semantics approach is tested. Selectional restrictions are employed to boost interpretations that are semantically plausible. We also show that self-training can improve parsing even without needing a re-ranker, as we can rely on a sufficiently good estimation of parsing accuracy. Parsing large amounts of data and using it in self-training allows us to learn world knowledge from the distribution of syntactic relation. We show that the performance of the parser considerably improves due to our extensions.
ABSTRACT Public email announcements present a number of unique challenges for an Information Extr... more ABSTRACT Public email announcements present a number of unique challenges for an Information Extraction (IE) system, such as the presence of both free and semi-structured text, inconsistent document layout and widely varying formats of template fillers. In this paper we describe a study of parametrisation of an IE method to determine settings that best suit the specifics of the task at hand.
Uploads
Papers by Viktor Pekar