Viktor Pekar

University of Birmingham, Birmingham Business School, Faculty Member

Followers

Following

Co-author

Public Views

Interests

Uploads

Papers by Viktor Pekar

Explainable text-based features in predictive models of crowdfunding campaigns

Annals of Operations Research, Jan 11, 2024

Download

Discovery of Language Resources on the Web: Information Extraction from Heterogeneous Documents

Literary and Linguistic Computing, May 2, 2007

ABSTRACT The present article is concerned with the problem of automatic database population via i... more ABSTRACT The present article is concerned with the problem of automatic database population via information extraction (IE) from web pages obtained from heterogeneous sources, such as those retrieved by a domain crawler. Specifically, we address the task of filling single multi-field templates from individual documents, a common scenario that involves free-format documents with the same communicative goal such as job adverts, CVs, or meeting/seminar announcements. We discuss challenges that arise in this scenario and propose solutions to them at different levels of the processing of web page content. Our main focus is on the issue of information extraction, which we address with a two-step machine learning approach that first aims to determine segments of a page that are likely to contain relevant facts and then delimits specific natural language expressions with which to fill template fields. We also present a range of techniques for the enrichment of web pages with semantic annotations, such as recognition of named entities, domain terminology and coreference resolution, and examine their effect on the information extraction method. We evaluate the developed IE system on the task of automatically populating a database with information on language resources available on the web.

Automatic Detection of Orthographics Cues for Cognate Recognition

Language Resources and Evaluation, May 1, 2006

Download

Detecting mass emergency events on social media: one classification problem or many?

Social media proves to be a major source of timely information during mass emergencies. A conside... more Social media proves to be a major source of timely information during mass emergencies. A considerable amount of recent research has aimed at developing methods to detect social media messages that report such disasters at early stages. In contrast to previous work, the goal of this paper is to identify messages relating to a very broad range of possible emergencies including technological and natural disasters. The challenge of this task is data heterogeneity: messages relating to different types of disasters tend to have different feature distributions. This makes it harder to learn the classification problem; a classifier trained on certain emergency types tends to perform poorly when tested on some other types of disasters. To counteract the negative effects of data heterogeneity, we present two novel methods. The first is an ensemble method, which combines multiple classifiers specific to each emergency type to classify previously unseen texts, and the second is a semi-supervis...

Download

Selecting classification features for detection of mass emergencies on social media

Discovery of Language Resources on the Web: Information Extraction from Heterogeneous Documents

Literary and Linguistic Computing, 2007

Optimizing the subtasks in the double classification approach to Information Extraction

ABSTRACT

Semantic evidence for automatic identification of cognates

... Proceedings of the 4th Conference of the Association for Machine Translation in the Americas ... more

Voting intentions on social media and political opinion polls

Government Information Quarterly, 2021

Purchase Intentions on Social Media as Predictors of Consumer Spending

Proceedings of the International AAAI Conference on Web and Social Media

The paper addresses the problem of forecasting consumer expenditure from social media data. Previ... more The paper addresses the problem of forecasting consumer expenditure from social media data. Previous research of the topic exploited the intuition that search engine traffic reflects purchase intentions and constructed predictive models of consumer behaviour from search query volumes. In contrast, we derive predictors from explicit expressions of purchase intentions found in social media posts. Two types of predictors created from these expressions are explored: those based on word embeddings and those based on topical word clusters. We introduce a new clustering method, which takes into account temporal co-occurrence of words, in addition to their semantic similarity, in order to create predictors relevant to the forecasting problem. The predictors are evaluated against baselines that use only macroeconomic variables, and against models trained on search traffic data. Conducting experiments with three different regression methods on Facebook and Twitter data, we find that both word...

Download

Mining for Signals of Future Consumer Expenditure on Twitter and Google Trends

Proceedings of the 2nd International Conference on Advanced Research Methods and Analytics (CARMA 2018), 2018

Consumer expenditure constitutes the largest component of Gross Domestic Product in developed cou... more Consumer expenditure constitutes the largest component of Gross Domestic Product in developed countries, and forecasts of consumer spending are therefore an important tool that governments and central bank use in their policy-making. In this paper we examine methods to forecast consumer spending from user-generated content, such as search engine queries and social media data, which hold the promise to produce forecasts much more efficiently than traditional surveys. Specifically, the aim of the paper is to study the relative utility of evidence about purchase intentions found in Google Trends versus those found in Twitter posts, for the problem of forecasting consumer expenditure. Our main findings are that, firstly, the Google Trends indicators and indicators extracted from Twitter are both beneficial for the forecasts: adding them as exogenous variables into regression model produces improvements on the pure AR baseline, consistently across all the forecast horizons. Secondly, we...

Download

Natural and contextual constrains for domain-specific relations 12

09:35 – 10:00 PredXtract, a generic platform to extract in texts predicate argument structures (PAS)

Download

Using semantic resources to improve a syntactic dependency parser

Probabilistic syntactic parsing has made rapid progress, but is reaching a performance ceiling. M... more Probabilistic syntactic parsing has made rapid progress, but is reaching a performance ceiling. More semantic resources need to be included. We exploit a number of semantic resources to improve parsing accuracy of a dependency parser. We compare semantic lexica on this task, then we extend the back-off chain by punishing underspecified decisions. Further, a simple distributional semantics approach is tested. Selectional restrictions are employed to boost interpretations that are semantically plausible. We also show that self-training can improve parsing even without needing a re-ranker, as we can rely on a sufficiently good estimation of parsing accuracy. Parsing large amounts of data and using it in self-training allows us to learn world knowledge from the distribution of syntactic relation. We show that the performance of the parser considerably improves due to our extensions.

Download

Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers on XX - NAACL '06

Forecasting Consumer Spending from Purchase Intentions Expressed on Social Media

Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, 2017

Download

Proceedings of the Workshop on Natural Language Processing in the 5th Information Systems Research Working Days (JISIC)

Download

Word classification based on combined measures of distributional and semantic similarity

Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - EACL '03, 2003

Download

Specification in terms of interactional properties as a way to optimize the representation of spatial expressions

Proceedings of the workshop on Temporal and spatial information processing -, 2001

Download

Information Extraction from Email Announcements

Lecture Notes in Computer Science, 2005

ABSTRACT Public email announcements present a number of unique challenges for an Information Extr... more ABSTRACT Public email announcements present a number of unique challenges for an Information Extraction (IE) system, such as the presence of both free and semi-structured text, inconsistent document layout and widely varying formats of template fillers. In this paper we describe a study of parametrisation of an IE method to determine settings that best suit the specifics of the task at hand.