-
Public Discourse about COVID-19 Vaccinations: A Computational Analysis of the Relationship between Public Concerns and Policies
Authors:
Katarina Boland,
Christopher Starke,
Felix Bensmann,
Frank Marcinkowski,
Stefan Dietze
Abstract:
Societies worldwide have witnessed growing rifts separating advocates and opponents of vaccinations and other COVID-19 countermeasures. With the rollout of vaccination campaigns, German-speaking regions exhibited much lower vaccination uptake than other European regions. While Austria, Germany, and Switzerland (the DACH region) caught up over time, it remains unclear which factors contributed to t…
▽ More
Societies worldwide have witnessed growing rifts separating advocates and opponents of vaccinations and other COVID-19 countermeasures. With the rollout of vaccination campaigns, German-speaking regions exhibited much lower vaccination uptake than other European regions. While Austria, Germany, and Switzerland (the DACH region) caught up over time, it remains unclear which factors contributed to these changes. Scrutinizing public discourses can help shed light on the intricacies of vaccine hesitancy and inform policy-makers tasked with making far-reaching decisions: policies need to effectively curb the spread of the virus while respecting fundamental civic liberties and minimizing undesired consequences. This study draws on Twitter data to analyze the topics prevalent in the public discourse. It further maps the topics to different phases of the pandemic and policy changes to identify potential drivers of change in public attention. We use a hybrid pipeline to detect and analyze vaccination-related tweets using topic modeling, sentiment analysis, and a minimum of social scientific domain knowledge to analyze the discourse about vaccinations in the light of the COVID-19 pandemic in the DACH region. We show that skepticism regarding the severity of the COVID-19 virus and towards efficacy and safety of vaccines were among the prevalent topics in the discourse on Twitter but that the most attention was given to debating the theme of freedom and civic liberties. Especially during later phases of the pandemic, when implemented policies restricted the freedom of unvaccinated citizens, increased vaccination uptake could be observed. At the same time, increasingly negative and polarized sentiments emerge in the discourse. This suggests that these policies might have effectively attenuated vaccination hesitancy but were not successfully dispersing citizens' doubts and concerns.
△ Less
Submitted 7 May, 2024;
originally announced July 2024.
-
Toward FAIR Semantic Publishing of Research Dataset Metadata in the Open Research Knowledge Graph
Authors:
Raia Abu Ahmad,
Jennifer D'Souza,
Matthäus Zloch,
Wolfgang Otto,
Georg Rehm,
Allard Oelen,
Stefan Dietze,
Sören Auer
Abstract:
Search engines these days can serve datasets as search results. Datasets get picked up by search technologies based on structured descriptions on their official web pages, informed by metadata ontologies such as the Dataset content type of schema.org. Despite this promotion of the content type dataset as a first-class citizen of search results, a vast proportion of datasets, particularly research…
▽ More
Search engines these days can serve datasets as search results. Datasets get picked up by search technologies based on structured descriptions on their official web pages, informed by metadata ontologies such as the Dataset content type of schema.org. Despite this promotion of the content type dataset as a first-class citizen of search results, a vast proportion of datasets, particularly research datasets, still need to be made discoverable and, therefore, largely remain unused. This is due to the sheer volume of datasets released every day and the inability of metadata to reflect a dataset's content and context accurately. This work seeks to improve this situation for a specific class of datasets, namely research datasets, which are the result of research endeavors and are accompanied by a scholarly publication. We propose the ORKG-Dataset content type, a specialized branch of the Open Research Knowledge Graoh (ORKG) platform, which provides descriptive information and a semantic model for research datasets, integrating them with their accompanying scholarly publications. This work aims to establish a standardized framework for recording and reporting research datasets within the ORKG-Dataset content type. This, in turn, increases research dataset transparency on the web for their improved discoverability and applied use. In this paper, we present a proposal -- the minimum FAIR, comparable, semantic description of research datasets in terms of salient properties of their supporting publication. We design a specific application of the ORKG-Dataset semantic model based on 40 diverse research datasets on scientific information extraction.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
Enhancing Software-Related Information Extraction via Single-Choice Question Answering with Large Language Models
Authors:
Wolfgang Otto,
Sharmila Upadhyaya,
Stefan Dietze
Abstract:
This paper describes our participation in the Shared Task on Software Mentions Disambiguation (SOMD), with a focus on improving relation extraction in scholarly texts through generative Large Language Models (LLMs) using single-choice question-answering. The methodology prioritises the use of in-context learning capabilities of GLMs to extract software-related entities and their descriptive attrib…
▽ More
This paper describes our participation in the Shared Task on Software Mentions Disambiguation (SOMD), with a focus on improving relation extraction in scholarly texts through generative Large Language Models (LLMs) using single-choice question-answering. The methodology prioritises the use of in-context learning capabilities of GLMs to extract software-related entities and their descriptive attributes, such as distributive information. Our approach uses Retrieval-Augmented Generation (RAG) techniques and GLMs for Named Entity Recognition (NER) and Attributive NER to identify relationships between extracted software entities, providing a structured solution for analysing software citations in academic literature. The paper provides a detailed description of our approach, demonstrating how using GLMs in a single-choice QA paradigm can greatly enhance IE methodologies. Our participation in the SOMD shared task highlights the importance of precise software citation practices and showcases our system's ability to overcome the challenges of disambiguating and extracting relationships between software mentions. This sets the groundwork for future research and development in this field.
△ Less
Submitted 19 April, 2024; v1 submitted 8 April, 2024;
originally announced April 2024.
-
Dissecting Paraphrases: The Impact of Prompt Syntax and supplementary Information on Knowledge Retrieval from Pretrained Language Models
Authors:
Stephan Linzbach,
Dimitar Dimitrov,
Laura Kallmeyer,
Kilian Evang,
Hajira Jabeen,
Stefan Dietze
Abstract:
Pre-trained Language Models (PLMs) are known to contain various kinds of knowledge. One method to infer relational knowledge is through the use of cloze-style prompts, where a model is tasked to predict missing subjects or objects. Typically, designing these prompts is a tedious task because small differences in syntax or semantics can have a substantial impact on knowledge retrieval performance.…
▽ More
Pre-trained Language Models (PLMs) are known to contain various kinds of knowledge. One method to infer relational knowledge is through the use of cloze-style prompts, where a model is tasked to predict missing subjects or objects. Typically, designing these prompts is a tedious task because small differences in syntax or semantics can have a substantial impact on knowledge retrieval performance. Simultaneously, evaluating the impact of either prompt syntax or information is challenging due to their interdependence. We designed CONPARE-LAMA - a dedicated probe, consisting of 34 million distinct prompts that facilitate comparison across minimal paraphrases. These paraphrases follow a unified meta-template enabling the controlled variation of syntax and semantics across arbitrary relations. CONPARE-LAMA enables insights into the independent impact of either syntactical form or semantic information of paraphrases on the knowledge retrieval performance of PLMs. Extensive knowledge retrieval experiments using our probe reveal that prompts following clausal syntax have several desirable properties in comparison to appositive syntax: i) they are more useful when querying PLMs with a combination of supplementary information, ii) knowledge is more consistently recalled across different combinations of supplementary information, and iii) they decrease response uncertainty when retrieving known facts. In addition, range information can boost knowledge retrieval performance more than domain information, even though domain information is more reliably helpful across syntactic forms.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
TACO -- Twitter Arguments from COnversations
Authors:
Marc Feger,
Stefan Dietze
Abstract:
Twitter has emerged as a global hub for engaging in online conversations and as a research corpus for various disciplines that have recognized the significance of its user-generated content. Argument mining is an important analytical task for processing and understanding online discourse. Specifically, it aims to identify the structural elements of arguments, denoted as information and inference.…
▽ More
Twitter has emerged as a global hub for engaging in online conversations and as a research corpus for various disciplines that have recognized the significance of its user-generated content. Argument mining is an important analytical task for processing and understanding online discourse. Specifically, it aims to identify the structural elements of arguments, denoted as information and inference. These elements, however, are not static and may require context within the conversation they are in, yet there is a lack of data and annotation frameworks addressing this dynamic aspect on Twitter. We contribute TACO, the first dataset of Twitter Arguments utilizing 1,814 tweets covering 200 entire conversations spanning six heterogeneous topics annotated with an agreement of 0.718 Krippendorff's alpha among six experts. Second, we provide our annotation framework, incorporating definitions from the Cambridge Dictionary, to define and identify argument components on Twitter. Our transformer-based classifier achieves an 85.06\% macro F1 baseline score in detecting arguments. Moreover, our data reveals that Twitter users tend to engage in discussions involving informed inferences and information. TACO serves multiple purposes, such as training tweet classifiers to manage tweets based on inference and information elements, while also providing valuable insights into the conversational reply patterns of tweets.
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
nuScenes Knowledge Graph -- A comprehensive semantic representation of traffic scenes for trajectory prediction
Authors:
Leon Mlodzian,
Zhigang Sun,
Hendrik Berkemeyer,
Sebastian Monka,
Zixu Wang,
Stefan Dietze,
Lavdim Halilaj,
Juergen Luettin
Abstract:
Trajectory prediction in traffic scenes involves accurately forecasting the behaviour of surrounding vehicles. To achieve this objective it is crucial to consider contextual information, including the driving path of vehicles, road topology, lane dividers, and traffic rules. Although studies demonstrated the potential of leveraging heterogeneous context for improving trajectory prediction, state-o…
▽ More
Trajectory prediction in traffic scenes involves accurately forecasting the behaviour of surrounding vehicles. To achieve this objective it is crucial to consider contextual information, including the driving path of vehicles, road topology, lane dividers, and traffic rules. Although studies demonstrated the potential of leveraging heterogeneous context for improving trajectory prediction, state-of-the-art deep learning approaches still rely on a limited subset of this information. This is mainly due to the limited availability of comprehensive representations. This paper presents an approach that utilizes knowledge graphs to model the diverse entities and their semantic connections within traffic scenes. Further, we present nuScenes Knowledge Graph (nSKG), a knowledge graph for the nuScenes dataset, that models explicitly all scene participants and road elements, as well as their semantic and spatial relationships. To facilitate the usage of the nSKG via graph neural networks for trajectory prediction, we provide the data in a format, ready-to-use by the PyG library. All artefacts can be found here: https://github.com/boschresearch/nuScenes_Knowledge_Graph
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity Extraction Focused on Machine Learning Models and Datasets
Authors:
Wolfgang Otto,
Matthäus Zloch,
Lu Gan,
Saurav Karmakar,
Stefan Dietze
Abstract:
Named Entity Recognition (NER) models play a crucial role in various NLP tasks, including information extraction (IE) and text understanding. In academic writing, references to machine learning models and datasets are fundamental components of various computer science publications and necessitate accurate models for identification. Despite the advancements in NER, existing ground truth datasets do…
▽ More
Named Entity Recognition (NER) models play a crucial role in various NLP tasks, including information extraction (IE) and text understanding. In academic writing, references to machine learning models and datasets are fundamental components of various computer science publications and necessitate accurate models for identification. Despite the advancements in NER, existing ground truth datasets do not treat fine-grained types like ML model and model architecture as separate entity types, and consequently, baseline models cannot recognize them as such. In this paper, we release a corpus of 100 manually annotated full-text scientific publications and a first baseline model for 10 entity types centered around ML models and datasets. In order to provide a nuanced understanding of how ML models and datasets are mentioned and utilized, our dataset also contains annotations for informal mentions like "our BERT-based model" or "an image CNN". You can find the ground truth dataset and code to replicate model training at https://data.gesis.org/gsap/gsap-ner.
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
Large Language Models and Knowledge Graphs: Opportunities and Challenges
Authors:
Jeff Z. Pan,
Simon Razniewski,
Jan-Christoph Kalo,
Sneha Singhania,
Jiaoyan Chen,
Stefan Dietze,
Hajira Jabeen,
Janna Omeliyanenko,
Wen Zhang,
Matteo Lissandrini,
Russa Biswas,
Gerard de Melo,
Angela Bonifati,
Edlira Vakaj,
Mauro Dragoni,
Damien Graux
Abstract:
Large Language Models (LLMs) have taken Knowledge Representation -- and the world -- by storm. This inflection point marks a shift from explicit knowledge representation to a renewed focus on the hybrid representation of both explicit knowledge and parametric knowledge. In this position paper, we will discuss some of the common debate points within the community on LLMs (parametric knowledge) and…
▽ More
Large Language Models (LLMs) have taken Knowledge Representation -- and the world -- by storm. This inflection point marks a shift from explicit knowledge representation to a renewed focus on the hybrid representation of both explicit knowledge and parametric knowledge. In this position paper, we will discuss some of the common debate points within the community on LLMs (parametric knowledge) and Knowledge Graphs (explicit knowledge) and speculate on opportunities and visions that the renewed focus brings, as well as related research topics and challenges.
△ Less
Submitted 11 August, 2023;
originally announced August 2023.
-
Which Factors are associated with Open Access Publishing? A Springer Nature Case Study
Authors:
Fakhri Momeni,
Stefan Dietze,
Philipp Mayr,
Kristin Biesenbender,
Isabella Peters
Abstract:
Open Access (OA) facilitates access to articles. But, authors or funders often must pay the publishing costs preventing authors who do not receive financial support from participating in OA publishing and citation advantage for OA articles. OA may exacerbate existing inequalities in the publication system rather than overcome them. To investigate this, we studied 522,411 articles published by Spri…
▽ More
Open Access (OA) facilitates access to articles. But, authors or funders often must pay the publishing costs preventing authors who do not receive financial support from participating in OA publishing and citation advantage for OA articles. OA may exacerbate existing inequalities in the publication system rather than overcome them. To investigate this, we studied 522,411 articles published by Springer Nature. Employing correlation and regression analyses, we describe the relationship between authors affiliated with countries from different income levels, their choice of publishing model, and the citation impact of their papers. A machine learning classification method helped us to explore the importance of different features in predicting the publishing model. The results show that authors eligible for APC waivers publish more in gold-OA journals than others. In contrast, authors eligible for an APC discount have the lowest ratio of OA publications, leading to the assumption that this discount insufficiently motivates authors to publish in gold-OA journals. We found a strong correlation between the journal rank and the publishing model in gold-OA journals, whereas the OA option is mostly avoided in hybrid journals. Also, results show that the countries' income level, seniority, and experience with OA publications are the most predictive factors for OA publishing in hybrid journals.
△ Less
Submitted 25 April, 2023; v1 submitted 17 August, 2022;
originally announced August 2022.
-
Investigating the contribution of author- and publication-specific features to scholars' h-index prediction
Authors:
Fakhri Momeni,
Philipp Mayr,
Stefan Dietze
Abstract:
Evaluation of researchers' output is vital for hiring committees and funding bodies, and it is usually measured via their scientific productivity, citations, or a combined metric such as h-index. Assessing young researchers is more critical because it takes a while to get citations and increment of h-index. Hence, predicting the h-index can help to discover the researchers' scientific impact. In a…
▽ More
Evaluation of researchers' output is vital for hiring committees and funding bodies, and it is usually measured via their scientific productivity, citations, or a combined metric such as h-index. Assessing young researchers is more critical because it takes a while to get citations and increment of h-index. Hence, predicting the h-index can help to discover the researchers' scientific impact. In addition, identifying the influential factors to predict the scientific impact is helpful for researchers seeking solutions to improve it. This study investigates the effect of author, paper and venue-specific features on the future h-index. For this purpose, we used machine learning methods to predict the h-index and feature analysis techniques to advance the understanding of feature impact. Utilizing the bibliometric data in Scopus, we defined and extracted two main groups of features. The first relates to prior scientific impact, and we name it 'prior impact-based features' and includes the number of publications, received citations, and h-index. The second group is 'non-impact-based features' and contains the features related to author, co-authorship, paper, and venue characteristics. We explored their importance in predicting h-index for researchers in three different career phases. Also, we examine the temporal dimension of predicting performance for different feature categories to find out which features are more reliable for long- and short-term prediction. We referred to the gender of the authors to examine the role of this author's characteristics in the prediction task. Our findings showed that gender has a very slight effect in predicting the h-index. We found that non-impact-based features are more robust predictors for younger scholars than seniors in the short term. Also, prior impact-based features lose their power to predict more than other features in the long-term.
△ Less
Submitted 9 August, 2023; v1 submitted 20 July, 2022;
originally announced July 2022.
-
Still Haven't Found What You're Looking For -- Detecting the Intent of Web Search Missions from User Interaction Features
Authors:
Ran Yu,
Limock,
Stefan Dietze
Abstract:
Web search is among the most frequent online activities. Whereas traditional information retrieval techniques focus on the information need behind a user query, previous work has shown that user behaviour and interaction can provide important signals for understanding the underlying intent of a search mission. An established taxonomy distinguishes between transactional, navigational and informatio…
▽ More
Web search is among the most frequent online activities. Whereas traditional information retrieval techniques focus on the information need behind a user query, previous work has shown that user behaviour and interaction can provide important signals for understanding the underlying intent of a search mission. An established taxonomy distinguishes between transactional, navigational and informational search missions, where in particular the latter involve a learning goal, i.e. the intent to acquire knowledge about a particular topic. We introduce a supervised approach for classifying online search missions into either of these categories by utilising a range of features obtained from the user interactions during an online search mission. Applying our model to a dataset of real-world query logs, we show that search missions can be categorised with an average F1 score of 63% and accuracy of 69%, while performance on informational and navigational missions is particularly promising (F1>75%). This suggests the potential to utilise such supervised classification during online search to better facilitate retrieval and ranking as well as to improve affiliated services, such as targeted online ads.
△ Less
Submitted 4 July, 2022;
originally announced July 2022.
-
SciTweets -- A Dataset and Annotation Framework for Detecting Scientific Online Discourse
Authors:
Salim Hafid,
Sebastian Schellhammer,
Sandra Bringay,
Konstantin Todorov,
Stefan Dietze
Abstract:
Scientific topics, claims and resources are increasingly debated as part of online discourse, where prominent examples include discourse related to COVID-19 or climate change. This has led to both significant societal impact and increased interest in scientific online discourse from various disciplines. For instance, communication studies aim at a deeper understanding of biases, quality or spreadi…
▽ More
Scientific topics, claims and resources are increasingly debated as part of online discourse, where prominent examples include discourse related to COVID-19 or climate change. This has led to both significant societal impact and increased interest in scientific online discourse from various disciplines. For instance, communication studies aim at a deeper understanding of biases, quality or spreading pattern of scientific information whereas computational methods have been proposed to extract, classify or verify scientific claims using NLP and IR techniques. However, research across disciplines currently suffers from both a lack of robust definitions of the various forms of science-relatedness as well as appropriate ground truth data for distinguishing them. In this work, we contribute (a) an annotation framework and corresponding definitions for different forms of scientific relatedness of online discourse in Tweets, (b) an expert-annotated dataset of 1261 tweets obtained through our labeling framework reaching an average Fleiss Kappa $κ$ of 0.63, (c) a multi-label classifier trained on our data able to detect science-relatedness with 89% F1 and also able to detect distinct forms of scientific knowledge (claims, references). With this work we aim to lay the foundation for developing and evaluating robust methods for analysing science as part of large-scale online discourse.
△ Less
Submitted 6 July, 2022; v1 submitted 15 June, 2022;
originally announced June 2022.
-
The many facets of academic mobility and its impact on scholars' career
Authors:
Fakhri Momeni,
Fariba Karimi,
Philipp Mayr,
Isabella Peters,
Stefan Dietze
Abstract:
International mobility in academia can enhance the human and social capital of researchers and consequently their scientific outcome. However, there is still a very limited understanding of the different mobility patterns among scholars with various socio-demographic characteristics. The aim of this study is twofold. First, we investigate to what extent individual factors associate with the mobili…
▽ More
International mobility in academia can enhance the human and social capital of researchers and consequently their scientific outcome. However, there is still a very limited understanding of the different mobility patterns among scholars with various socio-demographic characteristics. The aim of this study is twofold. First, we investigate to what extent individual factors associate with the mobility of researchers. Second, we explore the relationship between mobility and scientific activity and impact. For this purpose, we used a bibliometric approach to track the mobility of authors. To compare the scientific outcomes of researchers, we considered the number of publications and received citations as indicators, as well as the number of unique co-authors in all their publications. We also analysed the co-authorship network of researchers and compared centrality measures of mobile and non-mobile researchers. Results show that researchers from North America and Sub-Saharan Africa, particularly female ones, have the lowest, respectively, highest tendency towards international mobility. Having international co-authors increases the probability of international movement. Our findings uncover gender inequality in international mobility across scientific fields and countries. Across genders, researchers in the Physical sciences have the most and in the Social sciences the least rate of mobility. We observed more mobility for Social scientists at the advanced career stage, while researchers in other fields prefer to move at earlier career stages. Also, we found a positive correlation between mobility and scientific outcomes, but no apparent difference between females and males. Comparing the centrality of mobile and non-mobile researchers in the co-authorship networks reveals a higher social capital advantage for mobile researchers.
△ Less
Submitted 29 March, 2022; v1 submitted 14 March, 2022;
originally announced March 2022.
-
SaL-Lightning Dataset: Search and Eye Gaze Behavior, Resource Interactions and Knowledge Gain during Web Search
Authors:
Christian Otto,
Markus Rokicki,
Georg Pardi,
Wolfgang Gritz,
Daniel Hienert,
Ran Yu,
Johannes von Hoyer,
Anett Hoppe,
Stefan Dietze,
Peter Holtz,
Yvonne Kammerer,
Ralph Ewerth
Abstract:
The emerging research field Search as Learning investigates how the Web facilitates learning through modern information retrieval systems. SAL research requires significant amounts of data that capture both search behavior of users and their acquired knowledge in order to obtain conclusive insights or train supervised machine learning models. However, the creation of such datasets is costly and re…
▽ More
The emerging research field Search as Learning investigates how the Web facilitates learning through modern information retrieval systems. SAL research requires significant amounts of data that capture both search behavior of users and their acquired knowledge in order to obtain conclusive insights or train supervised machine learning models. However, the creation of such datasets is costly and requires interdisciplinary efforts in order to design studies and capture a wide range of features. In this paper, we address this issue and introduce an extensive dataset based on a user study, in which $114$ participants were asked to learn about the formation of lightning and thunder. Participants' knowledge states were measured before and after Web search through multiple-choice questionnaires and essay-based free recall tasks. To enable future research in SAL-related tasks we recorded a plethora of features and person-related attributes. Besides the screen recordings, visited Web pages, and detailed browsing histories, a large number of behavioral features and resource features were monitored. We underline the usefulness of the dataset by describing three, already published, use cases.
△ Less
Submitted 7 January, 2022;
originally announced January 2022.
-
Reduced Order Model Predictive Control for Parametrized Parabolic Partial Differential Equations
Authors:
Saskia Dietze,
Martin A. Grepl
Abstract:
Model Predictive Control (MPC) is a well-established approach to solve infinite horizon optimal control problems. Since optimization over an infinite time horizon is generally infeasible, MPC determines a suboptimal feedback control by repeatedly solving finite time optimal control problems. Although MPC has been successfully used in many applications, applying MPC to large-scale systems -- arisin…
▽ More
Model Predictive Control (MPC) is a well-established approach to solve infinite horizon optimal control problems. Since optimization over an infinite time horizon is generally infeasible, MPC determines a suboptimal feedback control by repeatedly solving finite time optimal control problems. Although MPC has been successfully used in many applications, applying MPC to large-scale systems -- arising, e.g., through discretization of partial differential equations -- requires the solution of high-dimensional optimal control problems and thus poses immense computational effort.
We consider systems governed by parametrized parabolic partial differential equations and employ the reduced basis (RB) method as a low-dimensional surrogate model for the finite time optimal control problem. The reduced order optimal control serves as feedback control for the original large-scale system. We analyze the proposed RB-MPC approach by first developing a posteriori error bounds for the errors in the optimal control and associated cost functional. These bounds can be evaluated efficiently in an offline-online computational procedure and allow us to guarantee asymptotic stability of the closed-loop system using the RB-MPC approach in several practical scenarios. We also propose an adaptive strategy to choose the prediction horizon of the finite time optimal control problem. Numerical results are presented to illustrate the theoretical properties of our approach.
△ Less
Submitted 25 October, 2022; v1 submitted 31 October, 2021;
originally announced November 2021.
-
SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles
Authors:
David Schindler,
Felix Bensmann,
Stefan Dietze,
Frank Krüger
Abstract:
Knowledge about software used in scientific investigations is important for several reasons, for instance, to enable an understanding of provenance and methods involved in data handling. However, software is usually not formally cited, but rather mentioned informally within the scholarly description of the investigation, raising the need for automatic information extraction and disambiguation. Giv…
▽ More
Knowledge about software used in scientific investigations is important for several reasons, for instance, to enable an understanding of provenance and methods involved in data handling. However, software is usually not formally cited, but rather mentioned informally within the scholarly description of the investigation, raising the need for automatic information extraction and disambiguation. Given the lack of reliable ground truth data, we present SoMeSci (Software Mentions in Science) a gold standard knowledge graph of software mentions in scientific articles. It contains high quality annotations (IRR: $κ{=}.82$) of 3756 software mentions in 1367 PubMed Central articles. Besides the plain mention of the software, we also provide relation labels for additional information, such as the version, the developer, a URL or citations. Moreover, we distinguish between different types, such as application, plugin or programming environment, as well as different types of mentions, such as usage or creation. To the best of our knowledge, SoMeSci is the most comprehensive corpus about software mentions in scientific articles, providing training samples for Named Entity Recognition, Relation Extraction, Entity Disambiguation, and Entity Linking. Finally, we sketch potential use cases and provide baseline results.
△ Less
Submitted 20 August, 2021;
originally announced August 2021.
-
Predicting Knowledge Gain during Web Search based on Multimedia Resource Consumption
Authors:
Christian Otto,
Ran Yu,
Georg Pardi,
Johannes von Hoyer,
Markus Rokicki,
Anett Hoppe,
Peter Holtz,
Yvonne Kammerer,
Stefan Dietze,
Ralph Ewerth
Abstract:
In informal learning scenarios the popularity of multimedia content, such as video tutorials or lectures, has significantly increased. Yet, the users' interactions, navigation behavior, and consequently learning outcome, have not been researched extensively. Related work in this field, also called search as learning, has focused on behavioral or text resource features to predict learning outcome a…
▽ More
In informal learning scenarios the popularity of multimedia content, such as video tutorials or lectures, has significantly increased. Yet, the users' interactions, navigation behavior, and consequently learning outcome, have not been researched extensively. Related work in this field, also called search as learning, has focused on behavioral or text resource features to predict learning outcome and knowledge gain. In this paper, we investigate whether we can exploit features representing multimedia resource consumption to predict of knowledge gain (KG) during Web search from in-session data, that is without prior knowledge about the learner. For this purpose, we suggest a set of multimedia features related to image and video consumption. Our feature extraction is evaluated in a lab study with 113 participants where we collected data for a given search as learning task on the formation of thunderstorms and lightning. We automatically analyze the monitored log data and utilize state-of-the-art computer vision methods to extract features about the seen multimedia resources. Experimental results demonstrate that multimedia features can improve KG prediction. Finally, we provide an analysis on feature importance (text and multimedia) for KG prediction.
△ Less
Submitted 11 June, 2021;
originally announced June 2021.
-
Better Together -- An Ensemble Learner for Combining the Results of Ready-made Entity Linking Systems
Authors:
Renato Stoffalette João,
Pavlos Fafalios,
Stefan Dietze
Abstract:
Entity linking (EL) is the task of automatically identifying entity mentions in text and resolving them to a corresponding entity in a reference knowledge base like Wikipedia. Throughout the past decade, a plethora of EL systems and pipelines have become available, where performance of individual systems varies heavily across corpora, languages or domains. Linking performance varies even between d…
▽ More
Entity linking (EL) is the task of automatically identifying entity mentions in text and resolving them to a corresponding entity in a reference knowledge base like Wikipedia. Throughout the past decade, a plethora of EL systems and pipelines have become available, where performance of individual systems varies heavily across corpora, languages or domains. Linking performance varies even between different mentions in the same text corpus, where, for instance, some EL approaches are better able to deal with short surface forms while others may perform better when more context information is available. To this end, we argue that performance may be optimised by exploiting results from distinct EL systems on the same corpus, thereby leveraging their individual strengths on a per-mention basis. In this paper, we introduce a supervised approach which exploits the output of multiple ready-made EL systems by predicting the correct link on a per-mention basis. Experimental results obtained on existing ground truth datasets and exploiting three state-of-the-art EL systems show the effectiveness of our approach and its capacity to significantly outperform the individual EL systems as well as a set of baseline methods.
△ Less
Submitted 14 January, 2021;
originally announced January 2021.
-
The Role of Word-Eye-Fixations for Query Term Prediction
Authors:
Masoud Davari,
Daniel Hienert,
Dagmar Kern,
Stefan Dietze
Abstract:
Throughout the search process, the user's gaze on inspected SERPs and websites can reveal his or her search interests. Gaze behavior can be captured with eye tracking and described with word-eye-fixations. Word-eye-fixations contain the user's accumulated gaze fixation duration on each individual word of a web page. In this work, we analyze the role of word-eye-fixations for predicting query terms…
▽ More
Throughout the search process, the user's gaze on inspected SERPs and websites can reveal his or her search interests. Gaze behavior can be captured with eye tracking and described with word-eye-fixations. Word-eye-fixations contain the user's accumulated gaze fixation duration on each individual word of a web page. In this work, we analyze the role of word-eye-fixations for predicting query terms. We investigate the relationship between a range of in-session features, in particular, gaze data, with the query terms and train models for predicting query terms. We use a dataset of 50 search sessions obtained through a lab study in the social sciences domain. Using established machine learning models, we can predict query terms with comparably high accuracy, even with only little training data. Feature analysis shows that the categories Fixation, Query Relevance and Session Topic contain the most effective features for our task.
△ Less
Submitted 5 August, 2020;
originally announced August 2020.
-
Exploiting stance hierarchies for cost-sensitive stance detection of Web documents
Authors:
Arjun Roy,
Pavlos Fafalios,
Asif Ekbal,
Xiaofei Zhu,
Stefan Dietze
Abstract:
Fact checking is an essential challenge when combating fake news. Identifying documents that agree or disagree with a particular statement (claim) is a core task in this process. In this context, stance detection aims at identifying the position (stance) of a document towards a claim. Most approaches address this task through a 4-class classification model where the class distribution is highly im…
▽ More
Fact checking is an essential challenge when combating fake news. Identifying documents that agree or disagree with a particular statement (claim) is a core task in this process. In this context, stance detection aims at identifying the position (stance) of a document towards a claim. Most approaches address this task through a 4-class classification model where the class distribution is highly imbalanced. Therefore, they are particularly ineffective in detecting the minority classes (for instance, 'disagree'), even though such instances are crucial for tasks such as fact-checking by providing evidence for detecting false claims. In this paper, we exploit the hierarchical nature of stance classes, which allows us to propose a modular pipeline of cascading binary classifiers, enabling performance tuning on a per step and class basis. We implement our approach through a combination of neural and traditional classification models that highlight the misclassification costs of minority classes. Evaluation results demonstrate state-of-the-art performance of our approach and its ability to significantly improve the classification performance of the important 'disagree' class.
△ Less
Submitted 17 May, 2021; v1 submitted 29 July, 2020;
originally announced July 2020.
-
TweetsCOV19 -- A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic
Authors:
Dimitar Dimitrov,
Erdal Baran,
Pavlos Fafalios,
Ran Yu,
Xiaofei Zhu,
Matthäus Zloch,
Stefan Dietze
Abstract:
Publicly available social media archives facilitate research in the social sciences and provide corpora for training and testing a wide range of machine learning and natural language processing methods. With respect to the recent outbreak of the Coronavirus disease 2019 (COVID-19), online discourse on Twitter reflects public opinion and perception related to the pandemic itself as well as mitigati…
▽ More
Publicly available social media archives facilitate research in the social sciences and provide corpora for training and testing a wide range of machine learning and natural language processing methods. With respect to the recent outbreak of the Coronavirus disease 2019 (COVID-19), online discourse on Twitter reflects public opinion and perception related to the pandemic itself as well as mitigating measures and their societal impact. Understanding such discourse, its evolution, and interdependencies with real-world events or (mis)information can foster valuable insights. On the other hand, such corpora are crucial facilitators for computational methods addressing tasks such as sentiment analysis, event detection, or entity recognition. However, obtaining, archiving, and semantically annotating large amounts of tweets is costly. In this paper, we describe TweetsCOV19, a publicly available knowledge base of currently more than 8 million tweets, spanning October 2019 - April 2020. Metadata about the tweets as well as extracted entities, hashtags, user mentions, sentiments, and URLs are exposed using established RDF/S vocabularies, providing an unprecedented knowledge base for a range of knowledge discovery tasks. Next to a description of the dataset and its extraction and annotation process, we present an initial analysis and use cases of the corpus.
△ Less
Submitted 15 August, 2020; v1 submitted 25 June, 2020;
originally announced June 2020.
-
A Software Framework and Datasets for the Analysis of Graph Measures on RDF Graphs
Authors:
Matthäus Zloch,
Maribel Acosta,
Daniel Hienert,
Stefan Dietze,
Stefan Conrad
Abstract:
As the availability and the inter-connectivity of RDF datasets grow, so does the necessity to understand the structure of the data. Understanding the topology of RDF graphs can guide and inform the development of, e.g. synthetic dataset generators, sampling methods, index structures, or query optimizers. In this work, we propose two resources: (i) a software framework able to acquire, prepare, and…
▽ More
As the availability and the inter-connectivity of RDF datasets grow, so does the necessity to understand the structure of the data. Understanding the topology of RDF graphs can guide and inform the development of, e.g. synthetic dataset generators, sampling methods, index structures, or query optimizers. In this work, we propose two resources: (i) a software framework able to acquire, prepare, and perform a graph-based analysis on the topology of large RDF graphs, and (ii) results on a graph-based analysis of 280 datasets from the LOD Cloud with values for 28 graph measures computed with the framework. We present a preliminary analysis based on the proposed resources and point out implications for synthetic dataset generators. Finally, we identify a set of measures, that can be used to characterize graphs in the Semantic Web.
△ Less
Submitted 3 July, 2019;
originally announced July 2019.
-
Data4UrbanMobility: Towards Holistic Data Analytics for Mobility Applications in Urban Regions
Authors:
Nicolas Tempelmeier,
Yannick Rietz,
Iryna Lishchuk,
Tina Kruegel,
Olaf Mumm,
Vanessa Miriam Carlow,
Stefan Dietze,
Elena Demidova
Abstract:
With the increasing availability of mobility-related data, such as GPS-traces, Web queries and climate conditions, there is a growing demand to utilize this data to better understand and support urban mobility needs. However, data available from the individual actors, such as providers of information, navigation and transportation systems, is mostly restricted to isolated mobility modes, whereas h…
▽ More
With the increasing availability of mobility-related data, such as GPS-traces, Web queries and climate conditions, there is a growing demand to utilize this data to better understand and support urban mobility needs. However, data available from the individual actors, such as providers of information, navigation and transportation systems, is mostly restricted to isolated mobility modes, whereas holistic data analytics over integrated data sources is not sufficiently supported. In this paper we present our ongoing research in the context of holistic data analytics to support urban mobility applications in the Data4UrbanMobility (D4UM) project. First, we discuss challenges in urban mobility analytics and present the D4UM platform we are currently developing to facilitate holistic urban data analytics over integrated heterogeneous data sources along with the available data sources. Second, we present the MiC app - a tool we developed to complement available datasets with intermodal mobility data (i.e. data about journeys that involve more than one mode of mobility) using a citizen science approach. Finally, we present selected use cases and discuss our future work.
△ Less
Submitted 26 March, 2019;
originally announced March 2019.
-
Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty
Authors:
Renato Stoffalette João,
Pavlos Fafalios,
Stefan Dietze
Abstract:
Entity Linking (EL) is the task of automatically identifying entity mentions in a piece of text and resolving them to a corresponding entity in a reference knowledge base like Wikipedia. There is a large number of EL tools available for different types of documents and domains, yet EL remains a challenging task where the lack of precision on particularly ambiguous mentions often spoils the usefuln…
▽ More
Entity Linking (EL) is the task of automatically identifying entity mentions in a piece of text and resolving them to a corresponding entity in a reference knowledge base like Wikipedia. There is a large number of EL tools available for different types of documents and domains, yet EL remains a challenging task where the lack of precision on particularly ambiguous mentions often spoils the usefulness of automated disambiguation results in real applications. A priori approximations of the difficulty to link a particular entity mention can facilitate flagging of critical cases as part of semi-automated EL systems, while detecting latent factors that affect the EL performance, like corpus-specific features, can provide insights on how to improve a system based on the special characteristics of the underlying corpus. In this paper, we first introduce a consensus-based method to generate difficulty labels for entity mentions on arbitrary corpora. The difficulty labels are then exploited as training data for a supervised classification task able to predict the EL difficulty of entity mentions using a variety of features. Experiments over a corpus of news articles show that EL difficulty can be estimated with high accuracy, revealing also latent features that affect EL performance. Finally, evaluation results demonstrate the effectiveness of the proposed method to inform semi-automated EL pipelines.
△ Less
Submitted 13 December, 2018;
originally announced December 2018.
-
TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets
Authors:
Pavlos Fafalios,
Vasileios Iosifidis,
Eirini Ntoutsi,
Stefan Dietze
Abstract:
Publicly available social media archives facilitate research in a variety of fields, such as data science, sociology or the digital humanities, where Twitter has emerged as one of the most prominent sources. However, obtaining, archiving and annotating large amounts of tweets is costly. In this paper, we describe TweetsKB, a publicly available corpus of currently more than 1.5 billion tweets, span…
▽ More
Publicly available social media archives facilitate research in a variety of fields, such as data science, sociology or the digital humanities, where Twitter has emerged as one of the most prominent sources. However, obtaining, archiving and annotating large amounts of tweets is costly. In this paper, we describe TweetsKB, a publicly available corpus of currently more than 1.5 billion tweets, spanning almost 5 years (Jan'13-Nov'17). Metadata information about the tweets as well as extracted entities, hashtags, user mentions and sentiment information are exposed using established RDF/S vocabularies. Next to a description of the extraction and annotation process, we present use cases to illustrate scenarios for entity-centric information exploration, data integration and knowledge discovery facilitated by TweetsKB.
△ Less
Submitted 23 October, 2018;
originally announced October 2018.
-
Time-Aware and Corpus-Specific Entity Relatedness
Authors:
Nilamadhaba Mohapatra,
Vasileios Iosifidis,
Asif Ekbal,
Stefan Dietze,
Pavlos Fafalios
Abstract:
Entity relatedness has emerged as an important feature in a plethora of applications such as information retrieval, entity recommendation and entity linking. Given an entity, for instance a person or an organization, entity relatedness measures can be exploited for generating a list of highly-related entities. However, the relation of an entity to some other entity depends on several factors, with…
▽ More
Entity relatedness has emerged as an important feature in a plethora of applications such as information retrieval, entity recommendation and entity linking. Given an entity, for instance a person or an organization, entity relatedness measures can be exploited for generating a list of highly-related entities. However, the relation of an entity to some other entity depends on several factors, with time and context being two of the most important ones (where, in our case, context is determined by a particular corpus). For example, the entities related to the International Monetary Fund are different now compared to some years ago, while these entities also may highly differ in the context of a USA news portal compared to a Greek news portal. In this paper, we propose a simple but flexible model for entity relatedness which considers time and entity aware word embeddings by exploiting the underlying corpus. The proposed model does not require external knowledge and is language independent, which makes it widely useful in a variety of applications.
△ Less
Submitted 23 October, 2018;
originally announced October 2018.
-
Detecting, Understanding and Supporting Everyday Learning in Web Search
Authors:
Ran Yu,
Ujwal Gadiraju,
Stefan Dietze
Abstract:
Web search is among the most ubiquitous online activities, commonly used to acquire new knowledge and to satisfy learning-related objectives through informational search sessions. The importance of learning as an outcome of web search has been recognized widely, leading to a variety of research at the intersection of information retrieval, human computer interaction and learning-oriented sciences.…
▽ More
Web search is among the most ubiquitous online activities, commonly used to acquire new knowledge and to satisfy learning-related objectives through informational search sessions. The importance of learning as an outcome of web search has been recognized widely, leading to a variety of research at the intersection of information retrieval, human computer interaction and learning-oriented sciences. Given the lack of explicit information, understanding of users and their learning needs has to be derived from their search behavior and resource interactions. In this paper, we introduce the involved research challenges and survey related work on the detection of learning needs, understanding of users, e.g. with respect to their knowledge state, learning tasks and learning progress throughout a search session as well as the actual consideration of learning needs throughout the retrieval and ranking process. In addition, we summarise our own research contributing to the aforementioned tasks and describe our research agenda in this context.
△ Less
Submitted 28 June, 2018;
originally announced June 2018.
-
Predicting User Knowledge Gain in Informational Search Sessions
Authors:
Ran Yu,
Ujwal Gadiraju,
Peter Holtz,
Markus Rokicki,
Philipp Kemkes,
Stefan Dietze
Abstract:
Web search is frequently used by people to acquire new knowledge and to satisfy learning-related objectives. In this context, informational search missions with an intention to obtain knowledge pertaining to a topic are prominent. The importance of learning as an outcome of web search has been recognized. Yet, there is a lack of understanding of the impact of web search on a user's knowledge state…
▽ More
Web search is frequently used by people to acquire new knowledge and to satisfy learning-related objectives. In this context, informational search missions with an intention to obtain knowledge pertaining to a topic are prominent. The importance of learning as an outcome of web search has been recognized. Yet, there is a lack of understanding of the impact of web search on a user's knowledge state. Predicting the knowledge gain of users can be an important step forward if web search engines that are currently optimized for relevance can be molded to serve learning outcomes. In this paper, we introduce a supervised model to predict a user's knowledge state and knowledge gain from features captured during the search sessions. To measure and predict the knowledge gain of users in informational search sessions, we recruited 468 distinct users using crowdsourcing and orchestrated real-world search sessions spanning 11 different topics and information needs. By using scientifically formulated knowledge tests, we calibrated the knowledge of users before and after their search sessions, quantifying their knowledge gain. Our supervised models utilise and derive a comprehensive set of features from the current state of the art and compare performance of a range of feature sets and feature selection strategies. Through our results, we demonstrate the ability to predict and classify the knowledge state and gain using features obtained during search sessions, exhibiting superior performance to an existing baseline in the knowledge state prediction task.
△ Less
Submitted 2 May, 2018;
originally announced May 2018.
-
Inferring Missing Categorical Information in Noisy and Sparse Web Markup
Authors:
Nicolas Tempelmeier,
Elena Demidova,
Stefan Dietze
Abstract:
Embedded markup of Web pages has seen widespread adoption throughout the past years driven by standards such as RDFa and Microdata and initiatives such as schema.org, where recent studies show an adoption by 39% of all Web pages already in 2016. While this constitutes an important information source for tasks such as Web search, Web page classification or knowledge graph augmentation, individual m…
▽ More
Embedded markup of Web pages has seen widespread adoption throughout the past years driven by standards such as RDFa and Microdata and initiatives such as schema.org, where recent studies show an adoption by 39% of all Web pages already in 2016. While this constitutes an important information source for tasks such as Web search, Web page classification or knowledge graph augmentation, individual markup nodes are usually sparsely described and often lack essential information. For instance, from 26 million nodes describing events within the Common Crawl in 2016, 59% of nodes provide less than six statements and only 257,000 nodes (0.96%) are typed with more specific event subtypes. Nevertheless, given the scale and diversity of Web markup data, nodes that provide missing information can be obtained from the Web in large quantities, in particular for categorical properties. Such data constitutes potential training data for inferring missing information to significantly augment sparsely described nodes. In this work, we introduce a supervised approach for inferring missing categorical properties in Web markup. Our experiments, conducted on properties of events and movies, show a performance of 79% and 83% F1 score correspondingly, significantly outperforming existing baselines.
△ Less
Submitted 1 March, 2018;
originally announced March 2018.
-
Improving Entity Retrieval on Structured Data
Authors:
Besnik Fetahu,
Ujwal Gadiraju,
Stefan Dietze
Abstract:
The increasing amount of data on the Web, in particular of Linked Data, has led to a diverse landscape of datasets, which make entity retrieval a challenging task. Explicit cross-dataset links, for instance to indicate co-references or related entities can significantly improve entity retrieval. However, only a small fraction of entities are interlinked through explicit statements. In this paper,…
▽ More
The increasing amount of data on the Web, in particular of Linked Data, has led to a diverse landscape of datasets, which make entity retrieval a challenging task. Explicit cross-dataset links, for instance to indicate co-references or related entities can significantly improve entity retrieval. However, only a small fraction of entities are interlinked through explicit statements. In this paper, we propose a two-fold entity retrieval approach. In a first, offline preprocessing step, we cluster entities based on the \emph{x--means} and \emph{spectral} clustering algorithms. In the second step, we propose an optimized retrieval model which takes advantage of our precomputed clusters. For a given set of entities retrieved by the BM25F retrieval approach and a given user query, we further expand the result set with relevant entities by considering features of the queries, entities and the precomputed clusters. Finally, we re-rank the expanded result set with respect to the relevance to the query. We perform a thorough experimental evaluation on the Billions Triple Challenge (BTC12) dataset. The proposed approach shows significant improvements compared to the baseline and state of the art approaches.
△ Less
Submitted 30 March, 2017;
originally announced March 2017.
-
Condensation of collective charge ordering in Chromium
Authors:
A. Singer,
M. Marsh,
S. Dietze,
V. Uhlíř,
Y. Li,
D. A. Walko,
E. M. Dufresne,
G. Srajer,
M. P. Cosgriff,
P. G. Evans,
E. E. Fullerton,
O. G. Shpyrko
Abstract:
Here we report on the dynamics of the structural order parameter in a chromium film using synchrotron radiation in response to photo-induced ultra-fast excitations. Following transient optical excitations the effective lattice temperature of the film rises close to the Néel temperature and the charge density wave (CDW) amplitude is reduced. Although we expect the electronic charge ordering to vani…
▽ More
Here we report on the dynamics of the structural order parameter in a chromium film using synchrotron radiation in response to photo-induced ultra-fast excitations. Following transient optical excitations the effective lattice temperature of the film rises close to the Néel temperature and the charge density wave (CDW) amplitude is reduced. Although we expect the electronic charge ordering to vanish shortly after the excitation we observe that the CDW is never completely disrupted, which is revealed by its unmodified period at elevated temperatures. We attribute the persistence of the CDW to the long-lived periodic lattice displacement in chromium. The long-term evolution shows that the CDW revives to its initial strength within 1 ns, which appears to behave in accordance with the temperature dependence in equilibrium. This study highlights the fundamental role of the lattice distortion in charge ordered systems and its impact on the re-condensation dynamics of the charge ordered state in strongly correlated materials.
△ Less
Submitted 20 November, 2014;
originally announced November 2014.