-
Efficient Entity Candidate Generation for Low-Resource Languages
Authors:
Alberto García-Durán,
Akhil Arora,
Robert West
Abstract:
Candidate generation is a crucial module in entity linking. It also plays a key role in multiple NLP tasks that have been proven to beneficially leverage knowledge bases. Nevertheless, it has often been overlooked in the monolingual English entity linking literature, as naive approaches obtain very good performance. Unfortunately, the existing approaches for English cannot be successfully transfer…
▽ More
Candidate generation is a crucial module in entity linking. It also plays a key role in multiple NLP tasks that have been proven to beneficially leverage knowledge bases. Nevertheless, it has often been overlooked in the monolingual English entity linking literature, as naive approaches obtain very good performance. Unfortunately, the existing approaches for English cannot be successfully transferred to poorly resourced languages. This paper constitutes an in-depth analysis of the candidate generation problem in the context of cross-lingual entity linking with a focus on low-resource languages. Among other contributions, we point out limitations in the evaluation conducted in previous works. We introduce a characterization of queries into types based on their difficulty, which improves the interpretability of the performance of different methods. We also propose a light-weight and simple solution based on the construction of indexes whose design is motivated by more complex transfer learning based neural approaches. A thorough empirical analysis on 9 real-world datasets under 2 evaluation settings shows that our simple solution outperforms the state-of-the-art approach in terms of both quality and efficiency for almost all datasets and query types.
△ Less
Submitted 30 June, 2022;
originally announced June 2022.
-
Wikipedia Reader Navigation: When Synthetic Data Is Enough
Authors:
Akhil Arora,
Martin Gerlach,
Tiziano Piccardi,
Alberto García-Durán,
Robert West
Abstract:
Every day millions of people read Wikipedia. When navigating the vast space of available topics using hyperlinks, readers describe trajectories on the article network. Understanding these navigation patterns is crucial to better serve readers' needs and address structural biases and knowledge gaps. However, systematic studies of navigation on Wikipedia are hindered by a lack of publicly available…
▽ More
Every day millions of people read Wikipedia. When navigating the vast space of available topics using hyperlinks, readers describe trajectories on the article network. Understanding these navigation patterns is crucial to better serve readers' needs and address structural biases and knowledge gaps. However, systematic studies of navigation on Wikipedia are hindered by a lack of publicly available data due to the commitment to protect readers' privacy by not storing or sharing potentially sensitive data. In this paper, we ask: How well can Wikipedia readers' navigation be approximated by using publicly available resources, most notably the Wikipedia clickstream data? We systematically quantify the differences between real navigation sequences and synthetic sequences generated from the clickstream data, in 6 analyses across 8 Wikipedia language versions. Overall, we find that the differences between real and synthetic sequences are statistically significant, but with small effect sizes, often well below 10%. This constitutes quantitative evidence for the utility of the Wikipedia clickstream data as a public resource: clickstream data can closely capture reader navigation on Wikipedia and provides a sufficient approximation for most practical downstream applications relying on reader data. More broadly, this study provides an example for how clickstream-like data can generally enable research on user navigation on online platforms while protecting users' privacy.
△ Less
Submitted 5 January, 2022; v1 submitted 3 January, 2022;
originally announced January 2022.
-
Low-Rank Subspaces for Unsupervised Entity Linking
Authors:
Akhil Arora,
Alberto García-Durán,
Robert West
Abstract:
Entity linking is an important problem with many applications. Most previous solutions were designed for settings where annotated training data is available, which is, however, not the case in numerous domains. We propose a light-weight and scalable entity linking method, Eigenthemes, that relies solely on the availability of entity names and a referent knowledge base. Eigenthemes exploits the fac…
▽ More
Entity linking is an important problem with many applications. Most previous solutions were designed for settings where annotated training data is available, which is, however, not the case in numerous domains. We propose a light-weight and scalable entity linking method, Eigenthemes, that relies solely on the availability of entity names and a referent knowledge base. Eigenthemes exploits the fact that the entities that are truly mentioned in a document (the "gold entities") tend to form a semantically dense subset of the set of all candidate entities in the document. Geometrically speaking, when representing entities as vectors via some given embedding, the gold entities tend to lie in a low-rank subspace of the full embedding space. Eigenthemes identifies this subspace using the singular value decomposition and scores candidate entities according to their proximity to the subspace. On the empirical front, we introduce multiple strong baselines that compare favorably to (and sometimes even outperform) the existing state of the art. Extensive experiments on benchmark datasets from a variety of real-world domains showcase the effectiveness of our approach.
△ Less
Submitted 14 October, 2021; v1 submitted 18 April, 2021;
originally announced April 2021.
-
Recursive input and state estimation: A general framework for learning from time series with missing data
Authors:
Alberto García-Durán,
Robert West
Abstract:
Time series with missing data are signals encountered in important settings for machine learning. Some of the most successful prior approaches for modeling such time series are based on recurrent neural networks that transform the input and previous state to account for the missing observations, and then treat the transformed signal in a standard manner.
In this paper, we introduce a single unif…
▽ More
Time series with missing data are signals encountered in important settings for machine learning. Some of the most successful prior approaches for modeling such time series are based on recurrent neural networks that transform the input and previous state to account for the missing observations, and then treat the transformed signal in a standard manner.
In this paper, we introduce a single unifying framework, Recursive Input and State Estimation (RISE), for this general approach and reformulate existing models as specific instances of this framework. We then explore additional novel variations within the RISE framework to improve the performance of any instance. We exploit representation learning techniques to learn latent representations of the signals used by RISE instances. We discuss and develop various encoding techniques to learn latent signal representations. We benchmark instances of the framework with various encoding functions on three data imputation datasets, observing that RISE instances always benefit from encoders that learn representations for numerical values from the digits into which they can be decomposed.
△ Less
Submitted 17 April, 2021;
originally announced April 2021.
-
Node Attribute Completion in Knowledge Graphs with Multi-Relational Propagation
Authors:
Eda Bayram,
Alberto Garcia-Duran,
Robert West
Abstract:
The existing literature on knowledge graph completion mostly focuses on the link prediction task. However, knowledge graphs have an additional incompleteness problem: their nodes possess numerical attributes, whose values are often missing. Our approach, denoted as MrAP, imputes the values of missing attributes by propagating information across the multi-relational structure of a knowledge graph.…
▽ More
The existing literature on knowledge graph completion mostly focuses on the link prediction task. However, knowledge graphs have an additional incompleteness problem: their nodes possess numerical attributes, whose values are often missing. Our approach, denoted as MrAP, imputes the values of missing attributes by propagating information across the multi-relational structure of a knowledge graph. It employs regression functions for predicting one node attribute from another depending on the relationship between the nodes and the type of the attributes. The propagation mechanism operates iteratively in a message passing scheme that collects the predictions at every iteration and updates the value of the node attributes. Experiments over two benchmark datasets show the effectiveness of our approach.
△ Less
Submitted 10 November, 2020;
originally announced November 2020.
-
MMKG: Multi-Modal Knowledge Graphs
Authors:
Ye Liu,
Hui Li,
Alberto Garcia-Duran,
Mathias Niepert,
Daniel Onoro-Rubio,
David S. Rosenblum
Abstract:
We present MMKG, a collection of three knowledge graphs that contain both numerical features and (links to) images for all entities as well as entity alignments between pairs of KGs. Therefore, multi-relational link prediction and entity matching communities can benefit from this resource. We believe this data set has the potential to facilitate the development of novel multi-modal learning approa…
▽ More
We present MMKG, a collection of three knowledge graphs that contain both numerical features and (links to) images for all entities as well as entity alignments between pairs of KGs. Therefore, multi-relational link prediction and entity matching communities can benefit from this resource. We believe this data set has the potential to facilitate the development of novel multi-modal learning approaches for knowledge graphs.We validate the utility ofMMKG in the sameAs link prediction task with an extensive set of experiments. These experiments show that the task at hand benefits from learning of multiple feature types.
△ Less
Submitted 13 March, 2019;
originally announced March 2019.
-
Learning Representations of Missing Data for Predicting Patient Outcomes
Authors:
Brandon Malone,
Alberto Garcia-Duran,
Mathias Niepert
Abstract:
Extracting actionable insight from Electronic Health Records (EHRs) poses several challenges for traditional machine learning approaches. Patients are often missing data relative to each other; the data comes in a variety of modalities, such as multivariate time series, free text, and categorical demographic information; important relationships among patients can be difficult to detect; and many o…
▽ More
Extracting actionable insight from Electronic Health Records (EHRs) poses several challenges for traditional machine learning approaches. Patients are often missing data relative to each other; the data comes in a variety of modalities, such as multivariate time series, free text, and categorical demographic information; important relationships among patients can be difficult to detect; and many others. In this work, we propose a novel approach to address these first three challenges using a representation learning scheme based on message passing. We show that our proposed approach is competitive with or outperforms the state of the art for predicting in-hospital mortality (binary classification), the length of hospital visits (regression) and the discharge destination (multiclass classification).
△ Less
Submitted 12 November, 2018;
originally announced November 2018.
-
Knowledge Graph Completion to Predict Polypharmacy Side Effects
Authors:
Brandon Malone,
Alberto García-Durán,
Mathias Niepert
Abstract:
The polypharmacy side effect prediction problem considers cases in which two drugs taken individually do not result in a particular side effect; however, when the two drugs are taken in combination, the side effect manifests. In this work, we demonstrate that multi-relational knowledge graph completion achieves state-of-the-art results on the polypharmacy side effect prediction problem. Empirical…
▽ More
The polypharmacy side effect prediction problem considers cases in which two drugs taken individually do not result in a particular side effect; however, when the two drugs are taken in combination, the side effect manifests. In this work, we demonstrate that multi-relational knowledge graph completion achieves state-of-the-art results on the polypharmacy side effect prediction problem. Empirical results show that our approach is particularly effective when the protein targets of the drugs are well-characterized. In contrast to prior work, our approach provides more interpretable predictions and hypotheses for wet lab validation.
△ Less
Submitted 22 October, 2018;
originally announced October 2018.
-
Learning Sequence Encoders for Temporal Knowledge Graph Completion
Authors:
Alberto García-Durán,
Sebastijan Dumančić,
Mathias Niepert
Abstract:
Research on link prediction in knowledge graphs has mainly focused on static multi-relational data. In this work we consider temporal knowledge graphs where relations between entities may only hold for a time interval or a specific point in time. In line with previous work on static knowledge graphs, we propose to address this problem by learning latent entity and relation type representations. To…
▽ More
Research on link prediction in knowledge graphs has mainly focused on static multi-relational data. In this work we consider temporal knowledge graphs where relations between entities may only hold for a time interval or a specific point in time. In line with previous work on static knowledge graphs, we propose to address this problem by learning latent entity and relation type representations. To incorporate temporal information, we utilize recurrent neural networks to learn time-aware representations of relation types which can be used in conjunction with existing latent factorization methods. The proposed approach is shown to be robust to common challenges in real-world KGs: the sparsity and heterogeneity of temporal expressions. Experiments show the benefits of our approach on four temporal KGs. The data sets are available under a permissive BSD-3 license 1.
△ Less
Submitted 10 September, 2018;
originally announced September 2018.
-
A Comparative Study of Distributional and Symbolic Paradigms for Relational Learning
Authors:
Sebastijan Dumancic,
Alberto Garcia-Duran,
Mathias Niepert
Abstract:
Many real-world domains can be expressed as graphs and, more generally, as multi-relational knowledge graphs. Though reasoning and learning with knowledge graphs has traditionally been addressed by symbolic approaches, recent methods in (deep) representation learning has shown promising results for specialized tasks such as knowledge base completion. These approaches abandon the traditional symbol…
▽ More
Many real-world domains can be expressed as graphs and, more generally, as multi-relational knowledge graphs. Though reasoning and learning with knowledge graphs has traditionally been addressed by symbolic approaches, recent methods in (deep) representation learning has shown promising results for specialized tasks such as knowledge base completion. These approaches abandon the traditional symbolic paradigm by replacing symbols with vectors in Euclidean space. With few exceptions, symbolic and distributional approaches are explored in different communities and little is known about their respective strengths and weaknesses. In this work, we compare representation learning and relational learning on various relational classification and clustering tasks and analyse the complexity of the rules used implicitly by these approaches. Preliminary results reveal possible indicators that could help in choosing one approach over the other for particular knowledge graphs.
△ Less
Submitted 24 March, 2020; v1 submitted 29 June, 2018;
originally announced June 2018.
-
Towards a Spectrum of Graph Convolutional Networks
Authors:
Mathias Niepert,
Alberto Garcia-Duran
Abstract:
We present our ongoing work on understanding the limitations of graph convolutional networks (GCNs) as well as our work on generalizations of graph convolutions for representing more complex node attribute dependencies. Based on an analysis of GCNs with the help of the corresponding computation graphs, we propose a generalization of existing GCNs where the aggregation operations are (a) determined…
▽ More
We present our ongoing work on understanding the limitations of graph convolutional networks (GCNs) as well as our work on generalizations of graph convolutions for representing more complex node attribute dependencies. Based on an analysis of GCNs with the help of the corresponding computation graphs, we propose a generalization of existing GCNs where the aggregation operations are (a) determined by structural properties of the local neighborhood graphs and (b) not restricted to weighted averages. We show that the proposed approach is strictly more expressive while requiring only a modest increase in the number of parameters and computations. We also show that the proposed generalization is identical to standard convolutional layers when applied to regular grid graphs.
△ Less
Submitted 4 May, 2018;
originally announced May 2018.
-
TransRev: Modeling Reviews as Translations from Users to Items
Authors:
Alberto Garcia-Duran,
Roberto Gonzalez,
Daniel Onoro-Rubio,
Mathias Niepert,
Hui Li
Abstract:
The text of a review expresses the sentiment a customer has towards a particular product. This is exploited in sentiment analysis where machine learning models are used to predict the review score from the text of the review. Furthermore, the products costumers have purchased in the past are indicative of the products they will purchase in the future. This is what recommender systems exploit by le…
▽ More
The text of a review expresses the sentiment a customer has towards a particular product. This is exploited in sentiment analysis where machine learning models are used to predict the review score from the text of the review. Furthermore, the products costumers have purchased in the past are indicative of the products they will purchase in the future. This is what recommender systems exploit by learning models from purchase information to predict the items a customer might be interested in. We propose TransRev, an approach to the product recommendation problem that integrates ideas from recommender systems, sentiment analysis, and multi-relational learning into a joint learning objective. TransRev learns vector representations for users, items, and reviews. The embedding of a review is learned such that (a) it performs well as input feature of a regression model for sentiment prediction; and (b) it always translates the reviewer embedding to the embedding of the reviewed items. This allows TransRev to approximate a review embedding at test time as the difference of the embedding of each item and the user embedding. The approximated review embedding is then used with the regression model to predict the review score for each item. TransRev outperforms state of the art recommender systems on a large number of benchmark data sets. Moreover, it is able to retrieve, for each user and item, the review text from the training set whose embedding is most similar to the approximated review embedding.
△ Less
Submitted 18 April, 2018; v1 submitted 30 January, 2018;
originally announced January 2018.
-
Learning Graph Representations with Embedding Propagation
Authors:
Alberto Garcia-Duran,
Mathias Niepert
Abstract:
We propose Embedding Propagation (EP), an unsupervised learning framework for graph-structured data. EP learns vector representations of graphs by passing two types of messages between neighboring nodes. Forward messages consist of label representations such as representations of words and other attributes associated with the nodes. Backward messages consist of gradients that result from aggregati…
▽ More
We propose Embedding Propagation (EP), an unsupervised learning framework for graph-structured data. EP learns vector representations of graphs by passing two types of messages between neighboring nodes. Forward messages consist of label representations such as representations of words and other attributes associated with the nodes. Backward messages consist of gradients that result from aggregating the label representations and applying a reconstruction loss. Node representations are finally computed from the representation of their labels. With significantly fewer parameters and hyperparameters an instance of EP is competitive with and often outperforms state of the art unsupervised and semi-supervised learning methods on a range of benchmark data sets.
△ Less
Submitted 9 October, 2017;
originally announced October 2017.
-
KBLRN : End-to-End Learning of Knowledge Base Representations with Latent, Relational, and Numerical Features
Authors:
Alberto Garcia-Duran,
Mathias Niepert
Abstract:
We present KBLRN, a framework for end-to-end learning of knowledge base representations from latent, relational, and numerical features. KBLRN integrates feature types with a novel combination of neural representation learning and probabilistic product of experts models. To the best of our knowledge, KBLRN is the first approach that learns representations of knowledge bases by integrating latent,…
▽ More
We present KBLRN, a framework for end-to-end learning of knowledge base representations from latent, relational, and numerical features. KBLRN integrates feature types with a novel combination of neural representation learning and probabilistic product of experts models. To the best of our knowledge, KBLRN is the first approach that learns representations of knowledge bases by integrating latent, relational, and numerical features. We show that instances of KBLRN outperform existing methods on a range of knowledge base completion tasks. We contribute a novel data sets enriching commonly used knowledge base completion benchmarks with numerical features. The data sets are available under a permissive BSD-3 license. We also investigate the impact numerical features have on the KB completion performance of KBLRN.
△ Less
Submitted 11 June, 2018; v1 submitted 14 September, 2017;
originally announced September 2017.
-
Answering Visual-Relational Queries in Web-Extracted Knowledge Graphs
Authors:
Daniel Oñoro-Rubio,
Mathias Niepert,
Alberto García-Durán,
Roberto González,
Roberto J. López-Sastre
Abstract:
A visual-relational knowledge graph (KG) is a multi-relational graph whose entities are associated with images. We explore novel machine learning approaches for answering visual-relational queries in web-extracted knowledge graphs. To this end, we have created ImageGraph, a KG with 1,330 relation types, 14,870 entities, and 829,931 images crawled from the web. With visual-relational KGs such as Im…
▽ More
A visual-relational knowledge graph (KG) is a multi-relational graph whose entities are associated with images. We explore novel machine learning approaches for answering visual-relational queries in web-extracted knowledge graphs. To this end, we have created ImageGraph, a KG with 1,330 relation types, 14,870 entities, and 829,931 images crawled from the web. With visual-relational KGs such as ImageGraph one can introduce novel probabilistic query types in which images are treated as first-class citizens. Both the prediction of relations between unseen images as well as multi-relational image retrieval can be expressed with specific families of visual-relational queries. We introduce novel combinations of convolutional networks and knowledge graph embedding methods to answer such queries. We also explore a zero-shot learning scenario where an image of an entirely new entity is linked with multiple relations to entities of an existing KG. The resulting multi-relational grounding of unseen entity images into a knowledge graph serves as a semantic entity representation. We conduct experiments to demonstrate that the proposed methods can answer these visual-relational queries efficiently and accurately.
△ Less
Submitted 3 May, 2019; v1 submitted 7 September, 2017;
originally announced September 2017.
-
Net2Vec: Deep Learning for the Network
Authors:
Roberto Gonzalez,
Filipe Manco,
Alberto Garcia-Duran,
Jose Mendes,
Felipe Huici,
Saverio Niccolini,
Mathias Niepert
Abstract:
We present Net2Vec, a flexible high-performance platform that allows the execution of deep learning algorithms in the communication network. Net2Vec is able to capture data from the network at more than 60Gbps, transform it into meaningful tuples and apply predictions over the tuples in real time. This platform can be used for different purposes ranging from traffic classification to network perfo…
▽ More
We present Net2Vec, a flexible high-performance platform that allows the execution of deep learning algorithms in the communication network. Net2Vec is able to capture data from the network at more than 60Gbps, transform it into meaningful tuples and apply predictions over the tuples in real time. This platform can be used for different purposes ranging from traffic classification to network performance analysis.
Finally, we showcase the use of Net2Vec by implementing and testing a solution able to profile network users at line rate using traces coming from a real network. We show that the use of deep learning for this case outperforms the baseline method both in terms of accuracy and performance.
△ Less
Submitted 10 May, 2017;
originally announced May 2017.
-
Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus
Authors:
Iulian Vlad Serban,
Alberto García-Durán,
Caglar Gulcehre,
Sungjin Ahn,
Sarath Chandar,
Aaron Courville,
Yoshua Bengio
Abstract:
Over the past decade, large-scale supervised learning corpora have enabled machine learning researchers to make substantial advances. However, to this date, there are no large-scale question-answer corpora available. In this paper we present the 30M Factoid Question-Answer Corpus, an enormous question answer pair corpus produced by applying a novel neural network architecture on the knowledge base…
▽ More
Over the past decade, large-scale supervised learning corpora have enabled machine learning researchers to make substantial advances. However, to this date, there are no large-scale question-answer corpora available. In this paper we present the 30M Factoid Question-Answer Corpus, an enormous question answer pair corpus produced by applying a novel neural network architecture on the knowledge base Freebase to transduce facts into natural language questions. The produced question answer pairs are evaluated both by human evaluators and using automatic evaluation metrics, including well-established machine translation and sentence similarity metrics. Across all evaluation criteria the question-generation model outperforms the competing template-based baseline. Furthermore, when presented to human evaluators, the generated questions appear comparable in quality to real human-generated questions.
△ Less
Submitted 29 May, 2016; v1 submitted 22 March, 2016;
originally announced March 2016.
-
Combining Two And Three-Way Embeddings Models for Link Prediction in Knowledge Bases
Authors:
Alberto Garcia-Duran,
Antoine Bordes,
Nicolas Usunier,
Yves Grandvalet
Abstract:
This paper tackles the problem of endogenous link prediction for Knowledge Base completion. Knowledge Bases can be represented as directed graphs whose nodes correspond to entities and edges to relationships. Previous attempts either consist of powerful systems with high capacity to model complex connectivity patterns, which unfortunately usually end up overfitting on rare relationships, or in app…
▽ More
This paper tackles the problem of endogenous link prediction for Knowledge Base completion. Knowledge Bases can be represented as directed graphs whose nodes correspond to entities and edges to relationships. Previous attempts either consist of powerful systems with high capacity to model complex connectivity patterns, which unfortunately usually end up overfitting on rare relationships, or in approaches that trade capacity for simplicity in order to fairly model all relationships, frequent or not. In this paper, we propose Tatec a happy medium obtained by complementing a high-capacity model with a simpler one, both pre-trained separately and then combined. We present several variants of this model with different kinds of regularization and combination strategies and show that this approach outperforms existing methods on different types of relationships by achieving state-of-the-art results on four benchmarks of the literature.
△ Less
Submitted 2 June, 2015;
originally announced June 2015.
-
Irreflexive and Hierarchical Relations as Translations
Authors:
Antoine Bordes,
Nicolas Usunier,
Alberto Garcia-Duran,
Jason Weston,
Oksana Yakhnenko
Abstract:
We consider the problem of embedding entities and relations of knowledge bases in low-dimensional vector spaces. Unlike most existing approaches, which are primarily efficient for modeling equivalence relations, our approach is designed to explicitly model irreflexive relations, such as hierarchies, by interpreting them as translations operating on the low-dimensional embeddings of the entities. P…
▽ More
We consider the problem of embedding entities and relations of knowledge bases in low-dimensional vector spaces. Unlike most existing approaches, which are primarily efficient for modeling equivalence relations, our approach is designed to explicitly model irreflexive relations, such as hierarchies, by interpreting them as translations operating on the low-dimensional embeddings of the entities. Preliminary experiments show that, despite its simplicity and a smaller number of parameters than previous approaches, our approach achieves state-of-the-art performance according to standard evaluation protocols on data from WordNet and Freebase.
△ Less
Submitted 26 April, 2013;
originally announced April 2013.