Analyzing Linked Data Quality with LiQuate

Maria-Esther Vidal

Analyzing Linked Data Quality with LiQuate

2014, Lecture Notes in Computer Science

Analyzing Linked Data Quality with LiQuate Edna Ruckhaus, Maria-Esther Vidal, Simón Castillo, Oscar Burguillos, and Oriana Baldizan Universidad Simón Bolı́var, Venezuela {eruckhaus,mvidal,scastillo,oburguillos,obaldizan}@ldc.usb.ve Abstract. The number of datasets in the Linking Open Data (LOD) cloud as well as LOD-based applications have exploded in the last years. However, because of data source heterogeneity, published data may suffer of redundancy, inconsistencies, or may be incomplete; thus, results generated by LOD-based applications may be imprecise, ambiguous, or unreliable. We demonstrate the capabilities of LiQuate (Linked Data Quality Assessment), a tool that relies on Bayesian Networks to analyze the quality of data and links in the LOD cloud. 1 Introduction Linking Open Data initiatives have made a diversity of collections available, and facilitate scientists the mining of linked datasets to discover patterns or suggest potential new associations. To ensure trustworthy results, linked data must meet high quality standards. However, data in the LOD cloud has not been necessarily curated, and tools are required to detect possible quality problems and ambiguities produced by redundancy, inconsistencies, and incompleteness of both data and links [2]. We developed LiQuate, a tool able to identify potential quality problems and ambiguities among data and links. LiQuate relies on statistical reasoning to analyze the quality of data based on completeness and potential redundancies or inconsistencies. A Bayesian Network models the dependencies among resources that belong to a set of linked datasets [1, 3]; conditional probability tables annotate the nodes of the network and represent joint probability distributions of relationships among resources. Queries against the Bayesian Network represent the probability that different resources have redundant labels or that a link between two resources is missing; thus, the returned probabilities can suggest ambiguities or possible incompleteness in the data or links. We demonstrate the data quality validation capabilities of LiQuate and the benefits of the approach on the Biomedical datasets: Drugbank Website1 , LinkedCT2 , D2R Diseasome3 , D2R Dailymed4 , D2R Drugbank5 , Bio2RDF Drugbank6 , and DBPe1 2 3 4 5 6 http://www.drugbank.ca/ http://linkedct.org/ http://wifo5-04.informatik.uni-mannheim.de/diseasome http://wifo5-03.informatik.uni-mannheim.de/dailymed/ http://wifo5-04.informatik.uni-mannheim.de/drugbank/ http://download.bio2rdf.org/current/drugbank/drugbank.html RDF Graph Dataset RDF Document User Request LBN Structure RDF Relational Database Liquate Bayesian Network Builder Tree Aggregated CPT LiQuAtE Bayesian Network Quality Validation Request Analizer Linked Data Ambiguities Ambiguity Detector Bayesian Network Query Translator Bayesian Network Inference Engine probability Queries SAMIAM Bayesan Inference Tool Fig. 1. The LiQuate System Architecture. dia7 . This demo illustrates how queries to a Bayesian Network that models RDF data and dependencies among properties, can be used to study quality problems related to both incompleteness of links, and ambiguities among labels and links. We show the following key issues: redundancy among drug labels in the LinkedCT dataset, and incompleteness and inconsistencies of links in Biomedical datasets. The demo is published at http://liquate.ldc.usb.ve. 2 The LiQuate System As a proof of concept, LiQuate has been built on top of the Biomedical linked datasets that maintain data related to clinical trials, interventions, conditions, drugs, diseases, and the relationships among them. LiQuate exploits visualization services implemented by the D3.js JavaScript library8 . Figure 1 illustrates the LiQuate architecture. LiQuate receives a quality validation request which is expressed as one or more evidence queries against the Bayesian Network. The answer of a quality validation request is a number in the range [0.0:1.0] that indicates the probability that a given quality problem occurs among the data. Currently, three types of quality validation requests can be expressed: i) probability that labels or names of a given (type of) resource are redundant, ii) probability of incomplete links among a given set of resources, and iii) probability of inconsistent links. LiQuate is comprised of two components: the LiQuate Bayesian Network Builder and the Ambiguity Detector. The LiQuate Bayesian Network Builder is a semi-automatic off-line process; it relies on an expert’s knowledge about the properties in the RDF linked datasets that are going to be represented in the Bayesian Network. Relevant data is retrieved from SPARQL endpoints, and stored in a relational database to compute the histograms that implement the conditional probability tables (CPTs) associated with the nodes of the network. 7 8 http://wiki.dbpedia.org/Downloads32 http://d3js.org/ The demo is focused on the Ambiguity Detector: a probabilistic model that supports the analysis of the three above mentioned linked data quality problems. The Ambiguity Detector is in turn comprised of three components: 1) the Quality Validation Request Analyzer, 2) the Bayesian Network Query Translator, and 3) the Bayesian Network Inference Engine. The Quality Validation Request Analyzer receives a user request and determines if it can be satisfied with the existing Bayesian Network. The Bayesian Network Query Translator considers the user request and generates the set of queries that must be posed against the Bayesian Network. It also gathers the answers of these queries and generates an answer to the user request. Finally, the Bayesian Network Inference Engine is responsible of performing the inference process required to answer each of the queries posed against the Bayesian Network. This engine is implemented by the SamIam Bayesian Inference Tool9 . 3 Demonstration of Use Cases As of September 2011, LinkedCT contains 106,308 trials, 2.7 million entities and over 25 million RDF triples. Additionally, we consider the following datasets that are linked to LinkedCT: i) Drugbank (over 765,936 triples), ii) Diseasome (around 91,182 triples), and iii) DBPedia (links from LinkedCT 25,476). We built local RDF storage with LinkedCT triples and the triples from these three datasets that are related to LinkedCT. The Bayesian network and its corresponding CPT’s were computed and stored in the SamIam Bayesian Inference Tool. The generated network is comprised of 17 nodes and the aggregated CPTs are of up to 167, 616 entries; for the cases to be shown, the average response time of LiQuate is 4, 715 ms. Figure 2(a) illustrates the description of Biomedical linked datasets, and Figure 2(b) presents the Bayesian Network that represents the dependencies between these properties and links. Concept Network Browser plots10 and Force-Directed Graphs11 are used for visualization. We demonstrate the following use cases: Ambiguities between labels of Interventions or Drugs: starting with Alemtuzumab as an exemplar, we retrieve the intersection of Monoclonal antibodies and Antineoplastic agents. This creates a dataset of 12 drugs: Alemtuzumab, Bevacizumab, Brentuximab vedotin, Cetuximab, Catumaxomab, Edrecolomab, Gemtuzumab, Ipilimumab, Ofatumumab, Panitumumab, Rituximab, and Trastuzumab. These drugs are frequently tested in clinical trials, and there are up to 723 clinical trials with a given intervention, e.g., the intervention that corresponds to the drug Alemtuzumab is present in 112 different clinical trials, and all of these should be linked to the drug DB00087 (Alemtuzumab) in Drugbank in order for the datasets to be unambiguous. This use case illustrates the execution of a query that could indicate possible uncontrolled redundancy in the datasets. The Bayesian Network used to infer the percentage of ambiguity is visualized by 9 10 11 http://reasoning.cs.ucla.edu/samiam/help/recursiveconditioning.html http://www.findtheconversation.com/concept-map http://bl.ocks.org/mbostock/4062045 (a) LinkedCT, DrugBank (website, and two endpoints), Diseasome, and DBPedia visualized as a Concept Network Browser plot. Predicates published by the Drugbank Website are highlighted. (b) Bayesian Network for LinkedCT, DrugBank, Diseasome, and DBPedia visualized by using a Force-Directed Graph; nodes colored in orange and in blue correspond to marginal and evidence variables, respectively Fig. 2. Biomedical Linked Datasets and a LiQuate Bayesian Network. using a Force-Directed Graph; nodes colored in orange and in blue correspond to marginal and evidence variables, respectively. Incompleteness of links between LinkedCT, Drugbank, Diseasome, and DBPedia: We consider the family of the 12 drugs described above, and for each of the partitions induced by redundant labels we consider the owl:sameAs and rdfs:seeAlso links. A partition represents all of the clinical trials that are of interventional type and that have the same intervention (drug) label. For each intervention id that belongs to a partition, a query to the Bayesian Network is executed in order to determine if owl:sameAs links have been established for this intervention. General results are also presented for each of the 12 drugs. Examples of these results are: i) a percentage of redundant labels are not linked through owl:sameAs to neither Drugbank or DBPedia, but 100% of the labels are linked through rdfs:seeAlso, e.g., Bevacizumab; ii) none of the redundant labels is linked to Drugbank or DBPedia, e.g., Brentuximab vedotin, in this case, the drug is not appear in Drugbank; and iii) a percentage of redundant labels are linked to DBPedia through owl:sameAs, all of them are linked to DBPedia through rdfs:seeAlso, and none to Drugbank, e.g., Ipilimumab. Inconsistencies of links between LinkedCT, Drugbank, Diseasome, and DBPedia: We analyze if relationships that represent diseases that are possible targets of a drug, are backed up by clinical trials. For each of the 12 drugs, the query to the Bayesian network determines if for each possible disease target of a drug, there is at least one trial with this Condition (disease) and drug intervention. Conditions and interventions should be linked by owl:sameAs links to their corresponding drugs and diseases, in the Drugbank and Diseasome datasets. Approximately, 10, 000 probability queries were generated for each drug and disease and all the combinations of linked (through owl:sameAs) conditions and interventions. The marginal node is s-s-hascondition-hasintervention, and the evidence is a disease, drug, condition, intervention, and the existence of owl:sameAs links among them. The result is that 13,5% of the drugs and targeted diseases are supported by clinical trials that can be found through owl:sameAs links. Similarly, another hypothesis is that drugs that can possibly treat diseases (possibleDrug links) are supported by the same number of clinical trials. The result is 13, 5% and this number suggests that both links possibleDiseaseTarget and possibleDrug are the inverse of each other. Particularly, for the dataset of 12 drugs we can observe the following: the drugs Brentuximab vedotin, Ipilimumab and Ofatumumab do not appear in Drugbank while these drugs have been studied in a large number of clinical trials. The rest of these 12 drugs do appear in Drugbank, but are associated with much less diseases through the property possibleDiseaseTarget in Drugbank, than to conditions through a clinical trial in LinkedCT. For example, the drug Cetuximab can possibly target eighteen diseases while this drug has been tested in completed clinical trials for 82 conditions; only four of the eighteen diseases in the property possibleDiseaseTarget in Drugbank, are included in the list of 82 conditions in LinkedCT. This ambiguity can be also observed in the rest of the drugs. 4 Conclusions We present LiQuate, a data and link validation tool that relies on a Bayesian Network to identify redundancies, incompleteness and inconsistencies. We demonstrate the main quality validation capabilities of LiQuate, and illustrate different quality problems that may currently occur in the LOD cloud. Particularly, we can observe some ambiguities that suggest the experts to check for uncontrolled redundancy, incompleteness or inconsistency: i) the same label or name of intervention is assigned to different resources, ii) incomplete owl:sameAs and rdfs:seeAlso links between datasets, and iii) associations between drugs and diseases in Drugbank may not be supported by trials in LinkedCT. References 1. L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. SIGMOD Record, 30(2):461–472, 2001. 2. C. Guéret, P. T. Groth, C. Stadler, and J. Lehmann. Assessing linked data mappings using network measures. In ESWC, pages 87–102, 2012. 3. E. Ruckhaus and M.-E. Vidal. Liquate-estimating the quality of links in the linking open data cloud. In RED, pages 56–82, 2012.

Log In

Analyzing Linked Data Quality with LiQuate

Related papers

Related papers

Related topics