Slides presented at Open Data Exchange 2013, April 6 2013, Montreal, Canada. ODX13.com. Sponsored by Trudat.co
1 of 17
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
More Related Content
Linking the Open Data? by Petko Valtchev
1. Linking the Open
Data?
Petko Valtchev
(Assoc. Prof., Dept. of CS, UQAM)
ODXâ13
Montreal, April 6th
2. Why Link The Data
âI want you to put your data on the Web.â
Sir T. Berners-Lee (TEDâ07)
â˘Original Web (1990s):
⢠network of linked documents
â˘Web of Data (2000s):
⢠network of interlinked data items
â˘Linked Open Data: Publish data on the Web:
⢠max. reuse and inter-connections, min. redundancy, network effect
Data is really useful, whenever it is shared and combined with other data.
3. Linking Data?
⢠But how should one produce such data?
1. Global identification: a URL should point to any data item.
2. Reachability via HTTP: accessing the URL should retrieve the data
item.
3. Linked structure: outgoing links (typed!) in the data should point to
additional data with URLs.
http://www.w3.org/DesignIssues/LinkedData.html
⢠THE language : Resource Description Framework (RDF)
1. benefits: links provide context
4. A Graph?
rdf:type
pd:tedstr
pd:tedstr foaf:Person
foaf:Person
foaf:name
Ted Strauss
Ted Strauss
foaf:based_near
dbpedia:Montre
dbpedia:Montre
al
al
dpprop:
population
3,407,963
3,407,963
8. How is it Open ?
⢠ââIf you want to start interlinking data then you can only do that if the data is licensed
in a way that allows such interlinking.ââ
Rufus Pollock
⢠But why is Open data on the Web not âlinkedâ?
⢠CVS, XML, RDBs
⢠no easy integration
⢠Web 2.0 Mashups?
⢠data sources fixed
⢠Linked Open Data (LOD) cloud - global data space
10. What for?
⢠Linking Open Drug Data (LODD), since 2008
⢠Publish/interlink publicly available data about drugs
⢠Provide answers to non trivial questions on the LODD
⢠For physicians
⢠Which are the equivalent drugs for a given condition?
⢠What drugs are currently under clinical trial?
⢠For patients
⢠What alternatives exist to a given drug?
⢠What are the contraindications for a drug?
11. Supplemental Slides
Petko Valtchev
(Assoc. Prof., Dept. of CS, UQAM)
ODXâ13
Montreal, April 6th
12. Main Entry Points into the LOD cloud
⢠DBPedia - a large multi-domain dataset containing extracted data from
Wikipedia; it contains about 3.77M concepts, 400+M facts with abstracts in 11
different languages.
⢠YAGO - precise knowledge base with 1.7M entities and 15M facts derived
from Wikipedia and WordNet.
⢠FOAF (Friend Of A Friend) - describes people, the links between them and
the things they create and do.
⢠GoodRelations - a vocabulary for eCommerce, enabling web sites to publish
details of their products and services in a machine-readable way.
⢠GeoNames - provides RDF descriptions of more than 6.5M geographical
features worldwide.
13. Cross-Media Cultural Heritage Management with LOD
⢠Simon is a Maths student visiting Montreal. He is fond of reading, cinema, music and history. His friends
recommended him the flourishing Mile End district where many cafĂŠs serve espresso and european pastry.
⢠Once settled down in a bar, he opens his iPad to look what is exciting about the surroundings. Knowing his
preferences, the mobile app suggests him an excerpt from a novel written by the local "infant du quarter",
Mordecai Richler, called "The Apprenticeship of Duddy Kravitz". The excerpt describes the life of the Jewish
community on two of the area's principal streets, St Urban St., and "The Main" St. in the 1930s.
⢠Once finished, Simon feels intrigued and accepts the suggestion to go for a short walk looking for remains
from that period. While sipping his coffee, Simon checks the author's biography and finds he has written
another book, "Barney's Version".
⢠After screening a summary, it is suggested to look at the eponimous film directed by Richard J. Lewis. While
watching a trailer, he noticed the youthful red-haired actress playing the 1st wife of the main character and
after querying the appâs knowledge base he learns that's Rachelle Lefevre who's born in Montreal.
⢠Before walking out, he checks the availability of a copy of "Barney's Version" and discovers that he can find
one in the local municipal library.
⢠When on the go, the system plays "I'm your man" a song by Leonard Cohen, another literary celebrity from
Montreal.
14. The Semantic Annotations : RDFa
⢠RDFa serializes RDF through HTML attributes
⢠similar to microformats
⢠@resource, @property, @href, @instanceof, @rel, etc.
15. Cool applications of semantic annotations
⢠Semantic query answering:
⢠Where do my colleagues live?
⢠Possible answers from their own web pages (via Trudat HP)
⢠dbpedia:Montreal
⢠dbpedia:Laval
⢠dbpedia:Toronto
⢠What are their dietary restrictions?
16. Practical take on OD vs LOD
⢠OD for social justice in US (say Atlanta)?
⢠Dataset 1: census data
⢠Focus on particular area with houses distinguished
⢠inhabited by black people vs white people
⢠Dataset 2: water supply data, houses connected to water lines or not
⢠By superposing datasets 1 and 2, analysis uncovered a discrimination
⢠~83 % of the unconnected houses were inhabited by black people!!!
⢠How was it done (a guess)
⢠matching between addresses as strings compared :-(
⢠LOD format - simpler and more reliable processing:
⢠finding paths in the graph
17. Data about the Data
⢠Reasoning about the dataset:
⢠Metadata:
⢠e.g. Dublin core vocabulary
⢠Notion of provenance
⢠The problem of trust: everybody could publish everything
Editor's Notes
â The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. â
A way of publishing data on the web that: Encourages reuse Reduces redundancy Maximises inter-connectedness Enables network effects Many ways to introduce the Linked Open Data Links provide a context, and context is important for proper processing.
Global identifier: URI Access to data: via HTTP Data model: RDF (a graph)
A graph of resources Vertices and edges are typed by terms provided in vocabularies: vocabularies are published in an open and distributed fashion. They can be mixed at will Moreover, the vocabulary terms are also resources (identified via URIs) Like in XML namespaces, shortcuts (prefixes) are used to avoid overloading the code with long URSLs FOAF is a vocabulary (schema) for representing people in the way linkedIn sees them DBpedia is an RDF version of Wikipedia: pages are translated into structured data
A graph of resources Vertices and edges are typed by terms provided in vocabularies: vocabularies are published in an open and distributed fashion. They can be mixed at will Moreover, the vocabulary terms are also resources (identified via URIs) Like in XML namespaces, shortcuts (prefixes) are used to avoid overloading the code with long URSLs FOAF is a vocabulary (schema) for representing people in the way linkedIn sees them DBpedia is an RDF version of Wikipedia: pages are translated into structured data
A graph of resources Vertices and edges are typed by terms provided in vocabularies: vocabularies are published in an open and distributed fashion. They can be mixed at will Moreover, the vocabulary terms are also resources (identified via URIs) Like in XML namespaces, shortcuts (prefixes) are used to avoid overloading the code with long URSLs FOAF is a vocabulary (schema) for representing people in the way linkedIn sees them DBpedia is an RDF version of Wikipedia: pages are translated into structured data
But haven ât we been putting linked data on the web for years? In CSV , relational databases, XML etc? Well yes, but these approaches are not so easy to integrate Web 2.0 mashups work against a fixed set of data sources Linked Data applications operate on top of an unbound, global data space.
â The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. â
TODO: Microformats schema.org
Let us dream a little bit... Once there are RDFa annotations on a good number of page, some more or less interesting questions can be answered directly from the RDFa - aware Web browser Ex. Now I am again Ted and I want to know where my colleagues from Trudat live A much less useless question could be: I want to invite my colleagues for dinner and therefore need to know their dietary restrictions. Instead of phoning them one-by-one or maintain a local database for colleagues and friends, I trust their own RDFa-enabled personal web pages.
First of all, there is no particular semantics to provide for your data to be linked to other available data Think of an example: Remember the example of how open data could support social justice in US? The guy took the census data of an american city (say Atlanta) Focus was on particular area and he distinguished houses  between inhabited by black people inhabited by white people He also took the water supply data, i.e., which houses were connected to the water lines By superposing the datasets, he discovered that ~83 % of the unconnected houses were inhabited by black people!!! This was a proof of discrimination and a judge (district) Well, what he did is matching between addresses in both datasets: he basically compared strings This is what is all about and you know strings my not always match perfectly :-( In a LOD format, URI (URL) would be assigned to individual addresses, so that there is a unique way of identifying an entity (resource) The processing would have been simpler and more reliable: Finding paths in the graph - using a dedicated query language, SPARQL But, the question is: DO the governments WANT us to have that much INSIGHT and at such a low PRICE?