TR-CS-05-06
Tracking RDF Graph Provenance using RDF Molecules
April 30, 2005
The Semantic Web can be viewed as one large "universal" RDF graph distributed across many Web pages. This is impractical for many reasons, so we usually work with a decomposition into RDF documents, each of which corresponds to an individual Web page. While this is natural and appropriate for most tasks, it is still too coarse for some. For example, many RDF documents may redundantly contain the same data and some documents comprise large amounts of weakly-related or unrelated data. Decomposing a document into its RDF triples is usually too fine a decomposition, information may be lost if the graph contains blank nodes. We define an intermediate decomposition of an RDF graph G into a set of RDF "molecules", each of which is a connected sub-graph of the original. The decomposition is "lossless" in that the molecules can be recombined to yield G even if their blank node IDs are 'standardized apart". RDF molecules provide a useful granularity for tracking the provenance of or evidence for information found in an RDF graph. Doing so at the document level (e.g., finding other documents with identical graphs) may find too few matches. Working at the triple level will just fail for any triples containing blank nodes. RDF molecules are the finest granularity at which we can do this without loss of information. We define the RDF molecule concept in more detail, describe an algorithm to decompose an RDF graph into its molecules, and show how these can be used to find evidence to support the original graph. The decomposition algorithm and the provenance application have both been prototyped in a simple Web-based demonstration.
TechReport
Stanford
Downloads: 9931 downloads