Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Optimal RDF Access to Relational Databases

Eric Prud'hommeaux

Abstract

W3C's Resource Description Framework, RDF [1], has been of great interest for networked data expression due in part to the popularity of the World Wide Web, and in part to its foundations in more traditional knowledge representation and reasoning systems. Among the attractive features of this language is the ability to distribute ontology development by naming terms using dereferencable, globally unambiguous identifiers. For this, RDF uses URIs, whose scalability has been proven by the success of the web. This paper introduces a relational database/RDF gateway called Federate and describes how Federate executes RDF queries with the same efficiency as hand-tailored SQL queries.

SQL has worked well for intranets and homogeneous trust environments. When relational data is made available to wider audiences, custodians must create custom gateways. This is often done via web sites which present content generated out of the database; RDF provides a similar path from the database to the world, but without the loss of semantic precision which occurs in using HTML. RDF's universally grounded entities and relationships provide a mechanism to unify RDBs with other RDBs, as well as static data and other sources. The resulting network of universally interpretable data can be used as a medium for future data-oriented applications.

Table of Contents

1 Introduction
2 Overview of RDF Model
3 "Intranet" Scenario
5 "Internet" Scenario
5 Language Extents
6 Algorithm
8 Example 1 - OrderTracking Billing Information (JOIN)
9 Example 2 - OrderTracking Shipping Information (OUTER JOIN)
11 Cost/Efficiency
12 Migration Between RDF and Normalized Relations
13 Modeling n-ary Relations
14 Query Distribution
14 View Update
14 Repeated Properties
15 Conclusion
16 Appendix 1: Example Tables
17 Appendix 2: Syntax and Semantics of n3
17 Appendix 3: Syntax and Semantics of algae
18 Appendix 4: Glossary
19 Appendix 5: References

Introduction

Enthusiasm for XML and semi-structured data has prompted some shift of highly structured relational data into XML documents and databases. Despite this, storage of structured data is still highly concentrated in relational databases, and, for performance and integrity checking reasons, likely to stay that way. Tools like XQuery and Federate can be used to provide a consistent interface to the vast data stores in commercial infrastructure. This consistent interface simplifies the job of defining views of the data that will be available to different actors. These views may be joined with views from other publishers, creating a much more useful fabric of data that doesn't require human intuition (and luck) to interpret.

Storing RDF data in a relational database is not novel, however, Federate provides RDF agents with the ability to query a relational database with an application-specific schema. Querying RDF stored as Statements incurs the cost of an additional join per restriction condition (WHERE or JOIN ON clause in SQL). In general, properly normalized relational databases do not need a join for unique properties, i.e. properties for which there is one and only one slot. Federate provides RDF data with the speed, compactness, and integrity constraints of conventional relational databases.

Overview of RDF Model

RDF is a language for modeling data in binary relations. Relations are expressed in statements composed of a subject, an object, and a predicate (relationship) between them. These statements combine to form a Directed Labeled Graph (DLG). The nodes and relations in RDF are commonly identified by URIs (a superset of URLs). This provides a practical global grounding for all the statements in the graph.

RDF can be expressed in XML or in non-XML syntaxes like n3 [4],[5] or ntriples. The three statements in this n3 example:

Order 2185 has a customer of called Customer 1 .
Order 2185 is for a product called Product 1004 .
Order 2185 was ordered on "Jan 1 2001 13:24:55" .

can be easily expressed in an RDF language called n3, or visualized as a graph:

@prefix db: <http://example.com/OrderTracking/relDB#>
db:Order_2185 db:customer db:Customers_1 .
db:Order_2185 db:product db:Products_1004 .
db:Order_2185 db:orderDate "Jan 1 2001 13:24:55" .
DLG with three arcs from one node

As DLGs clearly express relationships between objects, RDF is an excellent language for expressing relational data. For instance, the above statements came from a row in an sample OrderTracking database:

Orders
idcustomerproductorderDate
21851100420010101132455
21862200120020902132455
21872200420020902132455
31832200520020902132455

The data expressed may be anything to which one may assign relationships. Data and meta-data are expressed in the same language and may appear in the same document. Those familiar with UML may note that RDF imposes no distinction between the data, model and meta-model (...). Statements are unordered just as propositions in predicate logic are unordered. Ordering of lists is handled within the model with specific predicates.

"Intranet" Scenario

Database federation needs typically arise from independently developed and administered data. Relational schemes impose strict understanding of the relationships of the modeled data. However, joining two databases can be tricky as these relationships may be less clear. This can arise from incomplete understanding, but more often will arise when the custodians of the data have different needs. While the elements in the microuniverse described in two databases may coincide, it is not guaranteed they will correspond exactly to entities in both databases. For instance, a vendor database that maintains a part number for each tracked item may not correspond to a client's database which has sales names for these items, or house-defined names for groups of the items, or part numbers from that vendor interspersed with part numbers from other vendors. Sometimes the correlation of two entities can only be asserted for a particular domain of query, e.g, for the purposes of this query, butter melting frobnosticators can be considered the same as heated mystifiers. This scenario considers the case where some practical queries can ignore the impedance mismatch occuring in data that is managed by different departments in a large organization.

For example, take a large company where manufacturing, sales and shipping departments have separate databases. These databases have distinct purposes and it would be technically or politically infeasible to coerce them all into one relational schema. Conventional supply chains are maintained by a reports generated and sent to the other departments. The queries generating these reports are scripted and inflexible. The sales department can tell manufacturing how much to make and shipping where to send the product, but someone in shipping would likely not be able to join the pending orders in the shipping database against materials shipments in the manufacturing database in order to prepare for an increased work level.

Custom code can be written to handle joins across two independently maintained databases within the same organization, but it will only be useful for those two databases, and serviced queries will have to have the same attitude about what items can be merged. It is frequently impractical to provide a meta language to describe how and when items can be conditionally merged. Development of such languages leads to many years of theoretical exploration.

Mapping databases to RDF has proven quite simple. Slightly more challenging is the use of ontology languages to join disparate databases. RDFS and OWL can be used to relate these databases, either in a universal it-is-always-true-that sense, or in a premise, for-the-purposes-of-this-query, sense. This allows one to code the expertise embedded in custom database gateways in a language that has already been developed to unambiguously model data.

Instead of simply defining mappings between two databases, it is also possible to define mappings to a "house standard" of entity references. This allows the mapping problem to be addressed once for each database, rather than once for each pair of database.

"Internet" Scenario

Because the "intranet" scenario did not rely on coordination of disparate databases within an organization, the same approach can be used to manage relationships between databases published by different organizations, for instance, a vendor and a product rating service.

A product database may have some public visibility in the form of a marketing portal. This is currently done by publishing product lists in HTML, frequently with forms for purchasing the advertised products. Consumer desire to use specialized tools to browse and submit orders may motivate the vendor to publish the data in a predictable, machine-readable language. This eliminates tedious screen scraping from the more necessary task of semantic interpretation of the data.

A consumer-oriented rating service, ala Consumer Reports, can publish their product reviews reusing the vendor's entity identifiers. If both the marketing portal and the rating service provide RDF forms of the data, it will be trivial for users to join the two sources. This is one ideal of the semantic web.

In the common scenario that publishers do not agree on common identifiers for items, the same mechanisms used in the "intranet" scenario can be used to establish the useful relationships.

Language Extents

At present, Federate works with two reasoning agents, algae [2], and cwm [3]. Most of the experimentation has been done on the W3C access control database and a sample OrderTracking database.

Federate can, in theory, express any common logical connectives as SQL. The cwm port can, at this time, express only conjunction, in part because the expression of disjunction and negation is not defined for the n3 language. The development platform, algae, can express and execute a moderate set:

algaecwm

logical connectives

conjunction?a ?b ?c . ?c ?d ?e:a :b :c . :c :d :e.
disjunction?a ?b ?c || ?d ?e ?fnot expressible
negation!(?a ?b ?c)not expressible
outer join?a ?b ?c ~ ?c ?d ?enot expressible

comparison predicates

less than?a ?b ?c {?c < 5}?a ?b ?c . ?c < 5 1
greater than?a ?b ?c {?c > 5}?a ?b ?c . ?c > 5 1
less than or equal?a ?b ?c {?c <= 5}?a ?b ?c . ?c <= 5 1
greater than or equal?a ?b ?c {?c >= 5}?a ?b ?c . ?c >= 5 1

domain constraints

regular expression~http://example.com/.*not implemented

1 planned but not implemented

Algorithm

The SQL query generator parses an RDF query and examines the predicates to determine from which relation they will come. The algorithm is roughly as follows:

Initialization:
0. compile table structure into Classes with Property objects

Before processing any queries, inspect the database and compile a list of properties represented by the fields in the tables. This produces a list of table/field names indexed by URI. The URI may be generated from the database names:

http://example.com/OrderTracking/relDB#Orders_orderDate => (Orders.orderDate)
http://example.com/OrderTracking/relDB#Orders_customer => (Orders.customer)
http://example.com/OrderTracking/relDB#customers_name => (Customers.name)

or overriden by a URI read from a configuration:

http://example.com/OrderTracking/relDB#orderDate => (Orders.orderDate)
http://example.com/OrderTracking/relDB#customer => (Orders.customer)
http://xmlns.com/foaf/0.1/fullName => (Customers.name)

The compilation also interrogates the database (or auxiliary configuration for databases without the functionality) to derive the relational structure (foriegn key/primary key associations). RDF Schema information may echo some of this structure, but it is not used because it is not authoritative.

RDF Query Analysis — iterate over query terms:
1. find the table/field for the predicate
Look up the above index.
2. ground the subect
Entities that are the subject of a triple will either be named explicitly by an URI or be identified by a variable.
variable subject
Each time a new variable is introduced to identify entities in the same relation, a new table alias is generated. This allows the Addresses relation to be used as both the home address and the shipping address below. For example, a Patient table may hold both a workAddress and homeAddress:
...
?workAdr streetName ?workSName.
?workAdr streetNumber ?workSNumber.
?homeAdr streetName ?homeSName.
?homeAdr streetNumber ?homeSNumber.
...
yields the aliase Addresses_workAdr and Addresses_homeAdr.
constant subject
This step can also perform an implicit entity mapping from RDF nodes to relational tuples. Each labeled node (a node with a URI identifier) in the RDF graph identifies a tuple in one of the relations (tables). In the simple case, URIs are decomposed into the entity's table name and primary key field values. For example http://localhost/OrderTracking#Customers_id_2 identifies a Customer with id equal 2. This step transforms the published URIs to a relation name and a series of constraints.
3. ground the object
The field identified in step 1 will either be a simple value, or a foreign key to another table.
simple value
The object of the tuple will constrain or select the simple value. For instance ?homeAdr streetNumber 7. will introduce the constraint Addresses_homeAdr.streetNumber=7.
foreign key
The object of the tuple will constrain or select tuples in the table to which the foriegn key points. For instance ?who homeAddress <http://...Addresses_id=717>. will introduce a join to the Addresses table with the constraintAddresses_3.id=7.
SQL Query Assembly:
4. generate SQL query
Steps 1 thourgh 3 produce a set of table aliases, constraints, and variables assocated with field values. Assemble these into an SQL query, recording the query variables and type (URI vs. Literal) for each SELECTed column.
SQL Result Interpretation:
5. transform SQL results to RDF statements
Assemble the SELECTed values into the URIs or Literals, assign those to the variables in the query, and add the resulting triples to the returned results.

Example 1 - OrderTracking Billing Information (JOIN)

The following algae query selects the order date, product name and the customer's name and billing address for each order placed on 7-September from the example OrderTracking database. This demonstrates access to a simple, but practical JOIN on a set of database tables. It . See the Appendix 1: example tables from which this data was taken, the algae script and the resulting graph.

ns db=<http://localhost/OrderTracking#>
attach <http://www.w3.org/1999/02/26-modules/algae#dynamic> (
                    class="W3C::Rdf::SqlDB"
                    properties="../test/OrderTracking.prop")
ask (?o	       db:Orders_id		   ?orderId .
     ?o	       db:Orders_customer	   ?c .
     ?o	       db:Orders_orderDate	   20020907 .
     ?o	       db:Orders_product	   ?p .
     ?p	       db:Products_name		   ?productName .
     ?c	       db:Customers_givenName	   ?first .
     ?c	       db:Customers_familyName	   ?last .
     ?c	       db:Customers_billingAddress ?billAddr .
     ?billAddr db:Addresses_street	   ?billStreet .
     ?billAddr db:Addresses_city	   ?billCity .
     ?billAddr db:Addresses_state	   ?billState)
collect (?orderId ?productName ?first ?last
         ?billStreet ?billCity ?billState)
table view of join on OrderTracking database

Following is a breakdown of the same query expressed in n3.

n3 distinguishes variables be explicitly listing existentials and universals. Following "log:forAll" is a list of universal quantifiers.

this log:forAll :o, :d, :p, :productName, :c, :first, :last, :billAddr, :billStreet, :billCity, :billState .
<sql://rdftest@swada.w3.org/OrderTracking/> is log:authoritativeService of
	Orders:id, Orders:customer, Orders:product, Orders:orderDate, 
	Products:id, Products:name, 
	Customers:id, Customers:familyName, Customers:givenName, Customers:billingAddress, 
	Addresses:id, Addresses:street, Addresses:city, Addresses:state.

This example query doesn't rely on any external schema definition; the external keys are associated with their table/primary key pairs with "log:pointsAt".

Orders:product log:pointsAt Products:id .
Orders:customer log:pointsAt Customers:id .
Customers:billingAddress log:pointsAt Addresses:id .

The graph constraints are expressed identically to the algae query above:

 :o Orders:id :orderId .
 :o Orders:customer :c .
 :o Orders:orderDate 20020907 .
 :o Orders:product :p .

 :p Products:name :productName .

 :c Customers:givenName :last .
 :c Customers:familyName :first .
 :c Customers:billingAddress :billAddr .   

 :billAddr Addresses:street :billStreet .
 :billAddr Addresses:city :billCity .
 :billAddr Addresses:state :billState .

Both of the implementations produce the following single SQL query:

SELECT Orders_0.id AS o_id,
       Customers_0.id AS c_id,
       Products_0.id AS p_id,
Products_0.name AS productName_name,
Customers_0.givenName AS first_givenName,
Customers_0.familyName AS last_familyName,
       Addresses_0.id AS billAddr_id,
Addresses_0.street AS billStreet_street,
Addresses_0.city AS billCity_city,
Addresses_0.state AS billState_state
FROM Orders AS Orders_0
     INNER JOIN Customers AS Customers_0 ON Orders_0.customer=Customers_0.id
     INNER JOIN Products AS Products_0 ON Orders_0.product=Products_0.id
     INNER JOIN Addresses AS Addresses_0 ON Customers_0.billingAddress=Addresses_0.id
WHERE Orders_0.orderDate="20020907"
Results
orderIdproductName first last billStreet billCity billState
"2185""pool" "Biff""Thompson""123 Elm Street""EdgeCity""AV"
"2187""nose ring" "Chip""Thompson""123 Elm Street""EdgeCity""AV"
"3183""other ring""Chip""Thompson""123 Elm Street""EdgeCity""AV"

Example 2 - OrderTracking Shipping Information (OUTER JOIN)

This example extends the OrderTracking JOIN example above by adding optional arcs for shipping address and contact (person to sign for the shipment). This demonstrates the algorithm's ability to manage multiple joins to the same table. See the example tables from which this data was taken, the algae script and the resulting graph.

ns db=<http://localhost/OrderTracking#>
attach <http://www.w3.org/1999/02/26-modules/algae#dynamic> (
                    class="W3C::Rdf::SqlDB"
                    properties="../test/OrderTracking.prop")
ask (?o		db:Orders_id		    ?orderId .
     ?o		db:Orders_customer	    ?c .
     ?o		db:Orders_orderDate	    ?d {?d >= 20020908000000 && ?d < 20020909000000} .
     ?c		db:Customers_givenName	    ?first .
     ?c		db:Customers_familyName	    ?last .
     ?c		db:Customers_billingAddress ?billAddr .
     ?billAddr	db:Addresses_street	    ?billStreet .
     ?billAddr	db:Addresses_city	    ?billCity .
     ?billAddr	db:Addresses_state	    ?billState .
     ~?o	db:Orders_shippingAddress   ?shipAddr .  
     ~?shipAddr	db:Addresses_street	    ?shipStreet .
     ~?shipAddr	db:Addresses_city	    ?shipCity .  
     ~?shipAddr	db:Addresses_state	    ?shipState . 
     ~?shipAddr	db:Addresses_contact	    ?signer .    
     ~?signer	db:Customers_givenName	    ?sFirst .    
     ~?signer	db:Customers_familyName	    ?sLast       )
collect (?orderId ?first ?last ?billStreet ?billCity ?billState
         ?sFirst ?sLast ?shipStreet ?shipCity ?shipState)

specifies that the Addresses table JOINed on the relvar Orders.shippingAddress, as well as the Customers table then joined on Addresses.contact, are JOINed with an OUTER join:

SELECT Orders_0.id AS o_id,
       Customers_0.id AS c_id,
       Orders_0.orderDate AS d_orderDate,
Customers_0.givenName AS first_givenName,
Customers_0.familyName AS last_familyName,
       Addresses_0.id AS billAddr_id,
Addresses_0.street AS billStreet_street,
Addresses_0.city AS billCity_city,
Addresses_0.state AS billState_state,
       Addresses_1.id AS shipAddr_id,
Addresses_1.street AS shipStreet_street,
Addresses_1.city AS shipCity_city,
Addresses_1.state AS shipState_state,
       Customers_1.id AS signer_id,
Customers_1.givenName AS sFirst_givenName,
Customers_1.familyName AS sLast_familyName
FROM Orders AS Orders_0
     INNER JOIN Customers AS Customers_0 ON Orders_0.customer=Customers_0.id
     INNER JOIN Addresses AS Addresses_0 ON Customers_0.billingAddress=Addresses_0.id
     LEFT OUTER JOIN Addresses AS Addresses_1 ON Orders_0.shippingAddress=Addresses_1.id
     LEFT OUTER JOIN Customers AS Customers_1 ON Addresses_1.contact=Customers_1.id     
WHERE Orders_0.orderDate>="20020908000000" 
  AND Orders_0.orderDate<"20020909000000"

yielding:

Results
orderIdfirst last billStreet billCity billStatesFirst sLast shipStreet shipCity shipState
"2185""Biff""Thompson""123 Elm Street""EdgeCity""AV" NULL NULL NULL NULL NULL
"2187""Chip""Thompson""123 Elm Street""EdgeCity""AV" NULL NULL NULL NULL NULL
"3183""Chip""Thompson""123 Elm Street""EdgeCity""AV" "Eustis""Walker""245 King Street""EdgeCity""AV"

Similar tests demonstrated the rest of the features described in the language extents.

Example 3 - Product Review (query federation)

This example demonstrates a "join" between relational data and two other sources of input. The relational component comes from the Products table in the OrderTracking database, while two external documents provide an alternate price list and a set of customer reviews:

ns ot=<http://localhost/OrderTracking#>
attach <http://www.w3.org/2003/01/21-RDF-RDB-access/ns#dbs> ot:db (
                    properties="../test/OrderTracking.prop")
ask ot:db (
       ?p  ot:Products_partNo	   ?partNo .
       ?p  ot:Products_price	   ?price {?price >= 15.00 && 
					   ?price <= 50.00} .
       ?p  ot:Products_description ?description)
read <file:../test/Federate0-jwlry.n3> (inputLang="n3") 
ask ( ~?j ot:Products_partNo	   ?partNo .
      ~?j ot:Products_price	   ?jprice)
read <file:../test/Federate0-reviews.rdf> () 
ns rvw=<http://localhost/Reviews#>
ask ( ~?p rvw:rating		   ?rating .
      ~?p rvw:distribution	   ?dist)
collect (?description ?price ?jprice ?rating ?dist)

The SQL query is trivial:

SELECT Products_0.id AS p_id,
       Products_0.partNo AS partNo_partNo,
Products_0.price AS price_price,
Products_0.description AS description_description
FROM Products AS Products_0
WHERE Products_0.price>="15.00" 
  AND Products_0.price<="50.00"

The purpose of this example is to show the underlying motivation for expressing these SQL queries as RDF; that is, the ability to join with data in other formats via a uniform query language.

See "Case Study: FeDeRate for Drug Research" [7] for a pharmacology use case.

Cost/Efficiency

The above example SQL queries demonstrate how Federate transforms RDF queries into SQL queries as efficient as those tailored by an experience database programmer. The execution cost for the SQL query is identical (or better, if the RDF structure helps the querier tailor a better query) to those written into conventional data federation gateways.

Query transformation is accomplished in a parsing pass (yacc), linear-time compiled tree transformation, and a linear serialization of the compile-tree into an SQL string.

Passing the RDF query as a single SQL query over a normalized relational store is optimal in two regards: 1, it eliminates the network from the process of unification. 2, it operates on an application-optimized database instead of a generic triple store.

Migration Between RDF and Normalized Relations

One goal of Federate is to model data in RDF so that it can "round trip", that is, go from RDF statements to the normalized database and back again without changing. For this reason, it is important to distinguish the attributes in the database that represent attributes of the modeled entity from foreign keys, which are used to preserve the graph model in relational databases.

Suppose Chip places an order for a nose ring and we wish to store this in a generic triple store. This would be composed of the triples:

[ customer [ givenName "Chip" ] .
  product  [ name "nose ring" ] ] .

which implies the following triples, using _:<n> to represent an unnamed node in the graph:

_:1 customer _:2 .
_:2 givenName "Chip" .
_:1 product _:3 .
_:3 name "nose ring" .

If we wanted a normalized database to optimize storage and retrieval of this information, we could create an OrderTracking database:

Orders Customers Products
idcustomerproduct
218722004
idgivenName
2Chip
idname
2004nose ring

In doing this, we must invent some integers (2187, 2, 2004) to use as identifiers for the rows in the tables. These enable use to store the relationships conveyed by _:1, _:2 and _:3 above. The id attribute for Orders, Customers and Products is not an attribute of the tuple or the entity that it describes, but an artifact of the mechanism for storing internal structure.

For consistency between the two expressions (RDF and normalized), it is necessary to treat all external keys (Orders.customer and Orders.product above) not as integers, but as references to the entity represented by the indicated row in some table. A consequence of this constraint is that a naive relational database that does not know the external keys, but instead treats all joins as being constrained on coincident attributes, will report the object of would-be foreign key arcs as an integer instead of a resource representing a tuple.

_:1 customer 2 .

This would be analogous to a generic triple store reporting that the object of an arc is some memory address instead of the node in the graph that was the true object of that arc.

When later informed of these external keys, the object will no longer be an integer, but instead an entity reference.

_:1 customer <http://...Customers#id.204> .

The value and even type of the object of the customer arc has changed. This is appropriate as this system is indeed inconsistent and reflects a change in the state of the microuniverse.

Frequently, large organizations assign identifiers to orders and products. These are seldom small integers that should be used as primary keys, but instead published alphanumeric strings appearing in catalogs or returned to the customer who has place an order.

Orders Customers Products
idcustomerproductorderID
218722004200303120001a
idgivenName
2Chip
idproductIDname
2004nr21a.5nose ring

This would add two triples to the above list.

_:1 customer _:2 .
_:2 givenName "Chip" .
_:1 product _:3 .
_:3 name "nose ring" .
_:1 orderID "200303120001a" .
_:3 prodID "nr21a.5" .

It is possible that a custodian of the relational store would choose to publish the external keys as order or product identifiers. In such a case, the primary key would be a published attribute of the entity represented by the table row. External keys would remain as opaque references to the entity.

_:1 customer _:2 .
_:2 givenName "Chip" .
_:1 product _:3 .
_:3 name "nose ring" .
_:1 orderID "2187" .
_:3 prodID "2004" .

The current implementations of Federate assign ids to rows in tables based on the primary key. This may not be good practice, but is temptingly convenient and useful.

Modeling n-ary Relations

A common criticism of RDF is the consequence of the decision to model relations as only binary. This requires additional entities to be created when modeling n-ary relations. For example, the common supplier scenario:

supplies (3-ary relation)

requires that a Supplies entity be created though it does not model a phenomena and not an object in the microuniverse. (It could be argued that the Supplies entity models a contract or a perceived acquisition agreement.) This turns out to not be a problem for mapping to the relational world as RDBs are similarly constrained. A relation may not have duplicate tuples, implying a primary key (with potentially more than one attribute). The entity represented by this tuple can be represented by a node in an RDF graph where the node can be named by a function of the attributes in the primary key. Alternatively, the node may be unnamed but distinguished by arcs describing the attributes in the primary key.

Query Distribution

Many relational databases provide a mechanism to distribute database tuples and attributes in a manner that allows a query planner to predict which sites will have which data. The same processes can be used by an SQL query engine handling Federate requests. This could also be done by the Federate engines cwm and algae as both have unifiers to merge data from SQL with data from other sources like HTTP or rules. At present, both implementations decide which SQL store to query based on the predicate. Distributions based on the subject or object would be analogous to relation databases with tuples distributed at different sites. This should not be significantly more difficult than the attribute distribution which is already implemented.

View Update

Inserting or replacing arcs derivative of a normalized relational store is analogous to the classic view update problem in relational databases. Briefly, a view is created by defining a query involving one or more base tables and assigning the resulting table a name. This name becomes available for constructing future queries. Whether the view is populated with tuples which must be updated when the base tables are updates, or simply invokes the defining query each time it is referenced, is an implementation detail that the custodian should not have to consider. The view update problem occurs when one inserts, updates, or deletes the triples in a view. Such updates need to reflect back to updates in the base tables in such a way that the base tables and the derivative view remain consistent.

C J Date outlines one approach to view update [6]. At this time, this approach has not been implemented in Federate. Given a set up triple assertions which reflect changes to the microuniverse, it seems feasible to use this approach to create a set of SQL updates resulting in a consistent system.

Repeated Properties

RDF graphs may have nodes with repeated properties, such as shippingAddress for someone who receives goods at more than one location. In relational databases, these relationships are traditionally modeled with a many-to-many table e.g. a table with a person's primary key paired with the primary key of that person's addresses. At present, Federate has no machinery to declare that a many-to-many table corresponds to a particular repeated property.

Conclusion

Skeptics of the semantic web often see the performance of generic triple stores and the cost of wholesale migration to them as major obstacles. Tools like Federate can eliminate that cost, offering continued use of existing databases and the infrastructure that relies on them, with the data federation and consistency of the semantic web.

Most companies don't know how to evaluate the computational expense of RDF Schema and OWL reasoners. The computational expense of relational databases is well understood and accepted. Tools that publish (perhaps just in the sense of supporting queries) relational data on the semantic web allow companies to reduce perceived risk when deciding to participate in the semantic web.

Appendix 1: Example Tables

Four tables were used in the examples above:

Orders
idcustomerproductorderDateshippingAddress
21851100420020907NULL
21862200120020908NULL
21872200420020907NULL
318322005200209072
Customers
idgivenNamefamilyNamebillingAddress
1Biff Thompson1
2Chip Thompson1
3EustisWalker 2
4Elie Tweak 3
Products
idname
1001white house
1002picket fence
1003sport utility vehicle
1004pool
1005grill
2001skateboard
2002rebellious music
2003earring
2004nose ring
2005other ring
Addresses
idaptstreetcitystatecontact
1NULL123 Elm Street EdgeCityAV1
21 245 King StreetEdgeCityAV3
318b Ally 17 BigCity AV3

Appendix 2: Syntax and Semantics of n3

A subset of n3 is used this paper. It can be thought of as a series of statements, each terminated by a '.'. Each statement consists of a subject, relation, and object. Each of the nodes may be a URI enclosed in <> or a qname as described in XML namespaces. For instance, the following asserts that eric and danbri know each other:

@prefix per: <http://www.w3.org/People/all#>
@prefix thing: <http://xmlns.com/foaf/0.1/>

per:eric foaf:knows per:danbri .
per:danbri foaf:knows per:eric .

Statements may be grouped in a formula. This has the effect of removing them from the assertions of that document. These formulae are used in, among other things, rules. The standard format of a rule is:

<> log:forAll ','-separated var list .
{ statements with vars } log:implies { statements with vars } .

In fact, n3 is much more than the portion used for examples in this document. Unlike Algae, it uses triples to declare rules (and implicitly, queries) and to declare the relationship between documents and the data in those documents. For more information, see A Rough Guide to N3 [4].

Appendix 3: Syntax and Semantics of algae

Algae is a simple, pragmatic query and rules language. The functions that are relevant to this paper are:

ns prefix=uri
declare the prefix prefix for the URI uri.
attach type dbName (parameter=value...)
attach to database of type type and call it dbName.
ask dbName (triple ...)
select a graph described by the triples from the database dbName. Each triple consists of a subject, relation, and object in the same syntax as n3.
collect (variable ...)
select variables from the results of an ask

The connectives are described above in Language Extents.

Appendix 4: Glossary

tuple
(In databases), a set of attributes characterizing an entity.
RDB relation
A relation is an abstraction of what we commonly call a table. In this abstraction, there is no order to the rows or columns and no two rows may have the same attributes (values). Attributes must be referenced by name and rows must be referenced by distinguishing attributes.
RDF relation
A predicate denoted by a URI describing the relationship between the subject and object of a statement.
ER relation
(In entity-relation models) the relationship between to modeled entities. Relations may label the relationship of one entity to another and may also label the relationship of the latter entity to the former. In addition, they may constrain the number of each entity that may participate in that relation.
domain (data type)
In databases, the type of an attribute in a tuple.
RDF domain
The data type of a subject of statements with a given predicate. For instance, the domain of a predicate drives could be driver.
RDF range
The data type of a object of statements with a given predicate. For instance, the range of a predicate drives could be vehicle.
proposition
A logical statement, an assertion.
impedance mismatch
(In data modeling) the difficulty in expressing data with one conceptual model in a format designed for a different model.
microuniverse
The fraction of the real world which is modeled in a database; also called the domain of discourse.

Appendix 5: References

[1] Dave Beckett: "RDF/XML Syntax Specification (Revised)", http://www.w3.org/TR/rdf-syntax-grammar/
[2] Eric Prud'hommeaux: "Algae RDF Query Language", http://www.w3.org/2004/05/06-Algae/
[3] Tim Berners-Lee: "Cwm", http://www.w3.org/2000/10/swap/doc/cwm
[4] Tim Berners-Lee, Sandro Hawk, and Dan Connolly: "Semantic Web Tutorial Using N3", http://www.w3.org/2000/10/swap/doc/
[5] Sean B. Palmer: "A Rough Guide to N3", http://infomesh.net/2002/notation3/
[6] C. J. Date: "An Introduction to Database Systems", 6th Edition. Addison-Wesley 1995
[7] Eric Prud'hommeaux: "Case Study: FeDeRate for Drug Research", http://www.w3.org/2004/10/04-pharmaFederate/

$Date: 2004/11/09 07:22:31 $Valid XHTML 1.0!