Tue 25 Oct 2005

twinql: a SPARQLing Autopsy, Part 1

Update: Calling this an “autopsy” gives an unintentionally morbid impression — I should make it clear that nothing related to SPARQL or twinql is dead!

There's also going to be a bit of explanatory code going into this entry soon.

Prompted by Danny, here follows an explanation of how I do query pattern resolution in twinql, my SPARQL library for the Wilbur Common Lisp RDF toolkit. I have since refactored twinql into separate independent libraries, so the query engine is now its own package, wilbur-graph-patterns.
Consider a simple triple pattern, (?x rdf:type ?y). It is clear that this should return, when resolved against an RDF graph, a list of triples: all those that have rdf:type as their predicate. Wilbur supports this natively as the db-query method, and it's quite simple — replace the variables with NIL.

Example: To run that query pattern against a Wilbur store:

* (db-query *db* nil !rdf:type nil)
(#<TRIPLE !foaf:OnlineChatAccount
!rdf:type !rdfs:Class {11282EB1}>
#<TRIPLE !foaf:mbox !rdf:type
!rdf:Property {11287EF9}>
#<TRIPLE !foaf:mbox !rdf:type
!owl:InverseFunctionalProperty {11289AA9}>
#<TRIPLE !foaf:mbox !rdf:type
!owl:ObjectProperty {1128A6C9}>
… )

If we want to fetch x and y from just this ‘pattern’, we can iterate over these triples:

* (dolist (x *) 
(format t "X: ~A, Y: ~A~%"
(triple-subject x)
(triple-object x)))

X: !foaf:OnlineChatAccount, Y: !rdfs:Class
X: !foaf:mbox, Y: !rdf:Property
Triple patterns can contain 0, 1, 2, or 3 variables.
  • 0 is a question: “does the store contain this triple?”. Solving this is easy — if the triple is present, we continue with the query; otherwise, we return no results.
  • 3 is a request for all triples. In Wilbur this resolves down to (db-triples db). Again, this is comparatively easy.
  • 1 or 2 require us to do more work, unifying the results of a db-query with an existing set of bindings (see below).
Things are a little more complex if a variable is repeated (e.g., (?x ?x rdf:Property), which should bind x to rdf:type), but the principle is the same.
We now need to introduce the concept of a binding — an association between a variable and a value. Given a graph like the following:
rdf:type rdfs:label "type" .
rdfs:label rdfs:label "label" .
a query, (?x rdfs:label ?label), will return two sets of bindings:
We have a unique binding for every possible combination of values in the graph that satisfies the pattern. If we introduced another label for rdf:type, we would have another row in the results table.
Graph patterns
The next most complicated pattern is a graph pattern: a collection of triple patterns. These are likely to share variables, but may also be disjoint. E.g., to find all people I know that know John, we could use the following graph pattern:
(triple !rich:RichardNewman !foaf:knows x)
(triple x !foaf:knows !ex:John)
(triple x !rdf:type !foaf:Person))
We can see that increasing the number of disjoint patterns will swell the number of results — additional combinations produce more results — while the number of intersecting patterns (intersecting in terms of variables used) will refine the number of results. Here we take every thing I ‘know’, remove those that don't know John, then remove those that aren't explicitly typed as foaf:Person.
Resolving triple patterns
So, what happens when we run some triple patterns against a store? Let's take it in steps.

Firstly, we need to keep track of our bindings. We start off with none. Each binding is a mapping from variables to values, and we can have multiple bindings, each corresponding to a row in the result table.

Secondly, we need to keep track of which variables have been ‘touched’. This is vitally important. Essentially, we alternate between two behaviours.
  • If a triple pattern contains any variables that have already been touched, then we refine the existing bindings by intersecting them with the results of applying this triple pattern (a simple query, you will recall).
  • If a triple pattern contains no touched variables, then we union the results together.
For the former, we use each existing binding to build a new binding list. We do this by taking the new triple pattern, binding its variables with the existing binding, and running it as a query against the store. If we get results, great — our new results are what we get by merging each result with the original binding hash. If we don't, we check if this is an optional pattern. If it is, then we pass back our original result; otherwise, this pattern failed.

Clear as mud, no?
An example
Assume we're half-way through this query:
(?x foaf:knows ?y)
(?y foaf:knows ?z)
on this data:
ex:bill foaf:knows ex:john , ex:mike .
ex:john foaf:knows ex:mike, ex:steve .
ex:mike foaf:knows ex:steve .
Our current bindings, then, stand at:
Now we come to a new pattern: (?y foaf:knows ?z). We know that y has already been touched, so we bind the triple pattern's variables with our existing bindings in turn, producing 5 new queries:
(ex:john foaf:knows ?z)
(ex:mike foaf:knows ?z)
(ex:mike foaf:knows ?z)
(ex:steve foaf:knows ?z)
(ex:steve foaf:knows ?z)
Yes, those duplicates are there for a reason! That reason is that the binding gets passed along, too, so although the query is actually redundant, the results are not.

Now, each of those queries is run against our data, returning 2, 1, 1, 0, 0 triples respectively. We'll take John (the first new query) and Steve (the last) as examples in the next part!

Posted at 2005-10-25 11:02:39 by RichardLink to twinql: a SPARQLin…