This page gives an overview of the strategies explored in the XR project to join data coming from XML and RDF instances.
In the sequel, we assume an input query Q = (h, QX, QR), where QX is the set of tree patterns appearing in Q, QR is the set of triple patterns appearing in Q, and h is a tuple are variables appearing in QX ∪ QR, called the query head. We call I = (IX, IR), an XR data instance, where IX is an XML instance and IR is an RDF instance.
XML||RDF : Independent executions
In this strategies, QX and QR are evaluated on IX and IR respectively. Their results are combined into a natural join. The algorithm is given below.
XML->RDF : Bind XML, then push to RDF
QX is evaluated first over IX, then for each tuple found, values bound to URI variables are pushed (as selections) into a copy of QR and added to a union of triple pattern queries. The final union of conjunctive queries obtained is finally evaluated on IR, and provides the answer to Q over I.
RDF->XML-URI: Bind RDF, then push URIs
This approaches works based on the assumptions that the XML instance has a means of mapping URIs to the node it contains.
QR is evaluated first over IR, then for each tuple found, values bound to URI variables are pushed into a copy of QX and added to a union of tree pattern queries. The final union of conjunctive queries obtained is finally evaluated on IX, and provides the answer to Q over I.
RDF->XML-XPath: Bind RDF, then push XPaths
In situations where the XML instance has no means of mapping a given URI to a node it contains, one needs to “externally” dereference the URI, i.e. obtain an exact path to the node it refers to. Then, path expressions rather then URIs will be carried over the tree pattern queries of Q.
QR is evaluated first over IR, then for each tuple found and for each URI u bound to a URI variable v in QX, obtain the exact path of the node u refers to. Add to q a predicate stating that the node labeled with variable v should match that of the given exact path. The newly created subquery is added to a union of tree pattern queries. The final union of conjunctive queries obtained is finally evaluated on IX, and provides the answer to Q over I.
Tuple-at-a-time approaches
It may not always be possible nor efficient to evaluate a union of tree pattern queries on the XML instance. In such a case, one may evaluate multiple XML queries in sequence.
In other words, QR is evaluated first on IR, then for each tuple found, values bound to URI variables are pushed (as selections) into q, a copy of QX, and directly evaluated on IX. The result of each successive evaluations of q are combined to produce the overall result. This general approach may apply to different variants of RDF-to-XML join strategies. The following algorithm details one of them, namely RDF=>XML-URI.
Pruning algorithms
Whether or not the tree patterns of an XR query are evaluated as a single union or as a sequence of simpler queries, that execution may be expensive overall. Fortunately, it is possible to prune out some of the XML sub-queries, if we are given some insight about the underlying structure of the data.
We call dereferencing the ability to get an XML node from a URI. There may be many ways to implemented a dereferencing function, such as using an index, or a XML-URI encoding scheme. In the sequel, we assume that such a dereferencing mechanism is available.
Suppose QR has been evaluating, and for each binding, you consider pushing the URI-variable bound values to q, a copy of QX.
Let us consider a URI u bound to a variable v appearing in QX.
One first observation to make if that, if u is not an XML node URI, i.e. dereferencing fails, then pushing u to q will necessarily render it unsatisfiable. Therefore the whole binding tuple can be discarded.
Second, assuming the dereferencing provides the exact path of the node corresponding to u, you can check in linear time if this path is compatible with the tree pattern of q, in which v appears. If the path is not compatible with any of the considered tree patterns, then q is not satisfiable and q can be ignored.
Finally, given a second URI t bound to a variable w, and assuming both t and u can be dereferenced, you may want to compare the documents in which they appear. If these are different, but v and w appearing in a common tree pattern in q, then q will clearly not be satisfiable.
These pruning techniques can be applied to all variants of the RDF-to-XML algorithms, provided a dereferencing mechanism is available. The algorithm below details the pruning process for one of those algorithms: RDF=>XML-XPath.
Materialization-based approaches
The general idea behind this new set of approaches is to evaluate one side of the query, and push its results to the other’s side data instance rather than the other side’s query.
Bind RDF, then push to XML data
QR is evaluated over IR, then its result is a materialized in a temporary XML instance I’X. Finally, QX is rewritten into a new query Q’X, such that Q’X(IX ∪ I’X)=Q(I).
Bind XML, then push to RDF data
QX is evaluated over IX, then its result is a materialized in a temporary RDF instance I’R. Finally, QR is rewritten into a new query Q’R, such that Q’R(IR ∪ I’R)=Q(I).