Overview
The PAXQuery engine seamlessly parallelizes the execution of XQuery queries. By applying on-the-fly translation and optimization procedures, PAXQuery runs user queries over massive collections of XML documents in a distributed fashion. PAXQuery runs on top of Apache Flink, previously known as Stratosphere, a parallel execution platform that relies on the PACT model.
After the user inputs the XQuery query, the engine builds an equivalent tree of algebraic operators that works on nested tuples. The set of operators includes navigation, group by, aggregation, selection, projection, and many others.
Once the tree is built and optimized, the engine compiles it into a PACT plan consisting of implicit parallel operators such as Map, Reduce, Match, CoGroup, or Cross. The result is given to the Apache Flink platform, which is responsible for the PACT plan optimization and its parallel execution e.g. over HDFS or the local filesystem.
Open-source release
PAXQuery is open-source and can be found here. If you would like to get involved, send us a message!
People involved (listed in alphabetical order)
Publications
- PAXQuery: Parallel Analytical XML Processing
Jesús Camacho-Rodríguez, Dario Colazzo, Ioana Manolescu, and Juan A. M. Naranjo
Demonstration at SIGMOD 2015. - PAXQuery: Efficient Parallel Processing of XQuery
Jesús Camacho-Rodríguez, Dario Colazzo, and Ioana Manolescu
IEEE Transactions on Knowledge and Data Engineering (TKDE), 2015. - PAXQuery: A Massively Parallel XQuery Processor
Jesús Camacho-Rodríguez, Dario Colazzo, and Ioana Manolescu
Short paper at the 3rd International Workshop on Data analytics in the Cloud (DanaC ’14), colocated with SIGMOD/PODS 2014
Acknowledgements
This project has been partially funded by the ICTLabs of the European Institute of Innovation and Technology.