The research challenge is the design of new techniques for scientific data that must be done in a distributed and parallel manner, by leveraging multiple parallel machines in the cloud. In particular, this requires studying the impact of intersite data movement in the performance/cost tradeoffs of our algorithms.
Our approach is to capitalize on the principles of distributed and parallel data management. In particular, we exploit algebraic languages as the basis for automatic optimization and parallelization of scientific workflows, adaptive scheduling as the basis for multisite resource management, dynamic data partitioning as the basis for parallel data processing and data mining, data provenance queries as the basis for the dynamic steering of workflows by users, and probabilistic databases for uncertain data integration. Furthermore, we study the use of techniques for moving data efficiently across sites.
We validate our techniques by building software prototypes that exploit the expertise of the two teams with data processing frameworks (MapReduce, Spark), SWfMS (Chiron, SciCumulus) and modern DBMS (MonetDB, SciDB). We apply these techniques on real-world scientific data obtained from our application partners in astronomy, bioinformatics and computational engineering.