SciDISC (Scientific data analysis using Data-Intensive Scalable Computing) is an associated team (“équipe associée”), between Zenith and 4 teams in the state of Rio de Janeiro (LNCC, COPPE/UFRJ, UFF and CEFET) since january 2017. SciDISC is headed by Marta Mattoso (COPPE/UFRJ) and Patrick Valduriez (Zenith).
Data-intensive science requires the integration of two fairly different paradigms: high-performance computing (HPC) and data-intensive scalable computing (DISC). HPC is compute-centric and focuses on high-performance of simulation applications, typically using powerful, yet expensive supercomputers. DISC, on the other hand, is data-centric and focuses on fault-tolerance and scalability of web and cloud applications using cost-effective clusters of commodity hardware. Examples of DISC systems include big data processing frameworks such as Hadoop or Apache Spark or NoSQL systems . To harness parallel processing, HPC uses a low-level programming model (such as MPI or OpenMP) while DISC relies on powerful data processing operators (Map, Reduce, Filter, …). Data storage is also quite different: supercomputers typically rely on a shared disk infrastructure and data must be loaded in compute nodes before processing while DISC systems rely on a shared-nothing cluster (of disk-based nodes) and data partitioning.
Spurred by the growing need to analyze big scientific data, the convergence between HPC and DISC has been a recent topic of interest. However, simply porting the Hadoop stack on a supercomputer is not cost-effective, and does not solve the scalability and fault-tolerance issues addressed by DISC. On the other hand, DISC systems have not been designed for scientific applications, which have different requirements in terms of data analysis and visualization. This project will address the grand challenge of scientific data analysis using DISC (SciDISC), by developing architectures and methods to combine simulation and data analysis.