Miguel Liroz Gistau
Apache Spark is an efficient and general large-scale data processing engine. Leveraging on Hadoop’s ecosystem and with a large and active community, Spark has established itself as one of the main alternatives for big data analytics. As opposed to MapReduce, it allows the user to specify arbitrary workflows that are executed in a memory-efficient way, delivering significant performance improvements, especially for iterative algorithms. It also provides a multiple language, interactive interface and integrated libraries for SQL, machine learning, streaming and graph processing. In this talk, we describe the general architecture of the framework and its main components. In particular, we focus on the DataFrames abstraction, which brings many of the optimizations used in the database community and greatly simplifies data management. We also provide practical guidelines on how to efficiently execute our workflows in Apache Spark and present hadoop_g5k, a middleware implemented in the Zenith team that simplifies the deployment and management of Spark in big clusters of machines like Grid5000.