Real-Time processing in the cloud : the Storm case

Distributed stream processing has become a leading trend for analysing a large of data in real-time. Internet of things, stock trading, web traffic monitoring are all pushing continuously data for immediate processing. To address the challenge of handling high volume and high velocity of data, different stream processing engines emerged including Spark Streaming based on Spark[1], Storm[2], Flink[3] or Samza[4]. Those platforms are known to be intensively used at large scale by different actors (e.g. Yahoo!, witter, LinkedIn). The reputation of those frameworks are often measured in terms of raw performances (number of events treated / second). From our point of view, another aspect to take into account is the capacity of the framework to adapt to changes due to external factors. For example these systems are subject to overload or failures. Different steps have been made into this direction in [5,6,7,8].

In a cloud context, a recent effort has been made to integrate the Storm framework to OpenStack[5] OpenStack is the leading open source solution for creating private or public clouds. The integration is possible through the Sahara[6] plugin of OpenStack. The plugin allows users to deploy on–demand Storm clusters using the OpenStack dashboard or programmatically using the Sahara API[7].

The internship will focus on an extensive evaluation of the Storm capabilities in the OpenStack environment in terms of

(1) scalability – How large can be a Storm cluster deployed using Sahara ?

(2) performance – What impact introduces the cloud layer on a running Storm cluster ?

(3) self-adaptation – To what extent the API provided by Sahara can be used to dynamically adapt a running Storm cluster ?

The work to conduct in the internship will mainly deal with finding answers to the 3 dimensions above. It will heavily rely on deploying Storm, OpenStack and the Sahara framework on the Grid’5000 platform[8].

Pre-requisiste: Good programming skills in python are required. Notions as well as practical experiences with data management, or more generally distributed systems is a plus.

Send your application to matthieu.simonin@inria.fr and cedric.tedeschi@inria.fr

[1] http://spark.apache.org/streaming/

[2] https://storm.apache.org/

[3] https://flink.apache.org/

[4] http://samza.apache.org/

[5] https://www.openstack.org/

[6] http://docs.openstack.org/developer/sahara/

[7] https://github.com/openstack/sahara

[8] https://www.grid5000.fr