Uniform Cloud and Edge Stream Processing for Machine Learning-based Analytics

  • AdvisorsAlexandru Costan (KerData team), Nicolae Tapus (University Politehnica of Bucharest), Gabriel Antoniu (KerData team)
  • Main contact: alexandru.costan (at) irisa.fr
  • Expected start date:  October 1st, 2020
  • Application deadline: as early as possible, no later than May 30, 2020

Description

The recent spectacular rise of the Internet of Things (IoT) and the associated augmentation of the data deluge motivated the emergence of Edge computing [1] as a means to distribute processing from centralized Clouds towards decentralized processing units close to the data sources. The key idea is to leverage computing and storage resources at the “edge” of the network, i.e., near the places where data is produced (e.g., sensors, routers, etc.). They can be used to filter and to pre-process data or to perform (simple) local computations (for instance, a home assistant may perform a first lexical analysis before requesting a translation to the Cloud). Moreover, this approach supports the next generation of analytics and
Machine Learning applications spanning from the Edge (where data is generated) to the Fog/Cloud and
HPC centers (where it is processed).

This scheme led to new challenges regarding the ways to distribute processing across Cloud-based, Edge-based or hybrid Cloud/Edge-based infrastructures. State-of-the-art approaches advocate either “100% Cloud” or “100% Edge” solutions. In the former case, a plethora of Big Data stream processing engines like Apache Spark [2] and Apache Flink [3] emerged for data analytics and persistence, using the Edge devices just as proxies only to forward data to the Cloud. In the latter case, Edge devices are used to take local decisions and enable the real-time promise of the analytics, improving the reactivity and ”freshness” of the results. Several Edge analytics engines emerged lately (e.g. Apache Edgent [4], Apache Minifi [5]) enabling basic local stream processing on low performance IoT devices. The relative efficiency of a method over the other may vary. Intuitively, it depends on many parameters, including network technology, hardware characteristics, volume of data or computing power, processing framework configuration and application requirements, to cite a few.

Problem statement

Hybrid approaches, combining both Edge and Cloud processing require deploying two such different processing engines, one for each platform. This involves two different programming models, synchronization overheads between the two engines and empirical splits of the processing workflow schedule across the two infrastructures, which lead eventually to sub-optimal performance.

Thesis goal

This PhD thesis aims to propose a unified stream processing model for both Edge and Cloud platforms enabling the seamless execution of Machine Learning algorithms. In particular, the thesis will seek potential answers to the following research question: how much can one improve (or degrade) the performance of an application by performing computation closer to the data sources rather than keeping it in the Cloud? To this end, the PhD thesis will first devise a methodology to understand the performance trade-offs of Edge-Cloud executions, and then design a unified processing model capable of exploiting the semantics of both platforms. The model will be implemented and experimentally evaluated with representative real-life stream processing use-cases executed on hybrid Edge-Cloud testbeds. The high-level goal of this thesis is to enable the usage of a large spectrum of Machine Learning based analytics at extreme scales, to support fast decision making in real-time.

Target use-case

The unified processing model will be evaluated using the requirements of a real-life production application, provided by our partners from University Politehnica of Bucharest. The unified model will be used as the processing engine of the MonALISA [6] monitoring system of the ALICE experiment at CERN [7]. ALICE (A Large Ion Collider Experiment) is one of the four LHC (Large Hadron Collider) experiments run at CERN (European Organization for Nuclear Research). ALICE collects data at a rate of up to 4 Petabytes per run and produces more than 109 data files per year. Tens of thousands of CPUs are required to process and analyze them. More than 350 MonALISA services are running at sites around the world, collecting information about ALICE computing facilities, local- and wide-area network traffic, and the state and progress of the many thousands of concurrently running jobs. Currently, all the monitoring and alerting logic is implemented in the Cloud, with high latency. With the proof-of-concept envisioned by this thesis, the goal is to enable Edge/Cloud monitoring data processing and to enable:

  • faster alerts and decision making, as soon as the data is collected
  • smart predictions and Machine Learning-based analytics

Enabling technologies

In the process of designing the unified Edge-Cloud data processing framework, we will leverage in particular techniques for data processing already investigated by the participating teams as proof-of-concept software, validated in real-life environments:

  • The KerA [8] approach for Cloud-based low-latency storage for stream processing (currently under development at Inria, in collaboration with Universidad Politécnica de Madrid, in the framework of a contractual partnership between Inria and Huawei Munich). By eliminating storage redundancies between data ingestion and storage, preliminary experiments with KerA successfully demonstrated its capability to increase throughput for stream processing.
  • The Planner [9] middleware for cost-efficient execution plans placement for uniform stream analytics on Edge and Cloud. Planner automatically selects which parts of the execution graph will be executed at the Edge in order to minimize the network cost. Real-world micro-benchmarks show that Planner reduces the network usage by 40% and the makespan (end-to-end processing time) by 15% compared to state-of-the-art.

International visibility and mobility

The thesis will be co-supervised by Nicolae Tapus from University Politehnica of Bucharest (UPB, Romania). We plan to establish a cotutelle agreement between INSA and UPB, leveraging the strong connection of Alexandru Costan with UPB (i.e., previously hosting Master, PhD and Postdoc students from UPB within the KerData team). Alexandru has played an important role in setting up the double-diploma agreement, now in place between INSA Rennes and UPB.

The Distributed Systems team lead by Nicolae Tapus at UPB is the main developer (in partnership with Caltech) of the MonALISA system, used to monitor the computer infrastructure across the globe for the LHC experiments at CERN. In this context, the thesis will leverage and enhance the collaboration of INSA/IRISA with the California Institute of Technology (Caltech) and the CERN.

The thesis will include collaborations with other partners of the KerData team, focused on stream processing: the Flink team of Tillman Rabl at TU Berlin (working on the Apache Flink Big Data stream processing engine) and the team of Bogdan Nicolae at Argonne National Laboratory (working on Exascale storage and processing models).

The PhD position is mainly based in Rennes, at IRISA. The candidate is also expected to be hosted for 3-6 month internships at the partners mentioned above (i.e., UPB, TU Berlin, ANL)

Industrial partnership and valorisation

The KerA software which is central to this PhD proposal is now subject of interest for exploitation plans by Huawei. The envisioned PhD work will follow the requirements emerged in this partnership.

Building on the KerA expertise, the KerData team is now actively involved in the ZettaFlow startup project, financed by EIT Digital, that is focused on efficient Fast Data processing and management. The candidate will have the opportunity to work on challenging edge to cloud streaming use cases provided by the ZettaFlow initiative.

Leveraging the ongoing collaboration with TU Berlin (the main contributor to the Apache Flink reference framework), the prototype designed during this PhD will be integrated in the Apache Flink state-of-the-art data processing ecosystem. It will allow to perform both Edge and Cloud analytics and thereby derive insights from data in real-time.

Interdisciplinarity

The targeted use-case of this PhD proposal (i.e., monitoring the ALICE experiment at CERN) will provide a perfect opportunity to illustrate the impact of research in computer science (more specifically in Big Data analytics) on the domain of nuclear physics. In particular, we plan to show that the Large Hadron Collider experiments at CERN can be monitored more efficiently and that physicists can react faster to the perceived phenomena, using the Edge-Cloud processing techniques designed in this thesis.

References

[1] M. Satyanarayanan, “The emergence of edge computing,” Computer, 2017.
[2] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. “Spark: cluster computing with working sets”. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud’10). USENIX Association, Berkeley, CA, USA, 2010.
[3] P. Carbone, S. Ewen, G. Fóra, S. Haridi, S. Richter, and K. Tzoumas. “State management in Apache Flink: consistent stateful distributed stream processing”. Proc. VLDB Endow. 10, 12 2017, 1718-1729.
[4] Apache Edgent, http://edgent.apache.org, accessed on January 2019
[5] Apache Minifi, https://nifi.apache.org/minifi/, accessed on January 2019
[6] I. Legrand, R. Voicu, C. Cirstoiu, C. Grigoras, L. Betev, and A. Costan. “Monitoring and control of large systems with MonALISA”. Communications of the ACM 52, 9, September, 2009, 49-55.
[7] The ALICE LHC Experiment, https://home.cern/science/experiments/alice, accessed on January 2019
[8] O.C. Marcu, A. Costan, G. Antoniu, M. Pérez-Hernández, B. Nicolae, et al.. “KerA: Scalable Data Ingestion for Stream Processing”. ICDCS 2018 – 38th IEEE International Conference on Distributed Computing Systems, Vienna, Austria, pp.1480-1485, 2018,
[9] L.Prosperi, A.Costan, P.Silva and G.Antoniu,“Planner:Cost-efficient Execution Plans Placement for Uniform Stream Analytics on Edge and Cloud,” in WORKS 2018: 13th Workflows in Support of Large-Scale Science Workshop, held in conjunction with the IEEE/ACM SC18 conference, 2018.

Requirements of the candidate

  • An excellent Master degree in computer science or equivalent
  • Strong knowledge of computer networks and distributed systems
  • Knowledge on storage and (distributed) file systems
  • Ability and motivation to conduct high-quality research, including publishing the results in relevant venues
  • Strong programming skills (e.g. C/C++, Java, Python).
  • Working experience in the areas of Big Data management, Cloud computing, HPC, is an advantage
  • Very good communication skills in oral and written English.
  • Open-mindedness, strong integration skills and team spirit

To apply:

  1. Send an email with a cover letter, CV, contact address of at least two professional references and copies of degree certificates to Dr. Gabriel Antoniu and Dr. Alexandru Costan. Incomplete applications will not be considered or answered.