Unified Data Management for Hybrid Analytics
Due to an ever-growing digitalization of the everyday life, massive amounts of data start to be accumulated, providing larger and large volumes of historical data (past data) on more and more monitored systems. At the same time, an up-to-date vision of the actual status of these systems is offered by the increasing number of sources of real-time data (present data). Today’s data analytics systems correlate these two types of data (past and present) to predict the future evolution of the systems to enable decision making. However, the relevance of such decisions is limited by the knowledge already accumulated in the past. Our vision consists in improving this decision process by enriching the knowledge acquired based on past data with what could be called future data generated by simulations of the system behavior under various hypothetical conditions that have not been met in the past.
Activity in 2019
Sub-goal 1: To support such analytics we advocate for unified data processing on converged, extreme-scale distributed environments. We are targeting a novel data processing architecture able to relevantly leverage and combine stream processing and batch processing in situ and in transit.
Sub-goal 2: Leveraging on the previous goal, we aim to provide a set of design principles and to identify relevant several software libraries and components based on which such an architecture can be implemented as a demonstrator.
Results: The first sub-goal lead to the design of the Sigma architecture for data processing, leveraging data-driven analytics and simulations-driven analytics. It combines batch-based and stream-based Big Data processing techniques (i.e., the Lambda architecture) with in situ/in transit data processing techniques inspired by the HPC. This allows to collect, manage and process extreme volumes of past, real-time and simulated data. The architecture relies on two layers: (1) the data-driven layer leverages the unified Lambda architecture to update the simulation model dynamically using the past and real-time data through a continuous learning loop; (2) the computation-driven layer uses in situ and in transit processing to proactively control in real-time the targeted systems. The concept of the Sigma architecture was published as a white paper at the BDEC2 workshop 2019.
The second sub-goal was achieved by investigating the potential candidate libraries able to provide a reference implementation of the Sigma architecture. Our approach consists in jointly leveraging two existing software components: the Damaris middleware for scalable in situ/in transit processing and the KerA unified system for ingestion and storage of data for scalable stream processing. Damaris has already been integrated with HPC storage systems (e.g., HDF5) to enable scalable collective writing of data generated by simulations running on tens of thousands of cores (e.g., at the end of each iteration). Such data can further be analyzed (typically through offline analysis). We developed for Damaris a new real-time storage backend based on KerA: leveraging Damaris dedicated cores and shared memory components, we are able to asynchronously write in real-time the data output of a simulation (that corresponds to multiple iterations) as a series of events, logically aggregated by a stream, with metadata describing the stream’s sub-partitions for each iteration. This is an important step towards a complete data processing and analytics workflow: by leveraging real-time streaming analytics (e.g., Apache Flink) efficiently coupled with KerA through shared memory, scientists can further apply machine learning techniques in real time, in parallel with the running simulation. This joint architecture is now at the core of a start-up creation within the KerData team: ZettaFlow, providing a new generic IoT data storage and analysis platform.
People: Ovidiu Marcu, Alexandru Costan, Gabriel Antoniu, Rolando Menchaca-Mendez
Analytical Models for Performance Evaluation of Stream Processing
In recent years we have witnessed an exponential growth on the number of devices connected to internet, from personal computers and smartphones to surveillance cameras, household appliances, autonomous vehicles, traffic lights and specially, many different kinds of sensors. All of these devices are constantly producing useful, but massive amounts of data, which are currently being processed by means of Big Data solutions in the Clouds. However, as the number of devices (and in consequence, the size of data produced) keeps growing, the current solutions might not be sufficient to handle such huge quantities of data, or might incur on prohibitive monetary or energetic costs, or in large response times that degrade the user experience or even cause dangerous situations (as in real-time applications that require to take critical decisions i.e. self-driving cars). Edge computing has been proposed as an alternative solution, which aims to process data near to where it is generated, and while it may seem to be the right alternative, it poses a series of challenges which need to be addressed. One of these challenges is to identify the situations and scenarios in which Edge processing is suitable to be applied.
Activity in 2019
Sub-goal: Determine which parts of the application should be executed on the Edge (if any) and which parts should remain on the Cloud.
Results: To address this goal, José Canepa did a 3 month internship at Inria, under the supervision of Pedro Silva and Alexandru Costan. We have modelled the application using a graph consisting of tasks as nodes and data dependencies between them as edges. The problem comes down to deploying the application graph onto the network graph, that is, operators need to be put on machines, and finding the optimal cut in the graph between the Edge and Cloud resources (i.e., nodes in the network graph). We have designed an algorithm that finds the optimal execution plan, with a rich cost model that lets users to optimize whichever goal they might be interested in, such as monetary costs, energetic consumption or network traffic, to name a few. In order to validate the cost model and the effectiveness of the algorithm, a series of experiments were designed using two real-life stream processing applications: a closed-circuit television surveillance system, and an earthquake early warning system. Two network infrastructures were designed to run the applications: the first one being a state-of-art infrastructure where all processing is done on the Cloud to serve as benchmark; and the second one being an infrastructure produced by the algorithm. Both scenarios were executed on the Grid’5000. Several experiments are currently underway, the results will be published in a journal paper (current target: a Special Issue of Future Generation of Computer Systems). The trade-offs of executing Cloud/Edge workloads with this model were published at the IEEE Big Data 2019 conference.
People: José Canepa, Pedro Silva, Alexandru Costan, Rolando Menchaca, Ricardo Menchaca
Benchmarking Edge/Cloud Processing Frameworks
With the spectacular growth of the Internet of Things, Edge processing emerged as a relevant means to offload data processing and analytics from centralized Clouds to the devices that serve as data sources (often provided with some processing capabilities). While a large plethora of frameworks for Edge processing were recently proposed, the distributed systems community has no clear means today to discriminate between them. Some preliminary surveys exist, focusing on a feature-based comparison. We claim that a step further is needed, to enable a performance-based comparison. To this purpose, the definition of a benchmark is a necessity.
Activity in 2019
Sub-goal: Design a preliminary methodology for comparing Edge processing frameworks through benchmarking.
Results: During this period, Pedro Silva focused on analyzing the existing Edge processing frameworks and preliminary benchmarking efforts. We proposed an initial version of a benchmarking methodology for Edge processing and analyzed the main challenges posed by its implementation. The benchmark methodology in presented in seven dimensions: (i) benchmark objectives, (ii) Edge processing frameworks, (iii) infrastructure, (iv) scenario applications and input data, (v) experiment parameters, (vi) evaluation metrics, and (vii) benchmark workflow. Once implemented, this benchmark would make a step beyond existing feature-based comparisons available in the related work, typically based on their documentation: it will serve to compare a large number of private and open-source Edge processing frameworks in a performance-oriented fashion. Current collaboration with José Canepa and Rolando Menchaca set the pathway towards the implementation of the benchmarks, enriched with the data models presented in the previous section. The vision of the benchmark was initially published at the PAISE workshop joint with IPDPS 2019.
People: Pedro Silva, Alexandru Costan, Gabriel Antoniu, José Canepa
Modelling Smart Cities Applications
Smart Cities are the main applicative use-case of this Associated Team. We therefore need to take into account both the characteristics of the data and the process/storage requirements of these particular applications, since they will drive the design choices of the systems supporting them. The objective is to devise clear models of the data handled by the applications in this context. The relation between the data characteristics and the processing requirements does not have to follow an one-to-one relation. In some cases, some particular types of data might need one or more types of processing, depending on the use case. For example, small and fast data coming from sensors do not always have to be processed in real-time, but they could also be processed in a batch manner at a later stage.
Activity in 2019
Sub-goal 1: Model the stream rates of data from sets of sensors in Smart Cities, specifically, from vehicles inside a closed coverage area.
Sub-goal 2: Use the previous model for accurate predictions of the resources needed to process the sensors data at the Edge and in the Cloud, as an input for auto-scaling modules.
Results: These two goals were addressed during the 3 month internship of Edgar Romo Montiel at Inria, under the supervision of Alexandru Costan and Gabriel Antoniu. Specifically, the first sub-goal lead to the design of a mathematical model to predict the time that a mobile sensor resides inside a geographical designated area. To bound the problem, users are vehicles in a V2I VANET connected to some specific applications in the Cloud such as traffic reports, navigation apps, multimedia downloading etc. The proposed model is based on Coxian distributions to estimate the time a vehicle demands Cloud services. The main challenge is to adjust its parameters and this was achieved by validating the model agains real-life data traces from the city of Luxembourg, through extensive experiments on the Grid’5000.
As a second sub-goal, the objective is to use these models to estimate the resources needed in the Cloud (or at the Edge) in order to process the whole stream of data. We designed an Auto-Scaling module able to adapt the resources based on the detected load. Using the Grid’5000, we evaluate the different choices for placing the prediction module: (i) at the Edge, close to data with less accuracy but faster results, or (ii) in the Cloud, with higher accuracy due to the global data, but higher latency as well). The results will be published at a workshop joint with an A class conference from the Cloud/Big Data domain.
People: Edgar Romo, Alexandru Costan, Mario Rivero, Gabriel Antoniu, Ricardo Menchaca-Mendez