The proliferation of small sensors and devices that are capable of generating valuable information in the context of the Internet of Things (IoT) has exacerbated the amount of data flowing from all connected objects to cloud infrastructures. In particular, this is true for Smart City applications. These applications raise specific challenges, as they typically have to handle small data (in the order of bytes and kilobytes), arriving at high rates, from many geographical distributed sources (sensors, citizens, public open data sources, etc.) and in heterogeneous formats, that need to be processed and acted upon with high reactivity in near real-time.
Lambda architectures were proposed recently to combine batch and real-time processing, in order to complement the on-line dimension of such stream data processing with a machine/deep learning dimension and gain more insights based on historical data. Unfortunately, the lack of a scalable data management subsystem is becoming an important bottleneck for such Lambda architectures, as it increases the gap between their I/O requirements and the storage performance. In particular, the layered design of in Lambda architecture (i.e., one ”fast” layer for real-time streaming from the edge and one ”slow” layer for batch processing of historical data in the cloud) has several drawbacks: data is often written or sent twice to disk or over the network; there is a lack of coordination between the ingestion and storage layers, which can lead to I/O interference and increased overhead of the custom data management tools leveraging them at the processing layer. The goal of SmartFastData is to address these data challenges in order to allow Lambda architectures to truly enable a unified approach for hybrid edge/cloud analytics.
Objective 1: Design a new, unified data management architecture for hybrid edge/cloud analytics
Our key idea is to enable a uniform, continuous and transparent stream processing across edge and cloud and advance from the separation between the two imposed by the current data management limitations in the Lambda architectures (i.e., edge and cloud analytics now serve different workloads and applications and are not used to complement each other for the same applications). To address this goal, we will first explore the complementarity of the approaches to Big Data cloud and edge management developed by the partners. At the edge, we plan to investigate the use of low-energy bandwidth reutilization techniques, such as cognitive radio systems, as a way to increase the opportunistic networks’ capacity to transport the massive amounts of data generated by the sensing systems, as advocated by the Semantic-centric Cloudlets (Instituto Politécnico Nacional), which introduce semantics-based approaches to supporting collaborative opportunistic sensing tasks.
We target an order of magnitude decrease in stream data access (from seconds to miliseconds) and a reduction by half of the overall processing time (i.e., the time to decision). The ultimate goal of the resulting architecture will be to have an online/real-time front-end for processing on the edge, close to where data streams are generated, while the cloud will only be used for off-line back-end processing, mainly dealing with archival, fault tolerance and also further processing that is not time-critical. This hybrid approach enables edge analytics to detect “what” is happening with a monitored object, while cloud analytics allows to understand “why” this is happening.
Objective 2: Explore analytical models for performance evaluation of stream storage and ingestion systems
In order to asses and predict the cost-benefits of different storage strategies and their corresponding quality of service experienced by users, extensive performance evaluation is needed. Such analyses are not feasible by simulation or on-the-field experimentation, due to great numbers of parameters that have to be investigated and huge costs. We plan to investigate analytical models that are scalable to model systems composed of millions of resources, aggregating a myriad of parameters. To achieve this goal we will leverage the modelling approaches proposed by KerData to capture the cloud and mobile environments latencies and heterogeneity, coupled with the set of edge analytics models and algorithms developed by the Mexican partner to characterise the performance of data collection and processing.