Return to BigdataNet (2013-2015) with UCSB (USA)

Bigdatanet achievements

The project has achieved its initial objectives, wrt to the definition of a P2P/cloud architecture and basic data management techniques. Furthermore, it has been the basis for a new research direction on privacy-preserving database outsourcing, which is now continuing a major track in Zenith.

P2P/cloud architecture

We defined a P2P/Cloud as a hybrid architecture that is composed of traditional cloud infrastructure, small-scale peers (P2P Cloud) and clients. The Cloud infrastructure is a pay-per-use cloud service such as Amazon EC2. On the other hand, a peer is a machine that has rich networking resources to relay client packets fast. Even a regular desktop PC that meets the required resource constraints could be a peer. In the system, cloud infrastructure and peers form a 2-level cloud platform.

In [Korpeoglu 2013], we develop a dynamic overlay network architecture and routing technology that will support various kinds of multi-user interactions for advanced internet applications irrespective of user’s geographic location. We propose a geographic location-aware, hybrid, scalable cloud assisted peer-to-peer (P2P) architecture to support multipoint media streaming applications that targets low administration cost, reduced bandwidth consumption, low latency, low initial investment cost and optimized resource usage. The main objective is to develop an efficient media delivery system that leverages locality. We propose a 3-layer novel architecture that uses at the core the cloud for application management, 2-tier edge cloud for supporting geo-dispersed user groups, and at the lowest level peer-to-peer dynamic overlays for locally clustered user groups.

Scalable query processing with big data

We addressed the problem of answering user queries in P2P/cloud. One of the main challenges is to route the queries to the peers or the cloud nodes where the relevant data are stored. An important requirement for query processing is a data partitioning solution that in a smart way distributes the system’s data among peers and the cloud nodes and allows efficient query processing, in particular, for online collaborative applications.

First, we considered applications with very large databases, where data items are continuously appended. These applications are becoming more and more common on the web. Thus, the development of efficient data partitioning is one of the main requirements to yield good performance. In the case of applications that have complex access patterns, e.g. scientific applications, workload-based partitioning could be exploited. However, existing workload-based approaches, which work in a static way, cannot be applied to very large databases. In [Liroz-Gistau 2013b, Liroz-Gistau 2013c], we proposed dynamic partitioning algorithms for continuously growing databases. These algorithms efficiently adapt the data partitioning to the arrival of new data elements by taking into account the affinity of new data with queries and fragments. In contrast to existing static approaches, our approach offers constant execution time, no matter the size of the database, while obtaining very good partitioning efficiency.

Second, we addressed the problem of high data transfers in MapReduce, a very popular parallel programming framework for the cloud. In [Liroz-Gistau 2013a], we propose a technique that repartitions tuples of the input datasets, and thereby optimizes the distribution of key-values over mappers, and increases the data locality in reduce tasks. Our approach captures the relationships between input tuples and intermediate keys by monitoring the execution of a set of MapReduce jobs, which are representative of the workload. Then, based on those relationships, it assigns input tuples to the appropriate chunks. We evaluated our approach through experimentation in an Hadoop deployment on top of Grid5000 using standard benchmarks. The results show high reduction in data transfer during the shuffle phase compared to Native Hadoop.

Third, we addressed the problem of data skew in the MapReduce parallel processing framework. There are many cases where because of skew intermediate data, a high percentage of processing in the reduce side of MapReduce is done by a few nodes, or even one node, while the others remain idle. There have been some attempts to address this problem of data skew, but only for specific cases. In particular, there is no solution when all or most of the intermediate values correspond to a single key, or to a set of keys that are fewer than the number of reduce workers. In this work, we propose FP-Hadoop [Akbarinia 2015, Liroz-Gistau 2015], a system that makes the reduce side of MapReduce more parallel, and can efficiently deal with the problem of reduce side data skew. We extended the programming model of MapReduce to allow the collaboration of reduce workers on processing the values of an intermediate key, without affecting the correctness of the final results. This allows performing a big part of the reducing work by using the computing resources of all workers, even in the case of highly skewed data. We implemented a prototype of FP-Hadoop by modifying Hadoop’s code, and conducted extensive experiments over synthetic and real datasets. The results show that FP-Hadoop makes MapReduce job processing much faster and more parallel, and can efficiently deal with skewed data. We achieve excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.

Privacy-preserving database outsourcing

In [Allard 2015a], we consider the problem of outsourcing data querying services over massive sets of personal data to the cloud. In this context, preserving data privacy is a strong requirement even against the cloud provider (even though it is not necessarily dishonest, it may be attacked or simply neglectful). We adopt here an approach that consists in computing a privacy-preserving auxiliary data structure from the data and transmit it to the cloud provider together with the data encrypted by a semantically-secure encryption scheme. Indeed, with encrypted data only, the cloud provider would be unable to process any query on it. Some information must be leaked. This is precisely the goal of the auxiliary data structure. It is a hierarchical index satisfying differential privacy: it allows the cloud provider to run range queries on the encrypted data while tightly controlling the information that leaks from it. The gain in terms of privacy comes however at a cost in terms of bandwidth. Indeed, the set of encrypted records returned by the cloud provider is a superset of the exact query answer. We have designed both the privacy mechanisms that allow perturbing the index without destroying its utility, and distributed algorithms that maintain it when data is updated.

In [Allard 2015a], we propose Chiaroscuro, a complete solution for clustering personal data with strong privacy guarantees. The execution sequence produced by Chiaroscuro is massively distributed on personal devices, coping with arbitrary connections and disconnections. Chiaroscuro builds on our novel data structure, called Diptych, which allows the participating devices to collaborate privately by combining encryption with differential privacy. Our solution (proved correct and secure) yields a high clustering quality while minimizing the impact of the differentially private perturbation.

Decentralized recommendation for big data sharing

We considered the problem of big data sharing among professional communities, e.g. scientific communities, which have their own datasets and documents in a private cloud and wish to share some of them in a personalized and controlled way. Our approach to help community members locate useful information is to exploit recommendations based on explicit personalization (e.g. using the friendship networks of scientists) over multiple private clouds in a P2P fashion. We can then develop new decentralized, scalable recommendation protocols.

In [Servajean 2014a, 2014b], we investigate profile diversity for P2P search and recommendation of scientific documents. In scientific domains, endorsements from different communities are important indicators of the broad focus of scientific documents and should be accounted for in search and recommendation. To do so, we introduce profile diversity, a novel idea in searching scientific documents. Traditional content diversity has been thoroughly studied in centralized search and advertising, database queries, and recommendations and addresses the question of returning relevant but too-similar documents. We argue that content diversity alone does not suffice for finding documents endorsed by different scientific communities and that profile diversity is needed to alleviate returning popular but too-focused documents. Moreover, P2P profile diversity increases recall compared to other P2P approaches and reduces the search space compared with a centralized approach. Our experiments on several datasets validate our proposal in terms of quality of recommendation in a centralized architecture and in terms of recall in a P2P architecture.

In [Servajean 2015a, 2015b], we propose an original profile diversification scoring function that addresses the problem of returning redundant items, and enhances the quality of diversification compared to state-of-the-art solutions. We also addressed the problem of distributed and diversified recommendation in hybrid P2P/cloud. We proposed a new scoring function (usefulness) to cluster relevant users over a distributed overlay. Compared with state-of-the-art solutions, we obtain major gains in recall. We proposed a threshold-based approach to return the most relevant and most popular documents while satisfying content and profile diversity constraints.

Permanent link to this article: https://team.inria.fr/zenith/bigdatanet/bigdatanet-achievements/