Hlib Mykhailenko has defended his Ph.D. thesis on Wednesday, June, 14 th , 2017 at 14:30, in the Inria, Euler violet room.
Title: Distributed Edge Partitioning
- Pietro Michardi (Eurecom)
- Matteo Sereno (Università degli Studi di Torino)
- Fabrice Huet (University of Nice-Sophia Antipolis)
- Guillaume Urvoy-Keller (Laboratory I3S)
- Damiano Carra (University of Verona)
- Giovanni Neglia (Inria Sophia Antipolis)
In distributed graph computation, graph partitioning is an important preliminary step because the computation time can significantly depend on how the graph has been split among the different executors. In this thesis we explore the graph partitioning problem. Recently, edge partitioning approach has been advocated as a better approach to process graphs with a power-law degree distribution, which are very common in real-world datasets. That is why we focus on edge partitioning approach. We start by an overview of existing metrics, to evaluate the quality of the graph partitioning. We briefly study existing graph processing systems: Hadoop, Giraph, Giraph++, Distributed GrahpLab, and PowerGraph with their key features. Next, we compare them to Spark, a popular big-data processing framework with its graph processing APIs \u2014 GraphX. We provide an overview of existing edge partitioning algorithms and introduce partitioner classification. We conclude that, based only on published work, it is not possible to draw a clear conclusion about the relative performances of these partitioners. For this reason, we have experimentally compared all the edge partitioners currently available for GraphX. Results suggest that Hybrid-Cut partitioner provides the best performance. We then study how it is possible to evaluate the quality of a partition before running a computation. To this purpose, we carry experiments with GraphX and we perform an accurate statistical analysis using a linear regression model. Our experimental results show that communication metrics like vertex-cut and communication cost are effective predictors in most of the cases. Finally, we propose a framework for distributed edge partitioning based on distributed simulated annealing which can be used to optimize a large family of partitioning metrics. We provide sufficient conditions for convergence to the optimum and discuss which metrics can be efficiently optimized in a distributed way. We implemented our framework with GraphX and performed a comparison with JA-BE-JA-VC, a state-of-the-art partitioner that inspired our approach. We show that our approach can provide significant improvements.