Saber Salah: Parallel Itemset Mining in Massively Distributed Environments

When: Friday, May 27, at 11
Where: Turing building, room Gilles Kahn
Abstract: First I will address the problem of frequent itemset mining in big data. We call for specific data placement techniques in massively distributed environments to improve the performance of parallel frequent itemset mining (PFIM) algorithms. We thoroughly study and investigate the impact of combining such a frequent itemset algorithm with a specific data placement strategy. We show that an adequate placement of the data in a massively distributed environment along with a specific frequent itemset mining algorithm can make a mining process either inoperative or completely significant. We propose ODPR (Optimal Data-Process Relationship) our solution for fast mining of frequent itemsets in MapReduce. Our method allows discovering itemsets from massive data sets, where standard solutions from the literature do not scale. Indeed, in a massively distributed environment, the arrangement of both the data and the different processes can make the global job either completely inoperative or very effective. Our proposal has been evaluated using real-world data sets and the results illustrate a significant scale-up obtained with very low minimum support, which confirms the effectiveness of our approach. Generally, in a massively distributed environment (e.g., MapReduce or Spark), minimizing the number of jobs results in a significant performance of the process being executed. In the case of frequent itemset mining problem, discovering frequent itemsets in just one simple job would be preferable. To this end, we propose a highly scalable, parallel frequent itemset mining algorithm, namely Parallel Absolute Top Down (PATD). PATD algorithm renders the mining process of very large databases (up to Terabytes of data) simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the mining runtime, the communication cost and the energy power consumption overhead, in a distributed computational platform. Based on a clever and efficient data partitioning strategy, namely Item Based Data Partitioning (IBDP), the PATD algorithm mines each data partition independently, relying on an absolute minimum support instead of a relative one. PATD has been extensively evaluated using real-world data sets. Our experimental results suggest that PATD algorithm is significantly more efficient and scalable than alternative approaches.
The second problem which I will address, is discovering maximally informative k-itemsets (miki) in big data based on joint entropy. We propose PHIKS (Parallel Highly Informative K-ItemSet) a highly scalable, parallel miki mining algorithm that renders the mining process of large scale databases (up to Terabytes of data) succinct and effective. Its mining process is made up of only two efficient parallel jobs. With PHIKS, we provide a set of significant optimizations for calculating the joint entropies of the miki having different sizes, which drastically reduces the execution time of the mining process. PHIKS has been extensively evaluated using massive real-world data sets. Our experimental results confirm the effectiveness of our proposal by the significant scale-up obtained with high itemsets length and over very large database.

Comments are closed