Soutenance de thèse

La soutenance de thèse de Mikhail Kamalov a lieu le 1er décembre 2022 à 15:00 CET dans la salle Euler Violet, Inria Sophia Antipolis.

Titre: « Scalable algorithms for graph-based semi-supervised learning with embedding »

Composition du jury:

Directeur:

Konstantin Avrachenkov, DR, Inria, France

Rapporteurs:

Marianne Clausel, Professeur, Université de Lorraine, France
Khalid Benabdeslem, Maître de Conférences, Université Claude Bernard Lyon 1, France

Examiners:

Aurelie Boisbunon, PhD, Ericsson, France
Christophe Crespelle, Professeur, Université Côte d’Azur, France
Paulo Goncalves, DR Inria Rhône-Alpes, France

Invité:

Jonathan Daeden, PhD, MyDataModels, France

Résumé:

Nowadays, graph-based semi-supervised learning (GB-SSL) is a fast-growing area of classifying nodes in a graph with an extremely low number of labelled
nodes. However, the GB-SSL algorithms have two general limitations: the first is
the memory/time complexity that arises in all state-of-the-art GB-SSL algorithms on
extremely large graphs. In particular, the high memory consumption occurs in graph
convolution networks and leads to Out of Memory (OOM) issues on GPU or RAM; (ii)
the second one appears in all GB-SSL algorithms based on Laplacian regularization loss.
This thesis’ major contribution is divided into two parts in order to suggest strategies that
would guarantee to avoid the restrictions mentioned above. In the first part of this thesis,
we propose a novel linear algorithm called Markov-Batch Stochastic Approximation
(MBSA) for solving Personalized PageRank. MBSA updates nodes batches and proposes
a significantly better tradeoff between memory consumption and convergence rate for an
optimal classification result than other linear models. Then, we propose a novel scaling
graph convolution network, denoted as MBSA-NN, which embeds our linear MBSA.
MBSA-NN avoids OOM issues and significantly reduces time and memory consumption
on GPU and RAM. We applied MBSA-NN on several very large datasets, and we showed
that it can handle graphs with more than 10M nodes and 2M of features under one
minute on one standard machine, including preprocessing, training and inference time.
Furthermore, we show that it has significantly improved memory/time consumption
and competitive accuracy concerning the latest best GB-SSL scaling algorithms. The
second part of this thesis focuses on solutions to Laplacian regularization loss issues. For
that reason, we propose a novel framework called Graph Diffusion & PCA (GDPCA).
This framework combines a modified Principal Component Analysis with the classical
supervised loss and Laplacian regularization loss. GDPCA allows handling the case
where the adjacency matrix presents through Binary edges and avoids the Curse of
dimensionality. Also, GDPCA can be applied to non-graph datasets, such as images,
by constructing a similarity graph. Furthermore, we propose a framework that embeds
PageRank SSL in a generative model (GenPR). GenPR joint training of nodes latent
space representation and label spreading through the reweighted adjacency matrix by
node similarities in the latent space. We demonstrate that a generative model can improve
accuracy and reduce the number of iteration steps for PageRank SSL. Moreover, we show
how to embed MBSA into the GenPR framework for providing the batch training regime
of GenPR. Finally, we propose a flexible SSL framework based on stacking GDPCA
and Zoetrope Genetic Programming algorithms into a novel framework: PaZoe. This
self-labelling framework shows that graph-based and non-graph based algorithms jointly
improve the quality of predictions and outperform each component taken alone. We also show that PaZoe outperforms state-of-the-art SSL algorithms on real datasets. Note
that the one of the datasets was generated in house, taking data from industrial graded
equipment to mimic DC motors during operation