Papers, please

Sessions “Papers, please”

Il s’agit de sessions de partage d’articles pour les doctorants de CIDRE qui sont ouvertes à tous les membres de l’équipe. L’évènement a lieu au 5e étage de CS. Pour toute demande d’information, contacter pierre-francois.gimenez@centralesupelec.fr.

Les sessions “Papers, please” sont maintenant gérés par l’équipe PIRAT.


14/12/23 à 15h30: “MAGIC: Detecting Advanced Persistent Threats via Masked Graph Representation Learning” présenté par Fanny Dijoud

Abstract: Advance Persistent Threats (APTs), adopted by most delicate attackers, are becoming increasing common and pose great threat to various enterprises and institutions. Data provenance analysis on provenance graphs has emerged as a common approach in APT detection. However, previous works have exhibited several shortcomings: (1) requiring attack-containing data and a priori knowledge of APTs, (2) failing in extracting the rich contextual information buried within provenance graphs and (3) becoming impracticable due to their prohibitive computation overhead and memory consumption. In this paper, we introduce MAGIC, a novel and flexible self-supervised APT detection approach capable of performing multi-granularity detection under different level of supervision. MAGIC leverages masked graph representation learning to model benign system entities and behaviors, performing efficient deep feature extraction and structure abstraction on provenance graphs. By ferreting out anomalous system behaviors via outlier detection methods, MAGIC is able to perform both system entity level and batched log level APT detection. MAGIC is specially designed to handle concept drift with a model adaption mechanism and successfully applies to universal conditions and detection scenarios. We evaluate MAGIC on three widely-used datasets, including both real-world and simulated attacks. Evaluation results indicate that MAGIC achieves promising detection results in all scenarios and shows enormous advantage over state-of-the-art APT detection approaches in performance overhead.


30/11/23 à 15h30: “Network measurement methods for locating and examining censorship devices” présenté par Lucas Aubard

Abstract: Advances in networking and firewall technology have led to the emergence of network censorship devices that can perform large-scale, highly-performant content blocking. While such devices have proliferated, techniques to locate, identify, and understand them are still limited, require cumbersome manual effort, and are developed on a case-by-case basis. In this paper, we build robust, general-purpose methods to understand various aspects of censorship devices, and study devices deployed in 4 countries (Azerbaijan, Belarus, Kazakhstan, and Russia). We develop a censorship traceroute method, CenTrace, that automatically identifies the network location of censorship devices. We use banner grabs to identify vendors from potential censorship devices. To collect more features about the devices themselves, we build a censorship fuzzer, CenFuzz, that uses various HTTP request and TLS Client Hello fuzzing strategies to examine the rules and triggers of censorship devices. Finally, we use features collected using these methods to cluster censorship devices and explore device characteristics across deployments. Using CenTrace measurements, we find that censorship devices are often deployed in ISPs upstream to clients, sometimes even in other countries. Using data from banner grabs and injected block-pages, we identify 23 commercial censorship device deployments in Azerbaijan, Belarus, Kazakhstan, and Russia. We observe that certain CenFuzz strategies such as using a different HTTP method succeed in evading a large portion of these censorship devices, and observe that devices manufactured by the same vendors have similar evasion behavior using clustering. The methods developed in this paper apply consistently and rapidly across a wide range of censorship devices and enable continued understanding and monitoring of censorship devices around the world.


24/11/23 à 15h30: “ATRA: Address Translation Redirection Attack Against Hardware-based External Monitors” présenté par Lionel Hemmerlé

Abstract: Hardware-based external monitors have been proposed as a trustworthy method for protecting the kernel integrity. We introduce the design and implementation of Address Translation Redirection Attack (ATRA) that enables complete evasion of the hardware-based external monitor that anchors its trust on a separate processor. ATRA circumvents the external monitor by redirecting the memory access to critical kernel objects into a non-monitored region. Despite the seriousness of the ATRA issue, the address translation integrity has been assumed in many hardware-based external monitors and the possibility of its exploitation has been suggested yet many considered hypothetical. We explore the intricate details of ATRA, explain major challenges in realizing ATRA in practice, and address them with two types of ATRA called Memory-bound ATRA and Register-bound ATRA. Our evaluations with benchmarks show that ATRA does not introduce a noticeable performance degradation to the host system, proving practical applicability of the attack to alert the researchers to seriously address ATRA in designing future external monitors.


26/10/23 à 15h30: Tuto, please: Création de package python et publication, par Vincent Raulin


28/09/23 à 15h30: “Détection de bot sur les réseaux sociaux, enjeux et méthodes” présenté par Adrien Schoen


08/06/23 à 15h30: “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance” présenté par Sébastien Kilian

Abstract: There is a rapidly growing number of large language models (LLMs) that users can query for a fee. We review the cost associated with querying popular LLM APIs, e.g. GPT-4, ChatGPT, J1-Jumbo, and find that these models have heterogeneous pricing structures, with fees that can differ by two orders of magnitude. In particular, using LLMs on large collections of queries and text can be expensive. Motivated by this, we outline and discuss three types of strategies that users can exploit to reduce the inference cost associated with using LLMs: 1) prompt adaptation, 2) LLM approximation, and 3) LLM cascade. As an example, we propose FrugalGPT, a simple yet flexible instantiation of LLM cascade which learns which combinations of LLMs to use for different queries in order to reduce cost and improve accuracy. Our experiments show that FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost. The ideas and findings presented here lay a foundation for using LLMs sustainably and efficiently.


25/05/23 à 14h30: “HOLMES: Real-Time APT Detection through Correlation of Suspicious Information Flows” présenté par Maxime Lanvin

Abstract: In this paper, we present HOLMES, a system that implements a new approach to the detection of Advanced and Persistent Threats (APTs). HOLMES is inspired by several case studies of real-world APTs that highlight some common goals of APT actors. In a nutshell, HOLMES aims to produce a detection signal that indicates the presence of a coordinated set of activities that are part of an APT campaign. One of the main challenges addressed by our approach involves developing a suite of techniques that make the detection signal robust and reliable. At a high-level, the techniques we develop effectively leverage the correlation between suspicious information flows that arise during an attacker campaign. In addition to its detection capability, HOLMES is also able to generate a high-level graph that summarizes the attacker’s actions in real-time. This graph can be used by an analyst for an effective cyber response. An evaluation of our approach against some real-world APTs indicates that HOLMES can detect APT campaigns with high precision and low false alarm rate. The compact high-level graphs produced by HOLMES effectively summarizes an ongoing attack campaign and can assist real-time cyber-response operations.


25/05/23 à 15h30: “Reinforcement learning with parameterized actions” présenté par Natan Talon

Abstract: We introduce a model-free algorithm for learning in Markov decision processes with parameterized actions—discrete actions with continuous parameters. At each step the agent must select both which action to use and which parameters to use with that action. We introduce the Q-PAMDP algorithm for learning in these domains, show that it converges to a local optimum, and compare it to direct policy search in the goal-scoring and Platform domains.


17/05/23 à 14h:  “Diglossia: detecting code injection attacks with precision and efficiency” présenté par Grégor Quetel

Abstract: Code injection attacks continue to plague applications that incorporate user input into executable programs. For example, SQL injection vulnerabilities rank fourth among all bugs reported in CVE, yet all previously proposed methods for detecting SQL injection attacks suffer from false positives and false negatives. This paper describes the design and implementation of DIGLOSSIA, a new tool that precisely and efficiently detects code injection attacks on server-side Web applications generating SQL and NoSQL queries. The main problems in detecting injected code are (1) recognizing code in the generated query, and (2) determining which parts of the query are tainted by user input. To recognize code, DIGLOSSIA relies on the precise definition due to Ray and Ligatti. To identify tainted characters, DIGLOSSIA dynamically maps all application-generated characters to shadow characters that do not occur in user input and computes shadow values for all input-dependent strings. Any original characters in a shadow value are thus exactly the taint from user input. Our key technical innovation is dual parsing. To detect injected code in a generated query, DIGLOSSIA parses the query in tandem with its shadow and checks that (1) the two parse trees are syntactically isomorphic, and (2) all code in the shadow query is in shadow characters and, therefore, originated from the application itself, as opposed to user input. We demonstrate that DIGLOSSIA accurately detects both SQL and NoSQL code injection attacks while avoiding the false positives and false negatives of prior methods. By recasting the problem of detecting injected code as a string propagation and parsing problem, we gain substantial improvements in efficiency and precision over prior work. Our approach does not require any changes to the databases, Web servers, or Web browsers, adds virtually unnoticeable performance overhead, and is deployable today.


27/04/23 à 15h30: “Language Models are Few-Shot Learners” présenté par Vincent Raulin

Abstract: Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.


20/04/23 à 14h: “Towards a Standard Feature Set forNetwork Intrusion Detection System Datasets” présenté par Adrien Schoen

Abstract: Network Intrusion Detection Systems (NIDSs) are important tools for the protection of computer networks against increasingly frequent and sophisticated cyber attacks. Recently, a lot of research effort has been dedicated to the development of Machine Learning (ML) based NIDSs. As in any ML-based application, the availability of high-quality datasets is critical for the training and evaluation of ML-based NIDS. One of the key problems with the currently available datasets is the lack of a standard feature set. The use of a unique and proprietary set of features for each of the publicly available datasets makes it virtually impossible to compare the performance of ML-based traffic classifiers on different datasets, and hence to evaluate the ability of these systems to generalise across different network scenarios. To address that limitation, this paper proposes and evaluates standard NIDS feature sets based on the NetFlow network meta-data collection protocol and system. We evaluate and compare two NetFlow-based feature set variants, a version with 12 features, and another one with 43 features.


16/03/23 à 15h30: “CADE: Detecting and Explaining Concept Drift Samples for Security Applications” présenté par Hélène Orsini

Abstract: Concept drift poses a critical challenge to deploy machine learning models to solve practical security problems. Due to the dynamic behavior changes of attackers (and/or the benign counterparts), the testing data distribution is often shifting from the original training data over time, causing major failures to the deployed model. To combat concept drift, we present a novel system CADE aiming to 1) detect drifting samples that deviate from existing classes, and 2) provide explanations to reason the detected drift. Unlike traditional approaches (that require a large number of new labels to determine concept drift statistically), we aim to identify individual drifting samples as they arrive. Recognizing the challenges introduced by the high-dimensional outlier space, we propose to map the data samples into a low-dimensional space and automatically learn a distance function to measure the dissimilarity between samples. Using contrastive learning, we can take full advantage of existing labels in the training dataset to learn how to compare and contrast pairs of samples. To reason the meaning of the detected drift, we develop a distance-based explanation method. We show that explaining “distance” is much more effective than traditional methods that focus on explaining a “decision boundary” in this problem context. We evaluate CADE with two case studies: Android malware classification and network intrusion detection. We further work with a security company to test CADE on its malware database. Our results show that CADE can effectively detect drifting samples and provide semantically meaningful explanations


07/03/23 à 16h: “Characterizing DNS query response sizes through active and passive measurements” présenté par Manuel Poisson

Abstract: DNS has been one of the most important pieces in the current Internet. As an advanced feature, DNS provides chains of trusts for query responses from authoritative servers (i.e., DNSSEC). However, DNSSEC requires a larger payload size in a DNS query, which could yield packet fragmentation, truncation, and TCP fallback. In this paper, we characterize the DNS query response behavior from client (caching resolver) and server (authoritative server) views. For the client view, we analyze the offered maximum response sizes (EDNS0 size) from resolvers and actual response sizes at the servers of the ccTLD of jp (JP-DNS). For the server view, we characterize DNS response size distributions for different TLDs by actively querying the top 300K popular domain names in the Tranco list. The main findings of our work are as follows: (1) We confirm an increase of an EDNS0 size of 1232B from clients in Apr. 2021 despite not being so evident during the DNS flag day event in Oct. 2020. (2) Query truncation and TCP fallbacks almost occurred in an EDNS0 size of 512B, but its queries were legitimate A/AAAA records. (3) The response size distributions of popular domains are significantly different in TLDs, and median response size does not fit the minimum size (512B) in many TLDs. (4) We clarify that several issues affect the response size distribution: the ratio of signed zones and configurations (e.g., NSEC/NSEC3, signed algorithms).


21/02/23 à 16h: Tuto, please: Coq, par Matthieu Baty


09/02/23 à 15h30: “DexLego: Reassembleable Bytecode Extraction for Aiding Static Analysis“, présenté par Jean-Marie Mineau

Abstract: The scale of Android applications in the market is growing rapidly. To efficiently detect the malicious behavior in these applications, an array of static analysis tools are proposed. However, static analysis tools suffer from code hiding techniques like packing, dynamic loading, self modifying, and reflection. In this paper, we thus present DexLego, a novel system that performs a reassembleable bytecode extraction for aiding static analysis tools to reveal the malicious behavior of Android applications. DexLego leverages just-in-time collection to extract data and bytecode from an application at runtime, and reassembles them to a new Dalvik Executable (DEX) file offline. The experiments on DroidBench and real-world applications show that DexLego precisely reconstructs the behavior of an application in the reassembled DEX file, and significantly improves analysis result of the existing static analysis systems.


02/02/23 à 15h30: “V0LTpwn: Attacking x86 Processor Integrity from Software“, présenté par Lionel Hemmerlé

Abstract: Fault-injection attacks have been proven in the past to be a reliable way of bypassing hardware-based security measures, such as cryptographic hashes, privilege and access permission enforcement, and trusted execution environments. However, traditional fault-injection at-tacks require physical presence, and hence, were often considered out of scope in many real-world adversary settings. In this paper we show this assumption may no longer be justified on x86. We present V0LTpwn, a novel hardware-oriented but software-controlled attack that affects the integrity of computation in virtually any execution mode on modern x86 processors. To the best of our knowledge, this represents the first attack on the integrity of the x86 platform from software. The key idea behind our attack is to undervolt a physical core to force non-recoverable hardware faults. Under a V0LTpwn attack, CPU instructions will continue to execute with erroneous results and without crashes, allowing for exploitation. In contrast to recently presented side-channel attacks that leverage vulnerable speculative execution, V0LTpwn is not limited to information disclosure, but allows adversaries to affect execution, and hence, effectively breaks the integrity goals of modern x86 platforms. In our detailed evaluation we successfully launch software-based attacks against Intel SGX enclaves from a privileged process to demonstrate that a V0LTpwn attack can successfully change the results of computations within enclave execution across multiple CPU revisions.


19/01/23 à 15h30: “Geneva: Evolving Censorship Evasion Strategies“, présenté par Lucas Aubard

Abstract: Researchers and censoring regimes have long engaged in a cat-and-mouse game, leading to increasingly sophisticated Internet-scale censorship techniques and methods to evade them. In this paper, we take a drastic departure from the previously manual evade-detect cycle by developing techniques to automate the discovery of censorship evasion strategies. We present Geneva, a novel genetic algorithm that evolves packet-manipulation-based censorship evasion strategies against nation-state level censors. Geneva composes, mutates, and evolves sophisticated strategies out of four basic packet manipulation primitives (drop, tamper headers, duplicate, and fragment). With experiments performed both in-lab and against several real censors (in China, India, and Kazakhstan), we demonstrate that Geneva is able to quickly and independently re-derive most strategies from prior work, and derive novel subspecies and altogether new species of packet manipulation strategies. Moreover, Geneva discovers successful strategies that prior work posited were not effective, and evolves extinct strategies into newly working variants. We analyze the novel strategies Geneva creates to infer previously unknown behavior in censors. Geneva is a first step towards automating censorship evasion; to this end, we have made our code and data publicly available.


15/12/22 à 15h30: “Algorithms for learning regular expressions from positive data”, présenté par Pierre-François Gimenez

Abstract: We describe algorithms that directly infer very simple forms of 1-unambiguous regular expressions from positive data. Thus, we characterize the regular language classes that can be learned this way, both in terms of regular expressions and in terms of (not necessarily minimal) deterministic finite automata.


01/12/22 à 14h: “Denoising Diffusion Probabilistic Models”, présenté par Adrien Schoen

Abstract: We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256×256 LSUN, we obtain sample quality similar to ProgressiveGAN.


10/11/22 à 10h: “Positive-Unlabeled Learning with Non-Negative Risk Estimator”, présenté par Hélène Orsini

Abstract: From only positive (P) and unlabeled (U) data, a binary classifier could be trained with PU learning, in which the state of the art is unbiased PU learning. However, if its model is very flexible, empirical risks on training data will go negative, and we will suffer from serious overfitting. In this paper, we propose a non-negative risk estimator for PU learning: when getting minimized, it is more robust against overfitting, and thus we are able to use very flexible models (such as deep neural networks) given limited P data. Moreover, we analyze the bias, consistency, and mean-squared-error reduction of the proposed risk estimator, and bound the estimation error of the resulting empirical risk minimizer. Experiments demonstrate that our risk estimator fixes the overfitting problem of its unbiased counterparts.


27/10/22 à 15h30: “A Survey on Heterogeneous Graph Embedding: Methods, Techniques, Applications and Sources”, présenté par Vincent Raulin

Abstract: Heterogeneous graphs (HGs) also known as heterogeneous information networks have become ubiquitous in real-world scenarios; therefore, HG embedding, which aim to learn representations in a lower-dimension space while preserving the heterogeneous structures and semantics for downstream tasks (e.g., node/graph classification, node clustering, link prediction), has drawn considerable attentions. In this survey, we perform a comprehensive review of the recent development on HG embedding methods and techniques. We first introduce the basic concepts of HG and discuss the unique challenges brought by the heterogeneity for HG embedding; and then we systemically survey and categorize the state-of-the-art HG embedding methods based on the information they used in the learning process. In particular, for each representative HG embedding method, we provide detailed introduction and further analyze its pros and cons; meanwhile, we explore the transformativeness and applicability of different types of HG embedding methods in the real-world industrial environments. We further present several widely deployed systems that have demonstrated the success of HG embedding techniques in resolving real-world application problems. To facilitate future research and applications in this area, we also summarize the open-source code, existing graph learning platforms and benchmark datasets. Finally, we explore the additional issues and challenges of HG embedding and forecast the future research directions in this field.


12/10/22 à 10h30: “DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning”, présenté par Maxime Lanvin

Abstract: Anomaly detection is a critical step towards building a secure and trustworthy system. The primary purpose of a system log is to record system states and significant events at various critical points to help debug system failures and perform root cause analysis. Such log data is universally available in nearly all computer systems. Log data is an important and valuable resource for understanding system status and performance issues; therefore, the various system logs are naturally excellent source of information for online monitoring and anomaly detection. We propose DeepLog, a deep neural network model utilizing Long Short-Term Memory (LSTM), to model a system log as a natural language sequence. This allows DeepLog to automatically learn log patterns from normal execution, and detect anomalies when log patterns deviate from the model trained from log data under normal execution. In addition, we demonstrate how to incrementally update the DeepLog model in an online fashion so that it can adapt to new log patterns over time. Furthermore, DeepLog constructs workflows from the underlying system log so that once an anomaly is detected, users can diagnose the detected anomaly and perform root cause analysis effectively. Extensive experimental evaluations over large log data have shown that DeepLog has outperformed other existing log-based anomaly detection methods based on traditional data mining methodologies.


09/06/22 à 15h30: “A Tutorial on Energy-Based Learning”, présenté par Adrien Schoen

Abstract: Energy-Based Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables. Inference consists in clamping the value of observed variables and finding configurations of the remaining variables that minimize the energy. Learning consists in finding an energy function in which observed configurations of the variables are given lower energies than unobserved ones. The EBM approach provides a common theoretical framework for many learning models, including traditional discriminative and generative approaches, as well as graph-transformer networks, conditional random fields, maximum margin Markov networks, and several manifold learning methods. Probabilistic models must be properly normalized, which sometimes requires evaluating intractable integrals over the space of all possible variable configurations. Since EBMs have no requirement for proper normalization, this problem is naturally circumvented. EBMs can be viewed as a form of non-probabilistic factor graphs, and they provide considerably more flexibility in the design of architectures and training criteria than probabilistic approaches.


19/05/22 à 15h30: “A Unified Approach to interpreting Model Predictions”, présenté par Maxime Lanvin

Abstract: Understanding why a model makes a certain prediction can be as crucial as the prediction’s accuracy in many applications. However, the highest accuracy for large modern datasets is often achieved by complex models that even experts struggle to interpret, such as ensemble or deep learning models, creating a tension between accuracy and interpretability. In response, various methods have recently been proposed to help users interpret the predictions of complex models, but it is often unclear how these methods are related and when one method is preferable over another. To address this problem, we present a unified framework for interpreting predictions, SHAP (SHapley Additive exPlanations). SHAP assigns each feature an importance value for a particular prediction. Its novel components include: (1) the identification of a new class of additive feature importance measures, and (2) theoretical results showing there is a unique solution in this class with a set of desirable properties. The new class unifies six existing methods, notable because several recent methods in the class lack the proposed desirable properties. Based on insights from this unification, we present new methods that show improved computational performance and/or better consistency with human intuition than previous approaches.


05/05/22 à 15h30: “Understanding Attention and Generalization in Graph Neural Networks”, présenté par Hélène Orsini

Abstract: We aim to better understand attention over nodes in graph neural networks (GNNs) and identify factors influencing its effectiveness. We particularly focus on the ability of attention GNNs to generalize to larger, more complex or noisy graphs. Motivated by insights from the work on Graph Isomorphism Networks, we design simple graph reasoning tasks that allow us to study attention in a controlled environment. We find that under typical conditions the effect of attention is negligible or even harmful, but under certain conditions it provides an exceptional gain in performance of more than 60% in some of our classification tasks. Satisfying these conditions in practice is challenging and often requires optimal initialization or supervised training of attention. We propose an alternative recipe and train attention in a weakly-supervised fashion that approaches the performance of supervised models, and, compared to unsupervised models, improves results on several synthetic as well as real datasets. Source code and datasets are available at https://github.com/bknyaz/graph_attention_pool.


21/04/22 à 15h30:  “Heterogeneous Graph Attention Network”, présenté par Vincent Raulin

Abstract: Graph neural network, as a powerful graph representation technique based on deep learning, has shown superior performance and attracted considerable research interest. However, it has not been fully considered in graph neural network for heterogeneous graph which contains different types of nodes and links. The heterogeneity and rich semantic information bring great challenges for designing a graph neural network for heterogeneous graph. Recently, one of the most exciting advancements in deep learning is the attention mechanism, whose great potential has been well demonstrated in various areas. In this paper, we first propose a novel heterogeneous graph neural network based on the hierarchical attention, including node-level and semantic-level attentions. Specifically, the node-level attention aims to learn the importance between a node and its meta-path based neighbors, while the semantic-level attention is able to learn the importance of different meta-paths. With the learned importance from both node-level and semantic-level attention, the importance of node and meta-path can be fully considered. Then the proposed model can generate node embedding by aggregating features from meta-path based neighbors in a hierarchical manner. Extensive experimental results on three real-world heterogeneous graphs not only show the superior performance of our proposed model over the state-of-the-arts, but also demonstrate its potentially good interpretability for graph analysis.


31/03/22 à 15h30: “Explainable Artificial Intelligence Approaches: A Survey”, présenté par Pierre-François Gimenez

Abstract: The lack of explainability of a decision from an Artificial Intelligence (AI) based “black box” system/model, despite its superiority in many real-world applications, is a key stumbling block for adopting AI in many high stakes applications of different domain or industry. While many popular Explainable Artificial Intelligence (XAI) methods or approaches are available to facilitate a human-friendly explanation of the decision, each has its own merits and demerits, with a plethora of open challenges. We demonstrate popular XAI methods with a mutual case study/task (i.e., credit default prediction), analyze for competitive advantages from multiple perspectives (e.g., local, global), provide meaningful insight on quantifying explainability, and recommend paths towards responsible or human-centered AI using XAI as a medium. Practitioners can use this work as a catalog to understand, compare, and correlate competitive advantages of popular XAI methods. In addition, this survey elicits future research directions towards responsible or human-centric AI systems, which is crucial to adopt AI in high stakes applications.


17/03/22 à 15h: “Efficient Graphlet Counting for Large Networks”, ​présenté par Maxime Lanvin

Abstract: From social science to biology, numerous applications often rely on graphlets for intuitive and meaningful characterization of networks at both the global macro-level as well as the local micro-level. While graphlets have witnessed a tremendous success and impact in a variety of domains, there has yet to be a fast and efficient approach for computing the frequencies of these subgraph patterns. However, existing methods are not scalable to large networks with millions of nodes and edges, which impedes the application of graphlets to new problems that require large-scale network analysis. To address these problems, we propose a fast, efficient, and parallel algorithm for counting graphlets of size k={3,4}-nodes that take only a fraction of the time to compute when compared with the current methods used. The proposed graphlet counting algorithms leverages a number of proven combinatorial arguments for different graphlets. For each edge, we count a few graphlets, and with these counts along with the combinatorial arguments, we obtain the exact counts of others in constant time. On a large collection of 300+ networks from a variety of domains, our graphlet counting strategies are on average 460x faster than current methods. This brings new opportunities to investigate the use of graphlets on much larger networks and newer applications as we show in the experiments. To the best of our knowledge, this paper provides the largest graphlet computations to date as well as the largest systematic investigation on over 300+ networks from a variety of domains.


03/03/22 à 15h30: “On the Security Risks of AutoML”, présenté par Hélène Orsini

Abstract: Neural architecture search (NAS) represents an emerging machine learning (ML) paradigm that automatically searches for model architectures tailored to given tasks, which significantly simplifies the development of ML systems and propels the trend of ML democratization. Yet, thus far little is known about the potential security risks incurred by NAS, which is concerning given the increasing use of NAS-generated models in critical domains. This work represents a solid initial step towards bridging the gap. First, through an extensive empirical study of 10 popular NAS methods, we show that compared with their manually designed counterparts, NAS-generated models tend to suffer greater vulnerabilities to various malicious manipulations (e.g., adversarial evasion, model poisoning, functionality stealing). Further, with both empirical and analytical evidence, we provide possible explanations for such phenomena: given the prohibitive search space and training cost, most NAS methods favor models that converge fast at early training stages; this preference results in architectural properties associated with attack vulnerabilities (e.g., high loss smoothness, low gradient variance). Our findings not only reveal the relationships between model characteristics and attack vulnerabilities but also suggest the inherent connections underlying different attacks. Finally, we discuss potential remedies to mitigate such drawbacks, including increasing cell depth and suppressing skip connects, which lead to several promising research directions.


17/02/22 à 15h30:  “PcapGAN: Packet Capture File Generator by Style-Based Generative Adversarial Networks”, présenté par Adrien Schoen

Abstract: After the advent of GAN technology, many varied models have been studied and applied to various fields such as image and audio. However, in the field of cyber data, which has the same issue of data shortage, the research on data augmentation is insufficient. To solve this problem, we propose PcapGAN that can augment pcap data, a kind of network data. The proposed model includes an encoder, a data generator, and a decoder. The encoder subdivides network data into four parts. The generator generates new data for each part of the data. The decoder combines the generated data into realistic network data. We demonstrate the similarity between the generated data and original data, and validation of the generated data by increased performance of intrusion detection algorithms.


10/02/22 à 15h30: “MetaGraph2Vec: Complex Semantic Path Augmented Heterogeneous Network Embedding”, présenté par Vincent Raulin

Abstract: Network embedding in heterogeneous information networks (HINs) is a challenging task, due to complications of different node types and rich relationships between nodes. As a result, conventional network embedding techniques cannot work on such HINs. Recently, metapath-based approaches have been proposed to characterize relationships in HINs, but they are ineffective in capturing rich contexts and semantics between nodes for embedding learning, mainly because (1) metapath is a rather strict single path node-node relationship descriptor, which is unable to accommodate variance in relationships, and (2) only a small portion of paths can match the metapath, resulting in sparse context information for embedding learning. In this paper, we advocate a new metagraph concept to capture richer structural contexts and semantics between distant nodes. A metagraph contains multiple paths between nodes, each describing one type of relationships, so the augmentation of multiple metapaths provides an effective way to capture rich contexts and semantic relations between nodes. This greatly boosts the ability of metapath-based embedding techniques in handling very sparse HINs. We propose a new embedding learning algorithm, namely MetaGraph2Vec, which uses metagraph to guide the generation of random walks and to learn latent embeddings of multi-typed HIN nodes. Experimental results show that MetaGraph2Vec is able to outperform the state-of-the-art baselines in various heterogeneous network mining tasks such as node classification, node clustering, and similarity search.

Comments are closed.