ENS, S16

**Weighted Linear Bandits for Non-stationary environments.**

]]>

* September 30, October 4, 7, 11, 14, 18 and 21, 2019.

* 10:30-12:00

* Room 3052 of the Sophie Germain Building, Université Paris-Diderot (1, place Auréie Nemours, 75013 Paris).

]]>ENS, S16

** Yann Ramusat : **Provenance-Based Routing in Probabilistic Graph Databases**.

Abstract: Optimizing routing queries over graphs is a rich research area with important applications, e.g., to road and transportation networks. Thanks to progress made during past decades, current-day systems are able to compute paths across cities in continent-sized areas, paths that are optimal in terms of distance or expected travel time. Nevertheless, the problem considered is very constrained, personal preferences cannot be handled effectively, and similar queries need to be computed separately. We explore a provenance-based framework as a way to extend the expressive power of routing queries, based on the idea of keeping track of meta-information about query results. This framework, useful to deal with such aspects as uncertainty or preferences, cannot always benefit of optimizations used for computing optimal routes, leading to impractical algorithms. The aim of our PhD is to improve on routing techniques based on provenance to apply them to real transportation networks.

** Quentin Manière : **Complexity of answering rooted counting queries over DL-Lite ontologies**.

12 July 2019, 10:30-11:30

ENS, S16

**Lean Kernels: A Bridge Between Justifications and Provenance**

A justification is a minimal set of constraints (or axioms) responsible for a consequence to follow from a knowledge base. Since the time required to find justifications depends on the size of the knowledge base, recent research has focused on trying to approximate the set of “relevant” axioms; that is, the union of all justifications. One such approximation is the lean kernel, which corresponds to the axioms that appear in at least one proof of the consequence. In this talk we will explore the notion of lean kernel, its properties, and its relation to the computation of provenance over knowledge bases.

Bio: Rafael Peñaloza is an Associate Professor at the University of Milano-Bicocca, Italy. He received his PhD from TU Dresden, Germany, where he remained as a post-doctoral researcher before moving briefly to the Free University of Bozen-Bolzano, Italy. His main research interests are on non-standard knowledge representation formalisms–mainly fuzzy and probabilistic logics–and reasoning services such as explanations and repairs.

]]>28 June 2019, 10:30-11:30

ENS, S16

**Statistics on tables with non-curated entries**

“Dirty data” is said to be the data-scientists worst time sink. We investigate a specific data-quality challenge at the intersection of database curation and statistical learning.

Data tables often contain many non-numerical entries. Knowledge engineering in databases typically strive to recognize entities in these entries. For instance in deduplication or record-linkage are used to match entities expressed differently across the data.

On the other hand, statistical techniques, as in machine learning, tend to cast all entries to numerical vectors, given that statistical models and regularities are easier to formulate in vector spaces. To analyze data with entries that representing discrete entities, a standard pipeline is to curate them with deduplication approaches, after which the resulting categories are “one-hot encoded”: represented in a vector space by orthogonal binary vectors. The success of such pipeline depends crucially on the quality of the deduplication. In addition, it can create very high-dimensional vectorial representations that lead to statistical and computational problems in the machine learning step.

I will introduce statistical models of strings, useful to build low-dimensional representations of the entries that capture their morphological variations. These capture the string similarities between entries. They can also reveal latent categories that interpolate smoothly between various categories of entries without the need for cleaning or deduplication. Finally, we show that they lead to computationally and statistically efficient machine learning on non-curated tables.

Bio: Gaël Varoquaux is a computer-science researcher at Inria. His research focuses on statistical learning tools for data science and scientific inference. He has pioneered the use of machine learning on brain images to map cognition and pathologies. More generally, he develops tools to make machine learning easier, with statistical models suited for real-life, uncurated data, and software for data science. He co-funded scikit-learn, one of the reference machine-learning toolboxes, and helped build various central tools for data analysis in Python. Varoquaux has contributed key methods for learning on spatial data, matrix factorizations, and modeling covariance matrices. He has a PhD in quantum physics and is a graduate from Ecole Normale Superieure, Paris.

]]>3 May 2019, 10:30-11:30

ENS, S16

**Finding paths in large data graphs**

When dealing with large graphs, classical algorithms for finding paths such as Dijkstra’s Algorithm are unsuitable, because they require to perform too many disk accesses. To avoid the cost of these expensive accesses, while keeping a data structure of size quasi-linear in the size of the graph, we propose to guide the path search with a distance oracle, obtained from a topological embedding of the graph.

I will present fresh experimental research on this topic, in which we obtain graph embeddings using learning algorithms from natural language processing. On some graphs, such as the graph of publications of DBLP, our topologically-guided path search allows us to visit a small portion of the graph only, in average.

This is joint work with Charles Paperman.

ENS, S16

**Querying Attributed DL-Lite Ontologies Using Provenance Semirings**

Attributed description logic is a recently proposed formalism, targeted for graph-based representation formats, which enriches description logic concepts and roles with finite sets of attribute-value pairs, called annotations. One of the most important uses of annotations is to record provenance information. In this work, we first investigate the complexity of satisfiability and query answering for attributed DL-Lite ontologies. We then propose a new semantics, based on provenance semirings, for integrating provenance information with query answering. Finally, we establish complexity results for satisfiability and query answering under this semantics.

]]>28 January 2019, 14:30-15:30

Inria, 2 rue Simone Iff, 75012 Paris, building C, room Jacques-Louis Lions 2

**Can We Trust SQL as a Data Analytics Tool?**

Multiple surveys show that SQL and relational databases remain the most common tools used by data scientists. But can we fully trust them? We give a few examples showing unexpected and counterintuitive behavior of even simple SQL queries that make one question analytics results obtained from relational DBMSs. The talk will then give a quick overview of two lines of work that attempt to overcome these problems. One concerns a formal semantics of SQL, to at least eliminate the element of surprise in query results. The other presents a revised evaluation scheme that restores correctness to the notoriously unpredictable behavior of SQL queries over databases with incomplete information. In conclusion we outline new directions of work to deliver trusted results from both relational SQL databases, and a new popular model of graph data.

Leonid Libkin is Professor of Foundations of Data Management in the School of Informatics at the University of Edinburgh. He was previously a Professor at the University of Toronto and a member of research staff at Bell Laboratories in Murray Hill. He received his PhD from the University of Pennsylvania in 1994. His main research interests are in the areas of data management and applications of logic in computer science. He has written five books and over 200 technical papers. His awards include a

Marie Curie Chair Award, a Royal Society Wolfson Research Merit Award, and six Best Paper Awards. He has chaired programme committees of major database conferences (ACM PODS, ICDT) and was the conference chair of the 2010 Federated Logic Conference. He has given many invited conference talks and has served on multiple program committees and editorial boards. He is an ACM fellow, a fellow of the Royal Society of Edinburgh, and a member of Academia Europaea.