Smart mining of sequences with ASP

Contact

Guyet Thomas (thomas.guyet@irisa.fr), Torsten Schaub (torsten.schaub@inria.fr)

Mots-clés

Constraints programming, Answer Set Programming, pattern mining

Description

The objective of this internship is to develop a sequence mining program using declarative programming (ASP).

The pattern mining aims to discover interesting knowledge from a large databases of transactions (shopping carts for customers – itemsets) or series of transactions (sequence of actions or purchases). For two decades, research was focused on proposing very efficient algorithmic approaches dedicated to a single mining task. But recent works are focused on generic approaches, i.e. approaches that integrate as soon as possible some expert knowledge to extract only the meaningful patterns and that mines patterns of various kinds (itemsets, sequences, graphs). Most of such approaches are based on constraint programming [1]. The principles of such approach is to generate candidate patterns and to prune those that do not satisfy the constraints. The challenge is to express the pattern mining tasks with constraints that can be efficiently be solved. In this area, mining efficiently the sequential patterns is still a challenge.

The ASP (Answer Set Programming) is a form of declarative programming based on a logical paradigm. The clingo ASP solver combines a high-level modeling capacities of ASP with state-of-the-art techniques from the area of Boolean constraint solving. We believe that it could be a relevant alternative to tackle the pattern mining tasks and to benefit from the high-level modeling language to make easy the definition of expert knowledge.

The objective of the internship is to develop an ASP program to mine patterns in large databases of sequences. The proposed approach is the “mining by compression” approach. These kind of pattern mining techniques are based on the idea that a pattern is interesting, when it efficiently compresses the data [4]. The interest is twofold: it extracts interesting patterns and it builds a summary of the data that can be shown to analysts. The trainee will explore this principle as a first approach. And propose an ASP program to do it. The proposed program could be compared to other existing ASP approaches [2,3] and with existing dedicated algorithms. Evaluation will be first done on simulated datasets. A second part of the internship will cover the application development of an extraction method of interesting patterns in sequences of drug prescriptions (provided by health insurance) using declarative programming (ASP). These data are particularly interesting because they are structured by domain knowledge imposing constraints on relevant extracting patterns.

This project is part of the research program of Pr. Torsten Schaub (University Potsdam/Berlin.) – INRIA International Chair since 2014 within the DREAM team. He coordinates the development of the Potassco system (Potsdam Answer Set Solving Collection) including clingo, the ASP solver, and he leads a team experienced in the techniques of ASP programming. As part of this collaboration, the intern will have the opportunity to visit the Potassco team at Potsdam (Berlin).

The candidate will have a strong interest in the pattern mining. Knowledge of Python is required. Skills in logic programming (Prolog or ASP) or constraint programming are welcome.

 

Bibliographie

  1. Tias Guns, Anton Dries, Guido Tack, Siegfried Nijssen, and Luc De Raedt. MiningZinc: a modeling language for constraint-based mining. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence (IJCAI’13), Francesca Rossi (Ed.). AAAI Press 1365-1372, 2013.
  2. Matti Järvisalo, Itemset Mining as a Challenge Application for Answer Set Enumeration, In Logic Programming and Nonmonotonic Reasoning, pp. 304-310, 2011.
  3. Thomas Guyet, Yves Moinard, René Quiniou, Using Answer Set Programming for pattern mining, In actes de la conférence IAF, 2014.
  4. Nikolaj Tatti and Jilles Vreeken. The long and the short of it: summarising event sequences with serial episodes. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’12), 2012.