Automating the exploration of the KDD search space
Search space modeling
In order to automate the exploration of the KDD search space, the first need is a unifying framework that captures all components of KDD workflows (data, operators, models, domain knowledge) on a solid theoretical basis, and allows to compose them. As of now, there is no such framework.
Our main plan is to investigate on a type system of KDD operators, exploiting expressive, type constructions, for example dependent types. This solution, not explored as of now, seems quite natural due to the functional nature of KDD operators, and is efficient to check operator composition. In case we need even more expressiveness (possibly by capturing part of the operator semantics), we plan to investigate relational sketches, which have been shown well adapted to represent complex KDD operators. In both cases, we are interested to determine how to represent a rich variety of operators as well as pieces of domain knowledge in the formalism.
Search space exploration
Regardless of the formalism selected, the search space of KDD workflows is huge: efficient exploration methods and heuristics will be necessary.
Based on the type system representation, an original research direction is to adapt proof assistants (for example Coq) to use KDD operators as “theorems” and let them perform the composition, with possible hints (called tactics in this domain) to guide the system, coming either from a human analyst or results of preprocessing over the data to guide the system. We will also explore other ways to explore the search space, such as using the logical language ASP (Answer Set Programming).
Model selection
Once KDD workflows are generated, their execution produces models, each of which explains some aspects of the data. Such models can for example provide a clustering of the data, a wide variety of patterns of repeating structures, or conversely anomalies in the data. A critical task is to select the models most likely to bring novel and useful knowledge to the users.
Our research interest is to provide “multi-model” approaches, combining results from different models. We plan to investigate techniques such as Minimum Description Length (MDL) on models of different types, while existing approaches focus on one type of model. We also plan to exploit user preferences through skyline patterns. Another direction is to exploit complex domain simulators as “parameterized domain knowledge” to help finding novel knowledge in unseen cases.
Pattern mining operators
Several team members are well known for their expertise in pattern mining KDD operators for discovering regularities in data.
We will continue our work in this area, focusing on the integration of domain knowledge in the pattern mining algorithms. Our interest will shift to pattern mining operators designed specially for the automated workflow discovery approach, and coupled with model selection techniques based on user preferences and quality criteria. This should lead to novel compromises: current approaches focus on providing few results to humans, while here the goal will be to enable the automated system an as wide as possible exploration of the solution space.
Scaling up through in-memory approaches
A single KDD workflow often requires important computational resources. Approaches for exploring the space of KDD workflows will thus require even more computational resources, while users will require fast answers. We are interested in two kind of parallel computing platforms: first, large clusters with modern programming models such as Apache Spark, to handle huge datasets; second, multi/many-core computers, to guarantee quick computations on the analyst laptop.
Research directions: In both cases, we are interested in designing novel data mining algorithms adapted to these specific parallel environments, which requires an adequate partitioning of the data and the tasks of the algorithms. On multi/many-cores, an additional complexity is to reduce bandwidth pressure of the algorithms. We investigate cache aware/oblivious algorithms for this issue.
User/system interactions
The approach envisioned by LACODAM aims to provide novel and useful knowledge to users: it will require efficient interaction methods to present extracted knowledge to users, and to collect feedback.
First, the discovered knowledge can lead to action recommendation. The team will improve its already existing expertise in incremental rule learning for action recommendation, using its best pattern mining approaches as a starting point for rule discovery. This will allow to action recommendation based on complete KDD workflows found automatically. Second, we are investigating data visualization techniques with ongoing collaborations that we will extend. This work investigates the visualization of results of pattern mining operators. We will works on ways to visualize and interact with the workflows and models discovered by the system, in order to acquire feedback for system enrichment and improvement. A last point of interest is to find ways to explain/justify the results to the users. For this point we exploit the competences of some LACODAM members on logic argumentation. Another interest is to give users powerful ways to express their own interest over the results in order to enrich model selection methods. LACODAM hosts Torsten Schaub (INRIA International Chair) on a project to use ASP (Answer Set Programming), a powerful logic language, in order to ease the design of post-mining filters.
Collaborative knowledge and feedback management
The traditional data mining setting is to have a single analyst working on a single dataset. In such setting, this analyst cannot benefit from discoveries of other analysts working on the same or similar datasets, and other analysts cannot benefit from his work either: everyone has to start from scratch each time. The LACODAM vision is to shift to online workspaces centered on a given domain and community (bioinformatics or agriculture for example). A workspace centralizes public datasets, KDD workflows and domain knowledge, becoming a single point of entry for analysts of the domain willing to analyze a public or private dataset. Pieces of contributed domain knowledge can be made available to the whole community, greatly reducing individual work. And feedback given both on KDD workflows and input knowledge can be used to constantly improve the system.
As a first step, we will focus on the domain knowledge and feedback preservation and reuse in the KDD approaches that we will propose.