Encadrants: Zoltan Miklos (zoltan.miklos@irisa.fr) DRUID, Thomas Guyet (thomas.guyet@irisa.fr) DREAM
Description:
The principal goal of data mining is to extract useful knowledge from raw data. For example, if we have a large set of unstructured Web documents, we might want to construct a graph that represents the entities (people, companies, geographic locations) and their relationships (placeOfBirth, affiliatedTo, etc.). Such an entity-relationship graph represents important knowledge that is an important abstraction w.r.t. the simple unstructured textual documents. Depending on the data and the application context, we might want to mine structured knowledge that we can represent in the form of a graph that we call a knowledge graph. While such models are widely used, their dynamic aspect got very little attention so far: for example, entities might merge or split in the course of time, their relationships might change (e.g. someone changes his affiliation, companies merge, etc.).
The proposed project consists the following main steps:
- Discovering a knowledge graphs from row data. We will rely here on statistical and machine learning techniques.
- Analyzing the constructed graph based on expert feedback and consistency rules (which are defined by domain experts). Detecting consistency violations.
- Repairing the detected errors or suggesting potential repairs to the expert. Guiding the expert in the repair process. (go back to step 2)
While a number of techniques for these tasks already exist , the goal of this internship is to propose new approaches that tackles specifically the issue of the dynamicity of knowledge graphs. For several applications exactly the evolution of this network is the information that is important (e.g. medical history of patients). While machine learning techniques can deal with larger datasets, and give good approximations of the real (dynamic) knowledge graph, we expect that human input is needed to deal with the remaining problems. Consistency constrains were already successfully used in a different context to minimize the necessary human involvement [1]. Knowledge representation and reasoning tools, such as logic programming , would be used to model expect knowledge and to detect inconsistencies.
In the project we will work with several datasets, including Web document collections. A particularly important dataset that we will use is an (anonymized) dataset from the French National Health Insurance, containing the patients medical history records (drugs delivery, medical visits, hospital stays). We would like to abstract sequences of low-level events (e.g drugs deliveries) into high-level events, the “treatments” of a patient. Besides the real medical dataset we will also work with synthetic datasets from this domain. A concrete application of the project results involves the quality estimation of these synthetic datasets.
Bibliographie
[1] Quoc Viet Hung Nguyen, Tri Kurniawan Wijaya, Zoltan Miklos, Karl Aberer, Eliezer Levy, Victor Shafran, Avigdor Gal and Matthias Weidlich. Minimizing Human Effort in Reconciling Match Networks. 32nd International Conference on Conceptual Modeling (ER 2013), 11-13 November 2013, Hong Kong, China
[2] Hoffart, J., Suchanek, F., Berberich, K., Weikum, G. YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence. 194 (2013) 28-61
[3] Suling, Marc and Weber, Robert and Pigeot, Iris, Data Mining in Pharmacoepidemiological Databases, in Robustness and Complex Data Structures, 2013.