Abstra: Toward Generic Abstractions for Data of Any Model

Users need help to select the dataset they need and data producers need tools to generate automatically documentation for the data they produce. Towards this goal, we develop Abstra, an all-in-one data abstraction tool to help and simplify users’ acquaintance with tabular, tree- or graph-structured data.

Abstra computes a description meant for humans, based on the idea that, regardless of the syntax or the data model, any dataset holds some collections of entities/records, that are possibly linked with relationships. Abstra relies on a common graph representation of any incoming dataset; it leverages Information Extraction to detect what the dataset is about, and relies on an original algorithm for selecting the core entity collections and their relations. Abstractions are shown both as HTML text and a lightweight E/R diagram. A GUI also allows to tune the abstraction parameters and explore the dataset.

Teaser video

A short video introducing Abstra is available here.

Case Studies

How to read a description?

The dataset abstraction of the XMark dataset (3M nodes, 3M edges), describing auctions on an online website.

The dataset abstraction of the NASA dataset (140K nodes, 174K edges), describing spacecraft launches.

Gallery

Below, we provide the Entity-Relationship schema of each dataset we epxerimented on.

JSON

RDF

XML

Download

Abstra is a software developed in Java and using Postgres to store data that you can download at the following link:

https://gitlab.inria.fr/cedar/abstra

Results: user study results are available in the GitLab repository and in this ZIP file.

Publications

When referring to this work, please cite the article published in EDBT 2024.

Nelly Barret, Ioana Manolescu, Prajna Upadhyay. Computing Generic Abstractions from Application Datasets. EDBT 2024 (research paper).
Nelly Barret, Ioana Manolescu, Prajna Upadhyay. Computing Generic Abstractions from Application Datasets. BDA 2023 (informal publication).
Nelly Barret, Ioana Manolescu, Prajna Upadhyay. Abstra: Toward Generic Abstractions for Data of Any Model. CIKM 2022 (demonstration paper).
Nelly Barret, Ioana Manolescu, Prajna Upadhyay. Abstra: Toward Generic Abstractions for Data of Any Model. BDA 2022 (informal publication).
Nelly Barret, Ioana Manolescu, Prajna Upadhyay. Towards Generic Abstractions for Data of Any Model. BDA 2021 (short paper).
Nelly Barret. Facilitating Heterogenous Dataset Understanding. BDA 2021 (PhD paper).

How to read a description?

A description contains the main entity collections classified and their relationships. The output presents on the left the natural language description and on the right the Entity/Relation schema. For a given entity collection, you have the following information: its name, its size (number in parentheses) and the properties of the records that are in the collection. For each property, you have its name and its frequency in the collection records. Finally, the relationships linking entity collections are described with their name.

Abstra: Toward Generic Abstractions for Data of Any Model