News – Page 13 – Scientific Data Management

Jul 03

Best presentation award for Miguel during the Grid5000 Spring School 2014 in Lyon.

Filed under Awards, Slider News
July 3, 2014

Miguel Liroz-Gistau has received the best presentation award from the Grid5000 Spring School 2014 in Lyon for his talk on “Using Grid5000 for MapReduce Experiments” (Miguel Liroz-Gistau, Reza Akbarinia, and Patrick Valduriez).

Abstract of the talk:
MapReduce is one of the most popular solutions for big data processing. In our recent research activities, we have improved the MapReduce framework by enhancing data locality and load balancing during the MapReduce job executions. Particularly, we developed two prototypes: 1) MRPart for reducing the data transfer between map and reduce nodes; 2) FP-Hadoop for bringing more parallelism to the framework and balancing the load of reduce nodes. We used Grid5000 for evaluating the performance of our solutions. In this paper, we describe our methodology for deploying and testing the developed prototypes in Grid5000.

https://www.grid5000.fr/mediawiki/index.php/Grid5000:School2014

https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home

https://www.grid5000.fr/mediawiki/index.php/File:Presentation_Gistau_award.png

Permanent link to this article: https://team.inria.fr/zenith/best-presentation-award-for-miguel-during-the-grid5000-spring-school-2014-in-lyon/

Jun 30

Mastodons International Workshop on “Big Data Management and Crowd Sourcing towards Scientific Data”, June 30, 2014

Filed under Seminars, Slider News
June 30, 2014

Monday 30th june 2014, in MONTPELLIER, 95 rue de la Galéra

IBC & LIRMM (UM2, CNRS-Mastodons), INRIA-UCSB associated team Bigdatanet

Organisation : Esther Pacitti Lirmm reception desk : +33 (0)4 67 41 85 85

Workshop Objective

In the context of the Mastodons project in Montpellier, we are addressing problems related to the management and analysis of big scientific data, in particular biology data such as those produced by next generation sequencing tools and plant phenotyping platforms. The objective of the workshop is to discuss emerging solutions for big data management with world-class scientists. More information on Mastodons Web Site.

Permanent link to this article: https://team.inria.fr/zenith/mastodons-international-workshop-on-big-data-management-and-crowd-sourcing-towards-scientific-data/

Jun 23

Séminaire du pôle données connaissances : “In-Memory Analytics: Accelerating Business Performance” – QuartetFS, 23 juin 2014 à 11h.

Filed under Seminars, Slider News
June 23, 2014

Salle Galera 127 le 23-06-2014, 11h
Organisé par l’équipe Zenith

In-Memory Analytics: Accelerating Business Performance
Antoine Chambille, Romain Colle
QuartetFS, Paris

The Big Data trend is a rebirth for Business Intelligence. On the one hand the web companies use technologies like Hadoop to extract value from data previously out of reach, because it is too big or because it is not structured. On the other side the new databases that store the data in-memory reach levels of performance such that they can perform complex and interactive analysis on live data that changes in real-time.

In concrete terms those In-Memory databases are the foundation for a new generation of business applications that bring the power of analytics to the hands of the decision makers who “run the business”.

Through this crash course on In-Memory technology, we will see through practical examples the competitive advantage it already brings to the best-performing organizations in the fields of e-commerce, logistics and finance.

About the speakers

Antoine Chambille is Head of Research and Development at Quartet FS. He joined Quartet FS soon after its creation back in 2005 and has been leading the team in charge of designing, developing and supporting Quartet FS’s in-memory analytics solutions. As one of the first employees, Antoine was heavily involved in the design of ActivePivot Server, Quartet FS’ in-memory OLAP engine.
Before joining Quartet FS, Antoine worked several years for a consulting firm specialised in the financial sector. From his years in consulting, he developed a strong customer orientation and he is keen on keeping a close eye on customers’ use cases. Antoine graduated from Ecole Polytechnique and Telecom Paris.

Romain Colle is a Project Manager within the Research and Development team of Quartet FS. He designed and developed Sentinel, QuartetFS’ flagship monitoring solution, and led the development efforts on ActivePivot’s Distributed Architecture. Romain has been involved in large ActivePivot projects such as Societe Generale in Paris, J.P.Morgan in London and DekaBank in Frankfurt. He joined Quartet FS in 2010 after 3 years spent at Oracle’s Headquarters in Redwood Shores where he contributed to the development of their core database. Romain is graduated from the Stanford University (he holds a Master of Science in Computer Science) and from “Centrale” in Paris”.

About Quartet FS and ActivePivot

Quartet FS provides business users with instant insight into massive amounts of data streaming at high frequency for timely and context-aware decision-making. Using Quartet FS’ in-memory aggregation engine ActivePivot, organisations are able to build 24×7, sense-and-respond applications that help them accelerate business performance, optimize operations, reduce operational risk and react to the unexpected.

Created in 2005, Quartet FS is a privately owned company with offices in Paris, London, New York, Hong Kong and Singapore. With more than fifty live implementations in large international groups, the company serves customers operating in time-sensitive and data-intensive environments such as financial services, market exchanges, logistics, transportation, and retail.

ActivePivot allows business users to be able to extract actionable intelligence from massive amounts of fast moving data – enabling them to make informed decisions on the spot. An in-memory Hybrid Transactional and Analytics Processing engine, ActivePivot aggregates data from multiple sources and processes multi-dimensional queries at unparalleled speeds on data that is updated on the fly.
As a result, business users are able to:

Focus on what really matters by pinpointing anomalies at an early stage with ActivePivot detecting changes in business conditions and pushing meaningful alerts
Get answers in sub-second time to any analytical enquiry with ActivePivot processing millions of live records across systems using its in-memory aggregation capabilities
Drill-down and view data from any angle, at any point in the past, with ActivePivot providing the required contextual information for root cause analysis.
Run ‘What-if’ analysis and evaluate the effect of alternative scenarios on the business

Permanent link to this article: https://team.inria.fr/zenith/seminaire-du-pole-donnees-connaissances-in-memory-analytics-accelerating-business-performance-quartetfs-23-juin-2014-a-11h/

May 28

Zenith seminar: “CLyDE Mid-Flight: What we have learnt so far about the SSD-Based IO Stack”, by Philippe Bonnet (Univ. of Copenhagen), May 28, 2014

Filed under Seminars, Slider News
May 28, 2014

Zenith Seminar Room Galera 127 on May 28, 2014, 11am. CLyDE Mid-Flight: What we have learnt so far about the SSD-Based IO Stack Philippe Bonnet, INRIA and IT University of Copenhagen Abstract: The quest for energy proportional systems and the growing performance gap between processors and magnetic disks has led to the adoption of SSDs as secondary storage of choice for a large range of systems. Indeed, SSDs offer great performance (tens of flash chips wired in parallel can deliver hundreds of thousands accesses per second) with low energy consumption. This evolution introduces a mismatch between the simple disk model that underlies the design of today’s database systems and the complex SSDs of today’s computers. This mismatch leads to unpredictable performance, with orders of magnitude slowdown in IO latency that can hit an application anytime. To attack this problem, the obvious approach is to construct models that capture SSDs’ performance behaviour. However, our previous work has shown the limits of this approach because (a) performance characteristics and energy profiles vary significantly across SSDs, and (b) performance varies in time on a single device based on the history of accesses. The CLyDe project is based on the insight that the strict layering that has been so successful for designing database systems on top of magnetic disks is no longer applicable to SSDs. In other words, our central hypothesis is that the complexity of flash devices cannot be abstracted away as it results in unpredictable and suboptimal performance. We postulate that database system designers need a clear and stable distinction between efficient and inefficient patterns of access to secondary storage, so that they can adapt space allocation strategies, data representation or query processing algorithms. We propose that (i) SSDs should expose this distinction instead of aggressively mitigating the impact of inefficient patterns at the expense of the efficient ones, and (ii) that operating system and database system should explicitly provide mechanisms to ensure that efficient access patterns are favoured. We thus advocate a co-design of SSD controllers, operating system and database system with appropriate cross-layer optimisations. In this talk, I will report on the lessons we have learnt so far in the project. In particular, I will describe the SSD simulation frameworks that we have developed to explore cross layer designs: EagleTree and LightNVM. I will discuss our findings on the importance of scheduling within an SSD. I will present our contribution to the re-design of the Linux block layer, that makes it possible for Linux to keep up with SSD performance on multi-socket systems. Finally, I will present preliminary results on the co-design of file system and SSDs. CLyDE is a joint project between IT University of Copenhagen and INRIA Paris Rocquencourt, started in 2012 and funded by the Danish Council for Independent Research. Bio: Philippe Bonnet is associate professor at IT University of Copenhagen. Philippe is an experimental computer scientist focused on building/tuning systems for performance and energy efficiency. Philippe’s research interests include database tuning, flash-based database systems, secure personal data management, sensor data engineering.

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-philippe-bonnet-clyde-mid-flight-what-we-have-learnt-so-far-about-the-ssd-based-io-stack-may-28-11-00-am/

Apr 10

Démarrage du laboratoire commum Triton (I-Lab) avec Beepeers

Filed under Uncategorized
April 10, 2014

Le laboratoire commun (I-Lab) Triton, avec la société Beepeer (beepeers.com) et notre équipe de recherche a démarré en mars 2014, avec l’arrivée d’un ingénieur doctorant. Beepeers est une jeune entreprise innovante créée en 2011 qui propose aux entreprises une plateforme pour les aider à développer leurs réseaux sociaux d’entreprise sur mobile (smartphone et tablette) et sur le Web. Beepeers souhaite dès à présent préparer l’industrialisation et le déploiement à grande échelle de sa plateforme grâce à un projet de R&D commun avec Inria. Beepeers souhaite par l’intermédiaire de cet l’i-lab créer un middleware de collaboration dédié.

Plus informations

Permanent link to this article: https://team.inria.fr/zenith/demarrage-du-laboratoire-commum-triton-i-lab-avec-beepeers/

Mar 28

PhD position: “A Data-Centric Execution Model for Scientific Workflows”

Filed under Jobs
March 28, 2014

PhD position

Advisors: Didier Parigot

The Zenith team deals with the management of scientific applications that are computation-intensive and manipulate large amounts of data. These applications are often represented by workflows, which describe sequences of tasks (computations) and data dependencies between these tasks. Several scientific workflow environments have been already proposed [taylor07]. However, they have little support for efficiently managing large data sets. The Zenith team develops an original approach that deals with such large data sets in a way that allows efficient placement of both tasks and data on large-scale (distributed and parallel) infrastructures for more efficient execution. To this end, we propose an original solution that combines the advantages of cloud computing and P2P technologies.This work is part of the IBC project (Institut de Biologie Computationelle – http://www.ibc-montpellier.fr), in collaboration with biologists, in particular from CIRAD and IRD, and cloud providers The concept of cloud computing combines several technology advances such as Service-Oriented Architectures, resource virtualization, and novel data management systems referred to as NoSQL. These technologies enable flexible and extensible usage of resources, which is referred to as elasticity. In addition, the Cloud allows users to simply outsource data storage and application executions. For the manipulation of big data, NoSQL database systems, such as Google Bigtable, Hadoop Hbase, Amazon Dynamo, Apache Cassandra, 10gen MongoDB, have been recently proposed. Existing scientific workflow environments [taylor07] have been developed primarily to simplify the design and execution of a set of tasks in a particular infrastructure. For example, in the field of biology, the Galaxy environment allows users to introduce catalogs of functions/tasks and compose these functions with existing functions in order to build a workflow. These environments propose a design approach that we can classify as “process-oriented”, where information about data dependencies (data flow) is purely syntactic. In addition, the targeted execution infrastructures are mostly computation-oriented, like clusters and grids. Finally, the data produced by scientific workflows are often stored in loosely structured files, for further analysis. Thus, data management is fairly basic, with data are either stored on a centralized disk or directly transferred between tasks. This approach is not suitable for data-intensive applications because data management becomes the major bottleneck in terms of data transfers. As part of a new project that develops a middleware for scientific workflows (SciFloware), the objective of this thesis is to design a declarative data-centric language for expressing scientific workflows and its associated execution model. A declarative language is important to provide for automatic optimization and parallelization [ogasawara11]. The execution model for this language will be decentralized, in order to yield flexible execution in distributed and parallel environments. This execution model will capitalize on execution models developed in the context of distributed and parallel database systems [valduriez11]. To validate this work, a prototype will be implemented using the SON middleware [parigot12] and a distributed file system like HDFS. For application fields, this work will be in close relationship to the Virtual Plants team which develop computational models of plant development to understand the physical and biological principles that drive the development of plant branching systems and organs. In particular OpenAlea [pradal08] is a software platform for plant analysis and modelling at different scales. It provids a scientific workflow environment to integrate different tak for plant reconstruction, analysis, simulation and visualisation at the tissue level [lucas13] and at the plant level [boudon12]. One challenging application in biology and computer science is to process and analyse data collected on phenotyping plateforms in high-throughput. The scifloware middleware, combined with OpenAlea, will improve the capability of the plant science community at analysing high throughput of variables hardly accessible in the field such as architecture, response of organ growth to environmental conditions or radiation use efficiency. This will improve the ability of this community to model the genetic variablity of plant response to environmental cues associated to climate change.

References

[ogasawara11] E. Ogasawara, J. F. Dias, D. de Oliveira, F. Porto, P. Valduriez, M. Mattoso. An Algebraic Approach for Data-centric Scientific Workflows. Proceedings of the VLDB Endowment (PVLDB), 4(12) : 1328-1339, 2011. [valduriez11] M. T. Özsu, P. Valduriez. Principles of Distributed Database Systems. Third Edition, Springer, 2011. [taylor07] I. J. Taylor, E. Deelman, D. B. Gannon, M. Shields. Workflows for e-Science : Scientific Workflows for Grids. First Edition, Springer, 2007. [parigot12] Ait-Lahcen, D. Parigot. A Lightweight Middleware for developing P2P Applications with Component and Service-Based Principles. 15th IEEE International Conference on Computational Science and Engineering, 2012. [pradal08] C. Pradal, S. Dufour-Kowalski, F. Boudon, C. Fournier, C. Godin. OpenAlea: A visual programming and component-based software platform for plant modeling. Functional Plant Biology [lucas13] Lucas, Mikaël, et al. “Lateral root morphogenesis is dependent on the mechanical properties of the overlaying tissues.” Proceedings of the National Academy of Sciences 110.13 (2013): 5229-5234. [boudon12] Boudon, F., Pradal, C., Cokelaer, T., Prusinkiewicz, P., & Godin, C. (2012). L-py: an L-system simulation framework for modeling plant architecture development based on a dynamic language. Frontiers in plant science, 3 Contact: Didier Parigot (Firstname.Lastname@inria.fr) Apply online

Permanent link to this article: https://team.inria.fr/zenith/a-data-centric-execution-model-for-scientific-workflows/

Mar 05

Analyse, extraction, propagation et recherche des données issues des réseaux sociaux d’entreprise

Filed under Jobs
March 5, 2014

Titre de la thèse :

Recommandation temps réels pour des réseaux sociaux sectoriels

Directeur de Thèse: Didier Parigot

http://www-sop.inria.fr/members/Didier.Parigot/

Collaboration avec la société Beepeers (beepeers.com) dont l’activité est la création d’une plateforme collaborative d’outils sociaux pour les entreprises.

Lieux: Inria Sophia-Antipolis

Financement: Bourse universitaire

Introduction

Depuis quelques années les thématiques de gestion de grand volume de donnée (BIG DATA) et des données ouvertes (OPEN DATA) prennent une importance grandissante avec l’essor des réseaux sociaux et de l’internet. En effet par une exploitation ou une analyse des données manipulées il est possible d’extraire de nouvelles informations pertinentes qui permettent de proposer de nouveaux services ou outils. Dans le cadre d’une collaboration entre notre Equipe-Projet Zenith et une très jeune startup Beepeers qui commercialise une plateforme pour le développement de réseaux sociaux sectoriel, nous proposons ce sujet de recherche afin d’enrichir cette plate-forme par de nouveaux services avancés basés sur l’extraction ou l’analyse des données produites par ces réseaux sociaux d’entreprise.

Objectif de la thèse

L’objectif de la thèse sera de proposer et de combiner diverses techniques (algorithmes) d’analyse de donnée afin de proposer des services avancés à la plate-forme Beepeers. La plate-forme Beepeers propose déjà un riche ensemble de fonctionnalité ou services qui produisent une masse d’information ou de donnée qui formera le jeu de donnée initial pour ces futurs travaux de recherche.

Le doctorant devra proposer dans ce cadre applicatif bien ciblé, des algorithmes d’extraction d’information par une combinaison originale des techniques suivantes :

d’analyse d’usage des utilisateurs ;
d’extraction de profil utilisateur ;
de propagation ou de diffusion d’information à travers le réseau ou entre différents réseaux sociaux connectés à la plate-forme Beepeers ;
de recommandation de personne, de service ou d’évènement à l’aide des avis des utilisateurs du réseau (fonctionnalité déjà disponible dans la plate-forme Beepeers) ;
d’extraction par requête base de donnée continu dans le temps (persistant) sur les sites de données ouvertes disponible et pertinentes pour le réseau sectoriel sous-jacent.

De plus il sera demandé une mise en œuvre originale basée sur une architecture décentralisée orientée services pour permettre un passage à l’échelle des solutions proposées et un déploiement dynamique à ma demande des services avancés.

Contexte de la collaboration

Cette collaboration fait déjà l’objet d’un partenariat fort INRIA-PME à travers la mise en place et le démarrage cette année d’un laboratoire commun (I-lab), dénommé Triton, avec comme programme de R&D l’élaboration d’une architecture innovante pour la plate-forme Beepeers pour le passage à l’échelle. Ce programme de R&D va s’appuyer sur notre expertise en architecture décentralisée orientée services à travers l’utilisation de notre outil SON (Shared Overlay Network). Le doctorant sera donc accompagné dans ses propositions par cette équipe de R&D de ce laboratoire commun Triton et pourra tester et valider ses algorithmes sur les jeux de donnée issus cette nouvelle plate-forme Beepeers développé dans le cadre de l’I-Lab Triton. De plus le doctorant pourra s’appuyer sur l’expertise scientifique de l’équipe-projet Zenith en terme gestion de données scientifiques.

Résultats attendus et profil attendus du candidat

Le candidat devra avoir un gout prononcé par la validation pratique de ses travaux de recherche, et des bonnes aptitudes d’abstraction pour savoir maitriser et appréhender rapidement ces différentes techniques d’analyse ou d’extraction de donnée issu de divers communautés scientifiques (base de donné, analyse d’usage et la programmation distribuée pour la mise en œuvre). Le candidat devra savoir travailler en équipe, en étroite collaboration avec la société Beepeers pour mener à bien ses travaux de recherche. Ces travaux devront trouver rapidement des champs d’application à travers la réalisation concrète et effective de nouveaux services de la plate-forme Beepeers.

Références

Rodriguez, M.A., Neubauer, P., “The Graph Traversal Pattern,” Graph Data Management: Techniques and Applications, eds. S. Sakr, E. Pardede, IGI Global, ISBN:9781613500538, August 2011.

Haesung Lee , Joonhee Kwon, “Efficient Recommender System based on Graph Data for Multimedia Application”

International Journal of Multimedia and Ubiquitous Engineering, Vol. 8, No.4,July, 2013.

Profil recherché

Ecole d’Ingénieur (BAC + 5) ou Master 2

Goût du travail en Équipe
Bon niveau en Anglais (à l’écrit)

Pour candidater

Merci d’envoyer par email et en PDF à l’adresse Didier.Parigot@inria.fr les documents suivants

CV,
lettre de motivation ciblée sur le sujet,
au moins deux lettres de recommandation,
relevés de notes + liste des enseignements suivis en M2 et en M1.

Permanent link to this article: https://team.inria.fr/zenith/analyse-extraction-propagation-et-recherche-des-donnees-issues-des-reseaux-sociaux-dentreprise/

Feb 11

Analyse collaborative des nouveaux gisements personnels de données respectant leur confidentialité.

Filed under Seminars, Slider News
February 11, 2014

Zenith a participé à la rencontre Inria/Industrie, le Mardi 11 février 2014 à l’Ens de Lyon. Tristan Allard y a présenté nos travaux sur la découverte collaborative de profils dans les données personnelles, garantissant le respect de la confidentialité des données. Le “Quantified Self” est un mouvement qui gagne en popularité ces dernières années. Aujourd’hui, il est en effet possible de récolter des données personnelles sur de nombreux domaines, comme les activités quotidiennes, la santé ou les performances sportives. Cela peut se faire grâce à des capteurs physiologiques communiquant avec le dispositif personnel de l’individu les portant, un simple smartphone ou des “smart-glasses” par exemple, ou bien directement embarqués dans le dispositif, comme les accéléromètres notamment. Bien exploitées, ces données peuvent apporter des connaissances précieuses sur les domaines qui les concernent. Pour mieux traiter une maladie, il peut être important de mieux cerner le profil d’un individu pour proposer un traitement personnalisé. Pour un sportif, il serait intéressant de savoir dans quelle catégorie il se trouve afin d’adapter ses entraînement et concevoir un programme spécifique. Toutefois, pour préserver leur vie privée, les individus peuvent être réticents à l’idée de partager leur données. Cette démonstration montre le prototype d’un tel système de calcul des profils types dans lequel les participants collaborent ensemble par le biais d’un algorithme totalement décentralisé sans jamais communiquer en clair leurs données. http://www.inria.fr/centre/grenoble/innovation/rii-bio-informatique/demos/demo-zenith

Permanent link to this article: https://team.inria.fr/zenith/analyse-collaborative-des-nouveaux-gisements-personnels-de-donnees-respectant-leur-confidentialite/

Jan 31

Zenith seminar: “Improving the Efficiency of Multi-site Web Search Engines”, Xiao Bai, Jan 31, 2014

Filed under Seminars, Slider News
January 31, 2014

Seminaire Zenith

30/01 à 10h30 salle 227, Galera

Improving the Efficiency of Multi-site Web Search Engines

Xiao Bai – Yahoo Labs Barcelona

Abstract:

A multi-site web search engine is composed of a number of search sites geographically distributed around the world. Each search site is typically responsible for crawling and indexing the web pages that are in its geographical neighborhood. A query is selectively processed on a subset of search sites that are predicted to return the best-matching results. The scalability and efficiency of multi-site web search engines have attracted a lot of research attention in recent years. In particular, research has focused on replicating important web pages across sites, forwarding queries to relevant sites, and caching results of previous queries. Yet, these problems have only been studied in isolation, but no prior work has properly investigated the interplay between them.

In talk, I will present what we believe is the first comprehensive analysis of a full stack of techniques for efficient multi-site web search. Specifically, we propose a document replication technique that improves the query locality of the state-of-the-art approaches with various replication budget distribution strategies. We devise a machine learning approach to decide the query forwarding patterns, achieving a significantly lower false positive ratio than a state-of-the-art thresholding approach with little negative impact on search result quality. We propose three result caching strategies that reduce the number of forwarded queries and analyze the trade-off they introduce in terms of storage and network overheads. Finally, we show that the combination of the best-of-the-class techniques yields very promising search efficiency, rendering multi-site, geographically distributed web search engines an attractive alternative to centralized web search engines.

Short Bio: Xiao Bai is a research scientist in Yahoo Labs Barcelona. Before joining Yahoo, she received her Ph.D. in INRIA Rennes (France) in 2010. She obtained her Bachelor’s Degree and Master’s Degree from Xi’an Jiaotong University (China) in 2004 and 2007 respectively. During 2002 and 2004, she studied in Ecole Centrale de Lyon (France) within a Franco-Chinese exchange program and obtained her Engineer Degree (Diplôme d’Ingénieur). Her research interests include distributed data management, Web search and social networks. She has been working on different problems, such as personalized query processing in P2P systems, Web search (including web crawling, distributed architecture and efficiency optimization), content recommendation, and caching mechanisms for social applications.

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-xiao-bai-improving-the-efficiency-of-multi-site-web-search-engines-jan-31-10-30-am/

Jan 21

Séminaire du pôle Données Connaissances: “Quality and Price of Data”, by Ruiming Tang (NUS), Jan 21, 2014

Filed under Seminars, Slider News
January 21, 2014

Mardi 21 janvier 2014, 14h
Batiment Galera, salle 127

Quality and Price of Data
Ruiming Tang – National University of Singapore

In data marketplaces, people clean data, buy and sell data, and collect data. In this talk, we study quality and price of data. More specifically, we study three topics. The first topic is how to improve data quality by conditioning. The second topic is how to sell data according to a proposed price. The third topic is how people buy data, i.e., define price of a query and propose algorithms to compute the price of a query.

In order to improve data quality (accuracy) by adding constraint or information, we study the conditioning problem. We propose a framework for representing conditioned probabilistic relational data. Conditioning is the formalization of the process of adding knowledge to a database. Some worlds may be impossible given the constraints and the probabilities of possible worlds are accordingly re-defined. The new constraints can come from the observation of the existence or non-existence of a tuple, from the knowledge of a specific rule, such as the existence of an exclusive set of tuples, or from the knowledge of a general rule, such as a functional dependency. We are therefore interested in computing a concise representation of the possible worlds and their respective probabilities after the addition of new constraints, namely an equivalent probabilistic database instance without constraints after conditioning. We devise and present a general algorithm for this computation. Unfortunately, the general problem involves the simplification of general Boolean expressions and is NP-hard. We therefore identify specific practical families of constraints for which we devise and present efficient algorithms.

We study the relationship between quality and price of data. We proposed a theoretical and practical pricing framework for a data market in which data consumers can trade data quality for discounted prices. In most data markets, prices are prescribed and accuracy is determined by the data. Instead, we consider a model in which accuracy can be traded for discounted prices: “what you pay for is what you get”. The data market model consists of data consumers, data providers and data market owners. The data market owners are brokers between the data providers and data consumers. A data consumer proposes a price for the data that she requests. If the price is less than the price set by the data provider, then she gets an approximate value. The data market owners negotiate the pricing schemes with the data providers. They implement these schemes for the computation of the discounted approximate values. We propose a theoretical and practical pricing framework with its algorithms for the above mechanism. In this framework, the value published is randomly determined from a probability distribution. The distribution is computed such that its distance to the actual value is commensurate to the discount. The published value comes with a guarantee on the probability to be the exact value. The probability is also commensurate to the discount. We present and formalize the principles that a healthy data market should meet for such a transaction. We define two ancillary functions and describe the algorithms that compute the approximate value from the proposed price using these functions. We prove that the functions and the algorithm meet the required principles.

We study the price of queries for cases that data consumers request for data in forms of queries. We propose a generic data pricing model that is based on minimal provenance, i.e. minimal sets of tuples contributing to the result of a query. We show that the proposed model fulfills desirable properties such as contribution monotonicity, bounded-price and contribution arbitrage-freedom. We present a baseline algorithm to compute the exact price of a query based on our pricing model. We show that the problem is NP-hard. We therefore devise, present and compare several heuristics. We conduct a comprehensive experimental study to show their effectiveness and efficiency.

Permanent link to this article: https://team.inria.fr/zenith/seminaire-du-pole-donnees-connaissances-ruiming-tang-quality-and-price-of-data-jan-21-2pm/

Best presentation award for Miguel during the Grid5000 Spring School 2014 in Lyon.

Mastodons International Workshop on “Big Data Management and Crowd Sourcing towards Scientific Data”, June 30, 2014

Monday 30th june 2014, in MONTPELLIER, 95 rue de la Galéra

IBC & LIRMM (UM2, CNRS-Mastodons), INRIA-UCSB associated team Bigdatanet

Workshop Objective

Séminaire du pôle données connaissances : “In-Memory Analytics: Accelerating Business Performance” – QuartetFS, 23 juin 2014 à 11h.

Zenith seminar: “CLyDE Mid-Flight: What we have learnt so far about the SSD-Based IO Stack”, by Philippe Bonnet (Univ. of Copenhagen), May 28, 2014

Démarrage du laboratoire commum Triton (I-Lab) avec Beepeers

PhD position: “A Data-Centric Execution Model for Scientific Workflows”

PhD position

Advisors: Didier Parigot

References

Analyse, extraction, propagation et recherche des données issues des réseaux sociaux d’entreprise

Analyse collaborative des nouveaux gisements personnels de données respectant leur confidentialité.

Zenith seminar: “Improving the Efficiency of Multi-site Web Search Engines”, Xiao Bai, Jan 31, 2014

Séminaire du pôle Données Connaissances: “Quality and Price of Data”, by Ruiming Tang (NUS), Jan 21, 2014

Search

Events

Calendar

Meta