Zenith seminar: “CLyDE Mid-Flight: What we have learnt so far about the SSD-Based IO Stack”, by Philippe Bonnet (Univ. of Copenhagen), May 28, 2014

ssd_stackZenith Seminar Room Galera 127 on May  28, 2014, 11am. CLyDE Mid-Flight: What we have learnt so far about the SSD-Based IO Stack  Philippe Bonnet, INRIA and IT University of Copenhagen Abstract: The quest for energy proportional systems and the growing performance gap between processors and magnetic disks has led to the adoption of SSDs as secondary storage of choice for a large range of systems.  Indeed, SSDs offer great performance (tens of flash chips wired in parallel can deliver hundreds of thousands accesses per second) with low energy consumption. This evolution introduces a mismatch between the simple disk model that underlies the design of today’s database systems and the complex SSDs of today’s computers. This mismatch leads to unpredictable performance, with orders of magnitude slowdown in IO latency that can hit an application anytime. To attack this problem, the obvious approach is to construct models that capture SSDs’ performance behaviour. However, our previous work has shown the limits of this approach because (a) performance characteristics and energy profiles vary significantly across SSDs, and (b) performance varies in time on a single device based on the history of accesses. The CLyDe project is based on the insight that the strict layering that has been so successful for designing database systems on top of magnetic disks is no longer applicable to SSDs. In other words, our central hypothesis is that the complexity of flash devices cannot be abstracted away as it results in unpredictable and suboptimal performance. We postulate that database system designers need a clear and stable distinction between efficient and inefficient patterns of access to secondary storage, so that they can adapt space allocation strategies, data representation or query processing algorithms. We propose that (i) SSDs should expose this distinction instead of aggressively mitigating the impact of inefficient patterns at the expense of the efficient ones, and (ii) that operating system and database system should explicitly provide mechanisms to ensure that efficient access patterns are favoured.  We thus advocate a co-design of SSD controllers, operating system and database system with appropriate cross-layer optimisations. In this talk, I will report on the lessons we have learnt so far in the project. In particular, I will describe the SSD simulation frameworks that we have developed to explore cross layer designs: EagleTree and LightNVM. I will discuss our findings on the importance of scheduling within an SSD. I will present our contribution to the re-design of the Linux block layer, that makes it possible for Linux to keep up with SSD performance on multi-socket systems. Finally, I will present preliminary results on the co-design of file system and SSDs.   CLyDE is a joint project between IT University of Copenhagen and INRIA Paris Rocquencourt, started in 2012 and funded by the Danish Council for Independent Research.   Bio: Philippe Bonnet is associate professor at IT University of Copenhagen. Philippe is an experimental computer scientist focused on building/tuning systems for performance and energy efficiency. Philippe’s research interests include database tuning, flash-based database systems, secure personal data management, sensor data engineering.

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-philippe-bonnet-clyde-mid-flight-what-we-have-learnt-so-far-about-the-ssd-based-io-stack-may-28-11-00-am/

Démarrage du laboratoire commum Triton (I-Lab) avec Beepeers

Le laboratoire commun (I-Lab) Triton, avec la société Beepeer (beepeers.com) et notre équipe de recherche a démarré en mars 2014, avec l’arrivée d’un ingénieur doctorant.  Beepeers est une jeune entreprise innovante créée en 2011 qui propose aux entreprises une plateforme pour les aider à développer leurs réseaux sociaux d’entreprise sur mobile (smartphone et tablette) et sur le Web. Beepeers souhaite dès à présent préparer l’industrialisation et le déploiement à grande échelle de sa plateforme grâce à un projet de R&D commun avec Inria. Beepeers souhaite par l’intermédiaire de cet l’i-lab créer un middleware de collaboration dédié.

Plus informations

Permanent link to this article: https://team.inria.fr/zenith/demarrage-du-laboratoire-commum-triton-i-lab-avec-beepeers/

PhD position: “A Data-Centric Execution Model for Scientific Workflows”

PhD position

Advisors: Didier Parigot

The Zenith team deals with the management of scientific applications that are computation-intensive and manipulate large amounts of data. These applications are often represented by workflows, which describe sequences of tasks (computations) and data dependencies between these tasks. Several scientific workflow environments have been already proposed [taylor07]. However, they have little support for efficiently managing large data sets. The Zenith team develops an original approach that deals with such large data sets in a way that allows efficient placement of both tasks and data on large-scale (distributed and parallel) infrastructures for more efficient execution. To this end, we propose an original solution that combines the advantages of cloud computing and P2P technologies.This work is part of the IBC project (Institut de Biologie Computationelle – http://www.ibc-montpellier.fr), in collaboration with biologists, in particular from CIRAD and IRD, and cloud providers The concept of cloud computing combines several technology advances such as Service-Oriented Architectures, resource virtualization, and novel data management systems referred to as NoSQL. These technologies enable flexible and extensible usage of resources, which is referred to as elasticity. In addition, the Cloud allows users to simply outsource data storage and application executions. For the manipulation of big data, NoSQL database systems, such as Google Bigtable, Hadoop Hbase, Amazon Dynamo, Apache Cassandra, 10gen MongoDB, have been recently proposed. Existing scientific workflow environments [taylor07] have been developed primarily to simplify the design and execution of a set of tasks in a particular infrastructure. For example, in the field of biology, the Galaxy environment allows users to introduce catalogs of functions/tasks and compose these functions with existing functions in order to build a workflow. These environments propose a design approach that we can classify as “process-oriented”, where information about data dependencies (data flow) is purely syntactic. In addition, the targeted execution infrastructures are mostly computation-oriented, like clusters and grids. Finally, the data produced by scientific workflows are often stored in loosely structured files, for further analysis. Thus, data management is fairly basic, with data are either stored on a centralized disk or directly transferred between tasks. This approach is not suitable for data-intensive applications because data management becomes the major bottleneck in terms of data transfers. As part of a new project that develops a middleware for scientific workflows (SciFloware), the objective of this thesis is to design a declarative data-centric language for expressing scientific workflows and its associated execution model. A declarative language is important to provide for automatic optimization and parallelization [ogasawara11].  The execution model for this language will be decentralized, in order to yield flexible execution in distributed and parallel environments. This execution model will capitalize on execution models developed in the context of distributed and parallel database systems [valduriez11]. To validate this work, a prototype will be implemented using the SON middleware [parigot12] and a distributed file system like HDFS. For application fields, this work will be in close relationship to the Virtual Plants team which develop computational models of plant development to understand the physical and biological principles that drive the development of plant branching systems and organs. In particular OpenAlea [pradal08] is a software platform for plant analysis and modelling at different scales. It provids a scientific workflow environment to integrate different tak for plant reconstruction, analysis, simulation and visualisation at the tissue level [lucas13] and at the plant level [boudon12]. One challenging application in biology and computer science is to process and analyse data collected on phenotyping plateforms in high-throughput.  The scifloware middleware, combined with OpenAlea, will improve the capability of the plant science community  at analysing high throughput of variables hardly accessible in the field such as architecture, response of organ growth to environmental conditions or radiation use efficiency. This will improve the ability of this community to model the genetic variablity of plant response to environmental cues associated to climate change.

References

[ogasawara11] E. Ogasawara, J. F. Dias, D. de Oliveira, F. Porto, P. Valduriez, M. Mattoso. An Algebraic Approach for Data-centric Scientific Workflows. Proceedings of the VLDB Endowment (PVLDB), 4(12) : 1328-1339, 2011. [valduriez11] M. T. Özsu, P. Valduriez. Principles of Distributed Database Systems. Third Edition, Springer, 2011. [taylor07] I. J. Taylor, E. Deelman, D. B. Gannon, M. Shields. Workflows for e-Science : Scientific Workflows for Grids. First Edition, Springer, 2007. [parigot12] Ait-Lahcen, D. Parigot. A Lightweight Middleware for developing P2P Applications with Component and Service-Based Principles. 15th IEEE International Conference on Computational Science and Engineering, 2012. [pradal08] C. Pradal, S. Dufour-Kowalski, F. Boudon, C. Fournier, C. Godin. OpenAlea: A visual programming and component-based software platform for plant modeling. Functional Plant Biology [lucas13] Lucas, Mikaël, et al. “Lateral root morphogenesis is dependent on the mechanical properties of the overlaying tissues.” Proceedings of the National Academy of Sciences 110.13 (2013): 5229-5234. [boudon12] Boudon, F., Pradal, C., Cokelaer, T., Prusinkiewicz, P., & Godin, C. (2012). L-py: an L-system simulation framework for modeling plant architecture development based on a dynamic language. Frontiers in plant science, 3 Contact: Didier Parigot (Firstname.Lastname@inria.fr) Apply online

Permanent link to this article: https://team.inria.fr/zenith/a-data-centric-execution-model-for-scientific-workflows/

Analyse, extraction, propagation et recherche des données issues des réseaux sociaux d’entreprise

Titre de la thèse :

Recommandation temps réels pour des réseaux sociaux sectoriels

Directeur de Thèse: Didier Parigot

http://www-sop.inria.fr/members/Didier.Parigot/

Collaboration avec la société Beepeers (beepeers.com) dont l’activité est la création d’une plateforme collaborative d’outils sociaux pour les entreprises.

Lieux: Inria Sophia-Antipolis

Financement: Bourse universitaire

Introduction

Depuis quelques années les thématiques de gestion de grand volume de donnée (BIG DATA) et des données ouvertes (OPEN DATA) prennent une importance grandissante avec l’essor des réseaux sociaux et de l’internet.  En effet par une exploitation ou une analyse des données manipulées il est possible d’extraire de nouvelles informations pertinentes qui permettent de proposer de nouveaux services ou outils. Dans le cadre d’une collaboration entre notre Equipe-Projet Zenith et une très jeune startup Beepeers qui commercialise une plateforme pour le développement de réseaux sociaux sectoriel, nous proposons ce sujet de recherche afin d’enrichir cette plate-forme par de nouveaux services avancés basés sur l’extraction ou l’analyse des données produites par ces réseaux sociaux d’entreprise.

Objectif de la thèse

L’objectif de la thèse sera de proposer et de combiner diverses techniques (algorithmes) d’analyse de donnée afin de proposer des services avancés à la plate-forme Beepeers.  La plate-forme Beepeers propose déjà un riche ensemble de fonctionnalité ou services qui produisent une masse d’information ou de donnée qui formera le jeu de donnée initial pour ces futurs travaux de recherche.

Le doctorant devra proposer dans ce cadre applicatif bien ciblé, des algorithmes d’extraction d’information par une combinaison originale des techniques suivantes :

  • d’analyse d’usage des utilisateurs ;
  • d’extraction de profil utilisateur ;
  • de propagation ou de diffusion d’information à travers le réseau ou entre différents réseaux sociaux connectés à la plate-forme Beepeers ;
  • de recommandation de personne, de service ou d’évènement à l’aide des avis des utilisateurs du réseau (fonctionnalité déjà disponible dans la plate-forme Beepeers) ;
  • d’extraction par requête base de donnée continu dans le temps (persistant) sur les sites de données ouvertes disponible et pertinentes pour le réseau sectoriel sous-jacent.

De plus il sera demandé une mise en œuvre originale basée sur une architecture décentralisée orientée services pour permettre un passage à l’échelle des solutions proposées et un déploiement dynamique à ma demande des services avancés.

Contexte de la collaboration

Cette collaboration fait déjà l’objet d’un partenariat fort INRIA-PME à travers la mise en place et le démarrage cette année d’un laboratoire commun (I-lab), dénommé Triton, avec comme programme de R&D l’élaboration d’une architecture innovante pour la plate-forme Beepeers pour le passage à l’échelle. Ce programme de R&D va s’appuyer sur notre expertise en architecture décentralisée orientée services à travers l’utilisation de notre outil SON (Shared Overlay Network). Le doctorant sera donc accompagné dans ses propositions par cette équipe de R&D de ce  laboratoire commun Triton et pourra tester et valider ses algorithmes sur les jeux de donnée issus cette  nouvelle plate-forme Beepeers développé dans le cadre de l’I-Lab Triton. De plus le doctorant pourra s’appuyer sur l’expertise scientifique de l’équipe-projet Zenith en terme  gestion de données scientifiques.

Résultats attendus et profil attendus du candidat

Le candidat devra avoir un gout prononcé par la validation pratique de ses travaux de recherche, et des bonnes aptitudes d’abstraction pour savoir maitriser et appréhender rapidement ces différentes techniques d’analyse ou d’extraction de donnée issu de divers communautés scientifiques (base de donné, analyse d’usage et la programmation distribuée pour la mise en  œuvre). Le candidat devra savoir travailler en équipe, en étroite collaboration avec la société Beepeers pour mener à bien ses travaux de recherche.  Ces travaux devront trouver rapidement des champs d’application à travers la réalisation concrète et effective de nouveaux services de la plate-forme Beepeers.

Références

Rodriguez, M.A., Neubauer, P., “The Graph Traversal Pattern,” Graph Data Management: Techniques and Applications, eds. S. Sakr, E. Pardede, IGI Global, ISBN:9781613500538, August 2011.

Haesung Lee , Joonhee Kwon, “Efficient Recommender System based on Graph Data for Multimedia Application”

International Journal of Multimedia and Ubiquitous Engineering, Vol. 8, No.4,July, 2013.

Profil recherché

  • Ecole d’Ingénieur (BAC + 5) ou Master 2
  • Goût du travail en Équipe
  • Bon niveau en Anglais (à l’écrit)

Pour candidater

Merci d’envoyer par email et en PDF à l’adresse Didier.Parigot@inria.fr les documents suivants

  • CV,
  • lettre de motivation ciblée sur le sujet,
  • au moins deux lettres de recommandation,
  • relevés de notes + liste des enseignements suivis en M2 et en M1.

Permanent link to this article: https://team.inria.fr/zenith/analyse-extraction-propagation-et-recherche-des-donnees-issues-des-reseaux-sociaux-dentreprise/

Analyse collaborative des nouveaux gisements personnels de données respectant leur confidentialité.

secureZenith a participé à la rencontre Inria/Industrie, le Mardi 11 février 2014 à l’Ens de Lyon. Tristan Allard y a présenté nos travaux sur la découverte collaborative de profils dans les données personnelles, garantissant le respect de la confidentialité des données. Le “Quantified Self” est un mouvement qui gagne en popularité ces dernières années. Aujourd’hui, il est en effet possible de récolter des données personnelles sur de nombreux domaines, comme les activités quotidiennes, la santé ou les performances sportives. Cela peut se faire grâce à des capteurs physiologiques communiquant avec le dispositif personnel de l’individu les portant, un simple smartphone ou des “smart-glasses” par exemple, ou bien directement embarqués dans le dispositif, comme les accéléromètres notamment. Bien exploitées, ces données peuvent apporter des connaissances précieuses sur les domaines qui les concernent. Pour mieux traiter une maladie, il peut être important de mieux cerner le profil d’un individu pour proposer un traitement personnalisé. Pour un sportif, il serait intéressant de savoir dans quelle catégorie il se trouve afin d’adapter ses entraînement et concevoir un programme spécifique. Toutefois, pour préserver leur vie privée, les individus peuvent être réticents à l’idée de partager leur données. Cette démonstration montre le prototype d’un tel système de calcul des profils types dans lequel les participants collaborent ensemble par le biais d’un algorithme totalement décentralisé sans jamais communiquer en clair leurs données.   http://www.inria.fr/centre/grenoble/innovation/rii-bio-informatique/demos/demo-zenith

Permanent link to this article: https://team.inria.fr/zenith/analyse-collaborative-des-nouveaux-gisements-personnels-de-donnees-respectant-leur-confidentialite/

Zenith seminar: “Improving the Efficiency of Multi-site Web Search Engines”, Xiao Bai, Jan 31, 2014

search_botSeminaire Zenith
30/01 à 10h30 salle 227, Galera
Improving the Efficiency of Multi-site Web Search Engines
Xiao Bai – Yahoo Labs Barcelona
Abstract: 
A multi-site web search engine is composed of a number of search sites geographically distributed around the world. Each search site is typically responsible for crawling and indexing the web pages that are in its geographical neighborhood. A query is selectively processed on a subset of search sites that are predicted to return the best-matching results. The scalability and efficiency of multi-site web search engines have attracted a lot of research attention in recent years. In particular, research has focused on replicating important web pages across sites, forwarding queries to relevant sites, and caching results of previous queries. Yet, these problems have only been studied in isolation, but no prior work has properly investigated the interplay between them.
In talk, I will present what we believe is the first comprehensive analysis of a full stack of techniques for efficient multi-site web search. Specifically, we propose a document replication technique that improves the query locality of the state-of-the-art approaches with various replication budget distribution strategies. We devise a machine learning approach to decide the query forwarding patterns, achieving a significantly lower false positive ratio than a state-of-the-art thresholding approach with little negative impact on search result quality. We propose three result caching strategies that reduce the number of forwarded queries and analyze the trade-off they introduce in terms of storage and network overheads. Finally, we show that the combination of the best-of-the-class techniques yields very promising search efficiency, rendering multi-site, geographically distributed web search engines an attractive alternative to centralized web search engines.
Short Bio: Xiao Bai is a research scientist in Yahoo Labs Barcelona. Before joining Yahoo, she received her Ph.D. in INRIA Rennes (France) in 2010. She obtained her Bachelor’s Degree and Master’s Degree from Xi’an Jiaotong University (China) in 2004 and 2007 respectively. During 2002 and 2004, she studied in Ecole Centrale de Lyon (France) within a Franco-Chinese exchange program and obtained her Engineer Degree (Diplôme d’Ingénieur). Her research interests include distributed data management, Web search and social networks. She has been working on different problems, such as personalized query processing in P2P systems, Web search (including web crawling, distributed architecture and efficiency optimization), content recommendation, and caching mechanisms for social applications.

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-xiao-bai-improving-the-efficiency-of-multi-site-web-search-engines-jan-31-10-30-am/

Séminaire du pôle Données Connaissances: “Quality and Price of Data”, by Ruiming Tang (NUS), Jan 21, 2014

quality-priceMardi 21 janvier 2014, 14h
Batiment Galera, salle 127

Quality and Price of Data
Ruiming Tang – National University of Singapore

In data marketplaces, people clean data, buy and sell data, and collect data. In this talk, we study quality and price of data. More specifically, we study three topics. The first topic is how to improve data quality by conditioning. The second topic is how to sell data according to a proposed price. The third topic is how people buy data, i.e., define price of a query and propose algorithms to compute the price of a query.

In order to improve data quality (accuracy) by adding constraint or information, we study the conditioning problem. We propose a framework for representing conditioned probabilistic relational data. Conditioning is the formalization of the process of adding knowledge to a database. Some worlds may be impossible given the constraints and the probabilities of possible worlds are accordingly re-defined. The new constraints can come from the observation of the existence or non-existence of a tuple, from the knowledge of a specific rule, such as the existence of an exclusive set of tuples, or from the knowledge of a general rule, such as a functional dependency. We are therefore interested in computing a concise representation of the possible worlds and their respective probabilities after the addition of new constraints, namely an equivalent probabilistic database instance without constraints after conditioning. We devise and present a general algorithm for this computation. Unfortunately, the general problem involves the simplification of general Boolean expressions and is NP-hard. We therefore identify specific practical families of constraints for which we devise and present efficient algorithms.

We study the relationship between quality and price of data. We proposed a theoretical and practical pricing framework for a data market in which data consumers can trade data quality for discounted prices. In most data markets, prices are prescribed and accuracy is determined by the data. Instead, we consider a model in which accuracy can be traded for discounted prices: “what you pay for is what you get”. The data market model consists of data consumers, data providers and data market owners. The data market owners are brokers between the data providers and data consumers. A data consumer proposes a price for the data that she requests. If the price is less than the price set by the data provider, then she gets an approximate value. The data market owners negotiate the pricing schemes with the data providers. They implement these schemes for the computation of the discounted approximate values. We propose a theoretical and practical pricing framework with its algorithms for the above mechanism. In this framework, the value published is randomly determined from a probability distribution. The distribution is computed such that its distance to the actual value is commensurate to the discount. The published value comes with a guarantee on the probability to be the exact value. The probability is also commensurate to the discount. We present and formalize the principles that a healthy data market should meet for such a transaction. We define two ancillary functions and describe the algorithms that compute the approximate value from the proposed price using these functions. We prove that the functions and the algorithm meet the required principles.

We study the price of queries for cases that data consumers request for data in forms of queries. We propose a generic data pricing model that is based on minimal provenance, i.e. minimal sets of tuples contributing to the result of a query. We show that the proposed model fulfills desirable properties such as contribution monotonicity, bounded-price and contribution arbitrage-freedom. We present a baseline algorithm to compute the exact price of a query based on our pricing model. We show that the problem is NP-hard. We therefore devise, present and compare several heuristics. We conduct a comprehensive experimental study to show their effectiveness and efficiency.

Permanent link to this article: https://team.inria.fr/zenith/seminaire-du-pole-donnees-connaissances-ruiming-tang-quality-and-price-of-data-jan-21-2pm/

IBC seminar: “Enabling Exploratory Analysis on Very Large Scientific Data” by Themis Palpanas (Univ. Paris 5), Dec 12, 2014

ibc_logo7_smallEnabling Exploratory Analysis on Very Large Scientific Data

There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of data series. Examples of such applications come from biology, astronomy, the web, and other domains. It is not unusual for these applications to involve numbers of data series in the order of hundreds of millions to billions.

In this talk, we describe iSAX 2.0 and its improvements, iSAX 2.0 Clustered and iSAX2+, three methods designed for indexing and mining truly massive collections of data series. We show that the main bottleneck in mining such massive datasets is the time taken to build the index, and we thus introduce a novel bulk loading mechanism, the first of this kind specifically tailored to a data series index. Furthermore, we observe that in several cases scientists, and data analysts in general, need to issue a set of queries as soon as possible, as a first exploratory step of the datasets. We discuss extensions of our previous techniques that adaptively create data series indexes, and at the same time are able to correctly answer user queries.

We show how our methods allows mining on datasets that would otherwise be completely untenable, including the first published experiments to index one billion data series, and experiments in mining massive data from domains as diverse as genome sequences, entomology, and web-scale image collections.

Themis Palpanas is a professor of computer science at the Paris Descartes University, France. He received the BS degree from the National Technical University of Athens, Greece, and the MSc and PhD degrees from the University of Toronto, Canada. He has previously held positions at the IBM T.J. Watson Research Center and the University of Trento. He has also been a Visiting Professor at  the National University of Singapore, worked for the University of California, Riverside, and visited Microsoft Research and the IBM Almaden Research Center. His research solutions have been implemented in world-leading commercial data management products and he is the author of eight US patents. He is the recipient of three Best Paper awards (including ICDE and PERCOM), and the IBM Shared University Research (SUR) Award in 2012, which represents a recognition of research excellence at worldwide level. He has been a member of the IBM Academy of Technology Study on Event Processing, and is a founding member of the Event Processing Technical Society. He has served as General Chair for VLDB 2013.

Permanent link to this article: https://team.inria.fr/zenith/ibc-seminar-themis-palpanas-enabling-exploratory-analysis-on-very-large-scientific-data-dec-12-10am/

Zenith seminar: “Data Partitioning in Parallel Data Management Systems”, by Miguel Liroz, Dec 9, 2014

cloud-db1“Data Partitioning in Parallel Data Management Systems”

Miguel Liroz

Room G.227

During the last years, the volume of data that is captured and generated has exploded. Advances in computer technologies, which provide cheap storage and increased computing capabilities, have allowed organizations to perform complex analysis on this data and to extract valuable knowledge from it. This trend has been very important not only for industry, but has also had a significant impact on science, where enhanced instruments and more complex simulations call for an efficient management of huge quantities of data. Parallel computing is a fundamental technique in the management of large quantities of data as it leverages on the concurrent utilization of multiple computing resources. To take advantage of parallel computing, we need efficient data partitioning techniques which are in charge of dividing the whole data and assigning the partitions to the processing nodes. Data partitioning is a complex problem, as it has to consider different and often contradicting issues, such as data locality, load balancing and maximizing parallelism. In this thesis, we study the problem of data partitioning, particularly in scientific parallel databases that are continuously growing and in the MapReduce framework. In the case of scientific databases, we consider data partitioning in very large databases in which new data is appended continuously to the database, e.g. astronomical applications. Existing approaches are limited since the complexity of the workload and continuous appends restrict the applicability of traditional approaches. We propose two partitioning algorithms that dynamically partition new data elements by a technique based on data affinity. Our algorithms enable us to obtain very good data partitions in a low execution time compared to traditional approaches. We also study how to improve the performance of MapReduce framework using data partitioning techniques. In particular, we are interested in efficient data partitioning of the input datasets to reduce the amount of data that has to be transferred in the shuffle phase. We design and implement a strategy which, by capturing the relationships between input tuples and intermediate keys, obtains an efficient partitioning that can be used to reduce significantly the MapReduce’s communication overhead.

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-miguel-lirozdata-partitioning-in-parallel-data-management-systems-dec-9-11-am/

Séminaire du pôle “Données Connaissances”: “New Perspectives in Social Data Management “, Sihem Amr-Yahia (LIG), 29 novembre 2013

social_dataSéminaire du Pole Données et Connaissances
 
Date: 29/11 à 11h salle de séminaires 127 (Galera).
 

New Perspectives in Social Data Management 

by Sihem Amer-Yahia Abstract: The web has evolved from a technology platform to a social milieu where factual, opinion and behavior data interleave. A number of social applications are being built to analyze and extract value from this data, encouraging us to adopt a data-driven approach to research. I will describe a perspective on why and how social data management is fundamentally different from data management as it is taught in school today. More specifically, I’ll talk about data preparation, data exploration  and application validation. This talk is based on published and ongoing work with colleagues atLIG, UT Austin, U. of Trento, U. of Tacoma, and Google Research.

 
Sihem Amer-Yahia is DR1 CNRS at LIG in Grenoble. 

Permanent link to this article: https://team.inria.fr/zenith/seminaire-du-pole-donnees-connaissances-29-novembre-a-11h/