Zenith Seminar Room Galera 127 on May 28, 2014, 11am. CLyDE Mid-Flight: What we have learnt so far about the SSD-Based IO Stack Philippe Bonnet, INRIA and IT University of Copenhagen Abstract: The quest for energy proportional systems and the growing performance gap between processors and magnetic disks has led to the adoption of SSDs as secondary storage of choice for a large range of systems. Indeed, SSDs offer great performance (tens of flash chips wired in parallel can deliver hundreds of thousands accesses per second) with low energy consumption. This evolution introduces a mismatch between the simple disk model that underlies the design of today’s database systems and the complex SSDs of today’s computers. This mismatch leads to unpredictable performance, with orders of magnitude slowdown in IO latency that can hit an application anytime. To attack this problem, the obvious approach is to construct models that capture SSDs’ performance behaviour. However, our previous work has shown the limits of this approach because (a) performance characteristics and energy profiles vary significantly across SSDs, and (b) performance varies in time on a single device based on the history of accesses. The CLyDe project is based on the insight that the strict layering that has been so successful for designing database systems on top of magnetic disks is no longer applicable to SSDs. In other words, our central hypothesis is that the complexity of flash devices cannot be abstracted away as it results in unpredictable and suboptimal performance. We postulate that database system designers need a clear and stable distinction between efficient and inefficient patterns of access to secondary storage, so that they can adapt space allocation strategies, data representation or query processing algorithms. We propose that (i) SSDs should expose this distinction instead of aggressively mitigating the impact of inefficient patterns at the expense of the efficient ones, and (ii) that operating system and database system should explicitly provide mechanisms to ensure that efficient access patterns are favoured. We thus advocate a co-design of SSD controllers, operating system and database system with appropriate cross-layer optimisations. In this talk, I will report on the lessons we have learnt so far in the project. In particular, I will describe the SSD simulation frameworks that we have developed to explore cross layer designs: EagleTree and LightNVM. I will discuss our findings on the importance of scheduling within an SSD. I will present our contribution to the re-design of the Linux block layer, that makes it possible for Linux to keep up with SSD performance on multi-socket systems. Finally, I will present preliminary results on the co-design of file system and SSDs. CLyDE is a joint project between IT University of Copenhagen and INRIA Paris Rocquencourt, started in 2012 and funded by the Danish Council for Independent Research. Bio: Philippe Bonnet is associate professor at IT University of Copenhagen. Philippe is an experimental computer scientist focused on building/tuning systems for performance and energy efficiency. Philippe’s research interests include database tuning, flash-based database systems, secure personal data management, sensor data engineering.
Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-philippe-bonnet-clyde-mid-flight-what-we-have-learnt-so-far-about-the-ssd-based-io-stack-may-28-11-00-am/
Apr 10
Démarrage du laboratoire commum Triton (I-Lab) avec Beepeers
Le laboratoire commun (I-Lab) Triton, avec la société Beepeer (beepeers.com) et notre équipe de recherche a démarré en mars 2014, avec l’arrivée d’un ingénieur doctorant. Beepeers est une jeune entreprise innovante créée en 2011 qui propose aux entreprises une plateforme pour les aider à développer leurs réseaux sociaux d’entreprise sur mobile (smartphone et tablette) et sur le Web. Beepeers souhaite dès à présent préparer l’industrialisation et le déploiement à grande échelle de sa plateforme grâce à un projet de R&D commun avec Inria. Beepeers souhaite par l’intermédiaire de cet l’i-lab créer un middleware de collaboration dédié.
Permanent link to this article: https://team.inria.fr/zenith/demarrage-du-laboratoire-commum-triton-i-lab-avec-beepeers/
Mar 28
PhD position: “A Data-Centric Execution Model for Scientific Workflows”
PhD position
Advisors: Didier Parigot
The Zenith team deals with the management of scientific applications that are computation-intensive and manipulate large amounts of data. These applications are often represented by workflows, which describe sequences of tasks (computations) and data dependencies between these tasks. Several scientific workflow environments have been already proposed [taylor07]. However, they have little support for efficiently managing large data sets. The Zenith team develops an original approach that deals with such large data sets in a way that allows efficient placement of both tasks and data on large-scale (distributed and parallel) infrastructures for more efficient execution. To this end, we propose an original solution that combines the advantages of cloud computing and P2P technologies.This work is part of the IBC project (Institut de Biologie Computationelle – http://www.ibc-montpellier.fr), in collaboration with biologists, in particular from CIRAD and IRD, and cloud providers The concept of cloud computing combines several technology advances such as Service-Oriented Architectures, resource virtualization, and novel data management systems referred to as NoSQL. These technologies enable flexible and extensible usage of resources, which is referred to as elasticity. In addition, the Cloud allows users to simply outsource data storage and application executions. For the manipulation of big data, NoSQL database systems, such as Google Bigtable, Hadoop Hbase, Amazon Dynamo, Apache Cassandra, 10gen MongoDB, have been recently proposed. Existing scientific workflow environments [taylor07] have been developed primarily to simplify the design and execution of a set of tasks in a particular infrastructure. For example, in the field of biology, the Galaxy environment allows users to introduce catalogs of functions/tasks and compose these functions with existing functions in order to build a workflow. These environments propose a design approach that we can classify as “process-oriented”, where information about data dependencies (data flow) is purely syntactic. In addition, the targeted execution infrastructures are mostly computation-oriented, like clusters and grids. Finally, the data produced by scientific workflows are often stored in loosely structured files, for further analysis. Thus, data management is fairly basic, with data are either stored on a centralized disk or directly transferred between tasks. This approach is not suitable for data-intensive applications because data management becomes the major bottleneck in terms of data transfers. As part of a new project that develops a middleware for scientific workflows (SciFloware), the objective of this thesis is to design a declarative data-centric language for expressing scientific workflows and its associated execution model. A declarative language is important to provide for automatic optimization and parallelization [ogasawara11]. The execution model for this language will be decentralized, in order to yield flexible execution in distributed and parallel environments. This execution model will capitalize on execution models developed in the context of distributed and parallel database systems [valduriez11]. To validate this work, a prototype will be implemented using the SON middleware [parigot12] and a distributed file system like HDFS. For application fields, this work will be in close relationship to the Virtual Plants team which develop computational models of plant development to understand the physical and biological principles that drive the development of plant branching systems and organs. In particular OpenAlea [pradal08] is a software platform for plant analysis and modelling at different scales. It provids a scientific workflow environment to integrate different tak for plant reconstruction, analysis, simulation and visualisation at the tissue level [lucas13] and at the plant level [boudon12]. One challenging application in biology and computer science is to process and analyse data collected on phenotyping plateforms in high-throughput. The scifloware middleware, combined with OpenAlea, will improve the capability of the plant science community at analysing high throughput of variables hardly accessible in the field such as architecture, response of organ growth to environmental conditions or radiation use efficiency. This will improve the ability of this community to model the genetic variablity of plant response to environmental cues associated to climate change.
References
[ogasawara11] E. Ogasawara, J. F. Dias, D. de Oliveira, F. Porto, P. Valduriez, M. Mattoso. An Algebraic Approach for Data-centric Scientific Workflows. Proceedings of the VLDB Endowment (PVLDB), 4(12) : 1328-1339, 2011. [valduriez11] M. T. Özsu, P. Valduriez. Principles of Distributed Database Systems. Third Edition, Springer, 2011. [taylor07] I. J. Taylor, E. Deelman, D. B. Gannon, M. Shields. Workflows for e-Science : Scientific Workflows for Grids. First Edition, Springer, 2007. [parigot12] Ait-Lahcen, D. Parigot. A Lightweight Middleware for developing P2P Applications with Component and Service-Based Principles. 15th IEEE International Conference on Computational Science and Engineering, 2012. [pradal08] C. Pradal, S. Dufour-Kowalski, F. Boudon, C. Fournier, C. Godin. OpenAlea: A visual programming and component-based software platform for plant modeling. Functional Plant Biology [lucas13] Lucas, Mikaël, et al. “Lateral root morphogenesis is dependent on the mechanical properties of the overlaying tissues.” Proceedings of the National Academy of Sciences 110.13 (2013): 5229-5234. [boudon12] Boudon, F., Pradal, C., Cokelaer, T., Prusinkiewicz, P., & Godin, C. (2012). L-py: an L-system simulation framework for modeling plant architecture development based on a dynamic language. Frontiers in plant science, 3 Contact: Didier Parigot (Firstname.Lastname@inria.fr) Apply online
Permanent link to this article: https://team.inria.fr/zenith/a-data-centric-execution-model-for-scientific-workflows/
Mar 05
Analyse, extraction, propagation et recherche des données issues des réseaux sociaux d’entreprise
Titre de la thèse :
Recommandation temps réels pour des réseaux sociaux sectoriels
Directeur de Thèse: Didier Parigot
http://www-sop.inria.fr/members/Didier.Parigot/
Collaboration avec la société Beepeers (beepeers.com) dont l’activité est la création d’une plateforme collaborative d’outils sociaux pour les entreprises.
Lieux: Inria Sophia-Antipolis
Financement: Bourse universitaire
Introduction
Depuis quelques années les thématiques de gestion de grand volume de donnée (BIG DATA) et des données ouvertes (OPEN DATA) prennent une importance grandissante avec l’essor des réseaux sociaux et de l’internet. En effet par une exploitation ou une analyse des données manipulées il est possible d’extraire de nouvelles informations pertinentes qui permettent de proposer de nouveaux services ou outils. Dans le cadre d’une collaboration entre notre Equipe-Projet Zenith et une très jeune startup Beepeers qui commercialise une plateforme pour le développement de réseaux sociaux sectoriel, nous proposons ce sujet de recherche afin d’enrichir cette plate-forme par de nouveaux services avancés basés sur l’extraction ou l’analyse des données produites par ces réseaux sociaux d’entreprise.
Objectif de la thèse
L’objectif de la thèse sera de proposer et de combiner diverses techniques (algorithmes) d’analyse de donnée afin de proposer des services avancés à la plate-forme Beepeers. La plate-forme Beepeers propose déjà un riche ensemble de fonctionnalité ou services qui produisent une masse d’information ou de donnée qui formera le jeu de donnée initial pour ces futurs travaux de recherche.
Le doctorant devra proposer dans ce cadre applicatif bien ciblé, des algorithmes d’extraction d’information par une combinaison originale des techniques suivantes :
- d’analyse d’usage des utilisateurs ;
- d’extraction de profil utilisateur ;
- de propagation ou de diffusion d’information à travers le réseau ou entre différents réseaux sociaux connectés à la plate-forme Beepeers ;
- de recommandation de personne, de service ou d’évènement à l’aide des avis des utilisateurs du réseau (fonctionnalité déjà disponible dans la plate-forme Beepeers) ;
- d’extraction par requête base de donnée continu dans le temps (persistant) sur les sites de données ouvertes disponible et pertinentes pour le réseau sectoriel sous-jacent.
De plus il sera demandé une mise en œuvre originale basée sur une architecture décentralisée orientée services pour permettre un passage à l’échelle des solutions proposées et un déploiement dynamique à ma demande des services avancés.
Contexte de la collaboration
Cette collaboration fait déjà l’objet d’un partenariat fort INRIA-PME à travers la mise en place et le démarrage cette année d’un laboratoire commun (I-lab), dénommé Triton, avec comme programme de R&D l’élaboration d’une architecture innovante pour la plate-forme Beepeers pour le passage à l’échelle. Ce programme de R&D va s’appuyer sur notre expertise en architecture décentralisée orientée services à travers l’utilisation de notre outil SON (Shared Overlay Network). Le doctorant sera donc accompagné dans ses propositions par cette équipe de R&D de ce laboratoire commun Triton et pourra tester et valider ses algorithmes sur les jeux de donnée issus cette nouvelle plate-forme Beepeers développé dans le cadre de l’I-Lab Triton. De plus le doctorant pourra s’appuyer sur l’expertise scientifique de l’équipe-projet Zenith en terme gestion de données scientifiques.
Résultats attendus et profil attendus du candidat
Le candidat devra avoir un gout prononcé par la validation pratique de ses travaux de recherche, et des bonnes aptitudes d’abstraction pour savoir maitriser et appréhender rapidement ces différentes techniques d’analyse ou d’extraction de donnée issu de divers communautés scientifiques (base de donné, analyse d’usage et la programmation distribuée pour la mise en œuvre). Le candidat devra savoir travailler en équipe, en étroite collaboration avec la société Beepeers pour mener à bien ses travaux de recherche. Ces travaux devront trouver rapidement des champs d’application à travers la réalisation concrète et effective de nouveaux services de la plate-forme Beepeers.
Références
Rodriguez, M.A., Neubauer, P., “The Graph Traversal Pattern,” Graph Data Management: Techniques and Applications, eds. S. Sakr, E. Pardede, IGI Global, ISBN:9781613500538, August 2011.
Haesung Lee , Joonhee Kwon, “Efficient Recommender System based on Graph Data for Multimedia Application”
Profil recherché
- Ecole d’Ingénieur (BAC + 5) ou Master 2
- Goût du travail en Équipe
- Bon niveau en Anglais (à l’écrit)
Pour candidater
Merci d’envoyer par email et en PDF à l’adresse Didier.Parigot@inria.fr les documents suivants
- CV,
- lettre de motivation ciblée sur le sujet,
- au moins deux lettres de recommandation,
- relevés de notes + liste des enseignements suivis en M2 et en M1.
Permanent link to this article: https://team.inria.fr/zenith/analyse-extraction-propagation-et-recherche-des-donnees-issues-des-reseaux-sociaux-dentreprise/
Permanent link to this article: https://team.inria.fr/zenith/analyse-collaborative-des-nouveaux-gisements-personnels-de-donnees-respectant-leur-confidentialite/
Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-xiao-bai-improving-the-efficiency-of-multi-site-web-search-engines-jan-31-10-30-am/
Permanent link to this article: https://team.inria.fr/zenith/seminaire-du-pole-donnees-connaissances-ruiming-tang-quality-and-price-of-data-jan-21-2pm/
Permanent link to this article: https://team.inria.fr/zenith/ibc-seminar-themis-palpanas-enabling-exploratory-analysis-on-very-large-scientific-data-dec-12-10am/
Dec 09
Zenith seminar: “Data Partitioning in Parallel Data Management Systems”, by Miguel Liroz, Dec 9, 2014
“Data Partitioning in Parallel Data Management Systems”
Miguel Liroz
Room G.227
During the last years, the volume of data that is captured and generated has exploded. Advances in computer technologies, which provide cheap storage and increased computing capabilities, have allowed organizations to perform complex analysis on this data and to extract valuable knowledge from it. This trend has been very important not only for industry, but has also had a significant impact on science, where enhanced instruments and more complex simulations call for an efficient management of huge quantities of data. Parallel computing is a fundamental technique in the management of large quantities of data as it leverages on the concurrent utilization of multiple computing resources. To take advantage of parallel computing, we need efficient data partitioning techniques which are in charge of dividing the whole data and assigning the partitions to the processing nodes. Data partitioning is a complex problem, as it has to consider different and often contradicting issues, such as data locality, load balancing and maximizing parallelism. In this thesis, we study the problem of data partitioning, particularly in scientific parallel databases that are continuously growing and in the MapReduce framework. In the case of scientific databases, we consider data partitioning in very large databases in which new data is appended continuously to the database, e.g. astronomical applications. Existing approaches are limited since the complexity of the workload and continuous appends restrict the applicability of traditional approaches. We propose two partitioning algorithms that dynamically partition new data elements by a technique based on data affinity. Our algorithms enable us to obtain very good data partitions in a low execution time compared to traditional approaches. We also study how to improve the performance of MapReduce framework using data partitioning techniques. In particular, we are interested in efficient data partitioning of the input datasets to reduce the amount of data that has to be transferred in the shuffle phase. We design and implement a strategy which, by capturing the relationships between input tuples and intermediate keys, obtains an efficient partitioning that can be used to reduce significantly the MapReduce’s communication overhead.
Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-miguel-lirozdata-partitioning-in-parallel-data-management-systems-dec-9-11-am/
Permanent link to this article: https://team.inria.fr/zenith/seminaire-du-pole-donnees-connaissances-29-novembre-a-11h/