Fête de la science : participation de Zenith au village des sciences de Genopolys (3 jours).

300_298_fete-de-la-science-2012Le Lirmm et Inria tiendront un stand au village des sciences de Genopolys pour la fête de la science. Rendez-vous  jeudi 10 et vendredi 11 octobre pour les publics scolaires, ainsi que samedi 12 octobre pour un accueil tout public. Au programme : films, ateliers (bouteilles et océans, mallette d’activités déconnectées,…) et le jeu Datagramme !

Permanent link to this article: https://team.inria.fr/zenith/fete-de-la-science-participation-de-zenith-au-village-des-sciences-de-genopolys-3-jours/

IBC seminar: Alexis Joly,”Pl@ntnet: interactive plant identification and collaborative information system.”, Sept 20, 2pm.

plantnet (1)Alexis Joly,
Zenith team, INRIA and LIRMM, France.

Pl@ntnet: interactive plant identification and collaborative information system.
Speeding up the collection and integration of raw botanical observation data is a crucial step towards a sustainable development of agriculture and the conservation of biodiversity. Initiated in the context of a citizen sciences project, the main contribution of Pl@ntNet (http://www.plantnet-project.org) is an innovative collaborative workflow focused on image-based plants identification as a mean to enlist new contributors and facilitate access to botanical data. Since 2010, hundreds of thousands of geo-tagged and dated plant photographs were collected and revised by hundreds of novice, amateur and expert botanists of a specialized social network. An image-based identication tool – available as both a web and a mobile application – is synchronized with that growing data and allows any user to query or enrich the system with new observations. An important originality is that it works with up to ve dierent organs contrarily to previous approaches that mainly relied on the leaf. This allows querying the system at any period of the year and with complementary images composing a plant observation. Extensive experiments of the visual search engine as well as system-oriented and user-oriented evaluations of the application show that it is already very helpful to determine a plant among hundreds or thousands of species. At the time of writing, the whole framework covers about half of the plant species living in France (3776 species), which already makes it the widest existing automated identication tool.

Permanent link to this article: https://team.inria.fr/zenith/zenith-scientific-seminar-alexis-jolyplntnet-interactive-plant-identification-and-collaborative-information-system-sept-20-2pm/

Zenith seminar: Irina Alles,”Time Series Clustering in the Field of Agronomy”, Sept 13, 2pm.

auxanometer-measure-growth-rateIrina Alles will present her work on phenotypic data clustering on september 13, at 2pm (Galera 127).

Title: Time Series Clustering in the Field of Agronomy

Abstract: This work is realised in the field of agronomy, more precisely in the domain of plant phenotyping. Phenotyping studies the relationship between the genotype (genetic) and phenotype (behavior) of plants in several environmental scenarios. In order to understand certain plant characteristics it compares several genetic  varieties of plants in the same environment. The PhenoArch platform is a phenotyping platform enabling the monitoring of certain characteristics for more
than 1000 plants. The obtained data consists of time series of plant traits such as growth, biomass and transpiration.
The goal of this work is to ease the analysis of the obtained plant time series.
Clustering is a widely used method in the data mining domain to divide a 
dataset into natural appearing groups, it has demonstrated its benefit in a variety of fields. We will present how this technique can be applied in the field of  phenotyping and its potential to ease further investigations.

Permanent link to this article: https://team.inria.fr/zenith/zenith-scientific-seminar-irina-allestime-series-clustering-in-the-field-of-agronomy-sept-13-2pm/

Post-doc offer: Optimizing the Cloud for Data Mining

cloud_dataTopic: Cloud platforms rely on technologies and architectures that handle massive distribution of data and computation. They are usually provided and maintained by major companies (Amazon, Google, Yahoo, Microsoft). Hadoop is an open source platform written in Java that allows data management and processing in a cloud environment. It is maintained by the Apache Foundation and implements the Google MapReduce technology. Today, most solutions for data mining in the cloud are straightforward implementations of existing algorithms in the selected cloud programming language. A basic illustration is the implementation for MapReduce of the aPriori algorithm which performs successive counting steps that rely on the native cloud primitives.


However, not all algorithms can have such straightforward implementations.
This work aims at focusing on a set of major data mining algorithms and optimizing Hadoop for them. Such algorithms have to be useful for different applications (e.g., finding frequent itemsets and sequential patterns, clustering, etc.).

Missions and activities:

Your mission will consist in:

  • Proposing efficient algorithms for a set of well known data mining problems (frequent itemsets, clustering) that require specific adaptation to the cloud.
  • Implementing the proposed algorithms on top of Hadoop.
  • Performing experiments over real scientific data in an experimental platform for large scale parallel and distributed systems, to evaluate the performance of the proposed algorithms for the tackled data mining problems.


Skills and profiles:

– Strong knowledge of statistics.
– Good proficiency in English.
– Good programming skills in Java.
– A Ph.D. in computer science or mathematics.

Duration, Location and Salary:

Duration is 18 months and the location is Montpellier.

The position should be fulfilled by September 2013 (however, a starting date by December 2013 may be negotiated). The position might be extended to 24 months in total (depending on the evolution of the fundings).

The net salary is 2138 Euros and includes social security (gross salary is € 2620.84)

Environment:

This post-doc will take place in the Zenith team of INRIA. It is funded by the Datascale project that is a project funded by the French Government, and involves industrial and academic partners (Bull, Armadillo, ActiveEon, Twenga, XediX, CEA, INRIA, IPGP). The project aims at developing technologies for Big Data.


The Zenith project-team of INRIA, headed by Patrick Valduriez, aims to propose new solutions related to scientific data and activities. Our research topics incorporate the management and analysis of massive and complex data, such as uncertain data, in highly distributed environments.

Our team is located in Montpellier that is a very active town located in south of France. It gathers together major research Labs, that work on environment and health, such as INRA, CIRAD or IRD. Generally speaking, these scientific activities generate extremely large amounts of complex data that need to be managed and analyzed.

Supervisors:

  • Patrick Valduriez
  • Florent Masseglia
  • Reza Akbarina

Contact:

Please send your CV to reza Akbarinia (reza.akbarinia@inria.fr) and/or Florent Masseglia (florent.masseglia@inria.fr).

Permanent link to this article: https://team.inria.fr/zenith/post-doc-offer-optimizing-the-cloud-for-data-mining/

Séminaire Zenith: Mohamed Reda Bouadjenek, “Approaches and Algorithms for Information Retrieval Based On Social Network Analysis/Mining”, 5 juillet, 11h00.

social-recommendationSeminaire Zenith

5/7/2013, 11h salle 127 Galera

Approaches and Algorithms for Information Retrieval Based On Social Network Analysis/Mining.

Mohamed Reda Bouadjenek – Laboratoire PRiSM, Université de Versailles-Saint-Quentin-en-Yvelines

Abstract. The Web 2.0 has introduced a new freedom for the user in his relation with the Web by facilitating his interactions with other users who have similar tastes. Social platforms and networks are certainly the most adopted technologies in this new era. These platforms allow to interact with peers, exchange messages, share resources, etc. These so called “collaborative tasks” result in huge quantities of generated data. From the research perspective, this brings important and interesting challenges for many research fields.

In such a context, a crucial problem is to enable users to find relevant information with respect to their interests and needs. This task is commonly referred to as Information Retrieval (IR). However, classic models of IR don’t consider the social dimension of the Web.  Consequently, these classic models of IR and even the IR paradigm should be adapted to the socialization of the Web, in order to fully leverage the social context that surround web pages and users. This talk presents three methods as an illustration of our contributions in this direction on: (i) query expansion, (ii) documents modeling, and (iii) results ranking. All the presented approaches are based on social annotations as source of social information, which are extracted from folksonomies.

Permanent link to this article: https://team.inria.fr/zenith/seminaire-du-pole-donnees-connaissances-mohamed-reda-bouadjenek-approaches-and-algorithms-for-information-retrieval-based-on-social-network-analysismining-5-juillet-11h00/

Séminaire du pôle “Données Connaissances”: Manuel Serrano, “Des ordinateurs aux tablettes, la programmation du Web diffus”, 1er juillet, 14h30.

diffusSéminaire du pôle Données Connaissances

Organisé par l’équipe Zenith
Lundi 1er juillet, 14h30

Salle Galera 127

Des ordinateurs aux tablettes, la programmation du Web diffus.

Manuel Serrano,

INRIA Sophia Antipolis

L’informatique individuelle a été profondément bouleversée par les smartphones et les tablettes. En l’espace de quelques années, ces périphériques ont rattrapé en nombre, mais aussi presque en capacité, les ordinateurs individuels que nous utilisons depuis les années 1980. Comme les téléphones sont très peu encombrants nous les portons (presque) toujours avec nous. Comme de plus ils sont très connectés au monde réel par une multitude de capteurs et au monde électronique par une large couverture réseau, ils permettent la réalisation denouvelles applications qui étaient inimaginables il y a tout juste quelques années : les applications diffuses.

Toutefois, la programmation diffuse est complexe car elle cumule une grande partie des difficultés de la programmation classique auquel elle ajoute un lot de problèmes inédits. Lors de ce séminaire nous présenterons Hop, un langage de programmation conçu pour traiter ces problèmes. Il s’appuie très fortement sur l’architecture du Web qu’il considère comme vaste une plateforme d’exécution. Le séminaire commencera par une brève mise en perspective historique des techniques de programmation du Web. Suivra un exposé des principales caractéristiques du langage. Une application réaliste sera ensuite présentée et quelques points de son implantation détaillés.

Permanent link to this article: https://team.inria.fr/zenith/seminaire-du-pole-donnees-connaissances-manuel-serrano-des-ordinateurs-aux-tablettes-la-programmation-du-web-diffus-1er-juillet-14h30/

Séminaire du pôle “Données Connaissances”: Eliya Buyukkaya, “A peer-to-peer-based virtual environment system”, 1er juillet, 11h00.

virtual meetingSéminaire Pole Données et Connaissances
1/7/2013 à 11h, salle 127 Galera

A peer-to-peer-based virtual environment system
Eliya Buyukkaya – ENSSAT (École Nationale Supérieure des Sciences Appliquées et de Technologie)

Abstract: 
Virtual environments (VEs) are 3-D virtual worlds in which a huge number of participants play roles and interact with their surroundings through virtual representations called avatars. VEs are traditionally supported by a client/server architecture. However, centralized architectures can lead to bottleneck on the server due to high communication and computation overhead during peak loads. Thus, P2P overlay networks are emerging as a promising architecture for VEs. However, exploiting P2P schemes in VEs is not straightforward, and several challenging issues related to data distribution and state consistency should be considered.

One of the key aspects of P2P-based VEs is the logical platform consisting of connectivity, communication and data architectures, on which the VE is based. The connectivity architecture is the overlay topology structure, which defines how peers are connected to each other. The communication architecture is the routing protocol defining how peers can exchange messages, while the data architecture defines how data are distributed over the logical overlay. The design of these architectures has significant influence on the performance and scalability of VEs.

First, we propose a scalable connectivity architecture based on a new triangulation algorithm reducing maintenance cost of the system. Second, we construct a communication architecture built on top of the connectivity architecture ensuring that each message reaches its intended destination. Finally, we propose a data architecture ensuring the management of data with different characteristics in terms of mobility in the VE, while providing a fair data distribution and low data transfer between peers in the VE.

Permanent link to this article: https://team.inria.fr/zenith/seminaire-du-pole-donnees-connaissances-eliya-buyukkaya-a-peer-to-peer-based-virtual-environment-system-1er-juillet-11h00/

Numev : réunion de l’axe Données Mardi 21 mai, 10h30 – 12h.

numevLa prochaine réunion de l’axe Données NUMEV se tiendra Mardi 21 mai, 10h30 – 12h, La Galera, salle 127.

Programme:

  • Infos NUMEV: appels à projets, workshop
  • Nadine Hilgert (INRA)Quelques pistes de recherche en statistique pour données fonctionnelles autour des données de phénotypage haut-débit (projet Phenome). Résumé : Les plates-formes de phénotypage génèrent de grandes quantités de données issues de mesures de variables diverses au cours du temps, sur des centaines/milliers de plantes. Valoriser et exploiter ces masses de données est un défi pour produire de nouvelles connaissances en biologie et en génétique. Il s’agit de développer une méthodologie d’analyse et de modélisation des données du phénotypage, dans le même esprit que ce qui a été fait pour le génotypage avec l’émergence de la bioinformatique. Je montrerai quelques questions de recherche ouvertes et développerai les solutions possibles en statistique pour données fonctionnelles.
  • Maximilien Servajean (LIRMM), Esther Pacitti (LIRMM), Sihem Amr Yahia (LIG), Pascal Neveu (INRA). Profile diversity in search and recommendation. Résumé : We investigate profile diversity, a novel idea in searching scientic documents. Combining keyword relevance with popularity in a scoring function has been the subject of dierent forms of social relevance. Content diversity has been thoroughly studied in search and advertising, database queries, and recommendations. We believe our work is the first to investigate profile diversity to address the problem of returning highly popular but too-focused documents. We show how to adapt Fagin’s threshold-based algorithms to return the most relevant and most popular documents that satisfy content and profile diversities and run preliminary experiments on two benchmarks to validate our scoring function.
  • Andre Mas (I3M), Pascal Poncelet (LIRMM). Point sur le projet VIPP.
  • Discussions.

Permanent link to this article: https://team.inria.fr/zenith/numev-reunion-de-laxe-donnees-mardi-21-mai-10h30-12h/

PhD offer: Multisite Management of Data-intensive Scientific Workflows in the Cloud

Directors: Esther Pacitti (University Montpellier 2), Marta Mattoso (UFRJ) and Patrick Valduriez (Inria)
Contact: Patrick.Valduriez@inria.fr
Funding: The joint Microsoft-Inria Research Center
Gross salary : 1957 euros/month (36 months)

This work is part of a new project on advanced data storage and processing for cloud workflows (2013-2017) funded by Microsoft Research, in collaboration with the Kerdata INRIA team. It will be conducted within the Institut de Biologie Computationelle in Montpellier.

Scientific workflows allow scientists to easily express multi-step computational tasks, for instance, load input data files, preprocess the data, run various analyses, and aggregate the results. A scientific workflow describes the dependencies between tasks, typically as a Directed Acyclic Graph (DAG) where the nodes are tasks (that can call programs) and the edges express the task dependencies. As scientific workflows need to deal more and more with big data, it becomes critical to process them in high-performance computing environments such as clusters or clouds. Some scientific workflow systems such as Pegasus and Swift provide parallel support but with an imperative language, which forces optimization and parallelization to be hardcoded.

To be amenable to automatic optimization and parallel processing, the specification of a workflow should be high-level. Recently [1], we have proposed an algebraic approach for the optimization and parallelization of data-intensive scientific workflows. This approach is based on a workflow algebra with powerful operators such as Filter, Map and Reduce, a set of algebraic transformation rules as a basis for optimization and a parallel execution model. It has been implemented in Chiron [2] in a cluster environment.

In this thesis, we consider the problem of managing algebraic workflows to run efficiently in a multisite cloud environment, where each site has its own cluster, data and programs. Such environment is well suited for scientific communities, with groups and labs located at geographically dispersed sites. The problem resembles multisite query processing in distributed and parallel database systems [3,4] and we plan to develop similar techniques for workflow decomposition, optimization and parallelization, dynamic task allocation and efficient management of intermediate data to be exchanged between sites. These techniques will be validated by a prototype implemented using the BlobSteer distributed storage system [5] on Microsoft Azure.

Note: a second Ph.D. position related to the joint project is available in the Kerdata team.

References:

[1] E. Ogasawara, J. F. Dias, D. de Oliveira, F. Porto, P. Valduriez, M. Mattoso. An Algebraic Approach for Data-centric Scientific Workflows. In Proceedings of the VLDB Endowment (PVLDB), 4(12): 1328-1339, 2011.

[2] E. Ogasawara, D. Jonas, V. Silva, C. Fernando, D. De Oliveira, F. Porto, P. Valduriez, M. Mattoso. Chiron: A Parallel Engine for Algebraic Scientific Workflows. Journal of Concurrency and Computation: Practice and Experience, 2013.

[3] M. T. Özsu, P. Valduriez. Principles of Distributed Database Systems”. Third Edition, Springer ISBN 978-1-4419-8833-1, 2011.

[4] E. Pacitti, R. Akbarinia, M. El Dick. P2P Techniques for Decentralized Applications. Synthesis Lectures on Data Management, Morgan & Claypool Publishers, 2012.

[5] B. Nicolae, G. Antoniu, L. Bougé, D. Moise, A. Carpen-Amarie. BlobSeer: Next Generation Data Management for Large Scale Infrastructures. Journal of Parallel and Distributed Computing, 71 (2):168-184, 2011.

Requirements

  • Distributed programming, distributed and parallel data management, programming languages like C++, Java.
  • Fluent English (internship stays at MSR Redmond, USA, are planned).

Permanent link to this article: https://team.inria.fr/zenith/phd-2013/

Zenith seminar: Maximilien Servajean,”Profile Diversity in Search and Recommendation”, May 7, 3pm.

smiley-face-ratingMaximilien will present a recent work, accepted in a workshop held with WWW 2013. Galéra, room 127.

Title: Profile Diversity in Search and Recommendation

Abstract: We investigate profile diversity, a novel idea in searching scientic documents. Combining keyword relevance with popularity in a scoring function has been the subject of dierent forms of social relevance. Content diversity has been thoroughly studied in search and advertising, database queries, and recommendations. We believe our work is the first to investigate profile diversity to address the problem of returning highly popular but too-focused documents. We show how to adapt Fagin’s threshold-based algorithms to return the most relevant and most popular documents that satisfy content and profile diversities and run preliminary experiments on two benchmarks to validate our scoring function.

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-maximilien-servajeanprofile-diversity-in-search-and-recommendation-may-7-3pm/