New project on advanced data storage and processing for cloud workflows with Microsoft

Zenith engages in a new project funded by Microsoft to work on the problem of advanced data storage and processing for supporting scientific workflows in the cloud. More here.

Permanent link to this article: https://team.inria.fr/zenith/new-project-on-advanced-data-storage-and-processing-for-cloud-workflows-2013-2017-with-microsoft/

IBC seminar: Marta Mattoso, “Big Data Workflows – how provenance can help”, March 25, 2pm.

Séminaire IBC

Lundi 25 mars, 14h

Salle 127, Batiment la Galera

Organisé par l’équipe Zenith

Big Data Workflows – how provenance can help
Marta Mattoso
UFRJ, Rio de Janeiro
Brazil

Big data analyses are critical for decision support in business data processing. These analyses involve the execution of many activities such as: programs to explore data from the web, databases, data warehouses and files; data cleaning procedures; programs to aggregate data; core programs that perform analyses; and tools to visualize and interpret the results. Each step (activity) of the analysis is performed isolated from the other and the analysts need to manually manage the larger life cycle of big data analysis. Big data analysis started to be represented as pipelines or dataflows. However, current approaches lack features to provide a consistent view of many different explorations and activities as part of a broader analysis, like a computational experiment. Scientific workflows have long provided such features for scientific experiments, and although originally designed for science, they may be useful to support the life cycle of big data analysis. Scientific analyses typically involve
experimenting with several steps using different datasets and computer programs. Scientists need to manage the composition, execution and analysis of their experiments carefully, so the results can be trusted and the experiments reproducible. To help managing experiments, scientific workflow management systems (SWfMS) have been proposed to let scientists design workflows of different complexities and manage their execution, including high performance computing (HPC) in cloud environments. Most SWfMS also have provenance data support. Provenance tracks how the results of the experiments were produced, which is essential to make an experiment (big data analysis) reproducible and trustworthy. Business Process Workflows are focused on modeling the process rather than managing big data flows with provenance and HPC. In this talk we discuss on provenance support along the big data analysis workflow as an alternative to improve results of big data
analysis, especially in a long-term view

Permanent link to this article: https://team.inria.fr/zenith/ibc-seminar-marta-mattoso-big-data-workflows-how-provenance-can-help-march-25-2pm/

IBC seminar: Patrick Valduriez, “Parallel Techniques for Big Data”, March 22, 2pm.

big-dataCe 6ième séminaire, dans le cadre de l’axe 5 “Données” de l’institut, aura lieu le Vendredi 22 Mars, de 14h à 15h30, à l’IBC, salle 127 (pour venir: http://g.co/maps/ygsrk):

Patrick Valduriez,
Zenith team, INRIA and LIRMM
http://www-sop.inria.fr/members/Patrick.Valduriez/

Parallel Techniques for Big Data

Big data has become a buzzword, referring to massive amounts of data that are very hard to deal with traditional data management tools. In particular, the ability to produce high-value information and knowledge from big data makes it critical for many applications such as decision support, forecasting, business intelligence, research, and (data-intensive) science. Processing and analyzing massive, possibly complex data is a major challenge since solutions must combine new data management techniques (to deal with new kinds of data) with large-scale parallelism in cluster, grid or cloud environments. Parallel data processing has long been exploited in the context of distributed and parallel database systems for highly structured data. But big data encompasses different data formats (documents, sequences, graphs, arrays, …) that require significant extensions to traditional parallel techniques. In this talk, I will discuss such extensions, from the basic techniques and architectures to NoSQL systems and MapReduce.

Permanent link to this article: https://team.inria.fr/zenith/ibc-seminar-patrick-valduriez-parallel-techniques-for-big-data-march-22-2pm/

Zenith seminar: Pierre Letessier,”Découverte et exploitation d’objets visuels fréquents dans des collections multimédia”, March 21, 11am.

Pierre will give a talk on his thesis work on “Découverte et exploitation d’objets visuels fréquents dans des collections multimédia“. The talk will be given in French.

Abstract: L’objectif principal de cette thèse est la découverte d’objets visuels fréquents dans de grandes collections multimédia (images ou vidéos). Comme dans de nombreux domaines (finance, génétique, …), il s’agit d’extraire une connaissance de manière automatique ou semi-­‐automatique en utilisant la fréquence d’apparition d’un objet au sein d’un corpus comme critère de pertinence. Dans le cas visuel, le problème est différent de la fouille de données classique (ADN, textuel, etc.) puisque les instances d’apparition d’un même objet ne constituent pas des entités identiques mais doivent être appariées. Cette difficulté explique également pourquoi nous nous focalisons sur la découverte des objets rigides (logos, objets manufacturés, décors, bâtiments, etc.), et non des catégories d’objets de plus haut niveau sémantique (maison, voiture, chien, …). Bien que les techniques de recherche d’objets rigides aient atteint une certaine maturité, le problème de la découverte non supervisée d’instances d’objets dans des grandes collections d’images est à l’heure actuelle encore difficile. D’une part parce que les méthodes actuelles ne sont pas assez efficaces et passent difficilement à l’échelle. D’autre part parce que le rappel et la précision sont encore insuffisants pour de nombreux objets. Particulièrement ceux ayant une taille très restreinte par rapport à l’information visuelle contextuelle qui peut être très riche (par exemple le logo d’un parti politique apparaissant ponctuellement dans un sujet de journal télévisé).

Une première contribution de la thèse est de fournir un formalisme aux problèmes de découverte et de fouille d’instances d’objets visuels fréquents. Ces deux problèmes sont en effet définis de manière très confuse dans les quelques travaux récents de la littérature les abordant. Cette modélisation nous a permis entre autres choses de mettre en évidence le lien étroit qui existe entre la taille des objets à découvrir et la complexité du problème à traiter.

La deuxième contribution de la thèse est une méthode générique de résolution de ces deux types de problème reposant d’une part sur un processus itératif d’échantillonnage d’objets candidats et d’autre part sur une méthode efficace d’appariement d’objets rigides à large échelle. L’idée est de considérer l’étape de recherche d’instances proprement dite comme une simple boite noire à laquelle il s’agit de soumettre des régions d’images ayant une probabilité élevée d’appartenir à un objet fréquent de la base. Une première approche étudiée dans la thèse consiste à simplement considérer que toutes les régions d’images de la base sont équiprobables, avec comme idée conductrice que les objets les plus instanciés sont ceux qui auront la couverture spatiale la plus grande et donc la probabilité la plus élevée d’être échantillonnés. En généralisant cette notion de couverture à celle plus générique de couverture probabiliste, il est alors possible de modéliser la complexité de notre méthode pour toute fonction de vraisemblance donnée en entrée, et de montrer ainsi l’importance de cette étape.

La troisième contribution de la thèse s’attache précisément à construire une fonction de vraisemblance s’approchant au mieux de la distribution parfaite, tout en restant scalable et efficace. Cette dernière repose sur une approche originale de hachage à deux niveaux, permettant de générer efficacement un ensemble d’appariements visuels dans un premier temps, et d’évaluer ensuite leur pertinence en fonction de contraintes géométriques faibles. Les expérimentations montrent que contrairement aux méthodes de l’état de l’art notre approche permet de découvrir efficacement des objets de très petite taille dans des millions d’images.

Pour finir, plusieurs scénarios d’exploitation des graphes visuels produits par notre méthode sont proposées et expérimentés. Ceci inclut la détection d’évènements médiatiques transmedia et la suggestion de requêtes visuelles.

Permanent link to this article: https://team.inria.fr/zenith/zenith-scientific-seminar-pierre-letessierdecouverte-et-exploitation-dobjets-visuels-frequents-dans-des-collections-multimedia-march-21-11am/

Pl@ntNet iphone app rolls out

Plantnet iphone app is an image sharing and retrieval application for the identification of plants. It is developed in the context of the Pl@ntNet project by scientists from four French research organisations (INRIA, Cirad, INRA, IRD) and the members of Tela Botanica social network with the financial support of Agropolis fondation.
Among other features, this free app helps identifying plant species from photographs, through a visual search engine using several research results of ZENITH on large-scale content-based retrieval and high-dimensional data hashing.

Permanent link to this article: https://team.inria.fr/zenith/plntnet-iphone-app-roll-out/

ZENITH is seeking junior researchers (INRIA is hiring 29 researchers in 2013)

INRIA is hiring 29 researchers in 2013


The competition is now open:
http://www.inria.fr/en/news/news-from-inria/researcher-competitive-selection-2013

ZENITH is seeking junior researchers in scientific data management with expertise in areas such
as data mining, distributed and parallel databases, scientific workflows, P2P computing, cloud computing, content-based information retrieval.

Permanent link to this article: https://team.inria.fr/zenith/zenith-is-seeking-junior-researchers-inria-is-hiring-29-researchers-in-2013/

A hybrid P2P/cloud for Large Scale Data Sharing

Post-Doc Offer

A hybrid P2P/cloud for Large Scale Data Sharing

With the advent of the Internet and the World-wide-web, there is an emergent need to develop user applications that access data and resources stored in the network. In order to facilitate the development of network-centric applications, new computational paradigms are needed that are scalable, elastic, available, and fault-tolerant. During the past decades two dominant paradigms referred to as Peer-to-Peer (P2P) Computing and Cloud Computing have become widely prevalent as computational paradigms for distributed applications. Peer-to-peer computing is a highly decentralized computing paradigm that leverages computing resources at the user level for supporting decentralized user level applications such as wide-scale media file sharing, telecommunication services (e.g., Skype), and others. Cloud computing on the other hand relies on large data-centers consisting of thousands of server-class machines and all application processing and application data is centralized in the network core, i.e., data-centers [1,2]. The two paradigms in many ways are complementary and provide different trade-offs. For instance, the cost for computing and storage is almost free in P2P but it suffers from the challenges of churn and low reliability of user machines. Cloud computing, on the other hand significantly simplifies the task of system administration in the data-center but requires a very large investment in building large-scale data-centers.

This postdoc topic requires research in new distributed architectures an algorithms that leverage from the above two paradigms. At present, in the commercial realm, cloud computing has emerged as a dominant paradigm. However, we contend that cloud computing is amenable for supporting client-server interactions. As we move towards applications that are more collaborative and require continuous interactivity (i.e., latency sensitive applications), the cloud computing paradigm may not be able to sustain such applications. Examples of such applications arise in the area of distributed gaming, group video-chat, online interactive classrooms, and synchronous group interactions in online social networks. The commonality among all these applications is that they require many-to-many communication as well as the need for streaming media flow among all the members.

The goal is to develop a hybrid platform that combines the two paradigms and leverages computing, storage, and network resources both in the data-centers (i.e., the cloud) as well as at the edges of the network (i.e., the peer or user machines). In addition, we will also explore the suitability of this hybrid model for large scale distributed data sharing through recommendation in different contexts such as data streaming [3] and scientific on-line communities [4, 5]. The common issue here is that users have their own datasets (documents, videos, etc.) locally stored and controlled, and are willing to share their data in a personalized and controlled way.

[1] Big Data and Cloud Computing: Current State and Future Opportunities, Divyakant Agrawal, Sudipto Das, Amr El Abbadi, EDBT 2011: 530-533.
[2] Database Scalability, Elasticity, and Autonomy in the Cloud Divyakant Agrawal, Amr El Abbadi, Sudipto Das, and Aaron J. Elmore, DASFAA (1) 2011: 2-15.
[3] Flower-CDN: a hybrid P2P overlay for efficient query processing in CDN, Manal El Dick, Esther Pacitti and Bettina Kemme, EDBT 2009: 427-438.
[4] P2Prec: A P2P Recommendation System for Large-Scale Data Sharing, Fady Draidi, EstherPacitti and Bettina Kemme, Trans. Large-Scale Data- and Knowledge-Centered Systems 3: 87-116 (2011).
[5] Zenith: Scientific Data Management on a Large Scale, Esther Pacitti and Patrick Valduriez, ERCIM News 2012(89): (2012).

Contact : Esther.Pacitti@lirmm.fr

Permanent link to this article: https://team.inria.fr/zenith/a-hybrid-p2pcloud-for-large-scale-data-sharing/

Zenith is seeking postdoc candidates with expertise in distributed and parallel data management, in particular, cloud and P2P computing.

In the context of the BigdataNet project, between Zenith and the distributed systems team of Profs. Amr El Abbadi and Divy Agrawal at University of California, Santa Barbara, we are seeking postdoc candidates
with expertise in distributed and parallel data management (i.e. distributed and parallel systems AND data management), in particular, cloud and P2P computing.

The postdoc will be for 12 to 18 months and will be located in Montpellier, France, with trips to UCSB.

The postdoc candidate should hold a Ph.D. in computer science, obtained no more than on year ago, and have strong research experience in distributed and parallel systems as well as data management, as demonstrated by publications in major journals and conferences.

Gross salary: 2620 euros / month

Adviser and contact: Patrick.Valduriez@inria.fr
Co-adviser: Esther.Pacitti@lirmm.fr

Permanent link to this article: https://team.inria.fr/zenith/postoc-bigdatanet/

A Data-Centric Language and Execution Model for Scientific Workflows

PhD position

Advisors: Didier Parigot and Patrick Valduriez, Inria

The Zenith team deals with the management of scientific applications that are computation-intensive and manipulate large amounts of data. These applications are often represented by workflows, which describe sequences of tasks (computations) and data dependencies between these tasks. Several scientific workflow environments have been already proposed [3]. However, they have little support for efficiently managing large data sets. The Zenith team develops an original approach that deals with such large data sets in a way that allows efficient placement of both tasks and data on large-scale (distributed and parallel) infrastructures for more efficient execution. To this end, we propose an original solution that combines the advantages of cloud computing and P2P technologies. This work is part of the IBC project (Institut de Biologie Computationelle – http://www.ibc-montpellier.fr), in collaboration with biologists, in particular from CIRAD and IRD, and cloud providers, in particular Microsoft.

The concept of cloud computing combines several technology advances such as Service-Oriented Architectures, resource virtualization, and novel data management systems referred to as NoSQL. These technologies enable flexible and extensible usage of resources, which is referred to as elasticity. In addition, the Cloud allows users to simply outsource data storage and application executions. For the manipulation of big data, NoSQL database systems, such as Google Bigtable, Hadoop Hbase, Amazon Dynamo, Apache Cassandra, 10gen MongoDB, have been recently proposed.

Existing scientific workflow environments [3] have been developed primarily to simplify the design and execution of a set of tasks in a particular infrastructure. For example, in the field of biology, the Galaxy environment allows users to introduce catalogs of functions/tasks and compose these functions with existing functions in order to build a workflow. These environments propose a design approach that we can classify as “process-oriented”, where information about data dependencies (data flow) is purely syntactic. In addition, the targeted execution infrastructures are mostly computation-oriented, like clusters and grids. Finally, the data produced by scientific workflows are often stored in loosely structured files, for further analysis. Thus, data management is fairly basic, with data are either stored on a centralized disk or directly transferred between tasks. This approach is not suitable for data-intensive applications because data management becomes the major bottleneck in terms of data transfers.

As part of a new project that develops a middleware for scientific workflows (SciFloware), the objective of this thesis is to design a declarative data-centric language for expressing scientific workflows and its associated execution model. A declarative language is important to provide for automatic optimization and parallelization [1].  The execution model for this language will be decentralized, in order to yield flexible execution in distributed and parallel environments. This execution model will capitalize on execution models developed in the context of distributed and parallel database systems [2]. To validate this work, a prototype will be implemented using the SON middleware [4] and a distributed file system like HDFS.

References

[1] E. Ogasawara, J. F. Dias, D. de Oliveira, F. Porto, P. Valduriez, M. Mattoso. An Algebraic Approach for Data-centric Scientific Workflows. Proceedings of the VLDB Endowment (PVLDB), 4(12) : 1328-1339, 2011.

[2] M. T. Özsu, P. Valduriez. Principles of Distributed Database Systems. Third Edition, Springer, 2011.

[3] I. J. Taylor, E. Deelman, D. B. Gannon, M. Shields. Workflows for e-Science : Scientific Workflows for Grids. First Edition, Springer, 2007.

[4] A. Ait-Lahcen, D. Parigot. A Lightweight Middleware for developing P2P Applications with Component and Service-Based Principles. 15th IEEE International Conference on Computational Science and Engineering, 2012.

Contact: Didier Parigot (Firstname.Lastname@inria.fr)

Apply online

Permanent link to this article: https://team.inria.fr/zenith/a-data-centric-language-and-execution-model-for-scientific-workflows/

Zenith seminar: Dennis Shasha,”Upstart Puzzles”, January 30, 2013.

Galéra, room 127 at 10:30.

Dr. Dennis Shasha is a professor of Mathematical Sciences in the Department of Computer Science at NYU. Along with research and teaching in biological computing, pattern recognition, database tuning , cryptographic file systems, and the like, Dennis is well-known for his mathematical puzzle column for Dr. Dobbs whose readers are very sharp and his Puzzling Adventures Column for the Scientific American. His puzzle writing has given birth to fictional books about a mathematical detective named Dr. Ecco. Dr. Shasha has also co-authored numerous highly technical books. Dennis speaks often at conferences and is a tireless self-promoter in the world of “mensa-like” puzzles.

more details at www.cs.nyu.edu/shasha

Title: Upstart Puzzles

Abstract: The writer of puzzles often invents puzzles to illustrate a principle. The puzzles, however, sometimes have other ideas. They speak up and say that they would be so much prettier as slight variants of their original selves.

The dilemma is that the puzzle inventor sometimes can’t solve those variants. Sometimes he finds out that his colleagues can’t solve them either, because there is no existing theory for solving them. At that point, these sassy variants deserve to be called upstarts.

We discuss a few upstarts inspired originally from the Falklands/Malvinas Wars, zero-knowledge proofs, and hikers in Colorado, and city planning. They have given a good deal of trouble to a certain mathematical detective whom I know well.

Permanent link to this article: https://team.inria.fr/zenith/zenith-scientific-seminar-dennis-shashaupstart-puzzles-january-30-2013/