Zenith engages in a new project funded by Microsoft to work on the problem of advanced data storage and processing for supporting scientific workflows in the cloud. More here.
Permanent link to this article: https://team.inria.fr/zenith/new-project-on-advanced-data-storage-and-processing-for-cloud-workflows-2013-2017-with-microsoft/
Permanent link to this article: https://team.inria.fr/zenith/ibc-seminar-marta-mattoso-big-data-workflows-how-provenance-can-help-march-25-2pm/
Permanent link to this article: https://team.inria.fr/zenith/ibc-seminar-patrick-valduriez-parallel-techniques-for-big-data-march-22-2pm/
Mar 21
Zenith seminar: Pierre Letessier,”Découverte et exploitation d’objets visuels fréquents dans des collections multimédia”, March 21, 11am.
Pierre will give a talk on his thesis work on “Découverte et exploitation d’objets visuels fréquents dans des collections multimédia“. The talk will be given in French.
Abstract: L’objectif principal de cette thèse est la découverte d’objets visuels fréquents dans de grandes collections multimédia (images ou vidéos). Comme dans de nombreux domaines (finance, génétique, …), il s’agit d’extraire une connaissance de manière automatique ou semi-‐automatique en utilisant la fréquence d’apparition d’un objet au sein d’un corpus comme critère de pertinence. Dans le cas visuel, le problème est différent de la fouille de données classique (ADN, textuel, etc.) puisque les instances d’apparition d’un même objet ne constituent pas des entités identiques mais doivent être appariées. Cette difficulté explique également pourquoi nous nous focalisons sur la découverte des objets rigides (logos, objets manufacturés, décors, bâtiments, etc.), et non des catégories d’objets de plus haut niveau sémantique (maison, voiture, chien, …). Bien que les techniques de recherche d’objets rigides aient atteint une certaine maturité, le problème de la découverte non supervisée d’instances d’objets dans des grandes collections d’images est à l’heure actuelle encore difficile. D’une part parce que les méthodes actuelles ne sont pas assez efficaces et passent difficilement à l’échelle. D’autre part parce que le rappel et la précision sont encore insuffisants pour de nombreux objets. Particulièrement ceux ayant une taille très restreinte par rapport à l’information visuelle contextuelle qui peut être très riche (par exemple le logo d’un parti politique apparaissant ponctuellement dans un sujet de journal télévisé).
Une première contribution de la thèse est de fournir un formalisme aux problèmes de découverte et de fouille d’instances d’objets visuels fréquents. Ces deux problèmes sont en effet définis de manière très confuse dans les quelques travaux récents de la littérature les abordant. Cette modélisation nous a permis entre autres choses de mettre en évidence le lien étroit qui existe entre la taille des objets à découvrir et la complexité du problème à traiter.
La deuxième contribution de la thèse est une méthode générique de résolution de ces deux types de problème reposant d’une part sur un processus itératif d’échantillonnage d’objets candidats et d’autre part sur une méthode efficace d’appariement d’objets rigides à large échelle. L’idée est de considérer l’étape de recherche d’instances proprement dite comme une simple boite noire à laquelle il s’agit de soumettre des régions d’images ayant une probabilité élevée d’appartenir à un objet fréquent de la base. Une première approche étudiée dans la thèse consiste à simplement considérer que toutes les régions d’images de la base sont équiprobables, avec comme idée conductrice que les objets les plus instanciés sont ceux qui auront la couverture spatiale la plus grande et donc la probabilité la plus élevée d’être échantillonnés. En généralisant cette notion de couverture à celle plus générique de couverture probabiliste, il est alors possible de modéliser la complexité de notre méthode pour toute fonction de vraisemblance donnée en entrée, et de montrer ainsi l’importance de cette étape.
La troisième contribution de la thèse s’attache précisément à construire une fonction de vraisemblance s’approchant au mieux de la distribution parfaite, tout en restant scalable et efficace. Cette dernière repose sur une approche originale de hachage à deux niveaux, permettant de générer efficacement un ensemble d’appariements visuels dans un premier temps, et d’évaluer ensuite leur pertinence en fonction de contraintes géométriques faibles. Les expérimentations montrent que contrairement aux méthodes de l’état de l’art notre approche permet de découvrir efficacement des objets de très petite taille dans des millions d’images.
Pour finir, plusieurs scénarios d’exploitation des graphes visuels produits par notre méthode sont proposées et expérimentés. Ceci inclut la détection d’évènements médiatiques transmedia et la suggestion de requêtes visuelles.
Permanent link to this article: https://team.inria.fr/zenith/zenith-scientific-seminar-pierre-letessierdecouverte-et-exploitation-dobjets-visuels-frequents-dans-des-collections-multimedia-march-21-11am/
Permanent link to this article: https://team.inria.fr/zenith/plntnet-iphone-app-roll-out/
Permanent link to this article: https://team.inria.fr/zenith/zenith-is-seeking-junior-researchers-inria-is-hiring-29-researchers-in-2013/
Feb 08
A hybrid P2P/cloud for Large Scale Data Sharing
Post-Doc Offer
A hybrid P2P/cloud for Large Scale Data Sharing
With the advent of the Internet and the World-wide-web, there is an emergent need to develop user applications that access data and resources stored in the network. In order to facilitate the development of network-centric applications, new computational paradigms are needed that are scalable, elastic, available, and fault-tolerant. During the past decades two dominant paradigms referred to as Peer-to-Peer (P2P) Computing and Cloud Computing have become widely prevalent as computational paradigms for distributed applications. Peer-to-peer computing is a highly decentralized computing paradigm that leverages computing resources at the user level for supporting decentralized user level applications such as wide-scale media file sharing, telecommunication services (e.g., Skype), and others. Cloud computing on the other hand relies on large data-centers consisting of thousands of server-class machines and all application processing and application data is centralized in the network core, i.e., data-centers [1,2]. The two paradigms in many ways are complementary and provide different trade-offs. For instance, the cost for computing and storage is almost free in P2P but it suffers from the challenges of churn and low reliability of user machines. Cloud computing, on the other hand significantly simplifies the task of system administration in the data-center but requires a very large investment in building large-scale data-centers.
This postdoc topic requires research in new distributed architectures an algorithms that leverage from the above two paradigms. At present, in the commercial realm, cloud computing has emerged as a dominant paradigm. However, we contend that cloud computing is amenable for supporting client-server interactions. As we move towards applications that are more collaborative and require continuous interactivity (i.e., latency sensitive applications), the cloud computing paradigm may not be able to sustain such applications. Examples of such applications arise in the area of distributed gaming, group video-chat, online interactive classrooms, and synchronous group interactions in online social networks. The commonality among all these applications is that they require many-to-many communication as well as the need for streaming media flow among all the members.
The goal is to develop a hybrid platform that combines the two paradigms and leverages computing, storage, and network resources both in the data-centers (i.e., the cloud) as well as at the edges of the network (i.e., the peer or user machines). In addition, we will also explore the suitability of this hybrid model for large scale distributed data sharing through recommendation in different contexts such as data streaming [3] and scientific on-line communities [4, 5]. The common issue here is that users have their own datasets (documents, videos, etc.) locally stored and controlled, and are willing to share their data in a personalized and controlled way.
[1] Big Data and Cloud Computing: Current State and Future Opportunities, Divyakant Agrawal, Sudipto Das, Amr El Abbadi, EDBT 2011: 530-533.
[2] Database Scalability, Elasticity, and Autonomy in the Cloud Divyakant Agrawal, Amr El Abbadi, Sudipto Das, and Aaron J. Elmore, DASFAA (1) 2011: 2-15.
[3] Flower-CDN: a hybrid P2P overlay for efficient query processing in CDN, Manal El Dick, Esther Pacitti and Bettina Kemme, EDBT 2009: 427-438.
[4] P2Prec: A P2P Recommendation System for Large-Scale Data Sharing, Fady Draidi, EstherPacitti and Bettina Kemme, Trans. Large-Scale Data- and Knowledge-Centered Systems 3: 87-116 (2011).
[5] Zenith: Scientific Data Management on a Large Scale, Esther Pacitti and Patrick Valduriez, ERCIM News 2012(89): (2012).
Contact : Esther.Pacitti@lirmm.fr
Permanent link to this article: https://team.inria.fr/zenith/a-hybrid-p2pcloud-for-large-scale-data-sharing/
Feb 05
Zenith is seeking postdoc candidates with expertise in distributed and parallel data management, in particular, cloud and P2P computing.
In the context of the BigdataNet project, between Zenith and the distributed systems team of Profs. Amr El Abbadi and Divy Agrawal at University of California, Santa Barbara, we are seeking postdoc candidates
with expertise in distributed and parallel data management (i.e. distributed and parallel systems AND data management), in particular, cloud and P2P computing.
The postdoc will be for 12 to 18 months and will be located in Montpellier, France, with trips to UCSB.
The postdoc candidate should hold a Ph.D. in computer science, obtained no more than on year ago, and have strong research experience in distributed and parallel systems as well as data management, as demonstrated by publications in major journals and conferences.
Gross salary: 2620 euros / month
Adviser and contact: Patrick.Valduriez@inria.fr
Co-adviser: Esther.Pacitti@lirmm.fr
Permanent link to this article: https://team.inria.fr/zenith/postoc-bigdatanet/
Feb 04
A Data-Centric Language and Execution Model for Scientific Workflows
PhD position
Advisors: Didier Parigot and Patrick Valduriez, Inria
The Zenith team deals with the management of scientific applications that are computation-intensive and manipulate large amounts of data. These applications are often represented by workflows, which describe sequences of tasks (computations) and data dependencies between these tasks. Several scientific workflow environments have been already proposed [3]. However, they have little support for efficiently managing large data sets. The Zenith team develops an original approach that deals with such large data sets in a way that allows efficient placement of both tasks and data on large-scale (distributed and parallel) infrastructures for more efficient execution. To this end, we propose an original solution that combines the advantages of cloud computing and P2P technologies. This work is part of the IBC project (Institut de Biologie Computationelle – http://www.ibc-montpellier.fr), in collaboration with biologists, in particular from CIRAD and IRD, and cloud providers, in particular Microsoft.
The concept of cloud computing combines several technology advances such as Service-Oriented Architectures, resource virtualization, and novel data management systems referred to as NoSQL. These technologies enable flexible and extensible usage of resources, which is referred to as elasticity. In addition, the Cloud allows users to simply outsource data storage and application executions. For the manipulation of big data, NoSQL database systems, such as Google Bigtable, Hadoop Hbase, Amazon Dynamo, Apache Cassandra, 10gen MongoDB, have been recently proposed.
Existing scientific workflow environments [3] have been developed primarily to simplify the design and execution of a set of tasks in a particular infrastructure. For example, in the field of biology, the Galaxy environment allows users to introduce catalogs of functions/tasks and compose these functions with existing functions in order to build a workflow. These environments propose a design approach that we can classify as “process-oriented”, where information about data dependencies (data flow) is purely syntactic. In addition, the targeted execution infrastructures are mostly computation-oriented, like clusters and grids. Finally, the data produced by scientific workflows are often stored in loosely structured files, for further analysis. Thus, data management is fairly basic, with data are either stored on a centralized disk or directly transferred between tasks. This approach is not suitable for data-intensive applications because data management becomes the major bottleneck in terms of data transfers.
As part of a new project that develops a middleware for scientific workflows (SciFloware), the objective of this thesis is to design a declarative data-centric language for expressing scientific workflows and its associated execution model. A declarative language is important to provide for automatic optimization and parallelization [1]. The execution model for this language will be decentralized, in order to yield flexible execution in distributed and parallel environments. This execution model will capitalize on execution models developed in the context of distributed and parallel database systems [2]. To validate this work, a prototype will be implemented using the SON middleware [4] and a distributed file system like HDFS.
References
[1] E. Ogasawara, J. F. Dias, D. de Oliveira, F. Porto, P. Valduriez, M. Mattoso. An Algebraic Approach for Data-centric Scientific Workflows. Proceedings of the VLDB Endowment (PVLDB), 4(12) : 1328-1339, 2011.
[2] M. T. Özsu, P. Valduriez. Principles of Distributed Database Systems. Third Edition, Springer, 2011.
[3] I. J. Taylor, E. Deelman, D. B. Gannon, M. Shields. Workflows for e-Science : Scientific Workflows for Grids. First Edition, Springer, 2007.
[4] A. Ait-Lahcen, D. Parigot. A Lightweight Middleware for developing P2P Applications with Component and Service-Based Principles. 15th IEEE International Conference on Computational Science and Engineering, 2012.
Contact: Didier Parigot (Firstname.Lastname@inria.fr)
Permanent link to this article: https://team.inria.fr/zenith/a-data-centric-language-and-execution-model-for-scientific-workflows/
Permanent link to this article: https://team.inria.fr/zenith/zenith-scientific-seminar-dennis-shashaupstart-puzzles-january-30-2013/