H2020 High Performance Computing for Energy (2015-2017)

Zenith is now engaging in the H2020 High Performance Computing for Energy (HPC4E) project between Brazil and Europe. As a side effect of the excellent collaboration with Brazil in the CNPq-Inria project HOSCAR, HPC4E (2015-2017) is funded (2M€) by the Horizon 2020 programme of the European Commission.

Read more.

Permanent link to this article: https://team.inria.fr/zenith/h2020-high-performance-computing-for-energy-2015-2017/

Zenith seminar: Maximilien Servajean, “Simple Games for Complex Classification Problems”

artwork-three-300x168Title: Simple Games for Complex Classification Problems
Maximilien Servajean
03/07/2015, 11h
Abstract:
Pl@ntNet is a large-scale innovative participatory sensing platform relying on image-based plants identification as a mean to enlist non-expert contributors and facilitate the production of botanical observation data. The iOs and Android mobile applications allowing to identify plants and share observations have been downloaded by more than 350K users in 170 countries and counts up to ten thousands users per day. Nowadays, the whole collection contains more than 180K images covering about 7K plant species (mainly in West Europe). The collected data, revised by a large social network specialized in botany (TelaBotanica), is used by several international projects including the Gbif initiative and the LifeCLEF evaluation campaign which involves hundreds of research groups working on automatic recognition tools and methods. However there is still a need for human validation and identification when the application failed to propose a correct identification, such cases happening especially for the rarest and most interesting plants. Crowdsourcing has shown a lots of interests in the recent years. In such approaches, users are ask to resolve micro-tasks which results are then aggregated using mathematical tools in order to create knowledge. Unfortunately, asking a large set of users to identify some random plants is merely impossible. In this presentation, I will show some preliminary works we are doing to combine automatic identification tools and crowdsourcing in order to identify the maximum possible number of plants.

Permanent link to this article: https://team.inria.fr/zenith/zenith-seminar-maximilien-servajean-simple-games-for-complex-classification-problems/

HDR Defense “Large-scale Content-based Visual Information Retrieval”, Alexis Joly, May 26, 2015

Alexis Joly’s HDR defense will take place on Tuesday, May 26, 2015 at 14h at LIRMM, salle des séminaires.
 
Title: Large-scale Content-based Visual Information Retrieval
 
Jury members:
– Pr. Jenny BENOIS-PINEAU, université Bordeaux 1 (reviewer)
– Pr.  Bernard MERIALDO, EURECOM (reviewer)
– Dr. Yannis KOMPATSIARIS, CERTH-ITI (reviewer)
– Pr. Nozha BOUJEMAA, Inria Saclay
– Dr. Olivier BUISSON, INA
– Dr. Guillaume GRAVIER, IRISA
– Pr. William PUECH, LIRMM
– Pr. Patrick VALDURIEZ, Inria & LIRMM
Abstract: 
Rather than restricting search to the use of metadata, content-based information retrieval methods attempt to index, search and browse digital objects by means of signatures or features describing their actual content. Such methods have been intensively studied in the multimedia community to allow managing the massive amount of raw multimedia documents created every day (e.g. video will account to 84% of U.S. internet traffic by 2018). Recent years have consequently witnessed a consistent growth of content-aware and multi-modal search engines deployed on massive multimedia data. Popular multimedia search applications such as Google images, Youtube, Shazam, Tineye  or MusicID  clearly demonstrated that the first generation of large-scale audio-visual search technologies is now mature enough to be deployed on real-world big data. All these successful applications did greatly benefit from 15 years of research on multimedia analysis and efficient content-based indexing techniques. Yet the maturity reached by the first generation of content-based search engines does not preclude an intensive research activity in the field. There is actually still a lot of hard problems to be solved before we can retrieve any information in images or sounds as easily as we do in text documents. Content-based search methods actually have to reach a finer understanding of the contents as well as a higher semantic level. This requires modeling the raw signals by more and more complex and numerous features, so that the algorithms for analyzing, indexing and searching such features have to evolve accordingly. This thesis describes several of my works related to large-scale content-based information retrieval. The different contributions are presented in a bottom-up fashion reflecting a typical three-tier software architecture of an end-to-end multimedia information retrieval system. The lowest layer is only concerned with managing, indexing and searching large sets of high-dimensional feature vectors, whatever their origin or role in the upper levels (visual or audio features, global or part-based descriptions, low or high semantic level, etc. ). The middle layer rather works at the document level and is in charge of analyzing, indexing and searching collections of documents. It typically extracts and embeds the low-level features, implements the querying mechanisms and post-processes the results returned by the lower layer. The upper layer works at the applicative level and is in charge of providing useful and interactive functionalities to the end-user. It typically implements the front-end of the search application, the crawler and the orchestration of the different indexing and search services.

Permanent link to this article: https://team.inria.fr/zenith/hdr-defense-of-alexis-joly-26-may-2015-14h-at-lirmm/

PhD Position “Privacy Preserving Query Processing in Decentralized Online Social Networks”, April 28, 2015

PhD Position: Privacy Preserving Query Processing in Decentralized Online Social Networks

Topic

We propose a PhD position (CIFRE) which will be done in the context of collaboration between INRIA Zenith team (https://team.inria.fr/zenith/) and MatchUpBox startup (http://matchupbox.com/) that is developing a P2P social network application whose objective is to enable users to share their content while preserving the sensible data from unauthorized access.

Recently, there has been a growing research interest in extending privacy preserving techniques for social networks. One of the main challenges is to develop practical solutions that protect the users’ sensitive data while offering a good degree of utility for the queries executed over the published data.

The objective of this PhD thesis is to provide a query processing service that allows efficient evaluation of OLAP (Online analytical processing) queries without disclosing any information about individual data. An example of such queries is the following: given a topic, what is the percentage of the users interested in the topic? For realizing such a service, we need a distributed architecture allowing us to control the queries asked on each user data, and to block the answers if the privacy restrictions are not respected. The queries should be executed in a fully decentralized way, and the communications between the nodes should be secure. We also need to take into account the particular characteristics of the P2P architectures, especially the dynamic behavior of peers that can leave the system any time.

Missions and activities

The tasks to be realized are the following:

  • Investigating the state-of-the-art approaches for privacy preservation in social networks.
  • Proposing a distributed architecture for the service of privacy preserving query processing.
  • Proposing efficient techniques for some important OLAP queries which will be used in the application.
  • Implementing a prototype of the proposed service.
  • Evaluating the prototype in a distributed platform, such as Grid5000.

Skills and profiles

We will study all applications carefully. But, an ideal applicant may have the following skills:

  • Experience in privacy preserving methods (or cryptography).
  • Strong knowledge of C++ or Java.

Environment

The Zenith project-team of INRIA (https://team.inria.fr/zenith/), headed by Patrick Valduriez, aims to propose new solutions related to distributed data management. The research topics of Zenith include: parallel processing of massive data, recommendation in social networks, privacy preservation data analysis, probabilistic data management, etc.

MatchUpBox (http://matchupbox.com/index.html) is an innovative startup specialized in data security and privacy. It is developing a social network that helps the users regain the control of their data and to interact safely on the Internet. The uses are anonymous any time they are connected. All the content the users put to the social network provided by MatchUpBox is protected and only accessible to the owner and those whom he/she authorizes.

Both Zenith and MatchUpBox are located in Montpellier that is a very active town located in south of France.

How to apply

To apply for this PhD position, please send your CV and a motivation letter to the following email addresses:

  • Reza Akbarinia (reza.akbarinia@inria.fr)
  • Esther Pacitti (esther.pacitti@lirmm.fr)
  • Jorick Lartigau (jorick.lartigau@matchupbox.com)

Permanent link to this article: https://team.inria.fr/zenith/privacy-preserving-query-processing-in-decentralized-online-social-networks/

Morgenstern seminar: Dennis Shasha (NYU) “The Changing Nature of Invention in Computer Science”

Colloquium Jacques Morgenstern, Jeudi 2 avril, 11h, Amphithéatre Gilles Kahn, Inria Sophia Antipolis.

The Changing Nature of Invention in Computer Science

Dennis Shasha

Courant Institute of Mathematical Sciences, New York University

Zenith team, Inria, LIRMM & IBC, Montpellier

What drives inventions in computing? Necessity seems to play only a minor role. Anger at the way things are is much more powerful, because it leads to easier ways to work (the invention of new computer languages). A general dissatisfaction with the practical or theoretical structure of the world can open up whole new approaches to problems (complexity theory and cryptography). Finally, a genuine collaboration between people and machines can lead to an entirely new kind of engineering for devices that will travel to far-off planets or to hostile environments. The talk will discuss the work of several inventors in computing and engineering, their inventions, and how they came up with them and how they plan to come up with more in the future. The ensuing discussion will address the fundamental nature of invention in a world partly populated by intelligent machines.

Permanent link to this article: https://team.inria.fr/zenith/morgenstern-seminar-the-changing-nature-of-invention-in-computer-science-by-dennis-shasha-april-2-2015/

Inria Sophia-Antipolis seminar: Dennis Shasha (NYU) “Group Testing to Describe Causality in Gene Networks”

Séminaire 1: mercredi 1er avril 2015, 15h, Salle Euler bleu, Inria Sophia-Antipolis

“Group Testing to Describe Causality in Gene Networks”

Dennis Shasha Courant Institute of Mathematical Sciences, New York University Zenith team, Inria, LIRMM & IBC, Montpellier

Genomics is essentially the study of network effects among genes. A typical outcome of a genomic study will be some experimental proof of one more causal relationships of the form induction ingene X causes repression in gene Y. A typical way to establish such connections are to knock out genes and see which other genes are affected. Alternatively, single genes can be over-expressed. The advent of CRISPR allows the suppression or over-expression of several genes at once. The question then is how to make use of such technology to discover causal links more efficiently. We have found a way to use combinatorial group testing for this purposes. The talk explains the algorithm and validates it on the DREAM simulator for genomic networks.

Permanent link to this article: https://team.inria.fr/zenith/seminar-group-testing-to-describe-causality-in-gene-networks-by-dennis-shasha-april-1-2015/

Prix Turing : les bases de données à l’honneur via le Blog Binaire du Monde, 20 avril 2015

Stonebraker-credit-M.-Scott-Brauer-2-150x150

Le billet de Patrick Valduriez pour le blog binaire (journal Le Monde), sur le prix Turing de cette année : Michael Stonebraker, chercheur au Massachusetts Institute of Technology (USA), vient de remporter le prestigieux Prix Turing de l’ACM, souvent considéré comme « le prix Nobel de l’informatique ». Dans son annonce du 25 mars 2015, l’ACM précise que Stonebraker « a inventé de nombreux concepts qui sont utilisés dans presque tous les systèmes de bases de données modernes… ». Cette reconnaissance au plus haut niveau international me donne l’occasion de donner un éclairage sur la place singulière du domaine de la gestion de données dans la recherche en informatique. Lire la suite sur le blog binaire, dans l’article sur le prix Turing de Michael Stonebreaker. crédit photo M. Scott Brauer

Permanent link to this article: https://team.inria.fr/zenith/prix-turing-2015-les-bases-de-donnees-a-lhonneur-via-le-blog-binaire-du-monde/

PhD position “Predictive Big Data Analytics: continuous and dynamically adaptable data request and processing”, April 9, 2015

PhD position available at Inria.

Location:Sophia Antipolis
Funding:Labex UCN@SOPHIA, (accepted)

Title: Predictive Big Data Analytics: continuous and dynamically adaptable data request and processing

Main advisor:Francoise Baude (PR), Scale team CNRS I3S
Mail: Francoise.baude@unice.fr Web page:http://www-sop.inria.fr/members/francoise.baude/
Co-advisor: Didier Parigot (HDR), Zenith team, INRIA CRISAM
Mail: Didier.Parigot@inria.fr Web page: http://www-sop.inria.fr/members/Didier.Parigot/
 
Keywords: Large scale distribution, Data Stream processing, real-time data analytics, Component Model.
 
Application: The candidate should have a background in large scale data management, component model and be proficient in English. Send us a detailed CV, including a complete bibliography and recommendation letters.
 
Contacts:Francoise.baude@unice.fr or Didier.Parigot@inria.fr

Context:

As the popularity of Big Data explodes, more and more use cases are implemented using this kind of technologies. But there are some use cases that are not properly tackled by classic Big Data models and platforms like Apache Hadoop MapReduce because of these models intrinsic batch nature. These cases are those where online processing of new data is required as soon as they enter the system, in order to aggregate to the current analysis results the newest information extracted from these incoming data. Such on-line and continuous processing pertains to what is known as continuous query and trigger in the more focused context of databases [1][2], or also as complex event processing in publish-subscribe systems [3]. More generally, processing the incoming data is known as Data Stream processing, and in the big data area is known as real-time data analytics.

Social networks are nowadays the medium of choice for data delivery among end-users, be it in public circles or in more private spheres as professional dedicated networks. Analysing, understanding, recommending, rating all the vast amount of data of various kinds (text, images, videos, etc) is a feature more and more required to be supported by the underlying systems of those social networks. In particular, the Beepers startup associated with the Zenith team, is working towards an alerting tool to be part of their offered toolbox for building a social network. The goal of this joint PhD is partly motivated by this perspective as deploying an alert pertains to a persistent -possibly sophisticated- query and thus a data streaming program. Thanks to the partnership established through this common PhD research, we want to offer Beepers as an added-value, the capability for the query to dynamically evolve. The motivation for this evolution is to better stick to the discovery of a trend, a sentiment, an opinion, etc.  that is emerging as extracted by the continuous data flow analytics.

Indeed, some situations have the need of what could be named anticipatory analytics: given gathered data originating from various sources and combined to get meaningful information out of them, the goal is to adapt the current analytics in such a way that it can match to the anticipated coming situation, somehow ahead of time. For instance, doing short-term weather forecast for local places : if suddenly the speed of the wind increases and changes direction while intense rain falls, there is a need for (1) updating the short-term weather predictions, but also for (2) deploying appropriate supervision of the now-in-danger zone, so that, in case of a flooding risk of the new targeted zone, assuming the new zone is an inhabited one, the system gets able to trigger alerts towards the right actors: if now a flooding can reach the hospital, evacuation effectively should start, whereas it was not necessary few minutes before when wind was still soft  and given its direction the risky zone was a field full of corn; now, supervising also electricity delivery is required to make sure any blackout during evacuation risk can be anticipated. Another example of anticipated analytics is in multimedia stream processing: for example, a data journalist that is extracting data from its preferred social network and runs analytics on videos to learn about happenings in a specific city area prior to writing its press article, suspects from the broadcasted content to recognize someone possibly involved in a crime scene shown in a concomitant published video; and, accordingly, decides to adapt its current analytics to combine the two information sources around some common data which is the relatives whose accommodation lies in the specific city area, and continues its search focusing now on these relatives acts while still monitoring the initially recognized suspect.

From these exemplary scenarios, we clearly foresee that one trend in big data technologies is in predictive analytics [16][17][18] which by nature requires a strong capability of dynamic adaptation given partial already gained results. In this trend, our claim is that the analytics require to adapt (even better self-adapt) to what is happening, given of course some previously user-defined rules dictating which adaptation it could be relevant to decide to trigger.

Work:

Several big data platforms geared at real-time analytics have emerged recently: Spark Streaming [4], Twitter’s Storm, S4[5].   These platforms allow one to define a program as taking eventually, after a compilation process, the form of a DAG (directed acyclic graph) but to our knowledge, none allows to adapt the program at runtime with respect to its functional/business nature.  This is because these languages such as StreamSQL, CQL, StreamIt, and IBM’s SPL [7] are generally declarative. As a result, developers focus on expressing data processing logic, but not orchestrating runtime adaptations (i.e. in which situation functional adaptation should happen, how to monitor the adaptation needs, how to effectively modify the program without redeploying a new one from scratch) . Some data workflow engines start to appear in the community of large-scale stream processing but without yet an adequate behavioural/functional adaptation capability, as the so far only focus was to allow adaptation to face requirements for non-functional changes dictated by scalability, fault-tolerance, performance needs through node replication or elasticity [6].  From our literature survey done so far, only [8] sketches a solution for functional adaptivity in stream processing languages of big data platforms, even if some anterior works on stream platforms or active databases are a good starting point [9].

As a result from some years of research in the Scale team, Grid Component Model (GCM) [10] is a component model for applications to be run on distributed infrastructures, that extends the Fractal component model. Also, Zenith’ research have resulted in strong competences in component oriented platforms featuring high dynamicity  like SON [11] (and that the co-supervised PhD student could also take advantage from). Fractal defines a component model where components can be hierarchically organized, reconfigured, and controlled offering functional server interfaces and requiring client interfaces. GCM extends that model providing to the components the possibility to be remotely located, distributed, parallel, and deployed in a large-scale computing environment (cluster, grid, cloud, etc), and adding collective communications (multicast and gathercast interfaces). Autonomic capabilities can be expressed in the membranes of GCM components, that can drive their reconfiguration at functional level, in a totally autonomous manner [10].  If the DAG of a streaming application translates into a component oriented program, then it can naturally benefit from its intrinsic reconfiguration properties. This is one of the expected research question to be addressed in the scope of this PhD: how to benefit from autonomic, high-expressivity, clear functional versus non-functional separation of concerns features of the GCM component-oriented approach in order to support dynamic adaptation of the analytics the streaming application corresponds to.

Work:

Towards the global goal of designing a working solution for anticipated big data analytics, the development of the PhD could be organized along the following guidelines  :

  • Investigate on the stream processing languages (mainly Domain Specific Languages) that can be used atop of GCM composition framework, and extended to express adaptability functionalities. An approach developed in [15] relying upon data flow analysis of the DSL code could be reused to infer the GCM-based graph. Extend the GCM model and runtime to be stream processing aware
  • Implement the framework to be able to write adaptable stream processing workflows relying in fine on GCM and on SON( to handle the needed dynamic code deployment features).
  • Study existing and emerging stream processing platforms (that are mainly open source and supported by Apache) and select or extend the most appropriate one to plug to it the GCM-based dynamic reconfiguration of analytics feature and new DSL; alternatively but more ambitious, if existing platforms extensibility is not easy, develop a complete GCM-based big data real-time analytics solution, and make sure it can interact with existing big data providers (famous public social networks, as twitter, facebook, or dedicated social networks as built with the Beepers technology)
  • Benchmark and test on relevant big data analytics use-cases, extracting relevant information from social networks contents (with semantic annotation) like images, videos, text; applying to these contents for instance sentimental analysis to predict future situations [12], and accordingly adjust recommendations provided to the users of the social network [13].

Complementarity and perspectives

This research proposal is at the crossing of middleware for large-scale platforms working on large data volume, languages (including DSLs), data mining, social networking. Françoise Baude from SCALE team is an expert in runtime and middleware for distributed languages, and a recent EU project, PLAY, allowed her to gain experience in situational-awareness through complex event processing and supporting publish/subscribe platforms for web-semantic described events [14]. Didier Parigot from Zenith is addressing DSLs, data flow models, and also middleware and databases [15] to support social networking, through the iLab collaboration with Beepers. Overall, the two teams share a common background, while having complementary assets applicable to the emerging technologies for big data in general. The two teams have here a first opportunity to collaborate. They also have a strong willingness proved by past and current efforts, to transfer research results towards the industry, which has a rising interest in data analytics supporting solutions. This common PhD research stands up as a nice catalyst and opportunity to demonstrate usability and relevance of past developed platforms (ProActive/GCM, and SON respectively, that are going to be used in complementary aspects), on this exciting emerging area of adaptable big data analytics.

Innovation potential:

Due to the applied nature of the research  there are obvious  perspectives of valorization as a “product” handled by existing or to come SMEs of the Sophia-Antipolis ecosystem. Consequently, the candidate may apply for an ICT Labs  Doctoral Training Center additional funding. Indeed, the Labex@UCN has been labelled as the foundation for the newly funded Doctoral Training Center in Sophia-Antipolis from 2015. In this context, six additional months of PhD funding, plus an Innovation & Entrepreneurship curriculum, and the opportunities to evaluate the project in other ICT Labs ecosystems are allocated. There are at least two natural ICT Labs collaborating ecosystems that are strongly engaged in Big Data analytics efforts, which consequently deserve our attention. These are:

  • Technische Universität Berlin (TUB), which coordinates the Berlin Big Data Center (BBDC), a national center on Big Data recently established by the German Federal Ministry of Education and Research (BMBF). TUB established the Data Analytics Laboratory (DAL) in 2011 to serve as a focal point for innovative research. TUB is the birthplace of one of the leading open-source big data analytics platform called Stratosphere, now Apache Flink,  with an active worldwide user community. This is one of the possible systems that could be extended by the thesis work, to provide adaptive analytics.
  • The University of Trento (UNITN), an associated partner of EIT ICT labs Trento CLC, as the Politechnico de Milano. Both also offer an ecosystem active in big data analytics solutions. Polimi and its ecosystem, benefit from the expertise of the researchers working in the Collaborative Innovation Center on Big Data (CIC) jointly developed with IBM.

References :

[1] Shivnath Babu and Jennifer Widom. 2001. Continuous queries over data streams. SIGMOD Rec. 30, 3 (September 2001), 109-120

[2] Scheuermann, Peter, and Goce Trajcevski. “Active Database Systems.” Wiley Encyclopedia of Computer Science and Engineering (2008).

[3] Eugene Wu, Yanlei Diao, and Shariq Rizvi. 2006. High-performance complex event processing over streams. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data (SIGMOD ’06)

[4]Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP ’13)

[5] Neumeyer, Leonardo, et al. “S4: Distributed stream computing platform.” Data Mining Workshops (ICDMW), 2010 IEEE International Conference on. IEEE, 2010.

[6] Raphaël Barazzutti, Thomas Heinze, Andre Martin, Emanuel Onica, Pascal Felber, Christof Fetzer, Zbigniew Jerzak, Marcelo Pasin, Etienne Riviere: Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe Engine. ICDCS 2014: 567-576

[7] Martin Hirzel, Henrique Andrade, Bugra Gedik, Vibhore Kumar, Giuliano Losa, Mark Mendell, Howard Nasgaard, Rboert Soulé, Kun-Lung Wu – SPL Stream Processing Language Specification – IBM, 2009

[8] Gabriela Jacques-Silva, Bugra Gedik, Rohit Wagle, Kun-Lung Wu, Vibhore Kumar – Building User-defined Runtime Adaptation Routines for Stream Processing Applications – The Very Large Data Base Endowment Journal (VLDB), 2012

[9] Trajcevski, Goce, et al. “Evolving triggers for dynamic environments.” Advances in Database Technology-EDBT 2006. Springer Berlin Heidelberg, 2006. 1039-1048.

[10] F. Baude, L. Henrio, C. Ruz   Programming Distributed and Adaptable Autonomous Components – the GCM/ProActive Framework   Software: Practice and Experience, Wiley, In Press, 2015

[11] Ayoub Ait Lahcen, Didier Parigot, “A Lightweight Middleware for Developing P2P Applications with Component and Service-Based Principles”, CSE, 2012, 2012 IEEE 15th International Conference on Computational Science and Engineering (CSE 2012),

[12] Asur, Sitaram, and Bernardo A. Huberman. “Predicting the future with social media.” Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on. Vol. 1. IEEE, 2010.

[13] Fady Draidi, Esther Pacitti, Didier Parigot, and Guillaume Verger. 2011. P2Prec: a social-based P2P recommendation system. In Proceedings of the 20th ACM international conference on Information and knowledge management (CIKM ’11)

[14] N. Stojanovic, R. Stühmer, F. Baude, P. Gibert.   Tutorial: Where Event Processing Grand Challenge meets Real-time Web: PLAY Event Marketplace   DEBS’12, the 6th ACM International conference on Distributed Event-based system, ACM, 2012, p. 341-352   July 2012.

[15]Ayoub Ait Lahcen, Developing Component-Based Applications with a Data-Centric Approach and within a Service-Oriented P2P Architecture: Specification, Analysis and Middleware,  PhD thesis co-supervised by D. Parigot, Dec 2012

[16] Boulos, Maged N. Kamel, et al. “Social Web mining and exploitation for serious applications: Technosocial Predictive Analytics and related technologies for public health, environmental and national security surveillance.” Computer Methods and Programs in Biomedicine 100.1 (2010): 16-23.

[17]Lozada, Brian A. “The Emerging Technology of Predictive Analytics: Implications for Homeland Security.” Information Security Journal: A Global Perspective 23.3 (2014): 118-122.

[18]Doyle, Andy, et al. “The EMBERS architecture for streaming predictive analytics.” Big Data (Big Data), 2014 IEEE International Conference on. IEEE, 2014.

Permanent link to this article: https://team.inria.fr/zenith/predictive-big-data-analytics-continuous-and-dynamically-adaptable-data-request-and-processing/

IBC seminar: Marta Mattoso (UFRJ) “Exploratory Analysis of Raw Data Files through Dataflows”

Workflow-1Séminaire IBC, pôle données connaissances du LIRMM, Zenith Mardi 17 mars, 11h Salle 1/124, Campus Saint-Priest – Bâtiment 5 Exploratory Analysis of Raw Data Files through Dataflows* Marta Mattoso COPPE/UFRJ, Rio de Janeiro Scientific applications generate raw data files in very large scale. Most of these files follow a standard format established by the domain area application, like HDF5, NetCDF and FITS. These formats are supported by a variety of programming languages, libraries and programs. Since they are in large scale, analyzing these files require writing a specific program. Scientific data analysis systems like database management systems (DBMS) are not suited because of time consuming data loading, data transformation at large scale and legacy code incompatible with a DBMS. Recently there have been several proposals for indexing and querying raw data files without the overhead of using a DBMS. Systems like noDB, SDS and RAW offer query support to the raw data file after a scientific program has generated it. However, these solutions are focused on the analysis of one single large file. When a large number of files are all related and required to the evaluation of one scientific hypothesis, the relationships must be managed manually or by writing specific programs. In this talk we will discuss current approaches for raw data analysis and present our approach that combines DBMS and raw data analysis. It takes advantage of provenance database system support from scientific workflow management systems (SWfMS). When scientific applications are managed by SWfMS, the data that is being generated is registered along the provenance database. Therefore this provenance data may act as a description of theses files. When the SWfMS is dataflow aware and also registers selected data and pointers to domain data all in the same database. This resulting database becomes an important access path to the large number of files that are generated by the scientific workflow execution. This becomes a complementary approach to the single raw data file analysis support. *Joint work with Vitor Silva, Daniel Oliveira and Patrick Valduriez

Permanent link to this article: https://team.inria.fr/zenith/seminar-exploratory-analysis-of-raw-data-files-through-dataflows-by-marta-mattoso-march-17/

Zenith seminar: Aleksandra Levchenko “Integrating Big Data and Relational Data with a Functional SQL-like Query Language”

Zenith seminar: Jeudi 12 mars, 11h, salle 1/124 bat 5, Campus Saint Priest Integrating Big Data and Relational Data with a Functional SQL-like Query Language(*) Aleksandra Levchenko Zenith, Inria and LIRMM, University of Montpellier, France Odessa National Polytechnic University, Ukraine Abstract Multistore systems have been recently proposed to provide integrated access to multiple, heterogeneous data stores through a single query engine. In particular, much attention is being paid on the integration of unstructured big data typically stored in HDFS with relational data. One main solution is to use a relational query engine that allows SQL-like queries to retrieve data from HDFS, which requires the system to provide a relational view of the unstructured data and hence is not always feasible. In this talk we introduce a functional SQL-like query language that can integrate data retrieved from different data stores and take full advantage of the functionality of the underlying data processing frameworks by allowing the ad-hoc usage of user defined map/filter/reduce operators in combination with traditional SQL statements. Furthermore, the query language allows for optimization by enabling subquery rewriting so that filter conditions can be pushed inside and executed at the data store as early as possible. Our approach is validated with two data stores and representative queries that demonstrate the usability of the query language and evaluate the benefits from query optimization. (*) Joint work with Carlyna Bondiombouy, Boyan Kolev and Patrick Valduriez

Permanent link to this article: https://team.inria.fr/zenith/zenith-working-seminar-aleksandra-levchenko-thrusday-12-march-11h/