SIGMOD 2013: Fact checking and analyzing the Web

Fact checking and analyzing the Web
by François Goasdoué, Konstantinos Karanasos, Yannis Katsis, Julien Leblay, Ioana Manolescu, Stamatis Zampetakis
Demonstration in SIGMOD 2013

Permanent link to this article: https://team.inria.fr/oak/2013/02/07/sigmod-2013-fact-checking-and-analyzing-the-web/

Katerina Tzompanaki: Design and Implementation of a tool for formulating recall-oriented structured queries on semantic networks

14.00, room 455, PCRI

Abstract
In the recent years there is a trend towards the creation of massive metadata repositories, usually based on the RDF/S technology, as in the domain of cultural heritage. ISO21127 (CIDOC Conceptual Reference Model) is a rich conceptual model (or ontology) capable of describing the world stored in such repositories. Simpler models like those consisting only of core metadata‖ as in Dublin Core, lack the expressiveness and the potentiality to integrate the knowledge and to apply reasoning on it. Nevertheless, the more complex structure complicates the information searching: the declarative SPARQL query formulation becomes harder for the user due to the large number of ontology classes and properties, while on the other side keyword search does not take advantage of the information structure.

To address this problem, we suggest a new approach: we introduce a simpler model consisting of few fundamental classes and relationships aimed to be used for querying purposes only. Information search with this model is easier and more intuitive for the users, since its size and structure resemble those of the core metadata. Additionally, this model provides high recall rates because in the fundamental relationships we capture the total of potential paths over the CIDOC-CRM and also include property propagation through these paths. With the latter though, we introduce a statistical factor that may deteriorate precision since a property is not necessarily propagated along a path. Precision improvement can be achieved by creating specializations of the fundamental relationships or by adding more constraints on the queries.

To define the paths over the CIDOC-CRM schema that correspond to each fundamental relationship, we have created a paths‘ language‖ which is designed to be easy to write and to be comprehended by non-expert users. Thereafter, we constructed a tool that utilizes this language, permits the editing and validation of the fundamental relationships and their translation to SPARQL and provides extra supportive functions.

The proposed approach was proven adequate for expressing real research queries originating from independent (to this work) scientists in the domain of cultural heritage. The results of queries performed on repositories consisting of real metadata were encouraging, showing even 100% recall, when the repository‘s information was well-structured. Moreover we have shown that the usage of combined FRs in the query can improve the precision rate.

Permanent link to this article: https://team.inria.fr/oak/2013/02/01/katerina-tzompanaki-design-and-implementation-of-a-tool-for-formulating-recall-oriented-structured-queries-on-semantic-networks/

OAKSaD associated team with UCSD

OAKSaD has been accepted as an international associated team between OAK and the database group of UCSD (A. Deutsch, Y. Papakonstantinou).

Congrats and lots of success!

Permanent link to this article: https://team.inria.fr/oak/2013/01/16/oaksad/

Oak is an Inria project

Oak has been formally approved by Inria as a project (having been a team since April 2012).

Congrats and lots of success!

Permanent link to this article: https://team.inria.fr/oak/2013/01/10/oak-is-a-project/

EDBT 2013: Processing XML Queries and Updates on Map/Reduce Clusters

Processing XML Queries and Updates on Map/Reduce Clusters
by Nicole Bidoit, Dario Colazzo, Noor Malla, Maurizio Nolé, Carlo Sartiani and Federico Ulliana
Demonstration in EDBT 2013

Permanent link to this article: https://team.inria.fr/oak/2012/12/26/edbt-2013-processing-xml-queries-and-updates-on-mapreduce-clusters/

EDBT 2013: Web Data Indexing in the Cloud: Efficiency and Cost Reductions

Web Data Indexing in the Cloud: Efficiency and Cost Reductions
by Jesús Camacho-Rodríguez, Dario Colazzo and Ioana Manolescu
in EDBT 2013

Permanent link to this article: https://team.inria.fr/oak/2012/12/21/edbt-2013-web-data-indexing-in-the-cloud-efficiency-and-cost-reductions/

EDBT 2013: Efficient Query Answering against Dynamic RDF Databases

Efficient Query Answering against Dynamic RDF Databases
by François Goasdoué, Ioana Manolescu and Alexandra Roatiş
in EDBT 2013

Permanent link to this article: https://team.inria.fr/oak/2012/12/21/edbt-2013-efficient-query-answering-against-dynamic-rdf-databases/

Yanlei Diao: Scalable, Low-Latency Data Analytics and its Applications

14.00, room 445, PCRI

Abstract
An integral part of many data-intensive applications is the need to collect and analyze enormous data sets, such as click streams, search logs, and sensor streams to derive answers and insights with low latencies. Concurrently, new programming models and architectures have been developed for large-scale cluster computing, exemplified by recent MapReduce systems. However, these systems are designed for batch processing and require data set to be fully loaded into the cluster before running analytical queries, hence causing high delays of query answers.

In this talk, I present the design of a scalable, low-latency analytics platform, called Scalla, that fundamentally transforms the existing cluster computing paradigm into an incremental parallel processing paradigm, which provides the combined benefits of massive parallelism, incremental answers, and I/O efficiency. Our technical contributions include replacing an existing popular mechanism for partitioned parallelism with a purely hash-based mechanism and using dynamic frequency analysis to offer in-memory processing for most of the data. In this talk, I will also examine two application scenarios, click stream analysis, which has been used in our evaluation, and genomic data analysis, which is a new project that leverages Scalla for massive-scale genomic data processing and analysis.

Short bio
Yanlei Diao is an Associate Professor of Computer Science at the University of Massachusetts Amherst. Her research interests are in information architectures and data management systems, with a focus on large-scale data analysis, data streams, uncertain data management, and flash memory databases. She received her PhD in Computer Science from the University of California, Berkeley in 2005, her M.S. in Computer Science from the Hong Kong University of Science and Technology in 2000, and her B.S. in Computer Science from Fudan University in 1998.

Yanlei Diao was a recipient of the NSF Career Award and the IBM Scalable Innovation Faculty Award, and was a finalist of the Microsoft Research New Faculty Fellowship. She spoke at the Distinguished Faculty Lecture Series at the University of Texas at Austin. Her PhD dissertation “Query Processing for Large-Scale XML Message Brokering” won the 2006 ACM-SIGMOD Dissertation Award Honorable Mention. She is an associate editor of PVLDB 2013 and has served on the organizing committees of SIGMOD, CIDR, DMSN, the New Researcher Symposium, and the New England Database Summit. She has served on program committees of numerous international conferences and workshops.

Permanent link to this article: https://team.inria.fr/oak/2012/12/20/yanlei-diao-scalable-low-latency-data-analytics-and-its-applications/

Themis Palpanas: Entity Resolution for Big Data

11.00, room 445, PCRI

Abstract
Highly heterogeneous data have boomed during the last decade, due to their largely distributed way of production: corporations of any size, individual users as well as automatic extraction tools have contributed a constantly increasing volume of heterogeneous and noisy information. Entity Resolution (ER) helps to reduce the corresponding entropy by identifying those pieces of information that refer to the same real-world objects.

Typically, blocking techniques are used to scale ER to large volumes of data. However, most of these techniques rely on schema information and are inapplicable to highly heterogeneous settings. Our work goes beyond existing blocking techniques, by introducing a novel methodology that is inherently crafted for voluminous, highly heterogeneous, and noisy data collections.

At the core of our approach lie three independent, but complementary steps: block-building (using redundant block assignments for effectiveness), meta-blocking (reducing the number of necessary blocks), and block processing (increasing efficiency of ER operations). Our experimental evaluation with three large-scale, real-world data sets demonstrates that our methodology can successfully handle very large and highly heterogeneous datasets, achieving an excellent balance between effectiveness and efficiency.

Short bio
Themis Palpanas is a professor of computer science at the University of Trento, Italy. He received the BS degree from the National Technical University of Athens, Greece, and the MSc and PhD degrees from the University of Toronto, Canada. Before joining the University of Trento, he worked at the IBM T.J. Watson Research Center. He has also been a Visiting Professor at the National University of Singapore, worked for the University of California, Riverside, and visited Microsoft Research and the IBM Almaden Research Center. His research solutions have been implemented in world-leading commercial data management products and he is the author of eight US patents, three of which are part of commercial products in multi-billion dollar markets. He is the recipient of three Best Paper awards. He has been a member of the IBM Academy of Technology Study on Event Processing, and is a founding member of the Event Processing Technical Society. He is General Chair for VLDB 2013, has served on the program committees of several top database and data mining conferences, and also serves as a reviewer for the European Commission Framework Programme, the Natural Sciences and Engineering Research Council of Canada (NSERC), the Netherlands Organisation for Scientific Research (NWO), and the Qatar National Research Fund (QNRF).

Slides
Themis Palpanas – Entity Resolution for Big Data

Permanent link to this article: https://team.inria.fr/oak/2012/12/19/themis-palpanas-entity-resolution-for-big-data/

Vassilis Christophides: Continuous Queries over Text Streams

10.30, room 445, PCRI

Abstract
Web 2.0 technologies have transformed the Web from a publishing only environment into a vibrant information place where yesterday’s end users become nowadays content generators themselves. The vast amounts of user generated content available in various social media (Facebook, Twitter, blogs, discussion forums) in conjunction with traditional information producers (e.g., newspapers, television, radio) poses new challenges in achieving an effective, near real-time information awareness. In this talk we present recent results for supporting users to state their interests as continuous textual queries which will be matched on the fly against incoming information items originating from different sources. In particular, we will present in three complementary problems related to effective and efficient continuous filtering systems, namely (a) scalable indices for textual subscriptions in a Pub/Sub system (b) top-k continuous query evaluation with scoring functions (c) multi-query optimization for continuous textual mashups. The advocated solutions aim to meet the requirements for processing very large volumes of varying quality information published at different rates.

Short bio
Vassilis Christophides is Professor at the Computer Science Department (CSD) of the University of Crete. His main research interests include Databases and Web Information Systems, Digital Libraries and Scientific Systems. He has published over 100 articles in high-quality international conferences, workshops and journals. He has received the 2004 SIGMOD Test of Time Award and the Best Paper Award at the 2nd and 6th International Semantic Web Conference in 2003 & 2007.

Slides
Vassilis Christophides – Continuous Queries over Text Streams

Permanent link to this article: https://team.inria.fr/oak/2012/12/18/vassilis-christophides-continuous-queries-over-text-streams/