11.00, room 445, PCRI
Abstract
Highly heterogeneous data have boomed during the last decade, due to their largely distributed way of production: corporations of any size, individual users as well as automatic extraction tools have contributed a constantly increasing volume of heterogeneous and noisy information. Entity Resolution (ER) helps to reduce the corresponding entropy by identifying those pieces of information that refer to the same real-world objects.
Typically, blocking techniques are used to scale ER to large volumes of data. However, most of these techniques rely on schema information and are inapplicable to highly heterogeneous settings. Our work goes beyond existing blocking techniques, by introducing a novel methodology that is inherently crafted for voluminous, highly heterogeneous, and noisy data collections.
At the core of our approach lie three independent, but complementary steps: block-building (using redundant block assignments for effectiveness), meta-blocking (reducing the number of necessary blocks), and block processing (increasing efficiency of ER operations). Our experimental evaluation with three large-scale, real-world data sets demonstrates that our methodology can successfully handle very large and highly heterogeneous datasets, achieving an excellent balance between effectiveness and efficiency.
Short bio
Themis Palpanas is a professor of computer science at the University of Trento, Italy. He received the BS degree from the National Technical University of Athens, Greece, and the MSc and PhD degrees from the University of Toronto, Canada. Before joining the University of Trento, he worked at the IBM T.J. Watson Research Center. He has also been a Visiting Professor at the National University of Singapore, worked for the University of California, Riverside, and visited Microsoft Research and the IBM Almaden Research Center. His research solutions have been implemented in world-leading commercial data management products and he is the author of eight US patents, three of which are part of commercial products in multi-billion dollar markets. He is the recipient of three Best Paper awards. He has been a member of the IBM Academy of Technology Study on Event Processing, and is a founding member of the Event Processing Technical Society. He is General Chair for VLDB 2013, has served on the program committees of several top database and data mining conferences, and also serves as a reviewer for the European Commission Framework Programme, the Natural Sciences and Engineering Research Council of Canada (NSERC), the Netherlands Organisation for Scientific Research (NWO), and the Qatar National Research Fund (QNRF).