Lunchtime Seminar - Ad hoc Integration for Big Data – Top-k Entity Augmentation Using Web Tables
Abstract:
In the era of Big Data, the number and variety of data sources is increasing every day. However, not all of this new data is available in well-structured databases or data warehouses. Instead, heterogeneous collections of individual datasets such as data lakes are becoming more prevalent. This novel wealth of data, though not integrated, has enormous potential for generating value in ad-hoc analysis processes, which are becoming more and more common with increasingly agile data management practices. However, in today’s database management systems there is a lack of support for ad-hoc data integration of such heterogeneous data sources.
In this talk, we introduce the so-called entity augmentation problem, which aims at extending a given set of entities with an additional, user-requested attribute that is not yet defined for them. We propose a novel algorithm for producing consistent integration results from a large corpus of data sources and provide an overview of our Web table retrieval and matching system called REA implementing the proposed conceptual solution. Finally, we give an overview of our Dresden Web Table Corpus (DWTC), consisting of 125 million Web tables which we extracted from a public Web crawl and made available to the research community.
Speaker:
Wolfgang Lehner is full professor and head of the database technology group at the TU Dresden, Germany. His research is dedicated to database system architecture specifically looking at crosscutting aspects from information fusion algorithms down to hardware-related aspects in main-memory centric settings. He is part of TU Dresden's excellence cluster with research topics in energy-aware scheduling, resilient data structures on unreliable hardware, and orchestration of wildly heterogeneous systems; he is also a principal investigator with the DFG-funded CRC HAEC as well as Germany's national "Competence Center for Scalable Data Services and Solutions" (ScaDS); Wolfgang also maintains a close research relationship with the SAP HANA development team in Walldorf, Seoul, and Waterloo. He serves the community in many PCs, is an elected member of the VLDB Endowment, serves on the review board of the German Research Foundation (DFG), and is an appointed member of the Academy of Europe.
More info at: http://wwwdb.inf.tu-dresden.de/lehner