Big Data Integration by Xin Luna Dong Divesh Srivastava
Author:Xin Luna Dong,Divesh Srivastava [Xin Luna Dong,Divesh Srivastava]
Language: eng
Format: epub, pdf
Tags: Publisher: Morgan & Claypool Publishers, Compositor: Windfall Software
Published: 2015-03-04T10:36:57+00:00
82
3. RECORD LINKAGE
number of (non-matching) record pairs are pruned, and hence not compared in a pairwise
fashion.
2. Edge-centric pruning typically outperforms node-centric pruning in efficiency, discarding
more superfluous pairwise comparisons, while maintaining a high recall when the fraction of
matching record pairs is expected to be low. In this case, the high weighted edges are more
likely to correspond to the matching record pairs.
3. Weight-based pruning typically achieves a better recall than cardinality-based pruning. De-
pending on the threshold, the latter can be much more efficient than the former, but this is
achieved with a moderate loss in recall.
4. Among the proposed edge weighting schemes, ARCS consistently achieves the highest
performance. This is because ARCS downweights the co-occurrence of records in high
cardinality blocks, which is analogous to the use of IDF (inverse document frequency) in
document search.
3.3
ADDRESSING THE VELOCITY CHALLENGE
In the big data era, many of the data sources are very dynamic and the number of data sources is
also rapidly exploding. This high velocity of data updates can quickly make previous linkage results
obsolete. Since it is expensive to perform batch record linkage each time there is a data update, it
would be ideal to perform incremental record linkage, to be able to quickly update existing linkage
results when data updates arrive.
3.3.1
INCREMENTAL RECORD LINKAGE
While there has been a significant body of work on record linkage in the literature over the past few
decades, incremental record linkage has started to receive attention only in recent years [Whang and
Garcia-Molina 2010, Whang and Garcia-Molina 2014, Gruenheid et al. 2014].
The main focus of the works by Whang and Garcia-Molina [2010], Whang and Garcia-
Molina [2014] is the evolution of pairwise matching rules over time. Whang and Garcia-Molina
[2014] briefly discuss the case of evolving data, and identify a general incremental condition under which incremental record linkage can be easily performed using the batch linkage method.
Gruenheid et al. [2014] address the general case where the batch linkage algorithm may not be general incremental, and propose incremental techniques that explore the trade-offs between quality of
the linkage results and efficiency of the incremental algorithms.
Challenges for Incremental Linkage
Recall that record linkage computes a partitioning P of the input records R, such that each partition
in P identifies the records in R that refer to the same entity.
3.3 Addressing the Velocity Challenge
83
A natural thought for incremental linkage is that each inserted record is compared with
existing clusters, then either put it into an existing cluster (i.e., referring to an already known
entity), or create a new cluster for it (i.e., referring to a new entity). However, linkage algorithms
can make mistakes and the extra information from the data updates can often help identify and fix
such mistakes, as illustrated next with an example.
Example 3.12 Table 3.4 shows the records from the Flights domain, organized according to the order in which the date updates arrived, where Flights0 is the initial set of records, and Flights1 and
Flights2 are two updates.
Assume that the initial set of records Flights0 consists of seven records— r 213, r 214, r 215, r 224, r 231, r 232, and r 233. Figure 3.9 illustrates the pairwise matching graph obtained by applying the same pairwise similarity as in Example 3.
Download
Big Data Integration by Xin Luna Dong Divesh Srivastava.pdf
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8309)
Test-Driven Development with Java by Alan Mellor(6794)
Data Augmentation with Python by Duc Haba(6712)
Principles of Data Fabric by Sonia Mezzetta(6456)
Learn Blender Simulations the Right Way by Stephen Pearson(6362)
Microservices with Spring Boot 3 and Spring Cloud by Magnus Larsson(6230)
Hadoop in Practice by Alex Holmes(5965)
Jquery UI in Action : Master the concepts Of Jquery UI: A Step By Step Approach by ANMOL GOYAL(5813)
RPA Solution Architect's Handbook by Sachin Sahgal(5631)
Big Data Analysis with Python by Ivan Marin(5397)
The Infinite Retina by Robert Scoble Irena Cronin(5317)
Life 3.0: Being Human in the Age of Artificial Intelligence by Tegmark Max(5158)
Pretrain Vision and Large Language Models in Python by Emily Webber(4362)
Infrastructure as Code for Beginners by Russ McKendrick(4129)
Functional Programming in JavaScript by Mantyla Dan(4044)
The Age of Surveillance Capitalism by Shoshana Zuboff(3964)
WordPress Plugin Development Cookbook by Yannick Lefebvre(3842)
Embracing Microservices Design by Ovais Mehboob Ahmed Khan Nabil Siddiqui and Timothy Oleson(3646)
Applied Machine Learning for Healthcare and Life Sciences Using AWS by Ujjwal Ratan(3617)
