Big Data Integration by Xin Luna Dong Divesh Srivastava

Big Data Integration by Xin Luna Dong Divesh Srivastava

Author:Xin Luna Dong,Divesh Srivastava [Xin Luna Dong,Divesh Srivastava]
Language: eng
Format: epub, pdf
Tags: Publisher: Morgan & Claypool Publishers, Compositor: Windfall Software
Published: 2015-03-04T10:36:57+00:00


82

3. RECORD LINKAGE

number of (non-matching) record pairs are pruned, and hence not compared in a pairwise

fashion.

2. Edge-centric pruning typically outperforms node-centric pruning in efficiency, discarding

more superfluous pairwise comparisons, while maintaining a high recall when the fraction of

matching record pairs is expected to be low. In this case, the high weighted edges are more

likely to correspond to the matching record pairs.

3. Weight-based pruning typically achieves a better recall than cardinality-based pruning. De-

pending on the threshold, the latter can be much more efficient than the former, but this is

achieved with a moderate loss in recall.

4. Among the proposed edge weighting schemes, ARCS consistently achieves the highest

performance. This is because ARCS downweights the co-occurrence of records in high

cardinality blocks, which is analogous to the use of IDF (inverse document frequency) in

document search.

3.3

ADDRESSING THE VELOCITY CHALLENGE

In the big data era, many of the data sources are very dynamic and the number of data sources is

also rapidly exploding. This high velocity of data updates can quickly make previous linkage results

obsolete. Since it is expensive to perform batch record linkage each time there is a data update, it

would be ideal to perform incremental record linkage, to be able to quickly update existing linkage

results when data updates arrive.

3.3.1

INCREMENTAL RECORD LINKAGE

While there has been a significant body of work on record linkage in the literature over the past few

decades, incremental record linkage has started to receive attention only in recent years [Whang and

Garcia-Molina 2010, Whang and Garcia-Molina 2014, Gruenheid et al. 2014].

The main focus of the works by Whang and Garcia-Molina [2010], Whang and Garcia-

Molina [2014] is the evolution of pairwise matching rules over time. Whang and Garcia-Molina

[2014] briefly discuss the case of evolving data, and identify a general incremental condition under which incremental record linkage can be easily performed using the batch linkage method.

Gruenheid et al. [2014] address the general case where the batch linkage algorithm may not be general incremental, and propose incremental techniques that explore the trade-offs between quality of

the linkage results and efficiency of the incremental algorithms.

Challenges for Incremental Linkage

Recall that record linkage computes a partitioning P of the input records R, such that each partition

in P identifies the records in R that refer to the same entity.

3.3 Addressing the Velocity Challenge

83

A natural thought for incremental linkage is that each inserted record is compared with

existing clusters, then either put it into an existing cluster (i.e., referring to an already known

entity), or create a new cluster for it (i.e., referring to a new entity). However, linkage algorithms

can make mistakes and the extra information from the data updates can often help identify and fix

such mistakes, as illustrated next with an example.

Example 3.12 Table 3.4 shows the records from the Flights domain, organized according to the order in which the date updates arrived, where Flights0 is the initial set of records, and Flights1 and

Flights2 are two updates.

Assume that the initial set of records Flights0 consists of seven records— r 213, r 214, r 215, r 224, r 231, r 232, and r 233. Figure 3.9 illustrates the pairwise matching graph obtained by applying the same pairwise similarity as in Example 3.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.