Duplicate record detection is a crucial task for data cleaning process in data warehouse systems. Many approaches have been presented to address this problem: some of these rely on the accuracy of the resulted records, others focus on the efficiency of the comparison process. Following the first direction, we introduce two similarity functions based on the concept of q-grams that contribute to improve accuracy of duplicate detection process with respect to other well known measures. We also reduce the number and the running time of record comparisons by building an inverted index on a sorted list of q-grams, named q-grams array. Then, we extend this approach to perform a clustering process based on the proposed q-grams array. Finally, an experimental analysis on synthetic and real data shows the efficiency of the novel indexing method for both record comparison process and clustering.

An Efficient Duplicate Record Detection Using q-Grams Array Inverted Index

FERRO, Alfredo;PULVIRENTI, ALFREDO
2010-01-01

Abstract

Duplicate record detection is a crucial task for data cleaning process in data warehouse systems. Many approaches have been presented to address this problem: some of these rely on the accuracy of the resulted records, others focus on the efficiency of the comparison process. Following the first direction, we introduce two similarity functions based on the concept of q-grams that contribute to improve accuracy of duplicate detection process with respect to other well known measures. We also reduce the number and the running time of record comparisons by building an inverted index on a sorted list of q-grams, named q-grams array. Then, we extend this approach to perform a clustering process based on the proposed q-grams array. Finally, an experimental analysis on synthetic and real data shows the efficiency of the novel indexing method for both record comparison process and clustering.
2010
978-3-642-15104-0
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/71399
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 8
  • ???jsp.display-item.citation.isi??? 4
social impact