Logo del repository
  1. Home
 
Opzioni

Incorporating clustering into set similarity join algorithms: The SjClust framework

Ribeiro, Leonardo Andrade
•
CUZZOCREA, Alfredo Massimiliano
•
Bezerra, Karen Aline Alves
•
do Nascimento, Ben Hur Bahia
2016
  • conference object

Abstract
Data cleaning and integration found on duplicate record identification, which aims at detecting duplicate records that represent the same real-world entity. Similarity join is largely used in order to detect pairs of similar records in combination with a subsequent clustering algorithm meant for grouping together records that refer to the same entity. Unfortunately, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance, and final results are produced at the end of the whole process only. Inspired by this critical evidence, in this paper we propose and experimentally assess SjClust, a framework to integrate similarity join and clustering into a single operation. The basic idea of our proposal consists in introducing a variety of cluster representations that are smoothly merged during the set similarity task, carried out by the join algorithm. An optimization task is further applied on top of such framework. Experimental results, which are derived from an extensive experimental campaign, we retrieve are really surprising, as we are able to outperform the original set similarity join algorithm by an order of magnitude in most settings.
DOI
10.1007/978-3-319-44403-1_12
WOS
WOS:000389020100012
Archivio
http://hdl.handle.net/11368/2898280
info:eu-repo/semantics/altIdentifier/scopus/2-s2.0-84981278159
http://www.springer.com/la/book/9783319444055
Diritti
closed access
license:digital rights management non definito
FVG url
https://arts.units.it/request-item?handle=11368/2898280
Soggetti
  • Theoretical Computer ...

  • Computer Science (all...

Web of Science© citazioni
2
Data di acquisizione
Mar 17, 2024
Visualizzazioni
1
Data di acquisizione
Apr 19, 2024
Vedi dettagli
google-scholar
Get Involved!
  • Source Code
  • Documentation
  • Slack Channel
Make it your own

DSpace-CRIS can be extensively configured to meet your needs. Decide which information need to be collected and available with fine-grained security. Start updating the theme to match your nstitution's web identity.

Need professional help?

The original creators of DSpace-CRIS at 4Science can take your project to the next level, get in touch!

Realizzato con Software DSpace-CRIS - Estensione mantenuta e ottimizzata da 4Science

  • Impostazioni dei cookie
  • Informativa sulla privacy
  • Accordo con l'utente finale
  • Invia il tuo Feedback