Scaling Density-Based Clustering to Large Collections of Sets

Research output: Chapter in Book/Report/Conference proceeding/Legal commentaryConference contributionpeer-review

Abstract

We study techniques for clustering large collections of sets into DBSCAN clusters. Sets are often used as a representation of complex objects to assess their similarity. The similarity of two objects is then computed based on the overlap of their set representations, for example, using Hamming distance. Clustering large collections of sets is challenging. A baseline that executes the standard DBSCAN algorithm suffers from poor performance due to the unfavorable neighborhood-by-neighborhood order in which the sets are retrieved. The DBSCAN order requires the use of a symmetric index, which is less effective than its asymmetric counterpart. Precomputing and materializing the neighborhoods to gain control over the retrieval order often turns out to be infeasible due to excessive memory requirements.
We propose a new, density-based clustering algorithm that processes data points in any user-defined order and does not need to materialize neighborhoods. Instead, so-called backlinks are introduced that are sufficient to achieve a correct clustering. Backlinks require only linear space while there can be a quadratic number of neighbors. To the best of our knowledge, this is the first DBSCAN-compliant algorithm
that can leverage asymmetric indexes in linear space. Our empirical evaluation
suggests that our algorithm combines the best of two worlds: it achieves the
runtime performance of materialization-based approaches while retaining the
memory efficiency of non-materializing techniques.
Original languageEnglish
Title of host publicationEDBT 2021 - Proceedings of the 24th International Conference on Extending Database Technology
Subtitle of host publication24th International Conference on Extending Database Technology, Proceedings
EditorsYannis Velegrakis, Yannis Velegrakis, Demetris Zeinalipour, Panos K. Chrysanthis, Panos K. Chrysanthis, Francesco Guerra
PublisherOpenProceedings.org
Pages109-120
Number of pages12
ISBN (Electronic)978-3-89318-084-4
ISBN (Print)978-3-89318-084-4
DOIs
Publication statusPublished - Mar 2021

Publication series

NameAdvances in Database Technology - EDBT
Volume2021-March
ISSN (Electronic)2367-2005

Bibliographical note

Publisher Copyright:
© 2021 Copyright held by the owner/author(s).

Keywords

  • clustering
  • sets
  • hamming distance
  • process mining
  • density-based clustering
  • dbscan

Fields of Science and Technology Classification 2012

  • 102 Computer Sciences

Cite this