Scaling Density-Based Clustering to Large Collections of Sets

Publikation: Beitrag in Buch/Bericht/Konferenzband/GesetzeskommentarKonferenzbeitragPeer-reviewed

Abstract

We study techniques for clustering large collections of sets into DBSCAN clusters. Sets are often used as a representation of complex objects to assess their similarity. The similarity of two objects is then computed based on the overlap of their set representations, for example, using Hamming distance. Clustering large collections of sets is challenging. A baseline that executes the standard DBSCAN algorithm suffers from poor performance due to the unfavorable neighborhood-by-neighborhood order in which the sets are retrieved. The DBSCAN order requires the use of a symmetric index, which is less effective than its asymmetric counterpart. Precomputing and materializing the neighborhoods to gain control over the retrieval order often turns out to be infeasible due to excessive memory requirements.
We propose a new, density-based clustering algorithm that processes data points in any user-defined order and does not need to materialize neighborhoods. Instead, so-called backlinks are introduced that are sufficient to achieve a correct clustering. Backlinks require only linear space while there can be a quadratic number of neighbors. To the best of our knowledge, this is the first DBSCAN-compliant algorithm
that can leverage asymmetric indexes in linear space. Our empirical evaluation
suggests that our algorithm combines the best of two worlds: it achieves the
runtime performance of materialization-based approaches while retaining the
memory efficiency of non-materializing techniques.
OriginalspracheEnglisch
TitelEDBT 2021 - Proceedings of the 24th International Conference on Extending Database Technology
Untertitel24th International Conference on Extending Database Technology, Proceedings
Redakteure/-innenYannis Velegrakis, Yannis Velegrakis, Demetris Zeinalipour, Panos K. Chrysanthis, Panos K. Chrysanthis, Francesco Guerra
Herausgeber (Verlag)OpenProceedings.org
Seiten109-120
Seitenumfang12
ISBN (elektronisch)978-3-89318-084-4
ISBN (Print)978-3-89318-084-4
DOIs
PublikationsstatusVeröffentlicht - März 2021

Publikationsreihe

NameAdvances in Database Technology - EDBT
Band2021-March
ISSN (elektronisch)2367-2005

Bibliographische Notiz

Funding Information:
We thank Alexander Miller, Mateusz Pawlik, Thomas Hütter, Manuel Widmoser, Manuel Kocher, Daniel Ulrich Schmitt, Konstantin Thiel, Daniel Grittner, Christian Böhm, and Claudia Plant for valuable discussions, and Manuel Kocher for typesetting Figures 1 and 3. This work was partially supported by the Austrian Science Fund (FWF): P 29859.

Publisher Copyright:
© 2021 Copyright held by the owner/author(s).

Systematik der Wissenschaftszweige 2012

  • 102 Informatik

Dieses zitieren