TY - GEN
T1 - Scaling Density-Based Clustering to Large Collections of Sets
AU - Kocher, Daniel
AU - Mann, Willi
AU - Augsten, Nikolaus
N1 - Publisher Copyright:
© 2021 Copyright held by the owner/author(s).
PY - 2021/3
Y1 - 2021/3
N2 - We study techniques for clustering large collections of sets into DBSCAN clusters. Sets are often used as a representation of complex objects to assess their similarity. The similarity of two objects is then computed based on the overlap of their set representations, for example, using Hamming distance. Clustering large collections of sets is challenging. A baseline that executes the standard DBSCAN algorithm suffers from poor performance due to the unfavorable neighborhood-by-neighborhood order in which the sets are retrieved. The DBSCAN order requires the use of a symmetric index, which is less effective than its asymmetric counterpart. Precomputing and materializing the neighborhoods to gain control over the retrieval order often turns out to be infeasible due to excessive memory requirements.We propose a new, density-based clustering algorithm that processes data points in any user-defined order and does not need to materialize neighborhoods. Instead, so-called backlinks are introduced that are sufficient to achieve a correct clustering. Backlinks require only linear space while there can be a quadratic number of neighbors. To the best of our knowledge, this is the first DBSCAN-compliant algorithm that can leverage asymmetric indexes in linear space. Our empirical evaluation suggests that our algorithm combines the best of two worlds: it achieves the runtime performance of materialization-based approaches while retaining the memory efficiency of non-materializing techniques.
AB - We study techniques for clustering large collections of sets into DBSCAN clusters. Sets are often used as a representation of complex objects to assess their similarity. The similarity of two objects is then computed based on the overlap of their set representations, for example, using Hamming distance. Clustering large collections of sets is challenging. A baseline that executes the standard DBSCAN algorithm suffers from poor performance due to the unfavorable neighborhood-by-neighborhood order in which the sets are retrieved. The DBSCAN order requires the use of a symmetric index, which is less effective than its asymmetric counterpart. Precomputing and materializing the neighborhoods to gain control over the retrieval order often turns out to be infeasible due to excessive memory requirements.We propose a new, density-based clustering algorithm that processes data points in any user-defined order and does not need to materialize neighborhoods. Instead, so-called backlinks are introduced that are sufficient to achieve a correct clustering. Backlinks require only linear space while there can be a quadratic number of neighbors. To the best of our knowledge, this is the first DBSCAN-compliant algorithm that can leverage asymmetric indexes in linear space. Our empirical evaluation suggests that our algorithm combines the best of two worlds: it achieves the runtime performance of materialization-based approaches while retaining the memory efficiency of non-materializing techniques.
KW - clustering
KW - sets
KW - hamming distance
KW - process mining
KW - density-based clustering
KW - dbscan
UR - https://www.youtube.com/embed/IJHjQAIHR0Y?autoplay=1
UR - https://www.youtube.com/watch?v=6TfwPa9Gmxc
UR - https://edbt2021proceedings.github.io/ads/a85.png
UR - http://www.scopus.com/inward/record.url?scp=85113711003&partnerID=8YFLogxK
UR - https://www.mendeley.com/catalogue/185be470-c4d4-32bf-a9f6-2b9aae47eb78/
U2 - 10.5441/002/edbt.2021.11
DO - 10.5441/002/edbt.2021.11
M3 - Conference contribution
SN - 978-3-89318-084-4
T3 - Advances in Database Technology - EDBT
SP - 109
EP - 120
BT - EDBT 2021 - Proceedings of the 24th International Conference on Extending Database Technology
A2 - Velegrakis, Yannis
A2 - Velegrakis, Yannis
A2 - Zeinalipour, Demetris
A2 - Chrysanthis, Panos K.
A2 - Chrysanthis, Panos K.
A2 - Guerra, Francesco
PB - OpenProceedings.org
ER -