FINEX: A Fast Index for Exact & Flexible Density-Based Clustering

Konstantin Emil Thiel*, Daniel Kocher, Nikolaus Augsten, Thomas Hütter, Willi Mann, Daniel Ulrich Schmitt

*Korrespondierende/r Autor/-in für diese Arbeit

Publikation: Beitrag in FachzeitschriftArtikelPeer-reviewed

Abstract

Density-based clustering is a popular concept to find groups of similar objects (i.e., clusters) in a dataset. It is applied in various domains, e.g., process mining and anomaly detection and comes with two user parameters (𝜀, 𝑀𝑖𝑛𝑃𝑡𝑠) that determine the clustering result, but are typically unknown in advance. Thus, users need to interactively test various settings until satisfying clusterings are found. This requires efficient algorithms, which are currently only available for specific data models, e.g., vector space data. We identify the following limitations of data model-independent approaches: (a) Expensive neighborhood computations are ineffectively pruned. (b) Existing indexes only return approximations, where objects are falsely labeled noise. (c) Existing indexes are inflexible as they restrict users to specify density only via 𝜀 whereas 𝑀𝑖𝑛𝑃𝑡𝑠 is constant, which limits explorable clusterings. We propose FINEX, a linear-space index that overcomes these limitations. Our index provides exact clusterings and can be queried with either of the two parameters. FINEX avoids neighborhood computations where possible and reduces the complexity of the remaining computations by leveraging fundamental properties of density-based clusters. Hence, our solution is efficient, flexible, and data model-independent. Moreover, FINEX respects the orginal and straightforward notion of density-based clustering. In our experiments with 8 large real-world datasets from various domains, FINEX shows runtime improvements of at least one order of magnitude over the state-of-the-art technique for exact clustering.
OriginalspracheEnglisch
Aufsatznummer71
Seiten (von - bis)1-25
Seitenumfang25
FachzeitschriftProceedings of the ACM on Management of Data (PACMMOD)
Jahrgang1
Ausgabenummer1
DOIs
PublikationsstatusVeröffentlicht - Juni 2023
VeranstaltungACM SIGMOD International Conference on Management of Data - Seattle, Washington, USA/Vereinigte Staaten
Dauer: 18 Juni 202323 Juni 2023
https://2023.sigmod.org/

Systematik der Wissenschaftszweige 2012

  • 102 Informatik

Dieses zitieren