TY - JOUR
T1 - FINEX: A Fast Index for Exact & Flexible Density-Based Clustering
AU - Thiel, Konstantin Emil
AU - Kocher, Daniel
AU - Augsten, Nikolaus
AU - Hütter, Thomas
AU - Mann, Willi
AU - Schmitt, Daniel Ulrich
PY - 2023/6
Y1 - 2023/6
N2 - Density-based clustering is a popular concept to find groups of similar objects (i.e., clusters) in a dataset. It is applied in various domains, e.g., process mining and anomaly detection and comes with two user parameters (𝜀, 𝑀𝑖𝑛𝑃𝑡𝑠) that determine the clustering result, but are typically unknown in advance. Thus, users need to interactively test various settings until satisfying clusterings are found. This requires efficient algorithms, which are currently only available for specific data models, e.g., vector space data. We identify the following limitations of data model-independent approaches: (a) Expensive neighborhood computations are ineffectively pruned. (b) Existing indexes only return approximations, where objects are falsely labeled noise. (c) Existing indexes are inflexible as they restrict users to specify density only via 𝜀 whereas 𝑀𝑖𝑛𝑃𝑡𝑠 is constant, which limits explorable clusterings. We propose FINEX, a linear-space index that overcomes these limitations. Our index provides exact clusterings and can be queried with either of the two parameters. FINEX avoids neighborhood computations where possible and reduces the complexity of the remaining computations by leveraging fundamental properties of density-based clusters. Hence, our solution is efficient, flexible, and data model-independent. Moreover, FINEX respects the orginal and straightforward notion of density-based clustering. In our experiments with 8 large real-world datasets from various domains, FINEX shows runtime improvements of at least one order of magnitude over the state-of-the-art technique for exact clustering.
AB - Density-based clustering is a popular concept to find groups of similar objects (i.e., clusters) in a dataset. It is applied in various domains, e.g., process mining and anomaly detection and comes with two user parameters (𝜀, 𝑀𝑖𝑛𝑃𝑡𝑠) that determine the clustering result, but are typically unknown in advance. Thus, users need to interactively test various settings until satisfying clusterings are found. This requires efficient algorithms, which are currently only available for specific data models, e.g., vector space data. We identify the following limitations of data model-independent approaches: (a) Expensive neighborhood computations are ineffectively pruned. (b) Existing indexes only return approximations, where objects are falsely labeled noise. (c) Existing indexes are inflexible as they restrict users to specify density only via 𝜀 whereas 𝑀𝑖𝑛𝑃𝑡𝑠 is constant, which limits explorable clusterings. We propose FINEX, a linear-space index that overcomes these limitations. Our index provides exact clusterings and can be queried with either of the two parameters. FINEX avoids neighborhood computations where possible and reduces the complexity of the remaining computations by leveraging fundamental properties of density-based clusters. Hence, our solution is efficient, flexible, and data model-independent. Moreover, FINEX respects the orginal and straightforward notion of density-based clustering. In our experiments with 8 large real-world datasets from various domains, FINEX shows runtime improvements of at least one order of magnitude over the state-of-the-art technique for exact clustering.
U2 - 10.1145/3588925
DO - 10.1145/3588925
M3 - Article
SN - 2836-6573
VL - 1
SP - 1
EP - 25
JO - Proceedings of the ACM on Management of Data (PACMMOD)
JF - Proceedings of the ACM on Management of Data (PACMMOD)
IS - 1
M1 - 71
T2 - ACM SIGMOD International Conference on Management of Data
Y2 - 18 June 2023 through 23 June 2023
ER -