MetricJoin: Leveraging Metric Properties for Robust Exact Set Similarity Joins

Manuel Widmoser*, Daniel Kocher, Nikolaus Augsten, Willi Mann

*Corresponding author for this work

Research output: Contribution to conferencePaperpeer-review

Abstract

Given two collections of sets, the set similarity join reports all pairs of sets that are within a given distance threshold. State-of-the-art solutions employ an inverted list index and several heuristics to compute the join result efficiently. Prefix-based solutions benefit from infrequent set elements, known as tokens, and spend considerable time scanning long lists if the token frequency is not sufficiently skewed. Partition-based methods are less sensitive to the token distribution but suffer from a significantly larger memory footprint, limiting their applicability as the threshold or the set sizes grow. Solutions from the domain of metric-based similarity search are designed to reduce the overall number of distance computations. Generic metric techniques cannot compete with state-of-the-art similarity joins tailored to sets, which in turn do not exploit metric filter opportunities.
We propose MetricJoin, the first exact set similarity join technique that leverages the metric properties of set distance functions. In contrast to its competitors, MetricJoin is robust, i.e., datasets with different characteristics can be joined efficiently in terms of runtime and memory. Our algorithm embeds sets in vector space, organizes long inverted lists in spatial indexes, and employs an effective metric filter to prune unqualified sets. MetricJoin requires only linear space in the collection size and substantially reduces the number of sets that must be considered. In our performance studies, MetricJoin outperforms state-of-the-art solutions by up to an order of magnitude in runtime and generates up to five orders of magnitude fewer candidates.
Original languageEnglish
Pages1045-1058
Number of pages14
DOIs
Publication statusPublished - Apr 2023
EventIEEE International Conference on Data Engineering 2023 - Anaheim, United States
Duration: 3 Apr 20237 Apr 2023

Conference

ConferenceIEEE International Conference on Data Engineering 2023
Abbreviated titleICDE
Country/TerritoryUnited States
CityAnaheim
Period3/04/237/04/23

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

Fields of Science and Technology Classification 2012

  • 102 Computer Sciences

Cite this