Projects per year
Abstract
Given two collections of sets, the set similarity join reports all pairs of sets that are within a given distance threshold. State-of-the-art solutions employ an inverted list index and several heuristics to compute the join result efficiently. Prefix-based solutions benefit from infrequent set elements, known as tokens, and spend considerable time scanning long lists if the token frequency is not sufficiently skewed. Partition-based methods are less sensitive to the token distribution but suffer from a significantly larger memory footprint, limiting their applicability as the threshold or the set sizes grow. Solutions from the domain of metric-based similarity search are designed to reduce the overall number of distance computations. Generic metric techniques cannot compete with state-of-the-art similarity joins tailored to sets, which in turn do not exploit metric filter opportunities.
We propose MetricJoin, the first exact set similarity join technique that leverages the metric properties of set distance functions. In contrast to its competitors, MetricJoin is robust, i.e., datasets with different characteristics can be joined efficiently in terms of runtime and memory. Our algorithm embeds sets in vector space, organizes long inverted lists in spatial indexes, and employs an effective metric filter to prune unqualified sets. MetricJoin requires only linear space in the collection size and substantially reduces the number of sets that must be considered. In our performance studies, MetricJoin outperforms state-of-the-art solutions by up to an order of magnitude in runtime and generates up to five orders of magnitude fewer candidates.
We propose MetricJoin, the first exact set similarity join technique that leverages the metric properties of set distance functions. In contrast to its competitors, MetricJoin is robust, i.e., datasets with different characteristics can be joined efficiently in terms of runtime and memory. Our algorithm embeds sets in vector space, organizes long inverted lists in spatial indexes, and employs an effective metric filter to prune unqualified sets. MetricJoin requires only linear space in the collection size and substantially reduces the number of sets that must be considered. In our performance studies, MetricJoin outperforms state-of-the-art solutions by up to an order of magnitude in runtime and generates up to five orders of magnitude fewer candidates.
Original language | English |
---|---|
Pages | 1045-1058 |
Number of pages | 14 |
DOIs | |
Publication status | Published - Apr 2023 |
Event | IEEE International Conference on Data Engineering 2023 - Anaheim, United States Duration: 3 Apr 2023 → 7 Apr 2023 |
Conference
Conference | IEEE International Conference on Data Engineering 2023 |
---|---|
Abbreviated title | ICDE |
Country/Territory | United States |
City | Anaheim |
Period | 3/04/23 → 7/04/23 |
Bibliographical note
Publisher Copyright:© 2023 IEEE.
Fields of Science and Technology Classification 2012
- 102 Computer Sciences
Projects
- 1 Active
-
DESQ: DESQ - Declarative and Efficient Similarity Queries
Augsten, N. (Principal Investigator)
1/12/21 → 30/11/25
Project: Research
Research output
- 1 Patent
-
Method for Large-Scale Set Similarity Joins
Widmoser, M. (Inventor), Augsten, N. (Inventor), Kocher, D. (Inventor) & Mann, W. (Inventor), 30 Aug 2023, IPC No. G06F16/22, Patent No. EP4235451Research output: Patent
-
MetricJoin: Leveraging Metric Properties for Robust Exact Set Similarity Joins
Widmoser, M. (Speaker)
5 Jun 2023Activity: Talk or presentation › Poster presentation › science to science / art to art
-
MetricJoin: Leveraging Metric Properties for Robust Exact Set Similarity Joins
Augsten, N. (Speaker)
5 Apr 2023Activity: Talk or presentation › Oral presentation › science to science / art to art
-
MetricJoin: Leveraging Metric Properties for Robust Exact Set Similarity Joins
Widmoser, M. (Speaker)
5 Apr 2023Activity: Talk or presentation › Poster presentation › science to science / art to art