Projekte pro Jahr
Abstract
We study the set similarity join problem, which retrieves all pairs of similar sets from two collections of sets for a given distance function. Existing exact solutions employ a signature-based filter-verification framework: If two sets are similar, they must have at least one signature in common, otherwise they can be pruned safely. We observe that the choice of the signature scheme has a significant impact on the performance. Unfortunately, choosing a good signature scheme is hard because the performance heavily depends on the characteristics of the underlying dataset.
To address this problem, we propose a hybrid signature composition that leverages the most selective portion of each signature scheme. Sets with an unselective primary signature are detected, and the signatures are replaced with a more selective secondary signature. We propose a generic framework called TwoL and a cost model to balance the computational overhead and the selectivity of the signature schemes. We implement our framework with two complementary signature schemes for Jaccard similarity and Hamming distance, resulting in effective two-level hybrid indexes that join datasets with diverse characteristics efficiently. TwoL consistently outperforms state-of-the-art set similarity joins on a benchmark with 13 datasets that cover a wide range of data characteristics.
To address this problem, we propose a hybrid signature composition that leverages the most selective portion of each signature scheme. Sets with an unselective primary signature are detected, and the signatures are replaced with a more selective secondary signature. We propose a generic framework called TwoL and a cost model to balance the computational overhead and the selectivity of the signature schemes. We implement our framework with two complementary signature schemes for Jaccard similarity and Hamming distance, resulting in effective two-level hybrid indexes that join datasets with diverse characteristics efficiently. TwoL consistently outperforms state-of-the-art set similarity joins on a benchmark with 13 datasets that cover a wide range of data characteristics.
Originalsprache | Englisch |
---|---|
Seiten | 2686-2698 |
Seitenumfang | 13 |
DOIs | |
Publikationsstatus | Veröffentlicht - Juli 2023 |
Veranstaltung | 49th International Conference on Very Large Data Bases 2023 - Vancouver, Kanada Dauer: 28 Aug. 2023 → 1 Sept. 2029 Konferenznummer: 49 https://vldb.org/2023/ |
Konferenz
Konferenz | 49th International Conference on Very Large Data Bases 2023 |
---|---|
Kurztitel | VLDB 2023 |
Land/Gebiet | Kanada |
Ort | Vancouver |
Zeitraum | 28/08/23 → 1/09/29 |
Internetadresse |
Systematik der Wissenschaftszweige 2012
- 102 Informatik
Projekte
- 1 Laufend
-
DESQ: DESQ - Declarative and Efficient Similarity Queries
Augsten, N. (Projektleitung)
1/12/21 → 30/11/25
Projekt: Forschung
Publikationen
- 2 Patentschrift
-
Method for Stable Set Similarity Join
Schmitt, D. U. (Erfinder/-in), Augsten, N. (Erfinder/-in), Kocher, D. (Erfinder/-in) & Mann, W. (Erfinder/-in), 2 Nov. 2023, IPC Nr. G06F 16/22 2019.1, Patent Nr. WO2023208903A1Publikation: Patent › Patentschrift
-
Method for Stable Set Similarity Joins
Schmitt, D. U. (Erfinder/-in), Augsten, N. (Erfinder/-in), Kocher, D. (Erfinder/-in), Mann, W. (Erfinder/-in) & Miller, A. (Erfinder/-in), 1 Nov. 2023, IPC Nr. G06F16/22, Patent Nr. EP4270210A1Publikation: Patent › Patentschrift
-
A Two-Level Signature Scheme for Stable Set Similarity Joins
Schmitt, D. U. (Redner/in)
10 Feb. 2024Aktivität: Gastvortrag oder Vortrag › Vortrag › science to science / art to art
-
A Two-Level Signature Scheme for Stable Set Similarity Joins
Schmitt, D. U. (Redner/in)
30 Aug. 2023Aktivität: Gastvortrag oder Vortrag › Poster-Präsentation › science to science / art to art
-
A Two-Level Signature Scheme for Stable Set Similarity Joins
Schmitt, D. U. (Redner/in)
30 Aug. 2023Aktivität: Gastvortrag oder Vortrag › Vortrag › science to science / art to art