SWOOP: top-k similarity joins over set streams

Willi Mann, Nikolaus Augsten, Christian S. Jensen, Mateusz Pawlik

Publikation: Beitrag in FachzeitschriftArtikelPeer-reviewed

Abstract

We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams, such as Twitter streams that emit tweets as sets of words. Using a sliding window model, the top-k result changes as new sets enter the window or existing ones leave the window. Specifically, when a set arrives, it may form a new top-k result pair with any set already in the window. When a set leaves the window, all its pairings in the top-k result must be replaced with other pairs. It is therefore not sufficient to maintain the k most similar pairs since less similar pairs may become top-k pairs later. We propose SWOOP, a highly scalable stream join algorithm. Novel indexing techniques and sophisticated filters efficiently prune obsolete pairs as new sets enter the window. SWOOP incrementally maintains a provably minimal stock of similar pairs to update the top-k result at any time. Empirical studies confirm that SWOOP is able to support stream rates that are orders of magnitude faster than the rates supported by existing approaches.
OriginalspracheEnglisch
Aufsatznummer13
Seitenumfang13
FachzeitschriftThe VLDB Journal
Jahrgang34
Ausgabenummer1
DOIs
PublikationsstatusVeröffentlicht - 23 Dez. 2024

Systematik der Wissenschaftszweige 2012

  • 102 Informatik

Dieses zitieren