Scalable Similarity Queries over Complex Data

Publikation: Andere BeiträgeSonstiger BeitragForschungPeer-reviewed

Abstract

Our goal is to study and advance the efficient evaluation of similarity queries for complex data. In this thesis, we focus on two types of similarity queries: (1) top-k subtree similarity queries for trees and (2) density-based clustering for sets. A similarity query evaluates predicates that are based on some notion of similarity rather than equality, e.g., two data items are compared using a similarity function such as the overlap similarity for sets or the tree edit distance for trees. The similarity predicates, however, introduce additional complexity during query evaluation and prohibit the usage of standard techniques (like hashing or sorting) to evaluate the respective operators. We develop new index structures and algorithms that are tailored to similarity predicates.

The top-k subtree similarity query retrieves the k subtrees in a large document tree that are most similar to a given query tree. The similarity between two trees is assessed using the well-known tree edit distance. Previous solutions are either memory-efficient but slow (i.e., scan the entire document for each query) or fast but require quadratic space in the input size (i.e., an index is built to answer queries fast). We present SlimCone, a solution that is based on a linear-space index and allows to retrieve promising subtrees first. Promising subtrees share many node labels with the query tree. SlimCone avoids quadratic space by building relevant parts of the index on the fly. In our experiments on synthetic and real-world data, SlimCone outperforms the state of the art with respect to runtime by up to four orders of magnitude. Furthermore, SlimCone outperforms the index-based solution in terms of memory usage, indexing time, and the number of inspected subtrees.

Density-based clustering is a technique to partition data into clusters, i.e., dense regions that are separated by low-density regions. The popular DBSCAN algorithm recursively expands dense neighborhoods until a low-density neighborhood is reached. It relies on symmetric indexes, i.e., indexes that return all neighbors for a particular point independently of the processing order. For sets, however, the most efficient indexes are asymmetric. An asymmetric index assumes a processing order and returns only a specific part of the neighborhood (all unprocessed neighbors). Thus, they cannot be used in DBSCAN. A baseline that precomputes and materializes all neighbors before executing DBSCAN suffers from a large memory footprint (quadratic in the input size). To the best of our knowledge, we develop the first DBSCAN-compliant solution for sets. Our Spread algorithm uses asymmetric indexes and requires only linear space. Spread imposes a processing order on the sets and uses backlinks that keep sufficient information to derive a correct clustering. In our experiments, Spread is competitive with the materialization-based solution in terms of runtime and retains the memory efficiency of DBSCAN. We also present MC-Spread, an extension of Spread to multi-core environments. In our experiments, MC-Spread scales well with the number of cores for datasets with small neighborhoods that are expensive to compute.
OriginalspracheEnglisch
Seitenumfang124
PublikationsstatusVeröffentlicht - 2021

Systematik der Wissenschaftszweige 2012

  • 102 Informatik

Dieses zitieren