Project Details
Description
A query paradigm, which has recently received much attention, is the similarity
query. In a similarity query, two data objects "match" if they are similar.
Similarity queries are required in scenarios where equality and exact matches
are not effective, for example, when dealing with errors in the data.
In the past, the main research focus regarding similarity queries was on
physical operators and access methods: new algorithms and indexes for specific
combinations of operator (e.g., top-k selection), data type (e.g., string), and
similarity function (e.g., edit distance) were proposed. The evaluation of
similarity queries in a larger systems context, however, has received little
attention. Thus, similarity queries are poorly supported by database management systems. Applications that require advanced similarity features cannot rely on general-purpose systems that transparently handle data storage and querying.
Instead, similarity queries must be dealt with in an ad-hoc way, for example, by
implementing user defined functions in the database system or writing custom
applications. Both approaches are cumbersome and inefficient.
In this project we depart from the mainstream in similarity research with its
narrow focus on individual physical operators and study similarity queries from
a broader systems perspective. We want to develop a deep understanding of all
aspects of similarity queries that are required to build a general-purpose query
processor for this query type. The overall goal is the integration of
similarity queries into declarative databases (relational or non-relational) and
their efficient processing in a systems context.
We identify three core requirements for a similarity query processing system:
declarative, efficient, combinable, extendable (DECE), i.e., declarative queries that combine equality and similarity predicates should be processed efficiently. We believe that all DEC requirements must be satisfied to build a useful end-to-end system for similarity queries.
The key ideas for integrating similarity queries into systems are (a) the
development of a new similarity algebra that extends relational algebra with a
minimal set of orthogonal operators, (b) dynamic rewriting techniques that
expand similarity queries with efficient filters, and (c) a cost model to assess
the resulting query plans.
The focus of this project is on similarity queries over sets, strings, trees,
and graphs, which cover a wide range of applications. We expect that many of our results will be transferable to other domains, for example, feature vectors and multimedia retrieval.
query. In a similarity query, two data objects "match" if they are similar.
Similarity queries are required in scenarios where equality and exact matches
are not effective, for example, when dealing with errors in the data.
In the past, the main research focus regarding similarity queries was on
physical operators and access methods: new algorithms and indexes for specific
combinations of operator (e.g., top-k selection), data type (e.g., string), and
similarity function (e.g., edit distance) were proposed. The evaluation of
similarity queries in a larger systems context, however, has received little
attention. Thus, similarity queries are poorly supported by database management systems. Applications that require advanced similarity features cannot rely on general-purpose systems that transparently handle data storage and querying.
Instead, similarity queries must be dealt with in an ad-hoc way, for example, by
implementing user defined functions in the database system or writing custom
applications. Both approaches are cumbersome and inefficient.
In this project we depart from the mainstream in similarity research with its
narrow focus on individual physical operators and study similarity queries from
a broader systems perspective. We want to develop a deep understanding of all
aspects of similarity queries that are required to build a general-purpose query
processor for this query type. The overall goal is the integration of
similarity queries into declarative databases (relational or non-relational) and
their efficient processing in a systems context.
We identify three core requirements for a similarity query processing system:
declarative, efficient, combinable, extendable (DECE), i.e., declarative queries that combine equality and similarity predicates should be processed efficiently. We believe that all DEC requirements must be satisfied to build a useful end-to-end system for similarity queries.
The key ideas for integrating similarity queries into systems are (a) the
development of a new similarity algebra that extends relational algebra with a
minimal set of orthogonal operators, (b) dynamic rewriting techniques that
expand similarity queries with efficient filters, and (c) a cost model to assess
the resulting query plans.
The focus of this project is on similarity queries over sets, strings, trees,
and graphs, which cover a wide range of applications. We expect that many of our results will be transferable to other domains, for example, feature vectors and multimedia retrieval.
| Acronym | DESQ |
|---|---|
| Status | Active |
| Effective start/end date | 1/12/21 → 30/11/26 |
-
SHINE: A Scalable HNSW Index in Disaggregated Memory
Widmoser, M., Kocher, D. & Augsten, N., 18 Aug 2025, arXiv.org.Research output: Working paper/Preprint › Preprint
Open Access -
Extensible and Robust Evaluation of Similarity Queries
Schmitt, D. U., Hütter, T. & Augsten, N., May 2025, p. 3868-3882. 15 p.Research output: Contribution to conference › Paper › peer-review
Open Access -
SWOOP: top-k similarity joins over set streams
Mann, W., Augsten, N., Jensen, C. S. & Pawlik, M., 23 Dec 2024, In: The VLDB Journal. 34, 1, 13 p., 13.Research output: Contribution to journal › Article › peer-review
Datasets
-
JSON Datasets
Hütter, T. (Creator), Zenodo, 2022
DOI: 10.5281/zenodo.5807299, https://github.com/DatabaseGroup/jedi-datasets
Dataset
Prizes
-
ACM SIGMOD Best Artifact Award
Hütter, T. (Recipient), Augsten, N. (Recipient), Kirsch, C. (Recipient), Carey, M. J. (Recipient) & Li, C. (Recipient), 2023
Prize
-
SWOOP: Top-k Similarity Joins over Set Streams
Augsten, N. (Speaker)
4 Sept 2025Activity: Talk or presentation › Poster presentation › science to science / art to art
-
SWOOP: top-k similarity joins over set streams
Augsten, N. (Speaker)
4 Sept 2025Activity: Talk or presentation › Oral presentation › science to science / art to art
-
JEDI: These aren't the JSON documents you're looking for...
Hütter, T. (Speaker)
16 Jun 2022Activity: Talk or presentation › Oral presentation › science to science / art to art