A query paradigm, which has recently received much attention, is the similarity
query. In a similarity query, two data objects "match" if they are similar.
Similarity queries are required in scenarios where equality and exact matches
are not effective, for example, when dealing with errors in the data.
In the past, the main research focus regarding similarity queries was on
physical operators and access methods: new algorithms and indexes for specific
combinations of operator (e.g., top-k selection), data type (e.g., string), and
similarity function (e.g., edit distance) were proposed. The evaluation of
similarity queries in a larger systems context, however, has received little
attention. Thus, similarity queries are poorly supported by database management systems. Applications that require advanced similarity features cannot rely on general-purpose systems that transparently handle data storage and querying.
Instead, similarity queries must be dealt with in an ad-hoc way, for example, by
implementing user defined functions in the database system or writing custom
applications. Both approaches are cumbersome and inefficient.
In this project we depart from the mainstream in similarity research with its
narrow focus on individual physical operators and study similarity queries from
a broader systems perspective. We want to develop a deep understanding of all
aspects of similarity queries that are required to build a general-purpose query
processor for this query type. The overall goal is the integration of
similarity queries into declarative databases (relational or non-relational) and
their efficient processing in a systems context.
We identify three core requirements for a similarity query processing system:
declarative, efficient, combinable, extendable (DECE), i.e., declarative queries that combine equality and similarity predicates should be processed efficiently. We believe that all DEC requirements must be satisfied to build a useful end-to-end system for similarity queries.
The key ideas for integrating similarity queries into systems are (a) the
development of a new similarity algebra that extends relational algebra with a
minimal set of orthogonal operators, (b) dynamic rewriting techniques that
expand similarity queries with efficient filters, and (c) a cost model to assess
the resulting query plans.
The focus of this project is on similarity queries over sets, strings, trees,
and graphs, which cover a wide range of applications. We expect that many of our results will be transferable to other domains, for example, feature vectors and multimedia retrieval.