Minimal Edit-Based Diffs for Large Trees

Research output: Chapter in Book/Report/Conference proceeding/Legal commentaryConference contributionResearchpeer-review

Abstract

Hierarchically structured data are commonly represented as trees and have given rise to popular data formats like XML or JSON. An interesting query computes the difference between two versions of a tree, expressed as the minimum set of node edits (deletion, insertion, label rename) that transform one tree into another, commonly known as the tree edit distance. Unfortunately, the fastest tree edit distance algorithms run in cubic time and quadratic space and are therefore not feasible for large inputs. In this paper, we leverage the fact that the difference between two versions of a tree is typically much smaller than the overall tree size. We propose a new tree edit distance algorithm that is linear in the tree size for similar trees. Our algorithm is based on the new concept of top node pairs and avoids redundant distance computations, the main issue with previous solutions for tree diffs. We empirically evaluate the runtime of our algorithm on large synthetic and real-world trees; our algorithm clearly outperforms the state of the art, often by orders of magnitude.
Original languageEnglish
Title of host publicationProceedings of the ACM International Conference on Information and Knowledge Management
Pages 1225-1234
DOIs
Publication statusPublished - 2020

Fields of Science and Technology Classification 2012

  • 102 Computer Sciences

Cite this