Request for software: byte-oriented merge with moves
2021-11-13 Permalink
Most merge tools use line-oriented comparison algorithms, and employ heuristics to detect moved lines, if at all.
Instead, the comparison should compute the shortest sequence of operations required to transform one input into the other, where the permitted operations include character deletion, insertion, replacement, and a substring move:
Why byte-oriented? Contrary to popular belief, lines are an unnecessary abstraction when dealing with most programming languages. It is frequent to wrap long lines of code, or apply multiple independent changes on the same line. If indentation was changed, I’d rather see only that one tab character inserted rather than an entire line highlighted (‘ignore whitespace’ is a worse alternative).
Why with moves? It’s common to have code rearranged while a conflicting edit renames an identifier, say. Conflicting changes of such sort could be resolved automatically; or if marked as conflicted to play safe, could be resolved by applying ‘left then right’ without resorting to manual editing.
This problem is known to be NP-hard though. However, good practical approximations exist. In particular one such algorithm is described in The String Edit Distance Matching Problem with Moves by G. Cormode and S. Muthukrishnan.
The algorithm works by constructing an edit-sensitive parsing tree in O(N log* N) time. Such a tree is tolerant to local edits of the string under the above operations. Consequently a metric between two strings can be approximated by comparing the cardinalities of node labels of their parsing trees.
It should be possible to efficiently traverse such two trees and emit a sequence of plausible edit operations that would transform one of them into the other. Then the two sequences between ‘left’ and ‘right’ in a 3-way merge can be compared and merged.