Comment by vintermann

6 months ago

That turns in into specialized compression, which DNA already has plenty of. Many forms of specialized compression even allow string-related queries directly on the compressed data.

There are plenty of data formats where data is interspersed with fixed delimiters in fixed intervals.

  • In fixed intervals? I'm not so sure about that. Generally if the intervals are fixed, you shouldn't need delimiters, you know where one thing begins and another ends anyway.

    Anyway, BWT-based compressors like Bzip2 do a good job on "repetition, but with random differences". Better than LZ-based compressors. However, they are not competitive on speed, and it's gotten relatively worse as computers got faster since the Burrows-Wheeler transform can't be parallelized very well and is inherently cache-unfriendly.

    • All kinds of data - block justified text, database files with fixed row size tables, even HTTP chunked encoding tends to have blocks of same size with same delimiters,...

      I really don't see how better supporting this "second order repetition" feature in the encoding would cause such a big problem. LZ variants already track repeating strings.