Comment by rini17

6 months ago

This might in general be a good preprocessing step to check for punctuation repeating in fixed intervals and remove it, and restore after decompression.

5 comments

rini17

vintermann 6 months ago

That turns in into specialized compression, which DNA already has plenty of. Many forms of specialized compression even allow string-related queries directly on the compressed data.

rini17 6 months ago
There are plenty of data formats where data is interspersed with fixed delimiters in fixed intervals.
- vintermann 6 months ago
  
  In fixed intervals? I'm not so sure about that. Generally if the intervals are fixed, you shouldn't need delimiters, you know where one thing begins and another ends anyway.
  Anyway, BWT-based compressors like Bzip2 do a good job on "repetition, but with random differences". Better than LZ-based compressors. However, they are not competitive on speed, and it's gotten relatively worse as computers got faster since the Burrows-Wheeler transform can't be parallelized very well and is inherently cache-unfriendly.
  
  1 reply →

bede 6 months ago

Yes, it sounds like 7-Zip/LZMA can do this using custom filters, among other more exotic (and slow) statistical compression approaches.