Comment by thechao
5 years ago
Can you explain more? I deal with a shocking number of poorly supported binary formats, and this tool looks awesome. Buuuuttt... a lot of these formats are “headers with offsets to structured data”. They’re intermixed logs, so multiple incompatible bi art formats are freely mixed in the same file.
I can most easily speak to the PDF example, which is harder than the target I was going after due to it being a mixture of text and binary, with a lot of optional behavior (EOL is CR or LF or CRLF seemingly at random, for one painful example)
I took a few minutes just to kick the tires on the startxref of PDF, to get a feel for how the substream business plays out, and then stopped when I got to the part about how the offset position is written in a dynamically sized ascii string but represents an offset in the file
As best I can tell, the actual definition would involve a hypothetical `repeat: until-backward` where it starts at EOF (-5 in our case, due to the known EOF constant), reads backward until it hits LF (and/or CR!), captures that as the startxref offset, reads backward eating CR/LF, skips backward `strlen("startxref")` bytes, and then is when the tomfoolery starts about reading the "xref" stanza, which, again, is a ascii description of more binary offsets, using zero-prefix padded numbers because of course it does
Don't get me wrong -- it's entirely possible that kaitai is targeting _strictly binary_ formats written by sane engineering teams, but the file format I was going after had a boatload of that jumping-around, repeating structs-of-offsets, too, so my holding up PDF as a worst-case example isn't ludicrous, either