Comment by mdaniel

6 years ago

I'm super-on-board with declarative binary parsing schemes (i.e. not lua or python) but my own attempt to use kaitai was met with the same frustration that presumably caused PDF to be missing from the format gallery list (http://formats.kaitai.io/): jump or offset-based structures

At the time, I tried using a stunt like defining one huge blob that "eats" the main file, then reaching back into it as we learned more, but it looks like somewhere along the way they acquired a "substream" behavior (https://github.com/kaitai-io/kaitai_struct_doc/blob/c53060f7...) so maybe it's worth another look

2 comments

mdaniel

thechao 6 years ago

Can you explain more? I deal with a shocking number of poorly supported binary formats, and this tool looks awesome. Buuuuttt... a lot of these formats are “headers with offsets to structured data”. They’re intermixed logs, so multiple incompatible bi art formats are freely mixed in the same file.

mdaniel 6 years ago
I can most easily speak to the PDF example, which is harder than the target I was going after due to it being a mixture of text and binary, with a lot of optional behavior (EOL is CR or LF or CRLF seemingly at random, for one painful example)
I took a few minutes just to kick the tires on the startxref of PDF, to get a feel for how the substream business plays out, and then stopped when I got to the part about how the offset position is written in a dynamically sized ascii string but represents an offset in the file
meta: id: pdf file-extension: pdf endian: le seq: - id: magic_bytes contents: '%PDF-' instances: startxref_hack: type: startxref_hack0 pos: _root._io.size - 24 # pick a reasonable guess to wind backward size: 24 - 5 eof_marker: # contents: '%%EOF' type: str encoding: ASCII size: 5 # this isn't strictly accurate, # due to any optional CRLF trailing bytes pos: _root._io.size - 5 types: eat_until_lf: seq: - id: dummy type: u1 repeat: until repeat-until: _ == 0xA startxref_hack0: seq: # this isn't accurate, since we may have jumped # into "endobj\n" or worse - id: junk type: eat_until_lf - id: startxref_kw contents: 'startxref' - id: startxref_crlf type: eat_until_lf - id: startxref_offset type: str encoding: ASCII terminator: 0xA
As best I can tell, the actual definition would involve a hypothetical `repeat: until-backward` where it starts at EOF (-5 in our case, due to the known EOF constant), reads backward until it hits LF (and/or CR!), captures that as the startxref offset, reads backward eating CR/LF, skips backward `strlen("startxref")` bytes, and then is when the tomfoolery starts about reading the "xref" stanza, which, again, is a ascii description of more binary offsets, using zero-prefix padded numbers because of course it does
Don't get me wrong -- it's entirely possible that kaitai is targeting _strictly binary_ formats written by sane engineering teams, but the file format I was going after had a boatload of that jumping-around, repeating structs-of-offsets, too, so my holding up PDF as a worst-case example isn't ludicrous, either