Comment by mdaniel
5 years ago
I can most easily speak to the PDF example, which is harder than the target I was going after due to it being a mixture of text and binary, with a lot of optional behavior (EOL is CR or LF or CRLF seemingly at random, for one painful example)
I took a few minutes just to kick the tires on the startxref of PDF, to get a feel for how the substream business plays out, and then stopped when I got to the part about how the offset position is written in a dynamically sized ascii string but represents an offset in the file
meta:
id: pdf
file-extension: pdf
endian: le
seq:
- id: magic_bytes
contents: '%PDF-'
instances:
startxref_hack:
type: startxref_hack0
pos: _root._io.size - 24
# pick a reasonable guess to wind backward
size: 24 - 5
eof_marker:
# contents: '%%EOF'
type: str
encoding: ASCII
size: 5
# this isn't strictly accurate,
# due to any optional CRLF trailing bytes
pos: _root._io.size - 5
types:
eat_until_lf:
seq:
- id: dummy
type: u1
repeat: until
repeat-until: _ == 0xA
startxref_hack0:
seq:
# this isn't accurate, since we may have jumped
# into "endobj\n" or worse
- id: junk
type: eat_until_lf
- id: startxref_kw
contents: 'startxref'
- id: startxref_crlf
type: eat_until_lf
- id: startxref_offset
type: str
encoding: ASCII
terminator: 0xA
As best I can tell, the actual definition would involve a hypothetical `repeat: until-backward` where it starts at EOF (-5 in our case, due to the known EOF constant), reads backward until it hits LF (and/or CR!), captures that as the startxref offset, reads backward eating CR/LF, skips backward `strlen("startxref")` bytes, and then is when the tomfoolery starts about reading the "xref" stanza, which, again, is a ascii description of more binary offsets, using zero-prefix padded numbers because of course it does
Don't get me wrong -- it's entirely possible that kaitai is targeting _strictly binary_ formats written by sane engineering teams, but the file format I was going after had a boatload of that jumping-around, repeating structs-of-offsets, too, so my holding up PDF as a worst-case example isn't ludicrous, either
No comments yet
Contribute on Hacker News ↗