← Back to context

Comment by mdaniel

5 years ago

I can most easily speak to the PDF example, which is harder than the target I was going after due to it being a mixture of text and binary, with a lot of optional behavior (EOL is CR or LF or CRLF seemingly at random, for one painful example)

I took a few minutes just to kick the tires on the startxref of PDF, to get a feel for how the substream business plays out, and then stopped when I got to the part about how the offset position is written in a dynamically sized ascii string but represents an offset in the file

    meta:
      id: pdf
      file-extension: pdf
      endian: le
    seq:
    - id: magic_bytes
      contents: '%PDF-'
    instances:
      startxref_hack:
        type: startxref_hack0
        pos: _root._io.size - 24
        # pick a reasonable guess to wind backward
        size: 24 - 5
      eof_marker:
        # contents: '%%EOF'
        type: str
        encoding: ASCII
        size: 5
        # this isn't strictly accurate,
        # due to any optional CRLF trailing bytes
        pos: _root._io.size - 5

    types:
      eat_until_lf:
        seq:
        - id: dummy
          type: u1
          repeat: until
          repeat-until: _ == 0xA

      startxref_hack0:
        seq:
        # this isn't accurate, since we may have jumped
        # into "endobj\n" or worse
        - id: junk
          type: eat_until_lf
        - id: startxref_kw
          contents: 'startxref'
        - id: startxref_crlf
          type: eat_until_lf
        - id: startxref_offset
          type: str
          encoding: ASCII
          terminator: 0xA

As best I can tell, the actual definition would involve a hypothetical `repeat: until-backward` where it starts at EOF (-5 in our case, due to the known EOF constant), reads backward until it hits LF (and/or CR!), captures that as the startxref offset, reads backward eating CR/LF, skips backward `strlen("startxref")` bytes, and then is when the tomfoolery starts about reading the "xref" stanza, which, again, is a ascii description of more binary offsets, using zero-prefix padded numbers because of course it does

Don't get me wrong -- it's entirely possible that kaitai is targeting _strictly binary_ formats written by sane engineering teams, but the file format I was going after had a boatload of that jumping-around, repeating structs-of-offsets, too, so my holding up PDF as a worst-case example isn't ludicrous, either