Comment by mdaniel

5 years ago

I'm super-on-board with declarative binary parsing schemes (i.e. not lua or python) but my own attempt to use kaitai was met with the same frustration that presumably caused PDF to be missing from the format gallery list (http://formats.kaitai.io/): jump or offset-based structures

At the time, I tried using a stunt like defining one huge blob that "eats" the main file, then reaching back into it as we learned more, but it looks like somewhere along the way they acquired a "substream" behavior (https://github.com/kaitai-io/kaitai_struct_doc/blob/c53060f7...) so maybe it's worth another look

Can you explain more? I deal with a shocking number of poorly supported binary formats, and this tool looks awesome. Buuuuttt... a lot of these formats are “headers with offsets to structured data”. They’re intermixed logs, so multiple incompatible bi art formats are freely mixed in the same file.

  • I can most easily speak to the PDF example, which is harder than the target I was going after due to it being a mixture of text and binary, with a lot of optional behavior (EOL is CR or LF or CRLF seemingly at random, for one painful example)

    I took a few minutes just to kick the tires on the startxref of PDF, to get a feel for how the substream business plays out, and then stopped when I got to the part about how the offset position is written in a dynamically sized ascii string but represents an offset in the file

        meta:
          id: pdf
          file-extension: pdf
          endian: le
        seq:
        - id: magic_bytes
          contents: '%PDF-'
        instances:
          startxref_hack:
            type: startxref_hack0
            pos: _root._io.size - 24
            # pick a reasonable guess to wind backward
            size: 24 - 5
          eof_marker:
            # contents: '%%EOF'
            type: str
            encoding: ASCII
            size: 5
            # this isn't strictly accurate,
            # due to any optional CRLF trailing bytes
            pos: _root._io.size - 5
    
        types:
          eat_until_lf:
            seq:
            - id: dummy
              type: u1
              repeat: until
              repeat-until: _ == 0xA
    
          startxref_hack0:
            seq:
            # this isn't accurate, since we may have jumped
            # into "endobj\n" or worse
            - id: junk
              type: eat_until_lf
            - id: startxref_kw
              contents: 'startxref'
            - id: startxref_crlf
              type: eat_until_lf
            - id: startxref_offset
              type: str
              encoding: ASCII
              terminator: 0xA
    

    As best I can tell, the actual definition would involve a hypothetical `repeat: until-backward` where it starts at EOF (-5 in our case, due to the known EOF constant), reads backward until it hits LF (and/or CR!), captures that as the startxref offset, reads backward eating CR/LF, skips backward `strlen("startxref")` bytes, and then is when the tomfoolery starts about reading the "xref" stanza, which, again, is a ascii description of more binary offsets, using zero-prefix padded numbers because of course it does

    Don't get me wrong -- it's entirely possible that kaitai is targeting _strictly binary_ formats written by sane engineering teams, but the file format I was going after had a boatload of that jumping-around, repeating structs-of-offsets, too, so my holding up PDF as a worst-case example isn't ludicrous, either