← Back to context

Comment by trevor-e

5 years ago

While we are on the topic of file formats, I stumbled across a cool project the other day that aims to make parsing well-known binary formats much easier: http://kaitai.io/

In high school I was obsessed with the game Starcraft. It came with its own map editor that would save to its own proprietary format. Some smart people came along and reverse engineered that format and allowed us to do all kinds of neat things we weren't supposed to. I see modding communities for games are more popular than ever, and finding this project brought back lots of great memories.

I'm super-on-board with declarative binary parsing schemes (i.e. not lua or python) but my own attempt to use kaitai was met with the same frustration that presumably caused PDF to be missing from the format gallery list (http://formats.kaitai.io/): jump or offset-based structures

At the time, I tried using a stunt like defining one huge blob that "eats" the main file, then reaching back into it as we learned more, but it looks like somewhere along the way they acquired a "substream" behavior (https://github.com/kaitai-io/kaitai_struct_doc/blob/c53060f7...) so maybe it's worth another look

  • Can you explain more? I deal with a shocking number of poorly supported binary formats, and this tool looks awesome. Buuuuttt... a lot of these formats are “headers with offsets to structured data”. They’re intermixed logs, so multiple incompatible bi art formats are freely mixed in the same file.

    • I can most easily speak to the PDF example, which is harder than the target I was going after due to it being a mixture of text and binary, with a lot of optional behavior (EOL is CR or LF or CRLF seemingly at random, for one painful example)

      I took a few minutes just to kick the tires on the startxref of PDF, to get a feel for how the substream business plays out, and then stopped when I got to the part about how the offset position is written in a dynamically sized ascii string but represents an offset in the file

          meta:
            id: pdf
            file-extension: pdf
            endian: le
          seq:
          - id: magic_bytes
            contents: '%PDF-'
          instances:
            startxref_hack:
              type: startxref_hack0
              pos: _root._io.size - 24
              # pick a reasonable guess to wind backward
              size: 24 - 5
            eof_marker:
              # contents: '%%EOF'
              type: str
              encoding: ASCII
              size: 5
              # this isn't strictly accurate,
              # due to any optional CRLF trailing bytes
              pos: _root._io.size - 5
      
          types:
            eat_until_lf:
              seq:
              - id: dummy
                type: u1
                repeat: until
                repeat-until: _ == 0xA
      
            startxref_hack0:
              seq:
              # this isn't accurate, since we may have jumped
              # into "endobj\n" or worse
              - id: junk
                type: eat_until_lf
              - id: startxref_kw
                contents: 'startxref'
              - id: startxref_crlf
                type: eat_until_lf
              - id: startxref_offset
                type: str
                encoding: ASCII
                terminator: 0xA
      

      As best I can tell, the actual definition would involve a hypothetical `repeat: until-backward` where it starts at EOF (-5 in our case, due to the known EOF constant), reads backward until it hits LF (and/or CR!), captures that as the startxref offset, reads backward eating CR/LF, skips backward `strlen("startxref")` bytes, and then is when the tomfoolery starts about reading the "xref" stanza, which, again, is a ascii description of more binary offsets, using zero-prefix padded numbers because of course it does

      Don't get me wrong -- it's entirely possible that kaitai is targeting _strictly binary_ formats written by sane engineering teams, but the file format I was going after had a boatload of that jumping-around, repeating structs-of-offsets, too, so my holding up PDF as a worst-case example isn't ludicrous, either