← Back to context

Comment by saagarjha

8 hours ago

I have to admit that using C syntax as a string to parse something from Python is definitely a choice. I'm not even sure I would use C structs to lay things out in C…

I asked an LLM to rewrite it for me using the Python built-in struct module, and it gave me this:

   import sys
   import struct
   from collections import namedtuple
   
   # Bake the layout once into a reusable, precompiled object.
   HEADER = struct.Struct(">4sIIIQQ16sQQIIII")
   
   # struct only knows positions, not names — pair it with a namedtuple
   # to recover the named-field access that cstruct gives you for free.
   Header = namedtuple("Header", [
       "magic", "field4", "field8", "fieldC",
       "field10", "field18", "field20",
       "field30", "field38",
       "field40", "field44", "field48", "field4C",
   ])
   
   with open(sys.argv[1], "rb") as fh:
       header = Header._make(HEADER.unpack(fh.read(HEADER.size)))
   
   print(header)

To me, this seems significantly less readable... less Pythonic, even. The printed output is also less readable.

  • No I think Python's struct module is also really bad. My point is if you are making a new DSL for laying out arbitrary formats why not do something better than what we have

    • Author here, this is a valid point but there are also valid reasons to choose C structures. The larger framework that this is a part of is primarily targeted towards people working in cybersecurity, not software engineers. Cybersecurity people are very often not great software engineers and there is a high throughput of “throwaway” scripts, or “make a quick hacky change”. C is commonly already well understood, a bespoke DSL usually is not and requires a learning step. You can “hit the ground running”, so to say.

      And, as a bonus, creating, say, a filesystem implementation is now often as easy as copy/pasting existing C structure definitions, either from the original source (which is usually C) or from reversing tools such as IDA/Ghidra.

      There’s no right or wrong way in my opinion, just preferences.

      1 reply →

    • I would assume dissect.cstruct was written for interopt with c programs using C structs, or to use formats documented as C structs. Not as a greenfield tool for arbitrary formats.

      C structs seem less bad than python structs, so why not use them? Especially why write a struct parser and create a DSL for it, when there's already one that you can use that uses a well known DSL you might already understand.

    • OK so what's your alternative then? It's easy to say you don't like something but the onus is on to show there's something actually better.

      The library used in the author's post seems perfectly readable to me, enough that it didn't even register until I read your comment. Could it be tweaked slightly to not use C syntax? Sure, but it's still going to need roughly the same pattern of identifier + type (including size). Types in C are straightforward so long as you don't have functions/pointers (which have the "inside out" problem, but they're not needed for binary encodings), so you're going to be looking at pretty trivial changes to syntax. Certainly not enough to warrant this level of quibbling.

in video/image space most code we deal with day to day is still C, lots more rust plugins in gstreamer ecosystem, but 90%+ still C

  • The article is about disk images though :)

    But yeah, while I think the `cstruct` helper function to describe a binary data layout in Python is more elegant than the builtin alternatives, it would have been much less painful to just go with a minimal C command line program (or any other programming language where a struct directly maps to memory). Python and most other scripting languages have been built for manipulating text data, but suck when working with binary data.