← Back to context

Comment by ajross

3 months ago

It's actually the simplest scheme. Reparse from the top whenever you need to query a setting. When you see one, exit. No need to even bother to store an intermediate representation. No idea if this matches the actual ssh implementation, but that's the way many historical parsers worked. The idea of cooking your text file on disk (into precious RAM!) is fairly modern.

Nope, the actual ssh implementation parses all the config files once, at the startup, using buffered file I/O and getline(). That means that on systems with modern libc, the whole config file (if it's small enough, less than 4 KiB IIRC?) gets read into the RAM and then getline() results are served from that buffer.

The scheme you propose is insane and if it was ever used (can you actually back that up? The disk I/O would kill your performance for anything remotely loaded), it was rightfully abandoned for much faster and simpler scheme.

  • > getline() results are served from that buffer.

    So... it doesn't parse them once! It just does its own[1] buffering layer and implements... exactly the algorithm I described? Not seeing where you're getting the "Nope" here, except to be focusing on the one historical note about RAM that I put in parentheses.

    [1] Somewhat needless given the OS has already done this. It only saves the syscall overhead.

    • Sorry, my comment was intended to be reply to the one of yours that said

          I/O is done piecewise, a line at a time. The file is never "loaded up". Again
          you're applying an intuition about how parsers are presented to college
          students (suck it all into RAM and build a big in-memory representation of
          the syntax tree) that doesn't match the way actual config file parsers work
          (read a line and interpret it, then read another).
      

      So, the whole file is usually loaded up (if it's short enough). At this point you might as well parse all of it, instead of re-reading it from the disk over and over, and redoing the same work over and over; parsed configs — if they are parsed into falgs/enums, and not literal strings — usually take about the same, or less, memory than a FILE structure from libc does on the whole.

      The complexity of the algorithm is about the same, either the early exit is here or it isn't (in fact, the version with the early exit, now that I think of it, has larger cyclotomatic complexity but whatever).

Loading up your parsing code and reopening the file every time a setting is queried sounds to me like it would increase the average memory use of most programs.

  • The ssh config format has almost no context, and the code is static and always "loaded up". I can all but guarantee this isn't correct. Modern hackers tend to wildly overestimate the complexity of ancient tasks like parsing.

    • If you're actually concerned about the handfuls of bytes a settings object would take, you would make the page/segment containing parser code able to be unloaded from memory.

  • You don't care about average memory use, you care about peak memory use.

    • Same criticism. When the program is in the middle of busy runtime activity, with all the memory that entails, it's the worst time to also load up the parser.

    • Doesn't really sound much better. You still load up the file(s) and the parser either way, so parsing all once vs on-demand is just a question of computation duration and considering how many config options are used the on-demand just seems really wasteful, especially after startup.

      10 replies →