Comment by tux3

4 years ago

The moral of the story, as far as I'm concerned: do NOT parse strings in C! Use a library, prefferably in a higher-level language.

C string handling is a mess of viciously surprising APIs, juggling those particular footguns is almost certainly not your least bad option.

I used to work for people that processed emails and loaded them into databases with perl scripts. One day someone asked me if I could help, because the script they were running on a batch of emails was inexplicably freezing or running out of memory, I forget the exact details.

There were maybe a few thousand or tens of thousands of emails, and so, I came to look at the issue with my usual attitude which is that if it isn't running instantly on a GHz computer of any kind, something must be horrifically wrong. Not to mention running out of memory.

It turned out that essentially every email was cc'ed to several thousand people; you know, the thread went to a whole company or something. The file size itself wasn't huge in the scheme of things, but our perl script (written by someone much more elevated that me, with a master's degree) read the whole file at once into memory, and expanded the headers to perl hashes, multiplying the size.

But there was no reason the whole thing had to be in memory. Only that people learn to program in CS 101 these days, I guess, as if memory is infinite, because gigabytes might as well be infinity, but if you multiply a certain moderate overhead times tens of thousands by tens of thousands, suddenly all your gigabytes are gone.

Another thing I remember, was when I was given a T-SQL report that typically ran on a few thousand documents, and was asked to run it on all of the millions on all our databases. It was hundreds or thousands of times too slow, but it was running a SQL statement in a procedural loop per document and it could be turned into a single statement.

So my experience has basically taught me that if something is within a factor of two of optimal, it's good enough, but an incredible amount of code, regardless of high or low level, is way worse than that.

But I've never gotten much pay or glory for fixing this sort of thing. Sometimes I daydream about being a high paid consultant who goes and turns three nested loops into two, or two into one.

  • The other week I optimized some processing in a web app from 3 minutes to 11 seconds.

    The customer was ecstatic, but I still think it is 2 orders of magnitude too slow.

  • There is a whole lot of low hanging fruit in the world. When I am new at a job if I don’t find several order or two of magnitude improvements I am impressed.

Funny, the ergonomics of the Swift string API are so bad that I've started learning a lower level language for parsing, etc.

Here's my favorite WTF: https://christiantietze.de/posts/2020/01/string-index-offset...

  • You can fault the docs but what's the problem with the API? Why should you be surprised that accessing a collection with 'nil' index is a runtime error? What else could it be?

    A simple fix in the doc seems to solve the confusion:

    "Returns an index that is the specified distance from the given index, unless that distance is beyond a given limiting index [in which case it returns nil]".

    It does say "returns an index ... unless ..". So yeah, a bit too terse. But with is the issue with the API?

        //  /!\  Warning! Do not use in production!   /!\
        let s = "Swift"
        if let i = s.index(s.startIndex, offsetBy: 5, limitedBy: s.endIndex) {
           print(s[i])
        }
    

    "What does it print? Nothing, you hope?"

    Seriously? 'print(s[nil])' should print nothing? How about 'let c = s[nil]'. Should that just silently pass? That is the runtime error, btw. It won't even get to print(). (Valid to question the entirely non-informative error, however.)

    • If s.index() returned nil, the "if" would test false and the s[i] would not be reached.

      The problem is that it returns non-nil, but the _limit_ is broken in this case: using s.endIndex as a limit means you can get non-nil but bogus indices returned. And yes, this is the fault of the docs for using a broken last arg to the API, but there's no really clean way to use this API as designed, afaict. At least not if you want to limit to "end of string" as opposed to "some index into the string that I already know to be valid".

      1 reply →

    • The problem is that `s.index` with an `offset` equal to `limitedBy` returns non-nil index, rather than nil, but that index is invalid (out of bounds) and causes the program to blow up...

          let s = "Swift"
          
          print(s.index(s.startIndex, offsetBy: 4, limitedBy: s.endIndex))
          print(s.index(s.startIndex, offsetBy: 5, limitedBy: s.endIndex))
          print(s.index(s.startIndex, offsetBy: 6, limitedBy: s.endIndex))
      

      outputs:

          Optional(Swift.String.Index(_rawBits: 262401))
          Optional(Swift.String.Index(_rawBits: 327681)) <-  this is unexpected
          nil

      1 reply →

For me the moral of the story is do not use (whatever)scanf() for anything other than toy programs. In most cases implementing your own tokenizer (for both of these cases of reading numbers that involves str(c)spn() to get length of candidate token and then strtosomething()) is significantly easier than reasoning about what scanf() really does (even ignoring accidentally quadratic implementation details) and whether that can be adapted to your usecase.

  • Can you ELICSUndergraduate. Tokenizing is normally for if you're writing a compiler of DSL right?

    • Yes. But any data format you read, particularly any plaintext data format you read, is essentially interpreting or compiling a DSL. On a typical job, people are writing compilers much more often than they think!

    • It is a general term for the process of breaking a string into "tokens" which have a sort of meaning. Definitely a common task in compilers, but not limited to it.

I would argue the reverse - there is higher chance of this accidentally quadratic problems with more abstraction layers, convenience functions, and advanced language syntax. But I agree we shouldn't write parsing in C, but for other reasons :)

  • I say there's an equal chance, and it's equally easy to fix, but a high level language provides fewer distractions and temptations to work on small optimizations which don't matter. Getting in the weeds with the details is how you lose sight of the big picture. And allocate your time poorly.

My biggest issue with c-strings is that by requiring a zero terminator and a single char const , it forces a lot of copies(when truncating) and calls to strlen(caller doesn't know the length).

Had it been a char const /size_t or a pair of char const * for first/last it should be both safer and faster. I prefer the later as a pair of pointers doesn't require updating both when iterating with the first ptr.

Wouldn't this be an argument to go in the opposit direction? If you are using high level functionality that you dont know the implementation details of, you are running the risk of unintended consequences.

I am a C programmer who have implemented string to number parsing for this very reason. I know exactly what it does and how fast it is.

If you do use code you didn't write, The chance of a standard library being poorly implemented, is probably lover then most other libraries, so picking a non standard lib as a grantee against bad performance seems misguided.

  • I think it goes both ways in that you either go full low level and write yourself everything (for questionable benefits), or you use a (possibly higher level) language with sane standard library, but the important thing is the quality of said library.

    • I find that writing everything yourself, especially simple things like a text2inteeger parser, is very valuable because it takes very little time and it levels up your understanding of the system. I'm starting to believe that you rarely understand something until you have implemented it. Therefor implementation is the best way to learn.

      1 reply →

I've taken it a step further (or three). I'm building a whole computer on the principle[1] (one possible way of looking at it) of avoiding parsing as far as possible.

https://github.com/akkartik/mu

Mu takes the dwm[2] principle of avoid-config-files/modify-sources-to-reconfigure to the limit. The idea is that there are at any given time only 3 languages in the computer:

1. A self-hosted notation for a subset of 32-bit x86 machine code

2. A memory-safe statement-oriented language where most statements map 1:1 to machine-code instructions. Written in level 1 above.

3. A high-level interpreted language.

The vision is to fix 1 and 2 but allow 3 to fork wantonly. If you want a Lisp for your HLL, make a fork. If you want a Python-like syntax, make a fork. But, and this is the important part, in any given computer/repo there is only ever one language. Only one non-trivial parser. (Levels 1 and 2 have extremely uniform syntax.)

As best I can tell, the #1 way to avoid the need to run a fuzzer is to avoid writing parsers. Just say no.

[1] https://lobste.rs/s/to8wpr/configuration_files_are_canary_wa...

[2] https://dwm.suckless.org

  • Hi, level 1 and 2 looks really cool, but I may not understand the point of only having these languages “running” on a computer? Both 2 and 3 are interpreted and the interpreter is the area you want to minimize?

    What about a program written in 3 that can compile to either 1 or 2? Why would it hurt anything to have a different language somehow made possible to run here?

    • I'm not sure I follow your question, but the idea is that the computer the end user receives has a single opinionated language for programming it. Kinda like Basic for microcomputers. The end user is of course welcome to build their own languages. That is encouraged! But multiple languages make the computer more difficult for others to comprehend. My goal is to convince you that, all things being equal, a computer with fewer languages is in your best interest. Less code, fewer bugs, fewer security holes.

      (Level 2 is not interpreted, it's compiled. Skipping level 2 would be totally in line with my principle above. But it would be more painful, I think. You'd basically be programming in machine code.)

      5 replies →

Wrong conclusion. The problem is the broken C string library in POSIX. Don't use it! Wrong design. Zero-termination is too fragile and the cause of the evil.

Rather use buffers with known lenghts, and for strings you need to known the string (=unicode) rules. Nobody does that. Know libunistring. I have my own, because libunistring is too slow, but know it.

For my string libraries I rather follow the STL, with ranges/views and boehmgc core. const vs dynamic strings. So I will not step into the accidental strlen and buffer-overflow trap.

E.g. For input buffers know if they are zero-terminated and const. With the GTA post I pointed out the libfuzzer design flaw, giving you an ASCII input buffer which is not zero-terminated. Even strtol/strtod cannot be used then. You need to copy the buffer, terminate it, and then you can use the broken string libc. Not talking about sscanf, which I usually use only as sscanf_s if available. Or _snscanf/_snscanf_s. Microsoft does many things wrong, but its libc is far superior to glibc, bsd or musl. musl is better than glibc, but also lacks in this regard.

I think a better moral is "don't roll your own parser unless your core competency/product is the parser". Especially in a corporate situation.

  • Don't roll your own parser? How the hell would you get anything done? Unless you don't count regular expressions or something, I can't imagine somehow avoiding problems requiring parsers, especially on any unix-based system.

    • There are a lot of tasks that only need to work with existing, commonplace file formats with existing high-quality parsers (e.g., JSON, XML, sqlite, ...).

      1 reply →

    • >How the hell would you get anything done?

      This is a sign that writing (trivial) parsers is a core competency for you. However, that doesn't meant that writing all parsers is your core competency. Especially not for an industry standard like JSON that should have plenty of libraries that take care of the problem.

  • Disagree.

    Writing parsers is the same level of complexity as interview FAANG level questions. Any decent developer should be able to write a parser from scratch.

    • Any decent developers can write a parser. Most can't (or at least don't have enough time before they need to get back to whatever their main job is) to write a fast, secure and robust one for a complex grammar.

Agreed, that would never happen if a developer simply used parseInt(text.slice(offset, offset+40)).