Simple Dynamic Strings library for C

12 years ago (github.com)

I think the library is great, and don't have a whole lot to say about it, but wanted to mention one tangentially related thing:

The README on this repo is awesome. Opening up with advantages and disadvantages? Awesome. Plenty of code examples covering all of the major use cases? Awesome. Quick overview of the internals? AWESOME! Quick two line note about how to use the library in your project? Awesome.

I'm tempted to rant about how I wish documentation was taken more seriously, and that programmers seem to make it a point of pride that spending the first half hour with a library figuring out how to actually use it is just something we have to deal with as programmers, but I won't do so aside from this single sentence.

  • man pages used to be treated with the same reverence, but nowadays nobody seems to bother. I have pretty extensive man pages for some of my open source stuff, but it doesn't seem to help with users downstream.

    • Don't get discouraged; people who greatly appreciate a well written man page are still around. They're the users you might not hear about because the fine documentation actually helped them. :-)

      2 replies →

    • To be honest, I pretty much hate man pages.

      They almost never have examples(Is that some sort of rule?), they have no "quick summary of the shit you use 95% of the time", and they get generally written as a novel, seemingly from the perspective of the developer of the util rather than consumer.

      Mind you, I have used man pages many times before, but only because it was the best source of information, not because it was a particularly efficent one.

      The Markdown README of this string library is a thousand times better than any man page I have seen.

      11 replies →

    • That would be because GNU decided that man pages weren't good enough for them and decided to use info pages.

Mixing signed and unsigned arithmetic operands.

Not doing trivial overflow checking before arithmetic.

Using results from said arithmetic as indices or starting points for memory writes.

My gut says that any application using this library to process untrusted input is exploitable.

Of course, it was never advertised as being secure :-)

Which is unusual, as many "better strings" libraries make claims about the security of the traditional way of string handling (and then go on to do it wrong).

Edit: To give a concrete example, someone here recommended BString just a few days ago. That library has a "security statement" in which it claims to prevent overflow type attacks by using signed integers instead of size_t, and then checking whether the result is negative after arithmetic. But signed overflow is undefined behavior in C, and these are not guaranteed to wrap around. And yes such things have been exploited. I don't know how likely or easy bstring is to "own" but in any case it's doinnit wrong.

And yes I checked the code, it does what it claims to do..

  • Hello clarry, I don't think there are APIs that if not grossly misused will lead to security issues in general. There is one issue I'm aware of that perhaps is not easily exploitable but surely is unexpected behavior, that is, overflow when you reach a 2GB string: that requires a check in sdsMakeRoomFor() for sure.

    Note that this could be fixed by using uint64_t type for example in the header, however in the original incarnation of SDS inside Redis this was not possible for memory usage concerns. In the standalone library I believe it is instead a great idea, since 2GB is no longer an acceptable limit.

    From the point of view of the security the most concerning function looks like sdsrange(), there are sanity checks in place to avoid that indexes received from the outside can lead to issues, but I'll do a security review in the future in order to formally check for issues.

    • If it's any reassurance, I had a look at it, and although there is unnecessary signed/unsigned arithmetic, I could not find anything exploitable. Change the types of variables representing an object size, length, array index, and so on to size_t (from int) and it will be fine.

      2 replies →

    • I think when doing malloc(n * m) or similar the most cautious thing to do would be to check for overflow even if you don't think it's exploitable. Especially for a library. Witness for example that OpenBSD's calloc does an overflow check.

      I often leave this out of my own code, but not without feeling somewhat guilty about it.

    • In general, just because you cannot exploit something doesn't mean nobody else can. What to you seems at worst "merely" unexpected behavior is a very good starting point for someone who actually writes exploits.

      Consider what happens if an attacker controls initlen passed to sdsnewlen() (also via sdsnew() if he controls the initial string). He can overflow the arithmetic on malloc/calloc. Now either the memcpy or the NUL-byte assignment can write out of bounds, which is potentially exploitable, especially if the attacker has a nearly unbounded number of attempts (which they often do as far as online-facing services go) or if they can do things to affect the program's memory layout -- which they likely can if they're feeding the program with data.

      But if these two aren't enough for a successful exploit (I wouldn't make any bets!), the initlen is assigned to sh->len (whoops, conversion to signed... and possible overflow), which then can completely throw off the arithmetic in sdsMakeRoomFor. Here I would actually be quite willing to bet an attacker can arrange these numbers so that he can do a more controlled out-of-bounds write in one of the functions that rely on sdsMakeRoomFor.

      These two functions aren't necessarily the only problematic ones; they're just the ones I noticed after a minute of scanning.

      And I don't really agree with what eliteraspberrie said; using the right types is a good start but it's absolutely not fine until after you actually do all the overflow checking; for an example of this, see the OpenBSD man page for malloc: http://www.openbsd.org/cgi-bin/man.cgi?query=malloc

      Rule of thumb: whenever you do any arithmetic, ask yourself, can it overflow (how do you know it doesn't?) and who's in charge of making sure it doesn't? In this case it's your API's responsibility. This is exactly the same thing you do whenever you call a function: you need to know if it can fail, and if it can, you need to check the return value and do the right thing.

      In the years I've spent reading CVEs (and the broken code that caused the alert), proof-of-concept exploits as well as real world exploits, I've learned that some incredibly subtle and seemingly insignificant things can be exploited. As a general rule, don't worry about exploitability, it's usually not worth your time to prove one way or another. Just fix the code whenever you see something that could go wrong. Make it easy for the next guy who audits your code; so he too can tell that your code is secure, just by seeing the right checks in the right place.

      If you need a refresher on security issues in C code, I recommend Chapter 6 from The Art of Software Security Assment, made available free of charge by the authors: http://ptgmedia.pearsoncmg.com/images/0321444426/samplechapt...

      It's a good read for anyone doing C these days, and covers a good deal of problems, including the one discussed here. It comes with snippets of known vulnerable real world code too.

  • > prevent overflow type attacks by using signed integers instead of size_t, and then checking whether the result is negative after arithmetic

    Whoa!

    As if checking for overflow with unsigned is impossible...

  • > My gut says that any application using this library to process untrusted input is exploitable

    Well, it is extracted from Redis so that should be easy to test

  • I just started working at a place where size_t is being cast to "int" all over the code, with similar reasoning. Thanks for the ammunition.

I forked sds.c from Redis given that it was very useful for the project for years, so I guess it may be useful in other contexts as well. There are a few changes at API level compared to the Redis sds.c file, however most of the work went into documenting it.

Nice, thanks for contributing (even more).

I haven't read much of the code at all, but a minor suggestion would be to change:

    struct sdshdr *sh = (void*)(s-(sizeof(struct sdshdr)));

to

    struct sdshdr *sh = (void*)(s-sizeof *sh);

since the entire idea is that the pointer on the left-hand side is to the type whose size should be subtracted, I think it's better not to repeat the type but to "lock it" to the pointer instead. This also (of course) means we can drop the parenthesis with sizeof, since those are only needed when its argument is a type name.

  • Thanks unwind, this is much better indeed! Fixing.

    EDIT: Fixed here: https://github.com/antirez/sds/commit/c636fc6cd25e455a75dca2...

    Seems everything is still working but I need definitely more unit tests... Note everything is covered right now.

    • A more correct version might be:

          #include <stddef.h>
      
          int buf_offset = (int)offsetof(struct sdshdr, buf);
          struct sdshdr *sh = s - buf_offset;
      

      This is because the compiler might insert padding between your struct elements and the flexible array member. In your case, you're using only int types in the header, so padding shouldn't be an issue on most architectures, but consider the following:

          struct header {
              int i;
              char c;
              char data[];
          };
      

      sizeof(struct header) on my machine is 8, but "data" starts at an offset of 5 from the beginning of the struct. So, to go from "data" pointer to the pointer representing the beginning of the struct, you will need to subtract 5, not 8. Here is a test program:

          #include <stdio.h>
          #include <stddef.h>
      
          struct header {
              int i;
              char c;
              char data[];
          };
      
          int main(void)
          {
              struct header h = {0};
              printf("offsetof %zu\n", offsetof(struct header, data));
              printf("sizeof %zu\n", sizeof h);
              printf("start %p, data start %p, delta %d\n",
                      (void *)&h, (void *)(h.data), (int)((char *)h.data - (char *)&h));
      
              return 0;
          }
      

      http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n983.htm is relevant for this case as well.

      4 replies →

  • Heh, you're the first person I meet who advocates dropping the parens around the sizeof argument in any situation. That does look terribly alien to me.

    • Really? I always do. If you don't, it makes the variable look like a type. It's the same reason why I dislike extra parens around the value in the return statement. Don't make it look like something it isn't.

This is neat, but please add a pointer to deallocation function to sds header and use it in sdsfree. This will allow to return sds strings from functions in shared libraries. A common problem it that the user of the library might use a different version of libc than the shared library (this is especially typical on Windows), and when it calls sdsfree on an sds string allocated with a different libc, something awful will happen (in the best case, it will just leak memory). By storing a pointer to the deallocation function in the code which allocated the string you can make sure that it is always released by the same libc version that allocated it.

The biggest missing thing in C imho is destructors, because you need to manually clean up everything whenever you leave a scope or function. That means calling "free" at every exit point.

Or alternatively, putting one "cleanup:" label at the end of the function and using "goto cleanup;" instead of break or return everywhere.

But goto is considered harmful, so that handy pattern seems unclean.

My question is: do you consider the "goto cleanup" pattern clean or not, and if not, what are better alternatives?

  • Many people are correctly citing the usage of goto in the Linux kernel. One thing I would like to add is what Greg Kroah-Hartman says about goto in one of his talks: only jump forward, never backwards. This is precisely what you do when you goto cleanup.

    I, personally, think the pattern is clean. A labelled break is basically a goto anyway, yet not always available in your language. I never liked needing a flag to exit some nested structure early. I don't find reading that clean at all.

    Also, a minor rant here: Dijkstra's "Go To Statement Considered Harmful" is often mentioned, sometime wordlessly, when goto is brought up, but it seems many of those people misunderstood what was actually being argued. Although he mentions being "convinced that the go to statement should be abolished from all 'higher level' programming languages," his main gripe was that the "unbridled use of the go to statement has an immediate consequence that it becomes terribly hard to find a meaningful set of coordinates in which to describe the process progress." He felt it was "as it stands is just too primitive; it is too much an invitation to make a mess of one's program." So, as others have pointed out already, his main issue was with goto being used in a way that resulted in unstructured, hard to understand programs.

    • I had points taken off from a C programming assignment in college once because I used goto for cleanup, and the grader blindly took off points for using goto. It took me quite a few back & forths with him to convince him that my usage was fine. His words were: "the use of goto is rather dangerous and usually leads to spaghetti code and reduction in performance (because of prediction + cache reasons), wherever you used goto you could've used functions".

      2 replies →

  • Goto for cleanup is not bad IMHO. I use it extensively...

    $ cd redis; grep goto *.c | wc -l 251

    All the instances are like:

        goto cleanup;
        goto error;
        goto badfmt;
        goto numargserr;
    

    In this context is easy to read and makes the code structure better.

  • FWIW, gcc does have destructors for stack-allocated data as an extension: http://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html#i...

        #include <stdio.h>
        
        typedef struct foo foo_t;
        
        void foo_cleanup(foo_t *f);
        
        struct foo { int i; };
        
        void foo_cleanup(foo_t *f) {
            printf("f %d\n", f->i);
        }
        
        int main() {
        
            foo_t a __attribute__((cleanup(foo_cleanup))) = { 7 };
        
            printf("l %d\n", __LINE__);
        
            {
                foo_t b __attribute__((cleanup(foo_cleanup))) = { 8 };
                
                printf("l %d\n", __LINE__);
            }
        
            printf("l %d\n", __LINE__);
        }
    
    
        l 18
        l 23
        f 8
        l 26
        f 7

  • Yea, I agree. I love C and was a C purist and flat-out refused to learn C++ well and use it in practice. This recently changed and now I quite enjoy C++. Destructors are one reason why. The whole idea of resource-managing classes and RAII is clever and really useful.

    The hard part is that it takes a lot to be a good C++ programmer. C is such a smaller language, you can master a handful of best practices, gain experience, and become a competent C programmer. In contrast, C++ is such a large language you have to be active in learning everything — read Meyers, Modern C++, GoF's Design Patterns, etc. Anything short of this, and you're going to accidentally reinvent the wheel or do something stupid. But still, C++(11) is a terrific language that I'm learning to love.

  • > But goto is considered harmful,

    Dijkstra never considered the kind of goto you describe as harmful. This is a distinction that's lost on most programmers now that the kind of languages that did have harmful gotos are dead and forgotten.

    What he argued against was gotos that jump across procedure or scope boundaries.

    • I think part of the problem is that everyone heard that goto is harmful but noone ever reads the original letter, even though its reads like a short blog post (assuming they had blogs back then)

      http://www.u.arizona.edu/~rubinson/copyright_violations/Go_T...

      According to Dijkstra, the really evil gotos are the ones that force you to keep a track of the whole execution path in order to figure out the current state of the program. Its much easier when you can look at a line of code and statically know what the program state will be like at that point.

      A for loop is better than goto because you can look at a line inside it and instantly know that it will run a number of times, with the index changing in increasing order.

      gotos for resource cleanup are good because you can look at a given line of code and know what resources still need to be freed.

      Code that gets rid of gotos and break statements by blindly replacing them with tons of flags is just as bad as gotos because you still need to look at the whole execution path to figure out the state in the flags.

      2 replies →

    • In fact, his original title was "a case against the goto statement". It was the editor of CACM that gave it the link-baity (for the time) title "considered harmful".

  • "But goto is considered harmful"

    GOTO _was_ considered harmful back in 1968(!) when structured programming was still in its infancy. Code of that time was usually peppered with GOTOs and therefore a pain to read.

    Using GOTO for cleanup tasks is perfectly fine because it increases readability.

  • "goto cleanup" is fine, it's precisely how exceptions work anyway.

    It's a commonly used pattern, most famously in the Linux kernel.

  • If you read much of the original discussion around the "Go To Statement Considered Harmful", this pattern is frequently pointed out as a reasonable use case absent proper exception handling.

    Indeed, from the original editorial: "I remember having read the explicit recommendation to restrict the use of the go to statement to alarm exits, but I have not been able to trace it[.]"

    It is the right way to handle exceptional conditions in modern C.

  • This is a question I've wondered myself. It's nice to see some different answers and reasoning here.

    The solution I came up with myself is to have a destroy() function that is capable of cleaning up no matter what state the object is in.

    Example: https://github.com/andrewrk/libgroove/blob/bc7d72589eb297342...

    Notice all the calls to groove_playlist_destroy. And then in the destroy function it checks the conditions it has to before cleaning up.

    I'm not saying it's necessarily the best answer but I think it's a pretty clean strategy.

  • Well here is one alternative: allocate temporary strings on a stack basis. Have some higher level functions reset the stack pointer. So now things are allocated temporarily by default, and only in the (hopefully) rare case where you need a long-lived item do you promote it to long lived. Promoting could mean copying the string to the heap, or it could just mean marking it so that the stack cleanup mechanism skips it (so perhaps you have a linked list of temporary items or something similar).

  • You still have to 'free' in destructors in C++. Destructors don't free your memory for your custom types for you.

    • You need to free heap-allocated objects, but stack allocated objects are important and useful (particularly with RAII) and cleanly reversing such allocations eliminates a class of bugs.

      4 replies →

    • You mean heap allocated types? Stack allocated will be deallocated implicitly.

Recently I've been getting into some C and having programmed in a number of other languages know enough to be sure I wanted to have a solid understanding of properly handling encodings and strings in general. I've obtained a good understanding, but now I am looking to abstract some of the low-level mechanics. Unfortunately most of the libraries I have seen (such as bstring) use structs. It was nice to see this approach taken. Thanks for extracting it and making it easy to find and learn about!

On a related note - does anyone have recommendations for a similar small library for dealing with conversions from char * to wchar_t * and basic encoding duties? I'm working cross-platform, and so far I've stitched together some functions wrapping stuff like wcsrtombs() and WideCharToMultiByte().

Excellent news - the Redis source code is a really good source of high quality dependency-free C code and tends to be the first place I look.

Some time ago I forked the SDS code and added a set of additional utility functions around this for a project I was working on at the time (basic file reading, regex, LZF compression, Blowfish encrypt/decrypt, SHA256 etc).

This was from a fairly old SDS version so this looks like a good opportunity to sync up with the library version.

Repository is here: https://github.com/paulchakravarti/sdsutils

Disadvantage #1 can be simply solved by using a pointer to the "sds" type instead.

Then you can be sure that

  sdscat(&s, "Some more data")

updates s to always point at the right memory address and you can't introduce hard to find bugs by forgetting to assign to s which the compiler wont warn you about. If you'd pass just "s" instead of "&s" as the first parameter, the compiler would error out.

So all functions modifying the string should take a pointer to it.

  • To be consistent between allocation and reallocation (like malloc() and realloc()).

    And you still want to access the old value if reallocation failed.

My biggest gripe with C's string functions is that it's really easy to make off-by-one errors when trying to do anything with two strings, due to NUL and the way they seem to handle it inconsistently. I'm constantly having to check the docs for any given string function to find out how it handles NUL (which isn't always in an obvious spot thanks to the design of man pages). And once I've found it, I have to come up with some contrived example that usually needs to be written in a comment just to make sure I've used its intended algorithm correctly and didn't cause an off-by-one error. If this library hides all that for me, I'm sold.

Sigh, another string library. And this is the reason why I prefer C++ over C. You don't end up writing another string library for the 300th time. There aren't that many differences between string libraries anyway. Most of them are just structs with a pointer to a memory block, plus a length field.

  • > Sigh, another string library. And this is the reason why I prefer C++ over C. You don't end up writing another string library for the 300th time.

    You say that, and yet every C++ project I've ever touched in my life has had its own string class with various levels of horror attached. My favorite was the one that stored everything internally as 32-bit characters to be Unicode safe, and was never used in a codebase that had to deal with Unicode.

  • Yep, but this one has not the usual layout of struct+pointer (that's the whole point), and I'm not sure it qualifies as "yet another" since it was written in 2006.

    • It appears to me to be "yet another" because the length before the string approach is usually referred to as a b-string.

      windows (for example) has had a comprehensive b-string library (type is called BSTR) for about 20 years - due to it's age and provenance it has the downside of thinking a character is a 16-bit value...

      1 reply →

    • You're right, sorry. I only scanned the document, saw the struct, and closed the tab. The way you use a header before the data is similar to how Ruby represents its strings internally.

  • > And this is the reason why I prefer C++ over C. You don't end up writing another string library for the 300th time.

    Ironically, in almost all of my uses of C++, I did end up writing or using a custom string library there too. (I was doing mostly console games or language interpreters.)

  • Yes, many people who have written much C have written some version of this library.

  • I'm not so happy with C++ string either. Main complaint is that I like the use of C strings as "semi-predicates"- you can test for NULL to indicate a failure. I wrote my own String class at one point to provide this feature:

      // These provide test if assigned/not assigned
      inline operator void*() const
        {
        return (void *)s;
        }
    
      inline bool operator!() const
        {
        if(s) return 0;
        else return 1;
        }

  • Not only that.

    You also have x vector libraries for helping with bound checking in secure sensitive code and macro based generic data structures.

    No thanks.

Making sds a typedef for char* is very convenient. But it makes it very easy to pass an sds to a function that expects a C string without checking for null bytes.

Ruby, Java, Perl, PHP have all had security problems when interacting with C because they failed to properly distinguish binary-safe strings and C strings.

http://insecure.org/news/P55-07.txt http://cwe.mitre.org/data/definitions/626.html

  • I'd prefer a typesafe version (that would be a library with a struct type). It could even be a trivial wrapper struct for the char *.

Disadvantage #1 seems unnecessary. Instead of this:

  s = sdscat(s,"Some more data");

Why not do this?

  sdscat(&s,"Some more data");

The latter would make the use-after-free error they're describing impossible. (Disadvantage #2, changing one reference but not others, would remain. And callers would still need to check for NULL if they intend to handle ENOMEM gracefully.)

I assert there's no meaningful performance difference between the two.

  • To be consistent between allocation and reallocation (like malloc() and realloc()).

    And you still want to access the old value if reallocation failed.

    • > To be consistent between allocation and reallocation (like malloc() and realloc()).

      Consistency is a tool that is often useful to promote program correctness. You're suggesting using it to accomplish the reverse.

      If it's important aesthetically that all sds operations be consistent in this regard, you can structure the allocation interface in the same way:

        sdserror sdsnew(sds*);
      
        sds mystring = NULL;
        sdsnew(&mystring, "Hello World!");
      

      This is a common practice in relatively new interfaces like pthread_create.

      > And you still want to access the old value if reallocation failed.

      If you want to provide commit-or-rollback semantics, you could signal error via return value rather than by replacing s with NULL:

        if (sdscat(&s,"Some more data") != SDS_SUCCESS) {
          /* failure path; s is unchanged */
        } else {
          /* s now has "Some more data in it" */
        }
      

      but this may be completely useless, depending on the environment(s) in which the library or program is intended to be used. On 64-bit Linux systems with memory overcommit enabled (the default), this failure path essentially shouldn't ever happen. Instead, some process (maybe yours, maybe not) will be picked by the kernel OOM killer. Many programs just use an allocate-or-die interface, as the sds README mentions.

      1 reply →

    • To be fair, the library already assigns returns from realloc in such a way that you cannot retrieve the old data if realloc failed.

      Not nice I know.

NB: In the following function:

    /* Like sdscatpritf() but gets va_list instead of being variadic. */
    sds sdscatvprintf(sds s, const char *fmt, va_list ap) {
        va_list cpy;
        char *buf, *t;
        size_t buflen = 16;
    
        while(1) {
            buf = malloc(buflen);
            if (buf == NULL) return NULL;
            buf[buflen-2] = '\0';
            va_copy(cpy,ap);
            vsnprintf(buf, buflen, fmt, cpy);
            va_end(cpy); // <--- add this ----
            if (buf[buflen-2] != '\0') {
                free(buf);
                buflen *= 2;
                continue;
            }
            break;
        }
        t = sdscat(s, buf);
        free(buf);
        return t;
    }

From the `man va_copy` on my system:

    Each  invocation  of  va_copy()  must  be  matched  by  
    a  corresponding invocation of va_end() in the same
    function.

This is not likely to be a problem in most systems, but it can't hurt to be formally correct.

I've had the pleasure of getting pretty deep into sds back when I was tinkering with thredis, and it's definitely very cool.

The CString C++ class from Microsoft's MFC and now also ATL used since forever the trick of having both the length and the reference counting in the single allocation with the characters themselves, as well as additional 0-termination character even if it was initialized with non-zero terminated character.

In the case of disadvantage #1: s = sdscat(s,"Some more data");

You could fix that by adding a "remote pointer" header field. Inside of sdscat, you would allocate a new sds struct, and set the remote pointer header field in `s` to the new sds structs location. You could also try to do a realloc and maybe youll get the same starting pointer back again.

How much better is this dynamic approach than something like immutable strings? (not that char* is immutable)

Since with the sds library, you could potentially be getting back a new pointer for each operation, you could just as easily be working from an immutable string library. Do immutable strings have poor performance? I've never really considered it.

Disadvantage #1 isn't really a disadvantage in modern computing, it's the right thing to do.

  • I strongly disagree. It is a tradeoff between that and advantages #1 and #3; with the bstring library, for example, you pass in a mutable string and the function mutates it.

    In no world is passing in a mutable value, having it mutate it, but then having to reassigning your variable superior.

    Passing in an immutable value, and then assigning a fresh value is reasonable, but that's not what SDS is doing, AFAICT.

    I think gcc has some decorations you can have to tell it to never ignore the return function.

    I'm curious why s = sdscat(s,"Some more data"); didin't end up like: sdscat(&s,"Some more data");

So, this is an interesting approach to strings:

It's basically a malloc()/free() implementation with printf, some formatting, and strcat bolted onto it. (Strictly speaking, it may or may not use free lists or whatever, but the use of the header and returned pointer is quite similar.)

And that's awesome.

I remember lifting the 'sds' string implementation from within the redis back in the day. Thanks antirez