DynamoDB cannot store empty strings

9 years ago (forums.aws.amazon.com)

I'm surprised this is news in 2016. Anyone who uses DynamoDB hits this immediately and suffers a bit from it.

Would you like more insider advice on using DynamoDB? My startup Level Money used it as a primary data store and has kept it throughout our lifespan. We're migrating away from it now, but I wouldn't necessarily discourage people from using it.

NEVER SCAN ONE TABLE AND UPDATE ROWS AS YOU ENCOUNTER THEM, OR USE NATIVE KEY ORDERING TO TRIGGER UPDATES ANYWHERE ELSE!

Secretly DynamoDB is just a bunch of SQL databases with floating masters, or so we surmise. If you iterate across things in native order without randomization and at a very high speed then you will overload individual shards. You can end up at 10X write provisioning and still get rate limit responses. Randomizing the traversal of the keyspace fixes this.

Which is yeah, really really bad for sufficiently large datasets. You have to get creative to randomize it sufficiently in some cases.

Still, it's quite nice to have something like DynamoDB handling your scaling early on. It's a surprisingly useful design for a database and keeps you rom over-relying on relational properties which eventually don't scale. It also forces you to develop a story for cross-table transactional queries and their failures quite early in your platform's life cycle. Forcing that discipline is almost always healthy.

If you're not careful though, it becomes frightfully expensive. Before we understood why we were rate limiting we panicked and ended up with >$12k/mo in DDB costs. Not really a sustainable cost for a very small company.

  • > Secretly DynamoDB is just a bunch of SQL databases with floating masters, or so we surmise. If you iterate across things in native order without randomization and at a very high speed then you will overload individual shards. You can end up at 10X write provisioning and still get rate limit responses. Randomizing the traversal of the keyspace fixes this.

    It's not based on SQL, but the fact that the table is sharded and has the throughput characteristics you describe is well documented (i.e. not at all secret :p) http://docs.aws.amazon.com/amazondynamodb/latest/developergu...

    • Yes, it's documented that, to quote the page you linked, "a single partition can support a maximum of 3,000 read capacity units or 1,000 write capacity units."

      The documentation also explains that scanning a table runs the risk of saturating the capacity of a partition (1):

      > As a table or index grows, the Scan operation slows. The Scan operation examines every item for the requested values, and can use up the provisioned throughput for a large table or index in a single operation. (...)

      > The larger the table or index being scanned, the more time the Scan will take to complete. In addition, a sequential Scan might not always be able to fully utilize the provisioned read throughput capacity: Even though DynamoDB distributes a large table's data across multiple physical partitions, a Scan operation can only read one partition at a time. For this reason, the throughput of a Scan is constrained by the maximum throughput of a single partition. (emphasis mine)

      Whenever I take an important dependency on a product, I make it a habit to read or skim virtually all of the product's documentation from beginning to end. Documentation for complex technologies is something to study. It's served me very well and I'd recommend the practice to others. With this approach you'll find that you just "know" (or can quickly look up) things that tend to surprise other people. Even if not all of the knowledge is in your working memory, you'll have a vague recollection of reading "something about that" and will be able to come back quickly to what you read.

      (1) http://docs.aws.amazon.com/amazondynamodb/latest/developergu...

      5 replies →

  • > Secretly DynamoDB is just a bunch of SQL databases with floating masters, or so we surmise.

    SQL is a querying mechanism not an underlying storage protocol. Converting from DynamoDB syntax to SQL would be a bazaar choice for them.

    If you look at the limitations of Dynamo it becomes fairly clear what they do. Nearest I can figure it is close to this:

    Each hash key resolves to a number of possible servers the data can be on. Data is replicated across several of these servers. For redundancy. The hash key determines which shard to use.

    On individual machines, each set of data is stored by a compound key of hash key and sort key (if there is a sort key). The data is probably stored on disk sequentially by sort key or close to it. They possibly use something like LevelDB for this.[1]

    If you have Global Secondary indexes it is literally treated as a separate table that is replicated to automatically. This is why users were doing anyway so Amazon just made it easy.

    For those of you who do not know dynamo well. There are three basic read operations GetItem, Query, and Scan. Scans CANNOT be sorted, this is an important implementation detail.

    A query can hit all the records for a single hash key very quickly because all records with the hash key exist on the same hardware. And they can be sorted because as mentioned earlier they are sorted by key on storage. Which is why you cannot query on more than one hash key at once. And why you can query for exact sort keys, greater than, less than, or between but not non-sequential. Because for performance Dynamo will only return sequential records from a query.

    Scans, nearest I can tell, are map reduce across all shards that your table might exist on.

    In conclusion. DynamoDB is a key value store with compound keys where the records are stored in order by key.

    [1] In fact, I would not be surprised if Dynamo is mapped on top of LevelDB.

    • I used to work on the DynamoDB team. Throwaway account because my normal account can be tied back to my real name.

      > Each hash key resolves to a number of possible servers the data can be on. Data is replicated across several of these servers. For redundancy. The hash key determines which shard to use.

      > On individual machines, each set of data is stored by a compound key of hash key and sort key (if there is a sort key). The data is probably stored on disk sequentially by sort key or close to it.

      This is pretty much exactly correct. The hash key maps to a quorum group of 3 servers (the key is hashed, with each quorum group owning a range of the hash space). One of those 3 is the master and coordinates writes as well as answering strongly consistent queries; eventually consistent queries can be answered by any of the 3 replicas.

      > They possibly use something like LevelDB for this.[1]

      Sigh...if only. I don't remember the exact timeline but LevelDB either didn't exist when we started development or wasn't stable enough to be on our radar.

      DynamoDB is this very elegant system of highly-available replication, millisecond latencies, Paxos distributed state machines, etc. Then at the very bottom of it all there's a big pile of MySQL instances. Plus some ugly JNI/C++ code allowing the Java parts to come in through a side door of the MySQL interface, bypassing most of the query optimizer (since none of the queries are complex) and hitting InnoDB almost directly.

      There was a push to implement WiredTiger as an alternative storage engine, and migrate data transparently over time as it proved to be more performant. However, 10gen bought WiredTiger and their incentive to work with us vanished, as MongoDB was and is one of Dynamo's biggest competitors.

      1 reply →

    • > SQL is a querying mechanism not an underlying storage protocol

      My phone corrected "MySQL" to SQL and before I noticed the edit window expired. My apologies for the error.

      1 reply →

  • Oracle doesn't allow empty strings either (well, it does but they are the same as NULL). It's not a big deal if you understand what it's doing.

    • Strange. MySQL understands both empty strings and NULL, and they are not equivalent. Informix understands empty strings, and IBM mentions that Oracle does not. Microsoft SQL server understands empty strings, but apparently whitespace at the end is truncated in comparisons, so " " is equal to " ".

  • Most likely DynamoDB is BigTable like database, Amazon may even used forked version of Cassandra or HBase.

    All of this databases, got hot partition issue. If you cause to many read/write to one server you run into issues. That's why key schema is very important in non-trivial use-cases.

    Classic anti-pattern is to use date or other increasing number as prefix. It is also problem when you use S3.

  • did you consider using EMR and Hbase? its not very expensive to run a provisioned 3 node managed cluster.

    • I'm fairly sure that EMR and HBase would give us miserably bad performance characteristics for the kinds of workload we're describing here.

      Why would I consider it for sub-millisecond latency tasks? Has something changed there I'm unaware of?

      2 replies →

What is exceptionally irritating about this is that there is no valid reason for not being able to represent an empty string, and the lack of support makes the service inelegant and harder to use.

1. Empty strings are straightforward to represent.

2. Not allowing them violates the principle of least surprise, which makes the service harder to use.

3. Not supporting them adds needless complexity to client applications.

4. An empty string has obvious meaning and a place in everyday applications.

  • It just blows my mind that someone building a database in the 21st century doesn't know that zero is not a special case. In my book, anyone who doesn't get that is a beginning programmer, not even intermediate.

    I know it took people a while to get this originally. In the old days, for example, Fortran 'do' loops were always executed at least once, regardless of the values of the bounds. Nowadays we know better: you check the values at the top of the loop, not (just) the bottom, to handle the zero-iteration case correctly.

    But now it should be burned into every developer's brain: counts can be zero, strings can be empty, lists can be empty, etc. etc. etc., and handling these cases correctly is critical. Zero is not a special case!!!

What is the point of this trending? To try and get AWS to fix it? I must admit this was a gotcha for me at first but it is documented right on their standard Limits page. I'm guessing the fix is non-trivial or they probably would have just done it.

http://docs.aws.amazon.com/amazondynamodb/latest/developergu...

  • This limitation should be in a giant alert at the top of all DynamoDB documentation. They are being misleading in their documentation:

    > A map is similar to a JSON object. There are no restrictions on the data types that can be stored in a map element, and the elements in a map do not have to be of the same type.

    > Maps are ideal for storing JSON documents in DynamoDB. The following example shows a map that contains a string, a number, and a nested list that contains another map.

  • Not being able to serialize an empty string is abysmally poor design.

    • There's a difference between poor design and performance optimizations. Lacking any information, I'm inclined to commit an appeal to authority and guess that it's probably a performance optimization.

I've bumped into this. It's a serious problem.

If you think you're going to migrate an existing system to DynamoDB, then this is one of the things you need to plan for.

The workarounds for this are rather unappealing, perhaps heretical.

Oracle cannot store empty strings (they are treated as NULL).

  • Wow, I've been working with Oracle for a million years and I didn't know that.

    Fortunately, now that I'm over a million years old I'm probably not going to freak out about it no matter how wrong it seems. I might relax my pro-NULL politics a bit though.

    • This was one of the first things I recall learning about Oracle when I had my first internship dealing with it. It boggles my mind that you can work with Oracle for more than a week and not run into this.

      4 replies →

  • As far as I can tell Oracle can store empty strings they just happen to be treated exactly like NULL. That's at least marginally better. Still, it's hard to imagine people being willing to use Oracle nowadays.

    • Why? Because you think sql is not good or because you think there are better relational databases?

      I'd say that Oracle delivers quite much value for your money for many applications.

      I have worked with all other relevant relational databases and also with a few nosql databases, in distributed systems and monoliths.

      It's hard to make a recommendation that can be applied generically. The best answer is "It depends, and it's complicated"

      One thing is sure. Nosql is not the future (quite the opposite) although sometimes -in rather special circumstances- the right solution. Graph databases are perhaps more interesting, and perhaps event sourcing, but the promises are way larger than the deliveries.

      Edit. I forgot the obvious reason. You try to stay clear of the Dark Side.

      4 replies →

    • > Still, it's hard to imagine people being willing to use Oracle nowadays.

      I use Oracle every day, because that is one of the options for our customers rely on to run their critical Fortune 500 infrastructure. The other being SQL Server.

    • This rears its head when you want to make a string column nullable. If you do that, you can't store empty strings in it. Really broken IMHO.

    • Postgresql is pretty good, however Oracle still can do some crazy things, at the cost of a correspondingly insane amount (IMHO) of database admin / sysadmin overhead.

    • > it's hard to imagine people being willing to use Oracle nowadays.

      Only when it's the lesser of the super-evils you have to pick from. I'd pick Oracle over Progress(OpenEdge), but not much else :)

I used DynamoDB for an inherited project around the time I had gotten really good with MongoDB. I don't have any idea why anyone would ever use DynamoDB. It's like Mongo, without any of the good parts, that you can't run locally or any non-Amazon server. I don't see the benefit.

Edit: Specifically, I was using some pretty advanced aggregation stuff. At the time I used it, DynamoDB had zero aggregation whatsoever.

  • The reasons we use it * doesn't require any setup (no provision servers, maintining backups, etc) * automatic backups * practically free during development * scales when you need it too

    That said, we've run into issues with the partitioning. If we were given access to the partition information and the ability to reduce partitions when needed, that would solve those problems.

  • I also have done 3 years of MongoDB before I (unwillingly) wrote an app with it... worst development experience I've ever had. I wouldn't have used it were it not for my corporate overlords at the time.

    My new rule of thumb: Use plain databases for most things and only use DynamoDB if there's a precise feature in your system that it is well suited for.

    DynamoDB is a horrible general use database.

    • It's specifically not a general use database, so it's not surprising that it doesn't have general use database features. If you want that on AWS, use RDS or Aurora. If you need to scale in certain ways, then Dynamo can certainly be worth the restrictions it imposes, because it is very scalable.

    • Care to elaborate? MongoDB for app development on paper seems like a great solution.

      The NOSQL model fits great with a Restful API layer. The process of getting and posting data over HTTP should be a lot simpler with Restful API/MongoDB when compared to relational database solution.

      I know Restful APIs are not perfect either, so care to fill in the blanks of why its was the worst dev experience for you.

  • I used MongoDB in the past too. Now, I'm using Elasticsearch as a document store. Aggregations are good, and you can do some very advanced stuff with it (geographic clustering, scan left, range and exact match, range buckets, histograms). Of course it comes with a set of limitations, but it might be worth your time to investigate (especially if you want to use it also for full text search).

    • I've actually become quite familiar with Elasticsearch since then, whereas I stopped using Mongo (that project died). ES is indeed very powerful.

  • To be fair, you can run DynamoDB locally: https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

    • DynamoDB-Local is a service that implements DynamoDB's API, including its query language, datatypes, and grammar, but stores everything in SQLite. It's great for playing with the API or substituting a mock implementation that doesn't drain your wallet, but it's very unlikely have the backend implementation and the performance characteristics -- or even all of the exact bugs -- of "real" DynamoDB.

    • last time I used this, it had many significant differences from the real DynamoDB. Different enough that I don't really understand what purpose DynamoDB Local is actually supposed to serve.

    • Ah, I wonder if this was true a few years ago (early 2014) when I was working with it.

  • You can run it locally for development. Just look for dynamodb-local images on the docker hub.

  • Dynamodb has performance guarantees and you don't have to get into nightmare of maintaining MongoDB cluster.

In my previous team, we instead use the string "null" to represent NULL. Its pretty dumb but we haven't thought of any other way.

What I like about DynamoDB is that you can scale the table easily. You don't need a dedicated DBA to maintain your database.

For what I don't like

1. You pay for what IOPS you allocated 2. You can't share free IOPS to other tables 3. Getting a snapshot of the whole DB is impossible, DB backups are not transactional 4. Use EMR or DataPipeline for backups 5. If you reach your IOPS limit, you need to retry your writes/updates instead of delayed ack's. Others uses libraries that limits writes but its per server and doesn't account the free writes on other servers since the accounting is on the client level

Before they announce tons of new stuff to please the shareholders, there are much awaited enhancements on existing products which are much over due. Case in point this thread. Keep the developers happy. Just saying.

I dealt with this. It's absurd. So many smart coding conventions are completely broken by this absurd convention.

Don't use Dynamo unless you for some reason have to. It's half assed and not getting many updates.

"As a broader question, do you have a specific use case for having an empty string in an AttributeValue?"

Can anyone explain to me why this isn't a bizarre question? They can't imagine a use-case for empty strings on their own?

  • I would quote them back their own marketing: "Using the AWS SDK you can write applications that store JSON documents directly into Amazon DynamoDB tables"

Why is this news?

It might be annoying but it's super easy to work around. It makes sense to not store empty strings, they provide no value and it would only take up space.

  • How do empty strings provide no value? It's a valid value for strings, one which many programs encounter regularly, and it's reasonable to want to store them. Refusing to store empty strings makes as much sense as refusing to store zeroes in a numeric field, because zero is "nothing."

  • > It makes sense to not store empty strings, they provide no value

    `""`, just like `NULL`, is semantically difference from absence of record.

Just hit this problem again this morning. Quite annoying. I figured out what string was empty eventually, but i've solved this on other projects by just removing all keys with empty values to get around possible errors.

  • The problem we have is that we're trying to store JSON provided by the user. Meaning that we need to persist empty strings provided by them. The only way I can see around this issue is to replace all empty strings with a special string token on the way in, and replace all those special string tokens on the way out.

    • See that's the best use case I've seen in this entire discussion. If you don't get to control if blank vs null is significant then I can see how you'd have a real issue on your hand.

      I'm guessing you have to have it in a form you can search it (so you can't just GZIP it or something like that)?

      1 reply →

Can someone give me some examples of cases where there is a strong semantic difference that you'd want to represent with a blank string vs a null?

I'm interested in abstract concepts, not "I'm trying to port from Oracle/Mongo/whatever and this is how it works". Assume your designing against some generic record store.

I'm having a ton of trouble thinking of a case where that's a big problem (again, ignoring trying to save time in porting) and I was wondering if someone could provide me a few examples.

  • It's the difference between "this value is unknown" and "this value is known and empty". It's similar to why JavaScript has both null and undefined.

    Here's an example. Let's say a user is taking a survey and the last question is: "Please share any additional thoughts if you have them." A null could mean the user did not answer this question, probably because they haven't gotten to it yet. An empty string could mean the user was asked this question and actually didn't have anything to say.

    • That's basically the only example I could think of: tracking if someone filled out a form field.

      But in that case how can you tell the difference? I mean if they never clicked in the field or tabbed into it and submitted the form is that meaningfully different from they clicked into it and didn't type anything?

      I can think of ways to work around it (keep a boolean for 'was filled out', a number that means 'they got up to field 17', etc.) but I'm still not sure you can draw a conclusion of 'they skipped it' vs 'they didn't enter anything'.

      7 replies →

    • To be fair, this would often be modeled by omitting the field completely. The lack of the field suggests the user hasn't gotten that far. The presence of the field with a NULL value means they didn't provide anything.

  • In a relational database there's a clear difference between "I did a join and there was no value on the other table" and, "this cell was intentionally left blank".

    I haven't used DynamoDB enough to tell if that applies here or not.

    • You basically don't join in Dynamo (as I remember), so that doesn't fit.

      Either way, you'd still need to give a good reason for "this space intentionally left blank" because I'm not seeing much in the well of compelling reasons.

  • > Can someone give me some examples of cases where there is a strong semantic difference that you'd want to represent with a blank string vs a null?

    A null middle name means "we don't know their middle name, we added it to the schema after that user signed up or didn't ask them". An empty string means they explicitly don't have a middle name.

    • I entirely agree with your point with regard to a fixed-schema, row-based, relational-like datastore. But in a KV-store like DynamoDB, there's a different set of possibilities:

      [1] key present, value undefined or null or some other special type

      [2] key present, value is empty object of its expected type

      [3] key absent

      In schemaless datastores where the value's content somehow determines its datatype, it can be difficult to enforce a distinction between scenarios [1] and [2]. Meanwhile, in an externally-schema'd datastore (like most RDBMS), you don't have option [3]. I am familiar with the practice of mapping "we don't have a known-good value for this" to omitting the key in an KV-store, while in an RDBMS that semantic meaning is mapped to a NULL instead.

      5 replies →

    • Although I think the limitation is an incontrovertible issue in other ways, that specific example sounds like a domain modelling blunder to me, since it is overloading a scalar value to encode value-object metadata.

      2 replies →

  • A few thoughts.

    0 <> NULL <> ""

    They are all values that matter because they can exist.

    Blank can have business meaning. Also, I suspect that there is a reason blank is separate from NULL in the ASCII table. They are different signals, and changing one to the other is fundamentally changing the meaning of the signal.

    Blank often has business meaning on forms, like "leave this field blank to indicate the user has acknowledged the field but chose to enter a blank value" vs the "user never saw this field and no value was entered, or the system did not store a value for the user's input".

    The old saying "NULL is not nothing"?

  • You could represent a path through a grid as a string using U=up, R=right, D=down, L=left. Then a solution to a maze could be 'UUUURRRRLRUULLD'. An empty string indicates a solution to a maze in 0 zero steps. A null string indicates that there is no solution.

    • I'll give you that, but it's a very contrived example. I'm trying to think more of real world problems.

      Many people have mentioned the idea of keeping track of if someone doesn't have a middle name versus you don't know their middle name. That makes more sense as an example (although I'm not sure how critical that is).

      4 replies →

    • I think this is a bad use of `NULL`.

      You should instead use a list of all possible solutions, or a sublist of that list (if you only need one solution) because then the empty list can indicate there are no solutions.

      If you then want to do something you can iterate over the list: This will protect you from "null pointer" exceptions, promote more functional programming, and is usually less code as well.

          js: if(x !== null) { … }
          q: $[not null x;…;'`oops]
      

      versus:

          js: x.map((r) => …)
          q: {…} each x

It also doesn't store empty lists, sets, maps, etc, not just top level, but nested inside dynamo maps. The particular SDK i was using silently removes keys with empty (whatever) values from the map. Dynamo maps in general were full of gotchas and we ended up just serializing as json since the only gotcha there is integer keys becoming strings.

Also dynamo's capacity provisioning can be scaled up any time (although scaling up is slow-ish), but only down once per day, which means that unlike the promise of ec2 you end up having to essentially pay for your peak load at all times.

If I had it to do over again I'd have stuck with postgres, the only reason we used dynamo in the first place was to throw our boss a bone to soften our flat refusal to use Amazon Simple Workflow, which is an even bigger quagmire.

I tried using DynamoDB. It felt really unintuitive and like I had to bend my thought process to fit it.

I misconfigured some parameters and ended up with a big bill for a database that was doing pretty much nothing.

I don't see the upside of DynamoDB.

  • > I don't see the upside of DynamoDB.

    Wide column, no-sql data store where you pay for provisioned throughput, and storage over 25GB.

    All maintence is taken care by Amazon, no sharding, scale, patching, or tuning required.

    If you need a non relational data store for use inside an AWS deployment, you would be hardpressed to find a better alternative.

    • > inside an AWS deployment

      That's the key, isn't it?

      I've been trying to wrap my head around DynamoDB for something that wouldn't necessarily live inside AWS and it seems like an awful lot of trouble.

      But I guess if you're in the same AWS Region and you're doing high-volume stuff then the cost savings probably get compelling real fast.

    • > If you need a non relational data store for use inside an AWS deployment, you would be hardpressed to find a better alternative.

      RDS Postgres supports HStore and JSON types.

      4 replies →

How about BigTable and Cassandra?

  • Google Cloud Bigtable will let you store empty strings as values. I'm not quite as sure about whether you can successfully set a 0-byte value and distinguish it from an empty string ['\0'] value; the answer may depend on the client library/language/API that you use to interact with the service.

    [Source/Disclaimer: I work on Google Cloud Bigtable.]

I recommend people look into Azure's DocumentDB. I've used Dynamo before and even though it was fast and reliable, development experience was a pain.

I'm slowly migrating a self-hosted MongoDB database to DocumentDB and things are going pretty well. The only thing missing for me is aggregation and transparent encryption, but they are working on those.

Simple workaround: prefix every string with a fixed character. Advantage: it will leave the lexicographical sort order unchanged.