← Back to context

Comment by KirinDave

9 years ago

I'm surprised this is news in 2016. Anyone who uses DynamoDB hits this immediately and suffers a bit from it.

Would you like more insider advice on using DynamoDB? My startup Level Money used it as a primary data store and has kept it throughout our lifespan. We're migrating away from it now, but I wouldn't necessarily discourage people from using it.

NEVER SCAN ONE TABLE AND UPDATE ROWS AS YOU ENCOUNTER THEM, OR USE NATIVE KEY ORDERING TO TRIGGER UPDATES ANYWHERE ELSE!

Secretly DynamoDB is just a bunch of SQL databases with floating masters, or so we surmise. If you iterate across things in native order without randomization and at a very high speed then you will overload individual shards. You can end up at 10X write provisioning and still get rate limit responses. Randomizing the traversal of the keyspace fixes this.

Which is yeah, really really bad for sufficiently large datasets. You have to get creative to randomize it sufficiently in some cases.

Still, it's quite nice to have something like DynamoDB handling your scaling early on. It's a surprisingly useful design for a database and keeps you rom over-relying on relational properties which eventually don't scale. It also forces you to develop a story for cross-table transactional queries and their failures quite early in your platform's life cycle. Forcing that discipline is almost always healthy.

If you're not careful though, it becomes frightfully expensive. Before we understood why we were rate limiting we panicked and ended up with >$12k/mo in DDB costs. Not really a sustainable cost for a very small company.

> Secretly DynamoDB is just a bunch of SQL databases with floating masters, or so we surmise. If you iterate across things in native order without randomization and at a very high speed then you will overload individual shards. You can end up at 10X write provisioning and still get rate limit responses. Randomizing the traversal of the keyspace fixes this.

It's not based on SQL, but the fact that the table is sharded and has the throughput characteristics you describe is well documented (i.e. not at all secret :p) http://docs.aws.amazon.com/amazondynamodb/latest/developergu...

  • Yes, it's documented that, to quote the page you linked, "a single partition can support a maximum of 3,000 read capacity units or 1,000 write capacity units."

    The documentation also explains that scanning a table runs the risk of saturating the capacity of a partition (1):

    > As a table or index grows, the Scan operation slows. The Scan operation examines every item for the requested values, and can use up the provisioned throughput for a large table or index in a single operation. (...)

    > The larger the table or index being scanned, the more time the Scan will take to complete. In addition, a sequential Scan might not always be able to fully utilize the provisioned read throughput capacity: Even though DynamoDB distributes a large table's data across multiple physical partitions, a Scan operation can only read one partition at a time. For this reason, the throughput of a Scan is constrained by the maximum throughput of a single partition. (emphasis mine)

    Whenever I take an important dependency on a product, I make it a habit to read or skim virtually all of the product's documentation from beginning to end. Documentation for complex technologies is something to study. It's served me very well and I'd recommend the practice to others. With this approach you'll find that you just "know" (or can quickly look up) things that tend to surprise other people. Even if not all of the knowledge is in your working memory, you'll have a vague recollection of reading "something about that" and will be able to come back quickly to what you read.

    (1) http://docs.aws.amazon.com/amazondynamodb/latest/developergu...

    • That's fantastic, but let's just take a look at the documentation a few years ago when I discovered this:

      https://web.archive.org/web/20130102210613/http://docs.aws.a...

      Huh, worded slightly differently, isn't it? They DO allude to this perhaps being the case in https://web.archive.org/web/20121221003912/http://docs.aws.a... but it's not made clear at any point that scans will yield up keys neatly bundled by shard.

      It is not the case even now that the scan operation is using your capacity to cause this, it's because of the way shards enumerate their keys and that is not done simultaneously and mixed for you before sending it. You can redirect write traffic to another table and still often exhibit a rate limit effect even though the scan isn't consuming writes for that table.

      That, I think, is still quite surprising.

      You can take for granted how great the docs are now, although I still submit that this aspect of the system is quite poorly documented. AWS in general is fantastic at conveying API endpoints and very poor at offering a new developer a narrative on how to use the product.

      The reason that the docs are as good as they are now: people like me have been around yelling at Amazon for years to improve their documentation, and telling tech reps to better document things. I hope you in your capacity do the same. And I will continue to offer insights like this on forums like this precisely because there are lots of relatively new platform engineers here. It's one of the few things I _like_ doing on hacker news.

    • Pyxl101, I find it surprising that your comment was downvoted. Typically people downvote for one of 2 reasons, either they disagree with your tone or they disagree with your facts, but in your case your tone seems reasonable and your facts seem accurate. It does seem unreasonable that people downvoted you.

      3 replies →

> Secretly DynamoDB is just a bunch of SQL databases with floating masters, or so we surmise.

SQL is a querying mechanism not an underlying storage protocol. Converting from DynamoDB syntax to SQL would be a bazaar choice for them.

If you look at the limitations of Dynamo it becomes fairly clear what they do. Nearest I can figure it is close to this:

Each hash key resolves to a number of possible servers the data can be on. Data is replicated across several of these servers. For redundancy. The hash key determines which shard to use.

On individual machines, each set of data is stored by a compound key of hash key and sort key (if there is a sort key). The data is probably stored on disk sequentially by sort key or close to it. They possibly use something like LevelDB for this.[1]

If you have Global Secondary indexes it is literally treated as a separate table that is replicated to automatically. This is why users were doing anyway so Amazon just made it easy.

For those of you who do not know dynamo well. There are three basic read operations GetItem, Query, and Scan. Scans CANNOT be sorted, this is an important implementation detail.

A query can hit all the records for a single hash key very quickly because all records with the hash key exist on the same hardware. And they can be sorted because as mentioned earlier they are sorted by key on storage. Which is why you cannot query on more than one hash key at once. And why you can query for exact sort keys, greater than, less than, or between but not non-sequential. Because for performance Dynamo will only return sequential records from a query.

Scans, nearest I can tell, are map reduce across all shards that your table might exist on.

In conclusion. DynamoDB is a key value store with compound keys where the records are stored in order by key.

[1] In fact, I would not be surprised if Dynamo is mapped on top of LevelDB.

  • I used to work on the DynamoDB team. Throwaway account because my normal account can be tied back to my real name.

    > Each hash key resolves to a number of possible servers the data can be on. Data is replicated across several of these servers. For redundancy. The hash key determines which shard to use.

    > On individual machines, each set of data is stored by a compound key of hash key and sort key (if there is a sort key). The data is probably stored on disk sequentially by sort key or close to it.

    This is pretty much exactly correct. The hash key maps to a quorum group of 3 servers (the key is hashed, with each quorum group owning a range of the hash space). One of those 3 is the master and coordinates writes as well as answering strongly consistent queries; eventually consistent queries can be answered by any of the 3 replicas.

    > They possibly use something like LevelDB for this.[1]

    Sigh...if only. I don't remember the exact timeline but LevelDB either didn't exist when we started development or wasn't stable enough to be on our radar.

    DynamoDB is this very elegant system of highly-available replication, millisecond latencies, Paxos distributed state machines, etc. Then at the very bottom of it all there's a big pile of MySQL instances. Plus some ugly JNI/C++ code allowing the Java parts to come in through a side door of the MySQL interface, bypassing most of the query optimizer (since none of the queries are complex) and hitting InnoDB almost directly.

    There was a push to implement WiredTiger as an alternative storage engine, and migrate data transparently over time as it proved to be more performant. However, 10gen bought WiredTiger and their incentive to work with us vanished, as MongoDB was and is one of Dynamo's biggest competitors.

    • > Then at the very bottom of it all there's a big pile of MySQL instances. Plus some ugly JNI/C++ code allowing the Java parts to come in through a side door of the MySQL interface, bypassing most of the query optimizer (since none of the queries are complex) and hitting InnoDB almost directly.

      Ah! I knew it! It's an orchestration layer on top of a mysql layer with floating masters.

      I remember back when DDB was first launched we all sat around at Powerset and tried to figure out how Amazon did it given the two pieces of information we were given: "it is not based off the dynamo paper" and "it is based off both open source and proprietary code". We figured it had to be MySQL or Postgres that they were referring to.

      I didn't know you were completely bypassing the query layer though.

  • > SQL is a querying mechanism not an underlying storage protocol

    My phone corrected "MySQL" to SQL and before I noticed the edit window expired. My apologies for the error.

    • Kudos for bring write on that one. I'm wouldnot have guessed it was mySQL.

      I'm slightly less surprised because it apparently bypasses the SQL optimization engine and goes straight to InnoDB (per the other post). I have seen people write key-value stores based on MySQL+InnoDB before (sometimes bypassing SQL entirely).

Oracle doesn't allow empty strings either (well, it does but they are the same as NULL). It's not a big deal if you understand what it's doing.

  • Strange. MySQL understands both empty strings and NULL, and they are not equivalent. Informix understands empty strings, and IBM mentions that Oracle does not. Microsoft SQL server understands empty strings, but apparently whitespace at the end is truncated in comparisons, so " " is equal to " ".

Most likely DynamoDB is BigTable like database, Amazon may even used forked version of Cassandra or HBase.

All of this databases, got hot partition issue. If you cause to many read/write to one server you run into issues. That's why key schema is very important in non-trivial use-cases.

Classic anti-pattern is to use date or other increasing number as prefix. It is also problem when you use S3.

did you consider using EMR and Hbase? its not very expensive to run a provisioned 3 node managed cluster.

  • I'm fairly sure that EMR and HBase would give us miserably bad performance characteristics for the kinds of workload we're describing here.

    Why would I consider it for sub-millisecond latency tasks? Has something changed there I'm unaware of?

    • well i was just curious. im kinda wondering if anything among hbase, Cassandra, Mongodb etc are equivalent in performance at your scale. i see that you mentioned rethinkdb below.. but after their collapse, im wondering what are the options you are looking at.

      1 reply →