Achieving 100M database inserts per second using Apache Accumulo and D4M [pdf]

9 years ago (ieee-hpec.org)

46 comments

espeed

I hate to be a hater.

But the big issue with databases I've worked with is not how many inserts you do per second, even spinning rust, if properly reasoned can do -serious- inserts per second in append only data structures like myisam, redis even lucene. However the issue comes when you want to read that data or, more horribly, update that data. Updates, by definition are a read and a write to commuted data, this can cause fragmentation and other huge headaches.

I'd love to see someone do updates 1,000,000/s

jamesblonde 9 years ago
MySQL Cluster (NDB - not Innodb, not MyISM. NDB is the network database engine - built by Ericsson, taken on by MySQL. It is a distributed, in-memory, no shared state DB). >50% of mobile calls use NDB as a Home Location Registry, and they can handle >1m transactional writes per second. We have verified the benchmarks, that you can get millions of transactions/sec (updates are about 50% slower than reads) on commodity hardware with just gigabit ethernet. Oracle claim up to 200m transactions/sec on Infiniband and good hardware: http://www.slideshare.net/frazerClement/200-million-qps-on-c...
Two amazing facts about MySQL Cluster: (1) It's open source (2) It's never penetrated the Silicon Valley Echo Chamber, but is still the world's best DB for write-intensive transactional workloads.
- viraptor 9 years ago
  
  Another reason may be that it seems you need to be an expert in internal workings of the ndb to deal with it. I tried to use it. Over a month ended up with a few cases of "can't start the database", "can't update", etc. with no real solutions. There's not that much information about ndb on the internet, so googling doesn't help. The best help I got was from mysql-ndb irc channel, but in practice when things went bad, people said something like "if you send me the database, I'll help you figure it out". This does not work in practice.
  I feel like it would be more popular if people actually wrote about how they're using it. But how do you start when even basic configuration bugs are still open without progress: https://bugs.mysql.com/bug.php?id=28292
  (this was a few years ago or so, maybe things changed)
  
  1 reply →
doikor 9 years ago
Some systems (like Cassandra) has upserts and can do writes without reading. Though you loose any kind of "transaction" safety in that you are not sure what you are writing on top of. But in my experience for the vast majority of cases that is ok.
- gopalv 9 years ago
  
  > Some systems (like Cassandra) has upserts and can do writes without reading.
  That forces the latest-version resolution into a read-side problem - systems like that cannot handle any sort of complete table-scans but can only really work for a pre-known key.
  With a known unique key, this is somewhat possible to use systems like this, but extremely expensive when what you need is BETWEEN '2017-01-01' and '2017-01-02'.
  
  3 replies →
X86BSD 9 years ago
You are so right. Without giving to much detail the wife is working on a project with a branch of government. They use Oracle. Her teams biggest problems are and have been the DB falling over during updates or joins. Were talking billions of rows of government data. Running out of table space, loads taking literally days to complete. Thing's that I have never seen happen. I'm not sure if it's the DB architecture itself or clueless people doing really dumb things. I only get to see these failures second hand.
- throwawayish 9 years ago
  
  If they upgrade to all in-memory databases their full-table scans will run faster
  /s
espeed 9 years ago
See LMDB (embedded) and ScyllaDB (distributed):
The Lightning Memory-Mapped Database (LMDB) https://www.youtube.com/watch?v=Rx1-in-a1Xc
ScyllaDB http://www.scylladb.com/
- espeed 9 years ago
  
  Also Apache Geode (distributed, in-memory, consistent, durable, crazy fast) -- seriously solid tech.
  http://geode.apache.org/
  
  1 reply →
itchyouch 9 years ago

Not really a database per se, but trading system matching engines can maintain updates in excess of 1M/s. Of course these are small 60 byte messages updating 10-15 bytes of a given order book in something like a RB-tree, so it's not as impressive.
What is impressive I guess is that they can sustain 1M updates/sec at a 99.9th percentile update round-trip of ~10 microseconds or so, with medians at around ~4 microseconds.
Xorlev 9 years ago

Since each ingest process talked directly to the Accumulo tablet local to it, it really measured loopback+RPC+DFS performance. Knowing how these things usually go, it might have been 100M rows/s but only 100k-1M RPCs/s. It's still quite impressive, but it's important to keep it in perspective. For example, I believe Google's C* 1M writes/s demo also included real network overhead from driver processes. Additionally, that was with the WAL on, vs. this Accumulo run which disabled the WAL.
Our graph store (HBase, SSD) on 10 nodes can easily support 3M edges/s read/stored, but thats ~40k RPCs/s given our column sizes and average batch size.
eru 9 years ago
> However the issue comes when you want to read that data or, more horribly, update that data.
A log structure for your database would make the update case more similar to the append case, wouldn't it?
(There are definite limits to that technique, but it does work for eg some file systems---which are also a type of database.)
- throwawayish 9 years ago
  
  A transaction log needs to be cleaned up, which takes time. You might amortize cleanup over many small objects.
  > but it does work for eg some file systems---which are also a type of database
  No mainstream FS uses the txn log as a primary store. The log(s) are only around for active writes to metadata, as soon as the second write is out the spot in the log can/will be reused. Similar to the clean up case FSes try to batch as many ops as they fit into their log(s) before flushing the transaction.
  Apart from the CoW herd (ZFS, btrfs, HAMMER) most FSes either can't or won't journal data by default anyway. -- Which is a fairly often overlooked point, many people seem to assume that a journaling FS means that everything, include their application data, is journaled, which isn't just wrong, but can also be a rather dangerous assumption; depending on application.
  
  1 reply →
CyberDildonics 9 years ago

In a key value store? That can be done at 1,000,000/s per core.

espeed 9 years ago

This is work is related to the MIT D4M Course, GraphBLAS, and Graphulo:

Standards for Graph Algorithm Primitives http://www.netlib.org/utk/people/JackDongarra/PAPERS/GraphPr...

GraphBLAS: A Programming Specification for Graph Analysis [video] https://www.youtube.com/watch?v=6tnzSiq8QBo

http://graphblas.org

Graphulo: Graph Analytics in Apache Accumulo [video] https://www.youtube.com/watch?v=nsmFjZNl60s

https://github.com/Accla/graphulo

MIT D4M: Signal Processing on Databases http://www.mit.edu/~kepner/D4M/

https://ocw.mit.edu/resources/res-ll-005-d4m-signal-processi...

Video Lectures: https://www.youtube.com/watch?v=zNGKX-4PRsk&list=PLUl4u3cNGP...

Book: Graph Algorithms in the Language of Linear Algebra http://epubs.siam.org/doi/book/10.1137/1.9780898719918

D4M: Bringing Associative Arrays to Database Engines (2015) [pdf] https://arxiv.org/pdf/1508.07371.pdf

privacyfornow 9 years ago

I think this is a piece of the pie. The thing to recognize that I think is just as important is that it is possible to state several common use cases (even synchronous microservices) as collections of append only immutable logs for system of record and an in-memory read/view state for readers and mutating functions.

I am using this pattern in risk, fraud and commerce and once new members in my teams get over the mental barrier of decoupling the append only log from state, it all just clicks for them.

freeman478 9 years ago

Looks interesing ! Isn't this some kind of event sourcing or am I missing something ?

nrjdhsbsid 9 years ago

So... When do you reach the point Where it's better to just use a hashtable in ram? Super high speed "in memory" databases are still beat by manipulating the data structures yourself.

I feel like there's very limited applications where all out speed is important and it's better to use a database than just do the operation yourself in ram and save the network overhead.

You can insert billions of items per second into a hashtable, and when you're working in your own app memory transactions aren't needed.

jandrewrogers 9 years ago

The key point is that they managed to do it with Accumulo; the insertion rate through storage is otherwise unremarkable. For 10 GbE clusters, line-rate insertion has been relatively easy to achieve for several years now.

An important point is that they disabled all of the durability, replication, and safety features. Graph500 records are quite small so that insertion rate given the size of their cluster implies average throughput that is significantly less than line-rate.

hmottestad 9 years ago

I was wondering. Does anyone here use Accumulo internally or for a client?

I had not heard of it before and the line "widely used for government applications" made me wonder why I hadn't. I'm a consultant working with graphs in Norway and this database is completely new to me.

Scaevolus 9 years ago
It was developed by the NSA based on Google's Bigtable paper, so it has a lot of US government users. I think Apache HBase or Cassandra are far more popular NoSQL solutions for most users.
- mikecb 9 years ago
  
  If you don't need cell level access control, you probably wouldn't be looking for it. I guess hbase now offers this too.
threeseed 9 years ago
No. It's not popular at all since you can't get enterprise support for it.
With Cassandra you can goto DataStax and if you are using HBase often you are using Hadoop therefore you can get support from Hortonworks or Cloudera.
- mikecb 9 years ago
  
  Really? Sqrrl and Cloudera offer support.
  
  1 reply →
espeed 9 years ago

The principles applied in the paper aren't limited to Accumulo.

lngnmn 9 years ago

The word database means transactions committed to a persistent, durable storage (such that the data could survive a reboot).

castis 9 years ago
iirc "database" means an organized collection of data. persistence is just a nice feature.
- threeseed 9 years ago
  
  It literally means: "a structured set of data held in a computer, especially one that is accessible in various ways."
  You are right that restart survivable persistence has absolutely nothing to do with it.
  
  1 reply →
- flukus 9 years ago
  
  So I can write a linked list and call it a database?
  
  7 replies →
- lngnmn 9 years ago
  
  Nope. Old definitions still holds. If some hipsters are re-using the word to bullshit people it does not mean that correct connotations are deprecated.
throwawayish 9 years ago

I correct you. You are wrong.