Comment by kissgyorgy

3 years ago

Can you elaborate why would it be simpler to backup terabytes of files instead of just one?

8 comments

kissgyorgy

Not GP but one disadvantage of updating one huge file is it's harder to do efficient incremental backups. Theoretically it can still be done if your backup software supports e.g. content-defined chunking (there was a recent HN thread about Google's rsync-with-fastcdc tool). If you choose to store your assets as separate files instead though, you can trivially have incremental backups using off-the-shelf software like plain old rsync [1].

[1]: https://www.cyberciti.biz/faq/linux-unix-apple-osx-bsd-rsync...

porker 3 years ago
> there was a recent HN thread about Google's rsync-with-fastcdc tool
Was this the tool & thread you mean? https://news.ycombinator.com/item?id=34303497?
- nhanb 3 years ago
  
  Yeah that's the one!
  
  1 reply →

cdbattags 3 years ago

Wow, that is actually an amazing performance curiousity adding parallelism to the mix. I guess this would depend on the M.2 spec?

reissbaker 3 years ago
If you're using 16 PCI 4.0 lanes you max out at 32GB/s, although commercial drives tends to have much lower throughput than that maximum (~7.5GB/s for a good NVMe drive). Cat6a ethernet tops out at 10 gigabits per second, but plenty of earlier versions have lower caps e.g. 1 gigabit. My guess is you'll most likely be limited by either disk or network hardware before needing CPU parallelism, if all you're doing is copying bytes from one to the other.
- cdbattags 3 years ago
  
  The other being a network socket in this case? But that socket might be two servers over? Meh, ideally they've optimized that as well.
  So absolutely it is a network problem which means custom fiber?
  
  1 reply →