Comment by dangoodmanUT
7 hours ago
it seems like when you snapshot, you snapshot memory AND the filesystem (immutable ftw), that's pretty awesome
i am dying to know: firecracker still? I know you have an upcoming post abt it, but i'm incredibly impatient when it comes to fool new infra
Alright nerd-snipe snooping research post happning now!
Seems like they are using JuiceFS under the hood, with an overlay root for your CoW semantics. JuiceFS gives them instant clone (because they're not cloning the whole rootfs), while the chnages to the overlay are done as an overlayfs and probably synced back to S3 via a custom block device they have mounted into firecracker.
You can also see they are using juicefs it for the "policy" directly (which I'm assuming is the network policy functionality). iirc juicefs has support for block devices too, so maybe they are using that to back the rootfs overlay.
One concerning thing is the `/var/lib/docker` mount - i ran this in an ubuntu container, did they... attach it? Maybe that's a coincidence, but docker is not installed on the sprite by default. (the terminal is also super busted when used through an ubuntu container)
https://pastebin.com/raw/kt6q9fuA (edit: moved terminal output to pastebin because it was so ugly here)
I played with a similar stack recently, my guess is they are: 1. making some base vm, snapshotting it 2. when you create a vm, they just restore a copy and push metadata to it (probably via one of the mounts) 3. any changes that you make to the rootfs are stored on the juicefs block device (the overlay), which is relatively minimal compared to the base os. JucieFS also supports snapshotting, so that's probably how they support memory + filesystem snapshot and restore so quick
interestingly, seems they provision maybe a max disk size of 100GB for total checkpoints?
```
NAME TYPE SIZE FSTYPE MOUNTPOINTS
loop0 loop 100G /.sprite/checkpoints/active
```
fuse is definitely being used within the VMM, i can see a fuse mount and id being assigned. They're probably using juicefs directly for the policy mount because that doesn't need to be local nvme-cached, just consistent. The local-nvme -> s3 write-through runs on the hypervisor through a custom block device they attach to the firecracker vmm. This might just be the --cache-dir + --writeback cache option in juicefs. Wild guess is just 1 file per block.
guessing the "s3" here is tigris, since fly.io seems to have a relatoinship with them, and that probably keeps latency down for the filesystem
i think firecracker, just snooping around a sprite i see a lot of virtio-mmio, which afaik CHV would be using PCI in those instances