At CodeSandbox we use Firecracker for hosting development environments, and I agree with the points. Though I don't think that means you should not use Firecracker for running long-lived workloads.
We reclaim memory with a memory balloon device, for the disk trimming we discard (& compress) the disk, and for i/o speed we use io_uring (which we only use for scratch disks, the project disks are network disks).
It's a tradeoff. It's more work and does require custom implementations. For us that made sense, because in return we get a lightweight VMM that we can more easily extend with functionality like memory snapshotting and live VM cloning [1][2].
I don't know if this is relevant, but I've been intrigued by DragonflyBSD's "vkernel" [0] feature which (supposedly) allows for cloning the entire runtime state of the machine (established TCP connections, etc.) into a completely new userland memory space. I think they use it mostly for kernel debugging right now, but it's interesting to think about the possibilities of being able to just clone an entire running operating system to a new computer without interrupting even a single instruction.
I didn't know it existed until they posted, but QEMU has a Firecracker-inspired target:
> microvm is a machine type inspired by Firecracker and constructed after its machine model.
> It’s a minimalist machine type without PCI nor ACPI support, designed for short-lived guests. microvm also establishes a baseline for benchmarking and optimizing both QEMU and guest operating systems, since it is optimized for both boot time and footprint.
it's not that simple many other companies running longer running jobs, including their competition, use Firecracker
so while Firecracker was designed for thing running just a few seconds there are many places running it with jobs running way longer then that
the problem is if you want to make it work with long running general purpose images you don't control you have to put a ton of work into making it work nicely on all levels of you infrastructure and code ... which is costly ... which a startup competing on a online dev environment compared to e.g. a vm hosting service probably shouldn't wast time on
So AFIK the decision in the article make sense the reasons but listed for the decision are oversimplified to a point you could say they aren't quite right. Idk. why, could be anything from the engineer believing that to them avoiding issues with some shareholder/project lead which is obsessed with "we need to do Firecracker because competition does so too".
mainly it's optimized to run code only shortly (init time max 10s, max usage is 15min, and default max request time 130s AFIK)
also it's focused on thin server less functions, like e.g. deserialize some request, run some thin simple business logic and then delegate to other lambdas based on it. This kind of functions often have similar memory usage per-call and if a call is an outlier it can just discard the VM instance soon after (i.e. at most after starting up a new instance, i.e. at most 10s later)
"Firecracker's RAM footprint starts low, but once a workload inside allocates RAM, Firecracker will never return it to the host system."
Firecracker has a balloon device you can inflate (ie: acquire as much memory inside the VM as possible) and then deflate... returning the memory to the host. You can do this while the VM is running.
The first footnote says If you squint hard enough, you'll find that Firecracker does support dynamic memory management with a technique called ballooning. However, in practice, it's not usable. To reclaim memory, you need to make sure that the guest OS isn't using it, which, for a general-purpose workload, is nearly impossible
for many mostly "general purpose" use cases it's quite viable, or else ~fly.io~ AWS Fargate wouldn't be able to use it
this doesn't mean it's easy to implement the necessary automatized tooling etc.
so it's depending on your dev resources and priorities it might be a bad choice
still I feel the article was had quite a bit a being subtil judgemental while moving some quite relevant parts for the content of the article into a footnote and also omitting that this "supposedly unusable tool" is used successfully by various other companies...
like as it it was written by and engineer being overly defensive about their decision due having to defend it to the 100th time because shareholders, customers, higher level management just wouldn't shut up about "but that uses Firecracker"
> which, for a general-purpose workload, is nearly impossible
That depends on the workload and the maximum memory allocated to the guest OS.
A lot of workloads rely on the OS cache/buffers to manage IO so unless RAM is quite restricted you can call in to release that pretty easily prior to having the balloon driver do its thing. In fact I'd not be surprised to be told the balloon process does this automatically itself.
If the workload does its own IO management and memory allocation (something like SQL Server which will eat what RAM it can and does its own IO cacheing) or the VM's memory allocation is too small for OS caching to be a significant use after the rest of the workload (you might pair memory down to the bare minimum like this for a “fairly static content” server that doesn't see much variation in memory needs and can be allowed to swap a little if things grow temporarily), then I'd believe is it more difficult. That is hardly the use case for firecracker though so if that is the sort of workload being run perhaps reassessing the tool used for the job was the right call.
Having said that my use of VMs is generally such that I can give them a good static amount of RAM for their needs and don't need to worry about dynamic allocation, so I'm far from a subject expert here.
And, isn't firecraker more geared towards short-lived VMs, quick to spin up, do a job, spin down immediately (or after only a short idle timeout if the VM might answer another request if one comes in immediately or is already queued), so you are better off cycling VMs, which is probably happening anyway, than messing around with memory balloons? Again, I'm not talking from a position of personal experience here so corrections/details welcome!
Yeah it's pretty hard problem as you'd need to defragment physical memory (while fixing all the virtual-to-physical mappings) to make contiguous block to free
A bit disingenuous to make a broad sweeping claim, then have a footnote which contradicts that claim, and upon closer inspection even that claim is incorrect.
It's absolutely usable in practice, it just makes oversubscription more challenging.
That and the fact that this was after "several weeks of testing" tells me this team doesn't have much virtualization experience. Firecracker is designed to quickly virtualize 1 headless stateless app (like a container), not run hundreds of different programs in a developer environment.
I really want VM's to integrate 'smarter' with the host.
For example, if I'm running 5 VM's, there is a good chance that many of the pages are identical. Not only do I want those pages to be deduplicated, but I want them to be zero-copy (ie. not deduplicated after-the-fact by some daemon).
To do that, the guest block cache needs to be integrated with the host block-cache, so that whenever some guest application tries to map data from disk, the host notices that another virtual machine has already caused this data to be loaded, so we can just map the same page of already loaded data into the VM that is asking.
> An OS isn't large. Your spotify/slack/browser instance is of comparable size.
A fairly recent Windows 11 Pro image is ~26GB unpacked and 141k dirents. After finishing OOBE it's already running like >100 processes, >1000 threads, and >100k handles. My Chrome install is ~600MB and 115 dirents. (Not including UserData.) It runs ~1 process per tab. Comparable in scope and complexity? That's debatable, but I tend to agree that modern browsers are pretty similar in scope to what an OS should be. (The other day my "web browser" flashed the firmware on the microcontroller for my keyboard.)
They're not even close to "being comparable in size," although I guess that says more about Windows.
Basically all code pages should be the same if some other VM has the same version of ubuntu and running the same version of spotify/slack.
And remember that as well as RAM savings, you also get 'instant loading' because there is no need to do slow SSD accesses to load hundreds of megabytes of a chromium binary to get slack running...
The second I read "shared block cache" my brain went to containers.
If you want data colocated on the same filesystem, then put it on the same filesystem. VMs suck, nobody spins up a whole fabricated IBM-compatible PC and gaslights their executable because they want to.[1] They do it because their OS (a) doesn't have containers, (b) doesn't provide strong enough isolation between containers, or (c) the host kernel can't run their workload. (Different ISA, different syscalls, different executable format, etc.)
Anyone who has ever tried to run heavyweight VMs atop a snapshotting volume already knows the idea of "shared blocks" is a fantasy; as soon as you do one large update inside the guest the delta between your volume clones and the base snapshot grows immensely. That's why Docker et al. has a concept of layers and you describe your desired state as a series of idempotent instructions applied to those layers. That's possible because Docker operates semantically on a filesystem; much harder to do at the level of a block device.
Is the a block containing b"hello, world" part of a program's text section, or part of a user's document? You don't know, because the guest is asking you for an LBA, not a path, not modes, not an ACL, etc. - If you don't know that, the host kernel has no idea how the page should be mapped into memory. Furthermore storing the information to dedup common blocks is non-trivial: go look at the manpage for ZFS' deduplication and it is littered w/ warnings about the performance, memory, and storage implications of dealing with the dedup table.
People run containers for two reasons:
#1. They cannot control their devs with python dependencies.
#2. Everyone runs containers! Can't be left behind.
I've tried to use virtio-pmem + DAX for the page cache to not be duplicated between the guest and the host. In practice the RAM overhead of virtio-pmem is unacceptable and it doesn't support discard operations at all. So yes a better solution would be needed.
Well that's all nice, but that would also need to be compute-efficient for it to be worthwhile and near-real-time dedupe of memory pages would be a REALLY tough challenge.
I believe we do this on Windows for Windows Sandbox. It works well but you will take a hit on performance to do the block resolution compared to always paging into physical memory.
Are you sure you're not thinking "copy on write" rather than "zero copy"? The latter implies you can predict in advance which pages will be the same forever...
The pages would be copy-on-write, but since this would mostly be for code pages, they would never be written, and therefore never copied.
By 'zero copy', I mean that when a guest tries to read a page, if another guest has that page in RAM, then no copy operation is done to get it into the memory space of the 2nd guest.
No mention of Cloud Hypervisor [1]…perhaps they don’t know about it? It’s based in part on Firecracker and supports free page reporting, virtio-blk-pci, PCI passthrough, and (I believe) discard in virtio-blk.
We do, and we'd love to use it in the future. We've found that it's not ready for prime time yet and it's missing some features. The biggest problem was that it does not support discard operations yet. Here's a short writeup we did about VMMs that we considered: https://github.com/hocus-dev/hocus/blob/main/rfd/0002-worksp...
The article did an ok job of explaining the firecracker limitations they ran into but it was extremely skimpy when it came to qemu and just rushed to the conclusion “we did a lot of work so try our product.”
Other than making sure we release unused memory to the host, we didn't customize QEMU that much. Although we do have a cool layered storage solution - basically a faster alternative to QCOW2 that's also VMM independent. It's called overlaybd, and was created and implemented in Alibaba. That will probably be another blog post. https://github.com/containerd/overlaybd
I think their usecase makes a lot of sense as their workloads consume a predefined amount of ram. As a customer you rent a VM with a specified amount of memory so fly.io does not care about reclaiming it from a running VM.
Depends on if they're using smart memory allocation to keep costs lower, IE, if they can pattern that certain workloads only need N amount of memory at Y time, they can effectively borrow memory from one VM for usage in another that has an opposite statistical likelihood of needing that memory.
This is why paying for dedicated memory is often more expensive than its counter part, because that dedicated memory is not considered as part of pooling.
> The main issue we've had with QEMU is that it has too many options you need to configure. For instance, enabling your VM to return unused RAM to the host requires at least three challenging tasks
This just works on Hyper-V Linux guests btw. For all the crap MS gets they do some things very right.
I came to the same conclusion as OP. QEMU is the most stable, hackable, well-supported VM hypervisor on the market. Setting it up is a pain, but once you get it set up with all your custom scripts, you never have to do it again. Ever. Even in your next project.
I know that Firecracker does not let you bind mount volumes, but QEMU does. So, we changed to QEMU from Firecracker. If you run the workloads in Kubernetes, you just have to change a single value in a yaml file to change the runtime.
I would be scared to let unknown persons use QEMU that bind mounts volumes as that is a huge security risk. Firecracker, I think, was designed from the start to run un-sanitized workloads, hence, no bind mounting.
I know a good way to make a process make the most of the hardware and play cooperatively with other processes: don't use virtualization.
I will never understand the whole virtual machine and cloud craze. Your operating system is better than any hypervisor at sharing resources efficiently.
Automatic scaling is great. Cloud parallelization (a.k.a fork) is absolutely wild once you get it rolling. Code deployments are incredibly simple. Never having to worry about physical machines or variable traffic loads is worth the small overhead they charge me for the wrapper. The generic system wide permissions model is an absolute joy once you get over the learning curve.
After reading the README of virtualization tools (and looking at the author) I discovered the benefits of using them. I recommend also giving that a try.
Tl;dr: We tried to misuse technology and we failed. If Firecracker was developed for a single binary executed fir a short period of time why do you try to use it for multiple executables running for a long time? Does it make any sense to even try?
"Firecracker is an alternative to QEMU that is purpose-built for running serverless functions and containers safely and efficiently, and nothing more." [1]
Interesting. I guess we are reading a different website.
Listen people, Firecracker is NOT A HYPERVISOR. A hypervisor runs right on the hardware. KVM is a hypervisor. Firecracker is a process that controls KVM. If you want to call firecracker (and QEMU, when used in conjunction with KVM) a VMM ("virtual machine monitor") I won't complain. But please please please, we need a word for what KVM and Xen are, and "hypervisor" is the best fit. Stop using that word for a user-level process like Firecracker.
Nitpick: it’s not accurate to say that a hypervisor, by definition, runs right on the hardware. Xen (as a type-1 hypervisor) has this property; KVM (as a type-2 hypervisor) does not. It’s important to remember that the single core responsibility of a hypervisor is to divide hardware resources and time between VMs, and this decision-making doesn’t require bare-metal.
For those unfamiliar, the informal distinction between type-1 and type-2 is that type-1 hypervisors are in direct control of the allocation of all resources of the physical computer, while type-2 hypervisors operate as some combination of being “part of” / “running on” a host operating system, which owns and allocates the resources. KVM (for example) gives privileged directions to the Linux kernel and its virtualization kernel module for how to manage VMs, and the kernel then schedules and allocates the appropriate system resources. Yes, the type-2 hypervisor needs kernel-mode primitives for managing VMs, and the kernel runs right on the hardware, but those primitives aren’t making management decisions for the division of hardware resources and time between VMs. The type-2 hypervisor is making those decisions, and the hypervisor is scheduled by the OS like any other user-mode process.
Type-1 and type-2 hypervisor is terminology that should at this point be relegated to the past.
It was never popularly used in a way accurate to the origin of the classification - in the original paper by Popek and Goldberg talked about formal proofs for the two types and they really have very little to do with how the terms began being used in the 90s and 00s. Things have changed a lot with computers since the 70s when the paper was written and the terminology was coined.
So, language evolves, and Type-1 and Type-2 came to mean something else in common usage. And this might have made sense to differentiate something like esx from vmware workstation in their capabilities, but it's lost that utility in trying to differentiate Xen from KVM for the overwhelming majority of use cases.
Why would I say it's useless in trying to differentiate, say, Xen and KVM? Couple of reasons:
1) There's no performance benefit to type-1 - a lot of performance sits on the device emulation side, and both are going to default to qemu there. Other parts are based heavily on CPU extensions, and Xen and KVM have equal access there. Both can pass through hardware, support sr-iov, etc., as well.
2) There's no overhead benefit in Xen - you still need a dom0 VM, which is going to arguably be even more overhead than a stripped down KVM setup. There's been work on dom0less Xen, but it's frankly in a rough state and the related drawbacks make it challenging to use in a production environment.
Neither term provides any real advantage or benefit in reasoning between modern hypervisors.
According to the actual paper that introduced the distinction, and adjusting for change of terminology in the last 50 years, a type-1 hypervisor runs in kernel space and a type-2 hypervisor runs in user space. x86 is not virtualizable by a type-2 hypervisor, except by software emulation of the processor.
What actually can change is the amount of work that the kernel-mode hypervisor leaves to a less privileged (user space) component.
Although I’ll note that the line between a VMM and hypervisor are not always clear. E.g., KVM includes some things that other hypervisors delegate to the VMM (such as instruction completion). And macOS’s hypervisor.framework is almost a pass through to the CPU’s raw capabilities.
I think you could help me answer the question that has been in my mind for a month :)
Is there any article that tells the difference and relationship between KVM, QEMU, libvirt, virt-manager, Xen, Proxmox etc. with their typical use cases?
KVM is a Linux kernel implementation of the cpu extensions to accelerate vms to near bare metal speeds.
Qemu is a user space system emulator. It can emulate in software different architectures like ARM, x86, etc. It can also emulate drivers, networking, disks, etc. Is called via the command line.
The reason you'll see Qemu/KVM a lot is because Qemu is the emulator, the things actually running the VM. And it utilizes KVM (on linux, OSX has HVF, for example) to accelerate the VM when the host architecture matches the VM's.
Libvirt is an XML based API on top of Qemu (and others). It allows you to define networks, VMs (it calls them domains), and much more with a unified XML schema through libvirtd.
Virsh is a CLI tool to manage libvirtd. Virt-manager is a GUI to do the same.
Proxmox is Debian under the hood with Qemu/KVM running VMs. It provides a robust web UI and easy clustering capabilities. Along with nice to haves like easy management of disks, ceph, etc. You can also manage Ceph through an API with Terraform.
Xen is an alternative hypervisor (like esxi). Instead of running on top of Linux, Xen has it's own microkernel. This means less flexibility (there's no Linux body running things), but also simpler to manage and less attack surface. I haven't played much with xen though, KVM is kind of the defacto, but IIRC AWS used to use a modified Xen before KVM came along and ate Xen's lunch.
KVM is kernel-based virtual machine, with libvirt being its API abstraction over all of it. QEMU is a virtual machine host that leverages kvm or software virtualization to spin up machines on the host. virt-manager does the same. Xen is another virtual machine host, like KVM. Proxmox is a virtual machine manager (like QEMU, virt-manager) but is web based. Libvirt will provide abstraction for kvm,qemu,xen
Use cases: proxmox web interface exposed on your local network on a KVM Linux box that uses QEMU to manage VM’s. Proxmox will allow you to do that from the web. QEMU is great for single or small fleet of machines but should be automated for any heavy lifting. Proxmox will do that.
I don't know if _one_ such article exists, but here is a piece of tech doc from oVirt (yet another tool) that shows how - or that - VDSM is used by oVirt to communicate with QEMU through libvirt: https://www.ovirt.org/develop/architecture/architecture.html...
In really simple terms, so simple that I'm not 100% sure they are correct:
* KVM is a hypervisor, or rather it lets you turn linux into a hypervisor [1], which will let you run VMs on your machine. I've heard KVM is rather hard to work with (steep learning curve). (Xen is also a hypervisor.)
* QEMU is a wrapper-of-a-sorts (a "machine emulator and virtualizer" [2]) which can be used on top of KVM (or Xen). "When used as a virtualizer, QEMU achieves near native performance by executing the guest code directly on the host CPU. QEMU supports virtualization when executing under the Xen hypervisor or using the KVM kernel module in Linux." [2]
* libvirt "is a toolkit to manage virtualization platforms" [3] and is used, e.g., by VDSM to communicate with QEMU.
* virt-manager is "a desktop user interface for managing virtual machines through libvirt" [4]. The screenshots on the project page should give an idea of what its typical use-case is - think VirtualBox and similar solutions.
* Proxmox is the above toolstack (-ish) but as one product.
I think people just pick the coolest sounding term. Imagine someone is sharing what they are working on, what’s cooler sounding “I am working on a virtual machine monitor” or “I am working on a hypervisor”. Hypervisor just sounds futuristic and awesome.
It’s like with “isomorphic” code. That just sounds much cooler than “js that runs on the client and the server”.
I'd love to get a clear explanation of what libvirt actually does. As far as I can tell it's a qemu argument assembler and launcher. For my own use-case, I just launch qemu from systemd unit files:
It's a lot of glue to present a consistent interface but it also does the management part.
"API to virtualization system" would probably be closest approximation but it also does some more advanced stuff like coordinating cross-host VM migration
"Firecracker...'s excellent for running short-lived workloads...A little-known fact about Firecracker is its lack of support... for long-lived workloads."
At CodeSandbox we use Firecracker for hosting development environments, and I agree with the points. Though I don't think that means you should not use Firecracker for running long-lived workloads.
We reclaim memory with a memory balloon device, for the disk trimming we discard (& compress) the disk, and for i/o speed we use io_uring (which we only use for scratch disks, the project disks are network disks).
It's a tradeoff. It's more work and does require custom implementations. For us that made sense, because in return we get a lightweight VMM that we can more easily extend with functionality like memory snapshotting and live VM cloning [1][2].
[1]: https://codesandbox.io/blog/how-we-clone-a-running-vm-in-2-s...
[2]: https://codesandbox.io/blog/cloning-microvms-using-userfault...
I don't know if this is relevant, but I've been intrigued by DragonflyBSD's "vkernel" [0] feature which (supposedly) allows for cloning the entire runtime state of the machine (established TCP connections, etc.) into a completely new userland memory space. I think they use it mostly for kernel debugging right now, but it's interesting to think about the possibilities of being able to just clone an entire running operating system to a new computer without interrupting even a single instruction.
[0] https://www.dragonflybsd.org/docs/handbook/vkernel/
These blogs are wonderful. I'd read them before figuring out firecracker snapshot/restore, but wanted to say it here.
> i/o speed we use io_uring
custom io_uring based driver for the VM block devices? or what do you mean here?
Thank you!
> custom io_uring based driver for the VM block devices? or what do you mean here?
We're using the async io backend that's shipped with Firecracker for our scratch disks.
Someone posted this and then immediately deleted their comment: https://qemu.readthedocs.io/en/latest/system/i386/microvm.ht...
I didn't know it existed until they posted, but QEMU has a Firecracker-inspired target:
> microvm is a machine type inspired by Firecracker and constructed after its machine model.
> It’s a minimalist machine type without PCI nor ACPI support, designed for short-lived guests. microvm also establishes a baseline for benchmarking and optimizing both QEMU and guest operating systems, since it is optimized for both boot time and footprint.
"the fork was very very bad for eating soup - this is a story about how we migrated to a spoon"
...firecracker does fine what it was designed to - short running fast start workloads.
(oh, and the article starts by slightly misusing a bunch of technical terms, firecracker's not technically a hypervisor per se)
it's not that simple many other companies running longer running jobs, including their competition, use Firecracker
so while Firecracker was designed for thing running just a few seconds there are many places running it with jobs running way longer then that
the problem is if you want to make it work with long running general purpose images you don't control you have to put a ton of work into making it work nicely on all levels of you infrastructure and code ... which is costly ... which a startup competing on a online dev environment compared to e.g. a vm hosting service probably shouldn't wast time on
So AFIK the decision in the article make sense the reasons but listed for the decision are oversimplified to a point you could say they aren't quite right. Idk. why, could be anything from the engineer believing that to them avoiding issues with some shareholder/project lead which is obsessed with "we need to do Firecracker because competition does so too".
..so is it more to support directly deploying functions to the cloud? Like, what AWS Lambda and CloudFront Functions might be built on?
I'm pretty sure firecracker was literally created to underlie AWS Lambda.
EDIT: Okay, https://www.geekwire.com/2018/firecracker-amazon-web-service... says my "pretty sure" memory is in fact correct.
2 replies →
yes, it was created originally for AWS Lambda
mainly it's optimized to run code only shortly (init time max 10s, max usage is 15min, and default max request time 130s AFIK)
also it's focused on thin server less functions, like e.g. deserialize some request, run some thin simple business logic and then delegate to other lambdas based on it. This kind of functions often have similar memory usage per-call and if a call is an outlier it can just discard the VM instance soon after (i.e. at most after starting up a new instance, i.e. at most 10s later)
"Firecracker's RAM footprint starts low, but once a workload inside allocates RAM, Firecracker will never return it to the host system."
Firecracker has a balloon device you can inflate (ie: acquire as much memory inside the VM as possible) and then deflate... returning the memory to the host. You can do this while the VM is running.
https://github.com/firecracker-microvm/firecracker/blob/main...
The first footnote says If you squint hard enough, you'll find that Firecracker does support dynamic memory management with a technique called ballooning. However, in practice, it's not usable. To reclaim memory, you need to make sure that the guest OS isn't using it, which, for a general-purpose workload, is nearly impossible
> is nearly impossible
for many mostly "general purpose" use cases it's quite viable, or else ~fly.io~ AWS Fargate wouldn't be able to use it
this doesn't mean it's easy to implement the necessary automatized tooling etc.
so it's depending on your dev resources and priorities it might be a bad choice
still I feel the article was had quite a bit a being subtil judgemental while moving some quite relevant parts for the content of the article into a footnote and also omitting that this "supposedly unusable tool" is used successfully by various other companies...
like as it it was written by and engineer being overly defensive about their decision due having to defend it to the 100th time because shareholders, customers, higher level management just wouldn't shut up about "but that uses Firecracker"
> which, for a general-purpose workload, is nearly impossible
That depends on the workload and the maximum memory allocated to the guest OS.
A lot of workloads rely on the OS cache/buffers to manage IO so unless RAM is quite restricted you can call in to release that pretty easily prior to having the balloon driver do its thing. In fact I'd not be surprised to be told the balloon process does this automatically itself.
If the workload does its own IO management and memory allocation (something like SQL Server which will eat what RAM it can and does its own IO cacheing) or the VM's memory allocation is too small for OS caching to be a significant use after the rest of the workload (you might pair memory down to the bare minimum like this for a “fairly static content” server that doesn't see much variation in memory needs and can be allowed to swap a little if things grow temporarily), then I'd believe is it more difficult. That is hardly the use case for firecracker though so if that is the sort of workload being run perhaps reassessing the tool used for the job was the right call.
Having said that my use of VMs is generally such that I can give them a good static amount of RAM for their needs and don't need to worry about dynamic allocation, so I'm far from a subject expert here.
And, isn't firecraker more geared towards short-lived VMs, quick to spin up, do a job, spin down immediately (or after only a short idle timeout if the VM might answer another request if one comes in immediately or is already queued), so you are better off cycling VMs, which is probably happening anyway, than messing around with memory balloons? Again, I'm not talking from a position of personal experience here so corrections/details welcome!
I'm struggling to understand how qemu with free page reporting isn't exactly the same as a firecracker balloon.
Yeah it's pretty hard problem as you'd need to defragment physical memory (while fixing all the virtual-to-physical mappings) to make contiguous block to free
A bit disingenuous to make a broad sweeping claim, then have a footnote which contradicts that claim, and upon closer inspection even that claim is incorrect.
It's absolutely usable in practice, it just makes oversubscription more challenging.
That and the fact that this was after "several weeks of testing" tells me this team doesn't have much virtualization experience. Firecracker is designed to quickly virtualize 1 headless stateless app (like a container), not run hundreds of different programs in a developer environment.
Yes, we use this at CodeSandbox for reclaiming memory to the host (and to reduce snapshot size when we hibernate the VM).
I really want VM's to integrate 'smarter' with the host.
For example, if I'm running 5 VM's, there is a good chance that many of the pages are identical. Not only do I want those pages to be deduplicated, but I want them to be zero-copy (ie. not deduplicated after-the-fact by some daemon).
To do that, the guest block cache needs to be integrated with the host block-cache, so that whenever some guest application tries to map data from disk, the host notices that another virtual machine has already caused this data to be loaded, so we can just map the same page of already loaded data into the VM that is asking.
This seems like a security issue waiting to happen when you’re running code from different users.
https://www.kernel.org/doc/html/latest/admin-guide/mm/ksm.ht...
zero-copy is harder as one system upgrade on one of them will trash it, but KSM is overall pretty effective at saving some memory on similar VMs
KVM has KSM (kernel samepage merging) since a long time ago that de-duplicates pages.
It has side channel attacks so be careful when enabling: https://pve.proxmox.com/wiki/Kernel_Samepage_Merging_(KSM)
But that makes a copy first, and only later notices that the pages are the same and merges them again.
Better to not make copies in the first place.
4 replies →
Doubt it is worth the hassle. How many do you really expect to be identical?
An OS isn't large. Your spotify/slack/browser instance is of comparable size. Says more about browser based apps but still.
> An OS isn't large. Your spotify/slack/browser instance is of comparable size.
A fairly recent Windows 11 Pro image is ~26GB unpacked and 141k dirents. After finishing OOBE it's already running like >100 processes, >1000 threads, and >100k handles. My Chrome install is ~600MB and 115 dirents. (Not including UserData.) It runs ~1 process per tab. Comparable in scope and complexity? That's debatable, but I tend to agree that modern browsers are pretty similar in scope to what an OS should be. (The other day my "web browser" flashed the firmware on the microcontroller for my keyboard.)
They're not even close to "being comparable in size," although I guess that says more about Windows.
1 reply →
Basically all code pages should be the same if some other VM has the same version of ubuntu and running the same version of spotify/slack.
And remember that as well as RAM savings, you also get 'instant loading' because there is no need to do slow SSD accesses to load hundreds of megabytes of a chromium binary to get slack running...
If you already know so much about your application(s), are you sure you need virtualization?
The second I read "shared block cache" my brain went to containers.
If you want data colocated on the same filesystem, then put it on the same filesystem. VMs suck, nobody spins up a whole fabricated IBM-compatible PC and gaslights their executable because they want to.[1] They do it because their OS (a) doesn't have containers, (b) doesn't provide strong enough isolation between containers, or (c) the host kernel can't run their workload. (Different ISA, different syscalls, different executable format, etc.)
Anyone who has ever tried to run heavyweight VMs atop a snapshotting volume already knows the idea of "shared blocks" is a fantasy; as soon as you do one large update inside the guest the delta between your volume clones and the base snapshot grows immensely. That's why Docker et al. has a concept of layers and you describe your desired state as a series of idempotent instructions applied to those layers. That's possible because Docker operates semantically on a filesystem; much harder to do at the level of a block device.
Is the a block containing b"hello, world" part of a program's text section, or part of a user's document? You don't know, because the guest is asking you for an LBA, not a path, not modes, not an ACL, etc. - If you don't know that, the host kernel has no idea how the page should be mapped into memory. Furthermore storing the information to dedup common blocks is non-trivial: go look at the manpage for ZFS' deduplication and it is littered w/ warnings about the performance, memory, and storage implications of dealing with the dedup table.
[1]: https://www.youtube.com/watch?v=coFIEH3vXPw
People run containers for two reasons: #1. They cannot control their devs with python dependencies. #2. Everyone runs containers! Can't be left behind.
I've tried to use virtio-pmem + DAX for the page cache to not be duplicated between the guest and the host. In practice the RAM overhead of virtio-pmem is unacceptable and it doesn't support discard operations at all. So yes a better solution would be needed.
OpenVZ does this. If you have 5 VMs each loading the same library then memory is conserved, as I understand it.
kvm does the same with KSM.
2 replies →
Well that's all nice, but that would also need to be compute-efficient for it to be worthwhile and near-real-time dedupe of memory pages would be a REALLY tough challenge.
Pretty straightforward for disk blocks. Many VM disks are already de-duped, either through snapshopping or through copy on write host filesystems.
The host block cache will end up deduplicating it automatically because all the 'copies' lead back to the same block on disk.
I believe we do this on Windows for Windows Sandbox. It works well but you will take a hit on performance to do the block resolution compared to always paging into physical memory.
https://learn.microsoft.com/en-us/windows/security/applicati...
Are you sure you're not thinking "copy on write" rather than "zero copy"? The latter implies you can predict in advance which pages will be the same forever...
The pages would be copy-on-write, but since this would mostly be for code pages, they would never be written, and therefore never copied.
By 'zero copy', I mean that when a guest tries to read a page, if another guest has that page in RAM, then no copy operation is done to get it into the memory space of the 2nd guest.
No mention of Cloud Hypervisor [1]…perhaps they don’t know about it? It’s based in part on Firecracker and supports free page reporting, virtio-blk-pci, PCI passthrough, and (I believe) discard in virtio-blk.
[1]: https://www.cloudhypervisor.org/
We do, and we'd love to use it in the future. We've found that it's not ready for prime time yet and it's missing some features. The biggest problem was that it does not support discard operations yet. Here's a short writeup we did about VMMs that we considered: https://github.com/hocus-dev/hocus/blob/main/rfd/0002-worksp...
Thanks for the link to the elaboration! FYI footnotes 3 and 4 seem to be swapped.
1 reply →
The article did an ok job of explaining the firecracker limitations they ran into but it was extremely skimpy when it came to qemu and just rushed to the conclusion “we did a lot of work so try our product.”
yeah I was reading so I could find out what they did.
I understand that they need to sell their product but jeez. don't leave us hanging like that
I didn't want to go into all the technical details, but we have another write-up that goes into details about RAM management: https://github.com/hocus-dev/hocus/blob/main/rfd/0003-worksp...
Other than making sure we release unused memory to the host, we didn't customize QEMU that much. Although we do have a cool layered storage solution - basically a faster alternative to QCOW2 that's also VMM independent. It's called overlaybd, and was created and implemented in Alibaba. That will probably be another blog post. https://github.com/containerd/overlaybd
3 replies →
Fly uses Firecracker, and they host long-running processes. I wonder what's their opinion about it.
I think their usecase makes a lot of sense as their workloads consume a predefined amount of ram. As a customer you rent a VM with a specified amount of memory so fly.io does not care about reclaiming it from a running VM.
Depends on if they're using smart memory allocation to keep costs lower, IE, if they can pattern that certain workloads only need N amount of memory at Y time, they can effectively borrow memory from one VM for usage in another that has an opposite statistical likelihood of needing that memory.
This is why paying for dedicated memory is often more expensive than its counter part, because that dedicated memory is not considered as part of pooling.
16 replies →
We like Firecracker. People should use whatever makes sense for them.
> The main issue we've had with QEMU is that it has too many options you need to configure. For instance, enabling your VM to return unused RAM to the host requires at least three challenging tasks
This just works on Hyper-V Linux guests btw. For all the crap MS gets they do some things very right.
It kind of just works. It’s actually broken under Debian 13 for some reason; memory usage infinitely balloons if the feature is enabled.
13 is debian-testing so I guess Microsoft still has time to make it work - last I checked it wasn't yet on Azure supported list.
1 reply →
Presumably this doesn't use the "microvm" machine type in QEMU? (also on front page right now https://news.ycombinator.com/item?id=36673945)
I came to the same conclusion as OP. QEMU is the most stable, hackable, well-supported VM hypervisor on the market. Setting it up is a pain, but once you get it set up with all your custom scripts, you never have to do it again. Ever. Even in your next project.
I toyed with it a bit and was delighted to get it running. Only to discover getting even basic networking going is another mission in itself.
Light is cool but for many tasks that level of Spartan is overkill
If I’m investing time in light it might as well be wasm tech
I know that Firecracker does not let you bind mount volumes, but QEMU does. So, we changed to QEMU from Firecracker. If you run the workloads in Kubernetes, you just have to change a single value in a yaml file to change the runtime.
I would be scared to let unknown persons use QEMU that bind mounts volumes as that is a huge security risk. Firecracker, I think, was designed from the start to run un-sanitized workloads, hence, no bind mounting.
> you just have to change a single value in a yaml file
Most dangerous 12-words sentence.
I know a good way to make a process make the most of the hardware and play cooperatively with other processes: don't use virtualization.
I will never understand the whole virtual machine and cloud craze. Your operating system is better than any hypervisor at sharing resources efficiently.
In this context (the blog post) and the reason firecracker was created, was to isolate workloads.
And if youre running untrusted code, then using a virtualized environment is the easiest (id even say best) way to go about it.
> cloud craze.
Automatic scaling is great. Cloud parallelization (a.k.a fork) is absolutely wild once you get it rolling. Code deployments are incredibly simple. Never having to worry about physical machines or variable traffic loads is worth the small overhead they charge me for the wrapper. The generic system wide permissions model is an absolute joy once you get over the learning curve.
After reading the README of virtualization tools (and looking at the author) I discovered the benefits of using them. I recommend also giving that a try.
I do have to use it since someone early on in the company I work at decided to do everything with AWS and Kubernetes.
The fact of the matter is that it's just inefficient, slow and expensive.
Bare metal is simple, fast, and keeps you in control.
2 replies →
I want to segregate, not to share.
Tl;dr: We tried to misuse technology and we failed. If Firecracker was developed for a single binary executed fir a short period of time why do you try to use it for multiple executables running for a long time? Does it make any sense to even try?
AWS uses firecracker to execute long-running Fargate tasks; its hardly misuse
Where in the "sales" pitch on the fancy-CSS website as well as the README does it say only to use it for single-shot workloads?
I think the complaints are perfectly valid.
"Firecracker is an alternative to QEMU that is purpose-built for running serverless functions and containers safely and efficiently, and nothing more." [1]
Interesting. I guess we are reading a different website.
1. https://firecracker-microvm.github.io/
Listen people, Firecracker is NOT A HYPERVISOR. A hypervisor runs right on the hardware. KVM is a hypervisor. Firecracker is a process that controls KVM. If you want to call firecracker (and QEMU, when used in conjunction with KVM) a VMM ("virtual machine monitor") I won't complain. But please please please, we need a word for what KVM and Xen are, and "hypervisor" is the best fit. Stop using that word for a user-level process like Firecracker.
Nitpick: it’s not accurate to say that a hypervisor, by definition, runs right on the hardware. Xen (as a type-1 hypervisor) has this property; KVM (as a type-2 hypervisor) does not. It’s important to remember that the single core responsibility of a hypervisor is to divide hardware resources and time between VMs, and this decision-making doesn’t require bare-metal.
For those unfamiliar, the informal distinction between type-1 and type-2 is that type-1 hypervisors are in direct control of the allocation of all resources of the physical computer, while type-2 hypervisors operate as some combination of being “part of” / “running on” a host operating system, which owns and allocates the resources. KVM (for example) gives privileged directions to the Linux kernel and its virtualization kernel module for how to manage VMs, and the kernel then schedules and allocates the appropriate system resources. Yes, the type-2 hypervisor needs kernel-mode primitives for managing VMs, and the kernel runs right on the hardware, but those primitives aren’t making management decisions for the division of hardware resources and time between VMs. The type-2 hypervisor is making those decisions, and the hypervisor is scheduled by the OS like any other user-mode process.
Type-1 and type-2 hypervisor is terminology that should at this point be relegated to the past.
It was never popularly used in a way accurate to the origin of the classification - in the original paper by Popek and Goldberg talked about formal proofs for the two types and they really have very little to do with how the terms began being used in the 90s and 00s. Things have changed a lot with computers since the 70s when the paper was written and the terminology was coined.
So, language evolves, and Type-1 and Type-2 came to mean something else in common usage. And this might have made sense to differentiate something like esx from vmware workstation in their capabilities, but it's lost that utility in trying to differentiate Xen from KVM for the overwhelming majority of use cases.
Why would I say it's useless in trying to differentiate, say, Xen and KVM? Couple of reasons:
1) There's no performance benefit to type-1 - a lot of performance sits on the device emulation side, and both are going to default to qemu there. Other parts are based heavily on CPU extensions, and Xen and KVM have equal access there. Both can pass through hardware, support sr-iov, etc., as well.
2) There's no overhead benefit in Xen - you still need a dom0 VM, which is going to arguably be even more overhead than a stripped down KVM setup. There's been work on dom0less Xen, but it's frankly in a rough state and the related drawbacks make it challenging to use in a production environment.
Neither term provides any real advantage or benefit in reasoning between modern hypervisors.
2 replies →
According to the actual paper that introduced the distinction, and adjusting for change of terminology in the last 50 years, a type-1 hypervisor runs in kernel space and a type-2 hypervisor runs in user space. x86 is not virtualizable by a type-2 hypervisor, except by software emulation of the processor.
What actually can change is the amount of work that the kernel-mode hypervisor leaves to a less privileged (user space) component.
For more detail see https://www.spinics.net/lists/kvm/msg150882.html
KVM is a type-1 hypervisor [1]
[1]: https://www.redhat.com/en/topics/virtualization/what-is-KVM
3 replies →
Keep fighting the good fight, friend.
Although I’ll note that the line between a VMM and hypervisor are not always clear. E.g., KVM includes some things that other hypervisors delegate to the VMM (such as instruction completion). And macOS’s hypervisor.framework is almost a pass through to the CPU’s raw capabilities.
I think you could help me answer the question that has been in my mind for a month :)
Is there any article that tells the difference and relationship between KVM, QEMU, libvirt, virt-manager, Xen, Proxmox etc. with their typical use cases?
KVM is a Linux kernel implementation of the cpu extensions to accelerate vms to near bare metal speeds.
Qemu is a user space system emulator. It can emulate in software different architectures like ARM, x86, etc. It can also emulate drivers, networking, disks, etc. Is called via the command line.
The reason you'll see Qemu/KVM a lot is because Qemu is the emulator, the things actually running the VM. And it utilizes KVM (on linux, OSX has HVF, for example) to accelerate the VM when the host architecture matches the VM's.
Libvirt is an XML based API on top of Qemu (and others). It allows you to define networks, VMs (it calls them domains), and much more with a unified XML schema through libvirtd.
Virsh is a CLI tool to manage libvirtd. Virt-manager is a GUI to do the same.
Proxmox is Debian under the hood with Qemu/KVM running VMs. It provides a robust web UI and easy clustering capabilities. Along with nice to haves like easy management of disks, ceph, etc. You can also manage Ceph through an API with Terraform.
Xen is an alternative hypervisor (like esxi). Instead of running on top of Linux, Xen has it's own microkernel. This means less flexibility (there's no Linux body running things), but also simpler to manage and less attack surface. I haven't played much with xen though, KVM is kind of the defacto, but IIRC AWS used to use a modified Xen before KVM came along and ate Xen's lunch.
3 replies →
KVM is kernel-based virtual machine, with libvirt being its API abstraction over all of it. QEMU is a virtual machine host that leverages kvm or software virtualization to spin up machines on the host. virt-manager does the same. Xen is another virtual machine host, like KVM. Proxmox is a virtual machine manager (like QEMU, virt-manager) but is web based. Libvirt will provide abstraction for kvm,qemu,xen
Use cases: proxmox web interface exposed on your local network on a KVM Linux box that uses QEMU to manage VM’s. Proxmox will allow you to do that from the web. QEMU is great for single or small fleet of machines but should be automated for any heavy lifting. Proxmox will do that.
8 replies →
I don't know if _one_ such article exists, but here is a piece of tech doc from oVirt (yet another tool) that shows how - or that - VDSM is used by oVirt to communicate with QEMU through libvirt: https://www.ovirt.org/develop/architecture/architecture.html...
In really simple terms, so simple that I'm not 100% sure they are correct:
* KVM is a hypervisor, or rather it lets you turn linux into a hypervisor [1], which will let you run VMs on your machine. I've heard KVM is rather hard to work with (steep learning curve). (Xen is also a hypervisor.)
* QEMU is a wrapper-of-a-sorts (a "machine emulator and virtualizer" [2]) which can be used on top of KVM (or Xen). "When used as a virtualizer, QEMU achieves near native performance by executing the guest code directly on the host CPU. QEMU supports virtualization when executing under the Xen hypervisor or using the KVM kernel module in Linux." [2]
* libvirt "is a toolkit to manage virtualization platforms" [3] and is used, e.g., by VDSM to communicate with QEMU.
* virt-manager is "a desktop user interface for managing virtual machines through libvirt" [4]. The screenshots on the project page should give an idea of what its typical use-case is - think VirtualBox and similar solutions.
* Proxmox is the above toolstack (-ish) but as one product.
---
[1] https://www.redhat.com/en/topics/virtualization/what-is-KVM
[2] https://wiki.qemu.org/Main_Page
[3] https://libvirt.org/
[4] https://virt-manager.org/
3 replies →
[flagged]
6 replies →
I think people just pick the coolest sounding term. Imagine someone is sharing what they are working on, what’s cooler sounding “I am working on a virtual machine monitor” or “I am working on a hypervisor”. Hypervisor just sounds futuristic and awesome.
It’s like with “isomorphic” code. That just sounds much cooler than “js that runs on the client and the server”.
> virtual machine monitor
Is it good to think of libvirt as a virtual machine mointor, or is that more "virtual machine management"?
I'd love to get a clear explanation of what libvirt actually does. As far as I can tell it's a qemu argument assembler and launcher. For my own use-case, I just launch qemu from systemd unit files:
https://wiki.archlinux.org/title/QEMU#With_systemd_service
10 replies →
It's a lot of glue to present a consistent interface but it also does the management part.
"API to virtualization system" would probably be closest approximation but it also does some more advanced stuff like coordinating cross-host VM migration
"Firecracker...'s excellent for running short-lived workloads...A little-known fact about Firecracker is its lack of support... for long-lived workloads."
Okay.