- What is microVM snapshotting?
- Snapshotting in Firecracker
- Firecracker Snapshotting characteristics
- Snapshot versioning
- Snapshot API
- Provisioning host disk space for snapshots
- Ensure continued network connectivity for clones
- Snapshot security and uniqueness
- Vsock device limitation
MicroVM snapshotting is a mechanism through which a running microVM and its
resources can be serialized and saved to an external medium in the form of a
snapshot
. This snapshot can be later used to restore a microVM with its guest
workload at that particular point in time.
Warning
The Firecracker snapshot feature is in developer preview on all CPU micro-architectures listed in README. See this section for more info.
A Firecracker microVM snapshot can be used for loading it later in a different Firecracker process, and the original guest workload is being simply resumed.
The original guest which the snapshot is created from, should see no side effects from this process (other than the latency introduced by the snapshot creation process).
Both network and vsock packet loss can be expected on guests that are resumed from snapshots in another Firecracker process. It is also not guaranteed that the state of the network connections survives the process.
In order to make restoring possible, Firecracker snapshots save the full state of the following resources:
- the guest memory,
- the emulated HW state (both KVM and Firecracker emulated HW).
The state of the components listed above is generated independently, which brings flexibility to our snapshotting support. This means that taking a snapshot results in multiple files that are composing the full microVM snapshot:
- the guest memory file,
- the microVM state file,
- zero or more disk files (depending on how many the guest had; these are managed by the users).
The design allows sharing of memory pages and read only disks between multiple microVMs. When loading a snapshot, instead of loading at resume time the full contents from file to memory, Firecracker creates a MAP_PRIVATE mapping of the memory file, resulting in runtime on-demand loading of memory pages. Any subsequent memory writes go to a copy-on-write anonymous memory mapping. This has the advantage of very fast snapshot loading times, but comes with the cost of having to keep the guest memory file around for the entire lifetime of the resumed microVM.
The Firecracker snapshot design offers a very simple interface to interact with snapshots but provides no functionality to package or manage them on the host.
The threat containment model states that the host, host/API communication and snapshot files are trusted by Firecracker.
To ensure a secure integration with the snapshot functionality, users need to secure snapshot files by implementing authentication and encryption schemes while managing their lifecycle or moving them across the trust boundary, like for example when provisioning them from a repository to a host over the network.
Firecracker is optimized for fast load/resume, and it's designed to do some very basic sanity checks only on the vm state file. It only verifies integrity using a 64-bit CRC value embedded in the vm state file, but this is only a partial measure to protect against accidental corruption, as the disk files and memory file need to be secured as well. It is important to note that CRC computation is validated before trying to load the snapshot. Should it encounter failure, an error will be shown to the user and the Firecracker process will be terminated.
The Firecracker snapshot create/resume performance depends on the memory size, vCPU count and emulated devices count. The Firecracker CI runs snapshot tests on all supported platforms.
The snapshot functionality is still in developer preview due to the following:
- Poor entropy and replayable randomness when resuming multiple microvms from the same snapshot. We do not recommend to use snapshotting in production if there is no mechanism to guarantee proper secrecy and uniqueness between guests. Please see Snapshot security and uniqueness.
- Currently on aarch64 platforms only lower 128 bits of any register are saved
due to the limitations of
get/set_one_reg
fromkvm-ioctls
crate that Firecracker uses to interact with KVM. This creates an issue with newer aarch64 CPUs with support for registers with width greater than 128 bits, because these registers will be truncated before being stored in the snapshot. This can lead to uVM failure if restored from such snapshot. Because registers wider than 128 bits are usually used in SVE instructions, the best way to mitigate this issue is to ensure that the software run in uVM does not use SVE instructions during snapshot creation. An alternative way is to use CPU templates to disable SVE related features in uVM. - High snapshot latency on 5.4+ host kernels due to cgroups V1. We strongly recommend to deploy snapshots on cgroups V2 enabled hosts for the implied kernel versions - related issue.
- Guest network connectivity is not guaranteed to be preserved after resume. For recommendations related to guest network connectivity for clones please see Network connectivity for clones.
- Vsock device does not have full snapshotting support. Please see Vsock device limitation.
- Snapshotting on arm64 works for both GICv2 and GICv3 enabled guests. However, restoring between different GIC version is not possible.
- If a CPU template is not used on x86_64,
overwrites of
MSR_IA32_TSX_CTRL
MSR value will not be preserved after restoring from a snapshot. - Resuming from a snapshot that was taken during early stages of the guest kernel boot might lead to crashes upon snapshot resume. We suggest that users take snapshot after the guest microVM kernel has booted. Please see VMGenID device limitation.
- Fresh Firecracker microVMs are booted using
anonymous
memory, while microVMs resumed from snapshot load memory on-demand from the snapshot and copy-on-write to anonymous memory. - Resuming from a snapshot is optimized for speed, while taking a snapshot involves some extra CPU cycles for synchronously writing dirty memory pages to the memory snapshot file. Taking a snapshot of a fresh microVM, on which dirty pages tracking is not enabled, results in the full contents of guest memory being written to the snapshot.
- The memory file and microVM state file are generated by Firecracker on snapshot creation. The disk contents are not explicitly flushed to their backing files.
- The API calls exposing the snapshotting functionality have clear Prerequisites that describe the requirements on when/how they should be used.
- The Firecracker microVM's MMDS config is included in the snapshot. However, the data store is not persisted across snapshots.
- Configuration information for metrics and logs are not saved to the snapshot. These need to be reconfigured on the restored microVM.
- On x86_64, if a vCPU has MSR_IA32_TSC_DEADLINE set to 0 when a snapshot is taken, Firecracker replaces it with the MSR_IA32_TSC value from the same vCPU. This is to guarantee that the vCPU will continue receiving TSC interrupts after restoring from the snapshot even if an interrupt is lost when taking a snapshot.
The microVM state snapshot file uses a data format that has a version in the
form of MAJOR.MINOR.PATCH
. Each Firecracker binary supports a fixed version of
the snapshot data format. When creating a snapshot, Firecracker will use the
supported data format version. When loading snapshots, Firecracker will check
that the snapshot version is compatible with the version it supports. More
information about the snapshot data format and details about snapshot data
format versions can be found at versioning.
Firecracker exposes the following APIs for manipulating snapshots: Pause
,
Resume
and CreateSnapshot
can be called only after booting the microVM,
while LoadSnapshot
is allowed only before boot.
To create a snapshot, first you have to pause the running microVM and its vCPUs with the following API command:
curl --unix-socket /tmp/firecracker.socket -i \
-X PATCH 'http://localhost/vm' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"state": "Paused"
}'
Prerequisites: The microVM is booted. Successive calls of this request keep
the microVM in the Paused
state. Effects:
- on success: microVM is guaranteed to be
Paused
. - on failure: no side-effects.
Now that the microVM is paused, you can create a snapshot, which can be either a
full
one or a diff
one. Full snapshots always create a complete, resume-able
snapshot of the current microVM state and memory. Diff snapshots save the
current microVM state and the memory dirtied since the last snapshot (full or
diff). Diff snapshots are not resume-able, but can be merged into a full
snapshot. In this context, we will refer to the base as the first memory file
created by a /snapshot/create
API call and the layer as a memory file created
by a subsequent /snapshot/create
API call. The order in which the snapshots
were created matters and they should be merged in the same order in which they
were created. To merge a diff
snapshot memory file on top of a base, users
should copy its content over the base. This can be done using the rebase-snap
(deprecated) or snapshot-editor
tools provided with the firecracker release:
rebase-snap
(deprecated) example:
rebase-snap --base-file path/to/base --diff-file path/to/layer
snapshot-editor
example:
snapshot-editor edit-memory rebase \
--memory-path path/to/base \
--diff-path path/to/layer
After executing the command above, the base would be a resumable snapshot memory file describing the state of the memory at the moment of creation of the layer. More layers which were created later can be merged on top of this base.
This process needs to be repeated for each layer until the one describing the
desired memory state is merged on top of the base, which is constantly updated
with information from previously merged layers. Please note that users should
not merge state files which resulted from /snapshot/create
API calls and they
should use the state file created in the same call as the memory file which was
merged last on top of the base.
For creating a full snapshot, you can use the following API command:
curl --unix-socket /tmp/firecracker.socket -i \
-X PUT 'http://localhost/snapshot/create' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"snapshot_type": "Full",
"snapshot_path": "./snapshot_file",
"mem_file_path": "./mem_file",
}'
Details about the required and optional fields can be found in the swagger definition.
Note: If the files indicated by snapshot_path
and mem_file_path
don't
exist at the specified paths, then they will be created right before generating
the snapshot. If they exist, the files will be truncated and overwritten.
Prerequisites: The microVM is Paused
.
Effects:
-
on success:
- The file indicated by
snapshot_path
(e.g./path/to/snapshot_file
) contains the devices' model state and emulation state. The one indicated bymem_file_path
(e.g./path/to/mem_file
) contains a full copy of the guest memory. - The generated snapshot files are immediately available to be used (current process releases ownership). At this point, the block devices backing files should be backed up externally by the user. Please note that block device contents are only guaranteed to be committed/flushed to the host FS, but not necessarily to the underlying persistent storage (could still live in host FS cache).
- If diff snapshots were enabled, the snapshot creation resets then the dirtied page bitmap and marks all pages clean (from a diff snapshot point of view).
- The file indicated by
-
on failure: no side-effects.
Notes:
- The separate block device file components of the snapshot have to be handled by the user.
For creating a diff snapshot, you should use the same API command, but with
snapshot_type
field set to Diff
.
Note: If not specified, snapshot_type
is by default Full
.
curl --unix-socket /tmp/firecracker.socket -i \
-X PUT 'http://localhost/snapshot/create' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"snapshot_type": "Diff",
"snapshot_path": "./snapshot_file",
"mem_file_path": "./mem_file",
}'
Prerequisites: The microVM is Paused
.
Note: On a fresh microVM, track_dirty_pages
field should be set to true
,
when configuring the /machine-config
resource, while on a snapshot loaded
microVM, enable_diff_snapshots
from PUT /snapshot/load
request body, should
be set.
Effects:
- on success:
- The file indicated by
snapshot_path
contains the devices' model state and emulation state, same as when creating a full snapshot. The one indicated bymem_file_path
contains this time a diff copy of the guest memory; the diff consists of the memory pages which have been dirtied since the last snapshot creation or since the creation of the microVM, whichever of these events was the most recent. - All the other effects mentioned in the Effects paragraph from Creating full snapshots section apply here.
- The file indicated by
- on failure: no side-effects.
Note: This is an example of an API command that enables dirty page tracking:
curl --unix-socket /tmp/firecracker.socket -i \
-X PUT 'http://localhost/machine-config' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"vcpu_count": 2,
"mem_size_mib": 1024,
"smt": false,
"track_dirty_pages": true
}'
Enabling this support enables KVM dirty page tracking, so it comes at a cost (which consists of CPU cycles spent by KVM accounting for dirtied pages); it should only be used when needed.
Creating a snapshot will not influence state, will not stop or end the microVM, it can be used as before, so the microVM can be resumed if you still want to use it. At this point, in case you plan to continue using the current microVM, you should make sure to also copy the disk backing files.
You can resume the microVM by sending the following API command:
curl --unix-socket /tmp/firecracker.socket -i \
-X PATCH 'http://localhost/vm' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"state": "Resumed"
}'
Prerequisites: The microVM is Paused
. Successive calls of this request are
ignored (microVM remains in the running state). Effects:
- on success: microVM is guaranteed to be
Resumed
. - on failure: no side-effects.
If you want to load a snapshot, you can do that only before the microVM is configured (the only resources that can be configured prior are the Logger and the Metrics systems) by sending the following API command:
curl --unix-socket /tmp/firecracker.socket -i \
-X PUT 'http://localhost/snapshot/load' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"snapshot_path": "./snapshot_file",
"mem_backend": {
"backend_path": "./mem_file",
"backend_type": "File"
},
"enable_diff_snapshots": true,
"resume_vm": false
}'
The backend_type
field represents the memory backend type used for loading the
snapshot. Accepted values are:
File
- rely on the kernel to handle page faults when loading the contents of the guest memory file into memory.Uffd
- use a dedicated user space process to handle page faults that occur for the guest memory range. Please refer to this for more details on handling page faults in the user space.
The meaning of backend_path
depends on the backend_type
chosen:
- if using
File
, thenbackend_path
should contain the path to the snapshot's memory file to be loaded. - when using
Uffd
,backend_path
refers to the path of the unix domain socket used for communication between Firecracker and the user space process that handles page faults.
When relying on the OS to handle page faults, the command below is also
accepted. Note that mem_file_path
field is currently under the deprecation
policy. mem_file_path
and mem_backend
are mutually exclusive, therefore
specifying them both at the same time will return an error.
curl --unix-socket /tmp/firecracker.socket -i \
-X PUT 'http://localhost/snapshot/load' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"snapshot_path": "./snapshot_file",
"mem_file_path": "./mem_file",
"enable_diff_snapshots": true,
"resume_vm": false
}'
Details about the required and optional fields can be found in the swagger definition.
Prerequisites: A full memory snapshot and a microVM state file must be provided. The disk backing files, network interfaces backing TAPs and/or vsock backing socket that were used for the original microVM's configuration should be set up and accessible to the new Firecracker process (in which the microVM is resumed). These host-resources need to be accessible at the same relative paths to the new Firecracker process as they were to the original one.
Effects:
- on success:
- The complete microVM state is loaded from snapshot into the current Firecracker process.
- It then resets the dirtied page bitmap and marks all pages clean (from a diff snapshot point of view).
- The loaded microVM is now in the
Paused
state, so it needs to be resumed for it to run. - The memory file (pointed by
backend_path
when usingFile
backend type, or pointed bymem_file_path
) must be considered immutable from Firecracker and host point of view. It backs the guest OS memory for read access through the page cache. External modification to this file corrupts the guest memory and leads to undefined behavior. - The file indicated by
snapshot_path
, that is used to load from, is released and no longer used by this process. - If
enable_diff_snapshots
is set, then diff snapshots can be taken afterwards. - If
resume_vm
is set, the vm is automatically resumed if load is successful.
- on failure: A specific error is reported and then the current Firecracker process is ended (as it might be in an invalid state).
Notes: Please, keep in mind that only by setting to true
enable_diff_snapshots
, when loading a snapshot, or track_dirty_pages
, when
configuring the machine on a fresh microVM, you can then create a diff
snapshot. Also, track_dirty_pages
is not saved when creating a snapshot, so
you need to explicitly set enable_diff_snapshots
when sending
LoadSnapshot
command if you want to be able to do diff snapshots from a loaded
microVM. Another thing that you should be aware of is the following: if a fresh
microVM can create diff snapshots, then if you create a full snapshot, the
memory file contains the whole guest memory, while if you create a diff one,
that file is sparse and only contains the guest dirtied pages. With these in
mind, some possible snapshotting scenarios are the following:
Boot from a fresh microVM
->Pause
->Create snapshot
->Resume
->Pause
->Create snapshot
-> ... ;Boot from a fresh microVM
->Pause
->Create snapshot
->Resume
->Pause
->Resume
-> ... ->Pause
->Create snapshot
-> ... ;Load snapshot
->Resume
->Pause
->Create snapshot
->Resume
->Pause
->Create snapshot
-> ... ;Load snapshot
->Resume
->Pause
->Create snapshot
->Resume
->Pause
->Resume
-> ... ->Pause
->Create snapshot
-> ... ; whereCreate snapshot
can refer to either a full or a diff snapshot for all the aforementioned flows.
It is also worth knowing, a microVM that is restored from snapshot will be resumed with the guest OS wall-clock continuing from the moment of the snapshot creation. For this reason, the wall-clock should be updated to the current time, on the guest-side. More details on how you could do this can be found at a related FAQ.
Depending on VM memory size, snapshots can consume a lot of disk space. Firecracker integrators must ensure that the provisioned disk space is sufficient for normal operation of their service as well as during failure scenarios. If the service exposes the snapshot triggers to customers, integrators must enforce proper disk quotas to avoid any DoS threats that would cause the service to fail or function abnormally.
For recommendations related to continued network connectivity for multiple clones created from a single Firecracker microVM snapshot please see this doc.
When snapshots are used in a such a manner that a given guest's state is resumed from more than once, guest information assumed to be unique may in fact not be; this information can include identifiers, random numbers and random number seeds, the guest OS entropy pool, as well as cryptographic tokens. Without a strong mechanism that enables users to guarantee that unique things stay unique across snapshot restores, we consider resuming execution from the same state more than once insecure.
For more information please see this doc
Boot microVM A -> ... -> Create snapshot S -> Terminate
-> Load S in microVM B -> Resume -> ...
Here, microVM A terminates after creating the snapshot without ever resuming work, and a single microVM B resumes execution from snapshot S. In this case, unique identifiers, random numbers, and cryptographic tokens that are meant to be used once are indeed only used once. In this example, we consider microVM B secure.
Boot microVM A -> ... -> Create snapshot S -> Resume -> ...
-> Load S in microVM B -> Resume -> ...
Here, both microVM A and B do work starting from the state stored in snapshot S. Unique identifiers, random numbers, and cryptographic tokens that are meant to be used once may be used twice. It doesn't matter if microVM A is terminated before microVM B resumes execution from snapshot S or not. In this example, we consider both microVMs insecure as soon as microVM A resumes execution.
Boot microVM A -> ... -> Create snapshot S -> ...
-> Load S in microVM B -> Resume -> ...
-> Load S in microVM C -> Resume -> ...
[...]
Here, both microVM B and C do work starting from the state stored in snapshot S. Unique identifiers, random numbers, and cryptographic tokens that are meant to be used once may be used twice. It doesn't matter at which points in time microVMs B and C resume execution, or if microVM A terminates or not after the snapshot is created. In this example, we consider microVMs B and C insecure, and we also consider microVM A insecure if it resumes execution.
Virtual Machine Generation Identifier (VMGenID) is a virtual device that allows VM guests to detect when they have resumed from a snapshot. It works by exposing a cryptographically random 16-bytes identifier to the guest. The VMM ensures that the value of the indentifier changes every time the VM a time shift happens in the lifecycle of the VM, e.g. when it resumes from a snapshot.
Linux supports VMGenID since version 5.18. When Linux detects a change in the identifier, it uses its value to reseed its internal PRNG. Moreover, since version 6.8 Linux VMGenID driver also emits to userspace a uevent. User space processes can monitor this uevent for detecting snapshot resume events.
Firecracker supports VMGenID device on x86 platforms. Firecracker will always enable the device. During snapshot resume, Firecracker will update the 16-byte generation ID and inject a notification in the guest before resuming its vCPUs.
As a result, guests that run Linux versions >= 5.18 will re-seed their in-kernel PRNG upon snapshot resume. User space applications can rely on the guest kernel for randomness. State other than the guest kernel entropy pool, such as unique identifiers, cached random numbers, cryptographic tokens, etc will still be replicated across multiple microVMs resumed from the same snapshot. Users need to implement mechanisms for ensuring de-duplication of such state, where needed. On guests that run Linux versions >= 6.8, users can make use of the uevent that VMGenID driver emits upon resuming from a snapshot, to be notified about snapshot resume events.
Vsock must be inactive during snapshot. Vsock device can break if snapshotted while having active connections. Firecracker snapshots do not capture any inflight network or vsock (through the linux unix domain socket backend) traffic that has left or not yet entered Firecracker.
The above, coupled with the fact that Vsock control protocol is not resilient to vsock packet loss, leads to Vsock device breakage when doing a snapshot while there are active Vsock connections.
As a solution to the above issue, active Vsock connections prior to snapshotting
the VM are forcibly closed by sending a specific event called
VIRTIO_VSOCK_EVENT_TRANSPORT_RESET
. The event is sent on SnapshotCreate
. On
SnapshotResume
, when the VM becomes active again, the vsock driver closes all
existing connections. Listen sockets still remain active. Users wanting to build
vsock applications that use the snapshot capability have to take this into
consideration. More details about this event can be found in the official Virtio
document here,
section 5.10.6.6 Device Events.
Firecracker handles sending the reset
event to the vsock driver, thus the
customers are no longer responsible for closing active connections.
During snashot resume, Firecracker updates the 16-byte generation ID of the VMGenID device and injects an interrupt in the guest before resuming vCPUs. If the snapshot was taken at the very early stages of the guest kernel boot process proper interrupt handling might not be in place yet. As a result, the kernel might not be able to handle the injected notification and crash. We suggest to users that they take snapshots only after the guest kernel has completed booting, to avoid this issue.
We have a mechanism in place to experiment with snapshot compatibility across supported host kernel versions by generating snapshot artifacts through this tool and checking devices' functionality using this test. The microVM snapshotted is built from this configuration file. The test restores the snapshot and ensures that all the devices set-up in the configuration file (network devices, disk, vsock, balloon and MMDS) are operational post-load.
In those tests the instance is fixed, except some combinations where we also test across the same CPU family (Intel x86, Gravitons). In general cross-CPU snapshots are not supported
The tables below reflect the snapshot compatibility observed on the AWS instances we support.
all means all currently supported Intel/AMD/ARM metal instances (m6g, m7g, m5n, c5n, m6i, m6a). It does not mean cross-instance, i.e. a snapshot taken on m6i won't work on an m6g instance.
CPU family | taken on host kernel | restored on host kernel | working? |
---|---|---|---|
x86_64 | 4.14 | 5.10 | Y |
all | 5.10 | 4.14 | N |
all | 5.10 | 6.1 | Y |
all | 6.1 | 5.10 | Y |
What doesn't work:
- Graviton 4.14 <-> 5.10 does not restore due to register incompatibility.
- Intel 5.10 -> 4.14 does not restore because unresponsive net devices
- AMD m6a 5.10 -> 4.14 does not restore due to mismatch in MSRs