Skip to content

feat: switch to a per-device metrics model for vsock and rng devices#5599

Open
aerosouund wants to merge 12 commits intofirecracker-microvm:mainfrom
aerosouund:vmm-parallel
Open

feat: switch to a per-device metrics model for vsock and rng devices#5599
aerosouund wants to merge 12 commits intofirecracker-microvm:mainfrom
aerosouund:vmm-parallel

Conversation

@aerosouund
Copy link
Copy Markdown
Contributor

@aerosouund aerosouund commented Dec 29, 2025

Changes

Building on the original PR and addressing the reviewer comments.
With the following remarks:

  • Vsock device metrics are keyed in the btreemap by the guest CID which is hardcoded for test instance to be 52. if the desire is to run the tests in a given file, it may be needed to not hardcode it
  • Entropy device metrics use a random generated UUID as the key of the map, because entropy devices don't have a unique identifier in the system that distinguishes them from one another (if one exists, let me know about it)

Reason

Closes #4709

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkbuild --all to verify that the PR passes
    build checks on all supported architectures.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

@aerosouund aerosouund force-pushed the vmm-parallel branch 5 times, most recently from ddcbe7a to 3c724e5 Compare December 29, 2025 20:26
@aerosouund aerosouund marked this pull request as ready for review December 29, 2025 20:26
Copy link
Copy Markdown
Contributor

@Manciukic Manciukic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contibution!

I only had a quick look, but I think we can simplify the logic for devices that only allow a single instance. Maybe a simple RwLock<Arc<Metrics>> would do the job.

Once we align on the solution for single-entity devices, we should also tackle all other devices using global metrics (like mem and pmem which we recently added), and simplify their unit tests as well.

Please keep different devices in different commits for clarity (and feel free to tackle one at a time if you prefer).

I'm also not a big fan of the current per-device metrics for net and block sharing the same metrics Arc if the ID happens to be the same, which defeats the purpose of not sharing metrics. As in real life we can't have devices with the same ID, I'd rather have the device create the object and replace whatever is in the hashmap rather than the other way around.

Comment thread src/vmm/src/devices/virtio/net/tap.rs Outdated
Comment thread src/vmm/src/devices/virtio/rng/metrics.rs Outdated
Comment thread src/vmm/src/devices/virtio/rng/device.rs Outdated
@aerosouund
Copy link
Copy Markdown
Contributor Author

@Manciukic

Interesting thoughts for sure.

I'm also not a big fan of the current per-device metrics for net and block sharing the same metrics Arc if the ID happens to be the same, which defeats the purpose of not sharing metrics. As in real life we can't have devices with the same ID, I'd rather have the device create the object and replace whatever is in the hashmap rather than the other way around.

How does this fit with your previous statement of storing a single Arc in the RwLock ? single instance devices create their arc during initialization and store it directly in the RwLock and multi instance devices also create their own metrics but store it in the hashmap themselves rather than call alloc to give them a metrics instance ?

I think it sounds like a good way forward. Let me know if i misunderstood or you have any other opinions

@Manciukic
Copy link
Copy Markdown
Contributor

single instance devices create their arc during initialization and store it directly in the RwLock and multi instance devices also create their own metrics but store it in the hashmap themselves rather than call alloc to give them a metrics instance ?

Yes, that's what I meant. It seems it would simplify your implementation for the rng device quite much

@aerosouund aerosouund force-pushed the vmm-parallel branch 4 times, most recently from 3cd82a4 to 602c7a0 Compare January 16, 2026 14:03
@aerosouund
Copy link
Copy Markdown
Contributor Author

Hello @Manciukic

I have switched to single instance model in the rng device using a OnceLock and made the vsock device own the creation of its metrics instance. If the current pattern looks satisfactory, i can switch the net and block devices to follow the same pattern, and address the other comments on the PR. Let me know what you think

Comment thread src/vmm/src/devices/virtio/rng/device.rs Outdated
Comment thread src/vmm/src/devices/virtio/rng/metrics.rs Outdated
Comment thread src/vmm/src/devices/virtio/rng/device.rs Outdated
@aerosouund aerosouund force-pushed the vmm-parallel branch 3 times, most recently from 50c057d to 9b6b94c Compare January 22, 2026 18:28
@aerosouund
Copy link
Copy Markdown
Contributor Author

@Manciukic
Reworked the net and block devices to own their metrics creation and left a question regarding the macro thing you mentioned. let me know what you think

@Manciukic Manciukic added the Status: WIP Indicates that an issue is currently being worked on or triaged label Jan 28, 2026
@Manciukic Manciukic self-assigned this Jan 28, 2026
Copy link
Copy Markdown
Contributor

@Manciukic Manciukic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only skimmed through it but I think the approach is the right one. A few things to iron out:

  • We can improve all those unit tests that are reading the old values, doing an operation, and reading the new value and comparing. This was needed because of the global metrics. They should all now become simple assertions on the value of the metric after the operation. We don't need that macro anymore as well
  • We should split changes to different devices in different commits to make it simpler to review. The same for unrelated changes and the improvement to tests
  • Please ensure the PR passes the formatter and buildchecks by running the devtool (I left instructions in one of the comments)
  • We should do the same for all other devices to enable testing in parallel. If you have the time that'd be highly appreciated!

Thanks again for the contribution and sorry for the late reply!

Comment thread src/vmm/src/devices/virtio/net/tap.rs Outdated
Comment thread src/vmm/src/devices/virtio/rng/device.rs Outdated
Comment thread src/vmm/src/devices/virtio/rng/device.rs Outdated
Comment thread src/vmm/src/devices/virtio/rng/device.rs Outdated
Comment thread src/vmm/src/devices/virtio/rng/device.rs
Comment thread src/vmm/src/devices/virtio/rng/metrics.rs
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 83.47826% with 38 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.82%. Comparing base (717f593) to head (b1be2d0).
⚠️ Report is 27 commits behind head on main.

Files with missing lines Patch % Lines
src/vmm/src/devices/virtio/balloon/device.rs 61.53% 10 Missing ⚠️
src/vmm/src/devices/virtio/rng/device.rs 66.66% 6 Missing ⚠️
...rc/vmm/src/devices/virtio/balloon/event_handler.rs 16.66% 5 Missing ⚠️
src/vmm/src/devices/virtio/mem/device.rs 84.00% 4 Missing ⚠️
src/vmm/src/devices/virtio/vsock/unix/muxer.rs 63.63% 4 Missing ⚠️
src/vmm/src/devices/virtio/vsock/device.rs 83.33% 3 Missing ⚠️
src/vmm/src/devices/virtio/vsock/event_handler.rs 62.50% 3 Missing ⚠️
src/vmm/src/devices/virtio/balloon/metrics.rs 75.00% 1 Missing ⚠️
src/vmm/src/devices/virtio/mem/metrics.rs 75.00% 1 Missing ⚠️
src/vmm/src/devices/virtio/rng/metrics.rs 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5599      +/-   ##
==========================================
- Coverage   82.88%   82.82%   -0.06%     
==========================================
  Files         276      276              
  Lines       29725    29825     +100     
==========================================
+ Hits        24638    24704      +66     
- Misses       5087     5121      +34     
Flag Coverage Δ
5.10-m5n.metal 83.12% <83.47%> (-0.07%) ⬇️
5.10-m6a.metal 82.45% <83.47%> (-0.07%) ⬇️
5.10-m6g.metal 79.74% <83.47%> (-0.06%) ⬇️
5.10-m6i.metal 83.12% <83.47%> (-0.07%) ⬇️
5.10-m7a.metal-48xl 82.44% <83.47%> (-0.07%) ⬇️
5.10-m7g.metal 79.74% <83.47%> (-0.06%) ⬇️
5.10-m7i.metal-24xl 83.09% <83.47%> (-0.08%) ⬇️
5.10-m7i.metal-48xl 83.10% <83.47%> (-0.06%) ⬇️
5.10-m8g.metal-24xl 79.74% <83.47%> (-0.06%) ⬇️
5.10-m8g.metal-48xl 79.74% <83.47%> (-0.06%) ⬇️
5.10-m8i.metal-48xl 83.09% <83.47%> (-0.07%) ⬇️
5.10-m8i.metal-96xl 83.10% <83.47%> (-0.07%) ⬇️
6.1-m5n.metal 83.14% <83.47%> (-0.07%) ⬇️
6.1-m6a.metal 82.48% <83.47%> (-0.07%) ⬇️
6.1-m6g.metal 79.74% <83.47%> (-0.06%) ⬇️
6.1-m6i.metal 83.15% <83.47%> (-0.07%) ⬇️
6.1-m7a.metal-48xl 82.47% <83.47%> (-0.07%) ⬇️
6.1-m7g.metal 79.74% <83.47%> (-0.06%) ⬇️
6.1-m7i.metal-24xl 83.16% <83.47%> (-0.07%) ⬇️
6.1-m7i.metal-48xl 83.16% <83.47%> (-0.07%) ⬇️
6.1-m8g.metal-24xl 79.74% <83.47%> (-0.06%) ⬇️
6.1-m8g.metal-48xl 79.74% <83.47%> (-0.06%) ⬇️
6.1-m8i.metal-48xl 83.16% <83.47%> (-0.07%) ⬇️
6.1-m8i.metal-96xl 83.17% <83.47%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Manciukic
Copy link
Copy Markdown
Contributor

FYI I've kicked off the CI build: https://buildkite.com/firecracker/firecracker-pr/builds/15702

@aerosouund
Copy link
Copy Markdown
Contributor Author

aerosouund commented Feb 5, 2026

@Manciukic
Thanks for taking the time to review it. Will address your comments along with the CI failures shortly

@aerosouund aerosouund force-pushed the vmm-parallel branch 2 times, most recently from 482b325 to b4873fe Compare March 5, 2026 01:37
@Manciukic
Copy link
Copy Markdown
Contributor

Alright, I've kicked off a new set of workflows.

Comment on lines +47 to +50
seq.serialize_entry("balloon", &METRICS)?;
if let Some(metrics) = METRICS.get() {
seq.serialize_entry("balloon", &metrics)?;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd serialize a new "zero" object if missing to maintain the current behaviour.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the confusion, I think I may have suggested this in the first place

Copy link
Copy Markdown
Contributor Author

@aerosouund aerosouund Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Manciukic
apologies for the confusion. done, and fixed the failing test.

integration_tests/functional/test_metrics.py::test_flush_metrics[vmlinux-5.10.245-PCI_ON] PASSED [ 16%]
integration_tests/functional/test_metrics.py::test_flush_metrics[vmlinux-5.10.245-PCI_OFF] PASSED [ 33%]
integration_tests/functional/test_metrics.py::test_net_metrics[vmlinux-5.10.245-PCI_ON] PASSED [ 50%]
integration_tests/functional/test_metrics.py::test_net_metrics[vmlinux-5.10.245-PCI_OFF] PASSED [ 66%]
integration_tests/functional/test_metrics.py::test_block_metrics[vmlinux-5.10.245-PCI_ON] PASSED [ 83%]
integration_tests/functional/test_metrics.py::test_block_metrics[vmlinux-5.10.245-PCI_OFF] PASSED [100%]

@aerosouund aerosouund force-pushed the vmm-parallel branch 2 times, most recently from 390540e to 3b99399 Compare March 25, 2026 23:56
@aerosouund
Copy link
Copy Markdown
Contributor Author

@Manciukic
addressed the failure in the functional test for vsock tests, this should be hopefully the last of the errors (fingers crossed). i tried running the full functional test suite after my fix and it passes locally. highly appreciate you triggering another set of tests

@Manciukic
Copy link
Copy Markdown
Contributor

Sorry, busy weeks! I've kicked off the CI to sanity check the PR.

@aerosouund
Copy link
Copy Markdown
Contributor Author

aerosouund commented Apr 5, 2026

@Manciukic
functional tests passed!
the cargo dependencies check failed but i just needed a merge to get the dependencies updated with whats in main to avoid versions with vulnerabilities. can you please trigger another test run ?

report saved to: ../test_results/test-report.json
==================================================== slowest 10 durations =====================================================
69.82s call     integration_tests/security/test_sec_audit.py::test_cargo_audit
0.00s teardown integration_tests/security/test_sec_audit.py::test_cargo_audit
0.00s setup    integration_tests/security/test_sec_audit.py::test_cargo_audit
================================================ 1 passed in 69.83s (0:01:09) =================================================
[Firecracker devtool 2026-04-05T21:56:26+00:00] Finished test run ...

- replace check_metric_after_block with traditional asserts
- replace access to METRICS for incrementing with self.metrics
- make report_balloon_event_fail take an arc of BalloonDeviceMetrics
as an argument to avoid having to reset the METRICS instance after
every test. given that each balloon device will instantiate the metrics
instance during the start of every test

Signed-off-by: aerosouund <aerosound161@gmail.com>
- replace check_metric_after_block with traditional asserts
- replace access to METRICS for incrementing with self.metrics
- change values in tests with asserts that check the total not the delta

Signed-off-by: aerosouund <aerosound161@gmail.com>
- delete the PmemMetricsPerDevice type
- turn METRICS into an rwlock of the metrics btreemap directly
- make a device create its own metrics instance then add it to
the METRICS map which is keyed on the device ids

Signed-off-by: aerosouund <aerosound161@gmail.com>
- replace check_metric_after_block with traditional asserts
- replace access to METRICS for incrementing with self.metrics
- change some test values to assert the absolute value not the delta

Signed-off-by: aerosouund <aerosound161@gmail.com>
- delete the NetMetricsPerDevice type and make METRICS
  an rwlock of a btreemap directly, keyed on the net device id field.
- during net device initialization push the an arc of device metrics to
  the metrics map.

Signed-off-by: aerosouund <aerosound161@gmail.com>
- replace access to METRICS for incrementing with self.metrics
- add derive default to VirtioMemDeviceMetrics
- make the METRICS variable a oncelock of an arc of
  VirtioMemDeviceMetrics and initialize the lock during device init

Signed-off-by: aerosouund <aerosound161@gmail.com>
- turn the METRICS static variable to a btreemap of metrics arc's keyed
  on guest cid.
- the vsock multiplexer is the central piece and the entity that gets
  created first among the different vsock structs. make it create an
  arc of VsockDeviceMetrics and add that to the METRICS map.
- add aggregate() function to VsockDeviceMetrics.
- add unit tests in metrics.rs for aggregation and formatting.
- remove check_metrics_after_block use and replace it with traditional
  asserts. update the assertion values in select tests to make it
  reflect the real value not the delta
- make the connection type either fetch the metrics map entry or create
  its own if not found (test scenario)

Signed-off-by: aerosouund <aerosound161@gmail.com>
- it has been replaced by pure asserts

Signed-off-by: aerosouund <aerosound161@gmail.com>
- its not in use anywhere after the previous commits

Signed-off-by: aerosouund <aerosound161@gmail.com>
- make the vsock device struct take an option of the metric instance
- since the vsock muxer is what creates the metric instance and stores
  it in a map and the device needing to increment to it too, make it a
  parameter of the device initialization as an Option. because in the
  test cases you may want to supply a dummy backend so it won't have a
  valid metrics instance. if the option is None, the device creates its
  own metrics instance. but doesn't add it to the map to avoid
  overriding a muxer's instance and having to handle an existing
  instance in the METRICS map during muxer init.

Signed-off-by: aerosouund <aerosound161@gmail.com>
- change affects balloon, rng and mem devices

Signed-off-by: aerosouund <aerosound161@gmail.com>
- vsock metrics instances were being serialized under the vm cid. not
  the word "vsock" as the schema expects

Signed-off-by: aerosouund <aerosound161@gmail.com>
@aerosouund
Copy link
Copy Markdown
Contributor Author

@Manciukic
sorry for the delay. fixed the merge conflicts and the style checks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: WIP Indicates that an issue is currently being worked on or triaged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow Running vmm Unittests in Parallel

2 participants