VirtIO 1.0 and multi-queue support #998

dancrossnyc · 2025-12-23T02:17:04Z

No description provided.

dancrossnyc · 2025-12-23T02:18:51Z

Cc @papertigers @rcgoodfellow

citrus-it

I've reviewed this mostly from a VirtIO/viona perspective.

crates/viona-api/src/ffi.rs

crates/viona-api/src/lib.rs

lib/propolis/src/hw/virtio/pci.rs

lib/propolis/src/hw/virtio/queue.rs

citrus-it · 2025-12-30T16:53:04Z

lib/propolis/src/hw/virtio/viona.rs

    }
 }

 impl MigrateMulti for PciVirtioViona {


It seems we'll need to push at least the number of selected queue pairs into the migration state here.

We do; that information is bundled up into the VirtioQueues type and its migration machinery.

I was expecting to see both 'max pairs' and 'use pairs' in the state so we know what to pass to viona on resume/migration. Will we end up setting the right number of pairs in that case based on negotiated features?

That is implicitly encapsulated in the distinction between the vector size and length in VirtQueues.

iximeow

I've gone through most of this except for queue.rs and part of viona.rs but I've got to task switch so no reason hold the half-finished review. generally, thanks for getting this together, and apologies for the forever delay in review.

lib/propolis/src/hw/virtio/mod.rs

iximeow · 2026-01-07T21:48:35Z

lib/propolis/src/hw/virtio/mod.rs

+    /// Maps a VirtIO Device ID to a PCI Device Class.
+    pub fn pci_class(self) -> u8 {
+        use crate::hw::pci as pci_hw;
+        match self {
+            Self::Network => pci_hw::bits::CLASS_NETWORK,
+            Self::Block | Self::NineP => pci_hw::bits::CLASS_STORAGE,
+            _ => pci_hw::bits::CLASS_UNCLASSIFIED,


I realize this just moves the class definitions from PciVirtioState::new() to one more central lookup, but there's not really an authoritative spec definition for the device classes to use, is there? looking at QEMU I see that 9P is classed a CLASS_NETWORK (o_O?) and other devices seem to be PCI_CLASS_OTHERS == 0xff where CLASS_UNCLASSIFIED has us setting the class to 0.

while I'm primarily wondering if there's a reference to note (or non-reference to lament :) ), we actually don't expect to get CLASS_UNCLASSIFIED at any point right? seems like this could take the same approach as pci_dev_id and return an Err if we're translating an unexpected DeviceId.

9P is network in QEMU? That's kind of weird; it's literally a file protocol. sigh Then again, it's designed to go over a network, so....

I think we probably do expect to see some unclassified things eventually. For instance, if we ever implement virtio console or the entropy device. But we don't have any now, so...result it is.

re. 9P, that's my read of https://github.com/qemu/qemu/blob/3b5fe75e2c30e249acabe29924385782014c7632/hw/virtio/virtio-pci.c#L303-L307 at least. elsewhere it does seem like QEMU structures set a "storage class" bit. I'm not sure what actually comes out at runtime. I dunno, we're not typically leaning on 9P so I'm not too pressed about that difference if so.

iximeow · 2026-01-07T22:07:30Z

lib/propolis/src/hw/virtio/pci.rs

+        let vendor_id = VENDOR_VIRTIO;
+        let sub_vendor_id = VENDOR_VIRTIO;
+        let device_id = device_type.pci_dev_id(mode).expect("Device ID");
+        let sub_device_id = device_type.pci_sub_dev_id();
+        let device_class = device_type.pci_class();
        let mut builder = pci::Builder::new(pci::Ident {
-            vendor_id: VENDOR_VIRTIO,
-            device_id: dev_id,
-            sub_vendor_id: VENDOR_VIRTIO,
-            sub_device_id: sub_dev_id,
-            class: dev_class,
+            vendor_id,
+            device_id,
+            sub_vendor_id,
+            sub_device_id,
+            device_class,
            ..Default::default()


does it make sense to have a device_type.pci_ident() that does this shuffling? then it could return a Result and pci_dev_id/pci_sub_dev_id/pci_class can be Result-ful as appropriate for their mappings.

I've added it, but I'm honestly kind of ambivalent about it. I don't know if it stays on the right side of the balance of finding common ground between the "generic" VirtIO stuff and the VirtIO PCI stuff; maybe the impl bits should have been moved to virtio/pci.rs. But let me know what you think.

iximeow · 2026-01-07T22:24:22Z

lib/propolis/src/hw/virtio/pci.rs

+            // Note: we don't presently support a non-zero multiplier for the
+            // notification register, so we don't need to size this for the
+            // number of queues; hence the fixed size.


me not really understanding the value of a non-zero multiplier: if I'm reading correctly, the guest notifies a device by writing a virtqueue number to this register, so it's not like a write to this register notifies all queues. but with a non-zero multiplier you could do something like write to queue 1 to say that.. queue 0 has available buffers? is it just an error to notify queue N about a virtqueue other than N?

The data written to the register can be more arbitrary if VIRTIO_F_NOTIF_CONFIG_DATA is negotiated, but that still doesn't need a multiplier.
If the multiplier was chosen to put each register on a separate cache line I suppose there could be a benefit there.
Qemu has a flag (VIRTIO_PCI_FLAG_PAGE_PER_VQ) that sets the multiplier to 0x1000, otherwise it uses 0x4.

I suppose it's also possible to support multiple queue notifications simultaneously if you use the multiplier, using multiple threads or something in the host.

iximeow · 2026-01-08T01:36:55Z

lib/propolis/src/hw/virtio/pci.rs

+                        if let Some(queue) = self.queues.get(state.queue_select)
+                        {
+                            let mut size = queue.size.lock().unwrap();
+                            *size = offered;
+                        }


Suggested change

if let Some(queue) = self.queues.get(state.queue_select)

{

let mut size = queue.size.lock().unwrap();

*size = offered;

}

let Some(queue) = self.queues.get(state.queue_select) else {

// Invalid queue: drop the write.

return;

}

let mut size = queue.size.lock().unwrap();

*size = offered;

here the wrapping might end up hideouser.. but in general this helps fight the rightward drift of getting the queue and gives us a clearer place to say why the non-behavior on queue_select is reasonable.

not demanding this change here, especially because this continues the prior pattern. but if you don't get to it here I'll probably do it in a followup next week.

The wrapping was worse, but I introduced a temporary to hold the queue select value and it was ok.

In general, I haven't stressed too hard on things like this because I think we ought to redo the entire mechanism.

iximeow · 2026-01-08T01:41:34Z

lib/propolis/src/hw/virtio/pci.rs

+    fn pci_cfg_cap_read(
+        &self,
+        dev: &dyn VirtioDevice,
+        id: &PciCfgCapReg,
+        op: &mut ReadOp,
+    ) {
+        let _todo = dev;
+        match id {
+            PciCfgCapReg::Common(common_id) => match common_id {
+                CommonCfgCapReg::CfgType => {
+                    op.write_u8(VirtioCfgCapTag::Pci as u8)
+                }
+                CommonCfgCapReg::Bar => op.write_u8(0), // TODO: Handle
+                CommonCfgCapReg::Offset => op.write_u32(0), // TODO: Handle
+                CommonCfgCapReg::Length => op.write_u32(0), // TODO: Handle
+                _ => self.common_cfg_cap_read(common_id, op),
+            },
+            PciCfgCapReg::PciData => {
+                // TODO: We actually need to handle this.
+                op.write_u32(0);
+            }
+        }
+    }
+
+    fn pci_cfg_cap_write(
+        &self,
+        dev: &dyn VirtioDevice,
+        id: &PciCfgCapReg,
+        op: &mut WriteOp,
+    ) {
+        let _todo = (dev, op);
+        match id {
+            PciCfgCapReg::Common(common_id) => {
+                match common_id {
+                    CommonCfgCapReg::Bar => {
+                        // TODO: Store the bar
+                    }
+                    CommonCfgCapReg::Offset => {
+                        // TODO: Store the offset
+                    }
+                    CommonCfgCapReg::Length => {
+                        // TODO: Store the length
+                    }
+                    // Everything else is read-only for the driver.
+                    _ => {}
+                }
+            }
+            PciCfgCapReg::PciData => {
+                // TODO: Handle the write.
+            }
+        }
+    }


how TODO is this? it seems like leaving this unplumbed would be pretty unfortunate but clearly it's not necessary to see a working NIC?

We haven't found a driver that requires it, so it's pretty "meh, should be done... eventually." Given the time constraints, it didn't seem worth it at the moment.

iximeow · 2026-01-08T02:13:25Z

lib/propolis/src/hw/virtio/viona.rs

+    // The notification addresses (both port and MMIO) for the device can change
+    // due to guest action, or other administrative tasks within propolis.  We
+    // want to update the in-kernel IO port hook only in the former case, when
+    // the device emulation is actually running.


I wouldn't move this comment out onto the function itself: it's really "why is it correct to guard hdl.set_notify_*_port on self.indicator.state() == IndicatedState::Run".

maybe leave this one where it was and leave a // Only update the kernel for the same reason as notify_port_update() in notify_mmio_addr_update? I think what you're going for is that this really describes the if in both functions, at least.

I've word-smithed (or perhaps munged, if not out-right mangled) the comments here. PTAL.

iximeow · 2026-01-08T02:19:38Z

lib/propolis/src/hw/virtio/viona.rs

+pub const RX_QUEUE_SIZE: u16 = 0x800;
+pub const TX_QUEUE_SIZE: u16 = 0x100;
+pub const CTL_QUEUE_SIZE: u16 = 32;


VqSize::new is already const, is there a reason to not make these VqSize and call it here?

iximeow · 2026-01-08T02:42:18Z

lib/propolis/src/hw/virtio/viona.rs

+/// Types and so forth for supporting the control queue.
+pub mod control {


these types and definitions come from VirtIO also, right? I see what looks like the The Truth in 5.1.6.5-ish but that was after a bit of confusion thinking these would be defined by viona for the VMM to control it.

Oops; I missed this in my earlier swing. I added a comment here.

iximeow

ok, I'm +1 on getting this in especially given the time constraints. I'd mentioned on the side I'd prefer to separate the renaming and the meatier changes for multiqueue into a pair of commits that remain distinct in git history - I still think this is worth doing and I'll come back here with a proposal branch splitting this into such a pair (or we get really itchy to get this landed and into dogfood)

I know when we've talked about it you'd mentioned that you've tested across Linux, Windows, OmniOS, and I think a BSD, right? meaning that my questions re. virtio spec and TODOs above, we actually have pretty good confidence these don't leave guests in a risky spot.

Additionally, I recall you mentioned migrating guests in propolis-standalone and their NICs still working, so if I'm recalling correctly, that's great. I saw the conversation where Angela wasn't easily reproducing the relative improvement in net perf, do you know what's up with that?

on the whole, thanks for working on this (and Andy for the earlier review, and the kernel work!!). I hope to get back to you shortly with the proposed pair of commits.

iximeow · 2026-01-09T18:39:03Z

lib/propolis/src/hw/virtio/queue.rs


-        Ok(Self { queues })
+    pub fn set_len(&self, len: usize) {
+        assert!(0 < len && len <= self.max_capacity());


this should error rather than assert (so migration errors instead of panic, and maybe then we'd reconsider the guard before the call in viona.rs), but as a followup

iximeow · 2026-01-09T18:41:35Z

lib/propolis/src/hw/virtio/queue.rs

    pub fn get(&self, qid: u16) -> Option<&Arc<VirtQueue>> {
-        self.queues.get(usize::from(qid))
+        let len = self.len();
+        let qid = usize::from(qid);
+        // XXX: This special case is for viona, which always puts the
+        // control queue at the end of queue vector.  None of the other
+        // devices currently handle queues specially in this way, but we
+        // should come up with some better mechanism here.
+        if qid + 1 == len {


there's a similar mapping for the nvme device where we have a translation from the nvme queue ID to the ID of the block device queue, since the nvme admin queue is otherwise in the same namespace from the guest perspective. something like that here might make sense (VirtIoQueueId -> VionaQueueId) but also might be too typeful. might be worth checking out after this lands.

iximeow · 2026-01-09T18:42:42Z

lib/propolis/src/hw/virtio/queue.rs

+    pub fn iter(
+        &self,
+    ) -> std::iter::Chain<
+        std::slice::Iter<'_, Arc<VirtQueue>>,
+        std::array::IntoIter<&Arc<VirtQueue>, 1>,


we can't get away with impl Iter<Item = &Arc<VirtQueue>>? :( also worth trying after this lands.

It appears to work!

iximeow · 2026-01-09T18:52:32Z

lib/propolis/src/hw/virtio/viona.rs

+    fn set_features(&self, feat: u64) -> Result<(), ()> {
+        self.hdl.set_features(feat).map_err(|_| ())?;
+        if (feat & VIRTIO_NET_F_MQ) != 0 {
+            self.hdl.set_pairs(PROPOLIS_MAX_MQ_PAIRS).map_err(|_| ())?;
+            self.set_use_pairs(PROPOLIS_MAX_MQ_PAIRS)?;
+        }
+        Ok(())


boring lifecycle q: can a guest set_features repeatedly without reinitializing the device? that is, if you set_features(VIRTIO_NET_F_MQ), negotiate a number of queues, and then set_feature(VIRTIO_NET_F_MQ) again we'll forget the negotiated number and reset to max. I don't know if VirtIO allows that, if this is expected, etc.

happy to consider this in followup, because this won't cause badness outside the guest and relies on a guest driver being odd.

No, it's not allowed, as per the spec.

iximeow · 2026-01-09T19:01:38Z

lib/propolis/src/hw/virtio/viona.rs

+    /// Sets the address that viona recognizes for virtqueue notifications
    ///
-    /// The viona driver is able to install an IO port hook in the associated VM
-    /// at a specified address in order to process `out` operations which would
-    /// result in the in-kernel emulated virtqueues being notified of available
-    /// buffers.
+    /// Viona can install a hook in the associated VM at a specified address (in
+    /// either the guest port or physical address spaces) to recognize guest
+    /// writes that notify in-kernel emulated virtqueues of available buffers.
    ///
    /// With a non-zero argument, viona will attempt to attach such a hook,
-    /// replacing any currently in place. When the argument is None, any
+    /// replacing any currently in place.  When the argument is None, any
    /// existing hook is torn down.
-    fn set_notify_iop(&self, port: Option<NonZeroU16>) -> io::Result<()> {
+    fn set_notify_io_port(&self, port: Option<NonZeroU16>) -> io::Result<()> {


note for myself: I wonder if we can structure this differently (distinct impl block?) to talk about the notification generally, and either comment on the functions directly instead of trying to genericize the comment on set_notify_io_port, or find the higher-level lifecycle comment on the impl block obviates the per-function comments entirely

iximeow · 2026-01-09T19:07:33Z

lib/propolis/src/hw/virtio/viona.rs

        }
    }

+    fn is_ctl_queue(&self, vq: &Arc<VirtQueue>) -> bool {


for the ctl operations here I'd take &VirtQueue so it's clearer at a glance that we're not cloning references to the queues or anything funky like that. also small, also happy for this to be followup

iximeow · 2026-01-09T19:13:08Z

lib/propolis/src/hw/virtio/viona.rs

+                    let res = match self.ctl_msg(vq, &mut chain, &mem) {
+                        Ok(_) => control::Ack::Ok,
+                        Err(_) => control::Ack::Err,
+                    } as u8;


minor Rust thing: I'd do the conversion as let res = (<expr producing control::Ack>).as_u8() or something. doing the as u8 here has you jump back to Ack to double check that Ack variants are numbered, that it does actually have a u8 repr (or discriminants at least are in range), ..

also fine to fix after.

dancrossnyc · 2026-01-09T21:31:35Z

ok, I'm +1 on getting this in especially given the time constraints. I'd mentioned on the side I'd prefer to separate the renaming and the meatier changes for multiqueue into a pair of commits that remain distinct in git history - I still think this is worth doing and I'll come back here with a proposal branch splitting this into such a pair (or we get really itchy to get this landed and into dogfood)

I think we just need to get it in, at this point.

I know when we've talked about it you'd mentioned that you've tested across Linux, Windows, OmniOS, and I think a BSD, right? meaning that my questions re. virtio spec and TODOs above, we actually have pretty good confidence these don't leave guests in a risky spot.

Yeah. Linux, Windows, FreeBSD, and OmniOS (both with legacy and transitional drivers). Guests seem to behave pretty happily.

Additionally, I recall you mentioned migrating guests in propolis-standalone and their NICs still working, so if I'm recalling correctly, that's great.

That's right. I stopped/restored a VM (running FreeBSD, which does the weird thing of using multiqueue with the legacy register layout [!!]) and observed that ping continued to work and so on.

I saw the conversation where Angela wasn't easily reproducing the relative improvement in net perf, do you know what's up with that?

Not really, no. Andy and I empirically observed that the viona is not doing a great job of distributing data across the various queues; it's using the output of the kernel's mac_pkt_hash mod the number of ring pairs.

on the whole, thanks for working on this (and Andy for the earlier review, and the kernel work!!). I hope to get back to you shortly with the proposed pair of commits.

Thanks!

dancrossnyc requested review from citrus-it, iximeow and papertigers and removed request for papertigers December 23, 2025 02:17

dancrossnyc force-pushed the virtio-modern branch 2 times, most recently from e7b1927 to 59f9711 Compare December 23, 2025 21:18

askfongjojo added this to the 18 milestone Dec 29, 2025

citrus-it reviewed Dec 30, 2025

View reviewed changes

papertigers self-requested a review December 30, 2025 20:58

dancrossnyc force-pushed the virtio-modern branch 5 times, most recently from 9586765 to 2c7aa84 Compare January 6, 2026 19:14

citrus-it self-requested a review January 7, 2026 11:22

citrus-it approved these changes Jan 7, 2026

View reviewed changes

citrus-it assigned dancrossnyc Jan 7, 2026

dancrossnyc force-pushed the virtio-modern branch 6 times, most recently from bf65fd9 to 45326c0 Compare January 8, 2026 02:14

iximeow reviewed Jan 8, 2026

View reviewed changes

dancrossnyc force-pushed the virtio-modern branch 5 times, most recently from 2f401e9 to 8e2f55f Compare January 8, 2026 21:59

citrus-it approved these changes Jan 9, 2026

View reviewed changes

iximeow approved these changes Jan 9, 2026

View reviewed changes

VirtIO 1.0 and multi-queue support

fd05081

dancrossnyc force-pushed the virtio-modern branch from 8e2f55f to fd05081 Compare January 9, 2026 21:31

dancrossnyc merged commit 6990099 into master Jan 9, 2026
11 checks passed

dancrossnyc deleted the virtio-modern branch January 9, 2026 21:57

		/// Types and so forth for supporting the control queue.
		pub mod control {

VirtIO 1.0 and multi-queue support #998

VirtIO 1.0 and multi-queue support #998

Uh oh!

Conversation

dancrossnyc commented Dec 23, 2025

Uh oh!

dancrossnyc commented Dec 23, 2025

Uh oh!

citrus-it left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iximeow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iximeow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!