Skip to content

Conversation

@dancrossnyc
Copy link
Contributor

No description provided.

@dancrossnyc dancrossnyc requested review from citrus-it, iximeow and papertigers and removed request for papertigers December 23, 2025 02:17
@dancrossnyc
Copy link
Contributor Author

Cc @papertigers @rcgoodfellow

@dancrossnyc dancrossnyc force-pushed the virtio-modern branch 2 times, most recently from e7b1927 to 59f9711 Compare December 23, 2025 21:18
@askfongjojo askfongjojo added this to the 18 milestone Dec 29, 2025
Copy link
Contributor

@citrus-it citrus-it left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed this mostly from a VirtIO/viona perspective.

}
}

impl MigrateMulti for PciVirtioViona {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we'll need to push at least the number of selected queue pairs into the migration state here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do; that information is bundled up into the VirtioQueues type and its migration machinery.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting to see both 'max pairs' and 'use pairs' in the state so we know what to pass to viona on resume/migration. Will we end up setting the right number of pairs in that case based on negotiated features?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is implicitly encapsulated in the distinction between the vector size and length in VirtQueues.

@papertigers papertigers self-requested a review December 30, 2025 20:58
@dancrossnyc dancrossnyc force-pushed the virtio-modern branch 5 times, most recently from 9586765 to 2c7aa84 Compare January 6, 2026 19:14
@citrus-it citrus-it self-requested a review January 7, 2026 11:22
@dancrossnyc dancrossnyc force-pushed the virtio-modern branch 6 times, most recently from bf65fd9 to 45326c0 Compare January 8, 2026 02:14
Copy link
Member

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through most of this except for queue.rs and part of viona.rs but I've got to task switch so no reason hold the half-finished review. generally, thanks for getting this together, and apologies for the forever delay in review.

Comment on lines 144 to 150
/// Maps a VirtIO Device ID to a PCI Device Class.
pub fn pci_class(self) -> u8 {
use crate::hw::pci as pci_hw;
match self {
Self::Network => pci_hw::bits::CLASS_NETWORK,
Self::Block | Self::NineP => pci_hw::bits::CLASS_STORAGE,
_ => pci_hw::bits::CLASS_UNCLASSIFIED,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this just moves the class definitions from PciVirtioState::new() to one more central lookup, but there's not really an authoritative spec definition for the device classes to use, is there? looking at QEMU I see that 9P is classed a CLASS_NETWORK (o_O?) and other devices seem to be PCI_CLASS_OTHERS == 0xff where CLASS_UNCLASSIFIED has us setting the class to 0.

while I'm primarily wondering if there's a reference to note (or non-reference to lament :) ), we actually don't expect to get CLASS_UNCLASSIFIED at any point right? seems like this could take the same approach as pci_dev_id and return an Err if we're translating an unexpected DeviceId.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9P is network in QEMU? That's kind of weird; it's literally a file protocol. sigh Then again, it's designed to go over a network, so....

I think we probably do expect to see some unclassified things eventually. For instance, if we ever implement virtio console or the entropy device. But we don't have any now, so...result it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re. 9P, that's my read of https://github.com/qemu/qemu/blob/3b5fe75e2c30e249acabe29924385782014c7632/hw/virtio/virtio-pci.c#L303-L307 at least. elsewhere it does seem like QEMU structures set a "storage class" bit. I'm not sure what actually comes out at runtime. I dunno, we're not typically leaning on 9P so I'm not too pressed about that difference if so.

Comment on lines 354 to 365
let vendor_id = VENDOR_VIRTIO;
let sub_vendor_id = VENDOR_VIRTIO;
let device_id = device_type.pci_dev_id(mode).expect("Device ID");
let sub_device_id = device_type.pci_sub_dev_id();
let device_class = device_type.pci_class();
let mut builder = pci::Builder::new(pci::Ident {
vendor_id: VENDOR_VIRTIO,
device_id: dev_id,
sub_vendor_id: VENDOR_VIRTIO,
sub_device_id: sub_dev_id,
class: dev_class,
vendor_id,
device_id,
sub_vendor_id,
sub_device_id,
device_class,
..Default::default()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to have a device_type.pci_ident() that does this shuffling? then it could return a Result and pci_dev_id/pci_sub_dev_id/pci_class can be Result-ful as appropriate for their mappings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added it, but I'm honestly kind of ambivalent about it. I don't know if it stays on the right side of the balance of finding common ground between the "generic" VirtIO stuff and the VirtIO PCI stuff; maybe the impl bits should have been moved to virtio/pci.rs. But let me know what you think.

Comment on lines +384 to +401
// Note: we don't presently support a non-zero multiplier for the
// notification register, so we don't need to size this for the
// number of queues; hence the fixed size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me not really understanding the value of a non-zero multiplier: if I'm reading correctly, the guest notifies a device by writing a virtqueue number to this register, so it's not like a write to this register notifies all queues. but with a non-zero multiplier you could do something like write to queue 1 to say that.. queue 0 has available buffers? is it just an error to notify queue N about a virtqueue other than N?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data written to the register can be more arbitrary if VIRTIO_F_NOTIF_CONFIG_DATA is negotiated, but that still doesn't need a multiplier.
If the multiplier was chosen to put each register on a separate cache line I suppose there could be a benefit there.
Qemu has a flag (VIRTIO_PCI_FLAG_PAGE_PER_VQ) that sets the multiplier to 0x1000, otherwise it uses 0x4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's also possible to support multiple queue notifications simultaneously if you use the multiplier, using multiple threads or something in the host.

Comment on lines 652 to 656
if let Some(queue) = self.queues.get(state.queue_select)
{
let mut size = queue.size.lock().unwrap();
*size = offered;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if let Some(queue) = self.queues.get(state.queue_select)
{
let mut size = queue.size.lock().unwrap();
*size = offered;
}
let Some(queue) = self.queues.get(state.queue_select) else {
// Invalid queue: drop the write.
return;
}
let mut size = queue.size.lock().unwrap();
*size = offered;

here the wrapping might end up hideouser.. but in general this helps fight the rightward drift of getting the queue and gives us a clearer place to say why the non-behavior on queue_select is reasonable.

not demanding this change here, especially because this continues the prior pattern. but if you don't get to it here I'll probably do it in a followup next week.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrapping was worse, but I introduced a temporary to hold the queue select value and it was ok.

In general, I haven't stressed too hard on things like this because I think we ought to redo the entire mechanism.

Comment on lines +841 to +910
fn pci_cfg_cap_read(
&self,
dev: &dyn VirtioDevice,
id: &PciCfgCapReg,
op: &mut ReadOp,
) {
let _todo = dev;
match id {
PciCfgCapReg::Common(common_id) => match common_id {
CommonCfgCapReg::CfgType => {
op.write_u8(VirtioCfgCapTag::Pci as u8)
}
CommonCfgCapReg::Bar => op.write_u8(0), // TODO: Handle
CommonCfgCapReg::Offset => op.write_u32(0), // TODO: Handle
CommonCfgCapReg::Length => op.write_u32(0), // TODO: Handle
_ => self.common_cfg_cap_read(common_id, op),
},
PciCfgCapReg::PciData => {
// TODO: We actually need to handle this.
op.write_u32(0);
}
}
}

fn pci_cfg_cap_write(
&self,
dev: &dyn VirtioDevice,
id: &PciCfgCapReg,
op: &mut WriteOp,
) {
let _todo = (dev, op);
match id {
PciCfgCapReg::Common(common_id) => {
match common_id {
CommonCfgCapReg::Bar => {
// TODO: Store the bar
}
CommonCfgCapReg::Offset => {
// TODO: Store the offset
}
CommonCfgCapReg::Length => {
// TODO: Store the length
}
// Everything else is read-only for the driver.
_ => {}
}
}
PciCfgCapReg::PciData => {
// TODO: Handle the write.
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how TODO is this? it seems like leaving this unplumbed would be pretty unfortunate but clearly it's not necessary to see a working NIC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't found a driver that requires it, so it's pretty "meh, should be done... eventually." Given the time constraints, it didn't seem worth it at the moment.

Comment on lines 859 to 862
// The notification addresses (both port and MMIO) for the device can change
// due to guest action, or other administrative tasks within propolis. We
// want to update the in-kernel IO port hook only in the former case, when
// the device emulation is actually running.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't move this comment out onto the function itself: it's really "why is it correct to guard hdl.set_notify_*_port on self.indicator.state() == IndicatedState::Run".

maybe leave this one where it was and leave a // Only update the kernel for the same reason as notify_port_update() in notify_mmio_addr_update? I think what you're going for is that this really describes the if in both functions, at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've word-smithed (or perhaps munged, if not out-right mangled) the comments here. PTAL.

Comment on lines 40 to 42
pub const RX_QUEUE_SIZE: u16 = 0x800;
pub const TX_QUEUE_SIZE: u16 = 0x100;
pub const CTL_QUEUE_SIZE: u16 = 32;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VqSize::new is already const, is there a reason to not make these VqSize and call it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 55 to 57
/// Types and so forth for supporting the control queue.
pub mod control {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these types and definitions come from VirtIO also, right? I see what looks like the The Truth in 5.1.6.5-ish but that was after a bit of confusion thinking these would be defined by viona for the VMM to control it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops; I missed this in my earlier swing. I added a comment here.

@dancrossnyc dancrossnyc force-pushed the virtio-modern branch 5 times, most recently from 2f401e9 to 8e2f55f Compare January 8, 2026 21:59
Copy link
Member

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'm +1 on getting this in especially given the time constraints. I'd mentioned on the side I'd prefer to separate the renaming and the meatier changes for multiqueue into a pair of commits that remain distinct in git history - I still think this is worth doing and I'll come back here with a proposal branch splitting this into such a pair (or we get really itchy to get this landed and into dogfood)

I know when we've talked about it you'd mentioned that you've tested across Linux, Windows, OmniOS, and I think a BSD, right? meaning that my questions re. virtio spec and TODOs above, we actually have pretty good confidence these don't leave guests in a risky spot.

Additionally, I recall you mentioned migrating guests in propolis-standalone and their NICs still working, so if I'm recalling correctly, that's great. I saw the conversation where Angela wasn't easily reproducing the relative improvement in net perf, do you know what's up with that?

on the whole, thanks for working on this (and Andy for the earlier review, and the kernel work!!). I hope to get back to you shortly with the proposed pair of commits.


Ok(Self { queues })
pub fn set_len(&self, len: usize) {
assert!(0 < len && len <= self.max_capacity());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should error rather than assert (so migration errors instead of panic, and maybe then we'd reconsider the guard before the call in viona.rs), but as a followup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 764 to +992
pub fn get(&self, qid: u16) -> Option<&Arc<VirtQueue>> {
self.queues.get(usize::from(qid))
let len = self.len();
let qid = usize::from(qid);
// XXX: This special case is for viona, which always puts the
// control queue at the end of queue vector. None of the other
// devices currently handle queues specially in this way, but we
// should come up with some better mechanism here.
if qid + 1 == len {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a similar mapping for the nvme device where we have a translation from the nvme queue ID to the ID of the block device queue, since the nvme admin queue is otherwise in the same namespace from the guest perspective. something like that here might make sense (VirtIoQueueId -> VionaQueueId) but also might be too typeful. might be worth checking out after this lands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Comment on lines 1000 to 1004
pub fn iter(
&self,
) -> std::iter::Chain<
std::slice::Iter<'_, Arc<VirtQueue>>,
std::array::IntoIter<&Arc<VirtQueue>, 1>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't get away with impl Iter<Item = &Arc<VirtQueue>>? :( also worth trying after this lands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to work!

Comment on lines +727 to +736
fn set_features(&self, feat: u64) -> Result<(), ()> {
self.hdl.set_features(feat).map_err(|_| ())?;
if (feat & VIRTIO_NET_F_MQ) != 0 {
self.hdl.set_pairs(PROPOLIS_MAX_MQ_PAIRS).map_err(|_| ())?;
self.set_use_pairs(PROPOLIS_MAX_MQ_PAIRS)?;
}
Ok(())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boring lifecycle q: can a guest set_features repeatedly without reinitializing the device? that is, if you set_features(VIRTIO_NET_F_MQ), negotiate a number of queues, and then set_feature(VIRTIO_NET_F_MQ) again we'll forget the negotiated number and reset to max. I don't know if VirtIO allows that, if this is expected, etc.

happy to consider this in followup, because this won't cause badness outside the guest and relies on a guest driver being odd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not allowed, as per the spec.

Comment on lines +1157 to +1165
/// Sets the address that viona recognizes for virtqueue notifications
///
/// The viona driver is able to install an IO port hook in the associated VM
/// at a specified address in order to process `out` operations which would
/// result in the in-kernel emulated virtqueues being notified of available
/// buffers.
/// Viona can install a hook in the associated VM at a specified address (in
/// either the guest port or physical address spaces) to recognize guest
/// writes that notify in-kernel emulated virtqueues of available buffers.
///
/// With a non-zero argument, viona will attempt to attach such a hook,
/// replacing any currently in place. When the argument is None, any
/// replacing any currently in place. When the argument is None, any
/// existing hook is torn down.
fn set_notify_iop(&self, port: Option<NonZeroU16>) -> io::Result<()> {
fn set_notify_io_port(&self, port: Option<NonZeroU16>) -> io::Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for myself: I wonder if we can structure this differently (distinct impl block?) to talk about the notification generally, and either comment on the functions directly instead of trying to genericize the comment on set_notify_io_port, or find the higher-level lifecycle comment on the impl block obviates the per-function comments entirely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

}
}

fn is_ctl_queue(&self, vq: &Arc<VirtQueue>) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the ctl operations here I'd take &VirtQueue so it's clearer at a glance that we're not cloning references to the queues or anything funky like that. also small, also happy for this to be followup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +422 to +425
let res = match self.ctl_msg(vq, &mut chain, &mem) {
Ok(_) => control::Ack::Ok,
Err(_) => control::Ack::Err,
} as u8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor Rust thing: I'd do the conversion as let res = (<expr producing control::Ack>).as_u8() or something. doing the as u8 here has you jump back to Ack to double check that Ack variants are numbered, that it does actually have a u8 repr (or discriminants at least are in range), ..

also fine to fix after.

@dancrossnyc
Copy link
Contributor Author

ok, I'm +1 on getting this in especially given the time constraints. I'd mentioned on the side I'd prefer to separate the renaming and the meatier changes for multiqueue into a pair of commits that remain distinct in git history - I still think this is worth doing and I'll come back here with a proposal branch splitting this into such a pair (or we get really itchy to get this landed and into dogfood)

I think we just need to get it in, at this point.

I know when we've talked about it you'd mentioned that you've tested across Linux, Windows, OmniOS, and I think a BSD, right? meaning that my questions re. virtio spec and TODOs above, we actually have pretty good confidence these don't leave guests in a risky spot.

Yeah. Linux, Windows, FreeBSD, and OmniOS (both with legacy and transitional drivers). Guests seem to behave pretty happily.

Additionally, I recall you mentioned migrating guests in propolis-standalone and their NICs still working, so if I'm recalling correctly, that's great.

That's right. I stopped/restored a VM (running FreeBSD, which does the weird thing of using multiqueue with the legacy register layout [!!]) and observed that ping continued to work and so on.

I saw the conversation where Angela wasn't easily reproducing the relative improvement in net perf, do you know what's up with that?

Not really, no. Andy and I empirically observed that the viona is not doing a great job of distributing data across the various queues; it's using the output of the kernel's mac_pkt_hash mod the number of ring pairs.

on the whole, thanks for working on this (and Andy for the earlier review, and the kernel work!!). I hope to get back to you shortly with the proposed pair of commits.

Thanks!

@dancrossnyc dancrossnyc merged commit 6990099 into master Jan 9, 2026
11 checks passed
@dancrossnyc dancrossnyc deleted the virtio-modern branch January 9, 2026 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants