These are notes from a talk I gave walking through the virtio code in Linux and QEMU. Mark McLoughlin 2008-11-05 --- buffer queue virtqueue_ops - err = vq->add_buf(vq, sg, out_num, in_num, data); add a buffer to the queue sg is scatterlist, similar to struct iovec for e.g. writev(), a list of buffer pointers and lengths "out" buffers are buffers which the guest write to, "in" buffers are buffers that the host writes to - data = vq->get_buf(vq, *len); fetch a consumed buffer from the queue data which was passed to add_buf() returned to identify which buffer this is; buffers not necessarily consumed in the order they were added length of buffer consumed; only really useful for "in" buffers - vq->kick(vq) notify the other side that buffers have been added - vq->disable_cb(vq) guest will notify us each time it consumes a buffer; this function disables that notification example is if we schedule a tasklet to pull consumed buffers from the queue, there's no need to get any more notifications one quirk is that this hint is ignored when the guest has consumed all available buffers - we'll be notified anyway - more_used = vq->enable_cb(vq) re-enable notifications, but also check to see if more buffers have been consumed; usually in that case we will immediately disable notifications again and pull the newly consumed buffers virtio_net code walkthrough: transmit starts with net_device::hard_start_xmit() virtio impl == start_xmit() ignore most and show skb being added to send ->send skb queue and xmit_skb() being called jump back to add_buf() description in the slide - first param is a scatterlist, show it being populated show skb->cb[48] being used for the first element in the scatterlist - skb_vnet_hdr() and vnet_hdr_to_sg() skb_to_sgvec() does the rest - adds the linear data followed by any paged data to scatterlist finally supply add_buf() with the sg and as "out" buffers; use the skb pointer as a "token" show ->kick() being called in start_xmit() Now ... it's queued up; when the guest has consumed it we'll get a callback to skb_xmit_done(); all we do is schedule a tasklet implemented by xmit_tasklet() which eventually gets to free_old_xmit_skbs() ... data is sent, all we need to do is remove it from the send list, update stats and free it so - what happens if the ring fills up? we save the skb for later, stop the network device queue so we won't get any more packets and enable the notification so the guest will interrupt us when it has finished with some packets But! Last chance saloon - maybe some buffers were consumed in the mean time and we can disable callbacks, start the queue again and retry the transmit brief overview of the receive side guest allocates skbs, adds them as "in" buffers to the queue and waits for notification that the buffers have been consumed guest then pulls them back from the queue and passes it to the network stack with netif_receive_skb() walk through the host side quickly ignore the details of when and where we flush the tx queue and just go through virtio_net_flush_tx() basically we pop() the buffer from the queue, send it via qemu's internal network bridge, push it back onto the queue and notify the guest side that it has been consumed pop() returns a struct iovec which can be passed directly to writev() to inject it into the host's networking stack via a tun/tap device pop() == give me a buffer push() == I'm done with this buffer we're done with the high level virtio transport abstraction; let's look at the only current implementation of this abstraction, the virtio ring a ring buffer is basically a circular list of buffer descriptors. The producing side adds buffers descriptors to the ring while the consuming side removes them. Both sides can proceed in parallel without locking the ring a virtio buffer queue actually has two rings - the host is the producer for the "avail" ring while the guest is the producer of the "used" ring the queue can hold a fixed number of buffer descriptors and occupies a number of pages depending on ring size: entries | pages --------+------ 128 | 2 256 | 3 512 | 5 1024 | 8 2048 | 15 so, what you have is: - an array of buffer descriptors - address, length, flags and next + two flags : "buffer is writeable" and "next is valid" - the "available" ring - a circular array of indices into the buffer descriptors array; idx gives you the current position of the producer in the ring, flags can be used to tell the host not to interrupt the guest when buffers have been consumed (i.e. when they are added to the "used" ring - the "used" ring - again a circular array of indices, but each entry also has the number of bytes actually consumed; idx and flags a similar except the NO_NOTIFY flag tells the host not to notify the guest as new buffers are added to the avail ring vring_add_buf() - we keep track of free descriptors - notice that we add to the avail ring, but don't update avail->idx vring_kick() - here is where the guest can see the avail->idx update - unless we've been told not to, we notify the host vring_get_buf() - first check has used->idx changed - fetch the descriptor table index from used->ring and the length of the buffer used - fetch the data associated with the head - detach_buf() detachs the chain of descriptors, returns them to the free list - finally keep track our position in used->ring quick look at enable_cb() and disable_cb() finally, let's look at qemu code on the host side: in virtio.h, same definition of the ring layout but with different structure names virtqueue_init() takes a pointer to the ring pages and figures out the location of the descriptor table, avail ring and used ring virtqueue_size() figures out how much memory is needed for a ring with a given number of entries virtqueue_pop() reads an entry from the avail queue sanity checks number of available entries; N.B. (uint16_t) cast pulls the next head - i.e. an index into the descriptor table that begins a chain of buffers follow the chain, initializing the appropriate struct iovec as we go check the chain isn't too long keep track of the buffer head so for push() and also that there is a buffer "in flight" so that we can know not to notify-on-empty FIXME: the wmb() in next_desc() looks bogus virtqueue_push() simply adds the buffer head to the next entry in the used ring and increments the index so the guest can see it FIXME: the wmb is a no-op - should it be? GSO === In virtnet_probe, go through the flags we set: - NETIF_F_HIGHDMA - the buffers can be allocated in highmem - NETIF_F_HW_CSUM - we can do partial checksums - NETIF_F_SG - the packet data does not need to be in a linear buffer - NETIF_F_FRAGLIST - same, except slightly different scheme involving chained skbs - NETIF_F_TSO etc. - host can do packet segmentation virtio_net_hdr initialized in xmit_skb() receive_skb(): - Note that csum_start and csum_offset are relative to the start of the ethernet frame but eth_type_trans() discards the header Look at try_fill_recv() - need to allocate larger buffers for GSO packets qemu side: - In tap_open() we pass IFF_VNET_HDR to TUNSETIFF meaning, "frames should be prefixed with a virtio_net_hdr" - has_vnet_hdr says "the tapfd has IFF_VNET_HDR enabled" - using_vnet_hdr means "vlan clients will supply virtio_net_hdr" - In virtio_net_get_features() we expose GSO support if tun/tap has support - In virtio_net_set_features() we call tap_set_offload() to set the appropriate offload flags with TUNSETOFFLOAD this controls what the host networking stack will pass down to tun/tap - so we only enable what the guest can support - okay, in flush_tx() we drop the header before sending if there isn't any support for virtio_net_hdr - tap_receive_iov() called which just calls writev() - receive - virtio_net_can_receive() used to see if the guest can take any packets - tap_send() called when fd is readable - we copy into a buffer, skip over the header if needed, and pass to virtio_net_receive() - we pop() a buffer chain, copy the virtio_net_hdr in if we need to, and then copy the data into the buffers - data copied twice on the receive side, once on the transmit side - work_around_broken_dhclient() needed because some guest's can't handle partial checksums in DHCP replies