These are notes from a talk I gave walking through
the virtio code in Linux and QEMU.

Mark McLoughlin
2008-11-05

---

buffer queue

  virtqueue_ops

   - err = vq->add_buf(vq, sg, out_num, in_num, data);

       add a buffer to the queue

       sg is scatterlist, similar to struct iovec for e.g. writev(),
       a list of buffer pointers and lengths

       "out" buffers are buffers which the guest write to, "in"
       buffers are buffers that the host writes to

   - data = vq->get_buf(vq, *len);

        fetch a consumed buffer from the queue

        data which was passed to add_buf() returned to identify
        which buffer this is; buffers not necessarily consumed
        in the order they were added

        length of buffer consumed; only really useful for "in"
        buffers

   - vq->kick(vq)

        notify the other side that buffers have been added

   - vq->disable_cb(vq)

        guest will notify us each time it consumes a buffer; this
        function disables that notification

        example is if we schedule a tasklet to pull consumed buffers
        from the queue, there's no need to get any more notifications

        one quirk is that this hint is ignored when the guest has
        consumed all available buffers - we'll be notified anyway

   - more_used = vq->enable_cb(vq)

        re-enable notifications, but also check to see if more buffers
        have been consumed; usually in that case we will immediately
        disable notifications again and pull the newly consumed buffers


virtio_net code walkthrough:

   transmit starts with net_device::hard_start_xmit()

     virtio impl == start_xmit()

     ignore most and show skb being added to send ->send skb queue
     and xmit_skb() being called

       jump back to add_buf() description in the slide - first param
       is a scatterlist, show it being populated

       show skb->cb[48] being used for the first element in the scatterlist
         - skb_vnet_hdr() and vnet_hdr_to_sg()

       skb_to_sgvec() does the rest - adds the linear data followed by
       any paged data to scatterlist

       finally supply add_buf() with the sg and as "out" buffers; use the
       skb pointer as a "token"

     show ->kick() being called in start_xmit()

     Now ... it's queued up; when the guest has consumed it we'll get
     a callback to skb_xmit_done(); all we do is schedule a tasklet
     implemented by xmit_tasklet() which eventually gets to
     free_old_xmit_skbs() ... data is sent, all we need to do is remove
     it from the send list, update stats and free it

     so - what happens if the ring fills up? we save the skb for later,
     stop the network device queue so we won't get any more packets
     and enable the notification so the guest will interrupt us when
     it has finished with some packets

     But! Last chance saloon - maybe some buffers were consumed in the
     mean time and we can disable callbacks, start the queue again and
     retry the transmit

   brief overview of the receive side

     guest allocates skbs, adds them as "in" buffers to the queue and
     waits for notification that the buffers have been consumed

     guest then pulls them back from the queue and passes it to the
     network stack with netif_receive_skb()

   walk through the host side quickly

     ignore the details of when and where we flush the tx queue and
     just go through virtio_net_flush_tx()

     basically we pop() the buffer from the queue, send it via qemu's
     internal network bridge, push it back onto the queue and notify
     the guest side that it has been consumed

     pop() returns a struct iovec which can be passed directly to
     writev() to inject it into the host's networking stack via a
     tun/tap device

     pop() == give me a buffer
     push() == I'm done with this buffer

   we're done with the high level virtio transport abstraction; let's
   look at the only current implementation of this abstraction, the
   virtio ring

     a ring buffer is basically a circular list of buffer descriptors.
     The producing side adds buffers descriptors to the ring while
     the consuming side removes them. Both sides can proceed in parallel
     without locking the ring

     a virtio buffer queue actually has two rings - the host is the
     producer for the "avail" ring while the guest is the producer of
     the "used" ring

     the queue can hold a fixed number of buffer descriptors and
     occupies a number of pages depending on ring size:

       entries | pages
       --------+------
           128 | 2
           256 | 3
           512 | 5
          1024 | 8
          2048 | 15

     so, what you have is:

       - an array of buffer descriptors - address, length, flags and next
           + two flags : "buffer is writeable" and "next is valid"

       - the "available" ring - a circular array of indices into the buffer
         descriptors array; idx gives you the current position of the
         producer in the ring, flags can be used to tell the host not
         to interrupt the guest when buffers have been consumed (i.e.
         when they are added to the "used" ring

       - the "used" ring - again a circular array of indices, but each
         entry also has the number of bytes actually consumed; idx and
         flags a similar except the NO_NOTIFY flag tells the host not
         to notify the guest as new buffers are added to the avail
         ring

     vring_add_buf()

       - we keep track of free descriptors
       - notice that we add to the avail ring, but don't update avail->idx

     vring_kick()

       - here is where the guest can see the avail->idx update
       - unless we've been told not to, we notify the host

     vring_get_buf()

       - first check has used->idx changed
       - fetch the descriptor table index from used->ring and the length
         of the buffer used
       - fetch the data associated with the head
       - detach_buf() detachs the chain of descriptors, returns them to
         the free list
       - finally keep track our position in used->ring

     quick look at enable_cb() and disable_cb()

   finally, let's look at qemu code on the host side:

     in virtio.h, same definition of the ring layout but with different
     structure names

     virtqueue_init() takes a pointer to the ring pages and figures out
     the location of the descriptor table, avail ring and used ring

     virtqueue_size() figures out how much memory is needed for a ring
     with a given number of entries

     virtqueue_pop() reads an entry from the avail queue

       sanity checks number of available entries; N.B. (uint16_t) cast

       pulls the next head - i.e. an index into the descriptor table
       that begins a chain of buffers

       follow the chain, initializing the appropriate struct iovec
       as we go

       check the chain isn't too long

       keep track of the buffer head so for push() and also that there
       is a buffer "in flight" so that we can know not to notify-on-empty

       FIXME: the wmb() in next_desc() looks bogus

     virtqueue_push() simply adds the buffer head to the next entry in the
     used ring and increments the index so the guest can see it

       FIXME: the wmb is a no-op - should it be?


GSO
===

In virtnet_probe, go through the flags we set:

  - NETIF_F_HIGHDMA - the buffers can be allocated in highmem

  - NETIF_F_HW_CSUM - we can do partial checksums

  - NETIF_F_SG - the packet data does not need to be in a linear buffer

  - NETIF_F_FRAGLIST - same, except slightly different scheme involving
                       chained skbs

  - NETIF_F_TSO etc. - host can do packet segmentation

virtio_net_hdr initialized in xmit_skb()

receive_skb():

  - Note that csum_start and csum_offset are relative to the start
    of the ethernet frame but eth_type_trans() discards the header

Look at try_fill_recv() - need to allocate larger buffers for GSO packets

qemu side:

  - In tap_open() we pass IFF_VNET_HDR to TUNSETIFF meaning, "frames should
    be prefixed with a virtio_net_hdr"

     - has_vnet_hdr says "the tapfd has IFF_VNET_HDR enabled"

     - using_vnet_hdr means "vlan clients will supply virtio_net_hdr"

  - In virtio_net_get_features() we expose GSO support if tun/tap has support

  - In virtio_net_set_features() we call tap_set_offload() to set the
    appropriate offload flags with TUNSETOFFLOAD

    this controls what the host networking stack will pass down to
    tun/tap - so we only enable what the guest can support

  - okay, in flush_tx() we drop the header before sending if there
    isn't any support for virtio_net_hdr

  - tap_receive_iov() called which just calls writev()

  - receive - virtio_net_can_receive() used to see if the guest can
    take any packets

  - tap_send() called when fd is readable - we copy into a buffer,
    skip over the header if needed, and pass to virtio_net_receive()

  - we pop() a buffer chain, copy the virtio_net_hdr in if we need to,
    and then copy the data into the buffers

  - data copied twice on the receive side, once on the transmit side

  - work_around_broken_dhclient() needed because some guest's can't
    handle partial checksums in DHCP replies