Discussion:
[dpdk-dev] [PATCH 00/21] Vhost-user: Implement device IOTLB support
(too old to reply)
Maxime Coquelin
2017-08-31 09:50:02 UTC
Permalink
This first non-RFC, which targets v17.11, adds support for
VIRTIO_F_IOMMU_PLATFORM feature, by implementing device IOTLB in the
vhost-user backend. It improves the guest safety by enabling the
possibility to isolate the Virtio device.

It makes possible to use Virtio PMD in guest with using VFIO driver
without enable_unsafe_noiommu_mode parameter set, so that the DPDK
application on guest can only access memory its has been allowed to,
and preventing malicious/buggy DPDK application in guest to make
vhost-user backend write random guest memory. Note that Virtio-net
Kernel driver also support IOMMU.

The series depends on Qemu's "vhost-user: Specify and implement
device IOTLB support" [0], available upstream and which will be part
of Qemu v2.10 release.

Performance-wise, even if this RFC has still room for optimizations,
no performance degradation is noticed with static mappings (i.e. DPDK
on guest) with PVP benchmark:
Traffic Generator: Moongen (lua-trafficgen)
Acceptable Loss: 0.005%
Validation run time: 1 min
Guest DPDK version/commit: v17.05
QEMU version/commit: master (6db174aed1fd)
Virtio features: default
CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
NIC: 2 x X710
Page size: 1G host/1G guest
Results (bidirectional, total of the two flows):
- base: 18.8Mpps
- base + IOTLB series, IOMMU OFF: 18.8Mpps
- base + IOTLB series, IOMMU ON: 18.8Mpps (14.5Mpps w/o PATCH 21/21)

This is explained because IOTLB misses, which are very costly, only
happen at startup time. Indeed, once used, the buffers are not
invalidated, so if the IOTLB cache is large enough, there will be only
cache hits. Also, the use of 1G huge pages improves the IOTLB cache
searching time by reducing the number of entries.

With 2M hugepages, a performance degradation is seen with IOMMU on:
Traffic Generator: Moongen (lua-trafficgen)
Acceptable Loss: 0.005%
Validation run time: 1 min
Guest DPDK version/commit: v17.05
QEMU version/commit: master (6db174aed1fd)
Virtio features: default
CPU: Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
NIC: 2 x X710
Page size: 2M host/2M guest
Results (bidirectional, total of the two flows):
- base: 18.8Mpps
- base + IOTLB series, IOMMU OFF: 18.8Mpps
- base + IOTLB series, IOMMU ON: 13.5Mpps (12.4Mpps wo PATCH 21/21)

A possible improvement would be to merge contiguous IOTLB entries sharing
the same permissions. A very rough patch implementing this idea fixes
the performance degradation (18.8Mpps), but the required work to clean
it would delay this series after v17.11.

With dynamic mappings (i.e. Virtio-net kernel driver), this is another
story. The performance is so poor it makes it almost unusable. Indeed,
since the Kernel driver unmaps the buffers as soon as they are handled,
almost all descriptors buffers addresses translations result in an IOTLB
miss. There is not much that can be done on DPDK side, except maybe
batching IOTLB miss requests no to break bursts, but it would require
a big rework. In Qemu, we may consider enabling IOMMU MAP notifications,
so that DPDK receives the IOTLB updates without having to send IOTLB miss
request.

Regarding the design choices:
- I initially intended to use userspace RCU library[1] for the cache
implementation, but it would have added an external dependency, and the
lib is not available in all distros. Qemu for example got rid of this
dependency by copying some of the userspace RCU lib parts into Qemu tree,
but this is not possible with DPDK due to licensing issues (RCU lib is
LGPL v2). Thanks to Jason advice, I implemented the cache using rd/wr
locks.
- I initially implemented a per-device IOTLB cache, but the concurrent
acccesses on the IOTLB lock had huge impact on performance (~-40% in
bidirectionnal, expect even worse with multiqueue). I move to a per-
virtqueue IOTLB design, which prevents this concurrency.
- The slave IOTLB miss request supports reply-ack feature in spec, but
this version doesn't block or busy-wait for the corresponding update so
that other queues sharing the same lcore can be processed in the meantime.

For those who would like to test the series, I made it available on
gitlab[2] (vhost_user_iotlb_v1 tag). The guest kernel command line requires
the intel_iommu=on parameter, and the guest should be started with and
iommu device attached to the virtio-net device. For example:

./qemu-system-x86_64 \
-enable-kvm -m 4096 -smp 2 \
-M q35,kernel-irqchip=split \
-cpu host \
-device intel-iommu,device-iotlb=on,intremap \
-device ioh3420,id=root.1,chassis=1 \
-chardev socket,id=char0,path=/tmp/vhost-user1 \
-netdev type=vhost-user,id=hn2,chardev=char0 \
-device virtio-net-pci,netdev=hn2,id=v0,mq=off,mac=$MAC,bus=root.1,disable-modern=off,disable-legacy=on,iommu_platform=on,ats=on \
...

[0]: https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg00520.html
[1]: http://liburcu.org/
[2]: https://gitlab.com/mcoquelin/dpdk-next-virtio/commits/vhost_user_iotlb_v1

Changes since RFC:
==================
- Fix memory leak in error patch reported by Jens
- Rework wait for IOTLB update by stopping the burst to let other
queues to be processed, if any. It implies the introduction an
iotlb_pending_list, so that iotlb miss requests aren't sent multiple
times for a same address.
- Optimize iotlb lock usage to recover to same as IOMMU off performance
- Fix device locking issue in rte_vhost_dequeue_burst() error path
- Change virtio_dev_rx error handling for consistency with mergeable rx,
and to ease returning in case of IOTLB misses.
- Fix checkpatch warnings reported by ***@dpdk.org

Maxime Coquelin (21):
Revert "vhost: workaround MQ fails to startup"
vhost: make error handling consistent in rx path
vhost: protect virtio_net device struct
vhost: prepare send_vhost_message() to slave requests
vhost: add support to slave requests channel
vhost: declare missing IOMMU-related definitions for old kernels
vhost: add iotlb helper functions
vhost: iotlb: add pending miss request list and helpers
vhost-user: add support to IOTLB miss slave requests
vhost: initialize vrings IOTLB caches
vhost-user: handle IOTLB update and invalidate requests
vhost: introduce guest IOVA to backend VA helper
vhost: use the guest IOVA to host VA helper
vhost: enable rings at the right time
vhost: don't dereference invalid dev pointer after its reallocation
vhost: postpone rings addresses translation
vhost-user: translate ring addresses when IOMMU enabled
vhost-user: iommu: postpone device creation until ring are mapped
vhost: iommu: Invalidate vring in case of matching IOTLB invalidate
vhost: enable IOMMU support
vhost: iotlb: reduce iotlb read lock usage

lib/librte_vhost/Makefile | 4 +-
lib/librte_vhost/iotlb.c | 315 +++++++++++++++++++++++++++++++++++
lib/librte_vhost/iotlb.h | 64 +++++++
lib/librte_vhost/vhost.c | 340 ++++++++++++++++++++++++++++++++------
lib/librte_vhost/vhost.h | 53 +++++-
lib/librte_vhost/vhost_user.c | 376 ++++++++++++++++++++++++++++++++----------
lib/librte_vhost/vhost_user.h | 20 ++-
lib/librte_vhost/virtio_net.c | 129 +++++++++++----
8 files changed, 1130 insertions(+), 171 deletions(-)
create mode 100644 lib/librte_vhost/iotlb.c
create mode 100644 lib/librte_vhost/iotlb.h
--
2.13.3
Maxime Coquelin
2017-08-31 09:50:03 UTC
Permalink
This reverts commit 04d81227960b5c1cf2f11f492100979ead20c526.

As agreed when this workaround was introduced, it can be reverted
as Qemu v2.10 that fixes the issue is now out.

The reply-ack feature is required for vhost-user IOMMU support.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost_user.h | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/lib/librte_vhost/vhost_user.h b/lib/librte_vhost/vhost_user.h
index 35ebd7190..2ba22dbb0 100644
--- a/lib/librte_vhost/vhost_user.h
+++ b/lib/librte_vhost/vhost_user.h
@@ -49,14 +49,10 @@
#define VHOST_USER_PROTOCOL_F_REPLY_ACK 3
#define VHOST_USER_PROTOCOL_F_NET_MTU 4

-/*
- * disable REPLY_ACK feature to workaround the buggy QEMU implementation.
- * Proved buggy QEMU includes v2.7 - v2.9.
- */
#define VHOST_USER_PROTOCOL_FEATURES ((1ULL << VHOST_USER_PROTOCOL_F_MQ) | \
(1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD) |\
(1ULL << VHOST_USER_PROTOCOL_F_RARP) | \
- (0ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK) | \
+ (1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK) | \
(1ULL << VHOST_USER_PROTOCOL_F_NET_MTU))

typedef enum VhostUserRequest {
--
2.13.3
Yuanhan Liu
2017-09-07 11:54:48 UTC
Permalink
Post by Maxime Coquelin
This reverts commit 04d81227960b5c1cf2f11f492100979ead20c526.
As agreed when this workaround was introduced, it can be reverted
as Qemu v2.10 that fixes the issue is now out.
First of all, I'm very sorry for being so late on review!

I didn't quite remember we have come an agreement on that. I think my
old concern still counts: how do you handle the combo like DPDK v17.11
and QEMU v2.7-2.9 (assume we take this patch in DPDK v17.11) ?

Note that it's not a trivial bug, it's a fatal bug that freezes QEMU,
IIRC.

--yliu
Post by Maxime Coquelin
The reply-ack feature is required for vhost-user IOMMU support.
Maxime Coquelin
2017-09-07 12:59:59 UTC
Permalink
Hi Yuanhan,
Post by Yuanhan Liu
Post by Maxime Coquelin
This reverts commit 04d81227960b5c1cf2f11f492100979ead20c526.
As agreed when this workaround was introduced, it can be reverted
as Qemu v2.10 that fixes the issue is now out.
First of all, I'm very sorry for being so late on review!
I didn't quite remember we have come an agreement on that.
I think my
old concern still counts: how do you handle the combo like DPDK v17.11
and QEMU v2.7-2.9 (assume we take this patch in DPDK v17.11) ?
Note that it's not a trivial bug, it's a fatal bug that freezes QEMU,
IIRC
IIRC, I also raised the concern initially but after having listed the
possible solutions, I thought we agreed there were no trivial way to
detect the bug in Qemu and so we would re-enable it once upstream Qemu
got fixed.

Note that we're talking about upstream here, which should not be used in
production. Distro's Qemu v2.7/v2.9 should be fixed, as RHEL-7.4 has been.

Any thougths?

Maxime
Post by Yuanhan Liu
--yliu
Post by Maxime Coquelin
The reply-ack feature is required for vhost-user IOMMU support.
Maxime Coquelin
2017-09-24 10:41:59 UTC
Permalink
Post by Maxime Coquelin
Hi Yuanhan,
Post by Yuanhan Liu
Post by Maxime Coquelin
This reverts commit 04d81227960b5c1cf2f11f492100979ead20c526.
As agreed when this workaround was introduced, it can be reverted
as Qemu v2.10 that fixes the issue is now out.
First of all, I'm very sorry for being so late on review!
I didn't quite remember we have come an agreement on that.
I think my
old concern still counts: how do you handle the combo like DPDK v17.11
and QEMU v2.7-2.9 (assume we take this patch in DPDK v17.11) ?
Note that it's not a trivial bug, it's a fatal bug that freezes QEMU,
IIRC
IIRC, I also raised the concern initially but after having listed the
possible solutions, I thought we agreed there were no trivial way to
detect the bug in Qemu and so we would re-enable it once upstream Qemu
got fixed.
Note that we're talking about upstream here, which should not be used in
production. Distro's Qemu v2.7/v2.9 should be fixed, as RHEL-7.4 has been.
Any thougths?
Actually, stable v2.9.1 contains the fix, so only v2.7 and v2.8 are
impacted.

This is a good news, however it prevents to detect vulnerable Qemu
versions based on new features bit introduced in v2.10.

Maxime
Post by Maxime Coquelin
Maxime
Post by Yuanhan Liu
    --yliu
Post by Maxime Coquelin
The reply-ack feature is required for vhost-user IOMMU support.
Maxime Coquelin
2017-08-31 09:50:04 UTC
Permalink
In the non-mergeable receive case, when copy_mbuf_to_desc()
call fails the packet is skipped, the corresponding used element
len field is set to vnet header size, and it continues with next
packet/desc. It could be a problem because it does not know why it
failed, and assume the desc buffer is large enough.

In mergeable receive case, when copy_mbuf_to_desc_mergeable()
fails, packets burst is simply stopped.

This patch makes the non-mergeable error path to behave as the
mergeable one, as it seems the safest way. Also, doing this way
will simplify pending IOTLB miss requests handling.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/virtio_net.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index a5f0eebaa..b889aa0b7 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -320,11 +320,8 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,

err = copy_mbuf_to_desc(dev, descs, pkts[i], desc_idx, sz);
if (unlikely(err)) {
- used_idx = (start_idx + i) & (vq->size - 1);
- vq->used->ring[used_idx].len = dev->vhost_hlen;
- vhost_log_used_vring(dev, vq,
- offsetof(struct vring_used, ring[used_idx]),
- sizeof(vq->used->ring[used_idx]));
+ count = i;
+ break;
}

if (i + 1 < count)
--
2.13.3
Maxime Coquelin
2017-08-31 09:50:05 UTC
Permalink
virtio_net device might be accessed while being reallocated
in case of NUMA awareness. This case might be theoretical,
but it will be needed anyway to protect vrings pages against
invalidation.

The virtio_net devs are now protected with a readers/writers
lock, so that before reallocating the device, it is ensured
that it is not being referenced by the processing threads.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost.c | 237 +++++++++++++++++++++++++++++++++++-------
lib/librte_vhost/vhost.h | 3 +-
lib/librte_vhost/vhost_user.c | 73 +++++--------
lib/librte_vhost/virtio_net.c | 17 ++-
4 files changed, 240 insertions(+), 90 deletions(-)

diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 0b6aa1cc4..429983858 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -46,16 +46,25 @@
#include <rte_string_fns.h>
#include <rte_memory.h>
#include <rte_malloc.h>
+#include <rte_rwlock.h>
#include <rte_vhost.h>

#include "vhost.h"

-struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
+struct vhost_device {
+ struct virtio_net *dev;
+ rte_rwlock_t lock;
+};

-struct virtio_net *
-get_device(int vid)
+/* Declared as static so that .lock is initialized */
+static struct vhost_device vhost_devices[MAX_VHOST_DEVICE];
+
+static inline struct virtio_net *
+__get_device(int vid)
{
- struct virtio_net *dev = vhost_devices[vid];
+ struct virtio_net *dev;
+
+ dev = vhost_devices[vid].dev;

if (unlikely(!dev)) {
RTE_LOG(ERR, VHOST_CONFIG,
@@ -65,6 +74,83 @@ get_device(int vid)
return dev;
}

+struct virtio_net *
+get_device(int vid)
+{
+ struct virtio_net *dev;
+
+ rte_rwlock_read_lock(&vhost_devices[vid].lock);
+
+ dev = __get_device(vid);
+ if (unlikely(!dev))
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+
+ return dev;
+}
+
+void
+put_device(int vid)
+{
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+}
+
+static struct virtio_net *
+get_device_wr(int vid)
+{
+ struct virtio_net *dev;
+
+ rte_rwlock_write_lock(&vhost_devices[vid].lock);
+
+ dev = __get_device(vid);
+ if (unlikely(!dev))
+ rte_rwlock_write_unlock(&vhost_devices[vid].lock);
+
+ return dev;
+}
+
+static void
+put_device_wr(int vid)
+{
+ rte_rwlock_write_unlock(&vhost_devices[vid].lock);
+}
+
+int
+realloc_device(int vid, int vq_index, int node)
+{
+ struct virtio_net *dev, *old_dev;
+ struct vhost_virtqueue *vq;
+
+ dev = rte_malloc_socket(NULL, sizeof(*dev), 0, node);
+ if (!dev)
+ return -1;
+
+ vq = rte_malloc_socket(NULL, sizeof(*vq), 0, node);
+ if (!vq) {
+ rte_free(dev);
+ return -1;
+ }
+
+ old_dev = get_device_wr(vid);
+ if (!old_dev) {
+ rte_free(vq);
+ rte_free(dev);
+ return -1;
+ }
+
+ memcpy(dev, old_dev, sizeof(*dev));
+ memcpy(vq, old_dev->virtqueue[vq_index], sizeof(*vq));
+ dev->virtqueue[vq_index] = vq;
+
+ rte_free(old_dev->virtqueue[vq_index]);
+ rte_free(old_dev);
+
+ vhost_devices[vid].dev = dev;
+
+ put_device_wr(vid);
+
+ return 0;
+}
+
static void
cleanup_vq(struct vhost_virtqueue *vq, int destroy)
{
@@ -195,7 +281,7 @@ vhost_new_device(void)
}

for (i = 0; i < MAX_VHOST_DEVICE; i++) {
- if (vhost_devices[i] == NULL)
+ if (vhost_devices[i].dev == NULL)
break;
}
if (i == MAX_VHOST_DEVICE) {
@@ -205,8 +291,10 @@ vhost_new_device(void)
return -1;
}

- vhost_devices[i] = dev;
+ rte_rwlock_write_lock(&vhost_devices[i].lock);
+ vhost_devices[i].dev = dev;
dev->vid = i;
+ rte_rwlock_write_unlock(&vhost_devices[i].lock);

return i;
}
@@ -228,10 +316,15 @@ vhost_destroy_device(int vid)
dev->notify_ops->destroy_device(vid);
}

+ put_device(vid);
+ dev = get_device_wr(vid);
+
cleanup_device(dev, 1);
free_device(dev);

- vhost_devices[vid] = NULL;
+ vhost_devices[vid].dev = NULL;
+
+ put_device_wr(vid);
}

void
@@ -249,6 +342,8 @@ vhost_set_ifname(int vid, const char *if_name, unsigned int if_len)

strncpy(dev->ifname, if_name, len);
dev->ifname[sizeof(dev->ifname) - 1] = '\0';
+
+ put_device(vid);
}

void
@@ -260,25 +355,35 @@ vhost_enable_dequeue_zero_copy(int vid)
return;

dev->dequeue_zero_copy = 1;
+
+ put_device(vid);
}

int
rte_vhost_get_mtu(int vid, uint16_t *mtu)
{
struct virtio_net *dev = get_device(vid);
+ int ret = 0;

if (!dev)
return -ENODEV;

- if (!(dev->flags & VIRTIO_DEV_READY))
- return -EAGAIN;
+ if (!(dev->flags & VIRTIO_DEV_READY)) {
+ ret = -EAGAIN;
+ goto out_put;
+ }

- if (!(dev->features & (1ULL << VIRTIO_NET_F_MTU)))
- return -ENOTSUP;
+ if (!(dev->features & (1ULL << VIRTIO_NET_F_MTU))) {
+ ret = -ENOTSUP;
+ goto out_put;
+ }

*mtu = dev->mtu;

- return 0;
+out_put:
+ put_device(vid);
+
+ return ret;
}

int
@@ -298,9 +403,11 @@ rte_vhost_get_numa_node(int vid)
RTE_LOG(ERR, VHOST_CONFIG,
"(%d) failed to query numa node: %s\n",
vid, rte_strerror(errno));
- return -1;
+ numa_node = -1;
}

+ put_device(vid);
+
return numa_node;
#else
RTE_SET_USED(vid);
@@ -312,22 +419,32 @@ uint32_t
rte_vhost_get_queue_num(int vid)
{
struct virtio_net *dev = get_device(vid);
+ uint32_t queue_num;

if (dev == NULL)
return 0;

- return dev->nr_vring / 2;
+ queue_num = dev->nr_vring / 2;
+
+ put_device(vid);
+
+ return queue_num;
}

uint16_t
rte_vhost_get_vring_num(int vid)
{
struct virtio_net *dev = get_device(vid);
+ uint16_t vring_num;

if (dev == NULL)
return 0;

- return dev->nr_vring;
+ vring_num = dev->nr_vring;
+
+ put_device(vid);
+
+ return vring_num;
}

int
@@ -343,6 +460,8 @@ rte_vhost_get_ifname(int vid, char *buf, size_t len)
strncpy(buf, dev->ifname, len);
buf[len - 1] = '\0';

+ put_device(vid);
+
return 0;
}

@@ -356,6 +475,9 @@ rte_vhost_get_negotiated_features(int vid, uint64_t *features)
return -1;

*features = dev->features;
+
+ put_device(vid);
+
return 0;
}

@@ -365,6 +487,7 @@ rte_vhost_get_mem_table(int vid, struct rte_vhost_memory **mem)
struct virtio_net *dev;
struct rte_vhost_memory *m;
size_t size;
+ int ret = 0;

dev = get_device(vid);
if (!dev)
@@ -372,14 +495,19 @@ rte_vhost_get_mem_table(int vid, struct rte_vhost_memory **mem)

size = dev->mem->nregions * sizeof(struct rte_vhost_mem_region);
m = malloc(sizeof(struct rte_vhost_memory) + size);
- if (!m)
- return -1;
+ if (!m) {
+ ret = -1;
+ goto out;
+ }

m->nregions = dev->mem->nregions;
memcpy(m->regions, dev->mem->regions, size);
*mem = m;

- return 0;
+out:
+ put_device(vid);
+
+ return ret;
}

int
@@ -388,17 +516,22 @@ rte_vhost_get_vhost_vring(int vid, uint16_t vring_idx,
{
struct virtio_net *dev;
struct vhost_virtqueue *vq;
+ int ret = 0;

dev = get_device(vid);
if (!dev)
return -1;

- if (vring_idx >= VHOST_MAX_VRING)
- return -1;
+ if (vring_idx >= VHOST_MAX_VRING) {
+ ret = -1;
+ goto out;
+ }

vq = dev->virtqueue[vring_idx];
- if (!vq)
- return -1;
+ if (!vq) {
+ ret = -1;
+ goto out;
+ }

vring->desc = vq->desc;
vring->avail = vq->avail;
@@ -409,7 +542,10 @@ rte_vhost_get_vhost_vring(int vid, uint16_t vring_idx,
vring->kickfd = vq->kickfd;
vring->size = vq->size;

- return 0;
+out:
+ put_device(vid);
+
+ return ret;
}

uint16_t
@@ -417,6 +553,7 @@ rte_vhost_avail_entries(int vid, uint16_t queue_id)
{
struct virtio_net *dev;
struct vhost_virtqueue *vq;
+ uint16_t avail_entries = 0;

dev = get_device(vid);
if (!dev)
@@ -424,15 +561,23 @@ rte_vhost_avail_entries(int vid, uint16_t queue_id)

vq = dev->virtqueue[queue_id];
if (!vq->enabled)
- return 0;
+ goto out;

- return *(volatile uint16_t *)&vq->avail->idx - vq->last_used_idx;
+
+ avail_entries = *(volatile uint16_t *)&vq->avail->idx;
+ avail_entries -= vq->last_used_idx;
+
+out:
+ put_device(vid);
+
+ return avail_entries;
}

int
rte_vhost_enable_guest_notification(int vid, uint16_t queue_id, int enable)
{
struct virtio_net *dev = get_device(vid);
+ int ret = 0;

if (dev == NULL)
return -1;
@@ -440,11 +585,16 @@ rte_vhost_enable_guest_notification(int vid, uint16_t queue_id, int enable)
if (enable) {
RTE_LOG(ERR, VHOST_CONFIG,
"guest notification isn't supported.\n");
- return -1;
+ ret = -1;
+ goto out;
}

dev->virtqueue[queue_id]->used->flags = VRING_USED_F_NO_NOTIFY;
- return 0;
+
+out:
+ put_device(vid);
+
+ return ret;
}

void
@@ -456,6 +606,8 @@ rte_vhost_log_write(int vid, uint64_t addr, uint64_t len)
return;

vhost_log_write(dev, addr, len);
+
+ put_device(vid);
}

void
@@ -470,12 +622,15 @@ rte_vhost_log_used_vring(int vid, uint16_t vring_idx,
return;

if (vring_idx >= VHOST_MAX_VRING)
- return;
+ goto out;
vq = dev->virtqueue[vring_idx];
if (!vq)
- return;
+ goto out;

vhost_log_used_vring(dev, vq, offset, len);
+
+out:
+ put_device(vid);
}

uint32_t
@@ -483,6 +638,7 @@ rte_vhost_rx_queue_count(int vid, uint16_t qid)
{
struct virtio_net *dev;
struct vhost_virtqueue *vq;
+ uint32_t queue_count;

dev = get_device(vid);
if (dev == NULL)
@@ -491,15 +647,26 @@ rte_vhost_rx_queue_count(int vid, uint16_t qid)
if (unlikely(qid >= dev->nr_vring || (qid & 1) == 0)) {
RTE_LOG(ERR, VHOST_DATA, "(%d) %s: invalid virtqueue idx %d.\n",
dev->vid, __func__, qid);
- return 0;
+ queue_count = 0;
+ goto out;
}

vq = dev->virtqueue[qid];
- if (vq == NULL)
- return 0;
+ if (vq == NULL) {
+ queue_count = 0;
+ goto out;
+ }

- if (unlikely(vq->enabled == 0 || vq->avail == NULL))
- return 0;
+ if (unlikely(vq->enabled == 0 || vq->avail == NULL)) {
+ queue_count = 0;
+ goto out;
+ }
+
+ queue_count = *((volatile uint16_t *)&vq->avail->idx);
+ queue_count -= vq->last_avail_idx;
+
+out:
+ put_device(vid);

- return *((volatile uint16_t *)&vq->avail->idx) - vq->last_avail_idx;
+ return queue_count;
}
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 0f294f395..18ad69c85 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -269,7 +269,6 @@ vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,

extern uint64_t VHOST_FEATURES;
#define MAX_VHOST_DEVICE 1024
-extern struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];

/* Convert guest physical address to host physical address */
static __rte_always_inline phys_addr_t
@@ -292,6 +291,8 @@ gpa_to_hpa(struct virtio_net *dev, uint64_t gpa, uint64_t size)
}

struct virtio_net *get_device(int vid);
+void put_device(int vid);
+int realloc_device(int vid, int vq_index, int node);

int vhost_new_device(void);
void cleanup_device(struct virtio_net *dev, int destroy);
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index ad2e8d380..5b3b8812a 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -241,62 +241,31 @@ vhost_user_set_vring_num(struct virtio_net *dev,
static struct virtio_net*
numa_realloc(struct virtio_net *dev, int index)
{
- int oldnode, newnode;
- struct virtio_net *old_dev;
- struct vhost_virtqueue *old_vq, *vq;
- int ret;
+ int oldnode, newnode, vid, ret;

- old_dev = dev;
- vq = old_vq = dev->virtqueue[index];
+ vid = dev->vid;

- ret = get_mempolicy(&newnode, NULL, 0, old_vq->desc,
+ ret = get_mempolicy(&newnode, NULL, 0, dev->virtqueue[index]->desc,
MPOL_F_NODE | MPOL_F_ADDR);

/* check if we need to reallocate vq */
- ret |= get_mempolicy(&oldnode, NULL, 0, old_vq,
+ ret |= get_mempolicy(&oldnode, NULL, 0, dev->virtqueue[index],
MPOL_F_NODE | MPOL_F_ADDR);
if (ret) {
RTE_LOG(ERR, VHOST_CONFIG,
"Unable to get vq numa information.\n");
return dev;
}
- if (oldnode != newnode) {
- RTE_LOG(INFO, VHOST_CONFIG,
- "reallocate vq from %d to %d node\n", oldnode, newnode);
- vq = rte_malloc_socket(NULL, sizeof(*vq), 0, newnode);
- if (!vq)
- return dev;
-
- memcpy(vq, old_vq, sizeof(*vq));
- rte_free(old_vq);
- }

- /* check if we need to reallocate dev */
- ret = get_mempolicy(&oldnode, NULL, 0, old_dev,
- MPOL_F_NODE | MPOL_F_ADDR);
- if (ret) {
- RTE_LOG(ERR, VHOST_CONFIG,
- "Unable to get dev numa information.\n");
- goto out;
- }
if (oldnode != newnode) {
RTE_LOG(INFO, VHOST_CONFIG,
- "reallocate dev from %d to %d node\n",
- oldnode, newnode);
- dev = rte_malloc_socket(NULL, sizeof(*dev), 0, newnode);
- if (!dev) {
- dev = old_dev;
- goto out;
- }
-
- memcpy(dev, old_dev, sizeof(*dev));
- rte_free(old_dev);
+ "reallocate vq from %d to %d node\n", oldnode, newnode);
+ put_device(vid);
+ if (realloc_device(vid, index, newnode))
+ RTE_LOG(ERR, VHOST_CONFIG, "Failed to realloc device\n");
+ dev = get_device(vid);
}

-out:
- dev->virtqueue[index] = vq;
- vhost_devices[dev->vid] = dev;
-
return dev;
}
#else
@@ -336,9 +305,10 @@ qva_to_vva(struct virtio_net *dev, uint64_t qva)
* This function then converts these to our address space.
*/
static int
-vhost_user_set_vring_addr(struct virtio_net *dev, VhostUserMsg *msg)
+vhost_user_set_vring_addr(struct virtio_net **pdev, VhostUserMsg *msg)
{
struct vhost_virtqueue *vq;
+ struct virtio_net *dev = *pdev;

if (dev->mem == NULL)
return -1;
@@ -356,7 +326,7 @@ vhost_user_set_vring_addr(struct virtio_net *dev, VhostUserMsg *msg)
return -1;
}

- dev = numa_realloc(dev, msg->payload.addr.index);
+ *pdev = dev = numa_realloc(dev, msg->payload.addr.index);
vq = dev->virtqueue[msg->payload.addr.index];

vq->avail = (struct vring_avail *)(uintptr_t)qva_to_vva(dev,
@@ -966,7 +936,7 @@ vhost_user_msg_handler(int vid, int fd)
{
struct virtio_net *dev;
struct VhostUserMsg msg;
- int ret;
+ int ret = 0;

dev = get_device(vid);
if (dev == NULL)
@@ -978,7 +948,8 @@ vhost_user_msg_handler(int vid, int fd)
RTE_LOG(ERR, VHOST_CONFIG,
"failed to get callback ops for driver %s\n",
dev->ifname);
- return -1;
+ ret = -1;
+ goto out;
}
}

@@ -994,10 +965,10 @@ vhost_user_msg_handler(int vid, int fd)
RTE_LOG(ERR, VHOST_CONFIG,
"vhost read incorrect message\n");

- return -1;
+ ret = -1;
+ goto out;
}

- ret = 0;
RTE_LOG(INFO, VHOST_CONFIG, "read message %s\n",
vhost_message_str[msg.request]);

@@ -1005,7 +976,8 @@ vhost_user_msg_handler(int vid, int fd)
if (ret < 0) {
RTE_LOG(ERR, VHOST_CONFIG,
"failed to alloc queue\n");
- return -1;
+ ret = -1;
+ goto out;
}

switch (msg.request) {
@@ -1054,7 +1026,7 @@ vhost_user_msg_handler(int vid, int fd)
vhost_user_set_vring_num(dev, &msg);
break;
case VHOST_USER_SET_VRING_ADDR:
- vhost_user_set_vring_addr(dev, &msg);
+ vhost_user_set_vring_addr(&dev, &msg);
break;
case VHOST_USER_SET_VRING_BASE:
vhost_user_set_vring_base(dev, &msg);
@@ -1122,5 +1094,8 @@ vhost_user_msg_handler(int vid, int fd)
}
}

- return 0;
+out:
+ put_device(vid);
+
+ return ret;
}
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index b889aa0b7..04255dc85 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -598,14 +598,19 @@ rte_vhost_enqueue_burst(int vid, uint16_t queue_id,
struct rte_mbuf **pkts, uint16_t count)
{
struct virtio_net *dev = get_device(vid);
+ int ret = 0;

if (!dev)
return 0;

if (dev->features & (1 << VIRTIO_NET_F_MRG_RXBUF))
- return virtio_dev_merge_rx(dev, queue_id, pkts, count);
+ ret = virtio_dev_merge_rx(dev, queue_id, pkts, count);
else
- return virtio_dev_rx(dev, queue_id, pkts, count);
+ ret = virtio_dev_rx(dev, queue_id, pkts, count);
+
+ put_device(vid);
+
+ return ret;
}

static inline bool
@@ -1006,12 +1011,12 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
if (unlikely(!is_valid_virt_queue_idx(queue_id, 1, dev->nr_vring))) {
RTE_LOG(ERR, VHOST_DATA, "(%d) %s: invalid virtqueue idx %d.\n",
dev->vid, __func__, queue_id);
- return 0;
+ goto out;
}

vq = dev->virtqueue[queue_id];
if (unlikely(vq->enabled == 0))
- return 0;
+ goto out;

if (unlikely(dev->dequeue_zero_copy)) {
struct zcopy_mbuf *zmbuf, *next;
@@ -1061,7 +1066,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
if (rarp_mbuf == NULL) {
RTE_LOG(ERR, VHOST_DATA,
"Failed to allocate memory for mbuf.\n");
- return 0;
+ goto out;
}

if (make_rarp_packet(rarp_mbuf, &dev->mac)) {
@@ -1180,5 +1185,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
i += 1;
}

+ put_device(vid);
+
return i;
}
--
2.13.3
Tiwei Bie
2017-09-05 04:45:17 UTC
Permalink
Post by Maxime Coquelin
virtio_net device might be accessed while being reallocated
in case of NUMA awareness. This case might be theoretical,
but it will be needed anyway to protect vrings pages against
invalidation.
The virtio_net devs are now protected with a readers/writers
lock, so that before reallocating the device, it is ensured
that it is not being referenced by the processing threads.
[...]
Post by Maxime Coquelin
+struct virtio_net *
+get_device(int vid)
+{
+ struct virtio_net *dev;
+
+ rte_rwlock_read_lock(&vhost_devices[vid].lock);
+
+ dev = __get_device(vid);
+ if (unlikely(!dev))
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+
+ return dev;
+}
+
+void
+put_device(int vid)
+{
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+}
+
This patch introduced a per-device rwlock which needs to be acquired
unconditionally in the data path. So for each vhost device, the IO
threads of different queues will need to acquire/release this lock
during each enqueue and dequeue operation, which will cause cache
contention when multiple queues are enabled and handled by different
cores. With this patch alone, I saw ~7% performance drop when enabling
6 queues to do 64bytes iofwd loopback test. Is there any way to avoid
introducing this lock to the data path?

Best regards,
Tiwei Bie
Maxime Coquelin
2017-09-05 09:24:14 UTC
Permalink
Post by Tiwei Bie
Post by Maxime Coquelin
virtio_net device might be accessed while being reallocated
in case of NUMA awareness. This case might be theoretical,
but it will be needed anyway to protect vrings pages against
invalidation.
The virtio_net devs are now protected with a readers/writers
lock, so that before reallocating the device, it is ensured
that it is not being referenced by the processing threads.
[...]
Post by Maxime Coquelin
+struct virtio_net *
+get_device(int vid)
+{
+ struct virtio_net *dev;
+
+ rte_rwlock_read_lock(&vhost_devices[vid].lock);
+
+ dev = __get_device(vid);
+ if (unlikely(!dev))
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+
+ return dev;
+}
+
+void
+put_device(int vid)
+{
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+}
+
This patch introduced a per-device rwlock which needs to be acquired
unconditionally in the data path. So for each vhost device, the IO
threads of different queues will need to acquire/release this lock
during each enqueue and dequeue operation, which will cause cache
contention when multiple queues are enabled and handled by different
cores. With this patch alone, I saw ~7% performance drop when enabling
6 queues to do 64bytes iofwd loopback test. Is there any way to avoid
introducing this lock to the data path?
First, I'd like to thank you for running the MQ test.
I agree it may have a performance impact in this case.

This lock has currently two purposes:
1. Prevent referencing freed virtio_dev struct in case of numa_realloc.
2. Protect vring pages against invalidation.

For 2., it can be fixed by using the per-vq IOTLB lock (it was not the
case in my early prototypes that had per device IOTLB cache).

For 1., this is an existing problem, so we might consider it is
acceptable to keep current state. Maybe it could be improved by only
reallocating in case VQ0 is not on the right NUMA node, the other VQs
not being initialized at this point.

If we do this we might be able to get rid of this lock, I need some more
time though to ensure I'm not missing something.

What do you think?

Thanks,
Maxime
Post by Tiwei Bie
Best regards,
Tiwei Bie
Tiwei Bie
2017-09-05 10:07:51 UTC
Permalink
Post by Maxime Coquelin
Post by Tiwei Bie
Post by Maxime Coquelin
virtio_net device might be accessed while being reallocated
in case of NUMA awareness. This case might be theoretical,
but it will be needed anyway to protect vrings pages against
invalidation.
The virtio_net devs are now protected with a readers/writers
lock, so that before reallocating the device, it is ensured
that it is not being referenced by the processing threads.
[...]
Post by Maxime Coquelin
+struct virtio_net *
+get_device(int vid)
+{
+ struct virtio_net *dev;
+
+ rte_rwlock_read_lock(&vhost_devices[vid].lock);
+
+ dev = __get_device(vid);
+ if (unlikely(!dev))
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+
+ return dev;
+}
+
+void
+put_device(int vid)
+{
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+}
+
This patch introduced a per-device rwlock which needs to be acquired
unconditionally in the data path. So for each vhost device, the IO
threads of different queues will need to acquire/release this lock
during each enqueue and dequeue operation, which will cause cache
contention when multiple queues are enabled and handled by different
cores. With this patch alone, I saw ~7% performance drop when enabling
6 queues to do 64bytes iofwd loopback test. Is there any way to avoid
introducing this lock to the data path?
First, I'd like to thank you for running the MQ test.
I agree it may have a performance impact in this case.
1. Prevent referencing freed virtio_dev struct in case of numa_realloc.
2. Protect vring pages against invalidation.
For 2., it can be fixed by using the per-vq IOTLB lock (it was not the
case in my early prototypes that had per device IOTLB cache).
For 1., this is an existing problem, so we might consider it is
acceptable to keep current state. Maybe it could be improved by only
reallocating in case VQ0 is not on the right NUMA node, the other VQs
not being initialized at this point.
If we do this we might be able to get rid of this lock, I need some more
time though to ensure I'm not missing something.
What do you think?
Cool. So it's possible that the lock in the data path will be
acquired only when the IOMMU feature is enabled. It will be
great!

Besides, I just did a very simple MQ test to verify my thoughts.
Lei (CC'ed in this mail) may do a thorough performance test for
this patch set to evaluate the performance impacts.

Best regards,
Tiwei Bie
Maxime Coquelin
2017-09-05 11:00:42 UTC
Permalink
Post by Tiwei Bie
Post by Maxime Coquelin
Post by Tiwei Bie
Post by Maxime Coquelin
virtio_net device might be accessed while being reallocated
in case of NUMA awareness. This case might be theoretical,
but it will be needed anyway to protect vrings pages against
invalidation.
The virtio_net devs are now protected with a readers/writers
lock, so that before reallocating the device, it is ensured
that it is not being referenced by the processing threads.
[...]
Post by Maxime Coquelin
+struct virtio_net *
+get_device(int vid)
+{
+ struct virtio_net *dev;
+
+ rte_rwlock_read_lock(&vhost_devices[vid].lock);
+
+ dev = __get_device(vid);
+ if (unlikely(!dev))
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+
+ return dev;
+}
+
+void
+put_device(int vid)
+{
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+}
+
This patch introduced a per-device rwlock which needs to be acquired
unconditionally in the data path. So for each vhost device, the IO
threads of different queues will need to acquire/release this lock
during each enqueue and dequeue operation, which will cause cache
contention when multiple queues are enabled and handled by different
cores. With this patch alone, I saw ~7% performance drop when enabling
6 queues to do 64bytes iofwd loopback test. Is there any way to avoid
introducing this lock to the data path?
First, I'd like to thank you for running the MQ test.
I agree it may have a performance impact in this case.
1. Prevent referencing freed virtio_dev struct in case of numa_realloc.
2. Protect vring pages against invalidation.
For 2., it can be fixed by using the per-vq IOTLB lock (it was not the
case in my early prototypes that had per device IOTLB cache).
For 1., this is an existing problem, so we might consider it is
acceptable to keep current state. Maybe it could be improved by only
reallocating in case VQ0 is not on the right NUMA node, the other VQs
not being initialized at this point.
If we do this we might be able to get rid of this lock, I need some more
time though to ensure I'm not missing something.
What do you think?
Cool. So it's possible that the lock in the data path will be
acquired only when the IOMMU feature is enabled. It will be
great!
Besides, I just did a very simple MQ test to verify my thoughts.
Lei (CC'ed in this mail) may do a thorough performance test for
this patch set to evaluate the performance impacts.
I'll try to post v2 this week including the proposed change.
Maybe it'll be better Lei waits for the v2.

Thanks,
Maxime
Post by Tiwei Bie
Best regards,
Tiwei Bie
Tiwei Bie
2017-09-06 01:15:00 UTC
Permalink
Post by Maxime Coquelin
Post by Tiwei Bie
Post by Maxime Coquelin
Post by Tiwei Bie
Post by Maxime Coquelin
virtio_net device might be accessed while being reallocated
in case of NUMA awareness. This case might be theoretical,
but it will be needed anyway to protect vrings pages against
invalidation.
The virtio_net devs are now protected with a readers/writers
lock, so that before reallocating the device, it is ensured
that it is not being referenced by the processing threads.
[...]
Post by Maxime Coquelin
+struct virtio_net *
+get_device(int vid)
+{
+ struct virtio_net *dev;
+
+ rte_rwlock_read_lock(&vhost_devices[vid].lock);
+
+ dev = __get_device(vid);
+ if (unlikely(!dev))
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+
+ return dev;
+}
+
+void
+put_device(int vid)
+{
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+}
+
This patch introduced a per-device rwlock which needs to be acquired
unconditionally in the data path. So for each vhost device, the IO
threads of different queues will need to acquire/release this lock
during each enqueue and dequeue operation, which will cause cache
contention when multiple queues are enabled and handled by different
cores. With this patch alone, I saw ~7% performance drop when enabling
6 queues to do 64bytes iofwd loopback test. Is there any way to avoid
introducing this lock to the data path?
First, I'd like to thank you for running the MQ test.
I agree it may have a performance impact in this case.
1. Prevent referencing freed virtio_dev struct in case of numa_realloc.
2. Protect vring pages against invalidation.
For 2., it can be fixed by using the per-vq IOTLB lock (it was not the
case in my early prototypes that had per device IOTLB cache).
For 1., this is an existing problem, so we might consider it is
acceptable to keep current state. Maybe it could be improved by only
reallocating in case VQ0 is not on the right NUMA node, the other VQs
not being initialized at this point.
If we do this we might be able to get rid of this lock, I need some more
time though to ensure I'm not missing something.
What do you think?
Cool. So it's possible that the lock in the data path will be
acquired only when the IOMMU feature is enabled. It will be
great!
Besides, I just did a very simple MQ test to verify my thoughts.
Lei (CC'ed in this mail) may do a thorough performance test for
this patch set to evaluate the performance impacts.
I'll try to post v2 this week including the proposed change.
Maybe it'll be better Lei waits for the v2.
Cool. Sure. Thank you! :)

Best regards,
Tiwei Bie
Stephen Hemminger
2017-09-06 02:59:28 UTC
Permalink
Post by Maxime Coquelin
1. Prevent referencing freed virtio_dev struct in case of numa_realloc.
2. Protect vring pages against invalidation.
For 2., it can be fixed by using the per-vq IOTLB lock (it was not the
case in my early prototypes that had per device IOTLB cache).
For 1., this is an existing problem, so we might consider it is
acceptable to keep current state. Maybe it could be improved by only
reallocating in case VQ0 is not on the right NUMA node, the other VQs
not being initialized at this point.
Something like RCU does a better job of protecting against freed virtio_dev.
But using RCU requires quiescent callback in the main loop.
Maxime Coquelin
2017-09-06 07:50:27 UTC
Permalink
Hi Stephen,
Post by Stephen Hemminger
Post by Maxime Coquelin
1. Prevent referencing freed virtio_dev struct in case of numa_realloc.
2. Protect vring pages against invalidation.
For 2., it can be fixed by using the per-vq IOTLB lock (it was not the
case in my early prototypes that had per device IOTLB cache).
For 1., this is an existing problem, so we might consider it is
acceptable to keep current state. Maybe it could be improved by only
reallocating in case VQ0 is not on the right NUMA node, the other VQs
not being initialized at this point.
Something like RCU does a better job of protecting against freed virtio_dev.
But using RCU requires quiescent callback in the main loop.
In my early prototypes, I used the urcu library for the IOTLB cache
implementation.

The problems are that:
1. The lib is LGPL, so only small functions can be inlined into the
code (maybe not a important, if "large functions" are only called in the
slow path).
2. It adds an external dependency, and the lib is not distributed with
every distributions (For RHEL, it is only distributed in EPEL repos).

For the virtio_dev protection, my understanding might be incorrect as
this is the first time I used RCU, but the struct has one field
that the processing threads can write (broadcast_rarp).
But it may not be a problem because at worst, only broadcast_rarp
clearing would be lost, so we would broadcast RARP one more time.

Maxime
Maxime Coquelin
2017-09-06 07:15:47 UTC
Permalink
Hi Tiwei,
Post by Tiwei Bie
Post by Maxime Coquelin
Post by Tiwei Bie
Post by Maxime Coquelin
Post by Tiwei Bie
Post by Maxime Coquelin
virtio_net device might be accessed while being reallocated
in case of NUMA awareness. This case might be theoretical,
but it will be needed anyway to protect vrings pages against
invalidation.
The virtio_net devs are now protected with a readers/writers
lock, so that before reallocating the device, it is ensured
that it is not being referenced by the processing threads.
[...]
Post by Maxime Coquelin
+struct virtio_net *
+get_device(int vid)
+{
+ struct virtio_net *dev;
+
+ rte_rwlock_read_lock(&vhost_devices[vid].lock);
+
+ dev = __get_device(vid);
+ if (unlikely(!dev))
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+
+ return dev;
+}
+
+void
+put_device(int vid)
+{
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+}
+
This patch introduced a per-device rwlock which needs to be acquired
unconditionally in the data path. So for each vhost device, the IO
threads of different queues will need to acquire/release this lock
during each enqueue and dequeue operation, which will cause cache
contention when multiple queues are enabled and handled by different
cores. With this patch alone, I saw ~7% performance drop when enabling
6 queues to do 64bytes iofwd loopback test. Is there any way to avoid
introducing this lock to the data path?
First, I'd like to thank you for running the MQ test.
I agree it may have a performance impact in this case.
1. Prevent referencing freed virtio_dev struct in case of numa_realloc.
2. Protect vring pages against invalidation.
For 2., it can be fixed by using the per-vq IOTLB lock (it was not the
case in my early prototypes that had per device IOTLB cache).
For 1., this is an existing problem, so we might consider it is
acceptable to keep current state. Maybe it could be improved by only
reallocating in case VQ0 is not on the right NUMA node, the other VQs
not being initialized at this point.
If we do this we might be able to get rid of this lock, I need some more
time though to ensure I'm not missing something.
What do you think?
Cool. So it's possible that the lock in the data path will be
acquired only when the IOMMU feature is enabled. It will be
great!
Besides, I just did a very simple MQ test to verify my thoughts.
Lei (CC'ed in this mail) may do a thorough performance test for
this patch set to evaluate the performance impacts.
I'll try to post v2 this week including the proposed change.
Maybe it'll be better Lei waits for the v2.
Cool. Sure. Thank you! :)
I have done the changes, you can find the v2 on my gitlab repo:
https://gitlab.com/mcoquelin/dpdk-next-virtio/commits/vhost_iotlb_v2

I'm testing it right now, but if you'd like to run some early benchmark
before I post the series, there it is!

Thanks,
Maxime
Post by Tiwei Bie
Best regards,
Tiwei Bie
Tiwei Bie
2017-09-06 07:30:21 UTC
Permalink
Post by Maxime Coquelin
Hi Tiwei,
Post by Tiwei Bie
Post by Maxime Coquelin
Post by Tiwei Bie
Post by Maxime Coquelin
Post by Tiwei Bie
Post by Maxime Coquelin
virtio_net device might be accessed while being reallocated
in case of NUMA awareness. This case might be theoretical,
but it will be needed anyway to protect vrings pages against
invalidation.
The virtio_net devs are now protected with a readers/writers
lock, so that before reallocating the device, it is ensured
that it is not being referenced by the processing threads.
[...]
Post by Maxime Coquelin
+struct virtio_net *
+get_device(int vid)
+{
+ struct virtio_net *dev;
+
+ rte_rwlock_read_lock(&vhost_devices[vid].lock);
+
+ dev = __get_device(vid);
+ if (unlikely(!dev))
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+
+ return dev;
+}
+
+void
+put_device(int vid)
+{
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+}
+
This patch introduced a per-device rwlock which needs to be acquired
unconditionally in the data path. So for each vhost device, the IO
threads of different queues will need to acquire/release this lock
during each enqueue and dequeue operation, which will cause cache
contention when multiple queues are enabled and handled by different
cores. With this patch alone, I saw ~7% performance drop when enabling
6 queues to do 64bytes iofwd loopback test. Is there any way to avoid
introducing this lock to the data path?
First, I'd like to thank you for running the MQ test.
I agree it may have a performance impact in this case.
1. Prevent referencing freed virtio_dev struct in case of numa_realloc.
2. Protect vring pages against invalidation.
For 2., it can be fixed by using the per-vq IOTLB lock (it was not the
case in my early prototypes that had per device IOTLB cache).
For 1., this is an existing problem, so we might consider it is
acceptable to keep current state. Maybe it could be improved by only
reallocating in case VQ0 is not on the right NUMA node, the other VQs
not being initialized at this point.
If we do this we might be able to get rid of this lock, I need some more
time though to ensure I'm not missing something.
What do you think?
Cool. So it's possible that the lock in the data path will be
acquired only when the IOMMU feature is enabled. It will be
great!
Besides, I just did a very simple MQ test to verify my thoughts.
Lei (CC'ed in this mail) may do a thorough performance test for
this patch set to evaluate the performance impacts.
I'll try to post v2 this week including the proposed change.
Maybe it'll be better Lei waits for the v2.
Cool. Sure. Thank you! :)
https://gitlab.com/mcoquelin/dpdk-next-virtio/commits/vhost_iotlb_v2
I'm testing it right now, but if you'd like to run some early benchmark
before I post the series, there it is!
Got it. Thanks! :)

Best regards,
Tiwei Bie
Maxime Coquelin
2017-09-06 20:02:29 UTC
Permalink
Post by Tiwei Bie
Post by Maxime Coquelin
Hi Tiwei,
Post by Tiwei Bie
Post by Maxime Coquelin
Post by Tiwei Bie
Post by Maxime Coquelin
Post by Tiwei Bie
Post by Maxime Coquelin
virtio_net device might be accessed while being reallocated
in case of NUMA awareness. This case might be theoretical,
but it will be needed anyway to protect vrings pages against
invalidation.
The virtio_net devs are now protected with a readers/writers
lock, so that before reallocating the device, it is ensured
that it is not being referenced by the processing threads.
[...]
Post by Maxime Coquelin
+struct virtio_net *
+get_device(int vid)
+{
+ struct virtio_net *dev;
+
+ rte_rwlock_read_lock(&vhost_devices[vid].lock);
+
+ dev = __get_device(vid);
+ if (unlikely(!dev))
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+
+ return dev;
+}
+
+void
+put_device(int vid)
+{
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+}
+
This patch introduced a per-device rwlock which needs to be acquired
unconditionally in the data path. So for each vhost device, the IO
threads of different queues will need to acquire/release this lock
during each enqueue and dequeue operation, which will cause cache
contention when multiple queues are enabled and handled by different
cores. With this patch alone, I saw ~7% performance drop when enabling
6 queues to do 64bytes iofwd loopback test. Is there any way to avoid
introducing this lock to the data path?
First, I'd like to thank you for running the MQ test.
I agree it may have a performance impact in this case.
1. Prevent referencing freed virtio_dev struct in case of numa_realloc.
2. Protect vring pages against invalidation.
For 2., it can be fixed by using the per-vq IOTLB lock (it was not the
case in my early prototypes that had per device IOTLB cache).
For 1., this is an existing problem, so we might consider it is
acceptable to keep current state. Maybe it could be improved by only
reallocating in case VQ0 is not on the right NUMA node, the other VQs
not being initialized at this point.
If we do this we might be able to get rid of this lock, I need some more
time though to ensure I'm not missing something.
What do you think?
Cool. So it's possible that the lock in the data path will be
acquired only when the IOMMU feature is enabled. It will be
great!
Besides, I just did a very simple MQ test to verify my thoughts.
Lei (CC'ed in this mail) may do a thorough performance test for
this patch set to evaluate the performance impacts.
I'll try to post v2 this week including the proposed change.
Maybe it'll be better Lei waits for the v2.
Cool. Sure. Thank you! :)
https://gitlab.com/mcoquelin/dpdk-next-virtio/commits/vhost_iotlb_v2
I'm testing it right now, but if you'd like to run some early benchmark
before I post the series, there it is!
Got it. Thanks! :)
Just to let you know that I have updated my branch to remove another
regression with iommu=off by inlining the noiommu part of
vhost_iova_to_vva call (See below for the patch, that is squashed into
my branch).

Without this, when running microbenchmarks (txonly, rxonly, ...) I
noticed a 4% perf degradation.

I think I'll have to post the series without testing PVP, because I had
to change the machine I use as packet generator, and now I have X710
NICs that seems to be unsupported with Moongen :(.

I have been advised to us TRex instead, but I'll need some time to set
it up...

Regards,
Maxime

ps: Are you coming to Dublin?
Post by Tiwei Bie
Best regards,
Tiwei Bie
Subject: [PATCH] vhost: inline IOMMU feature check

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost.c | 5 +----
lib/librte_vhost/vhost.h | 12 +++++++++++-
2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 938b3abf2..256184ac2 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -55,7 +55,7 @@

struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];

-uint64_t vhost_iova_to_vva(struct virtio_net *dev, struct
vhost_virtqueue *vq,
+uint64_t __vhost_iova_to_vva(struct virtio_net *dev, struct
vhost_virtqueue *vq,
uint64_t iova, uint64_t size, uint8_t perm)
{
uint64_t vva, tmp_size;
@@ -63,9 +63,6 @@ uint64_t vhost_iova_to_vva(struct virtio_net *dev,
struct vhost_virtqueue *vq,
if (unlikely(!size))
return 0;

- if (!(dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)))
- return rte_vhost_gpa_to_vva(dev->mem, iova);
-
tmp_size = size;

vva = vhost_user_iotlb_cache_find(vq, iova, &tmp_size, perm);
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 191e6c5f1..969f1108b 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -355,8 +355,18 @@ struct vhost_device_ops const
*vhost_driver_callback_get(const char *path);
*/
void vhost_backend_cleanup(struct virtio_net *dev);

-uint64_t vhost_iova_to_vva(struct virtio_net *dev, struct
vhost_virtqueue *vq,
+uint64_t __vhost_iova_to_vva(struct virtio_net *dev, struct
vhost_virtqueue *vq,
uint64_t iova, uint64_t size, uint8_t perm);
+
+static __rte_always_inline uint64_t
+vhost_iova_to_vva(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ uint64_t iova, uint64_t size, uint8_t perm)
+{
+ if (!(dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)))
+ return rte_vhost_gpa_to_vva(dev->mem, iova);
+
+ return __vhost_iova_to_vva(dev, vq, iova, size, perm);
+}
int vring_translate(struct virtio_net *dev, struct vhost_virtqueue *vq);
void vring_invalidate(struct virtio_net *dev, struct vhost_virtqueue *vq);
--
2.13.3
Tiwei Bie
2017-09-07 05:08:06 UTC
Permalink
Post by Maxime Coquelin
Post by Tiwei Bie
Post by Maxime Coquelin
Hi Tiwei,
Post by Tiwei Bie
Post by Maxime Coquelin
Post by Tiwei Bie
Post by Maxime Coquelin
Post by Tiwei Bie
Post by Maxime Coquelin
virtio_net device might be accessed while being reallocated
in case of NUMA awareness. This case might be theoretical,
but it will be needed anyway to protect vrings pages against
invalidation.
The virtio_net devs are now protected with a readers/writers
lock, so that before reallocating the device, it is ensured
that it is not being referenced by the processing threads.
[...]
Post by Maxime Coquelin
+struct virtio_net *
+get_device(int vid)
+{
+ struct virtio_net *dev;
+
+ rte_rwlock_read_lock(&vhost_devices[vid].lock);
+
+ dev = __get_device(vid);
+ if (unlikely(!dev))
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+
+ return dev;
+}
+
+void
+put_device(int vid)
+{
+ rte_rwlock_read_unlock(&vhost_devices[vid].lock);
+}
+
This patch introduced a per-device rwlock which needs to be acquired
unconditionally in the data path. So for each vhost device, the IO
threads of different queues will need to acquire/release this lock
during each enqueue and dequeue operation, which will cause cache
contention when multiple queues are enabled and handled by different
cores. With this patch alone, I saw ~7% performance drop when enabling
6 queues to do 64bytes iofwd loopback test. Is there any way to avoid
introducing this lock to the data path?
First, I'd like to thank you for running the MQ test.
I agree it may have a performance impact in this case.
1. Prevent referencing freed virtio_dev struct in case of numa_realloc.
2. Protect vring pages against invalidation.
For 2., it can be fixed by using the per-vq IOTLB lock (it was not the
case in my early prototypes that had per device IOTLB cache).
For 1., this is an existing problem, so we might consider it is
acceptable to keep current state. Maybe it could be improved by only
reallocating in case VQ0 is not on the right NUMA node, the other VQs
not being initialized at this point.
If we do this we might be able to get rid of this lock, I need some more
time though to ensure I'm not missing something.
What do you think?
Cool. So it's possible that the lock in the data path will be
acquired only when the IOMMU feature is enabled. It will be
great!
Besides, I just did a very simple MQ test to verify my thoughts.
Lei (CC'ed in this mail) may do a thorough performance test for
this patch set to evaluate the performance impacts.
I'll try to post v2 this week including the proposed change.
Maybe it'll be better Lei waits for the v2.
Cool. Sure. Thank you! :)
https://gitlab.com/mcoquelin/dpdk-next-virtio/commits/vhost_iotlb_v2
I'm testing it right now, but if you'd like to run some early benchmark
before I post the series, there it is!
Got it. Thanks! :)
Just to let you know that I have updated my branch to remove another
regression with iommu=off by inlining the noiommu part of
vhost_iova_to_vva call (See below for the patch, that is squashed into
my branch).
Without this, when running microbenchmarks (txonly, rxonly, ...) I
noticed a 4% perf degradation.
Nice work!

Best regards,
Tiwei Bie
Yuanhan Liu
2017-09-07 13:44:34 UTC
Permalink
Post by Maxime Coquelin
virtio_net device might be accessed while being reallocated
in case of NUMA awareness.
From data path? data path won't be enabled until all are ready, which is
at a stage after numa_realloc(). Or, am I miss something?

--yliu
Post by Maxime Coquelin
This case might be theoretical,
but it will be needed anyway to protect vrings pages against
invalidation.
The virtio_net devs are now protected with a readers/writers
lock, so that before reallocating the device, it is ensured
that it is not being referenced by the processing threads.
Maxime Coquelin
2017-09-07 14:01:21 UTC
Permalink
Post by Yuanhan Liu
Post by Maxime Coquelin
virtio_net device might be accessed while being reallocated
in case of NUMA awareness.
From data path? data path won't be enabled until all are ready, which is
at a stage after numa_realloc(). Or, am I miss something?
Right, I just thought that Qemu could add queues after enabling the
first ones.

Anyway, I removed this patch from the v2 I'm preparing.

Maxime
Post by Yuanhan Liu
--yliu
Post by Maxime Coquelin
This case might be theoretical,
but it will be needed anyway to protect vrings pages against
invalidation.
The virtio_net devs are now protected with a readers/writers
lock, so that before reallocating the device, it is ensured
that it is not being referenced by the processing threads.
Maxime Coquelin
2017-08-31 09:50:06 UTC
Permalink
send_vhost_message() is currently only used to send
replies, so it modifies message flags to perpare the
reply.

With upcoming channel for backend initiated request,
this function can be used to send requests.

This patch introduces a new send_vhost_reply() that
does the message flags modifications, and makes
send_vhost_message() generic.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost_user.c | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 5b3b8812a..8984dcb6a 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -877,8 +877,16 @@ read_vhost_message(int sockfd, struct VhostUserMsg *msg)
static int
send_vhost_message(int sockfd, struct VhostUserMsg *msg)
{
- int ret;
+ if (!msg)
+ return 0;
+
+ return send_fd_message(sockfd, (char *)msg,
+ VHOST_USER_HDR_SIZE + msg->size, NULL, 0);
+}

+static int
+send_vhost_reply(int sockfd, struct VhostUserMsg *msg)
+{
if (!msg)
return 0;

@@ -887,10 +895,7 @@ send_vhost_message(int sockfd, struct VhostUserMsg *msg)
msg->flags |= VHOST_USER_VERSION;
msg->flags |= VHOST_USER_REPLY_MASK;

- ret = send_fd_message(sockfd, (char *)msg,
- VHOST_USER_HDR_SIZE + msg->size, NULL, 0);
-
- return ret;
+ return send_vhost_message(sockfd, msg);
}

/*
@@ -984,7 +989,7 @@ vhost_user_msg_handler(int vid, int fd)
case VHOST_USER_GET_FEATURES:
msg.payload.u64 = vhost_user_get_features(dev);
msg.size = sizeof(msg.payload.u64);
- send_vhost_message(fd, &msg);
+ send_vhost_reply(fd, &msg);
break;
case VHOST_USER_SET_FEATURES:
vhost_user_set_features(dev, msg.payload.u64);
@@ -993,7 +998,7 @@ vhost_user_msg_handler(int vid, int fd)
case VHOST_USER_GET_PROTOCOL_FEATURES:
msg.payload.u64 = VHOST_USER_PROTOCOL_FEATURES;
msg.size = sizeof(msg.payload.u64);
- send_vhost_message(fd, &msg);
+ send_vhost_reply(fd, &msg);
break;
case VHOST_USER_SET_PROTOCOL_FEATURES:
vhost_user_set_protocol_features(dev, msg.payload.u64);
@@ -1015,7 +1020,7 @@ vhost_user_msg_handler(int vid, int fd)

/* it needs a reply */
msg.size = sizeof(msg.payload.u64);
- send_vhost_message(fd, &msg);
+ send_vhost_reply(fd, &msg);
break;
case VHOST_USER_SET_LOG_FD:
close(msg.fds[0]);
@@ -1035,7 +1040,7 @@ vhost_user_msg_handler(int vid, int fd)
case VHOST_USER_GET_VRING_BASE:
vhost_user_get_vring_base(dev, &msg);
msg.size = sizeof(msg.payload.state);
- send_vhost_message(fd, &msg);
+ send_vhost_reply(fd, &msg);
break;

case VHOST_USER_SET_VRING_KICK:
@@ -1054,7 +1059,7 @@ vhost_user_msg_handler(int vid, int fd)
case VHOST_USER_GET_QUEUE_NUM:
msg.payload.u64 = VHOST_MAX_QUEUE_PAIRS;
msg.size = sizeof(msg.payload.u64);
- send_vhost_message(fd, &msg);
+ send_vhost_reply(fd, &msg);
break;

case VHOST_USER_SET_VRING_ENABLE:
@@ -1077,7 +1082,7 @@ vhost_user_msg_handler(int vid, int fd)
if (msg.flags & VHOST_USER_NEED_REPLY) {
msg.payload.u64 = !!ret;
msg.size = sizeof(msg.payload.u64);
- send_vhost_message(fd, &msg);
+ send_vhost_reply(fd, &msg);
}

if (!(dev->flags & VIRTIO_DEV_RUNNING) && virtio_is_ready(dev)) {
--
2.13.3
Maxime Coquelin
2017-08-31 09:50:07 UTC
Permalink
Currently, only QEMU sends requests, the backend sends
replies. In some cases, the backend may need to send
requests to QEMU, like IOTLB miss events when IOMMU is
supported.

This patch introduces a new channel for such requests.
QEMU sends a file descriptor of a new socket using
VHOST_USER_SET_SLAVE_REQ_FD.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost.h | 2 ++
lib/librte_vhost/vhost_user.c | 27 +++++++++++++++++++++++++++
lib/librte_vhost/vhost_user.h | 10 +++++++++-
3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 18ad69c85..2340b0c2a 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -196,6 +196,8 @@ struct virtio_net {
uint32_t nr_guest_pages;
uint32_t max_guest_pages;
struct guest_page *guest_pages;
+
+ int slave_req_fd;
} __rte_cache_aligned;


diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 8984dcb6a..7b3c2f32a 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -76,6 +76,7 @@ static const char *vhost_message_str[VHOST_USER_MAX] = {
[VHOST_USER_SET_VRING_ENABLE] = "VHOST_USER_SET_VRING_ENABLE",
[VHOST_USER_SEND_RARP] = "VHOST_USER_SEND_RARP",
[VHOST_USER_NET_SET_MTU] = "VHOST_USER_NET_SET_MTU",
+ [VHOST_USER_SET_SLAVE_REQ_FD] = "VHOST_USER_SET_SLAVE_REQ_FD",
};

static uint64_t
@@ -122,6 +123,11 @@ vhost_backend_cleanup(struct virtio_net *dev)
munmap((void *)(uintptr_t)dev->log_addr, dev->log_size);
dev->log_addr = 0;
}
+
+ if (dev->slave_req_fd >= 0) {
+ close(dev->slave_req_fd);
+ dev->slave_req_fd = -1;
+ }
}

/*
@@ -844,6 +850,23 @@ vhost_user_net_set_mtu(struct virtio_net *dev, struct VhostUserMsg *msg)
return 0;
}

+static int
+vhost_user_set_req_fd(struct virtio_net *dev, struct VhostUserMsg *msg)
+{
+ int fd = msg->fds[0];
+
+ if (fd < 0) {
+ RTE_LOG(ERR, VHOST_CONFIG,
+ "Invalid file descriptor for slave channel (%d)\n",
+ fd);
+ return -1;
+ }
+
+ dev->slave_req_fd = fd;
+
+ return 0;
+}
+
/* return bytes# of read on success or negative val on failure. */
static int
read_vhost_message(int sockfd, struct VhostUserMsg *msg)
@@ -1073,6 +1096,10 @@ vhost_user_msg_handler(int vid, int fd)
ret = vhost_user_net_set_mtu(dev, &msg);
break;

+ case VHOST_USER_SET_SLAVE_REQ_FD:
+ ret = vhost_user_set_req_fd(dev, &msg);
+ break;
+
default:
ret = -1;
break;
diff --git a/lib/librte_vhost/vhost_user.h b/lib/librte_vhost/vhost_user.h
index 2ba22dbb0..98f6e6f37 100644
--- a/lib/librte_vhost/vhost_user.h
+++ b/lib/librte_vhost/vhost_user.h
@@ -48,12 +48,14 @@
#define VHOST_USER_PROTOCOL_F_RARP 2
#define VHOST_USER_PROTOCOL_F_REPLY_ACK 3
#define VHOST_USER_PROTOCOL_F_NET_MTU 4
+#define VHOST_USER_PROTOCOL_F_SLAVE_REQ 5

#define VHOST_USER_PROTOCOL_FEATURES ((1ULL << VHOST_USER_PROTOCOL_F_MQ) | \
(1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD) |\
(1ULL << VHOST_USER_PROTOCOL_F_RARP) | \
(1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK) | \
- (1ULL << VHOST_USER_PROTOCOL_F_NET_MTU))
+ (1ULL << VHOST_USER_PROTOCOL_F_NET_MTU) | \
+ (1ULL << VHOST_USER_PROTOCOL_F_SLAVE_REQ))

typedef enum VhostUserRequest {
VHOST_USER_NONE = 0,
@@ -77,9 +79,15 @@ typedef enum VhostUserRequest {
VHOST_USER_SET_VRING_ENABLE = 18,
VHOST_USER_SEND_RARP = 19,
VHOST_USER_NET_SET_MTU = 20,
+ VHOST_USER_SET_SLAVE_REQ_FD = 21,
VHOST_USER_MAX
} VhostUserRequest;

+typedef enum VhostUserSlaveRequest {
+ VHOST_USER_SLAVE_NONE = 0,
+ VHOST_USER_SLAVE_MAX
+} VhostUserSlaveRequest;
+
typedef struct VhostUserMemoryRegion {
uint64_t guest_phys_addr;
uint64_t memory_size;
--
2.13.3
Tiwei Bie
2017-09-05 04:19:40 UTC
Permalink
Post by Maxime Coquelin
Currently, only QEMU sends requests, the backend sends
replies. In some cases, the backend may need to send
requests to QEMU, like IOTLB miss events when IOMMU is
supported.
This patch introduces a new channel for such requests.
QEMU sends a file descriptor of a new socket using
VHOST_USER_SET_SLAVE_REQ_FD.
---
lib/librte_vhost/vhost.h | 2 ++
lib/librte_vhost/vhost_user.c | 27 +++++++++++++++++++++++++++
lib/librte_vhost/vhost_user.h | 10 +++++++++-
3 files changed, 38 insertions(+), 1 deletion(-)
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 18ad69c85..2340b0c2a 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -196,6 +196,8 @@ struct virtio_net {
uint32_t nr_guest_pages;
uint32_t max_guest_pages;
struct guest_page *guest_pages;
+
+ int slave_req_fd;
} __rte_cache_aligned;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 8984dcb6a..7b3c2f32a 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -76,6 +76,7 @@ static const char *vhost_message_str[VHOST_USER_MAX] = {
[VHOST_USER_SET_VRING_ENABLE] = "VHOST_USER_SET_VRING_ENABLE",
[VHOST_USER_SEND_RARP] = "VHOST_USER_SEND_RARP",
[VHOST_USER_NET_SET_MTU] = "VHOST_USER_NET_SET_MTU",
+ [VHOST_USER_SET_SLAVE_REQ_FD] = "VHOST_USER_SET_SLAVE_REQ_FD",
};
static uint64_t
@@ -122,6 +123,11 @@ vhost_backend_cleanup(struct virtio_net *dev)
munmap((void *)(uintptr_t)dev->log_addr, dev->log_size);
dev->log_addr = 0;
}
+
+ if (dev->slave_req_fd >= 0) {
+ close(dev->slave_req_fd);
+ dev->slave_req_fd = -1;
The slave_req_fd should also be initialized to -1 when allocating
the virtio_net structure. Currently, it's missing.

Best regards,
Tiwei Bie
Maxime Coquelin
2017-09-05 08:18:33 UTC
Permalink
Post by Tiwei Bie
Post by Maxime Coquelin
Currently, only QEMU sends requests, the backend sends
replies. In some cases, the backend may need to send
requests to QEMU, like IOTLB miss events when IOMMU is
supported.
This patch introduces a new channel for such requests.
QEMU sends a file descriptor of a new socket using
VHOST_USER_SET_SLAVE_REQ_FD.
---
lib/librte_vhost/vhost.h | 2 ++
lib/librte_vhost/vhost_user.c | 27 +++++++++++++++++++++++++++
lib/librte_vhost/vhost_user.h | 10 +++++++++-
3 files changed, 38 insertions(+), 1 deletion(-)
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 18ad69c85..2340b0c2a 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -196,6 +196,8 @@ struct virtio_net {
uint32_t nr_guest_pages;
uint32_t max_guest_pages;
struct guest_page *guest_pages;
+
+ int slave_req_fd;
} __rte_cache_aligned;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 8984dcb6a..7b3c2f32a 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -76,6 +76,7 @@ static const char *vhost_message_str[VHOST_USER_MAX] = {
[VHOST_USER_SET_VRING_ENABLE] = "VHOST_USER_SET_VRING_ENABLE",
[VHOST_USER_SEND_RARP] = "VHOST_USER_SEND_RARP",
[VHOST_USER_NET_SET_MTU] = "VHOST_USER_NET_SET_MTU",
+ [VHOST_USER_SET_SLAVE_REQ_FD] = "VHOST_USER_SET_SLAVE_REQ_FD",
};
static uint64_t
@@ -122,6 +123,11 @@ vhost_backend_cleanup(struct virtio_net *dev)
munmap((void *)(uintptr_t)dev->log_addr, dev->log_size);
dev->log_addr = 0;
}
+
+ if (dev->slave_req_fd >= 0) {
+ close(dev->slave_req_fd);
+ dev->slave_req_fd = -1;
The slave_req_fd should also be initialized to -1 when allocating
the virtio_net structure. Currently, it's missing.
Good catch, thanks for spotting this.

Maxime
Post by Tiwei Bie
Best regards,
Tiwei Bie
Maxime Coquelin
2017-08-31 09:50:08 UTC
Permalink
These defines and enums have been introduced in upstream kernel v4.8,
and backported to RHEL 7.4.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost.h | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 2340b0c2a..5c9d9316a 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -132,6 +132,37 @@ struct vhost_virtqueue {
#define VIRTIO_NET_F_MTU 3
#endif

+/* Declare IOMMU related bits for older kernels */
+#ifndef VIRTIO_F_IOMMU_PLATFORM
+
+#define VIRTIO_F_IOMMU_PLATFORM 33
+
+struct vhost_iotlb_msg {
+ __u64 iova;
+ __u64 size;
+ __u64 uaddr;
+#define VHOST_ACCESS_RO 0x1
+#define VHOST_ACCESS_WO 0x2
+#define VHOST_ACCESS_RW 0x3
+ __u8 perm;
+#define VHOST_IOTLB_MISS 1
+#define VHOST_IOTLB_UPDATE 2
+#define VHOST_IOTLB_INVALIDATE 3
+#define VHOST_IOTLB_ACCESS_FAIL 4
+ __u8 type;
+};
+
+#define VHOST_IOTLB_MSG 0x1
+
+struct vhost_msg {
+ int type;
+ union {
+ struct vhost_iotlb_msg iotlb;
+ __u8 padding[64];
+ };
+};
+#endif
+
/*
* Define virtio 1.0 for older kernels
*/
--
2.13.3
Maxime Coquelin
2017-08-31 09:50:09 UTC
Permalink
Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/Makefile | 4 +-
lib/librte_vhost/iotlb.c | 231 ++++++++++++++++++++++++++++++++++++++++++++++
lib/librte_vhost/iotlb.h | 46 +++++++++
lib/librte_vhost/vhost.c | 1 +
lib/librte_vhost/vhost.h | 5 +
5 files changed, 285 insertions(+), 2 deletions(-)
create mode 100644 lib/librte_vhost/iotlb.c
create mode 100644 lib/librte_vhost/iotlb.h

diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
index 4a116fe31..e1084aba5 100644
--- a/lib/librte_vhost/Makefile
+++ b/lib/librte_vhost/Makefile
@@ -47,8 +47,8 @@ LDLIBS += -lnuma
endif

# all source are stored in SRCS-y
-SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c socket.c vhost.c vhost_user.c \
- virtio_net.c
+SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
+ vhost_user.c virtio_net.c

# install includes
SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h
diff --git a/lib/librte_vhost/iotlb.c b/lib/librte_vhost/iotlb.c
new file mode 100644
index 000000000..1b739dae5
--- /dev/null
+++ b/lib/librte_vhost/iotlb.c
@@ -0,0 +1,231 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright (c) 2017 Red Hat, Inc.
+ * Copyright (c) 2017 Maxime Coquelin <***@redhat.com>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifdef RTE_LIBRTE_VHOST_NUMA
+#include <numaif.h>
+#endif
+
+#include <rte_tailq.h>
+
+#include "iotlb.h"
+#include "vhost.h"
+
+struct vhost_iotlb_entry {
+ TAILQ_ENTRY(vhost_iotlb_entry) next;
+
+ uint64_t iova;
+ uint64_t uaddr;
+ uint64_t size;
+ uint8_t perm;
+};
+
+#define IOTLB_CACHE_SIZE 1024
+
+static void vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq)
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
+ TAILQ_REMOVE(&vq->iotlb_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+void vhost_user_iotlb_cache_insert(struct vhost_virtqueue *vq, uint64_t iova,
+ uint64_t uaddr, uint64_t size, uint8_t perm)
+{
+ struct vhost_iotlb_entry *node, *new_node;
+ int ret;
+
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool empty, invalidate cache\n");
+ vhost_user_iotlb_cache_remove_all(vq);
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n");
+ return;
+ }
+ }
+
+ new_node->iova = iova;
+ new_node->uaddr = uaddr;
+ new_node->size = size;
+ new_node->perm = perm;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH(node, &vq->iotlb_list, next) {
+ /*
+ * Entries must be invalidated before being updated.
+ * So if iova already in list, assume identical.
+ */
+ if (node->iova == new_node->iova) {
+ rte_mempool_put(vq->iotlb_pool, new_node);
+ goto unlock;
+ } else if (node->iova > new_node->iova) {
+ TAILQ_INSERT_BEFORE(node, new_node, next);
+ goto unlock;
+ }
+ }
+
+ TAILQ_INSERT_TAIL(&vq->iotlb_list, new_node, next);
+
+unlock:
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+void vhost_user_iotlb_cache_remove(struct vhost_virtqueue *vq,
+ uint64_t iova, uint64_t size)
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ if (unlikely(!size))
+ return;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
+ /* Sorted list */
+ if (unlikely(node->iova >= iova + size)) {
+ break;
+ } else if ((node->iova < iova + size) &&
+ (iova < node->iova + node->size)) {
+ TAILQ_REMOVE(&vq->iotlb_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ continue;
+ }
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+
+uint64_t vhost_user_iotlb_cache_find(struct vhost_virtqueue *vq, uint64_t iova,
+ uint64_t *size, uint8_t perm)
+{
+ struct vhost_iotlb_entry *node;
+ uint64_t offset, vva = 0, mapped = 0;
+
+ if (unlikely(!*size))
+ goto out;
+
+ rte_rwlock_read_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH(node, &vq->iotlb_list, next) {
+ /* List sorted by iova */
+ if (unlikely(iova < node->iova))
+ break;
+
+ if (iova >= node->iova + node->size)
+ continue;
+
+ if (unlikely((perm & node->perm) != perm)) {
+ vva = 0;
+ break;
+ }
+
+ offset = iova - node->iova;
+ if (!vva)
+ vva = node->uaddr + offset;
+
+ mapped += node->size - offset;
+ iova = node->iova + node->size;
+
+ if (mapped >= *size)
+ break;
+ }
+
+ rte_rwlock_read_unlock(&vq->iotlb_lock);
+
+out:
+ /* Only part of the requested chunk is mapped */
+ if (unlikely(mapped < *size))
+ *size = mapped;
+
+ return vva;
+}
+
+int vhost_user_iotlb_init(struct virtio_net *dev, int vq_index)
+{
+ char pool_name[RTE_MEMPOOL_NAMESIZE];
+ struct vhost_virtqueue *vq = dev->virtqueue[vq_index];
+ int ret = -1, socket;
+
+ if (vq->iotlb_pool) {
+ /*
+ * The cache has already been initialized,
+ * just drop all entries
+ */
+ vhost_user_iotlb_cache_remove_all(vq);
+ return 0;
+ }
+
+#ifdef RTE_LIBRTE_VHOST_NUMA
+ ret = get_mempolicy(&socket, NULL, 0, vq, MPOL_F_NODE | MPOL_F_ADDR);
+#endif
+ if (ret)
+ socket = 0;
+
+ rte_rwlock_init(&vq->iotlb_lock);
+
+ TAILQ_INIT(&vq->iotlb_list);
+
+ snprintf(pool_name, sizeof(pool_name), "iotlb_cache_%d_%d",
+ dev->vid, vq_index);
+
+ /* If already created, free it and recreate */
+ vq->iotlb_pool = rte_mempool_lookup(pool_name);
+ if (vq->iotlb_pool)
+ rte_mempool_free(vq->iotlb_pool);
+
+ vq->iotlb_pool = rte_mempool_create(pool_name,
+ IOTLB_CACHE_SIZE, sizeof(struct vhost_iotlb_entry), 0,
+ 0, 0, NULL, NULL, NULL, socket,
+ MEMPOOL_F_NO_CACHE_ALIGN |
+ MEMPOOL_F_SP_PUT |
+ MEMPOOL_F_SC_GET);
+ if (!vq->iotlb_pool) {
+ RTE_LOG(ERR, VHOST_CONFIG,
+ "Failed to create IOTLB cache pool (%s)\n",
+ pool_name);
+ return -1;
+ }
+
+ return 0;
+}
+
diff --git a/lib/librte_vhost/iotlb.h b/lib/librte_vhost/iotlb.h
new file mode 100644
index 000000000..459820762
--- /dev/null
+++ b/lib/librte_vhost/iotlb.h
@@ -0,0 +1,46 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright (c) 2017 Red Hat, Inc.
+ * Copyright (c) 2017 Maxime Coquelin <***@redhat.com>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+#ifndef _VHOST_IOTLB_H_
+#define _VHOST_IOTLB_H_
+
+#include "vhost.h"
+void vhost_user_iotlb_cache_insert(struct vhost_virtqueue *vq, uint64_t iova,
+ uint64_t uaddr, uint64_t size,
+ uint8_t perm);
+void vhost_user_iotlb_cache_remove(struct vhost_virtqueue *vq,
+ uint64_t iova, uint64_t size);
+uint64_t vhost_user_iotlb_cache_find(struct vhost_virtqueue *vq, uint64_t iova,
+ uint64_t *size, uint8_t perm);
+int vhost_user_iotlb_init(struct virtio_net *dev, int vq_index);
+
+#endif /* _VHOST_IOTLB_H_ */
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 429983858..f1099753c 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -188,6 +188,7 @@ free_device(struct virtio_net *dev)
vq = dev->virtqueue[i];

rte_free(vq->shadow_used_ring);
+ rte_mempool_free(vq->iotlb_pool);

rte_free(vq);
}
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 5c9d9316a..7816a92b5 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -45,6 +45,7 @@

#include <rte_log.h>
#include <rte_ether.h>
+#include <rte_rwlock.h>

#include "rte_vhost.h"

@@ -114,6 +115,10 @@ struct vhost_virtqueue {

struct vring_used_elem *shadow_used_ring;
uint16_t shadow_used_idx;
+
+ rte_rwlock_t iotlb_lock;
+ struct rte_mempool *iotlb_pool;
+ TAILQ_HEAD(, vhost_iotlb_entry) iotlb_list;
} __rte_cache_aligned;

/* Old kernels have no such macros defined */
--
2.13.3
Tiwei Bie
2017-09-05 06:02:00 UTC
Permalink
On Thu, Aug 31, 2017 at 11:50:09AM +0200, Maxime Coquelin wrote:
[...]
Post by Maxime Coquelin
+
+#define IOTLB_CACHE_SIZE 1024
+
+static void vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq)
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
+ TAILQ_REMOVE(&vq->iotlb_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+void vhost_user_iotlb_cache_insert(struct vhost_virtqueue *vq, uint64_t iova,
+ uint64_t uaddr, uint64_t size, uint8_t perm)
+{
+ struct vhost_iotlb_entry *node, *new_node;
+ int ret;
+
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool empty, invalidate cache\n");
I think the log level should be DEBUG or INFO or the likes, not ERR.
Post by Maxime Coquelin
+ vhost_user_iotlb_cache_remove_all(vq);
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n");
+ return;
+ }
+ }
+
[...]
Post by Maxime Coquelin
+
+void vhost_user_iotlb_cache_remove(struct vhost_virtqueue *vq,
+ uint64_t iova, uint64_t size)
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ if (unlikely(!size))
+ return;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
+ /* Sorted list */
+ if (unlikely(node->iova >= iova + size)) {
+ break;
+ } else if ((node->iova < iova + size) &&
+ (iova < node->iova + node->size)) {
The `else' can be removed.
And the check of (node->iova < iova + size) can also be removed.
Post by Maxime Coquelin
+ TAILQ_REMOVE(&vq->iotlb_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ continue;
+ }
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+
Only one empty line is needed here.
Post by Maxime Coquelin
+uint64_t vhost_user_iotlb_cache_find(struct vhost_virtqueue *vq, uint64_t iova,
+ uint64_t *size, uint8_t perm)
+{
[...]
Post by Maxime Coquelin
+#ifndef _VHOST_IOTLB_H_
+#define _VHOST_IOTLB_H_
+
+#include "vhost.h"
+void vhost_user_iotlb_cache_insert(struct vhost_virtqueue *vq, uint64_t iova,
+ uint64_t uaddr, uint64_t size,
+ uint8_t perm);
An empty line should be added after #include "vhost.h".

Best regards,
Tiwei Bie
Maxime Coquelin
2017-09-05 15:16:29 UTC
Permalink
Post by Tiwei Bie
[...]
Post by Maxime Coquelin
+
+#define IOTLB_CACHE_SIZE 1024
+
+static void vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq)
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
+ TAILQ_REMOVE(&vq->iotlb_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+void vhost_user_iotlb_cache_insert(struct vhost_virtqueue *vq, uint64_t iova,
+ uint64_t uaddr, uint64_t size, uint8_t perm)
+{
+ struct vhost_iotlb_entry *node, *new_node;
+ int ret;
+
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool empty, invalidate cache\n");
I think the log level should be DEBUG or INFO or the likes, not ERR.
I set it to ERR because failing in this case would mean that either the
IOTLB cache has not been sized large enough and we would like it to be
reported as it would have an impact on performance and drop rate, or
that we have a malicious or buggy guest.

I'm fine with changing it to INFO though.
Post by Tiwei Bie
Post by Maxime Coquelin
+ vhost_user_iotlb_cache_remove_all(vq);
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n");
+ return;
+ }
+ }
+
[...]
Post by Maxime Coquelin
+
+void vhost_user_iotlb_cache_remove(struct vhost_virtqueue *vq,
+ uint64_t iova, uint64_t size)
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ if (unlikely(!size))
+ return;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
+ /* Sorted list */
+ if (unlikely(node->iova >= iova + size)) {
+ break;
+ } else if ((node->iova < iova + size) &&
+ (iova < node->iova + node->size)) {
The `else' can be removed.
And the check of (node->iova < iova + size) can also be removed.
Right! Fixed.
Post by Tiwei Bie
Post by Maxime Coquelin
+ TAILQ_REMOVE(&vq->iotlb_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ continue;
+ }
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+
Only one empty line is needed here.
Fixed.
Post by Tiwei Bie
Post by Maxime Coquelin
+uint64_t vhost_user_iotlb_cache_find(struct vhost_virtqueue *vq, uint64_t iova,
+ uint64_t *size, uint8_t perm)
+{
[...]
Post by Maxime Coquelin
+#ifndef _VHOST_IOTLB_H_
+#define _VHOST_IOTLB_H_
+
+#include "vhost.h"
+void vhost_user_iotlb_cache_insert(struct vhost_virtqueue *vq, uint64_t iova,
+ uint64_t uaddr, uint64_t size,
+ uint8_t perm);
An empty line should be added after #include "vhost.h".
Fixed.

Thanks,
Maxime
Post by Tiwei Bie
Best regards,
Tiwei Bie
Yuanhan Liu
2017-09-08 08:08:55 UTC
Permalink
Post by Maxime Coquelin
diff --git a/lib/librte_vhost/iotlb.c b/lib/librte_vhost/iotlb.c
new file mode 100644
index 000000000..1b739dae5
--- /dev/null
+++ b/lib/librte_vhost/iotlb.c
@@ -0,0 +1,231 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright (c) 2017 Red Hat, Inc.
I'm not a lawer, but I have been told many years before, that you don't
have the copyright for the code you write for open source project, the
company you work for does.

Thus, it's more common to see something like following:
Copyright , ... the commany ...
Author: Some One <***@...>

However, as you may have noticed, it's not common to put the authorship
in the source files. Though I don't object it.

[...]
Post by Maxime Coquelin
+#define IOTLB_CACHE_SIZE 1024
+
+static void vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq)
^^^^^^^^^^^^
Note that it's not the DPDK coding style to define a function.
Post by Maxime Coquelin
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
+ TAILQ_REMOVE(&vq->iotlb_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+void vhost_user_iotlb_cache_insert(struct vhost_virtqueue *vq, uint64_t iova,
+ uint64_t uaddr, uint64_t size, uint8_t perm)
+{
+ struct vhost_iotlb_entry *node, *new_node;
+ int ret;
+
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool empty, invalidate cache\n");
It's a cache, why not considering remove one to get space for new one?
Post by Maxime Coquelin
+ vhost_user_iotlb_cache_remove_all(vq);
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n");
+ return;
+ }
+ }
+
+ new_node->iova = iova;
+ new_node->uaddr = uaddr;
+ new_node->size = size;
+ new_node->perm = perm;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH(node, &vq->iotlb_list, next) {
+ /*
+ * Entries must be invalidated before being updated.
+ * So if iova already in list, assume identical.
+ */
+ if (node->iova == new_node->iova) {
+ rte_mempool_put(vq->iotlb_pool, new_node);
+ goto unlock;
+ } else if (node->iova > new_node->iova) {
+ TAILQ_INSERT_BEFORE(node, new_node, next);
+ goto unlock;
+ }
+ }
+
+ TAILQ_INSERT_TAIL(&vq->iotlb_list, new_node, next);
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+void vhost_user_iotlb_cache_remove(struct vhost_virtqueue *vq,
+ uint64_t iova, uint64_t size)
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ if (unlikely(!size))
+ return;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
+ /* Sorted list */
I'd like to put such comments at the struct declartion, so that you don't
have to mention it many times that it's a sorted list.
Post by Maxime Coquelin
+ if (unlikely(node->iova >= iova + size)) {
+ break;
+ } else if ((node->iova < iova + size) &&
+ (iova < node->iova + node->size)) {
+ TAILQ_REMOVE(&vq->iotlb_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ continue;
+ }
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
[...]
Post by Maxime Coquelin
+int vhost_user_iotlb_init(struct virtio_net *dev, int vq_index)
+{
+ char pool_name[RTE_MEMPOOL_NAMESIZE];
+ struct vhost_virtqueue *vq = dev->virtqueue[vq_index];
+ int ret = -1, socket;
+
+ if (vq->iotlb_pool) {
+ /*
+ * The cache has already been initialized,
+ * just drop all entries
+ */
+ vhost_user_iotlb_cache_remove_all(vq);
+ return 0;
+ }
+
+#ifdef RTE_LIBRTE_VHOST_NUMA
+ ret = get_mempolicy(&socket, NULL, 0, vq, MPOL_F_NODE | MPOL_F_ADDR);
+#endif
+ if (ret)
+ socket = 0;
+
+ rte_rwlock_init(&vq->iotlb_lock);
+
+ TAILQ_INIT(&vq->iotlb_list);
+
+ snprintf(pool_name, sizeof(pool_name), "iotlb_cache_%d_%d",
+ dev->vid, vq_index);
iotlb_cache is too generic. Adding a "vhost" prefix?

--yliu
Maxime Coquelin
2017-09-08 08:24:58 UTC
Permalink
Post by Yuanhan Liu
Post by Maxime Coquelin
diff --git a/lib/librte_vhost/iotlb.c b/lib/librte_vhost/iotlb.c
new file mode 100644
index 000000000..1b739dae5
--- /dev/null
+++ b/lib/librte_vhost/iotlb.c
@@ -0,0 +1,231 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright (c) 2017 Red Hat, Inc.
I'm not a lawer, but I have been told many years before, that you don't
have the copyright for the code you write for open source project, the
company you work for does.
Copyright , ... the commany ...
However, as you may have noticed, it's not common to put the authorship
in the source files. Though I don't object it.
I'm not a lawyer too. At least in other projects, it seems common the
author puts his name as copyright owner.

I have no issue to change it to only keep Red Hat's one though.
Post by Yuanhan Liu
[...]
Post by Maxime Coquelin
+#define IOTLB_CACHE_SIZE 1024
+
+static void vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq)
^^^^^^^^^^^^
Note that it's not the DPDK coding style to define a function.
Ok, I guess you mean:
static void
vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq) ?
Post by Yuanhan Liu
Post by Maxime Coquelin
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
+ TAILQ_REMOVE(&vq->iotlb_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+void vhost_user_iotlb_cache_insert(struct vhost_virtqueue *vq, uint64_t iova,
+ uint64_t uaddr, uint64_t size, uint8_t perm)
+{
+ struct vhost_iotlb_entry *node, *new_node;
+ int ret;
+
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool empty, invalidate cache\n");
It's a cache, why not considering remove one to get space for new one?
It would mean having to track every lookups not to remove hot entries,
which would have an impact on performance.

Moreover, the idea is to have the cache large enough, else you could
face packet drops due to random cache misses.

We might consider to improve it, but I consider it an optimization that
could be implemented later if needed.
Post by Yuanhan Liu
Post by Maxime Coquelin
+ vhost_user_iotlb_cache_remove_all(vq);
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n");
+ return;
+ }
+ }
+
+ new_node->iova = iova;
+ new_node->uaddr = uaddr;
+ new_node->size = size;
+ new_node->perm = perm;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH(node, &vq->iotlb_list, next) {
+ /*
+ * Entries must be invalidated before being updated.
+ * So if iova already in list, assume identical.
+ */
+ if (node->iova == new_node->iova) {
+ rte_mempool_put(vq->iotlb_pool, new_node);
+ goto unlock;
+ } else if (node->iova > new_node->iova) {
+ TAILQ_INSERT_BEFORE(node, new_node, next);
+ goto unlock;
+ }
+ }
+
+ TAILQ_INSERT_TAIL(&vq->iotlb_list, new_node, next);
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+void vhost_user_iotlb_cache_remove(struct vhost_virtqueue *vq,
+ uint64_t iova, uint64_t size)
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ if (unlikely(!size))
+ return;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
+ /* Sorted list */
I'd like to put such comments at the struct declartion, so that you don't
have to mention it many times that it's a sorted list.
Ok, I'll comment directly in struct declaration.
Post by Yuanhan Liu
Post by Maxime Coquelin
+ if (unlikely(node->iova >= iova + size)) {
+ break;
+ } else if ((node->iova < iova + size) &&
+ (iova < node->iova + node->size)) {
+ TAILQ_REMOVE(&vq->iotlb_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ continue;
+ }
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
[...]
Post by Maxime Coquelin
+int vhost_user_iotlb_init(struct virtio_net *dev, int vq_index)
+{
+ char pool_name[RTE_MEMPOOL_NAMESIZE];
+ struct vhost_virtqueue *vq = dev->virtqueue[vq_index];
+ int ret = -1, socket;
+
+ if (vq->iotlb_pool) {
+ /*
+ * The cache has already been initialized,
+ * just drop all entries
+ */
+ vhost_user_iotlb_cache_remove_all(vq);
+ return 0;
+ }
+
+#ifdef RTE_LIBRTE_VHOST_NUMA
+ ret = get_mempolicy(&socket, NULL, 0, vq, MPOL_F_NODE | MPOL_F_ADDR);
+#endif
+ if (ret)
+ socket = 0;
+
+ rte_rwlock_init(&vq->iotlb_lock);
+
+ TAILQ_INIT(&vq->iotlb_list);
+
+ snprintf(pool_name, sizeof(pool_name), "iotlb_cache_%d_%d",
+ dev->vid, vq_index);
iotlb_cache is too generic. Adding a "vhost" prefix?
Sure, that would be better.

Thanks,
Maxime
Post by Yuanhan Liu
--yliu
Yuanhan Liu
2017-09-08 08:36:33 UTC
Permalink
Post by Maxime Coquelin
Post by Yuanhan Liu
Post by Maxime Coquelin
diff --git a/lib/librte_vhost/iotlb.c b/lib/librte_vhost/iotlb.c
new file mode 100644
index 000000000..1b739dae5
--- /dev/null
+++ b/lib/librte_vhost/iotlb.c
@@ -0,0 +1,231 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright (c) 2017 Red Hat, Inc.
I'm not a lawer, but I have been told many years before, that you don't
have the copyright for the code you write for open source project, the
company you work for does.
Copyright , ... the commany ...
However, as you may have noticed, it's not common to put the authorship
in the source files. Though I don't object it.
I'm not a lawyer too. At least in other projects, it seems common the
author puts his name as copyright owner.
I have no issue to change it to only keep Red Hat's one though.
That's up to you. What I said before was JFYI :)
Post by Maxime Coquelin
Post by Yuanhan Liu
[...]
Post by Maxime Coquelin
+#define IOTLB_CACHE_SIZE 1024
+
+static void vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq)
^^^^^^^^^^^^
Note that it's not the DPDK coding style to define a function.
static void
vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq) ?
Yep.
Post by Maxime Coquelin
Post by Yuanhan Liu
Post by Maxime Coquelin
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
+ TAILQ_REMOVE(&vq->iotlb_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+void vhost_user_iotlb_cache_insert(struct vhost_virtqueue *vq, uint64_t iova,
+ uint64_t uaddr, uint64_t size, uint8_t perm)
+{
+ struct vhost_iotlb_entry *node, *new_node;
+ int ret;
+
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool empty, invalidate cache\n");
It's a cache, why not considering remove one to get space for new one?
It would mean having to track every lookups not to remove hot entries,
which would have an impact on performance.
You were removing all caches, how can we do worse than that? Even a
random evict would be better. Or, more simply, just to remove the
head or the tail?

--yliu
Post by Maxime Coquelin
Moreover, the idea is to have the cache large enough, else you could
face packet drops due to random cache misses.
We might consider to improve it, but I consider it an optimization that
could be implemented later if needed.
Post by Yuanhan Liu
Post by Maxime Coquelin
+ vhost_user_iotlb_cache_remove_all(vq);
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n");
+ return;
+ }
+ }
Maxime Coquelin
2017-09-08 08:50:49 UTC
Permalink
Post by Yuanhan Liu
Post by Maxime Coquelin
Post by Yuanhan Liu
Post by Maxime Coquelin
diff --git a/lib/librte_vhost/iotlb.c b/lib/librte_vhost/iotlb.c
new file mode 100644
index 000000000..1b739dae5
--- /dev/null
+++ b/lib/librte_vhost/iotlb.c
@@ -0,0 +1,231 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright (c) 2017 Red Hat, Inc.
I'm not a lawer, but I have been told many years before, that you don't
have the copyright for the code you write for open source project, the
company you work for does.
Copyright , ... the commany ...
However, as you may have noticed, it's not common to put the authorship
in the source files. Though I don't object it.
I'm not a lawyer too. At least in other projects, it seems common the
author puts his name as copyright owner.
I have no issue to change it to only keep Red Hat's one though.
That's up to you. What I said before was JFYI :)
Post by Maxime Coquelin
Post by Yuanhan Liu
[...]
Post by Maxime Coquelin
+#define IOTLB_CACHE_SIZE 1024
+
+static void vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq)
^^^^^^^^^^^^
Note that it's not the DPDK coding style to define a function.
static void
vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq) ?
Yep.
Post by Maxime Coquelin
Post by Yuanhan Liu
Post by Maxime Coquelin
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
+ TAILQ_REMOVE(&vq->iotlb_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+void vhost_user_iotlb_cache_insert(struct vhost_virtqueue *vq, uint64_t iova,
+ uint64_t uaddr, uint64_t size, uint8_t perm)
+{
+ struct vhost_iotlb_entry *node, *new_node;
+ int ret;
+
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool empty, invalidate cache\n");
It's a cache, why not considering remove one to get space for new one?
It would mean having to track every lookups not to remove hot entries,
which would have an impact on performance.
You were removing all caches, how can we do worse than that? Even a
random evict would be better. Or, more simply, just to remove the
head or the tail?
I think removing head or tail could cause deadlocks.
For example it needs to translate from 0x0 to 0x2000, with page size
being 0x1000.

If cache is full, when inserting 0x1000-0x1fff, it will remove
0x0-0xfff, so a miss will be sent for 0x0-0xffff and 0x1000-0x1fff will
be remove at insert time, etc...


Note that we really need to size the cache large enough for performance
purpose, so having the cache full could be cause by either buggy or
malicious guest.
Post by Yuanhan Liu
--yliu
Post by Maxime Coquelin
Moreover, the idea is to have the cache large enough, else you could
face packet drops due to random cache misses.
We might consider to improve it, but I consider it an optimization that
could be implemented later if needed.
Post by Yuanhan Liu
Post by Maxime Coquelin
+ vhost_user_iotlb_cache_remove_all(vq);
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n");
+ return;
+ }
+ }
Yuanhan Liu
2017-09-08 09:21:21 UTC
Permalink
Post by Maxime Coquelin
Post by Yuanhan Liu
Post by Maxime Coquelin
Post by Yuanhan Liu
Post by Maxime Coquelin
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
+ TAILQ_REMOVE(&vq->iotlb_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+void vhost_user_iotlb_cache_insert(struct vhost_virtqueue *vq, uint64_t iova,
+ uint64_t uaddr, uint64_t size, uint8_t perm)
+{
+ struct vhost_iotlb_entry *node, *new_node;
+ int ret;
+
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool empty, invalidate cache\n");
It's a cache, why not considering remove one to get space for new one?
It would mean having to track every lookups not to remove hot entries,
which would have an impact on performance.
You were removing all caches, how can we do worse than that? Even a
random evict would be better. Or, more simply, just to remove the
head or the tail?
I think removing head or tail could cause deadlocks.
For example it needs to translate from 0x0 to 0x2000, with page size
being 0x1000.
If cache is full, when inserting 0x1000-0x1fff, it will remove
0x0-0xfff, so a miss will be sent for 0x0-0xffff and 0x1000-0x1fff will
be remove at insert time, etc...
Okay, that means we can't simply remove the head or tail.
Post by Maxime Coquelin
Note that we really need to size the cache large enough for performance
purpose, so having the cache full could be cause by either buggy or
malicious guest.
I agree. But for the malicious guest, it could lead to a DDOS attack:
assume it keeps vhost running out of cache and then vhost keeps printing
above message.

What I suggested was to evict one (by some polices) to get a space for
the new one we want to insert. Note that it won't be a performance issue,
IMO, due to we only do that when we run out of caches, which, according
to your sayings, should not happen in normal cases.

--yliu
Post by Maxime Coquelin
Post by Yuanhan Liu
Post by Maxime Coquelin
Moreover, the idea is to have the cache large enough, else you could
face packet drops due to random cache misses.
We might consider to improve it, but I consider it an optimization that
could be implemented later if needed.
Post by Yuanhan Liu
Post by Maxime Coquelin
+ vhost_user_iotlb_cache_remove_all(vq);
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n");
+ return;
+ }
+ }
Maxime Coquelin
2017-09-08 09:28:57 UTC
Permalink
Post by Yuanhan Liu
Post by Maxime Coquelin
Post by Yuanhan Liu
Post by Maxime Coquelin
Post by Yuanhan Liu
Post by Maxime Coquelin
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_list, next, temp_node) {
+ TAILQ_REMOVE(&vq->iotlb_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+void vhost_user_iotlb_cache_insert(struct vhost_virtqueue *vq, uint64_t iova,
+ uint64_t uaddr, uint64_t size, uint8_t perm)
+{
+ struct vhost_iotlb_entry *node, *new_node;
+ int ret;
+
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool empty, invalidate cache\n");
It's a cache, why not considering remove one to get space for new one?
It would mean having to track every lookups not to remove hot entries,
which would have an impact on performance.
You were removing all caches, how can we do worse than that? Even a
random evict would be better. Or, more simply, just to remove the
head or the tail?
I think removing head or tail could cause deadlocks.
For example it needs to translate from 0x0 to 0x2000, with page size
being 0x1000.
If cache is full, when inserting 0x1000-0x1fff, it will remove
0x0-0xfff, so a miss will be sent for 0x0-0xffff and 0x1000-0x1fff will
be remove at insert time, etc...
Okay, that means we can't simply remove the head or tail.
Post by Maxime Coquelin
Note that we really need to size the cache large enough for performance
purpose, so having the cache full could be cause by either buggy or
malicious guest.
assume it keeps vhost running out of cache and then vhost keeps printing
above message.
What I suggested was to evict one (by some polices) to get a space for
the new one we want to insert. Note that it won't be a performance issue,
IMO, due to we only do that when we run out of caches, which, according
to your sayings, should not happen in normal cases.
Ok, so let's randomly remove one entry. What about using something like:
rte_rand() % IOTLB_CACHE_SIZE
Post by Yuanhan Liu
--yliu
Post by Maxime Coquelin
Post by Yuanhan Liu
Post by Maxime Coquelin
Moreover, the idea is to have the cache large enough, else you could
face packet drops due to random cache misses.
We might consider to improve it, but I consider it an optimization that
could be implemented later if needed.
Post by Yuanhan Liu
Post by Maxime Coquelin
+ vhost_user_iotlb_cache_remove_all(vq);
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&new_node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n");
+ return;
+ }
+ }
Maxime Coquelin
2017-08-31 09:50:10 UTC
Permalink
In order to be able to handle other ports or queues while waiting
for an IOTLB miss reply, a pending list is created so that waiter
can return and restart later on with sending again a miss request.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/iotlb.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++--
lib/librte_vhost/iotlb.h | 4 +++
lib/librte_vhost/vhost.h | 1 +
3 files changed, 91 insertions(+), 2 deletions(-)

diff --git a/lib/librte_vhost/iotlb.c b/lib/librte_vhost/iotlb.c
index 1b739dae5..d014bfe98 100644
--- a/lib/librte_vhost/iotlb.c
+++ b/lib/librte_vhost/iotlb.c
@@ -49,7 +49,86 @@ struct vhost_iotlb_entry {
uint8_t perm;
};

-#define IOTLB_CACHE_SIZE 1024
+#define IOTLB_CACHE_SIZE 2048
+
+static void vhost_user_iotlb_pending_remove_all(struct vhost_virtqueue *vq)
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_pending_list, next, temp_node) {
+ TAILQ_REMOVE(&vq->iotlb_pending_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+int vhost_user_iotlb_pending_miss(struct vhost_virtqueue *vq, uint64_t iova,
+ uint8_t perm)
+{
+ struct vhost_iotlb_entry *node;
+ int found = 0;
+
+ rte_rwlock_read_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH(node, &vq->iotlb_pending_list, next) {
+ if ((node->iova == iova) && (node->perm == perm)) {
+ found = 1;
+ break;
+ }
+ }
+
+ rte_rwlock_read_unlock(&vq->iotlb_lock);
+
+ return found;
+}
+
+void vhost_user_iotlb_pending_insert(struct vhost_virtqueue *vq,
+ uint64_t iova, uint8_t perm)
+{
+ struct vhost_iotlb_entry *node;
+ int ret;
+
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool empty, invalidate cache\n");
+ vhost_user_iotlb_pending_remove_all(vq);
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n");
+ return;
+ }
+ }
+
+ node->iova = iova;
+ node->perm = perm;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_INSERT_TAIL(&vq->iotlb_pending_list, node, next);
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+static void vhost_user_iotlb_pending_remove(struct vhost_virtqueue *vq,
+ uint64_t iova, uint64_t size, uint8_t perm)
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ /* .iotlb_lock already locked by the caller */
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_pending_list, next, temp_node) {
+ if (node->iova < iova)
+ continue;
+ if (node->iova >= iova + size)
+ continue;
+ if ((node->perm & perm) != node->perm)
+ continue;
+ TAILQ_REMOVE(&vq->iotlb_pending_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ }
+}

static void vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq)
{
@@ -106,7 +185,10 @@ void vhost_user_iotlb_cache_insert(struct vhost_virtqueue *vq, uint64_t iova,
TAILQ_INSERT_TAIL(&vq->iotlb_list, new_node, next);

unlock:
+ vhost_user_iotlb_pending_remove(vq, iova, size, perm);
+
rte_rwlock_write_unlock(&vq->iotlb_lock);
+
}

void vhost_user_iotlb_cache_remove(struct vhost_virtqueue *vq,
@@ -189,9 +271,10 @@ int vhost_user_iotlb_init(struct virtio_net *dev, int vq_index)
if (vq->iotlb_pool) {
/*
* The cache has already been initialized,
- * just drop all entries
+ * just drop all cached and pending entries.
*/
vhost_user_iotlb_cache_remove_all(vq);
+ vhost_user_iotlb_pending_remove_all(vq);
return 0;
}

@@ -204,6 +287,7 @@ int vhost_user_iotlb_init(struct virtio_net *dev, int vq_index)
rte_rwlock_init(&vq->iotlb_lock);

TAILQ_INIT(&vq->iotlb_list);
+ TAILQ_INIT(&vq->iotlb_pending_list);

snprintf(pool_name, sizeof(pool_name), "iotlb_cache_%d_%d",
dev->vid, vq_index);
diff --git a/lib/librte_vhost/iotlb.h b/lib/librte_vhost/iotlb.h
index 459820762..4be1f7e85 100644
--- a/lib/librte_vhost/iotlb.h
+++ b/lib/librte_vhost/iotlb.h
@@ -41,6 +41,10 @@ void vhost_user_iotlb_cache_remove(struct vhost_virtqueue *vq,
uint64_t iova, uint64_t size);
uint64_t vhost_user_iotlb_cache_find(struct vhost_virtqueue *vq, uint64_t iova,
uint64_t *size, uint8_t perm);
+int vhost_user_iotlb_pending_miss(struct vhost_virtqueue *vq, uint64_t iova,
+ uint8_t perm);
+void vhost_user_iotlb_pending_insert(struct vhost_virtqueue *vq, uint64_t iova,
+ uint8_t perm);
int vhost_user_iotlb_init(struct virtio_net *dev, int vq_index);

#endif /* _VHOST_IOTLB_H_ */
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 7816a92b5..a41bacea7 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -119,6 +119,7 @@ struct vhost_virtqueue {
rte_rwlock_t iotlb_lock;
struct rte_mempool *iotlb_pool;
TAILQ_HEAD(, vhost_iotlb_entry) iotlb_list;
+ TAILQ_HEAD(, vhost_iotlb_entry) iotlb_pending_list;
} __rte_cache_aligned;

/* Old kernels have no such macros defined */
--
2.13.3
Tiwei Bie
2017-09-05 07:11:06 UTC
Permalink
Post by Maxime Coquelin
In order to be able to handle other ports or queues while waiting
for an IOTLB miss reply, a pending list is created so that waiter
can return and restart later on with sending again a miss request.
---
lib/librte_vhost/iotlb.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++--
lib/librte_vhost/iotlb.h | 4 +++
lib/librte_vhost/vhost.h | 1 +
3 files changed, 91 insertions(+), 2 deletions(-)
diff --git a/lib/librte_vhost/iotlb.c b/lib/librte_vhost/iotlb.c
index 1b739dae5..d014bfe98 100644
--- a/lib/librte_vhost/iotlb.c
+++ b/lib/librte_vhost/iotlb.c
@@ -49,7 +49,86 @@ struct vhost_iotlb_entry {
uint8_t perm;
};
-#define IOTLB_CACHE_SIZE 1024
+#define IOTLB_CACHE_SIZE 2048
+
+static void vhost_user_iotlb_pending_remove_all(struct vhost_virtqueue *vq)
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_pending_list, next, temp_node) {
+ TAILQ_REMOVE(&vq->iotlb_pending_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+int vhost_user_iotlb_pending_miss(struct vhost_virtqueue *vq, uint64_t iova,
+ uint8_t perm)
+{
+ struct vhost_iotlb_entry *node;
+ int found = 0;
+
The return value of this function is boolean. So it's better
to return bool instead of int.
Post by Maxime Coquelin
+ rte_rwlock_read_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH(node, &vq->iotlb_pending_list, next) {
+ if ((node->iova == iova) && (node->perm == perm)) {
+ found = 1;
+ break;
+ }
+ }
+
+ rte_rwlock_read_unlock(&vq->iotlb_lock);
+
+ return found;
+}
+
+void vhost_user_iotlb_pending_insert(struct vhost_virtqueue *vq,
+ uint64_t iova, uint8_t perm)
+{
+ struct vhost_iotlb_entry *node;
+ int ret;
+
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool empty, invalidate cache\n");
I think The log level should be INFO or the likes, not ERR.
Post by Maxime Coquelin
+ vhost_user_iotlb_pending_remove_all(vq);
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n");
+ return;
+ }
+ }
+
+ node->iova = iova;
+ node->perm = perm;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_INSERT_TAIL(&vq->iotlb_pending_list, node, next);
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+static void vhost_user_iotlb_pending_remove(struct vhost_virtqueue *vq,
+ uint64_t iova, uint64_t size, uint8_t perm)
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ /* .iotlb_lock already locked by the caller */
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_pending_list, next, temp_node) {
+ if (node->iova < iova)
+ continue;
+ if (node->iova >= iova + size)
+ continue;
+ if ((node->perm & perm) != node->perm)
+ continue;
+ TAILQ_REMOVE(&vq->iotlb_pending_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ }
+}
static void vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq)
{
@@ -106,7 +185,10 @@ void vhost_user_iotlb_cache_insert(struct vhost_virtqueue *vq, uint64_t iova,
TAILQ_INSERT_TAIL(&vq->iotlb_list, new_node, next);
+ vhost_user_iotlb_pending_remove(vq, iova, size, perm);
+
rte_rwlock_write_unlock(&vq->iotlb_lock);
+
This empty line should be removed.

Best regards,
Tiwei Bie
Post by Maxime Coquelin
}
void vhost_user_iotlb_cache_remove(struct vhost_virtqueue *vq,
Maxime Coquelin
2017-09-05 15:18:37 UTC
Permalink
Post by Tiwei Bie
Post by Maxime Coquelin
In order to be able to handle other ports or queues while waiting
for an IOTLB miss reply, a pending list is created so that waiter
can return and restart later on with sending again a miss request.
---
lib/librte_vhost/iotlb.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++--
lib/librte_vhost/iotlb.h | 4 +++
lib/librte_vhost/vhost.h | 1 +
3 files changed, 91 insertions(+), 2 deletions(-)
diff --git a/lib/librte_vhost/iotlb.c b/lib/librte_vhost/iotlb.c
index 1b739dae5..d014bfe98 100644
--- a/lib/librte_vhost/iotlb.c
+++ b/lib/librte_vhost/iotlb.c
@@ -49,7 +49,86 @@ struct vhost_iotlb_entry {
uint8_t perm;
};
-#define IOTLB_CACHE_SIZE 1024
+#define IOTLB_CACHE_SIZE 2048
+
+static void vhost_user_iotlb_pending_remove_all(struct vhost_virtqueue *vq)
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_pending_list, next, temp_node) {
+ TAILQ_REMOVE(&vq->iotlb_pending_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ }
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+int vhost_user_iotlb_pending_miss(struct vhost_virtqueue *vq, uint64_t iova,
+ uint8_t perm)
+{
+ struct vhost_iotlb_entry *node;
+ int found = 0;
+
The return value of this function is boolean. So it's better
to return bool instead of int.
Fixed.
Post by Tiwei Bie
Post by Maxime Coquelin
+ rte_rwlock_read_lock(&vq->iotlb_lock);
+
+ TAILQ_FOREACH(node, &vq->iotlb_pending_list, next) {
+ if ((node->iova == iova) && (node->perm == perm)) {
+ found = 1;
+ break;
+ }
+ }
+
+ rte_rwlock_read_unlock(&vq->iotlb_lock);
+
+ return found;
+}
+
+void vhost_user_iotlb_pending_insert(struct vhost_virtqueue *vq,
+ uint64_t iova, uint8_t perm)
+{
+ struct vhost_iotlb_entry *node;
+ int ret;
+
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool empty, invalidate cache\n");
I think The log level should be INFO or the likes, not ERR.
Fixed.
Post by Tiwei Bie
Post by Maxime Coquelin
+ vhost_user_iotlb_pending_remove_all(vq);
+ ret = rte_mempool_get(vq->iotlb_pool, (void **)&node);
+ if (ret) {
+ RTE_LOG(ERR, VHOST_CONFIG, "IOTLB pool still empty, failure\n");
+ return;
+ }
+ }
+
+ node->iova = iova;
+ node->perm = perm;
+
+ rte_rwlock_write_lock(&vq->iotlb_lock);
+
+ TAILQ_INSERT_TAIL(&vq->iotlb_pending_list, node, next);
+
+ rte_rwlock_write_unlock(&vq->iotlb_lock);
+}
+
+static void vhost_user_iotlb_pending_remove(struct vhost_virtqueue *vq,
+ uint64_t iova, uint64_t size, uint8_t perm)
+{
+ struct vhost_iotlb_entry *node, *temp_node;
+
+ /* .iotlb_lock already locked by the caller */
+ TAILQ_FOREACH_SAFE(node, &vq->iotlb_pending_list, next, temp_node) {
+ if (node->iova < iova)
+ continue;
+ if (node->iova >= iova + size)
+ continue;
+ if ((node->perm & perm) != node->perm)
+ continue;
+ TAILQ_REMOVE(&vq->iotlb_pending_list, node, next);
+ rte_mempool_put(vq->iotlb_pool, node);
+ }
+}
static void vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq)
{
@@ -106,7 +185,10 @@ void vhost_user_iotlb_cache_insert(struct vhost_virtqueue *vq, uint64_t iova,
TAILQ_INSERT_TAIL(&vq->iotlb_list, new_node, next);
+ vhost_user_iotlb_pending_remove(vq, iova, size, perm);
+
rte_rwlock_write_unlock(&vq->iotlb_lock);
+
This empty line should be removed.
Yes, this part disappears in next version, as I squashed patch 21 in
patches 7 & 8.

Thanks,
Maxime
Post by Tiwei Bie
Best regards,
Tiwei Bie
Post by Maxime Coquelin
}
void vhost_user_iotlb_cache_remove(struct vhost_virtqueue *vq,
Maxime Coquelin
2017-08-31 09:50:11 UTC
Permalink
Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost_user.c | 25 +++++++++++++++++++++++++
lib/librte_vhost/vhost_user.h | 3 +++
2 files changed, 28 insertions(+)

diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 7b3c2f32a..8b4a2b358 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -1131,3 +1131,28 @@ vhost_user_msg_handler(int vid, int fd)

return ret;
}
+
+int vhost_user_iotlb_miss(struct virtio_net *dev, uint64_t iova, uint8_t perm)
+{
+ int ret;
+ struct VhostUserMsg msg = {
+ .request = VHOST_USER_SLAVE_IOTLB_MSG,
+ .flags = VHOST_USER_VERSION,
+ .size = sizeof(msg.payload.iotlb),
+ .payload.iotlb = {
+ .iova = iova,
+ .perm = perm,
+ .type = VHOST_IOTLB_MISS,
+ },
+ };
+
+ ret = send_vhost_message(dev->slave_req_fd, &msg);
+ if (ret < 0) {
+ RTE_LOG(ERR, VHOST_CONFIG,
+ "Failed to send IOTLB miss message (%d)\n",
+ ret);
+ return ret;
+ }
+
+ return 0;
+}
diff --git a/lib/librte_vhost/vhost_user.h b/lib/librte_vhost/vhost_user.h
index 98f6e6f37..0b2aff14e 100644
--- a/lib/librte_vhost/vhost_user.h
+++ b/lib/librte_vhost/vhost_user.h
@@ -85,6 +85,7 @@ typedef enum VhostUserRequest {

typedef enum VhostUserSlaveRequest {
VHOST_USER_SLAVE_NONE = 0,
+ VHOST_USER_SLAVE_IOTLB_MSG = 1,
VHOST_USER_SLAVE_MAX
} VhostUserSlaveRequest;

@@ -122,6 +123,7 @@ typedef struct VhostUserMsg {
struct vhost_vring_addr addr;
VhostUserMemory memory;
VhostUserLog log;
+ struct vhost_iotlb_msg iotlb;
} payload;
int fds[VHOST_MEMORY_MAX_NREGIONS];
} __attribute((packed)) VhostUserMsg;
@@ -134,6 +136,7 @@ typedef struct VhostUserMsg {

/* vhost_user.c */
int vhost_user_msg_handler(int vid, int fd);
+int vhost_user_iotlb_miss(struct virtio_net *dev, uint64_t iova, uint8_t perm);

/* socket.c */
int read_fd_message(int sockfd, char *buf, int buflen, int *fds, int fd_num);
--
2.13.3
Maxime Coquelin
2017-08-31 09:50:12 UTC
Permalink
The per-virtqueue IOTLB cache init is done at virtqueue
init time. init_vring_queue() now takes vring id as parameter,
so that the IOTLB cache mempool name can be generated.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index f1099753c..bae98b02d 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -49,6 +49,7 @@
#include <rte_rwlock.h>
#include <rte_vhost.h>

+#include "iotlb.h"
#include "vhost.h"

struct vhost_device {
@@ -197,13 +198,16 @@ free_device(struct virtio_net *dev)
}

static void
-init_vring_queue(struct vhost_virtqueue *vq)
+init_vring_queue(struct virtio_net *dev, uint32_t vring_idx)
{
+ struct vhost_virtqueue *vq = dev->virtqueue[vring_idx];
+
memset(vq, 0, sizeof(struct vhost_virtqueue));

vq->kickfd = VIRTIO_UNINITIALIZED_EVENTFD;
vq->callfd = VIRTIO_UNINITIALIZED_EVENTFD;

+ vhost_user_iotlb_init(dev, vring_idx);
/* Backends are set to -1 indicating an inactive device. */
vq->backend = -1;

@@ -217,12 +221,13 @@ init_vring_queue(struct vhost_virtqueue *vq)
}

static void
-reset_vring_queue(struct vhost_virtqueue *vq)
+reset_vring_queue(struct virtio_net *dev, uint32_t vring_idx)
{
+ struct vhost_virtqueue *vq = dev->virtqueue[vring_idx];
int callfd;

callfd = vq->callfd;
- init_vring_queue(vq);
+ init_vring_queue(dev, vring_idx);
vq->callfd = callfd;
}

@@ -239,7 +244,7 @@ alloc_vring_queue(struct virtio_net *dev, uint32_t vring_idx)
}

dev->virtqueue[vring_idx] = vq;
- init_vring_queue(vq);
+ init_vring_queue(dev, vring_idx);

dev->nr_vring += 1;

@@ -261,7 +266,7 @@ reset_device(struct virtio_net *dev)
dev->flags = 0;

for (i = 0; i < dev->nr_vring; i++)
- reset_vring_queue(dev->virtqueue[i]);
+ reset_vring_queue(dev, i);
}

/*
--
2.13.3
Remy Horton
2017-09-04 13:57:59 UTC
Permalink
On 31/08/2017 10:50, Maxime Coquelin wrote:
[..]
Post by Maxime Coquelin
+reset_vring_queue(struct virtio_net *dev, uint32_t vring_idx)
{
+ struct vhost_virtqueue *vq = dev->virtqueue[vring_idx];
int callfd;
Probably ought to have a bounds check on vring_idx..
Maxime Coquelin
2017-09-04 15:45:01 UTC
Permalink
Hi Remy,
Post by Remy Horton
[..]
Post by Maxime Coquelin
+reset_vring_queue(struct virtio_net *dev, uint32_t vring_idx)
{
+ struct vhost_virtqueue *vq = dev->virtqueue[vring_idx];
int callfd;
Probably ought to have a bounds check on vring_idx..
Indeed, you are right. I'll fix it in next round.

Thanks,
Maxime
Maxime Coquelin
2017-08-31 09:50:13 UTC
Permalink
Vhost-user device IOTLB protocol extension introduces
VHOST_USER_IOTLB message type. The associated payload is the
vhost_iotlb_msg struct defined in Kernel, which in this was can
be either an IOTLB update or invalidate message.

On IOTLB update, the virtqueues get notified of a new entry.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost_user.c | 43 +++++++++++++++++++++++++++++++++++++++++++
lib/librte_vhost/vhost_user.h | 1 +
2 files changed, 44 insertions(+)

diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 8b4a2b358..1cabba73f 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -48,6 +48,7 @@
#include <rte_malloc.h>
#include <rte_log.h>

+#include "iotlb.h"
#include "vhost.h"
#include "vhost_user.h"

@@ -77,6 +78,7 @@ static const char *vhost_message_str[VHOST_USER_MAX] = {
[VHOST_USER_SEND_RARP] = "VHOST_USER_SEND_RARP",
[VHOST_USER_NET_SET_MTU] = "VHOST_USER_NET_SET_MTU",
[VHOST_USER_SET_SLAVE_REQ_FD] = "VHOST_USER_SET_SLAVE_REQ_FD",
+ [VHOST_USER_IOTLB_MSG] = "VHOST_USER_IOTLB_MSG",
};

static uint64_t
@@ -867,6 +869,43 @@ vhost_user_set_req_fd(struct virtio_net *dev, struct VhostUserMsg *msg)
return 0;
}

+static int
+vhost_user_iotlb_msg(struct virtio_net *dev, struct VhostUserMsg *msg)
+{
+ struct vhost_iotlb_msg *imsg = &msg->payload.iotlb;
+ uint16_t i;
+ uint64_t vva;
+
+ switch (imsg->type) {
+ case VHOST_IOTLB_UPDATE:
+ vva = qva_to_vva(dev, imsg->uaddr);
+ if (!vva)
+ return -1;
+
+ for (i = 0; i < dev->nr_vring; i++) {
+ struct vhost_virtqueue *vq = dev->virtqueue[i];
+
+ vhost_user_iotlb_cache_insert(vq, imsg->iova, vva,
+ imsg->size, imsg->perm);
+ }
+ break;
+ case VHOST_IOTLB_INVALIDATE:
+ for (i = 0; i < dev->nr_vring; i++) {
+ struct vhost_virtqueue *vq = dev->virtqueue[i];
+
+ vhost_user_iotlb_cache_remove(vq, imsg->iova,
+ imsg->size);
+ }
+ break;
+ default:
+ RTE_LOG(ERR, VHOST_CONFIG, "Invalid IOTLB message type (%d)\n",
+ imsg->type);
+ return -1;
+ }
+
+ return 0;
+}
+
/* return bytes# of read on success or negative val on failure. */
static int
read_vhost_message(int sockfd, struct VhostUserMsg *msg)
@@ -1100,6 +1139,10 @@ vhost_user_msg_handler(int vid, int fd)
ret = vhost_user_set_req_fd(dev, &msg);
break;

+ case VHOST_USER_IOTLB_MSG:
+ ret = vhost_user_iotlb_msg(dev, &msg);
+ break;
+
default:
ret = -1;
break;
diff --git a/lib/librte_vhost/vhost_user.h b/lib/librte_vhost/vhost_user.h
index 0b2aff14e..46c6ff956 100644
--- a/lib/librte_vhost/vhost_user.h
+++ b/lib/librte_vhost/vhost_user.h
@@ -80,6 +80,7 @@ typedef enum VhostUserRequest {
VHOST_USER_SEND_RARP = 19,
VHOST_USER_NET_SET_MTU = 20,
VHOST_USER_SET_SLAVE_REQ_FD = 21,
+ VHOST_USER_IOTLB_MSG = 22,
VHOST_USER_MAX
} VhostUserRequest;
--
2.13.3
Maxime Coquelin
2017-08-31 09:50:14 UTC
Permalink
This patch introduces vhost_iova_to_vva() function to translate
guest's IO virtual addresses to backend's virtual addresses.

When IOMMU is enabled, the IOTLB cache is queried to get the
translation. If missing from the IOTLB cache, an IOTLB_MISS request
is sent to Qemu, and IOTLB cache is queried again on IOTLB event
notification.

When IOMMU is disabled, the passed address is a guest's physical
address, so the legacy rte_vhost_gpa_to_vva() API is used.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost.c | 27 +++++++++++++++++++++++++++
lib/librte_vhost/vhost.h | 3 +++
2 files changed, 30 insertions(+)

diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index bae98b02d..0e8c0386a 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -48,9 +48,11 @@
#include <rte_malloc.h>
#include <rte_rwlock.h>
#include <rte_vhost.h>
+#include <rte_rwlock.h>

#include "iotlb.h"
#include "vhost.h"
+#include "vhost_user.h"

struct vhost_device {
struct virtio_net *dev;
@@ -60,6 +62,31 @@ struct vhost_device {
/* Declared as static so that .lock is initialized */
static struct vhost_device vhost_devices[MAX_VHOST_DEVICE];

+uint64_t vhost_iova_to_vva(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ uint64_t iova, uint64_t size, uint8_t perm)
+{
+ uint64_t vva, tmp_size;
+
+ if (unlikely(!size))
+ return 0;
+
+ if (!(dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)))
+ return rte_vhost_gpa_to_vva(dev->mem, iova);
+
+ tmp_size = size;
+
+ vva = vhost_user_iotlb_cache_find(vq, iova, &tmp_size, perm);
+ if (tmp_size == size)
+ return vva;
+
+ if (!vhost_user_iotlb_pending_miss(vq, iova + tmp_size, perm)) {
+ vhost_user_iotlb_pending_insert(vq, iova + tmp_size, perm);
+ vhost_user_iotlb_miss(dev, iova + tmp_size, perm);
+ }
+
+ return 0;
+}
+
static inline struct virtio_net *
__get_device(int vid)
{
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index a41bacea7..2653ec123 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -352,4 +352,7 @@ struct vhost_device_ops const *vhost_driver_callback_get(const char *path);
*/
void vhost_backend_cleanup(struct virtio_net *dev);

+uint64_t vhost_iova_to_vva(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ uint64_t iova, uint64_t size, uint8_t perm);
+
#endif /* _VHOST_NET_CDEV_H_ */
--
2.13.3
Tiwei Bie
2017-09-05 04:14:36 UTC
Permalink
Post by Maxime Coquelin
This patch introduces vhost_iova_to_vva() function to translate
guest's IO virtual addresses to backend's virtual addresses.
When IOMMU is enabled, the IOTLB cache is queried to get the
translation. If missing from the IOTLB cache, an IOTLB_MISS request
is sent to Qemu, and IOTLB cache is queried again on IOTLB event
notification.
When IOMMU is disabled, the passed address is a guest's physical
address, so the legacy rte_vhost_gpa_to_vva() API is used.
---
lib/librte_vhost/vhost.c | 27 +++++++++++++++++++++++++++
lib/librte_vhost/vhost.h | 3 +++
2 files changed, 30 insertions(+)
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index bae98b02d..0e8c0386a 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -48,9 +48,11 @@
#include <rte_malloc.h>
#include <rte_rwlock.h>
#include <rte_vhost.h>
+#include <rte_rwlock.h>
This header isn't needed.

Best regards,
Tiwei Bie
Maxime Coquelin
2017-09-05 07:05:40 UTC
Permalink
Post by Tiwei Bie
Post by Maxime Coquelin
This patch introduces vhost_iova_to_vva() function to translate
guest's IO virtual addresses to backend's virtual addresses.
When IOMMU is enabled, the IOTLB cache is queried to get the
translation. If missing from the IOTLB cache, an IOTLB_MISS request
is sent to Qemu, and IOTLB cache is queried again on IOTLB event
notification.
When IOMMU is disabled, the passed address is a guest's physical
address, so the legacy rte_vhost_gpa_to_vva() API is used.
---
lib/librte_vhost/vhost.c | 27 +++++++++++++++++++++++++++
lib/librte_vhost/vhost.h | 3 +++
2 files changed, 30 insertions(+)
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index bae98b02d..0e8c0386a 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -48,9 +48,11 @@
#include <rte_malloc.h>
#include <rte_rwlock.h>
#include <rte_vhost.h>
+#include <rte_rwlock.h>
This header isn't needed.
Right, I'll remove it.

Thanks,
Maxime
Post by Tiwei Bie
Best regards,
Tiwei Bie
Maxime Coquelin
2017-08-31 09:50:15 UTC
Permalink
Replace rte_vhost_gpa_to_vva() calls with vhost_iova_to_vva(), which
requires to also pass the mapped len and the access permissions needed.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/virtio_net.c | 71 +++++++++++++++++++++++++++++--------------
1 file changed, 49 insertions(+), 22 deletions(-)

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 04255dc85..18531c97d 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -168,8 +168,9 @@ virtio_enqueue_offload(struct rte_mbuf *m_buf, struct virtio_net_hdr *net_hdr)
}

static __rte_always_inline int
-copy_mbuf_to_desc(struct virtio_net *dev, struct vring_desc *descs,
- struct rte_mbuf *m, uint16_t desc_idx, uint32_t size)
+copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ struct vring_desc *descs, struct rte_mbuf *m,
+ uint16_t desc_idx, uint32_t size)
{
uint32_t desc_avail, desc_offset;
uint32_t mbuf_avail, mbuf_offset;
@@ -180,7 +181,8 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vring_desc *descs,
uint16_t nr_desc = 1;

desc = &descs[desc_idx];
- desc_addr = rte_vhost_gpa_to_vva(dev->mem, desc->addr);
+ desc_addr = vhost_iova_to_vva(dev, vq, desc->addr,
+ desc->len, VHOST_ACCESS_RW);
/*
* Checking of 'desc_addr' placed outside of 'unlikely' macro to avoid
* performance issue with some versions of gcc (4.8.4 and 5.3.0) which
@@ -219,7 +221,9 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vring_desc *descs,
return -1;

desc = &descs[desc->next];
- desc_addr = rte_vhost_gpa_to_vva(dev->mem, desc->addr);
+ desc_addr = vhost_iova_to_vva(dev, vq, desc->addr,
+ desc->len,
+ VHOST_ACCESS_RW);
if (unlikely(!desc_addr))
return -1;

@@ -304,8 +308,10 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,

if (vq->desc[desc_idx].flags & VRING_DESC_F_INDIRECT) {
descs = (struct vring_desc *)(uintptr_t)
- rte_vhost_gpa_to_vva(dev->mem,
- vq->desc[desc_idx].addr);
+ vhost_iova_to_vva(dev,
+ vq, vq->desc[desc_idx].addr,
+ vq->desc[desc_idx].len,
+ VHOST_ACCESS_RO);
if (unlikely(!descs)) {
count = i;
break;
@@ -318,7 +324,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
sz = vq->size;
}

- err = copy_mbuf_to_desc(dev, descs, pkts[i], desc_idx, sz);
+ err = copy_mbuf_to_desc(dev, vq, descs, pkts[i], desc_idx, sz);
if (unlikely(err)) {
count = i;
break;
@@ -361,7 +367,9 @@ fill_vec_buf(struct virtio_net *dev, struct vhost_virtqueue *vq,

if (vq->desc[idx].flags & VRING_DESC_F_INDIRECT) {
descs = (struct vring_desc *)(uintptr_t)
- rte_vhost_gpa_to_vva(dev->mem, vq->desc[idx].addr);
+ vhost_iova_to_vva(dev, vq, vq->desc[idx].addr,
+ vq->desc[idx].len,
+ VHOST_ACCESS_RO);
if (unlikely(!descs))
return -1;

@@ -436,8 +444,9 @@ reserve_avail_buf_mergeable(struct virtio_net *dev, struct vhost_virtqueue *vq,
}

static __rte_always_inline int
-copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct rte_mbuf *m,
- struct buf_vector *buf_vec, uint16_t num_buffers)
+copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ struct rte_mbuf *m, struct buf_vector *buf_vec,
+ uint16_t num_buffers)
{
uint32_t vec_idx = 0;
uint64_t desc_addr;
@@ -450,7 +459,9 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct rte_mbuf *m,
if (unlikely(m == NULL))
return -1;

- desc_addr = rte_vhost_gpa_to_vva(dev->mem, buf_vec[vec_idx].buf_addr);
+ desc_addr = vhost_iova_to_vva(dev, vq, buf_vec[vec_idx].buf_addr,
+ buf_vec[vec_idx].buf_len,
+ VHOST_ACCESS_RW);
if (buf_vec[vec_idx].buf_len < dev->vhost_hlen || !desc_addr)
return -1;

@@ -471,8 +482,11 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct rte_mbuf *m,
/* done with current desc buf, get the next one */
if (desc_avail == 0) {
vec_idx++;
- desc_addr = rte_vhost_gpa_to_vva(dev->mem,
- buf_vec[vec_idx].buf_addr);
+ desc_addr =
+ vhost_iova_to_vva(dev, vq,
+ buf_vec[vec_idx].buf_addr,
+ buf_vec[vec_idx].buf_len,
+ VHOST_ACCESS_RW);
if (unlikely(!desc_addr))
return -1;

@@ -569,7 +583,7 @@ virtio_dev_merge_rx(struct virtio_net *dev, uint16_t queue_id,
dev->vid, vq->last_avail_idx,
vq->last_avail_idx + num_buffers);

- if (copy_mbuf_to_desc_mergeable(dev, pkts[pkt_idx],
+ if (copy_mbuf_to_desc_mergeable(dev, vq, pkts[pkt_idx],
buf_vec, num_buffers) < 0) {
vq->shadow_used_idx -= num_buffers;
break;
@@ -768,8 +782,9 @@ put_zmbuf(struct zcopy_mbuf *zmbuf)
}

static __rte_always_inline int
-copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
- uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
+copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ struct vring_desc *descs, uint16_t max_desc,
+ struct rte_mbuf *m, uint16_t desc_idx,
struct rte_mempool *mbuf_pool)
{
struct vring_desc *desc;
@@ -787,7 +802,10 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
(desc->flags & VRING_DESC_F_INDIRECT))
return -1;

- desc_addr = rte_vhost_gpa_to_vva(dev->mem, desc->addr);
+ desc_addr = vhost_iova_to_vva(dev,
+ vq, desc->addr,
+ desc->len,
+ VHOST_ACCESS_RO);
if (unlikely(!desc_addr))
return -1;

@@ -807,7 +825,10 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
return -1;

- desc_addr = rte_vhost_gpa_to_vva(dev->mem, desc->addr);
+ desc_addr = vhost_iova_to_vva(dev,
+ vq, desc->addr,
+ desc->len,
+ VHOST_ACCESS_RO);
if (unlikely(!desc_addr))
return -1;

@@ -871,7 +892,10 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
return -1;

- desc_addr = rte_vhost_gpa_to_vva(dev->mem, desc->addr);
+ desc_addr = vhost_iova_to_vva(dev,
+ vq, desc->addr,
+ desc->len,
+ VHOST_ACCESS_RO);
if (unlikely(!desc_addr))
return -1;

@@ -1117,8 +1141,10 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,

if (vq->desc[desc_indexes[i]].flags & VRING_DESC_F_INDIRECT) {
desc = (struct vring_desc *)(uintptr_t)
- rte_vhost_gpa_to_vva(dev->mem,
- vq->desc[desc_indexes[i]].addr);
+ vhost_iova_to_vva(dev, vq,
+ vq->desc[desc_indexes[i]].addr,
+ sizeof(*desc),
+ VHOST_ACCESS_RO);
if (unlikely(!desc))
break;

@@ -1138,7 +1164,8 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
break;
}

- err = copy_desc_to_mbuf(dev, desc, sz, pkts[i], idx, mbuf_pool);
+ err = copy_desc_to_mbuf(dev, vq, desc, sz,
+ pkts[i], idx, mbuf_pool);
if (unlikely(err)) {
rte_pktmbuf_free(pkts[i]);
break;
--
2.13.3
Maxime Coquelin
2017-08-31 09:50:16 UTC
Permalink
When VHOST_USER_F_PROTOCOL_FEATURES is negotiated, the ring is not
enabled when started, but enabled through dedicated
VHOST_USER_SET_VRING_ENABLE request.

When not negotiated, the ring is started in enabled state, at
VHOST_USER_SET_VRING_KICK request time.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost.c | 6 ------
lib/librte_vhost/vhost_user.c | 9 +++++++++
2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 0e8c0386a..ccfc17022 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -238,12 +238,6 @@ init_vring_queue(struct virtio_net *dev, uint32_t vring_idx)
/* Backends are set to -1 indicating an inactive device. */
vq->backend = -1;

- /*
- * always set the vq to enabled; this is to keep compatibility
- * with the old QEMU, whereas there is no SET_VRING_ENABLE message.
- */
- vq->enabled = 1;
-
TAILQ_INIT(&vq->zmbuf_list);
}

diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 1cabba73f..4f9273fe7 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -660,6 +660,15 @@ vhost_user_set_vring_kick(struct virtio_net *dev, struct VhostUserMsg *pmsg)
"vring kick idx:%d file:%d\n", file.index, file.fd);

vq = dev->virtqueue[file.index];
+
+ /*
+ * When VHOST_USER_F_PROTOCOL_FEATURES is not negotiated,
+ * the ring starts already enabled. Otherwise, it is enabled via
+ * the SET_VRING_ENABLE message.
+ */
+ if (!(dev->features & (1ULL << VHOST_USER_F_PROTOCOL_FEATURES)))
+ vq->enabled = 1;
+
if (vq->kickfd >= 0)
close(vq->kickfd);
vq->kickfd = file.fd;
--
2.13.3
Maxime Coquelin
2017-08-31 09:50:17 UTC
Permalink
numa_realloc() reallocates the virtio_net device structure and
updates the vhost_devices[] table with the new pointer if the rings
are allocated different NUMA node.

Problem is that vhost_user_msg_handler() still derenferences old
pointer afterward.

This patch prevents this by fetching again the dev pointer in
vhost_devices[] after messages have been handled.

Cc: ***@dpdk.org
Fixes: af295ad4698c ("vhost: realloc device and queues to same numa node as vring desc")
Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost_user.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 4f9273fe7..a0c7c2f86 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -1158,6 +1158,12 @@ vhost_user_msg_handler(int vid, int fd)

}

+ /*
+ * The virtio_net struct might have been reallocated on a different
+ * NUMA node, so dev pointer might no more be valid.
+ */
+ dev = get_device(vid);
+
if (msg.flags & VHOST_USER_NEED_REPLY) {
msg.payload.u64 = !!ret;
msg.size = sizeof(msg.payload.u64);
--
2.13.3
Remy Horton
2017-09-04 13:58:05 UTC
Permalink
On 31/08/2017 10:50, Maxime Coquelin wrote:
[..]
Post by Maxime Coquelin
Problem is that vhost_user_msg_handler() still derenferences old
pointer afterward.
Spelling.. :)

Code itself looks OK.
Maxime Coquelin
2017-08-31 09:50:18 UTC
Permalink
This patch postpones rings addresses translations and checks, as
addresses sent by the master shuld not be interpreted as long as
ring is not started and enabled[0].

When protocol features aren't negotiated, the ring is started in
enabled state, so the addresses translations are postponed to
vhost_user_set_vring_kick().
Otherwise, it is postponed to when ring is enabled, in
vhost_user_set_vring_enable().

[0]: http://lists.nongnu.org/archive/html/qemu-devel/2017-05/msg04355.html

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost.h | 1 +
lib/librte_vhost/vhost_user.c | 74 +++++++++++++++++++++++++++++++++----------
2 files changed, 58 insertions(+), 17 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 2653ec123..c90e73268 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -115,6 +115,7 @@ struct vhost_virtqueue {

struct vring_used_elem *shadow_used_ring;
uint16_t shadow_used_idx;
+ struct vhost_vring_addr ring_addrs;

rte_rwlock_t iotlb_lock;
struct rte_mempool *iotlb_pool;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index a0c7c2f86..a2e030308 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -313,10 +313,10 @@ qva_to_vva(struct virtio_net *dev, uint64_t qva)
* This function then converts these to our address space.
*/
static int
-vhost_user_set_vring_addr(struct virtio_net **pdev, VhostUserMsg *msg)
+vhost_user_set_vring_addr(struct virtio_net *dev, VhostUserMsg *msg)
{
struct vhost_virtqueue *vq;
- struct virtio_net *dev = *pdev;
+ struct vhost_vring_addr *addr = &msg->payload.addr;

if (dev->mem == NULL)
return -1;
@@ -324,35 +324,50 @@ vhost_user_set_vring_addr(struct virtio_net **pdev, VhostUserMsg *msg)
/* addr->index refers to the queue index. The txq 1, rxq is 0. */
vq = dev->virtqueue[msg->payload.addr.index];

+ /*
+ * Rings addresses should not be interpreted as long as the ring is not
+ * started and enabled
+ */
+ memcpy(&vq->ring_addrs, addr, sizeof(*addr));
+
+ return 0;
+}
+
+static struct virtio_net *translate_ring_addresses(struct virtio_net *dev,
+ int vq_index)
+{
+ struct vhost_virtqueue *vq = dev->virtqueue[vq_index];
+ struct vhost_vring_addr *addr = &vq->ring_addrs;
+
/* The addresses are converted from QEMU virtual to Vhost virtual. */
vq->desc = (struct vring_desc *)(uintptr_t)qva_to_vva(dev,
- msg->payload.addr.desc_user_addr);
+ addr->desc_user_addr);
if (vq->desc == 0) {
RTE_LOG(ERR, VHOST_CONFIG,
"(%d) failed to find desc ring address.\n",
dev->vid);
- return -1;
+ return NULL;
}

- *pdev = dev = numa_realloc(dev, msg->payload.addr.index);
- vq = dev->virtqueue[msg->payload.addr.index];
+ dev = numa_realloc(dev, vq_index);
+ vq = dev->virtqueue[vq_index];

vq->avail = (struct vring_avail *)(uintptr_t)qva_to_vva(dev,
- msg->payload.addr.avail_user_addr);
+ addr->avail_user_addr);
if (vq->avail == 0) {
RTE_LOG(ERR, VHOST_CONFIG,
"(%d) failed to find avail ring address.\n",
dev->vid);
- return -1;
+ return NULL;
}

vq->used = (struct vring_used *)(uintptr_t)qva_to_vva(dev,
- msg->payload.addr.used_user_addr);
+ addr->used_user_addr);
if (vq->used == 0) {
RTE_LOG(ERR, VHOST_CONFIG,
"(%d) failed to find used ring address.\n",
dev->vid);
- return -1;
+ return NULL;
}

if (vq->last_used_idx != vq->used->idx) {
@@ -364,7 +379,7 @@ vhost_user_set_vring_addr(struct virtio_net **pdev, VhostUserMsg *msg)
vq->last_avail_idx = vq->used->idx;
}

- vq->log_guest_addr = msg->payload.addr.log_guest_addr;
+ vq->log_guest_addr = addr->log_guest_addr;

LOG_DEBUG(VHOST_CONFIG, "(%d) mapped address desc: %p\n",
dev->vid, vq->desc);
@@ -375,7 +390,7 @@ vhost_user_set_vring_addr(struct virtio_net **pdev, VhostUserMsg *msg)
LOG_DEBUG(VHOST_CONFIG, "(%d) log_guest_addr: %" PRIx64 "\n",
dev->vid, vq->log_guest_addr);

- return 0;
+ return dev;
}

/*
@@ -646,10 +661,11 @@ vhost_user_set_vring_call(struct virtio_net *dev, struct VhostUserMsg *pmsg)
}

static void
-vhost_user_set_vring_kick(struct virtio_net *dev, struct VhostUserMsg *pmsg)
+vhost_user_set_vring_kick(struct virtio_net **pdev, struct VhostUserMsg *pmsg)
{
struct vhost_vring_file file;
struct vhost_virtqueue *vq;
+ struct virtio_net *dev = *pdev;

file.index = pmsg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
if (pmsg->payload.u64 & VHOST_USER_VRING_NOFD_MASK)
@@ -659,6 +675,16 @@ vhost_user_set_vring_kick(struct virtio_net *dev, struct VhostUserMsg *pmsg)
RTE_LOG(INFO, VHOST_CONFIG,
"vring kick idx:%d file:%d\n", file.index, file.fd);

+ /*
+ * Interpret ring addresses only when ring is started and enabled.
+ * This is now if protocol features aren't supported.
+ */
+ if (!(dev->features & (1ULL << VHOST_USER_F_PROTOCOL_FEATURES))) {
+ *pdev = dev = translate_ring_addresses(dev, file.index);
+ if (!dev)
+ return;
+ }
+
vq = dev->virtqueue[file.index];

/*
@@ -736,15 +762,29 @@ vhost_user_get_vring_base(struct virtio_net *dev,
* enable the virtio queue pair.
*/
static int
-vhost_user_set_vring_enable(struct virtio_net *dev,
+vhost_user_set_vring_enable(struct virtio_net **pdev,
VhostUserMsg *msg)
{
+ struct virtio_net *dev = *pdev;
int enable = (int)msg->payload.state.num;

RTE_LOG(INFO, VHOST_CONFIG,
"set queue enable: %d to qp idx: %d\n",
enable, msg->payload.state.index);

+ /*
+ * Interpret ring addresses only when ring is started and enabled.
+ * This is now if protocol features are supported.
+ */
+ if (enable && (dev->features &
+ (1ULL << VHOST_USER_F_PROTOCOL_FEATURES))) {
+ dev = translate_ring_addresses(dev, msg->payload.state.index);
+ if (!dev)
+ return -1;
+
+ *pdev = dev;
+ }
+
if (dev->notify_ops->vring_state_changed)
dev->notify_ops->vring_state_changed(dev->vid,
msg->payload.state.index, enable);
@@ -1102,7 +1142,7 @@ vhost_user_msg_handler(int vid, int fd)
vhost_user_set_vring_num(dev, &msg);
break;
case VHOST_USER_SET_VRING_ADDR:
- vhost_user_set_vring_addr(&dev, &msg);
+ vhost_user_set_vring_addr(dev, &msg);
break;
case VHOST_USER_SET_VRING_BASE:
vhost_user_set_vring_base(dev, &msg);
@@ -1115,7 +1155,7 @@ vhost_user_msg_handler(int vid, int fd)
break;

case VHOST_USER_SET_VRING_KICK:
- vhost_user_set_vring_kick(dev, &msg);
+ vhost_user_set_vring_kick(&dev, &msg);
break;
case VHOST_USER_SET_VRING_CALL:
vhost_user_set_vring_call(dev, &msg);
@@ -1134,7 +1174,7 @@ vhost_user_msg_handler(int vid, int fd)
break;

case VHOST_USER_SET_VRING_ENABLE:
- vhost_user_set_vring_enable(dev, &msg);
+ vhost_user_set_vring_enable(&dev, &msg);
break;
case VHOST_USER_SEND_RARP:
vhost_user_send_rarp(dev, &msg);
--
2.13.3
Maxime Coquelin
2017-08-31 09:50:19 UTC
Permalink
When IOMMU is enabled, the ring addresses set by the
VHOST_USER_SET_VRING_ADDR requests are guest's IO virtual addresses,
whereas Qemu virtual addresses when IOMMU is disabled.

When enabled and the required translation is not in the IOTLB cache,
an IOTLB miss request is sent, but being called by the vhost-user
socket handling thread, the function does not wait for the requested
IOTLB update.

The function will be called again on the next IOTLB update message
reception if matching the vring addresses.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost_user.c | 44 +++++++++++++++++++++++++++++++++----------
1 file changed, 34 insertions(+), 10 deletions(-)

diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index a2e030308..a81b338ad 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -284,10 +284,7 @@ numa_realloc(struct virtio_net *dev, int index __rte_unused)
}
#endif

-/*
- * Converts QEMU virtual address to Vhost virtual address. This function is
- * used to convert the ring addresses to our address space.
- */
+/* Converts QEMU virtual address to Vhost virtual address. */
static uint64_t
qva_to_vva(struct virtio_net *dev, uint64_t qva)
{
@@ -308,6 +305,30 @@ qva_to_vva(struct virtio_net *dev, uint64_t qva)
return 0;
}

+
+/*
+ * Converts ring address to Vhost virtual address.
+ * If IOMMU is enabled, the ring address is a guest IO virtual address,
+ * else it is a QEMU virtual address.
+ */
+static uint64_t
+ring_addr_to_vva(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ uint64_t ra, uint64_t size)
+{
+ if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
+ uint64_t vva;
+
+ vva = vhost_user_iotlb_cache_find(vq, ra,
+ &size, VHOST_ACCESS_RW);
+ if (!vva)
+ vhost_user_iotlb_miss(dev, ra, VHOST_ACCESS_RW);
+
+ return vva;
+ }
+
+ return qva_to_vva(dev, ra);
+}
+
/*
* The virtio device sends us the desc, used and avail ring addresses.
* This function then converts these to our address space.
@@ -340,8 +361,11 @@ static struct virtio_net *translate_ring_addresses(struct virtio_net *dev,
struct vhost_vring_addr *addr = &vq->ring_addrs;

/* The addresses are converted from QEMU virtual to Vhost virtual. */
- vq->desc = (struct vring_desc *)(uintptr_t)qva_to_vva(dev,
- addr->desc_user_addr);
+ if (vq->desc && vq->avail && vq->used)
+ return dev;
+
+ vq->desc = (struct vring_desc *)(uintptr_t)ring_addr_to_vva(dev,
+ vq, addr->desc_user_addr, sizeof(struct vring_desc));
if (vq->desc == 0) {
RTE_LOG(ERR, VHOST_CONFIG,
"(%d) failed to find desc ring address.\n",
@@ -352,8 +376,8 @@ static struct virtio_net *translate_ring_addresses(struct virtio_net *dev,
dev = numa_realloc(dev, vq_index);
vq = dev->virtqueue[vq_index];

- vq->avail = (struct vring_avail *)(uintptr_t)qva_to_vva(dev,
- addr->avail_user_addr);
+ vq->avail = (struct vring_avail *)(uintptr_t)ring_addr_to_vva(dev,
+ vq, addr->avail_user_addr, sizeof(struct vring_avail));
if (vq->avail == 0) {
RTE_LOG(ERR, VHOST_CONFIG,
"(%d) failed to find avail ring address.\n",
@@ -361,8 +385,8 @@ static struct virtio_net *translate_ring_addresses(struct virtio_net *dev,
return NULL;
}

- vq->used = (struct vring_used *)(uintptr_t)qva_to_vva(dev,
- addr->used_user_addr);
+ vq->used = (struct vring_used *)(uintptr_t)ring_addr_to_vva(dev,
+ vq, addr->used_user_addr, sizeof(struct vring_used));
if (vq->used == 0) {
RTE_LOG(ERR, VHOST_CONFIG,
"(%d) failed to find used ring address.\n",
--
2.13.3
Maxime Coquelin
2017-08-31 09:50:20 UTC
Permalink
Translating the start addresses of the rings is not enough, we need to
be sure all the ring is made available by the guest.

It depends on the size of the rings, which is not known on SET_VRING_ADDR
reception. Furthermore, we need to be be safe against vring pages
invalidates.

This patch introduces a new access_ok flag per virtqueue, which is set
when all the rings are mapped, and cleared as soon as a page used by a
ring is invalidated. The invalidation part is implemented in a following
patch.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost.c | 37 ++++++++++++++++++++++++++
lib/librte_vhost/vhost.h | 2 ++
lib/librte_vhost/vhost_user.c | 62 +++++++++++++++++++++++++++++++------------
lib/librte_vhost/virtio_net.c | 12 +++++++++
4 files changed, 96 insertions(+), 17 deletions(-)

diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index ccfc17022..ed9292313 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -224,6 +224,43 @@ free_device(struct virtio_net *dev)
rte_free(dev);
}

+int
+vring_translate(struct virtio_net *dev, struct vhost_virtqueue *vq)
+{
+ uint64_t size;
+
+ if (!(dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)))
+ goto out;
+
+ size = sizeof(struct vring_desc) * vq->size;
+ vq->desc = (struct vring_desc *)vhost_iova_to_vva(dev, vq,
+ vq->ring_addrs.desc_user_addr,
+ size, VHOST_ACCESS_RW);
+ if (!vq->desc)
+ return -1;
+
+ size = sizeof(struct vring_avail);
+ size += sizeof(uint16_t) * vq->size;
+ vq->avail = (struct vring_avail *)vhost_iova_to_vva(dev, vq,
+ vq->ring_addrs.avail_user_addr,
+ size, VHOST_ACCESS_RW);
+ if (!vq->avail)
+ return -1;
+
+ size = sizeof(struct vring_used);
+ size += sizeof(struct vring_used_elem) * vq->size;
+ vq->used = (struct vring_used *)vhost_iova_to_vva(dev, vq,
+ vq->ring_addrs.used_user_addr,
+ size, VHOST_ACCESS_RW);
+ if (!vq->used)
+ return -1;
+
+out:
+ vq->access_ok = 1;
+
+ return 0;
+}
+
static void
init_vring_queue(struct virtio_net *dev, uint32_t vring_idx)
{
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index c90e73268..f84718ed1 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -103,6 +103,7 @@ struct vhost_virtqueue {
/* Currently unused as polling mode is enabled */
int kickfd;
int enabled;
+ int access_ok;

/* Physical address of used ring, for logging */
uint64_t log_guest_addr;
@@ -355,5 +356,6 @@ void vhost_backend_cleanup(struct virtio_net *dev);

uint64_t vhost_iova_to_vva(struct virtio_net *dev, struct vhost_virtqueue *vq,
uint64_t iova, uint64_t size, uint8_t perm);
+int vring_translate(struct virtio_net *dev, struct vhost_virtqueue *vq);

#endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index a81b338ad..650981e2d 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -351,6 +351,12 @@ vhost_user_set_vring_addr(struct virtio_net *dev, VhostUserMsg *msg)
*/
memcpy(&vq->ring_addrs, addr, sizeof(*addr));

+ vq->desc = NULL;
+ vq->avail = NULL;
+ vq->used = NULL;
+
+ vq->access_ok = 0;
+
return 0;
}

@@ -367,10 +373,10 @@ static struct virtio_net *translate_ring_addresses(struct virtio_net *dev,
vq->desc = (struct vring_desc *)(uintptr_t)ring_addr_to_vva(dev,
vq, addr->desc_user_addr, sizeof(struct vring_desc));
if (vq->desc == 0) {
- RTE_LOG(ERR, VHOST_CONFIG,
+ RTE_LOG(DEBUG, VHOST_CONFIG,
"(%d) failed to find desc ring address.\n",
dev->vid);
- return NULL;
+ return dev;
}

dev = numa_realloc(dev, vq_index);
@@ -379,19 +385,19 @@ static struct virtio_net *translate_ring_addresses(struct virtio_net *dev,
vq->avail = (struct vring_avail *)(uintptr_t)ring_addr_to_vva(dev,
vq, addr->avail_user_addr, sizeof(struct vring_avail));
if (vq->avail == 0) {
- RTE_LOG(ERR, VHOST_CONFIG,
+ RTE_LOG(DEBUG, VHOST_CONFIG,
"(%d) failed to find avail ring address.\n",
dev->vid);
- return NULL;
+ return dev;
}

vq->used = (struct vring_used *)(uintptr_t)ring_addr_to_vva(dev,
vq, addr->used_user_addr, sizeof(struct vring_used));
if (vq->used == 0) {
- RTE_LOG(ERR, VHOST_CONFIG,
+ RTE_LOG(DEBUG, VHOST_CONFIG,
"(%d) failed to find used ring address.\n",
dev->vid);
- return NULL;
+ return dev;
}

if (vq->last_used_idx != vq->used->idx) {
@@ -637,7 +643,7 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
static int
vq_is_ready(struct vhost_virtqueue *vq)
{
- return vq && vq->desc &&
+ return vq && vq->desc && vq->avail && vq->used &&
vq->kickfd != VIRTIO_UNINITIALIZED_EVENTFD &&
vq->callfd != VIRTIO_UNINITIALIZED_EVENTFD;
}
@@ -943,8 +949,29 @@ vhost_user_set_req_fd(struct virtio_net *dev, struct VhostUserMsg *msg)
}

static int
-vhost_user_iotlb_msg(struct virtio_net *dev, struct VhostUserMsg *msg)
+is_vring_iotlb_update(struct vhost_virtqueue *vq, struct vhost_iotlb_msg *imsg)
{
+ struct vhost_vring_addr *ra;
+ uint64_t start, end;
+
+ start = imsg->iova;
+ end = start + imsg->size;
+
+ ra = &vq->ring_addrs;
+ if (ra->desc_user_addr >= start && ra->desc_user_addr < end)
+ return 1;
+ if (ra->avail_user_addr >= start && ra->avail_user_addr < end)
+ return 1;
+ if (ra->used_user_addr >= start && ra->used_user_addr < end)
+ return 1;
+
+ return -1;
+}
+
+static int
+vhost_user_iotlb_msg(struct virtio_net **pdev, struct VhostUserMsg *msg)
+{
+ struct virtio_net *dev = *pdev;
struct vhost_iotlb_msg *imsg = &msg->payload.iotlb;
uint16_t i;
uint64_t vva;
@@ -960,6 +987,9 @@ vhost_user_iotlb_msg(struct virtio_net *dev, struct VhostUserMsg *msg)

vhost_user_iotlb_cache_insert(vq, imsg->iova, vva,
imsg->size, imsg->perm);
+
+ if (is_vring_iotlb_update(vq, imsg))
+ *pdev = dev = translate_ring_addresses(dev, i);
}
break;
case VHOST_IOTLB_INVALIDATE:
@@ -1109,8 +1139,12 @@ vhost_user_msg_handler(int vid, int fd)
goto out;
}

- RTE_LOG(INFO, VHOST_CONFIG, "read message %s\n",
- vhost_message_str[msg.request]);
+ if (msg.request != VHOST_USER_IOTLB_MSG)
+ RTE_LOG(INFO, VHOST_CONFIG, "read message %s\n",
+ vhost_message_str[msg.request]);
+ else
+ RTE_LOG(DEBUG, VHOST_CONFIG, "read message %s\n",
+ vhost_message_str[msg.request]);

ret = vhost_user_check_and_alloc_queue_pair(dev, &msg);
if (ret < 0) {
@@ -1213,7 +1247,7 @@ vhost_user_msg_handler(int vid, int fd)
break;

case VHOST_USER_IOTLB_MSG:
- ret = vhost_user_iotlb_msg(dev, &msg);
+ ret = vhost_user_iotlb_msg(&dev, &msg);
break;

default:
@@ -1222,12 +1256,6 @@ vhost_user_msg_handler(int vid, int fd)

}

- /*
- * The virtio_net struct might have been reallocated on a different
- * NUMA node, so dev pointer might no more be valid.
- */
- dev = get_device(vid);
-
if (msg.flags & VHOST_USER_NEED_REPLY) {
msg.payload.u64 = !!ret;
msg.size = sizeof(msg.payload.u64);
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 18531c97d..1bd21330e 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -277,6 +277,10 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
if (unlikely(vq->enabled == 0))
return 0;

+ if (unlikely(vq->access_ok == 0))
+ if (unlikely(vring_translate(dev, vq) < 0))
+ return 0;
+
avail_idx = *((volatile uint16_t *)&vq->avail->idx);
start_idx = vq->last_used_idx;
free_entries = avail_idx - start_idx;
@@ -558,6 +562,10 @@ virtio_dev_merge_rx(struct virtio_net *dev, uint16_t queue_id,
if (unlikely(vq->enabled == 0))
return 0;

+ if (unlikely(vq->access_ok == 0))
+ if (unlikely(vring_translate(dev, vq) < 0))
+ return 0;
+
count = RTE_MIN((uint32_t)MAX_PKT_BURST, count);
if (count == 0)
return 0;
@@ -1042,6 +1050,10 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
if (unlikely(vq->enabled == 0))
goto out;

+ if (unlikely(vq->access_ok == 0))
+ if (unlikely(vring_translate(dev, vq) < 0))
+ goto out;
+
if (unlikely(dev->dequeue_zero_copy)) {
struct zcopy_mbuf *zmbuf, *next;
int nr_updated = 0;
--
2.13.3
Maxime Coquelin
2017-08-31 09:50:21 UTC
Permalink
As soon as a page used by a ring is invalidated, the access_ok flag
is cleared, so that processing threads try to map them again.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost.c | 17 +++++++++++++++++
lib/librte_vhost/vhost.h | 1 +
lib/librte_vhost/vhost_user.c | 38 +++++++++++++++++++++++++++++++++-----
3 files changed, 51 insertions(+), 5 deletions(-)

diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index ed9292313..4bf22d6e2 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -261,6 +261,23 @@ vring_translate(struct virtio_net *dev, struct vhost_virtqueue *vq)
return 0;
}

+void vring_invalidate(struct virtio_net *dev, struct vhost_virtqueue *vq)
+{
+ int vid = dev->vid;
+
+ /* Ensure no thread is using the vrings */
+ put_device(vid);
+ dev = get_device_wr(vid);
+
+ vq->access_ok = 0;
+ vq->desc = NULL;
+ vq->avail = NULL;
+ vq->used = NULL;
+
+ put_device_wr(vid);
+ get_device(vid);
+}
+
static void
init_vring_queue(struct virtio_net *dev, uint32_t vring_idx)
{
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index f84718ed1..98f9ccaf0 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -357,5 +357,6 @@ void vhost_backend_cleanup(struct virtio_net *dev);
uint64_t vhost_iova_to_vva(struct virtio_net *dev, struct vhost_virtqueue *vq,
uint64_t iova, uint64_t size, uint8_t perm);
int vring_translate(struct virtio_net *dev, struct vhost_virtqueue *vq);
+void vring_invalidate(struct virtio_net *dev, struct vhost_virtqueue *vq);

#endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 650981e2d..4df24ef82 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -351,11 +351,7 @@ vhost_user_set_vring_addr(struct virtio_net *dev, VhostUserMsg *msg)
*/
memcpy(&vq->ring_addrs, addr, sizeof(*addr));

- vq->desc = NULL;
- vq->avail = NULL;
- vq->used = NULL;
-
- vq->access_ok = 0;
+ vring_invalidate(dev, vq);

return 0;
}
@@ -969,6 +965,35 @@ is_vring_iotlb_update(struct vhost_virtqueue *vq, struct vhost_iotlb_msg *imsg)
}

static int
+is_vring_iotlb_invalidate(struct vhost_virtqueue *vq,
+ struct vhost_iotlb_msg *imsg)
+{
+ uint64_t istart, iend, vstart, vend;
+
+ istart = imsg->iova;
+ iend = istart + imsg->size - 1;
+
+ vstart = (uint64_t)vq->desc;
+ vend = vstart + sizeof(struct vring_desc) * vq->size - 1;
+ if (vstart <= iend && istart <= vend)
+ return 1;
+
+ vstart = (uint64_t)vq->avail;
+ vend = vstart + sizeof(struct vring_avail);
+ vend += sizeof(uint16_t) * vq->size - 1;
+ if (vstart <= iend && istart <= vend)
+ return 1;
+
+ vstart = (uint64_t)vq->used;
+ vend = vstart + sizeof(struct vring_used);
+ vend += sizeof(struct vring_used_elem) * vq->size - 1;
+ if (vstart <= iend && istart <= vend)
+ return 1;
+
+ return 0;
+}
+
+static int
vhost_user_iotlb_msg(struct virtio_net **pdev, struct VhostUserMsg *msg)
{
struct virtio_net *dev = *pdev;
@@ -998,6 +1023,9 @@ vhost_user_iotlb_msg(struct virtio_net **pdev, struct VhostUserMsg *msg)

vhost_user_iotlb_cache_remove(vq, imsg->iova,
imsg->size);
+
+ if (is_vring_iotlb_invalidate(vq, imsg))
+ vring_invalidate(dev, vq);
}
break;
default:
--
2.13.3
Maxime Coquelin
2017-08-31 09:50:22 UTC
Permalink
Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/vhost.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 98f9ccaf0..52bbc9a1c 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -196,7 +196,8 @@ struct vhost_msg {
(1ULL << VIRTIO_NET_F_GUEST_TSO4) | \
(1ULL << VIRTIO_NET_F_GUEST_TSO6) | \
(1ULL << VIRTIO_RING_F_INDIRECT_DESC) | \
- (1ULL << VIRTIO_NET_F_MTU))
+ (1ULL << VIRTIO_NET_F_MTU) | \
+ (1ULL << VIRTIO_F_IOMMU_PLATFORM))


struct guest_page {
--
2.13.3
Maxime Coquelin
2017-08-31 09:50:23 UTC
Permalink
Prior to this patch, iotlb cache's read/write lock was
read-locked at every guest IOVA to app VA translation,
i.e. at least once per packet with indirect off and twice
with indirect on.

The problem is that rte_rwlock_read_lock() makes use of atomic
operation, which is costly.

This patch introduces iotlb lock helpers, so that a full burst
can be protected with taking the lock once, which reduces the
number of atomic operations by up to 64 with indirect
descriptors.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/iotlb.c | 22 +++++++++++-----------
lib/librte_vhost/iotlb.h | 14 ++++++++++++++
lib/librte_vhost/vhost.h | 1 +
lib/librte_vhost/virtio_net.c | 22 ++++++++++++++++++++++
4 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/lib/librte_vhost/iotlb.c b/lib/librte_vhost/iotlb.c
index d014bfe98..7dca95281 100644
--- a/lib/librte_vhost/iotlb.c
+++ b/lib/librte_vhost/iotlb.c
@@ -55,14 +55,14 @@ static void vhost_user_iotlb_pending_remove_all(struct vhost_virtqueue *vq)
{
struct vhost_iotlb_entry *node, *temp_node;

- rte_rwlock_write_lock(&vq->iotlb_lock);
+ rte_rwlock_write_lock(&vq->iotlb_pending_lock);

TAILQ_FOREACH_SAFE(node, &vq->iotlb_pending_list, next, temp_node) {
TAILQ_REMOVE(&vq->iotlb_pending_list, node, next);
rte_mempool_put(vq->iotlb_pool, node);
}

- rte_rwlock_write_unlock(&vq->iotlb_lock);
+ rte_rwlock_write_unlock(&vq->iotlb_pending_lock);
}

int vhost_user_iotlb_pending_miss(struct vhost_virtqueue *vq, uint64_t iova,
@@ -71,7 +71,7 @@ int vhost_user_iotlb_pending_miss(struct vhost_virtqueue *vq, uint64_t iova,
struct vhost_iotlb_entry *node;
int found = 0;

- rte_rwlock_read_lock(&vq->iotlb_lock);
+ rte_rwlock_read_lock(&vq->iotlb_pending_lock);

TAILQ_FOREACH(node, &vq->iotlb_pending_list, next) {
if ((node->iova == iova) && (node->perm == perm)) {
@@ -80,7 +80,7 @@ int vhost_user_iotlb_pending_miss(struct vhost_virtqueue *vq, uint64_t iova,
}
}

- rte_rwlock_read_unlock(&vq->iotlb_lock);
+ rte_rwlock_read_unlock(&vq->iotlb_pending_lock);

return found;
}
@@ -105,11 +105,11 @@ void vhost_user_iotlb_pending_insert(struct vhost_virtqueue *vq,
node->iova = iova;
node->perm = perm;

- rte_rwlock_write_lock(&vq->iotlb_lock);
+ rte_rwlock_write_lock(&vq->iotlb_pending_lock);

TAILQ_INSERT_TAIL(&vq->iotlb_pending_list, node, next);

- rte_rwlock_write_unlock(&vq->iotlb_lock);
+ rte_rwlock_write_unlock(&vq->iotlb_pending_lock);
}

static void vhost_user_iotlb_pending_remove(struct vhost_virtqueue *vq,
@@ -117,7 +117,8 @@ static void vhost_user_iotlb_pending_remove(struct vhost_virtqueue *vq,
{
struct vhost_iotlb_entry *node, *temp_node;

- /* .iotlb_lock already locked by the caller */
+ rte_rwlock_write_lock(&vq->iotlb_pending_lock);
+
TAILQ_FOREACH_SAFE(node, &vq->iotlb_pending_list, next, temp_node) {
if (node->iova < iova)
continue;
@@ -128,6 +129,8 @@ static void vhost_user_iotlb_pending_remove(struct vhost_virtqueue *vq,
TAILQ_REMOVE(&vq->iotlb_pending_list, node, next);
rte_mempool_put(vq->iotlb_pool, node);
}
+
+ rte_rwlock_write_unlock(&vq->iotlb_pending_lock);
}

static void vhost_user_iotlb_cache_remove_all(struct vhost_virtqueue *vq)
@@ -226,8 +229,6 @@ uint64_t vhost_user_iotlb_cache_find(struct vhost_virtqueue *vq, uint64_t iova,
if (unlikely(!*size))
goto out;

- rte_rwlock_read_lock(&vq->iotlb_lock);
-
TAILQ_FOREACH(node, &vq->iotlb_list, next) {
/* List sorted by iova */
if (unlikely(iova < node->iova))
@@ -252,8 +253,6 @@ uint64_t vhost_user_iotlb_cache_find(struct vhost_virtqueue *vq, uint64_t iova,
break;
}

- rte_rwlock_read_unlock(&vq->iotlb_lock);
-
out:
/* Only part of the requested chunk is mapped */
if (unlikely(mapped < *size))
@@ -285,6 +284,7 @@ int vhost_user_iotlb_init(struct virtio_net *dev, int vq_index)
socket = 0;

rte_rwlock_init(&vq->iotlb_lock);
+ rte_rwlock_init(&vq->iotlb_pending_lock);

TAILQ_INIT(&vq->iotlb_list);
TAILQ_INIT(&vq->iotlb_pending_list);
diff --git a/lib/librte_vhost/iotlb.h b/lib/librte_vhost/iotlb.h
index 4be1f7e85..d70c05a70 100644
--- a/lib/librte_vhost/iotlb.h
+++ b/lib/librte_vhost/iotlb.h
@@ -34,6 +34,20 @@
#define _VHOST_IOTLB_H_

#include "vhost.h"
+
+
+static __rte_always_inline void
+vhost_user_iotlb_rd_lock(struct vhost_virtqueue *vq)
+{
+ rte_rwlock_read_lock(&vq->iotlb_lock);
+}
+
+static __rte_always_inline void
+vhost_user_iotlb_rd_unlock(struct vhost_virtqueue *vq)
+{
+ rte_rwlock_read_unlock(&vq->iotlb_lock);
+}
+
void vhost_user_iotlb_cache_insert(struct vhost_virtqueue *vq, uint64_t iova,
uint64_t uaddr, uint64_t size,
uint8_t perm);
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 52bbc9a1c..008fc2ada 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -119,6 +119,7 @@ struct vhost_virtqueue {
struct vhost_vring_addr ring_addrs;

rte_rwlock_t iotlb_lock;
+ rte_rwlock_t iotlb_pending_lock;
struct rte_mempool *iotlb_pool;
TAILQ_HEAD(, vhost_iotlb_entry) iotlb_list;
TAILQ_HEAD(, vhost_iotlb_entry) iotlb_pending_list;
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 1bd21330e..799e12d2c 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -45,6 +45,7 @@
#include <rte_sctp.h>
#include <rte_arp.h>

+#include "iotlb.h"
#include "vhost.h"

#define MAX_PKT_BURST 32
@@ -306,6 +307,10 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
}

rte_prefetch0(&vq->desc[desc_indexes[0]]);
+
+ if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
+ vhost_user_iotlb_rd_lock(vq);
+
for (i = 0; i < count; i++) {
uint16_t desc_idx = desc_indexes[i];
int err;
@@ -338,6 +343,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
rte_prefetch0(&vq->desc[desc_indexes[i+1]]);
}

+ if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
+ vhost_user_iotlb_rd_unlock(vq);
+
rte_smp_wmb();

*(volatile uint16_t *)&vq->used->idx += count;
@@ -574,6 +582,10 @@ virtio_dev_merge_rx(struct virtio_net *dev, uint16_t queue_id,

vq->shadow_used_idx = 0;
avail_head = *((volatile uint16_t *)&vq->avail->idx);
+
+ if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
+ vhost_user_iotlb_rd_lock(vq);
+
for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;

@@ -600,6 +612,9 @@ virtio_dev_merge_rx(struct virtio_net *dev, uint16_t queue_id,
vq->last_avail_idx += num_buffers;
}

+ if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
+ vhost_user_iotlb_rd_unlock(vq);
+
if (likely(vq->shadow_used_idx)) {
flush_shadow_used_ring(dev, vq);

@@ -1143,6 +1158,10 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,

/* Prefetch descriptor index. */
rte_prefetch0(&vq->desc[desc_indexes[0]]);
+
+ if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
+ vhost_user_iotlb_rd_lock(vq);
+
for (i = 0; i < count; i++) {
struct vring_desc *desc;
uint16_t sz, idx;
@@ -1206,6 +1225,9 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
TAILQ_INSERT_TAIL(&vq->zmbuf_list, zmbuf, next);
}
}
+ if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
+ vhost_user_iotlb_rd_unlock(vq);
+
vq->last_avail_idx += i;

if (likely(dev->dequeue_zero_copy == 0)) {
--
2.13.3
Yuanhan Liu
2017-09-11 04:18:21 UTC
Permalink
Post by Maxime Coquelin
Prior to this patch, iotlb cache's read/write lock was
read-locked at every guest IOVA to app VA translation,
i.e. at least once per packet with indirect off and twice
with indirect on.
The problem is that rte_rwlock_read_lock() makes use of atomic
operation, which is costly.
This patch introduces iotlb lock helpers, so that a full burst
can be protected with taking the lock once, which reduces the
number of atomic operations by up to 64 with indirect
descriptors.
You were assuming there is no single miss during a burst. If a miss
happens, it requries 2 locks: one for _pending_miss and another one
for _pending_insert. From this point of view, it's actually more
expensive.

However, I won't call it's a bad assumption (for the case of virtio
PMD). And if you take this assumption, why not just deleting the
pending list and moving the lock outside the _iotlb_find function()
like what you did in this patch?

I don't really see the point of introducing the pending list.

--yliu
Maxime Coquelin
2017-09-11 07:34:30 UTC
Permalink
Hi Yuanhan,
Post by Yuanhan Liu
Post by Maxime Coquelin
Prior to this patch, iotlb cache's read/write lock was
read-locked at every guest IOVA to app VA translation,
i.e. at least once per packet with indirect off and twice
with indirect on.
The problem is that rte_rwlock_read_lock() makes use of atomic
operation, which is costly.
This patch introduces iotlb lock helpers, so that a full burst
can be protected with taking the lock once, which reduces the
number of atomic operations by up to 64 with indirect
descriptors.
You were assuming there is no single miss during a burst. If a miss
happens, it requries 2 locks: one for _pending_miss and another one
for _pending_insert. From this point of view, it's actually more
expensive.
It's actually more expensive only when a miss happens. And in that case,
the cost of taking the lock is negligible compared to the miss itself.
Post by Yuanhan Liu
However, I won't call it's a bad assumption (for the case of virtio
PMD). And if you take this assumption, why not just deleting the
pending list and moving the lock outside the _iotlb_find function()
like what you did in this patch?
Because we need the pending list. When the is no matching entry in the
IOTLB cache, we have to send a miss request through the slave channel.

On miss request reception, Qemu performs the translation, and in case of
success, sends it through the main channel using an update request.

While all this is done, the backend could wait for it, blocking
processing on the PMD thread. But it would be really inefficient in case
other queues are being processed on the same lcore. Moreover, if the
iova is invalid, no update requst is sent, so the lcore would be blocked
forever.

To overcome this, what is done is that in case of miss, it exits the
burst and try again later, letting a chance for other virtqueues to be
processed while the update arrives.

And here comes the pending list. On the next try, the update may have
not arrived yet, so we need the check whether a miss has already been
sent for the same address & perm. Else, we would flood Qemu with miss
requests for the same address.
Post by Yuanhan Liu
I don't really see the point of introducing the pending list.
Hope the above clarifies.

I will see if I can improve the pending list protection, but honestly,
its cost is negligible.

Cheers,
Maxime
Post by Yuanhan Liu
--yliu
Yuanhan Liu
2017-09-11 09:39:52 UTC
Permalink
Post by Maxime Coquelin
Hi Yuanhan,
Post by Yuanhan Liu
Post by Maxime Coquelin
Prior to this patch, iotlb cache's read/write lock was
read-locked at every guest IOVA to app VA translation,
i.e. at least once per packet with indirect off and twice
with indirect on.
The problem is that rte_rwlock_read_lock() makes use of atomic
operation, which is costly.
This patch introduces iotlb lock helpers, so that a full burst
can be protected with taking the lock once, which reduces the
number of atomic operations by up to 64 with indirect
descriptors.
You were assuming there is no single miss during a burst. If a miss
happens, it requries 2 locks: one for _pending_miss and another one
for _pending_insert. From this point of view, it's actually more
expensive.
It's actually more expensive only when a miss happens.
Yes, that's what I meant.
Post by Maxime Coquelin
And in that case,
the cost of taking the lock is negligible compared to the miss itself.
Yes, I'm aware of it.
Post by Maxime Coquelin
Post by Yuanhan Liu
However, I won't call it's a bad assumption (for the case of virtio
PMD). And if you take this assumption, why not just deleting the
pending list and moving the lock outside the _iotlb_find function()
like what you did in this patch?
Because we need the pending list. When the is no matching entry in the
IOTLB cache, we have to send a miss request through the slave channel.
You don't have the pending list to send a MISS.
Post by Maxime Coquelin
On miss request reception, Qemu performs the translation, and in case of
success, sends it through the main channel using an update request.
While all this is done, the backend could wait for it, blocking
processing on the PMD thread. But it would be really inefficient in case
other queues are being processed on the same lcore. Moreover, if the
iova is invalid, no update requst is sent, so the lcore would be blocked
forever.
To overcome this, what is done is that in case of miss, it exits the
burst and try again later, letting a chance for other virtqueues to be
processed while the update arrives.
You can also quit earlier without the pending list.
Post by Maxime Coquelin
And here comes the pending list. On the next try, the update may have
not arrived yet, so we need the check whether a miss has already been
sent for the same address & perm. Else, we would flood Qemu with miss
requests for the same address.
Okay, that's the reason we need a pending list: to record the miss
we have already sent.
Post by Maxime Coquelin
Post by Yuanhan Liu
I don't really see the point of introducing the pending list.
Hope the above clarifies.
Thanks, it indeed helps!
Post by Maxime Coquelin
I will see if I can improve the pending list protection, but honestly,
its cost is negligible.
That's not my point :)

--yliu
Remy Horton
2017-09-04 13:58:11 UTC
Permalink
On 31/08/2017 10:50, Maxime Coquelin wrote:
[..]
Post by Maxime Coquelin
Revert "vhost: workaround MQ fails to startup"
vhost: make error handling consistent in rx path
vhost: protect virtio_net device struct
vhost: prepare send_vhost_message() to slave requests
vhost: add support to slave requests channel
vhost: declare missing IOMMU-related definitions for old kernels
vhost: add iotlb helper functions
vhost: iotlb: add pending miss request list and helpers
vhost-user: add support to IOTLB miss slave requests
vhost: initialize vrings IOTLB caches
vhost-user: handle IOTLB update and invalidate requests
vhost: introduce guest IOVA to backend VA helper
vhost: use the guest IOVA to host VA helper
vhost: enable rings at the right time
vhost: don't dereference invalid dev pointer after its reallocation
vhost: postpone rings addresses translation
vhost-user: translate ring addresses when IOMMU enabled
vhost-user: iommu: postpone device creation until ring are mapped
vhost: iommu: Invalidate vring in case of matching IOTLB invalidate
vhost: enable IOMMU support
vhost: iotlb: reduce iotlb read lock usage
Reviewed-by: Remy Horton <***@intel.com>
Continue reading on narkive:
Loading...