Discussion:
[dpdk-dev] [PATCH] vhost: batch used descriptors chains write-back with packed ring
(too old to reply)
Maxime Coquelin
2018-11-28 09:47:00 UTC
Permalink
Instead of writing back descriptors chains in order, let's
write the first chain flags last in order to improve batching.

With Kernel's pktgen benchmark, ~3% performance gain is measured.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/virtio_net.c | 37 ++++++++++++++++++++++-------------
1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5e1a1a727..f54642c2d 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -135,19 +135,10 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
struct vhost_virtqueue *vq)
{
int i;
- uint16_t used_idx = vq->last_used_idx;
+ uint16_t head_flags, head_idx = vq->last_used_idx;

- /* Split loop in two to save memory barriers */
- for (i = 0; i < vq->shadow_used_idx; i++) {
- vq->desc_packed[used_idx].id = vq->shadow_used_packed[i].id;
- vq->desc_packed[used_idx].len = vq->shadow_used_packed[i].len;
-
- used_idx += vq->shadow_used_packed[i].count;
- if (used_idx >= vq->size)
- used_idx -= vq->size;
- }
-
- rte_smp_wmb();
+ if (unlikely(vq->shadow_used_idx == 0))
+ return;

for (i = 0; i < vq->shadow_used_idx; i++) {
uint16_t flags;
@@ -165,12 +156,22 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
flags &= ~VRING_DESC_F_AVAIL;
}

- vq->desc_packed[vq->last_used_idx].flags = flags;
+ vq->desc_packed[vq->last_used_idx].id =
+ vq->shadow_used_packed[i].id;
+ vq->desc_packed[vq->last_used_idx].len =
+ vq->shadow_used_packed[i].len;
+
+ if (i > 0) {
+ vq->desc_packed[vq->last_used_idx].flags = flags;

- vhost_log_cache_used_vring(dev, vq,
+ vhost_log_cache_used_vring(dev, vq,
vq->last_used_idx *
sizeof(struct vring_packed_desc),
sizeof(struct vring_packed_desc));
+ } else {
+ head_idx = vq->last_used_idx;
+ head_flags = flags;
+ }

vq->last_used_idx += vq->shadow_used_packed[i].count;
if (vq->last_used_idx >= vq->size) {
@@ -180,7 +181,15 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
}

rte_smp_wmb();
+
+ vq->desc_packed[head_idx].flags = head_flags;
vq->shadow_used_idx = 0;
+
+ vhost_log_cache_used_vring(dev, vq,
+ head_idx *
+ sizeof(struct vring_packed_desc),
+ sizeof(struct vring_packed_desc));
+
vhost_log_cache_sync(dev, vq);
}
--
2.17.2
Jens Freimann
2018-11-28 10:05:34 UTC
Permalink
Post by Maxime Coquelin
Instead of writing back descriptors chains in order, let's
write the first chain flags last in order to improve batching.
With Kernel's pktgen benchmark, ~3% performance gain is measured.
---
lib/librte_vhost/virtio_net.c | 37 ++++++++++++++++++++++-------------
1 file changed, 23 insertions(+), 14 deletions(-)
Tested-by: Jens Freimann <***@redhat.com>
Reviewed-by: Jens Freimann <***@redhat.com>
Ilya Maximets
2018-12-05 16:01:23 UTC
Permalink
Post by Maxime Coquelin
Instead of writing back descriptors chains in order, let's
write the first chain flags last in order to improve batching.
I'm not sure if this fully compliant with virtio spec.
It says that 'each side (driver and device) are only required to poll
(or test) a single location in memory', but it does not forbid to
test other descriptors. So, if the driver will try to check not only
'the next device descriptor after the one they processed previously,
in circular order' but a few descriptors ahead, it could read an
inconsistent memory because there are no more write barriers between
updates for flags and id/len for them.

What do you think ?
Post by Maxime Coquelin
With Kernel's pktgen benchmark, ~3% performance gain is measured.
---
lib/librte_vhost/virtio_net.c | 37 ++++++++++++++++++++++-------------
1 file changed, 23 insertions(+), 14 deletions(-)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5e1a1a727..f54642c2d 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -135,19 +135,10 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
struct vhost_virtqueue *vq)
{
int i;
- uint16_t used_idx = vq->last_used_idx;
+ uint16_t head_flags, head_idx = vq->last_used_idx;
- /* Split loop in two to save memory barriers */
- for (i = 0; i < vq->shadow_used_idx; i++) {
- vq->desc_packed[used_idx].id = vq->shadow_used_packed[i].id;
- vq->desc_packed[used_idx].len = vq->shadow_used_packed[i].len;
-
- used_idx += vq->shadow_used_packed[i].count;
- if (used_idx >= vq->size)
- used_idx -= vq->size;
- }
-
- rte_smp_wmb();
+ if (unlikely(vq->shadow_used_idx == 0))
+ return;
for (i = 0; i < vq->shadow_used_idx; i++) {
uint16_t flags;
@@ -165,12 +156,22 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
flags &= ~VRING_DESC_F_AVAIL;
}
- vq->desc_packed[vq->last_used_idx].flags = flags;
+ vq->desc_packed[vq->last_used_idx].id =
+ vq->shadow_used_packed[i].id;
+ vq->desc_packed[vq->last_used_idx].len =
+ vq->shadow_used_packed[i].len;
+
+ if (i > 0) {
+ vq->desc_packed[vq->last_used_idx].flags = flags;
- vhost_log_cache_used_vring(dev, vq,
+ vhost_log_cache_used_vring(dev, vq,
vq->last_used_idx *
sizeof(struct vring_packed_desc),
sizeof(struct vring_packed_desc));
+ } else {
+ head_idx = vq->last_used_idx;
+ head_flags = flags;
+ }
vq->last_used_idx += vq->shadow_used_packed[i].count;
if (vq->last_used_idx >= vq->size) {
@@ -180,7 +181,15 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
}
rte_smp_wmb();
+
+ vq->desc_packed[head_idx].flags = head_flags;
vq->shadow_used_idx = 0;
+
+ vhost_log_cache_used_vring(dev, vq,
+ head_idx *
+ sizeof(struct vring_packed_desc),
+ sizeof(struct vring_packed_desc));
+
vhost_log_cache_sync(dev, vq);
}
Michael S. Tsirkin
2018-12-06 00:56:43 UTC
Permalink
Post by Ilya Maximets
Post by Maxime Coquelin
Instead of writing back descriptors chains in order, let's
write the first chain flags last in order to improve batching.
I'm not sure if this fully compliant with virtio spec.
It says that 'each side (driver and device) are only required to poll
(or test) a single location in memory', but it does not forbid to
test other descriptors. So, if the driver will try to check not only
'the next device descriptor after the one they processed previously,
in circular order' but a few descriptors ahead, it could read an
inconsistent memory because there are no more write barriers between
updates for flags and id/len for them.
What do you think ?
Write barriers for SMP effects are quite cheap on most architectures.
So adding them before each flag write is probably not a big deal.
Post by Ilya Maximets
Post by Maxime Coquelin
With Kernel's pktgen benchmark, ~3% performance gain is measured.
---
lib/librte_vhost/virtio_net.c | 37 ++++++++++++++++++++++-------------
1 file changed, 23 insertions(+), 14 deletions(-)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5e1a1a727..f54642c2d 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -135,19 +135,10 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
struct vhost_virtqueue *vq)
{
int i;
- uint16_t used_idx = vq->last_used_idx;
+ uint16_t head_flags, head_idx = vq->last_used_idx;
- /* Split loop in two to save memory barriers */
- for (i = 0; i < vq->shadow_used_idx; i++) {
- vq->desc_packed[used_idx].id = vq->shadow_used_packed[i].id;
- vq->desc_packed[used_idx].len = vq->shadow_used_packed[i].len;
-
- used_idx += vq->shadow_used_packed[i].count;
- if (used_idx >= vq->size)
- used_idx -= vq->size;
- }
-
- rte_smp_wmb();
+ if (unlikely(vq->shadow_used_idx == 0))
+ return;
for (i = 0; i < vq->shadow_used_idx; i++) {
uint16_t flags;
@@ -165,12 +156,22 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
flags &= ~VRING_DESC_F_AVAIL;
}
- vq->desc_packed[vq->last_used_idx].flags = flags;
+ vq->desc_packed[vq->last_used_idx].id =
+ vq->shadow_used_packed[i].id;
+ vq->desc_packed[vq->last_used_idx].len =
+ vq->shadow_used_packed[i].len;
+
+ if (i > 0) {
Specifically here?
Post by Ilya Maximets
Post by Maxime Coquelin
+ vq->desc_packed[vq->last_used_idx].flags = flags;
- vhost_log_cache_used_vring(dev, vq,
+ vhost_log_cache_used_vring(dev, vq,
vq->last_used_idx *
sizeof(struct vring_packed_desc),
sizeof(struct vring_packed_desc));
+ } else {
+ head_idx = vq->last_used_idx;
+ head_flags = flags;
+ }
vq->last_used_idx += vq->shadow_used_packed[i].count;
if (vq->last_used_idx >= vq->size) {
@@ -180,7 +181,15 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
}
rte_smp_wmb();
+
+ vq->desc_packed[head_idx].flags = head_flags;
vq->shadow_used_idx = 0;
+
+ vhost_log_cache_used_vring(dev, vq,
+ head_idx *
+ sizeof(struct vring_packed_desc),
+ sizeof(struct vring_packed_desc));
+
vhost_log_cache_sync(dev, vq);
}
Maxime Coquelin
2018-12-06 17:10:58 UTC
Permalink
Post by Ilya Maximets
Post by Maxime Coquelin
Instead of writing back descriptors chains in order, let's
write the first chain flags last in order to improve batching.
I'm not sure if this fully compliant with virtio spec.
It says that 'each side (driver and device) are only required to poll
(or test) a single location in memory', but it does not forbid to
test other descriptors. So, if the driver will try to check not only
'the next device descriptor after the one they processed previously,
in circular order' but a few descriptors ahead, it could read an
inconsistent memory because there are no more write barriers between
updates for flags and id/len for them.
What do you think ?
Yes, that makes sense.
It should have no cost on x86 moreover.

I'll fix it in v2.
Thanks,
Maxime
Post by Ilya Maximets
Post by Maxime Coquelin
With Kernel's pktgen benchmark, ~3% performance gain is measured.
---
lib/librte_vhost/virtio_net.c | 37 ++++++++++++++++++++++-------------
1 file changed, 23 insertions(+), 14 deletions(-)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5e1a1a727..f54642c2d 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -135,19 +135,10 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
struct vhost_virtqueue *vq)
{
int i;
- uint16_t used_idx = vq->last_used_idx;
+ uint16_t head_flags, head_idx = vq->last_used_idx;
- /* Split loop in two to save memory barriers */
- for (i = 0; i < vq->shadow_used_idx; i++) {
- vq->desc_packed[used_idx].id = vq->shadow_used_packed[i].id;
- vq->desc_packed[used_idx].len = vq->shadow_used_packed[i].len;
-
- used_idx += vq->shadow_used_packed[i].count;
- if (used_idx >= vq->size)
- used_idx -= vq->size;
- }
-
- rte_smp_wmb();
+ if (unlikely(vq->shadow_used_idx == 0))
+ return;
for (i = 0; i < vq->shadow_used_idx; i++) {
uint16_t flags;
@@ -165,12 +156,22 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
flags &= ~VRING_DESC_F_AVAIL;
}
- vq->desc_packed[vq->last_used_idx].flags = flags;
+ vq->desc_packed[vq->last_used_idx].id =
+ vq->shadow_used_packed[i].id;
+ vq->desc_packed[vq->last_used_idx].len =
+ vq->shadow_used_packed[i].len;
+
+ if (i > 0) {
+ vq->desc_packed[vq->last_used_idx].flags = flags;
- vhost_log_cache_used_vring(dev, vq,
+ vhost_log_cache_used_vring(dev, vq,
vq->last_used_idx *
sizeof(struct vring_packed_desc),
sizeof(struct vring_packed_desc));
+ } else {
+ head_idx = vq->last_used_idx;
+ head_flags = flags;
+ }
vq->last_used_idx += vq->shadow_used_packed[i].count;
if (vq->last_used_idx >= vq->size) {
@@ -180,7 +181,15 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
}
rte_smp_wmb();
+
+ vq->desc_packed[head_idx].flags = head_flags;
vq->shadow_used_idx = 0;
+
+ vhost_log_cache_used_vring(dev, vq,
+ head_idx *
+ sizeof(struct vring_packed_desc),
+ sizeof(struct vring_packed_desc));
+
vhost_log_cache_sync(dev, vq);
}
Loading...