Discussion:
[dpdk-dev] [PATCH 0/5] vhost: add missing barriers, remove useless volatiles
(too old to reply)
Maxime Coquelin
2018-12-05 09:49:52 UTC
Permalink
This series adds missing read barriers after reading avail index
for split ring and desc flags for packed ring.

Once that fixed, the casts to volatile are useless and so removed.

Also, it turns out that some descriptors prefetching are either
badly placed, or useless, last part of the series fixes that.

With the series applied, I get between 0 and 4% gain depending
on the benchmark (testpmd txonly/rxonly/io).

Thanks to Jason for reporting the missing read barriers.

Maxime Coquelin (5):
vhost: enforce avail index and desc read ordering
vhost: enforce desc flags and content read ordering
vhost: prefetch descriptor after the read barrier
vhost: remove useless prefetch for packed ring descriptor
vhost: remove useless casts to volatile

lib/librte_vhost/virtio_net.c | 32 ++++++++++++++++++++++++--------
1 file changed, 24 insertions(+), 8 deletions(-)
--
2.17.2
Maxime Coquelin
2018-12-05 09:49:53 UTC
Permalink
A read barrier is required to ensure the ordering between
available index and the descriptor reads is enforced.

Fixes: 4796ad63ba1f ("examples/vhost: import userspace vhost application")
Cc: ***@dpdk.org

Reported-by: Jason Wang <***@redhat.com>
Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/virtio_net.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5e1a1a727..f11ebb54f 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -791,6 +791,12 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
avail_head = *((volatile uint16_t *)&vq->avail->idx);

+ /*
+ * The ordering between avail index and
+ * desc reads needs to be enforced.
+ */
+ rte_smp_rmb();
+
for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
uint16_t nr_vec = 0;
@@ -1373,6 +1379,12 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
if (free_entries == 0)
return 0;

+ /*
+ * The ordering between avail index and
+ * desc reads needs to be enforced.
+ */
+ rte_smp_rmb();
+
VHOST_LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);

count = RTE_MIN(count, MAX_PKT_BURST);
--
2.17.2
Ilya Maximets
2018-12-05 11:30:36 UTC
Permalink
Post by Maxime Coquelin
A read barrier is required to ensure the ordering between
available index and the descriptor reads is enforced.
Fixes: 4796ad63ba1f ("examples/vhost: import userspace vhost application")
---
lib/librte_vhost/virtio_net.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5e1a1a727..f11ebb54f 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -791,6 +791,12 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
avail_head = *((volatile uint16_t *)&vq->avail->idx);
+ /*
+ * The ordering between avail index and
+ * desc reads needs to be enforced.
+ */
+ rte_smp_rmb();
+
Hmm. This looks weird to me.
Could you please describe the bad scenario here? (It'll be good to have it
in commit message too)

As I understand, you're enforcing the read of avail->idx to happen before
reading the avail->ring[avail_idx]. Is it correct?

But we have following code sequence:

1. read avail->idx (avail_head).
2. check that last_avail_idx != avail_head.
3. read from the ring using last_avail_idx.

So, there is a strict dependency between all 3 steps and the memory
transaction will be finished at the step #2 in any case. There is no
way to read the ring before reading the avail->idx.

Am I missing something?
Post by Maxime Coquelin
for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
uint16_t nr_vec = 0;
@@ -1373,6 +1379,12 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
if (free_entries == 0)
return 0;
+ /*
+ * The ordering between avail index and
+ * desc reads needs to be enforced.
+ */
+ rte_smp_rmb();
+
This one is strange too.

free_entries = *((volatile uint16_t *)&vq->avail->idx) -
vq->last_avail_idx;
if (free_entries == 0)
return 0;

The code reads the value of avail->idx and uses the value on the next
line even with any compiler optimizations. There is no way for CPU to
postpone the actual read.
Post by Maxime Coquelin
VHOST_LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
count = RTE_MIN(count, MAX_PKT_BURST);
Jason Wang
2018-12-06 04:17:38 UTC
Permalink
Post by Ilya Maximets
Post by Maxime Coquelin
A read barrier is required to ensure the ordering between
available index and the descriptor reads is enforced.
Fixes: 4796ad63ba1f ("examples/vhost: import userspace vhost application")
---
lib/librte_vhost/virtio_net.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5e1a1a727..f11ebb54f 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -791,6 +791,12 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
avail_head = *((volatile uint16_t *)&vq->avail->idx);
+ /*
+ * The ordering between avail index and
+ * desc reads needs to be enforced.
+ */
+ rte_smp_rmb();
+
Hmm. This looks weird to me.
Could you please describe the bad scenario here? (It'll be good to have it
in commit message too)
As I understand, you're enforcing the read of avail->idx to happen before
reading the avail->ring[avail_idx]. Is it correct?
1. read avail->idx (avail_head).
2. check that last_avail_idx != avail_head.
3. read from the ring using last_avail_idx.
So, there is a strict dependency between all 3 steps and the memory
transaction will be finished at the step #2 in any case. There is no
way to read the ring before reading the avail->idx.
Am I missing something?
Nope, I kind of get what you meaning now. And even if we will

4. read descriptor from descriptor ring using the id read from 3

5. read descriptor content according to the address from 4

They still have dependent memory access. So there's no need for rmb.
Post by Ilya Maximets
Post by Maxime Coquelin
for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
uint16_t nr_vec = 0;
@@ -1373,6 +1379,12 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
if (free_entries == 0)
return 0;
+ /*
+ * The ordering between avail index and
+ * desc reads needs to be enforced.
+ */
+ rte_smp_rmb();
+
This one is strange too.
free_entries = *((volatile uint16_t *)&vq->avail->idx) -
vq->last_avail_idx;
if (free_entries == 0)
return 0;
The code reads the value of avail->idx and uses the value on the next
line even with any compiler optimizations. There is no way for CPU to
postpone the actual read.
Yes.

Thanks
Post by Ilya Maximets
Post by Maxime Coquelin
VHOST_LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
count = RTE_MIN(count, MAX_PKT_BURST);
Ilya Maximets
2018-12-06 12:48:31 UTC
Permalink
Post by Jason Wang
Post by Ilya Maximets
Post by Maxime Coquelin
A read barrier is required to ensure the ordering between
available index and the descriptor reads is enforced.
Fixes: 4796ad63ba1f ("examples/vhost: import userspace vhost application")
---
  lib/librte_vhost/virtio_net.c | 12 ++++++++++++
  1 file changed, 12 insertions(+)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5e1a1a727..f11ebb54f 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -791,6 +791,12 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
      rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
      avail_head = *((volatile uint16_t *)&vq->avail->idx);
  +    /*
+     * The ordering between avail index and
+     * desc reads needs to be enforced.
+     */
+    rte_smp_rmb();
+
Hmm. This looks weird to me.
Could you please describe the bad scenario here? (It'll be good to have it
in commit message too)
As I understand, you're enforcing the read of avail->idx to happen before
reading the avail->ring[avail_idx]. Is it correct?
1. read avail->idx (avail_head).
2. check that last_avail_idx != avail_head.
3. read from the ring using last_avail_idx.
So, there is a strict dependency between all 3 steps and the memory
transaction will be finished at the step #2 in any case. There is no
way to read the ring before reading the avail->idx.
Am I missing something?
Nope, I kind of get what you meaning now. And even if we will
4. read descriptor from descriptor ring using the id read from 3
5. read descriptor content according to the address from 4
They still have dependent memory access. So there's no need for rmb.
On a second glance I changed my mind.
The code looks like this:

1. read avail_head = avail->idx
2. read cur_idx = last_avail_idx
if (cur_idx != avail_head) {
3. read idx = avail->ring[cur_idx]
4. read desc[idx]
}

There is an address (data) dependency: 2 -> 3 -> 4.
These reads could not be reordered.

But it's only control dependency between 1 and (3, 4), because 'avail_head'
is not used to calculate 'cur_idx'. In case of aggressive speculative
execution, 1 could be reordered with 3 resulting with reading of not yet
updated 'idx'.

Not sure if speculative execution could go so far while 'avail_head' is not
read yet, but it's should be possible in theory.

Thoughts ?
Post by Jason Wang
Post by Ilya Maximets
Post by Maxime Coquelin
      for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
          uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
          uint16_t nr_vec = 0;
@@ -1373,6 +1379,12 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
      if (free_entries == 0)
          return 0;
  +    /*
+     * The ordering between avail index and
+     * desc reads needs to be enforced.
+     */
+    rte_smp_rmb();
+
This one is strange too.
    free_entries = *((volatile uint16_t *)&vq->avail->idx) -
            vq->last_avail_idx;
    if (free_entries == 0)
        return 0;
The code reads the value of avail->idx and uses the value on the next
line even with any compiler optimizations. There is no way for CPU to
postpone the actual read.
Yes.
It's kind of similar situation here, but 'avail_head' is involved somehow
in 'cur_idx' calculation because of
fill_vec_buf_split(..., vq->last_avail_idx + i, ...)
And 'i' depends on 'free_entries'. But we need to look at the exact asm
code to be sure. I think, we may add barrier here to avoid possible issues.
Post by Jason Wang
Thanks
Post by Ilya Maximets
Post by Maxime Coquelin
      VHOST_LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
        count = RTE_MIN(count, MAX_PKT_BURST);
Jason Wang
2018-12-06 13:25:56 UTC
Permalink
Post by Ilya Maximets
Post by Jason Wang
Post by Ilya Maximets
Post by Maxime Coquelin
A read barrier is required to ensure the ordering between
available index and the descriptor reads is enforced.
Fixes: 4796ad63ba1f ("examples/vhost: import userspace vhost application")
---
  lib/librte_vhost/virtio_net.c | 12 ++++++++++++
  1 file changed, 12 insertions(+)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5e1a1a727..f11ebb54f 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -791,6 +791,12 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
      rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
      avail_head = *((volatile uint16_t *)&vq->avail->idx);
  +    /*
+     * The ordering between avail index and
+     * desc reads needs to be enforced.
+     */
+    rte_smp_rmb();
+
Hmm. This looks weird to me.
Could you please describe the bad scenario here? (It'll be good to have it
in commit message too)
As I understand, you're enforcing the read of avail->idx to happen before
reading the avail->ring[avail_idx]. Is it correct?
1. read avail->idx (avail_head).
2. check that last_avail_idx != avail_head.
3. read from the ring using last_avail_idx.
So, there is a strict dependency between all 3 steps and the memory
transaction will be finished at the step #2 in any case. There is no
way to read the ring before reading the avail->idx.
Am I missing something?
Nope, I kind of get what you meaning now. And even if we will
4. read descriptor from descriptor ring using the id read from 3
5. read descriptor content according to the address from 4
They still have dependent memory access. So there's no need for rmb.
On a second glance I changed my mind.
1. read avail_head = avail->idx
2. read cur_idx = last_avail_idx
if (cur_idx != avail_head) {
3. read idx = avail->ring[cur_idx]
4. read desc[idx]
}
There is an address (data) dependency: 2 -> 3 -> 4.
These reads could not be reordered.
But it's only control dependency between 1 and (3, 4), because 'avail_head'
is not used to calculate 'cur_idx'. In case of aggressive speculative
execution, 1 could be reordered with 3 resulting with reading of not yet
updated 'idx'.
Not sure if speculative execution could go so far while 'avail_head' is not
read yet, but it's should be possible in theory.
Thoughts ?
I think I change my mind as well, this is similar to the discussion of
desc_is_avail(). So I think it's possible.
Post by Ilya Maximets
Post by Jason Wang
Post by Ilya Maximets
Post by Maxime Coquelin
      for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
          uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
          uint16_t nr_vec = 0;
@@ -1373,6 +1379,12 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
      if (free_entries == 0)
          return 0;
  +    /*
+     * The ordering between avail index and
+     * desc reads needs to be enforced.
+     */
+    rte_smp_rmb();
+
This one is strange too.
    free_entries = *((volatile uint16_t *)&vq->avail->idx) -
            vq->last_avail_idx;
    if (free_entries == 0)
        return 0;
The code reads the value of avail->idx and uses the value on the next
line even with any compiler optimizations. There is no way for CPU to
postpone the actual read.
Yes.
It's kind of similar situation here, but 'avail_head' is involved somehow
in 'cur_idx' calculation because of
fill_vec_buf_split(..., vq->last_avail_idx + i, ...)
And 'i' depends on 'free_entries'.
I agree it depends on compiler,  it can choose to remove such data
dependency.
Post by Ilya Maximets
But we need to look at the exact asm
code to be sure.
I think it's probably hard to get a conclusion by checking asm code
generated by one specific version or kind of a compiler
Post by Ilya Maximets
I think, we may add barrier here to avoid possible issues.
Yes.


Thanks.
Post by Ilya Maximets
Post by Jason Wang
Thanks
Post by Ilya Maximets
Post by Maxime Coquelin
      VHOST_LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
        count = RTE_MIN(count, MAX_PKT_BURST);
Michael S. Tsirkin
2018-12-06 13:48:14 UTC
Permalink
Post by Jason Wang
Post by Ilya Maximets
Post by Maxime Coquelin
A read barrier is required to ensure the ordering between
available index and the descriptor reads is enforced.
Fixes: 4796ad63ba1f ("examples/vhost: import userspace vhost application")
---
lib/librte_vhost/virtio_net.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5e1a1a727..f11ebb54f 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -791,6 +791,12 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
avail_head = *((volatile uint16_t *)&vq->avail->idx);
+ /*
+ * The ordering between avail index and
+ * desc reads needs to be enforced.
+ */
+ rte_smp_rmb();
+
Hmm. This looks weird to me.
Could you please describe the bad scenario here? (It'll be good to have it
in commit message too)
As I understand, you're enforcing the read of avail->idx to happen before
reading the avail->ring[avail_idx]. Is it correct?
1. read avail->idx (avail_head).
2. check that last_avail_idx != avail_head.
3. read from the ring using last_avail_idx.
So, there is a strict dependency between all 3 steps and the memory
transaction will be finished at the step #2 in any case. There is no
way to read the ring before reading the avail->idx.
Am I missing something?
Nope, I kind of get what you meaning now. And even if we will
4. read descriptor from descriptor ring using the id read from 3
5. read descriptor content according to the address from 4
They still have dependent memory access. So there's no need for rmb.
I am pretty sure on some architectures there is a need for a barrier
here. This is an execution dependency since avail_head is not used as an
index. And reads can be speculated. So the read from the ring can be
speculated and execute before the read of avail_head and the check.

However SMP rmb is/should be free on x86. So unless someone on this
thread is actually testing performance on non-x86, you are both wasting
cycles discussing removal of nop macros and also risk pushing untested
software on users.
Post by Jason Wang
Post by Ilya Maximets
Post by Maxime Coquelin
for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
uint16_t nr_vec = 0;
@@ -1373,6 +1379,12 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
if (free_entries == 0)
return 0;
+ /*
+ * The ordering between avail index and
+ * desc reads needs to be enforced.
+ */
+ rte_smp_rmb();
+
This one is strange too.
free_entries = *((volatile uint16_t *)&vq->avail->idx) -
vq->last_avail_idx;
if (free_entries == 0)
return 0;
The code reads the value of avail->idx and uses the value on the next
line even with any compiler optimizations. There is no way for CPU to
postpone the actual read.
Yes.
Thanks
Post by Ilya Maximets
Post by Maxime Coquelin
VHOST_LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
count = RTE_MIN(count, MAX_PKT_BURST);
Ilya Maximets
2018-12-07 14:58:24 UTC
Permalink
Post by Michael S. Tsirkin
Post by Jason Wang
Post by Ilya Maximets
Post by Maxime Coquelin
A read barrier is required to ensure the ordering between
available index and the descriptor reads is enforced.
Fixes: 4796ad63ba1f ("examples/vhost: import userspace vhost application")
---
lib/librte_vhost/virtio_net.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5e1a1a727..f11ebb54f 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -791,6 +791,12 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
avail_head = *((volatile uint16_t *)&vq->avail->idx);
+ /*
+ * The ordering between avail index and
+ * desc reads needs to be enforced.
+ */
+ rte_smp_rmb();
+
Hmm. This looks weird to me.
Could you please describe the bad scenario here? (It'll be good to have it
in commit message too)
As I understand, you're enforcing the read of avail->idx to happen before
reading the avail->ring[avail_idx]. Is it correct?
1. read avail->idx (avail_head).
2. check that last_avail_idx != avail_head.
3. read from the ring using last_avail_idx.
So, there is a strict dependency between all 3 steps and the memory
transaction will be finished at the step #2 in any case. There is no
way to read the ring before reading the avail->idx.
Am I missing something?
Nope, I kind of get what you meaning now. And even if we will
4. read descriptor from descriptor ring using the id read from 3
5. read descriptor content according to the address from 4
They still have dependent memory access. So there's no need for rmb.
I am pretty sure on some architectures there is a need for a barrier
here. This is an execution dependency since avail_head is not used as an
index. And reads can be speculated. So the read from the ring can be
speculated and execute before the read of avail_head and the check.
However SMP rmb is/should be free on x86.
rte_smp_rmd() turns into compiler barrier on x86. And compiler barriers
could be harmful too in some cases.
Post by Michael S. Tsirkin
So unless someone on this
thread is actually testing performance on non-x86, you are both wasting
cycles discussing removal of nop macros and also risk pushing untested
software on users.
Since DPDK supports not only x86, we have to consider possible performance
issues on different architectures. In fact that this patch makes no sense
on x86, the only thing we need to consider is the stability and performance
on non-x86 architectures. If we'll not pay attention to things like this,
vhost-user could become completely unusable on non-x86 architectures someday.

It'll be cool if someone could test patches (autotest would be nice too) on
ARM at least. But, unfortunately, testing of DPDK is still far from being
ideal. And the lack of hardware is the main issue. I'm running vhost with
qemu on my ARMv8 platform from time to time, but it's definitely not enough.
And I can not test every patch on a list.

However I made a few tests on ARMv8 and this patch shows no significant
performance difference. But it makes the performance a bit more stable
between runs, which is nice.
Post by Michael S. Tsirkin
Post by Jason Wang
Post by Ilya Maximets
Post by Maxime Coquelin
for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
uint16_t nr_vec = 0;
@@ -1373,6 +1379,12 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
if (free_entries == 0)
return 0;
+ /*
+ * The ordering between avail index and
+ * desc reads needs to be enforced.
+ */
+ rte_smp_rmb();
+
This one is strange too.
free_entries = *((volatile uint16_t *)&vq->avail->idx) -
vq->last_avail_idx;
if (free_entries == 0)
return 0;
The code reads the value of avail->idx and uses the value on the next
line even with any compiler optimizations. There is no way for CPU to
postpone the actual read.
Yes.
Thanks
Post by Ilya Maximets
Post by Maxime Coquelin
VHOST_LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
count = RTE_MIN(count, MAX_PKT_BURST);
Michael S. Tsirkin
2018-12-07 15:44:53 UTC
Permalink
Post by Ilya Maximets
Post by Michael S. Tsirkin
Post by Jason Wang
Post by Ilya Maximets
Post by Maxime Coquelin
A read barrier is required to ensure the ordering between
available index and the descriptor reads is enforced.
Fixes: 4796ad63ba1f ("examples/vhost: import userspace vhost application")
---
lib/librte_vhost/virtio_net.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5e1a1a727..f11ebb54f 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -791,6 +791,12 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
avail_head = *((volatile uint16_t *)&vq->avail->idx);
+ /*
+ * The ordering between avail index and
+ * desc reads needs to be enforced.
+ */
+ rte_smp_rmb();
+
Hmm. This looks weird to me.
Could you please describe the bad scenario here? (It'll be good to have it
in commit message too)
As I understand, you're enforcing the read of avail->idx to happen before
reading the avail->ring[avail_idx]. Is it correct?
1. read avail->idx (avail_head).
2. check that last_avail_idx != avail_head.
3. read from the ring using last_avail_idx.
So, there is a strict dependency between all 3 steps and the memory
transaction will be finished at the step #2 in any case. There is no
way to read the ring before reading the avail->idx.
Am I missing something?
Nope, I kind of get what you meaning now. And even if we will
4. read descriptor from descriptor ring using the id read from 3
5. read descriptor content according to the address from 4
They still have dependent memory access. So there's no need for rmb.
I am pretty sure on some architectures there is a need for a barrier
here. This is an execution dependency since avail_head is not used as an
index. And reads can be speculated. So the read from the ring can be
speculated and execute before the read of avail_head and the check.
However SMP rmb is/should be free on x86.
rte_smp_rmd() turns into compiler barrier on x86. And compiler barriers
could be harmful too in some cases.
Post by Michael S. Tsirkin
So unless someone on this
thread is actually testing performance on non-x86, you are both wasting
cycles discussing removal of nop macros and also risk pushing untested
software on users.
Since DPDK supports not only x86, we have to consider possible performance
issues on different architectures. In fact that this patch makes no sense
on x86, the only thing we need to consider is the stability and performance
on non-x86 architectures. If we'll not pay attention to things like this,
vhost-user could become completely unusable on non-x86 architectures someday.
It'll be cool if someone could test patches (autotest would be nice too) on
ARM at least. But, unfortunately, testing of DPDK is still far from being
ideal. And the lack of hardware is the main issue. I'm running vhost with
qemu on my ARMv8 platform from time to time, but it's definitely not enough.
And I can not test every patch on a list.
However I made a few tests on ARMv8 and this patch shows no significant
performance difference. But it makes the performance a bit more stable
between runs, which is nice.
I'm sorry about being unclear. I think a barrier is required, so this
patch is good. I was trying to say that splitting hairs trying to prove
that the barrier can be omitted without testing that omitting it gives a
performance benefit doesn't make sense. Since you observed that adding a
barrier actually helps performance stability, it's all good.
Post by Ilya Maximets
Post by Michael S. Tsirkin
Post by Jason Wang
Post by Ilya Maximets
Post by Maxime Coquelin
for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
uint16_t nr_vec = 0;
@@ -1373,6 +1379,12 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
if (free_entries == 0)
return 0;
+ /*
+ * The ordering between avail index and
+ * desc reads needs to be enforced.
+ */
+ rte_smp_rmb();
+
This one is strange too.
free_entries = *((volatile uint16_t *)&vq->avail->idx) -
vq->last_avail_idx;
if (free_entries == 0)
return 0;
The code reads the value of avail->idx and uses the value on the next
line even with any compiler optimizations. There is no way for CPU to
postpone the actual read.
Yes.
Thanks
Post by Ilya Maximets
Post by Maxime Coquelin
VHOST_LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
count = RTE_MIN(count, MAX_PKT_BURST);
Maxime Coquelin
2018-12-05 09:49:54 UTC
Permalink
A read barrier is required to ensure that the ordering between
descriptor's flags and content reads is enforced.

Fixes: 2f3225a7d69b ("vhost: add vector filling support for packed ring")
Cc: ***@dpdk.org

Reported-by: Jason Wang <***@redhat.com>
Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/virtio_net.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index f11ebb54f..68b72e7a5 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -520,6 +520,12 @@ fill_vec_buf_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
if (unlikely(!desc_is_avail(&descs[avail_idx], wrap_counter)))
return -1;

+ /*
+ * The ordering between desc flags and desc
+ * content reads need to be enforced.
+ */
+ rte_smp_rmb();
+
*desc_count = 0;
*len = 0;
--
2.17.2
Ilya Maximets
2018-12-05 13:33:31 UTC
Permalink
Post by Maxime Coquelin
A read barrier is required to ensure that the ordering between
descriptor's flags and content reads is enforced.
Fixes: 2f3225a7d69b ("vhost: add vector filling support for packed ring")
---
lib/librte_vhost/virtio_net.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index f11ebb54f..68b72e7a5 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -520,6 +520,12 @@ fill_vec_buf_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
if (unlikely(!desc_is_avail(&descs[avail_idx], wrap_counter)))
return -1;
+ /*
+ * The ordering between desc flags and desc
+ * content reads need to be enforced.
+ */
+ rte_smp_rmb();
+
Same here. 'desc_is_avail' reads and uses the flags. i.e.
no way for reordering,
Writes must be ordered on the virtio side by the write barrier.
This means that if flags are updated (desc_is_avail() == true)
than the whole descriptor already updated and the data is written.
No need to have any read barriers here.
Post by Maxime Coquelin
*desc_count = 0;
*len = 0;
Jason Wang
2018-12-06 04:24:48 UTC
Permalink
Post by Ilya Maximets
Post by Maxime Coquelin
A read barrier is required to ensure that the ordering between
descriptor's flags and content reads is enforced.
Fixes: 2f3225a7d69b ("vhost: add vector filling support for packed ring")
---
lib/librte_vhost/virtio_net.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index f11ebb54f..68b72e7a5 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -520,6 +520,12 @@ fill_vec_buf_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
if (unlikely(!desc_is_avail(&descs[avail_idx], wrap_counter)))
return -1;
+ /*
+ * The ordering between desc flags and desc
+ * content reads need to be enforced.
+ */
+ rte_smp_rmb();
+
Same here. 'desc_is_avail' reads and uses the flags. i.e.
no way for reordering,
Writes must be ordered on the virtio side by the write barrier.
This means that if flags are updated (desc_is_avail() == true)
than the whole descriptor already updated and the data is written.
No need to have any read barriers here.
In fact , the sequence might be:


flag = read desc[avail_idx].flag [1]

if(flag is not avail) {

    read desc[avail_idx].id [2]

}


There's no data dependency but control dependency here, so 2 could be
done before 1 without a rmb.

Thanks
Post by Ilya Maximets
Post by Maxime Coquelin
*desc_count = 0;
*len = 0;
Ilya Maximets
2018-12-06 11:34:50 UTC
Permalink
Post by Jason Wang
Post by Ilya Maximets
Post by Maxime Coquelin
A read barrier is required to ensure that the ordering between
descriptor's flags and content reads is enforced.
Fixes: 2f3225a7d69b ("vhost: add vector filling support for packed ring")
---
  lib/librte_vhost/virtio_net.c | 6 ++++++
  1 file changed, 6 insertions(+)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index f11ebb54f..68b72e7a5 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -520,6 +520,12 @@ fill_vec_buf_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
      if (unlikely(!desc_is_avail(&descs[avail_idx], wrap_counter)))
          return -1;
  +    /*
+     * The ordering between desc flags and desc
+     * content reads need to be enforced.
+     */
+    rte_smp_rmb();
+
Same here. 'desc_is_avail' reads and uses the flags. i.e.
no way for reordering,
Writes must be ordered on the virtio side by the write barrier.
This means that if flags are updated (desc_is_avail() == true)
than the whole descriptor already updated and the data is written.
No need to have any read barriers here.
flag = read desc[avail_idx].flag [1]
if(flag is not avail) {
    read desc[avail_idx].id [2]
}
There's no data dependency but control dependency here, so 2 could be done before 1 without a rmb.
OK. Thanks. I agree. Missed that speculative load.
Post by Jason Wang
Thanks
Post by Ilya Maximets
Post by Maxime Coquelin
      *desc_count = 0;
      *len = 0;
 
Maxime Coquelin
2018-12-05 09:49:55 UTC
Permalink
This patch moves the prefetch after the available index
is read to avoid prefetching a descriptor not available yet.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/virtio_net.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 68b72e7a5..0a860ca72 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -794,7 +794,6 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
struct buf_vector buf_vec[BUF_VECTOR_MAX];
uint16_t avail_head;

- rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
avail_head = *((volatile uint16_t *)&vq->avail->idx);

/*
@@ -803,6 +802,8 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
*/
rte_smp_rmb();

+ rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
+
for (pkt_idx = 0; pkt_idx < count; pkt_idx++) {
uint32_t pkt_len = pkts[pkt_idx]->pkt_len + dev->vhost_hlen;
uint16_t nr_vec = 0;
@@ -1378,8 +1379,6 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
}
}

- rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
-
free_entries = *((volatile uint16_t *)&vq->avail->idx) -
vq->last_avail_idx;
if (free_entries == 0)
@@ -1391,6 +1390,8 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
*/
rte_smp_rmb();

+ rte_prefetch0(&vq->avail->ring[vq->last_avail_idx & (vq->size - 1)]);
+
VHOST_LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);

count = RTE_MIN(count, MAX_PKT_BURST);
--
2.17.2
Maxime Coquelin
2018-12-05 09:49:56 UTC
Permalink
This prefetch does not show any performance improvement.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/virtio_net.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 0a860ca72..679ce388b 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -1476,8 +1476,6 @@ virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
{
uint16_t i;

- rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
-
if (unlikely(dev->dequeue_zero_copy)) {
struct zcopy_mbuf *zmbuf, *next;
--
2.17.2
Maxime Coquelin
2018-12-05 09:49:57 UTC
Permalink
Cast to volatile is done when reading avail index and writing
the used index. This would not be necessary if proper barriers
are used.

Now that the read barrier has been added, we can remove these
cast to volatile.

Signed-off-by: Maxime Coquelin <***@redhat.com>
---
lib/librte_vhost/virtio_net.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 679ce388b..eab1a5b4c 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -114,7 +114,7 @@ flush_shadow_used_ring_split(struct virtio_net *dev, struct vhost_virtqueue *vq)

vhost_log_cache_sync(dev, vq);

- *(volatile uint16_t *)&vq->used->idx += vq->shadow_used_idx;
+ vq->used->idx += vq->shadow_used_idx;
vq->shadow_used_idx = 0;
vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
sizeof(vq->used->idx));
@@ -794,7 +794,7 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
struct buf_vector buf_vec[BUF_VECTOR_MAX];
uint16_t avail_head;

- avail_head = *((volatile uint16_t *)&vq->avail->idx);
+ avail_head = vq->avail->idx;

/*
* The ordering between avail index and
@@ -1379,8 +1379,7 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
}
}

- free_entries = *((volatile uint16_t *)&vq->avail->idx) -
- vq->last_avail_idx;
+ free_entries = vq->avail->idx - vq->last_avail_idx;
if (free_entries == 0)
return 0;
--
2.17.2
Ilya Maximets
2018-12-05 13:52:30 UTC
Permalink
Post by Maxime Coquelin
Cast to volatile is done when reading avail index and writing
the used index. This would not be necessary if proper barriers
are used.
'volatile' and barriers are not really connected. 'volatile' is
the disabling of the compiler optimizations, while barriers are
for runtime CPU level optimizations. In general, casts here made
to force compiler to actually read the value and not cache it
somehow. In fact that vhost library never writes to avail index,
"very smart" compiler could drop it at all. None of modern compilers
will do that for a single operation within a function, so,
volatiles are not really necessary in current code, but they could
save some nerves in case of code/compiler changes.

OTOH, IMHO, the main purpose of the casts in current code is
the self-documenting. Casts forces to pay special attention to
these variables and reminds that they could be updated in other
process. Casts allows to understand which variables are local and
which are shared. I don't think that we should remove them anyway.
Post by Maxime Coquelin
Now that the read barrier has been added, we can remove these
cast to volatile.
---
lib/librte_vhost/virtio_net.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 679ce388b..eab1a5b4c 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -114,7 +114,7 @@ flush_shadow_used_ring_split(struct virtio_net *dev, struct vhost_virtqueue *vq)
vhost_log_cache_sync(dev, vq);
- *(volatile uint16_t *)&vq->used->idx += vq->shadow_used_idx;
+ vq->used->idx += vq->shadow_used_idx;
vq->shadow_used_idx = 0;
vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
sizeof(vq->used->idx));
@@ -794,7 +794,7 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
struct buf_vector buf_vec[BUF_VECTOR_MAX];
uint16_t avail_head;
- avail_head = *((volatile uint16_t *)&vq->avail->idx);
+ avail_head = vq->avail->idx;
/*
* The ordering between avail index and
@@ -1379,8 +1379,7 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
}
}
- free_entries = *((volatile uint16_t *)&vq->avail->idx) -
- vq->last_avail_idx;
+ free_entries = vq->avail->idx - vq->last_avail_idx;
if (free_entries == 0)
return 0;
Maxime Coquelin
2018-12-06 16:59:00 UTC
Permalink
Hi Ilya,
Post by Ilya Maximets
Post by Maxime Coquelin
Cast to volatile is done when reading avail index and writing
the used index. This would not be necessary if proper barriers
are used.
'volatile' and barriers are not really connected. 'volatile' is
the disabling of the compiler optimizations, while barriers are
for runtime CPU level optimizations. In general, casts here made
to force compiler to actually read the value and not cache it
somehow. In fact that vhost library never writes to avail index,
"very smart" compiler could drop it at all. None of modern compilers
will do that for a single operation within a function, so,
volatiles are not really necessary in current code, but they could
save some nerves in case of code/compiler changes.
Ok, thanks for the explanation.
Why don't we do the same in Virtio PMD?
Post by Ilya Maximets
OTOH, IMHO, the main purpose of the casts in current code is
the self-documenting. Casts forces to pay special attention to
these variables and reminds that they could be updated in other
process. Casts allows to understand which variables are local and
which are shared. I don't think that we should remove them anyway.
Post by Maxime Coquelin
Now that the read barrier has been added, we can remove these
cast to volatile.
---
lib/librte_vhost/virtio_net.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 679ce388b..eab1a5b4c 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -114,7 +114,7 @@ flush_shadow_used_ring_split(struct virtio_net *dev, struct vhost_virtqueue *vq)
vhost_log_cache_sync(dev, vq);
- *(volatile uint16_t *)&vq->used->idx += vq->shadow_used_idx;
+ vq->used->idx += vq->shadow_used_idx;
With cast to volatile:
*(volatile uint16_t *)&vq->used->idx += vq->shadow_used_idx;
35f8: 49 8b 53 10 mov 0x10(%r11),%rdx
vq->shadow_used_idx = 0;
35fc: 31 db xor %ebx,%ebx
*(volatile uint16_t *)&vq->used->idx += vq->shadow_used_idx;
35fe: 0f b7 42 02 movzwl 0x2(%rdx),%eax
3602: 66 41 03 43 70 add 0x70(%r11),%ax
3607: 66 89 42 02 mov %ax,0x2(%rdx)
vq->shadow_used_idx = 0;

Without it:
vq->used->idx += vq->shadow_used_idx;
35f8: 49 8b 43 10 mov 0x10(%r11),%rax
35fc: 41 0f b7 53 70 movzwl 0x70(%r11),%edx
vq->shadow_used_idx = 0;
3601: 31 db xor %ebx,%ebx
vq->used->idx += vq->shadow_used_idx;
3603: 66 01 50 02 add %dx,0x2(%rax)
vq->shadow_used_idx = 0;

If my understanding is correct there is no functional change, but we
save one instruction by removing the cast to volatile.

Thanks,
Maxime
Ilya Maximets
2018-12-07 11:16:53 UTC
Permalink
Post by Maxime Coquelin
Hi Ilya,
Post by Ilya Maximets
Post by Maxime Coquelin
Cast to volatile is done when reading avail index and writing
the used index. This would not be necessary if proper barriers
are used.
'volatile' and barriers are not really connected. 'volatile' is
the disabling of the compiler optimizations, while barriers are
for runtime CPU level optimizations. In general, casts here made
to force compiler to actually read the value and not cache it
somehow. In fact that vhost library never writes to avail index,
"very smart" compiler could drop it at all. None of modern compilers
will do that for a single operation within a function, so,
volatiles are not really necessary in current code, but they could
save some nerves in case of code/compiler changes.
Ok, thanks for the explanation.
Why don't we do the same in Virtio PMD?
Maybe we should. It works because in virtio all the accesses wrapped
by short access functions like 'vq_update_avail_idx'. And we, actually,
never reading the same value twice in the same function. Compilers
today does not optimize such memory accesses.
Post by Maxime Coquelin
Post by Ilya Maximets
OTOH, IMHO, the main purpose of the casts in current code is
the self-documenting. Casts forces to pay special attention to
these variables and reminds that they could be updated in other
process. Casts allows to understand which variables are local and
which are shared. I don't think that we should remove them anyway.
Post by Maxime Coquelin
Now that the read barrier has been added, we can remove these
cast to volatile.
---
  lib/librte_vhost/virtio_net.c | 7 +++----
  1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 679ce388b..eab1a5b4c 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -114,7 +114,7 @@ flush_shadow_used_ring_split(struct virtio_net *dev, struct vhost_virtqueue *vq)
        vhost_log_cache_sync(dev, vq);
  -    *(volatile uint16_t *)&vq->used->idx += vq->shadow_used_idx;
+    vq->used->idx += vq->shadow_used_idx;
    *(volatile uint16_t *)&vq->used->idx += vq->shadow_used_idx;
    35f8:    49 8b 53 10              mov    0x10(%r11),%rdx
    vq->shadow_used_idx = 0;
    35fc:    31 db                    xor    %ebx,%ebx
    *(volatile uint16_t *)&vq->used->idx += vq->shadow_used_idx;
    35fe:    0f b7 42 02              movzwl 0x2(%rdx),%eax
    3602:    66 41 03 43 70           add    0x70(%r11),%ax
    3607:    66 89 42 02              mov    %ax,0x2(%rdx)
    vq->shadow_used_idx = 0;
    vq->used->idx += vq->shadow_used_idx;
    35f8:    49 8b 43 10              mov    0x10(%r11),%rax
    35fc:    41 0f b7 53 70           movzwl 0x70(%r11),%edx
    vq->shadow_used_idx = 0;
    3601:    31 db                    xor    %ebx,%ebx
    vq->used->idx += vq->shadow_used_idx;
    3603:    66 01 50 02              add    %dx,0x2(%rax)
    vq->shadow_used_idx = 0;
If my understanding is correct there is no functional change, but we save one instruction by removing the cast to volatile.
IMHO, it's a gcc issue that it could not understand that cast and
dereference could be dropped. For example, clang on my ubuntu
generates equal code:

With cast to volatile:

*(volatile uint16_t *)&vq->used->idx += vq->shadow_used_idx;
32550: 41 0f b7 42 70 movzwl 0x70(%r10),%eax
32555: 49 8b 4a 10 mov 0x10(%r10),%rcx
32559: 66 01 41 02 add %ax,0x2(%rcx)
vq->shadow_used_idx = 0;
3255d: 66 41 c7 42 70 00 00 movw $0x0,0x70(%r10)

Without it:

vq->used->idx += vq->shadow_used_idx;
32550: 41 0f b7 42 70 movzwl 0x70(%r10),%eax
32555: 49 8b 4a 10 mov 0x10(%r10),%rcx
32559: 66 01 41 02 add %ax,0x2(%rcx)
vq->shadow_used_idx = 0;
3255d: 66 41 c7 42 70 00 00 movw $0x0,0x70(%r10)


However, different code appears only in '+=' case.
Why we have this increment at all? Following change will eliminate
the generated code difference:

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5e1a1a727..5776975ca 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -114,7 +114,7 @@ flush_shadow_used_ring_split(struct virtio_net *dev, struct vhost_virtqueue *vq)

vhost_log_cache_sync(dev, vq);

- *(volatile uint16_t *)&vq->used->idx += vq->shadow_used_idx;
+ *(volatile uint16_t *)&vq->used->idx = vq->last_used_idx;
vq->shadow_used_idx = 0;
vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
sizeof(vq->used->idx));
---

What do you think?


Best regards, Ilya Maximets.
Loading...