Discussion:
[dpdk-dev] [PATCH 0/4 for 2.3] vhost-user live migration support
Yuanhan Liu
2015-12-02 03:43:09 UTC
Permalink
This patch set adds the initial vhost-user live migration support.

The major task behind that is to log pages we touched during
live migration. So, this patch is basically about adding vhost
log support, and using it.

Patchset
========
- Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
the dirty memory bitmap is.

- Patch 2 introduces a vhost_log_write() helper function to log
pages we are gonna change.

- Patch 3 logs changes we made to used vring.

- Patch 4 sets log_fhmfd protocol feature bit, which actually
enables the vhost-user live migration support.

A simple test guide (on same host)
==================================

The following test is based on OVS + DPDK. And here is guide
to setup OVS + DPDK:

http://wiki.qemu.org/Features/vhost-user-ovs-dpdk

1. start ovs-vswitchd

2. Add two ovs vhost-user port, say vhost0 and vhost1

3. Start a VM1 to connect to vhost0. Here is my example:

$QEMU -enable-kvm -m 1024 -smp 4 \
-chardev socket,id=char0,path=/var/run/openvswitch/vhost0 \
-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
-device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
-object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
-numa node,memdev=mem -mem-prealloc \
-kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
-hda fc-19-i386.img \
-monitor telnet::3333,server,nowait -curses

4. run "ping $host" inside VM1

5. Start VM2 to connect to vhost0, and marking it as the target
of live migration (by adding -incoming tcp:0:4444 option)

$QEMU -enable-kvm -m 1024 -smp 4 \
-chardev socket,id=char0,path=/var/run/openvswitch/vhost1 \
-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
-device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
-object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
-numa node,memdev=mem -mem-prealloc \
-kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
-hda fc-19-i386.img \
-monitor telnet::3334,server,nowait -curses \
-incoming tcp:0:4444
migrate tcp:0:4444
7. After a while, you will find that VM1 has been migrated to VM2,
and the "ping" command continues running, perfectly.


Note: this patch set has mostly been based on Victor Kaplansk's demo
work (vhost-user-bridge) at QEMU project. I was thinking to add Victor
as the co-author. Victor, what do you think of that? :)

Comments are welcome!

---
Yuanhan Liu (4):
vhost: handle VHOST_USER_SET_LOG_BASE request
vhost: introduce vhost_log_write
vhost: log vring changes
vhost: enable log_shmfd protocol feature

lib/librte_vhost/rte_virtio_net.h | 35 ++++++++++++++
lib/librte_vhost/vhost_rxtx.c | 70 ++++++++++++++++++---------
lib/librte_vhost/vhost_user/vhost-net-user.c | 7 ++-
lib/librte_vhost/vhost_user/vhost-net-user.h | 6 +++
lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++
lib/librte_vhost/vhost_user/virtio-net-user.h | 5 +-
lib/librte_vhost/virtio-net.c | 4 ++
7 files changed, 145 insertions(+), 26 deletions(-)
--
1.9.0
Yuanhan Liu
2015-12-02 03:43:10 UTC
Permalink
VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
vhost-user) where we should log dirty pages, and how big the log
buffer is.

This request introduces a new payload:

typedef struct VhostUserLog {
uint64_t mmap_size;
uint64_t mmap_offset;
} VhostUserLog;

Also, a fd is delivered from QEMU by ancillary data.

With those info given, an area of memory is mmaped, assigned
to dev->log_base, for logging dirty pages.

Signed-off-by: Yuanhan Liu <***@linux.intel.com>
---
lib/librte_vhost/rte_virtio_net.h | 2 ++
lib/librte_vhost/vhost_user/vhost-net-user.c | 7 ++++-
lib/librte_vhost/vhost_user/vhost-net-user.h | 6 ++++
lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++++++++++++
lib/librte_vhost/vhost_user/virtio-net-user.h | 1 +
5 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 5687452..416dac2 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -127,6 +127,8 @@ struct virtio_net {
#define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
char ifname[IF_NAME_SZ]; /**< Name of the tap device or socket path. */
uint32_t virt_qp_nb; /**< number of queue pair we have allocated */
+ uint64_t log_size; /**< Size of log area */
+ uint8_t *log_base; /**< Where dirty pages are logged */
void *priv; /**< private context */
struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2]; /**< Contains all virtqueue information. */
} __rte_cache_aligned;
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 2dc0547..76bcac2 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -388,7 +388,12 @@ vserver_message_handler(int connfd, void *dat, int *remove)
break;

case VHOST_USER_SET_LOG_BASE:
- RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
+ user_set_log_base(ctx, &msg);
+
+ /* it needs a reply */
+ msg.size = sizeof(msg.payload.u64);
+ send_vhost_message(connfd, &msg);
+ break;
case VHOST_USER_SET_LOG_FD:
close(msg.fds[0]);
RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.h b/lib/librte_vhost/vhost_user/vhost-net-user.h
index 38637cc..6d252a3 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.h
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.h
@@ -83,6 +83,11 @@ typedef struct VhostUserMemory {
VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
} VhostUserMemory;

+typedef struct VhostUserLog {
+ uint64_t mmap_size;
+ uint64_t mmap_offset;
+} VhostUserLog;
+
typedef struct VhostUserMsg {
VhostUserRequest request;

@@ -97,6 +102,7 @@ typedef struct VhostUserMsg {
struct vhost_vring_state state;
struct vhost_vring_addr addr;
VhostUserMemory memory;
+ VhostUserLog log;
} payload;
int fds[VHOST_MEMORY_MAX_NREGIONS];
} __attribute((packed)) VhostUserMsg;
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
index 2934d1c..1d705fd 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.c
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
@@ -365,3 +365,47 @@ user_set_protocol_features(struct vhost_device_ctx ctx,

dev->protocol_features = protocol_features;
}
+
+int
+user_set_log_base(struct vhost_device_ctx ctx,
+ struct VhostUserMsg *msg)
+{
+ struct virtio_net *dev;
+ int fd = msg->fds[0];
+ uint64_t size, off;
+ void *addr;
+
+ dev = get_device(ctx);
+ if (!dev)
+ return -1;
+
+ if (fd < 0) {
+ RTE_LOG(ERR, VHOST_CONFIG, "invalid log fd: %d\n", fd);
+ return -1;
+ }
+
+ if (msg->size != sizeof(VhostUserLog)) {
+ RTE_LOG(ERR, VHOST_CONFIG,
+ "invalid log base msg size: %"PRId32" != %d\n",
+ msg->size, (int)sizeof(VhostUserLog));
+ return -1;
+ }
+
+ size = msg->payload.log.mmap_size;
+ off = msg->payload.log.mmap_offset;
+ RTE_LOG(INFO, VHOST_CONFIG,
+ "log mmap size: %"PRId64", offset: %"PRId64"\n",
+ size, off);
+
+ addr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, off);
+ if (addr == MAP_FAILED) {
+ RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
+ return -1;
+ }
+
+ /* TODO: unmap on stop */
+ dev->log_base = addr;
+ dev->log_size = size;
+
+ return 0;
+}
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index b82108d..013cf38 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);

void user_set_protocol_features(struct vhost_device_ctx ctx,
uint64_t protocol_features);
+int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);

int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
--
1.9.0
Panu Matilainen
2015-12-02 13:53:45 UTC
Permalink
Post by Yuanhan Liu
VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
vhost-user) where we should log dirty pages, and how big the log
buffer is.
typedef struct VhostUserLog {
uint64_t mmap_size;
uint64_t mmap_offset;
} VhostUserLog;
Also, a fd is delivered from QEMU by ancillary data.
With those info given, an area of memory is mmaped, assigned
to dev->log_base, for logging dirty pages.
---
lib/librte_vhost/rte_virtio_net.h | 2 ++
lib/librte_vhost/vhost_user/vhost-net-user.c | 7 ++++-
lib/librte_vhost/vhost_user/vhost-net-user.h | 6 ++++
lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++++++++++++
lib/librte_vhost/vhost_user/virtio-net-user.h | 1 +
5 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 5687452..416dac2 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -127,6 +127,8 @@ struct virtio_net {
#define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
char ifname[IF_NAME_SZ]; /**< Name of the tap device or socket path. */
uint32_t virt_qp_nb; /**< number of queue pair we have allocated */
+ uint64_t log_size; /**< Size of log area */
+ uint8_t *log_base; /**< Where dirty pages are logged */
void *priv; /**< private context */
struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2]; /**< Contains all virtqueue information. */
} __rte_cache_aligned;
This (and other changes in patch 2 breaks the librte_vhost ABI again, so
you'd need to at least add a deprecation note to 2.2 to be able to do it
in 2.3 at all according to the ABI policy.

Perhaps a better option would be adding some padding to the structs now
for 2.2 since the vhost ABI is broken there anyway. That would at least
give a chance to keep it compatible from 2.2 to 2.3.

- Panu -
Yuanhan Liu
2015-12-02 14:31:01 UTC
Permalink
Post by Panu Matilainen
Post by Yuanhan Liu
VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
vhost-user) where we should log dirty pages, and how big the log
buffer is.
typedef struct VhostUserLog {
uint64_t mmap_size;
uint64_t mmap_offset;
} VhostUserLog;
Also, a fd is delivered from QEMU by ancillary data.
With those info given, an area of memory is mmaped, assigned
to dev->log_base, for logging dirty pages.
---
lib/librte_vhost/rte_virtio_net.h | 2 ++
lib/librte_vhost/vhost_user/vhost-net-user.c | 7 ++++-
lib/librte_vhost/vhost_user/vhost-net-user.h | 6 ++++
lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++++++++++++
lib/librte_vhost/vhost_user/virtio-net-user.h | 1 +
5 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 5687452..416dac2 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -127,6 +127,8 @@ struct virtio_net {
#define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
char ifname[IF_NAME_SZ]; /**< Name of the tap device or socket path. */
uint32_t virt_qp_nb; /**< number of queue pair we have allocated */
+ uint64_t log_size; /**< Size of log area */
+ uint8_t *log_base; /**< Where dirty pages are logged */
void *priv; /**< private context */
struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2]; /**< Contains all virtqueue information. */
} __rte_cache_aligned;
This (and other changes in patch 2 breaks the librte_vhost ABI
again, so you'd need to at least add a deprecation note to 2.2 to be
able to do it in 2.3 at all according to the ABI policy.
I was thinking that adding a new field (instead of renaming it or
removing it) isn't an ABI break. So, I was wrong?
Post by Panu Matilainen
Perhaps a better option would be adding some padding to the structs
now for 2.2 since the vhost ABI is broken there anyway. That would
at least give a chance to keep it compatible from 2.2 to 2.3.
It will not be compatible, unless we add exact same fields (not
something like uint8_t pad[xx]). Otherwise, the pad field renaming
is also an ABI break, right?

Thomas, should I write an ABI deprecation note? Can I make it for
v2.2 release If I make one tomorrow? (Sorry that I'm not awared
of that it would be an ABI break).

--yliu
Panu Matilainen
2015-12-02 14:48:14 UTC
Permalink
Post by Yuanhan Liu
Post by Panu Matilainen
Post by Yuanhan Liu
VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
vhost-user) where we should log dirty pages, and how big the log
buffer is.
typedef struct VhostUserLog {
uint64_t mmap_size;
uint64_t mmap_offset;
} VhostUserLog;
Also, a fd is delivered from QEMU by ancillary data.
With those info given, an area of memory is mmaped, assigned
to dev->log_base, for logging dirty pages.
---
lib/librte_vhost/rte_virtio_net.h | 2 ++
lib/librte_vhost/vhost_user/vhost-net-user.c | 7 ++++-
lib/librte_vhost/vhost_user/vhost-net-user.h | 6 ++++
lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++++++++++++
lib/librte_vhost/vhost_user/virtio-net-user.h | 1 +
5 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 5687452..416dac2 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -127,6 +127,8 @@ struct virtio_net {
#define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
char ifname[IF_NAME_SZ]; /**< Name of the tap device or socket path. */
uint32_t virt_qp_nb; /**< number of queue pair we have allocated */
+ uint64_t log_size; /**< Size of log area */
+ uint8_t *log_base; /**< Where dirty pages are logged */
void *priv; /**< private context */
struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2]; /**< Contains all virtqueue information. */
} __rte_cache_aligned;
This (and other changes in patch 2 breaks the librte_vhost ABI
again, so you'd need to at least add a deprecation note to 2.2 to be
able to do it in 2.3 at all according to the ABI policy.
I was thinking that adding a new field (instead of renaming it or
removing it) isn't an ABI break. So, I was wrong?
Adding or removing a field in the middle of a public struct is always an
ABI break. Adding to the end often is too, but not always.
Renaming a field is an API break but not an ABI break - the compiler
cares but the cpu does not.
Post by Yuanhan Liu
Post by Panu Matilainen
Perhaps a better option would be adding some padding to the structs
now for 2.2 since the vhost ABI is broken there anyway. That would
at least give a chance to keep it compatible from 2.2 to 2.3.
It will not be compatible, unless we add exact same fields (not
something like uint8_t pad[xx]). Otherwise, the pad field renaming
is also an ABI break, right?
There's no ABI (or API) break in changing reserved unused fields to
something else, as long as care is taken with sizes and alignment. In
any case padding is best added to the end of a struct to minimize risks
and keep things simple.

- Panu -
Post by Yuanhan Liu
Thomas, should I write an ABI deprecation note? Can I make it for
v2.2 release If I make one tomorrow? (Sorry that I'm not awared
of that it would be an ABI break).
--yliu
Yuanhan Liu
2015-12-02 15:09:23 UTC
Permalink
On Wed, Dec 02, 2015 at 04:48:14PM +0200, Panu Matilainen wrote:
...
Post by Panu Matilainen
Post by Yuanhan Liu
Post by Panu Matilainen
Post by Yuanhan Liu
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 5687452..416dac2 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -127,6 +127,8 @@ struct virtio_net {
#define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
char ifname[IF_NAME_SZ]; /**< Name of the tap device or socket path. */
uint32_t virt_qp_nb; /**< number of queue pair we have allocated */
+ uint64_t log_size; /**< Size of log area */
+ uint8_t *log_base; /**< Where dirty pages are logged */
void *priv; /**< private context */
struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2]; /**< Contains all virtqueue information. */
} __rte_cache_aligned;
This (and other changes in patch 2 breaks the librte_vhost ABI
again, so you'd need to at least add a deprecation note to 2.2 to be
able to do it in 2.3 at all according to the ABI policy.
I was thinking that adding a new field (instead of renaming it or
removing it) isn't an ABI break. So, I was wrong?
Adding or removing a field in the middle of a public struct is
always an ABI break. Adding to the end often is too, but not always.
Renaming a field is an API break but not an ABI break - the compiler
cares but the cpu does not.
Good to know. Thanks.
Post by Panu Matilainen
Post by Yuanhan Liu
Post by Panu Matilainen
Perhaps a better option would be adding some padding to the structs
now for 2.2 since the vhost ABI is broken there anyway. That would
at least give a chance to keep it compatible from 2.2 to 2.3.
It will not be compatible, unless we add exact same fields (not
something like uint8_t pad[xx]). Otherwise, the pad field renaming
is also an ABI break, right?
There's no ABI (or API) break in changing reserved unused fields to
something else, as long as care is taken with sizes and alignment.
as long as we don't reference the reserved unused fields?
Post by Panu Matilainen
In any case padding is best added to the end of a struct to minimize
risks and keep things simple.
The thing is that isn't it a bit aweful to (always) add pads to
the end of a struct, especially when you don't know how many
need to be padded?

--yliu
Panu Matilainen
2015-12-02 16:58:03 UTC
Permalink
Post by Yuanhan Liu
...
Post by Panu Matilainen
Post by Yuanhan Liu
Post by Panu Matilainen
Post by Yuanhan Liu
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 5687452..416dac2 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -127,6 +127,8 @@ struct virtio_net {
#define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
char ifname[IF_NAME_SZ]; /**< Name of the tap device or socket path. */
uint32_t virt_qp_nb; /**< number of queue pair we have allocated */
+ uint64_t log_size; /**< Size of log area */
+ uint8_t *log_base; /**< Where dirty pages are logged */
void *priv; /**< private context */
struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2]; /**< Contains all virtqueue information. */
} __rte_cache_aligned;
This (and other changes in patch 2 breaks the librte_vhost ABI
again, so you'd need to at least add a deprecation note to 2.2 to be
able to do it in 2.3 at all according to the ABI policy.
I was thinking that adding a new field (instead of renaming it or
removing it) isn't an ABI break. So, I was wrong?
Adding or removing a field in the middle of a public struct is
always an ABI break. Adding to the end often is too, but not always.
Renaming a field is an API break but not an ABI break - the compiler
cares but the cpu does not.
Good to know. Thanks.
Post by Panu Matilainen
Post by Yuanhan Liu
Post by Panu Matilainen
Perhaps a better option would be adding some padding to the structs
now for 2.2 since the vhost ABI is broken there anyway. That would
at least give a chance to keep it compatible from 2.2 to 2.3.
It will not be compatible, unless we add exact same fields (not
something like uint8_t pad[xx]). Otherwise, the pad field renaming
is also an ABI break, right?
There's no ABI (or API) break in changing reserved unused fields to
something else, as long as care is taken with sizes and alignment.
as long as we don't reference the reserved unused fields?
That would be the definition of an unused field I think :)
Call it "reserved" if you want, it doesn't really matter as long as its
clear its something you shouldn't be using.
Post by Yuanhan Liu
Post by Panu Matilainen
In any case padding is best added to the end of a struct to minimize
risks and keep things simple.
The thing is that isn't it a bit aweful to (always) add pads to
the end of a struct, especially when you don't know how many
need to be padded?
Then you pad for what you think you need, plus a bit extra, and maybe
some more for others who might want to extend it. What is a reasonable
amount needs deciding case by case - if a struct is alloced in the
millions then be (very) conservative, but if there are one or 50 such
structs within an app lifetime then who cares if its bit larger?

And yeah padding may be annoying, but that's pretty much the only option
in a project where most of the structs are out in the open.

- Panu -
Post by Yuanhan Liu
--yliu
Michael S. Tsirkin
2015-12-02 17:24:53 UTC
Permalink
Post by Panu Matilainen
Post by Yuanhan Liu
...
Post by Panu Matilainen
Post by Yuanhan Liu
Post by Panu Matilainen
Post by Yuanhan Liu
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 5687452..416dac2 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -127,6 +127,8 @@ struct virtio_net {
#define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
char ifname[IF_NAME_SZ]; /**< Name of the tap device or socket path. */
uint32_t virt_qp_nb; /**< number of queue pair we have allocated */
+ uint64_t log_size; /**< Size of log area */
+ uint8_t *log_base; /**< Where dirty pages are logged */
void *priv; /**< private context */
struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2]; /**< Contains all virtqueue information. */
} __rte_cache_aligned;
This (and other changes in patch 2 breaks the librte_vhost ABI
again, so you'd need to at least add a deprecation note to 2.2 to be
able to do it in 2.3 at all according to the ABI policy.
I was thinking that adding a new field (instead of renaming it or
removing it) isn't an ABI break. So, I was wrong?
Adding or removing a field in the middle of a public struct is
always an ABI break. Adding to the end often is too, but not always.
Renaming a field is an API break but not an ABI break - the compiler
cares but the cpu does not.
Good to know. Thanks.
Post by Panu Matilainen
Post by Yuanhan Liu
Post by Panu Matilainen
Perhaps a better option would be adding some padding to the structs
now for 2.2 since the vhost ABI is broken there anyway. That would
at least give a chance to keep it compatible from 2.2 to 2.3.
It will not be compatible, unless we add exact same fields (not
something like uint8_t pad[xx]). Otherwise, the pad field renaming
is also an ABI break, right?
There's no ABI (or API) break in changing reserved unused fields to
something else, as long as care is taken with sizes and alignment.
as long as we don't reference the reserved unused fields?
That would be the definition of an unused field I think :)
Call it "reserved" if you want, it doesn't really matter as long as its
clear its something you shouldn't be using.
Post by Yuanhan Liu
Post by Panu Matilainen
In any case padding is best added to the end of a struct to minimize
risks and keep things simple.
The thing is that isn't it a bit aweful to (always) add pads to
the end of a struct, especially when you don't know how many
need to be padded?
Then you pad for what you think you need, plus a bit extra, and maybe some
more for others who might want to extend it. What is a reasonable amount
needs deciding case by case - if a struct is alloced in the millions then be
(very) conservative, but if there are one or 50 such structs within an app
lifetime then who cares if its bit larger?
And yeah padding may be annoying, but that's pretty much the only option in
a project where most of the structs are out in the open.
- Panu -
Functions versioning is another option.
For a sufficiently widely used struct, it's a lot of work, padding
is easier. But it might be better than breaking ABI
if e.g. you didn't pad enough.
Post by Panu Matilainen
Post by Yuanhan Liu
--yliu
Thomas Monjalon
2015-12-02 16:38:10 UTC
Permalink
Post by Yuanhan Liu
Thomas, should I write an ABI deprecation note? Can I make it for
v2.2 release If I make one tomorrow? (Sorry that I'm not awared
of that it would be an ABI break).
As Panu suggested, it would be better to reserve some room now
in 2.2 which already breaks vhost ABI.
Maybe we have a chance to keep the same vhost ABI in 2.3.

The 2.2 release will probably be closed in less than 2 weeks.
Yuanhan Liu
2015-12-03 01:49:37 UTC
Permalink
Post by Thomas Monjalon
Post by Yuanhan Liu
Thomas, should I write an ABI deprecation note? Can I make it for
v2.2 release If I make one tomorrow? (Sorry that I'm not awared
of that it would be an ABI break).
As Panu suggested, it would be better to reserve some room now
in 2.2 which already breaks vhost ABI.
Maybe we have a chance to keep the same vhost ABI in 2.3.
Got it. I will cook up one now.

--yliu
Post by Thomas Monjalon
The 2.2 release will probably be closed in less than 2 weeks.
Thomas Monjalon
2015-12-06 23:07:28 UTC
Permalink
Post by Panu Matilainen
This (and other changes in patch 2 breaks the librte_vhost ABI again, so
you'd need to at least add a deprecation note to 2.2 to be able to do it
in 2.3 at all according to the ABI policy.
Perhaps a better option would be adding some padding to the structs now
for 2.2 since the vhost ABI is broken there anyway. That would at least
give a chance to keep it compatible from 2.2 to 2.3.
Please could you point where the vhost ABI is broken in 2.2?
Yuanhan Liu
2015-12-07 02:00:35 UTC
Permalink
Post by Thomas Monjalon
Post by Panu Matilainen
This (and other changes in patch 2 breaks the librte_vhost ABI again, so
you'd need to at least add a deprecation note to 2.2 to be able to do it
in 2.3 at all according to the ABI policy.
Perhaps a better option would be adding some padding to the structs now
for 2.2 since the vhost ABI is broken there anyway. That would at least
give a chance to keep it compatible from 2.2 to 2.3.
Please could you point where the vhost ABI is broken in 2.2?
Thomas, here are the changes to rte_virtio_net.h:


$ git diff 381316f6a225139d22d39b5ab8d50c40607924ca..19d4d7ef2a216b5418d8edb5b004d1a58bba3cc1 \
-- lib/librte_vhost/rte_virtio_net.h >
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index e3a21e5..426a70d 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -89,6 +89,7 @@ struct vhost_virtqueue {
volatile uint16_t last_used_idx_res; /**< Used for multiple devices reserving buffers. */
int callfd; /**< Used to notify the guest (trigger interrupt). */
int kickfd; /**< Currently unused as polling mode is enabled. */
+ int enabled;
struct buf_vector buf_vec[BUF_VECTOR_MAX]; /**< for scatter RX. */
} __rte_cache_aligned;

@@ -96,7 +97,6 @@ struct vhost_virtqueue {
* Device structure contains all configuration information relating to the device.
*/
struct virtio_net {
- struct vhost_virtqueue *virtqueue[VIRTIO_QNUM]; /**< Contains all virtqueue information. */
struct virtio_memory *mem; /**< QEMU memory and memory region information. */
uint64_t features; /**< Negotiated feature set. */
uint64_t protocol_features; /**< Negotiated protocol feature set. */
@@ -104,7 +104,9 @@ struct virtio_net {
uint32_t flags; /**< Device flags. Only used to check if device is running on data core. */
#define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
char ifname[IF_NAME_SZ]; /**< Name of the tap device or socket path. */
+ uint32_t virt_qp_nb; /**< number of queue pair we have allocated */
void *priv; /**< private context */
+ struct vhost_virtqueue *virtqueue[VIRTIO_NET_CTRL_MQ_VQ_PAIRS_MAX]; /**< Contains all virtqueue information. */
} __rte_cache_aligned;

/**
@@ -131,7 +133,7 @@ struct virtio_memory {
};

/**
- * Device operations to add/remove device.
+ * Device and vring operations.
*
* Make sure to set VIRTIO_DEV_RUNNING to the device flags in new_device and
* remove it in destroy_device.
@@ -140,12 +142,18 @@ struct virtio_memory {
struct virtio_net_device_ops {
int (*new_device)(struct virtio_net *); /**< Add device. */
void (*destroy_device)(volatile struct virtio_net *); /**< Remove device. */
+
+ int (*vring_state_changed)(struct virtio_net *dev, uint16_t queue_id, int enable); /**< triggered when a vring is enabled or disabled */
};

static inline uint16_t __attribute__((always_inline))
rte_vring_available_entries(struct virtio_net *dev, uint16_t queue_id)
{
struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
+
+ if (!vq->enabled)
+ return 0;
+
return *(volatile uint16_t *)&vq->avail->idx - vq->last_used_idx_res;
}


--yliu
Thomas Monjalon
2015-12-07 02:03:24 UTC
Permalink
Post by Yuanhan Liu
Post by Thomas Monjalon
Post by Panu Matilainen
This (and other changes in patch 2 breaks the librte_vhost ABI again, so
you'd need to at least add a deprecation note to 2.2 to be able to do it
in 2.3 at all according to the ABI policy.
Perhaps a better option would be adding some padding to the structs now
for 2.2 since the vhost ABI is broken there anyway. That would at least
give a chance to keep it compatible from 2.2 to 2.3.
Please could you point where the vhost ABI is broken in 2.2?
$ git diff 381316f6a225139d22d39b5ab8d50c40607924ca..19d4d7ef2a216b5418d8edb5b004d1a58bba3cc1 \
-- lib/librte_vhost/rte_virtio_net.h >
[...]

The problem is that the changes are not noticed in the release notes
and the LIBABIVER is still 1.
Yuanhan Liu
2015-12-07 02:18:43 UTC
Permalink
Post by Thomas Monjalon
Post by Yuanhan Liu
Post by Thomas Monjalon
Post by Panu Matilainen
This (and other changes in patch 2 breaks the librte_vhost ABI again, so
you'd need to at least add a deprecation note to 2.2 to be able to do it
in 2.3 at all according to the ABI policy.
Perhaps a better option would be adding some padding to the structs now
for 2.2 since the vhost ABI is broken there anyway. That would at least
give a chance to keep it compatible from 2.2 to 2.3.
Please could you point where the vhost ABI is broken in 2.2?
$ git diff 381316f6a225139d22d39b5ab8d50c40607924ca..19d4d7ef2a216b5418d8edb5b004d1a58bba3cc1 \
-- lib/librte_vhost/rte_virtio_net.h >
[...]
The problem is that the changes are not noticed in the release notes
and the LIBABIVER is still 1.
Yeah, my bad. Firstly, I was not aware of it's an ABI change. Secondly,
I was landed to this team in the middle of v2.2 release, so that I have
limited experience of how those works in DPDK community.

Anyway, it's my fault. I should have realized that in the first time.
Should I send a patch to update LIBABIVER to 2 and update release note
now?

--yliu
Thomas Monjalon
2015-12-07 02:49:03 UTC
Permalink
Post by Yuanhan Liu
Post by Thomas Monjalon
Post by Yuanhan Liu
Post by Thomas Monjalon
Post by Panu Matilainen
This (and other changes in patch 2 breaks the librte_vhost ABI again, so
you'd need to at least add a deprecation note to 2.2 to be able to do it
in 2.3 at all according to the ABI policy.
Perhaps a better option would be adding some padding to the structs now
for 2.2 since the vhost ABI is broken there anyway. That would at least
give a chance to keep it compatible from 2.2 to 2.3.
Please could you point where the vhost ABI is broken in 2.2?
$ git diff 381316f6a225139d22d39b5ab8d50c40607924ca..19d4d7ef2a216b5418d8edb5b004d1a58bba3cc1 \
-- lib/librte_vhost/rte_virtio_net.h >
[...]
The problem is that the changes are not noticed in the release notes
and the LIBABIVER is still 1.
Yeah, my bad. Firstly, I was not aware of it's an ABI change. Secondly,
I was landed to this team in the middle of v2.2 release, so that I have
limited experience of how those works in DPDK community.
Anyway, it's my fault. I should have realized that in the first time.
No it's not your fault, and it does not matter who is responsible.
Post by Yuanhan Liu
Should I send a patch to update LIBABIVER to 2 and update release note
now?
Yes today or tomorrow please.
Panu Matilainen
2015-12-07 06:29:46 UTC
Permalink
Post by Thomas Monjalon
Post by Panu Matilainen
This (and other changes in patch 2 breaks the librte_vhost ABI again, so
you'd need to at least add a deprecation note to 2.2 to be able to do it
in 2.3 at all according to the ABI policy.
Perhaps a better option would be adding some padding to the structs now
for 2.2 since the vhost ABI is broken there anyway. That would at least
give a chance to keep it compatible from 2.2 to 2.3.
Please could you point where the vhost ABI is broken in 2.2?
The vhost ABI break was announced for DPDK 2.2 in commit
Post by Thomas Monjalon
commit 3c848bd7b1c6f4f681b833322a748fdefbb5fb2d
Date: Tue Jun 16 09:38:43 2015 +0800
doc: announce ABI changes for vhost-user multiple queues
It announces the planned ABI changes for vhost-user multiple
queues feature on v2.2.
So the ABI process was properly followed, except for actually bumping
LIBABIVER. Bumping LIBABIVER is mentioned in
doc/guides/contributing/versioning.rst but it doesn't specify *when*
this should be done, eg should the first patch breaking the ABI bump it
or should it done be shortly before the next stable release, or
something else. As it is, it seems a bit too easy to simply forget.

- Panu -
Thomas Monjalon
2015-12-07 11:28:57 UTC
Permalink
Post by Panu Matilainen
Post by Thomas Monjalon
Post by Panu Matilainen
This (and other changes in patch 2 breaks the librte_vhost ABI again, so
you'd need to at least add a deprecation note to 2.2 to be able to do it
in 2.3 at all according to the ABI policy.
Perhaps a better option would be adding some padding to the structs now
for 2.2 since the vhost ABI is broken there anyway. That would at least
give a chance to keep it compatible from 2.2 to 2.3.
Please could you point where the vhost ABI is broken in 2.2?
The vhost ABI break was announced for DPDK 2.2 in commit
[...]
Post by Panu Matilainen
So the ABI process was properly followed, except for actually bumping
LIBABIVER. Bumping LIBABIVER is mentioned in
doc/guides/contributing/versioning.rst but it doesn't specify *when*
this should be done, eg should the first patch breaking the ABI bump it
or should it done be shortly before the next stable release, or
something else. As it is, it seems a bit too easy to simply forget.
I thought it was not needed to explicitly say that commits must be atomic
and we do not have to wait to do the required changes.
In this case, I've missed it when reviewing the vhost patches breaking the
ABI.
Panu Matilainen
2015-12-07 11:41:40 UTC
Permalink
Post by Thomas Monjalon
Post by Panu Matilainen
Post by Thomas Monjalon
Post by Panu Matilainen
This (and other changes in patch 2 breaks the librte_vhost ABI again, so
you'd need to at least add a deprecation note to 2.2 to be able to do it
in 2.3 at all according to the ABI policy.
Perhaps a better option would be adding some padding to the structs now
for 2.2 since the vhost ABI is broken there anyway. That would at least
give a chance to keep it compatible from 2.2 to 2.3.
Please could you point where the vhost ABI is broken in 2.2?
The vhost ABI break was announced for DPDK 2.2 in commit
[...]
Post by Panu Matilainen
So the ABI process was properly followed, except for actually bumping
LIBABIVER. Bumping LIBABIVER is mentioned in
doc/guides/contributing/versioning.rst but it doesn't specify *when*
this should be done, eg should the first patch breaking the ABI bump it
or should it done be shortly before the next stable release, or
something else. As it is, it seems a bit too easy to simply forget.
I thought it was not needed to explicitly say that commits must be atomic
and we do not have to wait to do the required changes.
The "problem" is that during a development cycle, an ABI could be broken
several times but LIBABIVER should only be bumped once. So ABI breaking
commits will often not be atomic wrt LIBABIVER, no matter which way its
done.

For example libtool recommendation is that library versions are updated
only just before public releases:
https://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html#Updating-version-info

- Panu -
Post by Thomas Monjalon
In this case, I've missed it when reviewing the vhost patches breaking the
ABI.
Thomas Monjalon
2015-12-07 13:55:56 UTC
Permalink
Post by Panu Matilainen
Post by Thomas Monjalon
Post by Panu Matilainen
The vhost ABI break was announced for DPDK 2.2 in commit
[...]
Post by Panu Matilainen
So the ABI process was properly followed, except for actually bumping
LIBABIVER. Bumping LIBABIVER is mentioned in
doc/guides/contributing/versioning.rst but it doesn't specify *when*
this should be done, eg should the first patch breaking the ABI bump it
or should it done be shortly before the next stable release, or
something else. As it is, it seems a bit too easy to simply forget.
I thought it was not needed to explicitly say that commits must be atomic
and we do not have to wait to do the required changes.
The "problem" is that during a development cycle, an ABI could be broken
several times but LIBABIVER should only be bumped once. So ABI breaking
commits will often not be atomic wrt LIBABIVER, no matter which way its
done.
If the ABI version has already been changed, there should be a merge conflict.
I think it's better to manage a conflict than forget to update the version.
Post by Panu Matilainen
For example libtool recommendation is that library versions are updated
https://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html#Updating-version-info
Interesting link. It makes me think that we do not manage ABI break when
downgrading the library (case of only new API keeping the ABI number).
Post by Panu Matilainen
Post by Thomas Monjalon
In this case, I've missed it when reviewing the vhost patches breaking the
ABI.
Panu Matilainen
2015-12-07 16:48:11 UTC
Permalink
Post by Panu Matilainen
Post by Thomas Monjalon
Post by Panu Matilainen
The vhost ABI break was announced for DPDK 2.2 in commit
[...]
Post by Panu Matilainen
So the ABI process was properly followed, except for actually bumping
LIBABIVER. Bumping LIBABIVER is mentioned in
doc/guides/contributing/versioning.rst but it doesn't specify *when*
this should be done, eg should the first patch breaking the ABI bump it
or should it done be shortly before the next stable release, or
something else. As it is, it seems a bit too easy to simply forget.
I thought it was not needed to explicitly say that commits must be atomic
and we do not have to wait to do the required changes.
Heh, now that I look more carefully... it IS documented, line 38 of
ABI versions are set at the time of major release labeling, and the
ABI may change multiple times, without warning, between the last
release label and the HEAD label of the git tree.
Post by Panu Matilainen
The "problem" is that during a development cycle, an ABI could be broken
several times but LIBABIVER should only be bumped once. So ABI breaking
commits will often not be atomic wrt LIBABIVER, no matter which way its
done.
If the ABI version has already been changed, there should be a merge conflict.
I think it's better to manage a conflict than forget to update the version.
What I'm thinking of is something that would tie LIBABIVER to the
deprecation announcement in a way that could be easily checked
(programmatically and manually). As it is now, its quite non-trivial to
figure what LIBABIVER *should* be for a given library at a given point -
you need to dig up deprecation.rst history and Makefile history and
whatnot, and its all quite error-prone.
Post by Panu Matilainen
For example libtool recommendation is that library versions are updated
https://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html#Updating-version-info
Interesting link. It makes me think that we do not manage ABI break when
downgrading the library (case of only new API keeping the ABI number).
Hmm, not quite sure what you mean here, but full libtool-style
versioning is not really needed with symbol versioning.

- Panu -
Post by Panu Matilainen
Post by Thomas Monjalon
In this case, I've missed it when reviewing the vhost patches breaking the
ABI.
Thomas Monjalon
2015-12-07 17:47:09 UTC
Permalink
Post by Panu Matilainen
Post by Panu Matilainen
Post by Thomas Monjalon
Post by Panu Matilainen
The vhost ABI break was announced for DPDK 2.2 in commit
[...]
Post by Panu Matilainen
So the ABI process was properly followed, except for actually bumping
LIBABIVER. Bumping LIBABIVER is mentioned in
doc/guides/contributing/versioning.rst but it doesn't specify *when*
this should be done, eg should the first patch breaking the ABI bump it
or should it done be shortly before the next stable release, or
something else. As it is, it seems a bit too easy to simply forget.
I thought it was not needed to explicitly say that commits must be atomic
and we do not have to wait to do the required changes.
Heh, now that I look more carefully... it IS documented, line 38 of
ABI versions are set at the time of major release labeling, and the
ABI may change multiple times, without warning, between the last
release label and the HEAD label of the git tree.
Interesting :)
Post by Panu Matilainen
Post by Panu Matilainen
The "problem" is that during a development cycle, an ABI could be broken
several times but LIBABIVER should only be bumped once. So ABI breaking
commits will often not be atomic wrt LIBABIVER, no matter which way its
done.
If the ABI version has already been changed, there should be a merge conflict.
I think it's better to manage a conflict than forget to update the version.
What I'm thinking of is something that would tie LIBABIVER to the
deprecation announcement in a way that could be easily checked
(programmatically and manually). As it is now, its quite non-trivial to
figure what LIBABIVER *should* be for a given library at a given point -
you need to dig up deprecation.rst history and Makefile history and
whatnot, and its all quite error-prone.
Yes clearly we need a safer process.
You are welcome to suggest one.
I like the idea of being based on some "parse-able" RST changes.
Xie, Huawei
2015-12-08 05:57:54 UTC
Permalink
On 12/2/2015 11:40 AM, Yuanhan Liu wrote:
[...]
Post by Yuanhan Liu
+
+ addr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, off);
+ if (addr == MAP_FAILED) {
+ RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
+ return -1;
+ }
Yuanhan:
mmap could fail with non-zero offset for huge page based mapping. Check
our workaround in user_set_mem_table.
I think you have done the validation, but i guess off is zero here.
Post by Yuanhan Liu
+
+ /* TODO: unmap on stop */
+ dev->log_base = addr;
+ dev->log_size = size;
+
+ return 0;
+}
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index b82108d..013cf38 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
void user_set_protocol_features(struct vhost_device_ctx ctx,
uint64_t protocol_features);
+int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
Yuanhan Liu
2015-12-08 07:25:02 UTC
Permalink
Post by Xie, Huawei
[...]
Post by Yuanhan Liu
+
+ addr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, off);
+ if (addr == MAP_FAILED) {
+ RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
+ return -1;
+ }
mmap could fail with non-zero offset for huge page based mapping. Check
our workaround in user_set_mem_table.
I think you have done the validation, but i guess off is zero here.
Yes, off is zero. And thanks for the remind; will fix it in next version.

--yliu
Yuanhan Liu
2015-12-02 03:43:11 UTC
Permalink
Introduce vhost_log_write() helper function to log the dirty pages we
touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
log is presented by 1 bit.

Therefore, vhost_log_write() simply finds the right bit for related
page we are gonna change, and set it to 1. dev->log_base denotes the
start of the dirty page bitmap.

The page address is biased by log_guest_addr, which is derived from
SET_VRING_ADDR request as part of the vring related addresses.

Signed-off-by: Yuanhan Liu <***@linux.intel.com>
---
lib/librte_vhost/rte_virtio_net.h | 34 ++++++++++++++++++++++++++++++++++
lib/librte_vhost/virtio-net.c | 4 ++++
2 files changed, 38 insertions(+)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 416dac2..191c1be 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -40,6 +40,7 @@
*/

#include <stdint.h>
+#include <linux/vhost.h>
#include <linux/virtio_ring.h>
#include <linux/virtio_net.h>
#include <sys/eventfd.h>
@@ -59,6 +60,8 @@ struct rte_mbuf;
/* Backend value set by guest. */
#define VIRTIO_DEV_STOPPED -1

+#define VHOST_LOG_PAGE 4096
+

/* Enum for virtqueue management. */
enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
@@ -82,6 +85,7 @@ struct vhost_virtqueue {
struct vring_desc *desc; /**< Virtqueue descriptor ring. */
struct vring_avail *avail; /**< Virtqueue available ring. */
struct vring_used *used; /**< Virtqueue used ring. */
+ uint64_t log_guest_addr; /**< Physical address of used ring, for logging */
uint32_t size; /**< Size of descriptor ring. */
uint32_t backend; /**< Backend value to determine if device should started/stopped. */
uint16_t vhost_hlen; /**< Vhost header length (varies depending on RX merge buffers. */
@@ -203,6 +207,36 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
return vhost_va;
}

+static inline void __attribute__((always_inline))
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+ /* TODO: to make it atomic? */
+ log_base[page / 8] |= 1 << (page % 8);
+}
+
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ uint64_t offset, uint64_t len)
+{
+ uint64_t addr = vq->log_guest_addr;
+ uint64_t page;
+
+ if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base || !len))
+ return;
+
+ addr += offset;
+ if (dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8))
+ return;
+
+ page = addr / VHOST_LOG_PAGE;
+ while (page * VHOST_LOG_PAGE < addr + len) {
+ vhost_log_page(dev->log_base, page);
+ page += VHOST_LOG_PAGE;
+ }
+}
+
+
/**
* Disable features in feature_mask. Returns 0 on success.
*/
diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index 8364938..4481827 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -666,12 +666,16 @@ set_vring_addr(struct vhost_device_ctx ctx, struct vhost_vring_addr *addr)
return -1;
}

+ vq->log_guest_addr = addr->log_guest_addr;
+
LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address desc: %p\n",
dev->device_fh, vq->desc);
LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address avail: %p\n",
dev->device_fh, vq->avail);
LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address used: %p\n",
dev->device_fh, vq->used);
+ LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") log_guest_addr: %p\n",
+ dev->device_fh, (void *)(uintptr_t)vq->log_guest_addr);

return 0;
}
--
1.9.0
Victor Kaplansky
2015-12-02 13:53:01 UTC
Permalink
Post by Yuanhan Liu
Introduce vhost_log_write() helper function to log the dirty pages we
touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
log is presented by 1 bit.
Therefore, vhost_log_write() simply finds the right bit for related
page we are gonna change, and set it to 1. dev->log_base denotes the
start of the dirty page bitmap.
The page address is biased by log_guest_addr, which is derived from
SET_VRING_ADDR request as part of the vring related addresses.
---
lib/librte_vhost/rte_virtio_net.h | 34 ++++++++++++++++++++++++++++++++++
lib/librte_vhost/virtio-net.c | 4 ++++
2 files changed, 38 insertions(+)
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 416dac2..191c1be 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -40,6 +40,7 @@
*/
#include <stdint.h>
+#include <linux/vhost.h>
#include <linux/virtio_ring.h>
#include <linux/virtio_net.h>
#include <sys/eventfd.h>
@@ -59,6 +60,8 @@ struct rte_mbuf;
/* Backend value set by guest. */
#define VIRTIO_DEV_STOPPED -1
+#define VHOST_LOG_PAGE 4096
+
/* Enum for virtqueue management. */
enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
@@ -82,6 +85,7 @@ struct vhost_virtqueue {
struct vring_desc *desc; /**< Virtqueue descriptor ring. */
struct vring_avail *avail; /**< Virtqueue available ring. */
struct vring_used *used; /**< Virtqueue used ring. */
+ uint64_t log_guest_addr; /**< Physical address of used ring, for logging */
uint32_t size; /**< Size of descriptor ring. */
uint32_t backend; /**< Backend value to determine if device should started/stopped. */
uint16_t vhost_hlen; /**< Vhost header length (varies depending on RX merge buffers. */
@@ -203,6 +207,36 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
return vhost_va;
}
+static inline void __attribute__((always_inline))
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+ /* TODO: to make it atomic? */
+ log_base[page / 8] |= 1 << (page % 8);
I think the atomic OR operation is necessary only if there can be
more than one vhost-user back-end updating the guest's memory
simultaneously. However probably it is pretty safe to perform
regular OR operation, since rings are not shared between
back-end. What about buffers pointed by descriptors? To be on
the safe side, I would use a GCC built-in function
__sync_fetch_and_or().
Post by Yuanhan Liu
+}
+
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ uint64_t offset, uint64_t len)
+{
+ uint64_t addr = vq->log_guest_addr;
+ uint64_t page;
+
+ if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base || !len))
+ return;
Isn't "likely" more appropriate in above, since the whole
expression is expected to be true most of the time?
Post by Yuanhan Liu
+
+ addr += offset;
+ if (dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8))
+ return;
+
+ page = addr / VHOST_LOG_PAGE;
+ while (page * VHOST_LOG_PAGE < addr + len) {
+ vhost_log_page(dev->log_base, page);
+ page += VHOST_LOG_PAGE;
+ }
+}
+
+
/**
* Disable features in feature_mask. Returns 0 on success.
*/
diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index 8364938..4481827 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -666,12 +666,16 @@ set_vring_addr(struct vhost_device_ctx ctx, struct vhost_vring_addr *addr)
return -1;
}
+ vq->log_guest_addr = addr->log_guest_addr;
+
LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address desc: %p\n",
dev->device_fh, vq->desc);
LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address avail: %p\n",
dev->device_fh, vq->avail);
LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address used: %p\n",
dev->device_fh, vq->used);
+ LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") log_guest_addr: %p\n",
+ dev->device_fh, (void *)(uintptr_t)vq->log_guest_addr);
return 0;
}
--
1.9.0
Yuanhan Liu
2015-12-02 14:39:57 UTC
Permalink
Post by Victor Kaplansky
Post by Yuanhan Liu
Introduce vhost_log_write() helper function to log the dirty pages we
touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
log is presented by 1 bit.
Therefore, vhost_log_write() simply finds the right bit for related
page we are gonna change, and set it to 1. dev->log_base denotes the
start of the dirty page bitmap.
The page address is biased by log_guest_addr, which is derived from
SET_VRING_ADDR request as part of the vring related addresses.
---
lib/librte_vhost/rte_virtio_net.h | 34 ++++++++++++++++++++++++++++++++++
lib/librte_vhost/virtio-net.c | 4 ++++
2 files changed, 38 insertions(+)
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 416dac2..191c1be 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -40,6 +40,7 @@
*/
#include <stdint.h>
+#include <linux/vhost.h>
#include <linux/virtio_ring.h>
#include <linux/virtio_net.h>
#include <sys/eventfd.h>
@@ -59,6 +60,8 @@ struct rte_mbuf;
/* Backend value set by guest. */
#define VIRTIO_DEV_STOPPED -1
+#define VHOST_LOG_PAGE 4096
+
/* Enum for virtqueue management. */
enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
@@ -82,6 +85,7 @@ struct vhost_virtqueue {
struct vring_desc *desc; /**< Virtqueue descriptor ring. */
struct vring_avail *avail; /**< Virtqueue available ring. */
struct vring_used *used; /**< Virtqueue used ring. */
+ uint64_t log_guest_addr; /**< Physical address of used ring, for logging */
uint32_t size; /**< Size of descriptor ring. */
uint32_t backend; /**< Backend value to determine if device should started/stopped. */
uint16_t vhost_hlen; /**< Vhost header length (varies depending on RX merge buffers. */
@@ -203,6 +207,36 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
return vhost_va;
}
+static inline void __attribute__((always_inline))
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+ /* TODO: to make it atomic? */
+ log_base[page / 8] |= 1 << (page % 8);
I think the atomic OR operation is necessary only if there can be
more than one vhost-user back-end updating the guest's memory
simultaneously. However probably it is pretty safe to perform
regular OR operation, since rings are not shared between
back-end. What about buffers pointed by descriptors? To be on
the safe side, I would use a GCC built-in function
__sync_fetch_and_or().
The build has to be passed not only for gcc, but for icc and clang as
well.
Post by Victor Kaplansky
Post by Yuanhan Liu
+}
+
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ uint64_t offset, uint64_t len)
+{
+ uint64_t addr = vq->log_guest_addr;
+ uint64_t page;
+
+ if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base || !len))
+ return;
Isn't "likely" more appropriate in above, since the whole
expression is expected to be true most of the time?
Sorry, it's a typo, and thanks for the catching.

--yliu
Xie, Huawei
2015-12-09 03:33:16 UTC
Permalink
Post by Victor Kaplansky
Post by Yuanhan Liu
Introduce vhost_log_write() helper function to log the dirty pages we
touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
log is presented by 1 bit.
Therefore, vhost_log_write() simply finds the right bit for related
page we are gonna change, and set it to 1. dev->log_base denotes the
start of the dirty page bitmap.
The page address is biased by log_guest_addr, which is derived from
SET_VRING_ADDR request as part of the vring related addresses.
---
lib/librte_vhost/rte_virtio_net.h | 34 ++++++++++++++++++++++++++++++++++
lib/librte_vhost/virtio-net.c | 4 ++++
2 files changed, 38 insertions(+)
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 416dac2..191c1be 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -40,6 +40,7 @@
*/
#include <stdint.h>
+#include <linux/vhost.h>
#include <linux/virtio_ring.h>
#include <linux/virtio_net.h>
#include <sys/eventfd.h>
@@ -59,6 +60,8 @@ struct rte_mbuf;
/* Backend value set by guest. */
#define VIRTIO_DEV_STOPPED -1
+#define VHOST_LOG_PAGE 4096
+
/* Enum for virtqueue management. */
enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
@@ -82,6 +85,7 @@ struct vhost_virtqueue {
struct vring_desc *desc; /**< Virtqueue descriptor ring. */
struct vring_avail *avail; /**< Virtqueue available ring. */
struct vring_used *used; /**< Virtqueue used ring. */
+ uint64_t log_guest_addr; /**< Physical address of used ring, for logging */
uint32_t size; /**< Size of descriptor ring. */
uint32_t backend; /**< Backend value to determine if device should started/stopped. */
uint16_t vhost_hlen; /**< Vhost header length (varies depending on RX merge buffers. */
@@ -203,6 +207,36 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
return vhost_va;
}
+static inline void __attribute__((always_inline))
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+ /* TODO: to make it atomic? */
+ log_base[page / 8] |= 1 << (page % 8);
I think the atomic OR operation is necessary only if there can be
more than one vhost-user back-end updating the guest's memory
simultaneously. However probably it is pretty safe to perform
regular OR operation, since rings are not shared between
back-end. What about buffers pointed by descriptors? To be on
the safe side, I would use a GCC built-in function
__sync_fetch_and_or().
Post by Yuanhan Liu
+}
+
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ uint64_t offset, uint64_t len)
+{
+ uint64_t addr = vq->log_guest_addr;
+ uint64_t page;
+
+ if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base || !len))
+ return;
Isn't "likely" more appropriate in above, since the whole
expression is expected to be true most of the time?
Victor:
So we are not always logging, what is the message that tells the backend
the migration is started?
[...]
Yuanhan Liu
2015-12-09 03:42:51 UTC
Permalink
On Wed, Dec 09, 2015 at 03:33:16AM +0000, Xie, Huawei wrote:
...
Post by Xie, Huawei
Post by Victor Kaplansky
Post by Yuanhan Liu
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ uint64_t offset, uint64_t len)
+{
+ uint64_t addr = vq->log_guest_addr;
+ uint64_t page;
+
+ if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base || !len))
+ return;
Isn't "likely" more appropriate in above, since the whole
expression is expected to be true most of the time?
So we are not always logging, what is the message that tells the backend
the migration is started?
When log starts, VHOST_USER_SET_FEATURES request will be sent again,
with VHOST_F_LOG_ALL feature bit set.

--yliu
Xie, Huawei
2015-12-09 05:44:11 UTC
Permalink
Post by Yuanhan Liu
...
Post by Xie, Huawei
Post by Victor Kaplansky
Post by Yuanhan Liu
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ uint64_t offset, uint64_t len)
+{
+ uint64_t addr = vq->log_guest_addr;
+ uint64_t page;
+
+ if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base || !len))
+ return;
Isn't "likely" more appropriate in above, since the whole
expression is expected to be true most of the time?
So we are not always logging, what is the message that tells the backend
the migration is started?
When log starts, VHOST_USER_SET_FEATURES request will be sent again,
with VHOST_F_LOG_ALL feature bit set.
As the VHOST_USER_SET_FEATURES handling and rx/tx runs asynchronously,
we have to make sure we don't miss logging anything when this feature is
set. For example, I doubt like in virtio_dev_rx, is the dev->features
volatile?
Post by Yuanhan Liu
--yliu
Yuanhan Liu
2015-12-09 08:41:36 UTC
Permalink
Post by Xie, Huawei
Post by Yuanhan Liu
...
Post by Xie, Huawei
Post by Victor Kaplansky
Post by Yuanhan Liu
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ uint64_t offset, uint64_t len)
+{
+ uint64_t addr = vq->log_guest_addr;
+ uint64_t page;
+
+ if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base || !len))
+ return;
Isn't "likely" more appropriate in above, since the whole
expression is expected to be true most of the time?
So we are not always logging, what is the message that tells the backend
the migration is started?
When log starts, VHOST_USER_SET_FEATURES request will be sent again,
with VHOST_F_LOG_ALL feature bit set.
As the VHOST_USER_SET_FEATURES handling and rx/tx runs asynchronously,
we have to make sure we don't miss logging anything when this feature is
set.
That's a good remind. Thanks.
Post by Xie, Huawei
For example, I doubt like in virtio_dev_rx, is the dev->features
volatile?
No, it is not volatile.

--yliu
Yuanhan Liu
2015-12-02 03:43:12 UTC
Permalink
Invoking vhost_log_write() to mark corresponding page as dirty while
updating used vring.

Signed-off-by: Yuanhan Liu <***@linux.intel.com>
---
lib/librte_vhost/vhost_rxtx.c | 74 +++++++++++++++++++++++++++++--------------
1 file changed, 50 insertions(+), 24 deletions(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index 9322ce6..d4805d8 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -129,6 +129,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
uint32_t offset = 0, vb_offset = 0;
uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
uint8_t hdr = 0, uncompleted_pkt = 0;
+ uint16_t idx;

/* Get descriptor from available ring */
desc = &vq->desc[head[packet_success]];
@@ -200,16 +201,22 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
}

/* Update used ring with desc information */
- vq->used->ring[res_cur_idx & (vq->size - 1)].id =
- head[packet_success];
+ idx = res_cur_idx & (vq->size - 1);
+ vq->used->ring[idx].id = head[packet_success];

/* Drop the packet if it is uncompleted */
if (unlikely(uncompleted_pkt == 1))
- vq->used->ring[res_cur_idx & (vq->size - 1)].len =
- vq->vhost_hlen;
+ vq->used->ring[idx].len = vq->vhost_hlen;
else
- vq->used->ring[res_cur_idx & (vq->size - 1)].len =
- pkt_len + vq->vhost_hlen;
+ vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
+
+ /*
+ * to defer the update to when updating used->idx,
+ * and batch them?
+ */
+ vhost_log_write(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));

res_cur_idx++;
packet_success++;
@@ -236,6 +243,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,

*(volatile uint16_t *)&vq->used->idx += count;
vq->last_used_idx = res_end_idx;
+ vhost_log_write(dev, vq,
+ offsetof(struct vring_used, idx),
+ sizeof(vq->used->idx));

/* flush used->idx update before we read avail->flags. */
rte_mb();
@@ -265,6 +275,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
uint32_t seg_avail;
uint32_t vb_avail;
uint32_t cpy_len, entry_len;
+ uint16_t idx;

if (pkt == NULL)
return 0;
@@ -302,16 +313,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
entry_len = vq->vhost_hlen;

if (vb_avail == 0) {
- uint32_t desc_idx =
- vq->buf_vec[vec_idx].desc_idx;
+ uint32_t desc_idx = vq->buf_vec[vec_idx].desc_idx;
+
+ if ((vq->desc[desc_idx].flags & VRING_DESC_F_NEXT) == 0) {
+ idx = cur_idx & (vq->size - 1);

- if ((vq->desc[desc_idx].flags
- & VRING_DESC_F_NEXT) == 0) {
/* Update used ring with desc information */
- vq->used->ring[cur_idx & (vq->size - 1)].id
- = vq->buf_vec[vec_idx].desc_idx;
- vq->used->ring[cur_idx & (vq->size - 1)].len
- = entry_len;
+ vq->used->ring[idx].id = vq->buf_vec[vec_idx].desc_idx;
+ vq->used->ring[idx].len = entry_len;
+
+ vhost_log_write(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));

entry_len = 0;
cur_idx++;
@@ -354,10 +367,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
if ((vq->desc[vq->buf_vec[vec_idx].desc_idx].flags &
VRING_DESC_F_NEXT) == 0) {
/* Update used ring with desc information */
- vq->used->ring[cur_idx & (vq->size - 1)].id
+ idx = cur_idx & (vq->size - 1);
+ vq->used->ring[idx].id
= vq->buf_vec[vec_idx].desc_idx;
- vq->used->ring[cur_idx & (vq->size - 1)].len
- = entry_len;
+ vq->used->ring[idx].len = entry_len;
+ vhost_log_write(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));
entry_len = 0;
cur_idx++;
entry_success++;
@@ -390,16 +406,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,

if ((vq->desc[desc_idx].flags &
VRING_DESC_F_NEXT) == 0) {
- uint16_t wrapped_idx =
- cur_idx & (vq->size - 1);
+ idx = cur_idx & (vq->size - 1);
/*
* Update used ring with the
* descriptor information
*/
- vq->used->ring[wrapped_idx].id
+ vq->used->ring[idx].id
= desc_idx;
- vq->used->ring[wrapped_idx].len
+ vq->used->ring[idx].len
= entry_len;
+ vhost_log_write(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));
entry_success++;
entry_len = 0;
cur_idx++;
@@ -422,10 +440,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
* This whole packet completes.
*/
/* Update used ring with desc information */
- vq->used->ring[cur_idx & (vq->size - 1)].id
+ idx = cur_idx & (vq->size - 1);
+ vq->used->ring[idx].id
= vq->buf_vec[vec_idx].desc_idx;
- vq->used->ring[cur_idx & (vq->size - 1)].len
- = entry_len;
+ vq->used->ring[idx].len = entry_len;
+ vhost_log_write(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));
entry_success++;
break;
}
@@ -658,6 +679,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
/* Update used index buffer information. */
vq->used->ring[used_idx].id = head[entry_success];
vq->used->ring[used_idx].len = 0;
+ vhost_log_write(dev, vq,
+ offsetof(struct vring_used, ring[used_idx]),
+ sizeof(vq->used->ring[used_idx]));

/* Allocate an mbuf and populate the structure. */
m = rte_pktmbuf_alloc(mbuf_pool);
@@ -778,6 +802,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,

rte_compiler_barrier();
vq->used->idx += entry_success;
+ vhost_log_write(dev, vq, offsetof(struct vring_used, idx),
+ sizeof(vq->used->idx));
/* Kick guest if required. */
if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
eventfd_write(vq->callfd, (eventfd_t)1);
--
1.9.0
Victor Kaplansky
2015-12-02 14:07:02 UTC
Permalink
Post by Yuanhan Liu
Invoking vhost_log_write() to mark corresponding page as dirty while
updating used vring.
Looks good, thanks!

I didn't find where you log the dirty pages in result of data
written to the buffers pointed by the descriptors in RX vring.
AFAIU, the buffers of RX queue reside in guest's memory and have
to be marked as dirty if they are written. What do you say?

-- Victor
Post by Yuanhan Liu
---
lib/librte_vhost/vhost_rxtx.c | 74 +++++++++++++++++++++++++++++--------------
1 file changed, 50 insertions(+), 24 deletions(-)
diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index 9322ce6..d4805d8 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -129,6 +129,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
uint32_t offset = 0, vb_offset = 0;
uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
uint8_t hdr = 0, uncompleted_pkt = 0;
+ uint16_t idx;
/* Get descriptor from available ring */
desc = &vq->desc[head[packet_success]];
@@ -200,16 +201,22 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
}
/* Update used ring with desc information */
- vq->used->ring[res_cur_idx & (vq->size - 1)].id =
- head[packet_success];
+ idx = res_cur_idx & (vq->size - 1);
+ vq->used->ring[idx].id = head[packet_success];
/* Drop the packet if it is uncompleted */
if (unlikely(uncompleted_pkt == 1))
- vq->used->ring[res_cur_idx & (vq->size - 1)].len =
- vq->vhost_hlen;
+ vq->used->ring[idx].len = vq->vhost_hlen;
else
- vq->used->ring[res_cur_idx & (vq->size - 1)].len =
- pkt_len + vq->vhost_hlen;
+ vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
+
+ /*
+ * to defer the update to when updating used->idx,
+ * and batch them?
+ */
+ vhost_log_write(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));
res_cur_idx++;
packet_success++;
@@ -236,6 +243,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
*(volatile uint16_t *)&vq->used->idx += count;
vq->last_used_idx = res_end_idx;
+ vhost_log_write(dev, vq,
+ offsetof(struct vring_used, idx),
+ sizeof(vq->used->idx));
/* flush used->idx update before we read avail->flags. */
rte_mb();
@@ -265,6 +275,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
uint32_t seg_avail;
uint32_t vb_avail;
uint32_t cpy_len, entry_len;
+ uint16_t idx;
if (pkt == NULL)
return 0;
@@ -302,16 +313,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
entry_len = vq->vhost_hlen;
if (vb_avail == 0) {
- uint32_t desc_idx =
- vq->buf_vec[vec_idx].desc_idx;
+ uint32_t desc_idx = vq->buf_vec[vec_idx].desc_idx;
+
+ if ((vq->desc[desc_idx].flags & VRING_DESC_F_NEXT) == 0) {
+ idx = cur_idx & (vq->size - 1);
- if ((vq->desc[desc_idx].flags
- & VRING_DESC_F_NEXT) == 0) {
/* Update used ring with desc information */
- vq->used->ring[cur_idx & (vq->size - 1)].id
- = vq->buf_vec[vec_idx].desc_idx;
- vq->used->ring[cur_idx & (vq->size - 1)].len
- = entry_len;
+ vq->used->ring[idx].id = vq->buf_vec[vec_idx].desc_idx;
+ vq->used->ring[idx].len = entry_len;
+
+ vhost_log_write(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));
entry_len = 0;
cur_idx++;
@@ -354,10 +367,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
if ((vq->desc[vq->buf_vec[vec_idx].desc_idx].flags &
VRING_DESC_F_NEXT) == 0) {
/* Update used ring with desc information */
- vq->used->ring[cur_idx & (vq->size - 1)].id
+ idx = cur_idx & (vq->size - 1);
+ vq->used->ring[idx].id
= vq->buf_vec[vec_idx].desc_idx;
- vq->used->ring[cur_idx & (vq->size - 1)].len
- = entry_len;
+ vq->used->ring[idx].len = entry_len;
+ vhost_log_write(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));
entry_len = 0;
cur_idx++;
entry_success++;
@@ -390,16 +406,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
if ((vq->desc[desc_idx].flags &
VRING_DESC_F_NEXT) == 0) {
- uint16_t wrapped_idx =
- cur_idx & (vq->size - 1);
+ idx = cur_idx & (vq->size - 1);
/*
* Update used ring with the
* descriptor information
*/
- vq->used->ring[wrapped_idx].id
+ vq->used->ring[idx].id
= desc_idx;
- vq->used->ring[wrapped_idx].len
+ vq->used->ring[idx].len
= entry_len;
+ vhost_log_write(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));
entry_success++;
entry_len = 0;
cur_idx++;
@@ -422,10 +440,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
* This whole packet completes.
*/
/* Update used ring with desc information */
- vq->used->ring[cur_idx & (vq->size - 1)].id
+ idx = cur_idx & (vq->size - 1);
+ vq->used->ring[idx].id
= vq->buf_vec[vec_idx].desc_idx;
- vq->used->ring[cur_idx & (vq->size - 1)].len
- = entry_len;
+ vq->used->ring[idx].len = entry_len;
+ vhost_log_write(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));
entry_success++;
break;
}
@@ -658,6 +679,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
/* Update used index buffer information. */
vq->used->ring[used_idx].id = head[entry_success];
vq->used->ring[used_idx].len = 0;
+ vhost_log_write(dev, vq,
+ offsetof(struct vring_used, ring[used_idx]),
+ sizeof(vq->used->ring[used_idx]));
/* Allocate an mbuf and populate the structure. */
m = rte_pktmbuf_alloc(mbuf_pool);
@@ -778,6 +802,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
rte_compiler_barrier();
vq->used->idx += entry_success;
+ vhost_log_write(dev, vq, offsetof(struct vring_used, idx),
+ sizeof(vq->used->idx));
/* Kick guest if required. */
if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
eventfd_write(vq->callfd, (eventfd_t)1);
--
1.9.0
Yuanhan Liu
2015-12-02 14:38:02 UTC
Permalink
Post by Victor Kaplansky
Post by Yuanhan Liu
Invoking vhost_log_write() to mark corresponding page as dirty while
updating used vring.
Looks good, thanks!
I didn't find where you log the dirty pages in result of data
written to the buffers pointed by the descriptors in RX vring.
AFAIU, the buffers of RX queue reside in guest's memory and have
to be marked as dirty if they are written. What do you say?
Yeah, we should. I got a question then: why log_guest_addr is set
to the physical address of used vring in guest? I mean, apparently,
we need log more changes other than used vring only.

--yliu
Victor Kaplansky
2015-12-02 15:58:24 UTC
Permalink
Post by Yuanhan Liu
Post by Victor Kaplansky
Post by Yuanhan Liu
Invoking vhost_log_write() to mark corresponding page as dirty while
updating used vring.
Looks good, thanks!
I didn't find where you log the dirty pages in result of data
written to the buffers pointed by the descriptors in RX vring.
AFAIU, the buffers of RX queue reside in guest's memory and have
to be marked as dirty if they are written. What do you say?
Yeah, we should. I got a question then: why log_guest_addr is set
to the physical address of used vring in guest? I mean, apparently,
we need log more changes other than used vring only.
The physical address of used vring sent to the back-end, since
otherwise back-end has to perform virtual to physical
translation, and we want to avoid this. The dirty buffers has to
be marked as well, but their guest's physical address is known
directly from the descriptors.
Post by Yuanhan Liu
--yliu
Michael S. Tsirkin
2015-12-02 16:26:37 UTC
Permalink
Post by Victor Kaplansky
Post by Yuanhan Liu
Post by Victor Kaplansky
Post by Yuanhan Liu
Invoking vhost_log_write() to mark corresponding page as dirty while
updating used vring.
Looks good, thanks!
I didn't find where you log the dirty pages in result of data
written to the buffers pointed by the descriptors in RX vring.
AFAIU, the buffers of RX queue reside in guest's memory and have
to be marked as dirty if they are written. What do you say?
Yeah, we should. I got a question then: why log_guest_addr is set
to the physical address of used vring in guest? I mean, apparently,
we need log more changes other than used vring only.
The physical address of used vring sent to the back-end, since
otherwise back-end has to perform virtual to physical
translation, and we want to avoid this. The dirty buffers has to
be marked as well, but their guest's physical address is known
directly from the descriptors.
Yes, people wanted to be able to do multiple physical
addresses to one virtual so you do not want to translate
virt to phys.
Post by Victor Kaplansky
Post by Yuanhan Liu
--yliu
Yuanhan Liu
2015-12-03 02:31:19 UTC
Permalink
Post by Michael S. Tsirkin
Post by Victor Kaplansky
Post by Yuanhan Liu
Post by Victor Kaplansky
Post by Yuanhan Liu
Invoking vhost_log_write() to mark corresponding page as dirty while
updating used vring.
Looks good, thanks!
I didn't find where you log the dirty pages in result of data
written to the buffers pointed by the descriptors in RX vring.
AFAIU, the buffers of RX queue reside in guest's memory and have
to be marked as dirty if they are written. What do you say?
Yeah, we should. I got a question then: why log_guest_addr is set
to the physical address of used vring in guest? I mean, apparently,
we need log more changes other than used vring only.
The physical address of used vring sent to the back-end, since
otherwise back-end has to perform virtual to physical
translation, and we want to avoid this. The dirty buffers has to
be marked as well, but their guest's physical address is known
directly from the descriptors.
Yes, people wanted to be able to do multiple physical
addresses to one virtual so you do not want to translate
virt to phys.
Thank you both, it's clear to me then.

--yliu
Xie, Huawei
2015-12-09 02:45:28 UTC
Permalink
Post by Victor Kaplansky
Post by Yuanhan Liu
Invoking vhost_log_write() to mark corresponding page as dirty while
updating used vring.
Looks good, thanks!
I didn't find where you log the dirty pages in result of data
written to the buffers pointed by the descriptors in RX vring.
AFAIU, the buffers of RX queue reside in guest's memory and have
to be marked as dirty if they are written. What do you say?
Yes, that is actually the majority of the work.
Besides, in the first version, we could temporarily ignore zero copy,
which is much more complicated, as we have no idea when the page has
been accessed.
-- Huawei
Post by Victor Kaplansky
-- Victor
Yuanhan Liu
2015-12-02 03:43:13 UTC
Permalink
To claim that we support vhost-user live migration support:
SET_LOG_BASE request will be send only when this feature flag
is set.

Besides this flag, we actually need another feature flag set
to make vhost-user live migration work: VHOST_F_LOG_ALL.
Which, however, has been enabled long time ago.

Signed-off-by: Yuanhan Liu <***@linux.intel.com>
---
lib/librte_vhost/vhost_user/virtio-net-user.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index 013cf38..a3a889d 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -38,8 +38,10 @@
#include "vhost-net-user.h"

#define VHOST_USER_PROTOCOL_F_MQ 0
+#define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1

-#define VHOST_USER_PROTOCOL_FEATURES (1ULL << VHOST_USER_PROTOCOL_F_MQ)
+#define VHOST_USER_PROTOCOL_FEATURES ((1ULL << VHOST_USER_PROTOCOL_F_MQ) | \
+ (1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD))

int user_set_mem_table(struct vhost_device_ctx, struct VhostUserMsg *);
--
1.9.0
Victor Kaplansky
2015-12-02 14:10:56 UTC
Permalink
Post by Yuanhan Liu
This patch set adds the initial vhost-user live migration support.
The major task behind that is to log pages we touched during
live migration. So, this patch is basically about adding vhost
log support, and using it.
Patchset
========
- Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
the dirty memory bitmap is.
- Patch 2 introduces a vhost_log_write() helper function to log
pages we are gonna change.
- Patch 3 logs changes we made to used vring.
- Patch 4 sets log_fhmfd protocol feature bit, which actually
enables the vhost-user live migration support.
A simple test guide (on same host)
==================================
The following test is based on OVS + DPDK. And here is guide
http://wiki.qemu.org/Features/vhost-user-ovs-dpdk
1. start ovs-vswitchd
2. Add two ovs vhost-user port, say vhost0 and vhost1
$QEMU -enable-kvm -m 1024 -smp 4 \
-chardev socket,id=char0,path=/var/run/openvswitch/vhost0 \
-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
-device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
-object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
-numa node,memdev=mem -mem-prealloc \
-kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
-hda fc-19-i386.img \
-monitor telnet::3333,server,nowait -curses
4. run "ping $host" inside VM1
5. Start VM2 to connect to vhost0, and marking it as the target
of live migration (by adding -incoming tcp:0:4444 option)
$QEMU -enable-kvm -m 1024 -smp 4 \
-chardev socket,id=char0,path=/var/run/openvswitch/vhost1 \
-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
-device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
-object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
-numa node,memdev=mem -mem-prealloc \
-kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
-hda fc-19-i386.img \
-monitor telnet::3334,server,nowait -curses \
-incoming tcp:0:4444
migrate tcp:0:4444
7. After a while, you will find that VM1 has been migrated to VM2,
and the "ping" command continues running, perfectly.
Note: this patch set has mostly been based on Victor Kaplansk's demo
work (vhost-user-bridge) at QEMU project. I was thinking to add Victor
as the co-author. Victor, what do you think of that? :)
Thanks for adding me to credits list!
-- Victor
Post by Yuanhan Liu
Comments are welcome!
---
vhost: handle VHOST_USER_SET_LOG_BASE request
vhost: introduce vhost_log_write
vhost: log vring changes
vhost: enable log_shmfd protocol feature
lib/librte_vhost/rte_virtio_net.h | 35 ++++++++++++++
lib/librte_vhost/vhost_rxtx.c | 70 ++++++++++++++++++---------
lib/librte_vhost/vhost_user/vhost-net-user.c | 7 ++-
lib/librte_vhost/vhost_user/vhost-net-user.h | 6 +++
lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++
lib/librte_vhost/vhost_user/virtio-net-user.h | 5 +-
lib/librte_vhost/virtio-net.c | 4 ++
7 files changed, 145 insertions(+), 26 deletions(-)
--
1.9.0
Yuanhan Liu
2015-12-02 14:33:11 UTC
Permalink
On Wed, Dec 02, 2015 at 04:10:56PM +0200, Victor Kaplansky wrote:
...
Post by Victor Kaplansky
Post by Yuanhan Liu
Note: this patch set has mostly been based on Victor Kaplansk's demo
work (vhost-user-bridge) at QEMU project. I was thinking to add Victor
as the co-author. Victor, what do you think of that? :)
Thanks for adding me to credits list!
Great, I will add your signed-off-by since v2. Will that be okay to you?

--yliu
Xie, Huawei
2015-12-09 03:41:36 UTC
Permalink
Post by Yuanhan Liu
This patch set adds the initial vhost-user live migration support.
The major task behind that is to log pages we touched during
live migration. So, this patch is basically about adding vhost
log support, and using it.
Patchset
========
- Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
the dirty memory bitmap is.
- Patch 2 introduces a vhost_log_write() helper function to log
pages we are gonna change.
- Patch 3 logs changes we made to used vring.
- Patch 4 sets log_fhmfd protocol feature bit, which actually
enables the vhost-user live migration support.
A simple test guide (on same host)
==================================
The following test is based on OVS + DPDK. And here is guide
http://wiki.qemu.org/Features/vhost-user-ovs-dpdk
1. start ovs-vswitchd
2. Add two ovs vhost-user port, say vhost0 and vhost1
$QEMU -enable-kvm -m 1024 -smp 4 \
-chardev socket,id=char0,path=/var/run/openvswitch/vhost0 \
-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
-device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
-object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
-numa node,memdev=mem -mem-prealloc \
-kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
-hda fc-19-i386.img \
-monitor telnet::3333,server,nowait -curses
4. run "ping $host" inside VM1
5. Start VM2 to connect to vhost0, and marking it as the target
of live migration (by adding -incoming tcp:0:4444 option)
$QEMU -enable-kvm -m 1024 -smp 4 \
-chardev socket,id=char0,path=/var/run/openvswitch/vhost1 \
-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
-device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
-object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
-numa node,memdev=mem -mem-prealloc \
-kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
-hda fc-19-i386.img \
-monitor telnet::3334,server,nowait -curses \
-incoming tcp:0:4444
migrate tcp:0:4444
7. After a while, you will find that VM1 has been migrated to VM2,
and the "ping" command continues running, perfectly.
Is there some formal verification that migration is truly successful? At
least that the memory we care in our vhost-user case has been migrated
successfully?
For instance, we miss logging guest RX buffers in this patch set, but we
have no idea.

[...]
Pavel Fedin
2015-12-11 08:26:55 UTC
Permalink
Hello!

I am currently testing this patchset with qemu and have problems.

The guest migrates correctly, but after the migration it cries in the log:

Vhost user backend fails to broadcast fake RARP

and pinging the (new) host doesn't work. When i migrate it back to the old host, the network resumes working.

I have analyzed the code, and this problem happens because we support neither VHOST_USER_PROTOCOL_F_RARP, nor
VIRTIO_NET_F_GUEST_ANNOUNCE. Since the latter seems to be related only to guest, i simply enabled it in qemu by force, and after
this the network doesn't work at all.
Can anybody help me and explain how the thing works? I expected that gratuitous ARP packets are harmless, but they seem to break
things somehow. And what was used for testing the implementation?

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Yuanhan Liu
2015-12-11 09:49:34 UTC
Permalink
Post by Pavel Fedin
Hello!
I am currently testing this patchset with qemu and have problems.
Hi,

Thanks for testing!
Post by Pavel Fedin
Vhost user backend fails to broadcast fake RARP
Yes, because I have enabled it yet on DPDK side, and I was intended to
try it in the v2 patchset, which is likely to be sent out next week.

--yliu
Post by Pavel Fedin
and pinging the (new) host doesn't work. When i migrate it back to the old host, the network resumes working.
I have analyzed the code, and this problem happens because we support neither VHOST_USER_PROTOCOL_F_RARP, nor
VIRTIO_NET_F_GUEST_ANNOUNCE. Since the latter seems to be related only to guest, i simply enabled it in qemu by force, and after
this the network doesn't work at all.
Can anybody help me and explain how the thing works? I expected that gratuitous ARP packets are harmless, but they seem to break
things somehow. And what was used for testing the implementation?
Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Pavel Fedin
2015-12-11 10:22:23 UTC
Permalink
Hello!
Post by Yuanhan Liu
Post by Pavel Fedin
Hello!
I am currently testing this patchset with qemu and have problems.
Hi,
Thanks for testing!
Not at all :)

BTW, it works, and it was my bad. openvswitch was configured incorrectly on the other side, vhost port number was different for
some reason, while ruleset was the same. I reconfigured it and now everything migrates correctly, except increased downtime because
of missing GARP (the guest misses some PINGs, then it retries ARP, which brings the link back up).

Tested-by: Pavel Fedin <***@samsung.com>

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Peter Xu
2015-12-14 03:58:42 UTC
Permalink
This post might be inappropriate. Click to display it.
Pavel Fedin
2015-12-14 07:30:54 UTC
Permalink
Hello!
Post by Peter Xu
When doing the ping, was it from the guest (to another host) or to
the guest (from another host)?
In any case, I still could not understand why the ping loss happened
in this test.
If ping from guest, no ARP refresh is required at all?
ping from guest to host.

Ok, my setup was:

Host<------->openVSwitch<----------->guest
LOCAL vhostuser

So, in order to migrate the guest, i simply replicated this setup on both hosts, with the same IPs on host side. And on both hosts i set up the following ruleset for openvswitch:

ovs-ofctl add-flow ovs-br0 in_port=1,actions=output:LOCAL
ovs-ofctl add-flow ovs-br0 in_port=LOCAL,actions=output:1

And on the second host, for some reason, vhostuser port got no 2 in the database instead of 1. Probably because first i added wrong port, then added correct one, then removed the wrong one. So, as i wrote before - please don't worry, the patch works fine, it was totally my lame fault.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Peter Xu
2015-12-14 09:04:06 UTC
Permalink
Post by Pavel Fedin
Hello!
Hi, Pavel!
Post by Pavel Fedin
Post by Peter Xu
When doing the ping, was it from the guest (to another host) or to
the guest (from another host)?
In any case, I still could not understand why the ping loss happened
in this test.
If ping from guest, no ARP refresh is required at all?
ping from guest to host.
Host<------->openVSwitch<----------->guest
LOCAL vhostuser
Regarding to "with the same IPs on host side": do you mean that you
configured the same IP on two hosts in the intranet? I think this
does not matter if we are testing it functionally (whether live
migration could work), However I would still perfer to try ping
another host (say, host3) inside the intranet. What do you think?

When pinging host3, I assume there should have no ping loss. Also,
should have no loss too in the revert direction (reason as in
previous mail).
Post by Pavel Fedin
ovs-ofctl add-flow ovs-br0 in_port=1,actions=output:LOCAL
ovs-ofctl add-flow ovs-br0 in_port=LOCAL,actions=output:1
And on the second host, for some reason, vhostuser port got no 2 in the database instead of 1. Probably because first i added wrong port, then added correct one, then removed the wrong one. So, as i wrote before - please don't worry, the patch works fine, it was totally my lame fault.
Yes, thanks to let me know that the patch is working. Actually what
I am interested in is the down time that when host3 ping guest from
outside during migration. Would you please let me know the result if
you are doing such tests in the future? And please just ignore this
if there is no requirement on your side.

Thanks!
Peter
Pavel Fedin
2015-12-14 09:46:57 UTC
Permalink
Hello!
Post by Peter Xu
Post by Pavel Fedin
Host<------->openVSwitch<----------->guest
LOCAL vhostuser
So, in order to migrate the guest, i simply replicated this setup on both hosts, with the
Regarding to "with the same IPs on host side": do you mean that you
configured the same IP on two hosts in the intranet?
No intranet. You can think of it as an isolated network between the host and guest, and that's all. I just assigned an IP to ovs' LOCAL interface on both hosts, and these ovs instances knew nothing about each other, neither they forwarded packets between each other. I didn't want to make things overcomplicated and decided not to mess with host's own connection to the intranet, just something that sits on the other side of vhost-user and replies to PINGs was perfectly OK for me.
Post by Peter Xu
I think this
does not matter if we are testing it functionally (whether live
migration could work), However I would still perfer to try ping
another host (say, host3) inside the intranet. What do you think?
Yes, perhaps this would be better test, may be next time i'll do it. Anyway, IIRC, PATCH v2 is coming.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Peter Xu
2015-12-14 10:09:38 UTC
Permalink
Post by Pavel Fedin
Post by Peter Xu
Regarding to "with the same IPs on host side": do you mean that you
configured the same IP on two hosts in the intranet?
No intranet. You can think of it as an isolated network between the host and guest, and that's all. I just assigned an IP to ovs' LOCAL interface on both hosts, and these ovs instances knew nothing about each other, neither they forwarded packets between each other. I didn't want to make things overcomplicated and decided not to mess with host's own connection to the intranet, just something that sits on the other side of vhost-user and replies to PINGs was perfectly OK for me.
I see.
Post by Pavel Fedin
Post by Peter Xu
I think this
does not matter if we are testing it functionally (whether live
migration could work), However I would still perfer to try ping
another host (say, host3) inside the intranet. What do you think?
Yes, perhaps this would be better test, may be next time i'll do it. Anyway, IIRC, PATCH v2 is coming.
Agreed.

Thanks!
Peter
Post by Pavel Fedin
Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Yuanhan Liu
2015-12-14 12:09:37 UTC
Permalink
Post by Pavel Fedin
Hello!
Post by Peter Xu
Post by Pavel Fedin
Host<------->openVSwitch<----------->guest
LOCAL vhostuser
So, in order to migrate the guest, i simply replicated this setup on both hosts, with the
Regarding to "with the same IPs on host side": do you mean that you
configured the same IP on two hosts in the intranet?
No intranet. You can think of it as an isolated network between the host and guest, and that's all. I just assigned an IP to ovs' LOCAL interface on both hosts, and these ovs instances knew nothing about each other, neither they forwarded packets between each other. I didn't want to make things overcomplicated and decided not to mess with host's own connection to the intranet, just something that sits on the other side of vhost-user and replies to PINGs was perfectly OK for me.
Pavel,

It seems that we have exactly the same test environment set up: I have
one server (where I normally do vhost test there) and one desktop (my
dev box),

On both hosts, there is an ovs bridge, with IP address 192.168.100.1
assigned manually. Later, I started a VM on the server, and manually
assigned IP to 192.168.100.10. I then run "ping 192.168.100.1" for
live migration testing.

The migration to my desktop somehow works (even though there are some
bugs in this patch set), however, I did see what Pavel saw: about 12
packets has been lost, which means about 12 seconds the network is not
working well.

Besides that, there was always an error message from the target host
after the migration:

KVM: injection failed, MSI lost (Operation not permitted)

Firstly, I've very limited knowledge of OVS, therefore I'm not sure
this kind of live migration test env is setup rightly or not. I'd
appreciate if anyone could shine some lights on it. Anyway, I'm digging
the code to see if I can find something abnormal there.
Post by Pavel Fedin
Post by Peter Xu
I think this
does not matter if we are testing it functionally (whether live
migration could work), However I would still perfer to try ping
another host (say, host3) inside the intranet. What do you think?
Yes, perhaps this would be better test, may be next time i'll do it.
Again, appreciate your testing!
Post by Pavel Fedin
Anyway, IIRC, PATCH v2 is coming.
Hopefully, I could fix this gap this week and send out v2.

--yliu
Peter Xu
2015-12-14 13:00:22 UTC
Permalink
Post by Yuanhan Liu
It seems that we have exactly the same test environment set up: I have
one server (where I normally do vhost test there) and one desktop (my
dev box),
On both hosts, there is an ovs bridge, with IP address 192.168.100.1
assigned manually. Later, I started a VM on the server, and manually
assigned IP to 192.168.100.10. I then run "ping 192.168.100.1" for
live migration testing.
The migration to my desktop somehow works (even though there are some
bugs in this patch set), however, I did see what Pavel saw: about 12
packets has been lost, which means about 12 seconds the network is not
working well.
Hi, Yuanhan,

I _guess_ the problem for ping might be: guest ARP entry for
192.168.100.1 is not updated. Or say, after guest migrated to host2
from host1, guest is still trying to send packet to host1's NIC (no
one is telling it to update, right?), so no one is responding the
ping. When the entry is expired, guest will resend the ARP request,
and host2 will respond this time, with mac address on host2 provided
this time. After that, ping works again.

(not familiar with OVS too, so am just taking it as a "vritual"
switch)

Thanks.
Peter
Yuanhan Liu
2015-12-14 13:21:15 UTC
Permalink
Post by Peter Xu
Post by Yuanhan Liu
It seems that we have exactly the same test environment set up: I have
one server (where I normally do vhost test there) and one desktop (my
dev box),
On both hosts, there is an ovs bridge, with IP address 192.168.100.1
assigned manually. Later, I started a VM on the server, and manually
assigned IP to 192.168.100.10. I then run "ping 192.168.100.1" for
live migration testing.
The migration to my desktop somehow works (even though there are some
bugs in this patch set), however, I did see what Pavel saw: about 12
packets has been lost, which means about 12 seconds the network is not
working well.
Hi, Yuanhan,
I _guess_ the problem for ping might be: guest ARP entry for
192.168.100.1 is not updated. Or say, after guest migrated to host2
from host1, guest is still trying to send packet to host1's NIC (no
one is telling it to update, right?), so no one is responding the
ping. When the entry is expired, guest will resend the ARP request,
and host2 will respond this time, with mac address on host2 provided
this time. After that, ping works again.
Peter,

Thanks for your input, and that sounds reasonable. You just reminded
me that the host1's NIC is indeed different with host2's NIC: the ovs
bridge mac address is different.

I then had a quick try, setting the two ovs bridge with same mac
address, and it works like a charm: the gap is gone :)

--yliu
Post by Peter Xu
(not familiar with OVS too, so am just taking it as a "vritual"
switch)
Thanks.
Peter
Peter Xu
2015-12-14 13:28:08 UTC
Permalink
Post by Yuanhan Liu
Peter,
Thanks for your input, and that sounds reasonable. You just reminded
me that the host1's NIC is indeed different with host2's NIC: the ovs
bridge mac address is different.
I then had a quick try, setting the two ovs bridge with same mac
address, and it works like a charm: the gap is gone :)
Good to know that. :)

I will try to do some tests too using the patchset. Not sure whether
I will encounter the same KVM warning (seems related to APIC,
however still could not tell more than that). Will update you if
there is anything helpful.

Peter
Post by Yuanhan Liu
--yliu
Yuanhan Liu
2015-12-14 13:51:53 UTC
Permalink
Post by Peter Xu
Post by Yuanhan Liu
Peter,
Thanks for your input, and that sounds reasonable. You just reminded
me that the host1's NIC is indeed different with host2's NIC: the ovs
bridge mac address is different.
I then had a quick try, setting the two ovs bridge with same mac
address, and it works like a charm: the gap is gone :)
Good to know that. :)
I will try to do some tests too using the patchset. Not sure whether
I will encounter the same KVM warning (seems related to APIC,
however still could not tell more than that). Will update you if
there is anything helpful.
Appreciate that! It'd be good if you could test with my v2 patch set:
hopefully I could cook it up by the end of tomorrow.

--yliu
Pavel Fedin
2015-12-14 14:54:45 UTC
Permalink
Hello!
Post by Yuanhan Liu
Post by Peter Xu
I _guess_ the problem for ping might be: guest ARP entry for
192.168.100.1 is not updated. Or say, after guest migrated to host2
from host1, guest is still trying to send packet to host1's NIC (no
one is telling it to update, right?), so no one is responding the
ping. When the entry is expired, guest will resend the ARP request,
and host2 will respond this time, with mac address on host2 provided
this time. After that, ping works again.
Peter,
Thanks for your input, and that sounds reasonable. You just reminded
me that the host1's NIC is indeed different with host2's NIC: the ovs
bridge mac address is different.
Yes, this is indeed what is happening, and actually i already wrote about it. In wireshark it looks exactly like that: the some
PINGs are sent without replies, then the guest redoes ARP, PING replies resume.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Yuanhan Liu
2015-12-15 08:23:24 UTC
Permalink
Post by Peter Xu
Post by Pavel Fedin
BTW, it works, and it was my bad. openvswitch was configured incorrectly on the other side, vhost port number was different for
some reason, while ruleset was the same. I reconfigured it and now everything migrates correctly, except increased downtime because
of missing GARP (the guest misses some PINGs, then it retries ARP, which brings the link back up).
Hi,
When doing the ping, was it from the guest (to another host) or to
the guest (from another host)?
In any case, I still could not understand why the ping loss happened
in this test.
If ping from guest, no ARP refresh is required at all?
If ping to guest from outside, when the migration finishes on the
target side of qemu, qemu_self_announce() will be called.
It's supposed to see some ARP requests if I run tcpdump against
with the ovs bridge, right? However, in fact, I saw nothing from
tcpdump on the target host when the migration is done.

I mean I do find that qemu_annouce_self composes an ARP
broadcoast request, but it seems that I didn't catch it on
the target host.

Something wrong, or someting I missed?

--yliu
Post by Peter Xu
Although
we might see a warning like "Vhost user backend fails to broadcast
fake RARP" (notify is done by hacking vhost_user_receive(), even if
notify fails, things will still move on), QEMU should still send a
RARP onto the link.
Not sure whether I missed anything.
Thanks.
Peter
Pavel Fedin
2015-12-15 08:45:56 UTC
Permalink
Hello!
Post by Yuanhan Liu
I mean I do find that qemu_annouce_self composes an ARP
broadcoast request, but it seems that I didn't catch it on
the target host.
Something wrong, or someting I missed?
To tell the truth, i don't know. I am also learning qemu internals on the fly. Indeed, i see that it should announce itself. But
this brings up a question: why do we need special announce procedure in vhost-user then?
I think you can add some debug output and see how it works in realtime. This is what i normally do when i don't understand in which
sequence things happen.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Yuanhan Liu
2015-12-15 08:56:43 UTC
Permalink
Post by Pavel Fedin
Hello!
Post by Yuanhan Liu
I mean I do find that qemu_annouce_self composes an ARP
broadcoast request, but it seems that I didn't catch it on
the target host.
Something wrong, or someting I missed?
To tell the truth, i don't know. I am also learning qemu internals on the fly. Indeed, i see that it should announce itself.
I was acutally asking Peter. Sorry for not making it clear and
thanks for your reply, anyway :)
Post by Pavel Fedin
But
this brings up a question: why do we need special announce procedure in vhost-user then?
Note quite sure. I found Thibaut submitted a patch to send
VHOST_USER_SEND_RARP request after migration is done months
ago. Thibaut, would you please elaborate it a bit more what
should be done on vhost-user backend? To construct a gratuitous
ARP request and broadcast it?
Post by Pavel Fedin
I think you can add some debug output and see how it works in realtime. This is what i normally do when i don't understand in which
sequence things happen.
Thanks.

--yliu
Post by Pavel Fedin
Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Pavel Fedin
2015-12-15 09:04:38 UTC
Permalink
Hello!
Post by Yuanhan Liu
Note quite sure. I found Thibaut submitted a patch to send
VHOST_USER_SEND_RARP request after migration is done months
ago. Thibaut, would you please elaborate it a bit more what
should be done on vhost-user backend? To construct a gratuitous
ARP request and broadcast it?
By the way, some more info for you all.
1. I've just examined qemu_announce_self() and i see that IPs are all set to 0 in the packet it generates. It's quite logical
because qemu has no idea what address is used by the guest, even more, theoretically it could be not IPv4 at all. But then - how can
it work at all, and what's the use for this packet?
2. I tried to work around if by adding VIRTIO_NET_F_GUEST_ANNOUNCE. I expected that the guest will see it and make announcement by
itself. But result was quite the opposite - PING stopped working at all, right from the beginning, even without migration.

Can local qemu/DPDK/etc gurus give some explanation?

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Peter Xu
2015-12-15 10:05:48 UTC
Permalink
Post by Pavel Fedin
To tell the truth, i don't know. I am also learning qemu internals on the fly. Indeed, i see that it should announce itself. But
this brings up a question: why do we need special announce procedure in vhost-user then?
I have the same question. Here is my guess...

In customized networks, maybe people are not using ARP at all? When
we use DPDK, we directly pass through the network logic inside
kernel itself. So logically all the network protocols could be
customized by the user of it. In the customized network, maybe there
is some other protocol (rather than RARP) that would do the same
thing as what ARP/RARP does. So, this SEND_RARP request could give
the vhost-user backend a chance to format its own announce packet
and broadcast (in the SEND_RARP request, the guest's mac address
will be appended).

CCing Victor to better know the truth...

Peter
Thibaut Collet
2015-12-15 11:43:15 UTC
Permalink
Post by Pavel Fedin
Post by Pavel Fedin
To tell the truth, i don't know. I am also learning qemu internals on
the fly. Indeed, i see that it should announce itself. But
Post by Pavel Fedin
this brings up a question: why do we need special announce procedure in
vhost-user then?
I have the same question. Here is my guess...
In customized networks, maybe people are not using ARP at all? When
we use DPDK, we directly pass through the network logic inside
kernel itself. So logically all the network protocols could be
customized by the user of it. In the customized network, maybe there
is some other protocol (rather than RARP) that would do the same
thing as what ARP/RARP does. So, this SEND_RARP request could give
the vhost-user backend a chance to format its own announce packet
and broadcast (in the SEND_RARP request, the guest's mac address
will be appended).
CCing Victor to better know the truth...
Peter
Hi,

After a migration, to avoid network outage, the guest must announce its new
location to the L2 layer, typically with a GARP. Otherwise requests sent to
the guest arrive to the old host until a ARP request is sent (after 30
seconds) or the guest sends some data.

QEMU implementation of self announce after a migration with a vhost backend
is the following:
- If the VIRTIO_GUEST_ANNOUNCE feature has been negotiated the guest sends
automatically a GARP.
- Else if the vhost backend implements VHOST_USER_SEND_RARP this request
is sent to the vhost backend. When this message is received the vhost
backend must act as it receives a RARP from the guest (purpose of this RARP
is to update switches' MAC->port maaping as a GARP). This RARP is a false
one, created by the vhost backend,
- Else nothing is done and we have a network outage until a ARP is sent or
the guest sends some data.


VIRTIO_GUEST_ANNOUNCE feature is negotiated if:
- the vhost backend announces the support of this feature. Maybe QEMU can
be updated to support unconditionnaly this feature
- the virtio driver of the guest implements this feature. It is not the
case for old kernel or dpdk virtio pmd.

Regarding dpdk to have a migration of vhost interface with limited network
outage we have to:

- Implement management VHOST_USER_SEND_RARP request to emulate a fake
RARP for guest

To do that we have to consider two kinds of guest:
1. Guest with virtio driver implementing VIRTIO_GUEST_ANNOUNCE feature
2. Guest with virtio driver that does not have the VIRTIO_GUEST_ANNOUNCE
feature. This is the case with old kernel or guest running a dpdk (virtio
pmd of dpdk does not have this feature)

Guest with VIRTIO_GUEST_ANNOUNCE feature sends automatically some GARP
after a migration if this feature has been negotiated. So the only thing to
do it is to negotiate the VIRTIO_GUEST_ANNOUNCE feature between QEMU, vhost
backend and the guest.
For this kind of guest the vhost-backend must announce the support of
VIRTIO_GUEST_ANNOUNCE feature. As vhost-backend has no particular action to
do in this case the support of VIRTIO_GUEST_ANNOUNCE feature can be
unconditionally set in QEMU in the future.

For guest without VIRTIO_GUEST_ANNOUNCE feature we have to send a fake
RARP: QEMU knows the MAC address of the guest and can create and broadcast
a RARP. But in case of vhost-backend QEMU is not able to broadcast this
fake RARP and must ask to the vhost backend to do it through the
VHOST_USER_SEND_RARP request. When the vhost backend receives this message
it must create a fake RARP message (as done by QEMU) and do the appropriate
operation as this message has been sent by the guest through the virtio
rings.


To solve this point 2 solutions are implemented:
- After the migration the guest automatically sends GARP. This solution
occurs if VIRTIO_GUEST_ANNOUNCE feature has been negotiated between QEMU
and the guest.
* VIRTIO_GUEST_ANNOUNCE
Thibaut Collet
2015-12-15 11:47:47 UTC
Permalink
Post by Thibaut Collet
Post by Pavel Fedin
Post by Pavel Fedin
To tell the truth, i don't know. I am also learning qemu internals on
the fly. Indeed, i see that it should announce itself. But
Post by Pavel Fedin
this brings up a question: why do we need special announce procedure in
vhost-user then?
I have the same question. Here is my guess...
In customized networks, maybe people are not using ARP at all? When
we use DPDK, we directly pass through the network logic inside
kernel itself. So logically all the network protocols could be
customized by the user of it. In the customized network, maybe there
is some other protocol (rather than RARP) that would do the same
thing as what ARP/RARP does. So, this SEND_RARP request could give
the vhost-user backend a chance to format its own announce packet
and broadcast (in the SEND_RARP request, the guest's mac address
will be appended).
CCing Victor to better know the truth...
Peter
Hi,
After a migration, to avoid network outage, the guest must announce its
new location to the L2 layer, typically with a GARP. Otherwise requests
sent to the guest arrive to the old host until a ARP request is sent (after
30 seconds) or the guest sends some data.
QEMU implementation of self announce after a migration with a vhost
- If the VIRTIO_GUEST_ANNOUNCE feature has been negotiated the guest
sends automatically a GARP.
- Else if the vhost backend implements VHOST_USER_SEND_RARP this request
is sent to the vhost backend. When this message is received the vhost
backend must act as it receives a RARP from the guest (purpose of this RARP
is to update switches' MAC->port maaping as a GARP). This RARP is a false
one, created by the vhost backend,
- Else nothing is done and we have a network outage until a ARP is sent
or the guest sends some data.
- the vhost backend announces the support of this feature. Maybe QEMU
can be updated to support unconditionnaly this feature
- the virtio driver of the guest implements this feature. It is not the
case for old kernel or dpdk virtio pmd.
Regarding dpdk to have a migration of vhost interface with limited network
- Implement management VHOST_USER_SEND_RARP request to emulate a fake
RARP for guest
1. Guest with virtio driver implementing VIRTIO_GUEST_ANNOUNCE feature
2. Guest with virtio driver that does not have the VIRTIO_GUEST_ANNOUNCE
feature. This is the case with old kernel or guest running a dpdk (virtio
pmd of dpdk does not have this feature)
Guest with VIRTIO_GUEST_ANNOUNCE feature sends automatically some GARP
after a migration if this feature has been negotiated. So the only thing to
do it is to negotiate the VIRTIO_GUEST_ANNOUNCE feature between QEMU, vhost
backend and the guest.
For this kind of guest the vhost-backend must announce the support of
VIRTIO_GUEST_ANNOUNCE feature. As vhost-backend has no particular action to
do in this case the support of VIRTIO_GUEST_ANNOUNCE feature can be
unconditionally set in QEMU in the future.
For guest without VIRTIO_GUEST_ANNOUNCE feature we have to send a fake
RARP: QEMU knows the MAC address of the guest and can create and broadcast
a RARP. But in case of vhost-backend QEMU is not able to broadcast this
fake RARP and must ask to the vhost backend to do it through the
VHOST_USER_SEND_RARP request. When the vhost backend receives this message
it must create a fake RARP message (as done by QEMU) and do the appropriate
operation as this message has been sent by the guest through the virtio
rings.
- After the migration the guest automatically sends GARP. This solution
occurs if VIRTIO_GUEST_ANNOUNCE feature has been negotiated between QEMU
and the guest.
* VIRTIO_GUEST_ANNOUNCE
Sorry my previous message will be sent by error (it is a draft with rework
in progress)

The full explanation are:

Hi,

After a migration, to avoid network outage, the guest must announce its new
location to the L2 layer, typically with a GARP. Otherwise requests sent to
the guest arrive to the old host until a ARP request is sent (after 30
seconds) or the guest sends some data.

QEMU implementation of self announce after a migration with a vhost backend
is the following:
- If the VIRTIO_GUEST_ANNOUNCE feature has been negotiated the guest sends
automatically a GARP.
- Else if the vhost backend implements VHOST_USER_SEND_RARP this request
is sent to the vhost backend. When this message is received the vhost
backend must act as it receives a RARP from the guest (purpose of this RARP
is to update switches' MAC->port maaping as a GARP). This RARP is a false
one, created by the vhost backend,
- Else nothing is done and we have a network outage until a ARP is sent or
the guest sends some data.


VIRTIO_GUEST_ANNOUNCE feature is negotiated if:
- the vhost backend announces the support of this feature. Maybe QEMU can
be updated to support unconditionnaly this feature
- the virtio driver of the guest implements this feature. It is not the
case for old kernel or dpdk virtio pmd.

Regarding dpdk to have a migration of vhost interface with limited network
outage we have to:
- In the vhost pmd
* Announce supports of VIRTIO_GUEST_ANNOUNCE feature
* Implement management of VHOST_USER_SEND_RARP request to emulate a
fake RARP if the VIRTIO_GUEST_ANNOUNCE feature is not implemented by the
guest

- In the virtio pmd
* Support VIRTIO_GUEST_ANNOUNCE feature to avoid RARP emission by
the host after a migration.


Hope this explanation will help

Regards.

Thibaut.
Pavel Fedin
2015-12-15 12:24:48 UTC
Permalink
Hello!
After a migration, to avoid network outage, the guest must announce its new location to the L2 layer, typically with a GARP. Otherwise requests sent to
the guest arrive to the old host until a ARP request is sent (after 30 seconds) or the guest sends some data.
- If the VIRTIO_GUEST_ANNOUNCE feature has been negotiated the guest sends automatically a GARP.
- Else if the vhost backend implements VHOST_USER_SEND_RARP this request is sent to the vhost backend. When this message is received the vhost backend
must act as it receives a RARP from the guest (purpose of this RARP is to update switches' MAC->port maaping as a GARP). This RARP is a false one,
created by the vhost backend,
- Else nothing is done and we have a network outage until a ARP is sent or the guest sends some data.
But what is qemu_announce_self() then? It's just unconditionally triggered after migration, but indeed sends some strange thing.
- the vhost backend announces the support of this feature. Maybe QEMU can be updated to support unconditionnaly this feature
Wrong. I tried to unconditionally enforce it in qemu (my guest does support it), and the link stopped working at all. I don't understand why.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Yuanhan Liu
2015-12-15 13:36:12 UTC
Permalink
Post by Pavel Fedin
Hello!
After a migration, to avoid network outage, the guest must announce its new location to the L2 layer, typically with a GARP. Otherwise requests sent to
the guest arrive to the old host until a ARP request is sent (after 30 seconds) or the guest sends some data.
- If the VIRTIO_GUEST_ANNOUNCE feature has been negotiated the guest sends automatically a GARP.
- Else if the vhost backend implements VHOST_USER_SEND_RARP this request is sent to the vhost backend. When this message is received the vhost backend
must act as it receives a RARP from the guest (purpose of this RARP is to update switches' MAC->port maaping as a GARP). This RARP is a false one,
created by the vhost backend,
- Else nothing is done and we have a network outage until a ARP is sent or the guest sends some data.
But what is qemu_announce_self() then? It's just unconditionally triggered after migration, but indeed sends some strange thing.
- the vhost backend announces the support of this feature. Maybe QEMU can be updated to support unconditionnaly this feature
Wrong. I tried to unconditionally enforce it in qemu (my guest does support it), and the link stopped working at all. I don't understand why.
I'm wondering how did you do that? Why do you need enforece it in QEMU?
Isn't it already supported so far?

Actually, what's we need to do is to add such feature bit in vhost
library, to claim we support it so that the the guest will send a
gratuitous ARP when migration is done (check virtio_net_load()).

----
diff --git a/lib/librte_vhost/virtio-net.c
b/lib/librte_vhost/virtio-net.c
index 03044f6..0ba5045 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
#define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
(1ULL << VIRTIO_NET_F_CTRL_VQ) | \
(1ULL << VIRTIO_NET_F_CTRL_RX) | \
+ (1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
(VHOST_SUPPORTS_MQ) | \
(1ULL << VIRTIO_F_VERSION_1) | \
(1ULL << VHOST_F_LOG_ALL) | \

However, I found the GARP is not sent out at all, due to an error
I met and reported before:

KVM: injection failed, MSI lost (Operation not permitted)

Which happened at the time QEMU is about to send the interrupt to the
guest for announce event. However, it failed, hence no GARP was received.

One thing worth noting is that it happened only when I did live migration
on two different hosts (the two hosts happened to be using a same old
kernel: v3.11.10). It works pretty well on same host. So, seems like
a KVM bug then?

--yliu
Pavel Fedin
2015-12-15 13:48:12 UTC
Permalink
Hello!
Post by Pavel Fedin
Post by Pavel Fedin
Wrong. I tried to unconditionally enforce it in qemu (my guest does support it), and the
link stopped working at all. I don't understand why.
I'm wondering how did you do that? Why do you need enforece it in QEMU?
Isn't it already supported so far?
I mean - qemu first asks vhost-user server (ovs/DPDK in our case) about capabilities, then negotiates them with the guest. And DPDK
doesn't report VIRTIO_NET_F_GUEST_ANNOUNCE, so i just ORed this flag in qemu before the negotiation with guest (because indeed my
logic says that the host should not do anything special about it). So the overall effect is the same as in your patch
Post by Pavel Fedin
diff --git a/lib/librte_vhost/virtio-net.c
b/lib/librte_vhost/virtio-net.c
index 03044f6..0ba5045 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
#define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
(1ULL << VIRTIO_NET_F_CTRL_VQ) | \
(1ULL << VIRTIO_NET_F_CTRL_RX) | \
+ (1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
(VHOST_SUPPORTS_MQ) | \
(1ULL << VIRTIO_F_VERSION_1) | \
(1ULL << VHOST_F_LOG_ALL) | \
But i was somehow wrong and this causes the whole thing to stop working instead. Even after just booting up the network doesn't
work and PINGs do not pass.
Post by Pavel Fedin
However, I found the GARP is not sent out at all, due to an error
KVM: injection failed, MSI lost (Operation not permitted)
Interesting, i don't have this problem here. Some bug in your kernel/hardware?
Post by Pavel Fedin
One thing worth noting is that it happened only when I did live migration
on two different hosts (the two hosts happened to be using a same old
kernel: v3.11.10). It works pretty well on same host. So, seems like
a KVM bug then?
3.18.9 here and no this problem.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Yuanhan Liu
2015-12-15 13:59:07 UTC
Permalink
Post by Pavel Fedin
Hello!
Post by Pavel Fedin
Post by Pavel Fedin
Wrong. I tried to unconditionally enforce it in qemu (my guest does support it), and the
link stopped working at all. I don't understand why.
I'm wondering how did you do that? Why do you need enforece it in QEMU?
Isn't it already supported so far?
I mean - qemu first asks vhost-user server (ovs/DPDK in our case) about capabilities, then negotiates them with the guest. And DPDK
doesn't report VIRTIO_NET_F_GUEST_ANNOUNCE, so i just ORed this flag in qemu before the negotiation with guest (because indeed my
logic says that the host should not do anything special about it). So the overall effect is the same as in your patch
I see.
Post by Pavel Fedin
Post by Pavel Fedin
diff --git a/lib/librte_vhost/virtio-net.c
b/lib/librte_vhost/virtio-net.c
index 03044f6..0ba5045 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
#define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
(1ULL << VIRTIO_NET_F_CTRL_VQ) | \
(1ULL << VIRTIO_NET_F_CTRL_RX) | \
+ (1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
(VHOST_SUPPORTS_MQ) | \
(1ULL << VIRTIO_F_VERSION_1) | \
(1ULL << VHOST_F_LOG_ALL) | \
But i was somehow wrong and this causes the whole thing to stop working instead. Even after just booting up the network doesn't
work and PINGs do not pass.
No idea. Maybe you have changed some other configures (such as of ovs)
without notice? Or, the ovs bridge interface resets?

BTW, would you please try my v1 patch set with above diff applied to
see if the ping loss is still there. You might also want to run tcpdump
with the dest host ovs bridge, to see if GARP is actually sent.
Post by Pavel Fedin
Post by Pavel Fedin
However, I found the GARP is not sent out at all, due to an error
KVM: injection failed, MSI lost (Operation not permitted)
I was thinking that may be caused by the difference of my two hosts (a
desktop and a server). I will try to find two similar hosts tomorrow
to do more tests. Besides that, it'd be great if you could do a more
test with above diff applied.

--yliu
Post by Pavel Fedin
Interesting, i don't have this problem here. Some bug in your kernel/hardware?
Post by Pavel Fedin
One thing worth noting is that it happened only when I did live migration
on two different hosts (the two hosts happened to be using a same old
kernel: v3.11.10). It works pretty well on same host. So, seems like
a KVM bug then?
3.18.9 here and no this problem.
Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Pavel Fedin
2015-12-15 14:58:28 UTC
Permalink
Hello!
Post by Yuanhan Liu
No idea. Maybe you have changed some other configures (such as of ovs)
without notice? Or, the ovs bridge interface resets?
I don't touch the ovs at all. Just shut down the guest, rebuild the qemu, reinstall it, run the guest.
Post by Yuanhan Liu
BTW, would you please try my v1 patch set with above diff applied to
see if the ping loss is still there. You might also want to run tcpdump
with the dest host ovs bridge, to see if GARP is actually sent.
Retested with wireshark running on the host. I used my qemu patch instead, but it should not matter at all:
--- cut ---
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 1b6c5ac..5ca2987 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -480,7 +480,12 @@ static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64)

static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features)
{
- return vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features);
+ int ret = vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features);
+
+ if (!ret) {
+ virtio_add_feature(features, VIRTIO_NET_F_GUEST_ANNOUNCE);
+ }
+ return ret;
}

static int vhost_user_set_owner(struct vhost_dev *dev)
--- cut ---

So, here are both wireshark captures on the host side:

1. Without the patch

***@nfv_test_x86_64 / # tshark -i ovs-br0
Running as user "root" and group "root". This could be dangerous.
Capturing on 'ovs-br0'
1 0.000000 :: -> ff02::16 ICMPv6 90 Multicast Listener Report Message v2
2 0.003304 RealtekU_3b:83:1a -> Broadcast ARP 42 Gratuitous ARP for 192.168.6.2 (Reply)
3 0.669957 :: -> ff02::16 ICMPv6 90 Multicast Listener Report Message v2
4 0.858957 :: -> ff02::1:ff3b:831a ICMPv6 78 Neighbor Solicitation for fe80::5054:ff:fe3b:831a
5 1.858968 fe80::5054:ff:fe3b:831a -> ff02::16 ICMPv6 90 Multicast Listener Report Message v2
6 2.300948 fe80::5054:ff:fe3b:831a -> ff02::16 ICMPv6 90 Multicast Listener Report Message v2
7 2.527088 fe80::5054:ff:fe3b:831a -> ff02::2 ICMPv6 62 Router Solicitation
8 2.527800 RealtekU_3b:83:1a -> Broadcast ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
9 6.526814 fe80::5054:ff:fe3b:831a -> ff02::2 ICMPv6 62 Router Solicitation
10 10.526993 fe80::5054:ff:fe3b:831a -> ff02::2 ICMPv6 62 Router Solicitation
11 15.984632 RealtekU_3b:83:1a -> Broadcast ARP 42 Who has 192.168.6.1? Tell 192.168.6.2
12 15.984643 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
13 15.984772 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x0477, seq=1/256, ttl=64
14 15.984798 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x0477, seq=1/256, ttl=64 (request in 13)
15 16.984970 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x0477, seq=2/512, ttl=64
16 16.984991 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x0477, seq=2/512, ttl=64 (request in 15)
17 17.984956 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x0477, seq=3/768, ttl=64
18 17.984975 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x0477, seq=3/768, ttl=64 (request in 17)
19 20.994535 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 Who has 192.168.6.2? Tell 192.168.6.1
20 20.994637 RealtekU_3b:83:1a -> be:e1:71:c1:47:4d ARP 42 192.168.6.2 is at 52:54:00:3b:83:1a
^C20 packets captured

2. With the patch

***@nfv_test_x86_64 / # tshark -i ovs-br0
Running as user "root" and group "root". This could be dangerous.
Capturing on 'ovs-br0'
1 0.000000 :: -> ff02::16 ICMPv6 90 Multicast Listener Report Message v2
2 0.000969 RealtekU_3b:83:1a -> Broadcast ARP 42 Gratuitous ARP for 192.168.6.2 (Reply)
3 0.156966 :: -> ff02::1:ff3b:831a ICMPv6 78 Neighbor Solicitation for fe80::5054:ff:fe3b:831a
4 0.536948 :: -> ff02::16 ICMPv6 90 Multicast Listener Report Message v2
5 1.156968 fe80::5054:ff:fe3b:831a -> ff02::16 ICMPv6 90 Multicast Listener Report Message v2
6 1.312708 fe80::5054:ff:fe3b:831a -> ff02::2 ICMPv6 62 Router Solicitation
7 1.629960 fe80::5054:ff:fe3b:831a -> ff02::16 ICMPv6 90 Multicast Listener Report Message v2
8 2.314713 RealtekU_3b:83:1a -> Broadcast ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
9 5.313333 fe80::5054:ff:fe3b:831a -> ff02::2 ICMPv6 62 Router Solicitation
10 9.315486 fe80::5054:ff:fe3b:831a -> ff02::2 ICMPv6 62 Router Solicitation
11 21.536450 RealtekU_3b:83:1a -> Broadcast ARP 42 Who has 192.168.6.1? Tell 192.168.6.2
12 21.536461 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
13 22.538937 RealtekU_3b:83:1a -> Broadcast ARP 42 Who has 192.168.6.1? Tell 192.168.6.2
14 22.538943 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
15 23.540937 RealtekU_3b:83:1a -> Broadcast ARP 42 Who has 192.168.6.1? Tell 192.168.6.2
16 23.540942 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
17 25.537519 RealtekU_3b:83:1a -> Broadcast ARP 42 Who has 192.168.6.1? Tell 192.168.6.2
18 25.537525 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
19 26.538939 RealtekU_3b:83:1a -> Broadcast ARP 42 Who has 192.168.6.1? Tell 192.168.6.2
20 26.538944 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
21 27.540937 RealtekU_3b:83:1a -> Broadcast ARP 42 Who has 192.168.6.1? Tell 192.168.6.2
22 27.540942 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
23 29.538475 RealtekU_3b:83:1a -> Broadcast ARP 42 Who has 192.168.6.1? Tell 192.168.6.2
24 29.538482 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
25 30.538935 RealtekU_3b:83:1a -> Broadcast ARP 42 Who has 192.168.6.1? Tell 192.168.6.2
26 30.538941 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
27 31.540935 RealtekU_3b:83:1a -> Broadcast ARP 42 Who has 192.168.6.1? Tell 192.168.6.2
28 31.540941 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 192.168.6.1 is at be:e1:71:c1:47:4d
^C28 packets captured

Obviously, the guest simply doesn't read incoming packets. ifconfig for the interface on guest side shows:

RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 9 overruns 0 frame 9

BTW, number 9 exactly matches the number of ARP replies from the host. The question is - why? Looks like guest behavior changes
somehow. Is it a bug in guest? It's very strange, because in these sessions i see only one difference in IPv6 packets:

4 0.858957 :: -> ff02::1:ff3b:831a ICMPv6 78 Neighbor Solicitation for fe80::5054:ff:fe3b:831a

This is present in session #1 and missing from session #2. Can it affect the whole thing somehow? But i don't even use IPv6.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Yuanhan Liu
2015-12-16 07:28:18 UTC
Permalink
Post by Pavel Fedin
Hello!
Post by Yuanhan Liu
No idea. Maybe you have changed some other configures (such as of ovs)
without notice? Or, the ovs bridge interface resets?
I don't touch the ovs at all. Just shut down the guest, rebuild the qemu, reinstall it, run the guest.
Post by Yuanhan Liu
BTW, would you please try my v1 patch set with above diff applied to
see if the ping loss is still there. You might also want to run tcpdump
with the dest host ovs bridge, to see if GARP is actually sent.
--- cut ---
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 1b6c5ac..5ca2987 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -480,7 +480,12 @@ static int vhost_user_get_u64(struct vhost_dev *dev, int request, uint64_t *u64)
static int vhost_user_get_features(struct vhost_dev *dev, uint64_t *features)
{
- return vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features);
+ int ret = vhost_user_get_u64(dev, VHOST_USER_GET_FEATURES, features);
+
+ if (!ret) {
+ virtio_add_feature(features, VIRTIO_NET_F_GUEST_ANNOUNCE);
+ }
+ return ret;
}
static int vhost_user_set_owner(struct vhost_dev *dev)
--- cut ---
Pavel,

I can reproduce your issue on my side with above patch (and only when
F_GUEST_ANNOUNCE is not set at DPDK vhost lib). TBH, I don't know
why that happened, the cause could be subtle, and I don't think it's
worthwhile to dig it, especially it's not the right way to do it.

So, would you please try to set the F_GUEST_ANNOUNCE flag on DPDK vhost
lib side, as my early diff showed and have another test?

On the other hand, I failed to find two identical server, the two closet
I found are E5-2695 and E5-2699, However, the MSI lost fatal bug still
occurred. I'm out of thoughts what could be the root cause. I'm asking
help from som KVM gurus; hopefully they could shine some lights on.
Meanwhile, I may need try to debug it.

Since you don't meet such issue, I'd hope you could have a test and
tell me how it works :)

Thanks.

--yliu
Pavel Fedin
2015-12-16 11:57:15 UTC
Permalink
Hello!
Post by Yuanhan Liu
I can reproduce your issue on my side with above patch (and only when
F_GUEST_ANNOUNCE is not set at DPDK vhost lib). TBH, I don't know
why that happened, the cause could be subtle, and I don't think it's
worthwhile to dig it, especially it's not the right way to do it.
May be not right, may be it can be done... Actually, i found what was wrong. qemu tries to feed features back to vhost-user via
VHOST_USER_SET_FEATURES, and DPDK barfs on the unknown bit. More tweaking is needed for qemu to do the trick correctly.
Post by Yuanhan Liu
So, would you please try to set the F_GUEST_ANNOUNCE flag on DPDK vhost
lib side, as my early diff showed and have another test?
Tried it, works fine, thank you.
I have almost implemented the workaround in qemu... However now i start to think that you are right. Theoretically, the application
may want to suppress GUEST_ANNOUNCE for some reason. So, let it stay this way. Please include this bit into your v2.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Yuanhan Liu
2015-12-16 12:08:17 UTC
Permalink
Post by Pavel Fedin
Hello!
Post by Yuanhan Liu
I can reproduce your issue on my side with above patch (and only when
F_GUEST_ANNOUNCE is not set at DPDK vhost lib). TBH, I don't know
why that happened, the cause could be subtle, and I don't think it's
worthwhile to dig it, especially it's not the right way to do it.
May be not right, may be it can be done... Actually, i found what was wrong. qemu tries to feed features back to vhost-user via
VHOST_USER_SET_FEATURES, and DPDK barfs on the unknown bit. More tweaking is needed for qemu to do the trick correctly.
Post by Yuanhan Liu
So, would you please try to set the F_GUEST_ANNOUNCE flag on DPDK vhost
lib side, as my early diff showed and have another test?
Tried it, works fine, thank you.
Thanks for the test.

However, I'm more curious about the ping loss? Did you still see
that? And to be more specific, have the wireshark captured the
GRAP from the guest? And what's the output of 'grep virtio /proc/interrupts'
inside guest?

--yliu
Post by Pavel Fedin
I have almost implemented the workaround in qemu... However now i start to think that you are right. Theoretically, the application
may want to suppress GUEST_ANNOUNCE for some reason. So, let it stay this way. Please include this bit into your v2.
Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Pavel Fedin
2015-12-16 12:43:06 UTC
Permalink
Hello!
Post by Yuanhan Liu
However, I'm more curious about the ping loss? Did you still see
that? And to be more specific, have the wireshark captured the
GRAP from the guest?
Yes, everything is fine.

***@nfv_test_x86_64 /var/log/libvirt/qemu # tshark -i ovs-br0
Running as user "root" and group "root". This could be dangerous.
Capturing on 'ovs-br0'
1 0.000000 RealtekU_3b:83:1a -> Broadcast ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
2 0.000024 fe80::5054:ff:fe3b:831a -> ff02::1 ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
3 0.049490 RealtekU_3b:83:1a -> Broadcast ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
4 0.049497 fe80::5054:ff:fe3b:831a -> ff02::1 ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
5 0.199485 RealtekU_3b:83:1a -> Broadcast ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
6 0.199492 fe80::5054:ff:fe3b:831a -> ff02::1 ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
7 0.449500 RealtekU_3b:83:1a -> Broadcast ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
8 0.449508 fe80::5054:ff:fe3b:831a -> ff02::1 ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
9 0.517229 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x04af, seq=70/17920, ttl=64
10 0.517277 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x04af, seq=70/17920, ttl=64 (request in 9)
11 0.799521 RealtekU_3b:83:1a -> Broadcast ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
12 0.799553 fe80::5054:ff:fe3b:831a -> ff02::1 ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
13 1.517210 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x04af, seq=71/18176, ttl=64
14 1.517238 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x04af, seq=71/18176, ttl=64 (request in 13)
15 2.517219 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x04af, seq=72/18432, ttl=64
16 2.517256 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x04af, seq=72/18432, ttl=64 (request in 15)
17 3.517497 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x04af, seq=73/18688, ttl=64
18 3.517518 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x04af, seq=73/18688, ttl=64 (request in 17)
19 4.517219 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x04af, seq=74/18944, ttl=64
20 4.517237 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x04af, seq=74/18944, ttl=64 (request in 19)
21 5.517222 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x04af, seq=75/19200, ttl=64
22 5.517242 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x04af, seq=75/19200, ttl=64 (request in 21)
23 6.517235 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x04af, seq=76/19456, ttl=64
24 6.517256 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x04af, seq=76/19456, ttl=64 (request in 23)
25 6.531466 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 Who has 192.168.6.2? Tell 192.168.6.1
26 6.531619 RealtekU_3b:83:1a -> be:e1:71:c1:47:4d ARP 42 192.168.6.2 is at 52:54:00:3b:83:1a
27 7.517212 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x04af, seq=77/19712, ttl=64
28 7.517229 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x04af, seq=77/19712, ttl=64 (request in 27)

But there's one important detail here. Any replicated network interfaces (LOCAL port in my example) should be fully cloned on both
hosts, including MAC addresses. Otherwise after the migration the guest continues to send packets to old MAC, and, obvious, there's
still ping loss until it redoes the ARP for its ping target.
Post by Yuanhan Liu
And what's the output of 'grep virtio /proc/interrupts' inside guest?
11: 0 0 0 0 IO-APIC 11-fasteoi uhci_hcd:usb1, virtio3
24: 0 0 0 0 PCI-MSI 114688-edge virtio2-config
25: 3544 0 0 0 PCI-MSI 114689-edge virtio2-req.0
26: 10 0 0 0 PCI-MSI 49152-edge virtio0-config
27: 852 0 0 0 PCI-MSI 49153-edge virtio0-input.0
28: 3 0 0 0 PCI-MSI 49154-edge virtio0-output.0
29: 10 0 0 0 PCI-MSI 65536-edge virtio1-config
30: 172 0 0 0 PCI-MSI 65537-edge virtio1-input.0
31: 1 0 0 0 PCI-MSI 65538-edge virtio1-output.0

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Yuanhan Liu
2015-12-16 13:00:27 UTC
Permalink
rYR8N8f/ookveMRL7BfPnj5lw+EJZd+uG+v/lZnBuWidyQ4r
g586/P1rPsQw8p6wT+M7LnqvMLZM9eWq2ht53Bd5liqxFGckGmoxFxUnAgC5sFKthAIAAA==
Status: O
Content-Length: 4853
Lines: 66
Hello!
Post by Yuanhan Liu
However, I'm more curious about the ping loss? Did you still see
that? And to be more specific, have the wireshark captured the
GRAP from the guest?
Yes, everything is fine.
Great!
Running as user "root" and group "root". This could be dangerous.
Capturing on 'ovs-br0'
1 0.000000 RealtekU_3b:83:1a -> Broadcast ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
2 0.000024 fe80::5054:ff:fe3b:831a -> ff02::1 ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
3 0.049490 RealtekU_3b:83:1a -> Broadcast ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
4 0.049497 fe80::5054:ff:fe3b:831a -> ff02::1 ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
5 0.199485 RealtekU_3b:83:1a -> Broadcast ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
6 0.199492 fe80::5054:ff:fe3b:831a -> ff02::1 ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
7 0.449500 RealtekU_3b:83:1a -> Broadcast ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
8 0.449508 fe80::5054:ff:fe3b:831a -> ff02::1 ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
9 0.517229 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x04af, seq=70/17920, ttl=64
10 0.517277 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x04af, seq=70/17920, ttl=64 (request in 9)
11 0.799521 RealtekU_3b:83:1a -> Broadcast ARP 42 Gratuitous ARP for 192.168.6.2 (Request)
12 0.799553 fe80::5054:ff:fe3b:831a -> ff02::1 ICMPv6 86 Neighbor Advertisement fe80::5054:ff:fe3b:831a (ovr) is at
52:54:00:3b:83:1a
13 1.517210 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x04af, seq=71/18176, ttl=64
14 1.517238 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x04af, seq=71/18176, ttl=64 (request in 13)
15 2.517219 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x04af, seq=72/18432, ttl=64
16 2.517256 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x04af, seq=72/18432, ttl=64 (request in 15)
17 3.517497 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x04af, seq=73/18688, ttl=64
18 3.517518 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x04af, seq=73/18688, ttl=64 (request in 17)
19 4.517219 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x04af, seq=74/18944, ttl=64
20 4.517237 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x04af, seq=74/18944, ttl=64 (request in 19)
21 5.517222 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x04af, seq=75/19200, ttl=64
22 5.517242 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x04af, seq=75/19200, ttl=64 (request in 21)
23 6.517235 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x04af, seq=76/19456, ttl=64
24 6.517256 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x04af, seq=76/19456, ttl=64 (request in 23)
25 6.531466 be:e1:71:c1:47:4d -> RealtekU_3b:83:1a ARP 42 Who has 192.168.6.2? Tell 192.168.6.1
26 6.531619 RealtekU_3b:83:1a -> be:e1:71:c1:47:4d ARP 42 192.168.6.2 is at 52:54:00:3b:83:1a
27 7.517212 192.168.6.2 -> 192.168.6.1 ICMP 98 Echo (ping) request id=0x04af, seq=77/19712, ttl=64
28 7.517229 192.168.6.1 -> 192.168.6.2 ICMP 98 Echo (ping) reply id=0x04af, seq=77/19712, ttl=64 (request in 27)
But there's one important detail here. Any replicated network interfaces (LOCAL port in my example) should be fully cloned on both
hosts, including MAC addresses. Otherwise after the migration the guest continues to send packets to old MAC, and, obvious, there's
still ping loss until it redoes the ARP for its ping target.
I see. And here I care more about whether we can get the GARP from the
target guest just after the migration. If you can, everything should
be fine.
Post by Yuanhan Liu
And what's the output of 'grep virtio /proc/interrupts' inside guest?
11: 0 0 0 0 IO-APIC 11-fasteoi uhci_hcd:usb1, virtio3
24: 0 0 0 0 PCI-MSI 114688-edge virtio2-config
25: 3544 0 0 0 PCI-MSI 114689-edge virtio2-req.0
26: 10 0 0 0 PCI-MSI 49152-edge virtio0-config
The GUEST_ANNOUNCE has indeed been triggered. That's great! I just have
no idea why I can't get any config IRQ from the guest after the migration.
(I can for migratin inside one same host, but not on two hosts).

In my first tries, I just got an error message telling me that the MSI is
just lost. I then found it may because I'm using a customized guest kernel.
I then switched to the kernel shipped by Fedora 22, I no longer see such
error, but I still don't see such interrupt generated inside the guest, either.

It might still be an issue on my side. Even it's not, it's likely a KVM
bug, but not from vhost-user. And glad it works on your side :)

So, I will send v2 tomorow.

--yliu
Yuanhan Liu
2015-12-15 13:18:12 UTC
Permalink
Post by Thibaut Collet
Post by Thibaut Collet
Post by Pavel Fedin
Post by Pavel Fedin
To tell the truth, i don't know. I am also learning qemu internals on
the fly. Indeed, i see that it should announce itself. But
Post by Pavel Fedin
this brings up a question: why do we need special announce procedure in
vhost-user then?
I have the same question. Here is my guess...
In customized networks, maybe people are not using ARP at all? When
we use DPDK, we directly pass through the network logic inside
kernel itself. So logically all the network protocols could be
customized by the user of it. In the customized network, maybe there
is some other protocol (rather than RARP) that would do the same
thing as what ARP/RARP does. So, this SEND_RARP request could give
the vhost-user backend a chance to format its own announce packet
and broadcast (in the SEND_RARP request, the guest's mac address
will be appended).
CCing Victor to better know the truth...
Peter
Hey Thibaut,

First of all, thanks a lot for your lengthy explanation.
Post by Thibaut Collet
Post by Thibaut Collet
Hi,
After a migration, to avoid network outage, the guest must announce its
new location to the L2 layer, typically with a GARP. Otherwise requests
sent to the guest arrive to the old host until a ARP request is sent (after
30 seconds) or the guest sends some data.
QEMU implementation of self announce after a migration with a vhost
- If the VIRTIO_GUEST_ANNOUNCE feature has been negotiated the guest
sends automatically a GARP.
I'm kind of clear how VIRTIO_GUEST_ANNOUNCE works so far, except that I
met a bug, which I will describe in another email.
Post by Thibaut Collet
Post by Thibaut Collet
- Else if the vhost backend implements VHOST_USER_SEND_RARP this request
is sent to the vhost backend. When this message is received the vhost
backend must act as it receives a RARP from the guest (purpose of this RARP
Can you be more specific about this? Say, what kind of acts the vhost
backend should do exactly?
Post by Thibaut Collet
Post by Thibaut Collet
is to update switches' MAC->port maaping as a GARP).
Isn't it vhost library is not aware of swtich at all? How could we
update switches's MAC-port mapping inside vhost library?
Post by Thibaut Collet
This RARP is a false
Post by Thibaut Collet
one, created by the vhost backend,
I'm a bit confused now. You were just saying "vhost backend must act
as it __recevives__ a RARP from the guest", and you are now saying
"the RARP is a false one __created__ by the vhost backend".

Thanks.

--yliu
Thibaut Collet
2015-12-15 15:07:57 UTC
Permalink
Post by Pavel Fedin
On Tue, Dec 15, 2015 at 12:43 PM, Thibaut Collet <
Post by Thibaut Collet
Post by Pavel Fedin
Post by Pavel Fedin
To tell the truth, i don't know. I am also learning qemu internals
on
Post by Thibaut Collet
Post by Pavel Fedin
the fly. Indeed, i see that it should announce itself. But
Post by Pavel Fedin
this brings up a question: why do we need special announce
procedure in
Post by Thibaut Collet
Post by Pavel Fedin
vhost-user then?
I have the same question. Here is my guess...
In customized networks, maybe people are not using ARP at all? When
we use DPDK, we directly pass through the network logic inside
kernel itself. So logically all the network protocols could be
customized by the user of it. In the customized network, maybe there
is some other protocol (rather than RARP) that would do the same
thing as what ARP/RARP does. So, this SEND_RARP request could give
the vhost-user backend a chance to format its own announce packet
and broadcast (in the SEND_RARP request, the guest's mac address
will be appended).
CCing Victor to better know the truth...
Peter
Hey Thibaut,
First of all, thanks a lot for your lengthy explanation.
Post by Thibaut Collet
Hi,
After a migration, to avoid network outage, the guest must announce its
new location to the L2 layer, typically with a GARP. Otherwise requests
sent to the guest arrive to the old host until a ARP request is sent
(after
Post by Thibaut Collet
30 seconds) or the guest sends some data.
QEMU implementation of self announce after a migration with a vhost
- If the VIRTIO_GUEST_ANNOUNCE feature has been negotiated the guest
sends automatically a GARP.
I'm kind of clear how VIRTIO_GUEST_ANNOUNCE works so far, except that I
met a bug, which I will describe in another email.
Post by Thibaut Collet
- Else if the vhost backend implements VHOST_USER_SEND_RARP this
request
Post by Thibaut Collet
is sent to the vhost backend. When this message is received the vhost
backend must act as it receives a RARP from the guest (purpose of this
RARP
Can you be more specific about this? Say, what kind of acts the vhost
backend should do exactly?
Post by Thibaut Collet
is to update switches' MAC->port maaping as a GARP).
Isn't it vhost library is not aware of swtich at all? How could we
update switches's MAC-port mapping inside vhost library?
This RARP is a false
Post by Thibaut Collet
one, created by the vhost backend,
I'm a bit confused now. You were just saying "vhost backend must act
as it __recevives__ a RARP from the guest", and you are now saying
"the RARP is a false one __created__ by the vhost backend".
Thanks.
--yliu
After a migration, to avoid netwotk outage, all interfaces of the guest
must send a packet to update switches mapping (ideally a GARP).
As some interfaces do not do it QEMU does it in behalf of the guest by
sending a RARP (his RARP is not forged by the guest but by QEMU). This is
the qemu_self_announce purpose that "spoofs" a RARP to all backend of guest
ethernet interfaces. For vhost-user backend, QEMU can not do it directly
and asks to the vhost-user backend to do it with the VHOST_USER_SEND_RARP
request that contains the MAC address of the guest interface.

Thibaut.
Pavel Fedin
2015-12-15 15:36:43 UTC
Permalink
Hello!
After a migration, to avoid netwotk outage, all interfaces of the guest must send a packet to update switches mapping (ideally a GARP).
As some interfaces do not do it QEMU does it in behalf of the guest by sending a RARP (his RARP is not forged by the guest but by QEMU). This is the
qemu_self_announce purpose that "spoofs" a RARP to all backend of guest ethernet interfaces. For vhost-user backend, QEMU can not do it directly
Aha, see it now. qemu_announce_self() uses qemu_foreach_nic(), which actually iterates only over NET_CLIENT_OPTIONS_KIND_NIC interfaces. I expect these are fully emulated hardware controllers. virtio uses another type (see enum NetClientOptionsKind).
So, we can happily ignore qemu_announce_self(), it does not do anything for us. Thanks for pointing it out.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Peter Xu
2015-12-16 02:38:03 UTC
Permalink
Post by Thibaut Collet
After a migration, to avoid netwotk outage, all interfaces of the guest
must send a packet to update switches mapping (ideally a GARP).
As some interfaces do not do it QEMU does it in behalf of the guest by
sending a RARP (his RARP is not forged by the guest but by QEMU). This is
the qemu_self_announce purpose that "spoofs" a RARP to all backend of guest
ethernet interfaces. For vhost-user backend, QEMU can not do it directly
and asks to the vhost-user backend to do it with the VHOST_USER_SEND_RARP
request that contains the MAC address of the guest interface.
Thibaut.
Hi, Thibaut,

Thanks for the explaination.

Two more questions:

1. if vhost-user backend (or say, DPDK) supports GUEST_ANNOUNCE, and
send another RARP (or say, GARP, I will use RARP as example),
then there will be two RARP later on the line, right? (since the
QEMU one is sent unconditionally from qemu_announce_self).

2. if the only thing vhost-user backend is to send another same RARP
when got SEND_RARP request, why would it bother if QEMU will
unconditionally send one? (or say, I still do not know why we
need this SEND_RARP request, if the vhost-user backend is going
to do the same thing again as QEMU already does)

Thanks in advance.
Peter
Yuanhan Liu
2015-12-16 02:50:19 UTC
Permalink
Post by Peter Xu
Post by Thibaut Collet
After a migration, to avoid netwotk outage, all interfaces of the guest
must send a packet to update switches mapping (ideally a GARP).
As some interfaces do not do it QEMU does it in behalf of the guest by
sending a RARP (his RARP is not forged by the guest but by QEMU). This is
the qemu_self_announce purpose that "spoofs" a RARP to all backend of guest
ethernet interfaces. For vhost-user backend, QEMU can not do it directly
and asks to the vhost-user backend to do it with the VHOST_USER_SEND_RARP
request that contains the MAC address of the guest interface.
Thibaut.
Hi, Thibaut,
Thanks for the explaination.
1. if vhost-user backend (or say, DPDK) supports GUEST_ANNOUNCE, and
send another RARP (or say, GARP, I will use RARP as example),
then there will be two RARP later on the line, right? (since the
QEMU one is sent unconditionally from qemu_announce_self).
The one send by qemu_announce_self() will be caught by
vhost_user_receive(), which ends up invoking vhost_user_migration_done().
And it will be dropped when VIRTIO_NET_F_GUEST_ANNOUNCE is negotiated
there.
Post by Peter Xu
2. if the only thing vhost-user backend is to send another same RARP
when got SEND_RARP request, why would it bother if QEMU will
unconditionally send one? (or say, I still do not know why we
need this SEND_RARP request, if the vhost-user backend is going
to do the same thing again as QEMU already does)
Because that one is caught by vhost-user, and vhost-user just relays
it to the backend when necessary (say when GUEST_ANNOUNCE is not
supported)?

--yliu
Pavel Fedin
2015-12-16 07:05:31 UTC
Permalink
Hello!
Post by Peter Xu
1. if vhost-user backend (or say, DPDK) supports GUEST_ANNOUNCE, and
send another RARP (or say, GARP, I will use RARP as example),
then there will be two RARP later on the line, right? (since the
QEMU one is sent unconditionally from qemu_announce_self).
qemu_announce_self() is NOT unconditional. It applies only to emulated physical NICs and bypasses virtio/vhost. So it will not send anything at all for vhost-user.
Post by Peter Xu
2. if the only thing vhost-user backend is to send another same RARP
when got SEND_RARP request, why would it bother if QEMU will
unconditionally send one?
See above, it won't send one.
It looks to me like qemu_announce_self() is just a poor man's solution which even doesn't always work (because GARP should reassociate an existing IP with new MAC, shouldn't it? and qemu doesn't know the IP and just sets both src and dst to 0.0.0.0).

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
Peter Xu
2015-12-15 09:42:54 UTC
Permalink
Post by Yuanhan Liu
Post by Peter Xu
If ping to guest from outside, when the migration finishes on the
target side of qemu, qemu_self_announce() will be called.
It's supposed to see some ARP requests if I run tcpdump against
with the ovs bridge, right? However, in fact, I saw nothing from
tcpdump on the target host when the migration is done.
I mean I do find that qemu_annouce_self composes an ARP
broadcoast request, but it seems that I didn't catch it on
the target host.
Something wrong, or someting I missed?
AFAIK, it should be RARP rather than ARP request. However, sorry
that I do not know the reason for its lossing either.

Btw, I did a very basic live migration using v1 patchset locally
today (one host, two QEMU instances attach to the same vhost-user
socket), it's working too on my host. :)

Thanks.
Peter
Post by Yuanhan Liu
--yliu
Post by Peter Xu
Although
we might see a warning like "Vhost user backend fails to broadcast
fake RARP" (notify is done by hacking vhost_user_receive(), even if
notify fails, things will still move on), QEMU should still send a
RARP onto the link.
Not sure whether I missed anything.
Thanks.
Peter
Yuanhan Liu
2015-12-17 03:11:56 UTC
Permalink
VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
vhost-user) where we should log dirty pages, and how big the log
buffer is.

This request introduces a new payload:

typedef struct VhostUserLog {
uint64_t mmap_size;
uint64_t mmap_offset;
} VhostUserLog;

Also, a fd is delivered from QEMU by ancillary data.

With those info given, an area of memory is mmaped, assigned
to dev->log_base, for logging dirty pages.

Signed-off-by: Yuanhan Liu <***@linux.intel.com>
Signed-off-by: Victor Kaplansky <***@redhat.com
---

v2: workaround mmap issue when offset is not zero
---
lib/librte_vhost/rte_virtio_net.h | 4 ++-
lib/librte_vhost/vhost_user/vhost-net-user.c | 7 ++--
lib/librte_vhost/vhost_user/vhost-net-user.h | 6 ++++
lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++++++++++++++++++++++++++
lib/librte_vhost/vhost_user/virtio-net-user.h | 1 +
5 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 10dcb90..8acee02 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -129,7 +129,9 @@ struct virtio_net {
char ifname[IF_NAME_SZ]; /**< Name of the tap device or socket path. */
uint32_t virt_qp_nb; /**< number of queue pair we have allocated */
void *priv; /**< private context */
- uint64_t reserved[64]; /**< Reserve some spaces for future extension. */
+ uint64_t log_size; /**< Size of log area */
+ uint64_t log_base; /**< Where dirty pages are logged */
+ uint64_t reserved[62]; /**< Reserve some spaces for future extension. */
struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2]; /**< Contains all virtqueue information. */
} __rte_cache_aligned;

diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 8b7a448..32ad6f6 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -388,9 +388,12 @@ vserver_message_handler(int connfd, void *dat, int *remove)
break;

case VHOST_USER_SET_LOG_BASE:
- RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
- break;
+ user_set_log_base(ctx, &msg);

+ /* it needs a reply */
+ msg.size = sizeof(msg.payload.u64);
+ send_vhost_message(connfd, &msg);
+ break;
case VHOST_USER_SET_LOG_FD:
close(msg.fds[0]);
RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.h b/lib/librte_vhost/vhost_user/vhost-net-user.h
index 38637cc..6d252a3 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.h
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.h
@@ -83,6 +83,11 @@ typedef struct VhostUserMemory {
VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
} VhostUserMemory;

+typedef struct VhostUserLog {
+ uint64_t mmap_size;
+ uint64_t mmap_offset;
+} VhostUserLog;
+
typedef struct VhostUserMsg {
VhostUserRequest request;

@@ -97,6 +102,7 @@ typedef struct VhostUserMsg {
struct vhost_vring_state state;
struct vhost_vring_addr addr;
VhostUserMemory memory;
+ VhostUserLog log;
} payload;
int fds[VHOST_MEMORY_MAX_NREGIONS];
} __attribute((packed)) VhostUserMsg;
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
index 2934d1c..b77c9b3 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.c
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
@@ -365,3 +365,51 @@ user_set_protocol_features(struct vhost_device_ctx ctx,

dev->protocol_features = protocol_features;
}
+
+int
+user_set_log_base(struct vhost_device_ctx ctx,
+ struct VhostUserMsg *msg)
+{
+ struct virtio_net *dev;
+ int fd = msg->fds[0];
+ uint64_t size, off;
+ void *addr;
+
+ dev = get_device(ctx);
+ if (!dev)
+ return -1;
+
+ if (fd < 0) {
+ RTE_LOG(ERR, VHOST_CONFIG, "invalid log fd: %d\n", fd);
+ return -1;
+ }
+
+ if (msg->size != sizeof(VhostUserLog)) {
+ RTE_LOG(ERR, VHOST_CONFIG,
+ "invalid log base msg size: %"PRId32" != %d\n",
+ msg->size, (int)sizeof(VhostUserLog));
+ return -1;
+ }
+
+ size = msg->payload.log.mmap_size;
+ off = msg->payload.log.mmap_offset;
+ RTE_LOG(INFO, VHOST_CONFIG,
+ "log mmap size: %"PRId64", offset: %"PRId64"\n",
+ size, off);
+
+ /*
+ * mmap from 0 to workaround a hugepage mmap bug: mmap will be
+ * failed when offset is not page size aligned.
+ */
+ addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (addr == MAP_FAILED) {
+ RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
+ return -1;
+ }
+
+ /* TODO: unmap on stop */
+ dev->log_base = (uint64_t)(uintptr_t)addr + off;
+ dev->log_size = size;
+
+ return 0;
+}
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index b82108d..013cf38 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);

void user_set_protocol_features(struct vhost_device_ctx ctx,
uint64_t protocol_features);
+int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);

int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
--
1.9.0
Xie, Huawei
2015-12-21 15:32:53 UTC
Permalink
Post by Yuanhan Liu
VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
vhost-user) where we should log dirty pages, and how big the log
buffer is.
typedef struct VhostUserLog {
uint64_t mmap_size;
uint64_t mmap_offset;
} VhostUserLog;
Also, a fd is delivered from QEMU by ancillary data.
With those info given, an area of memory is mmaped, assigned
to dev->log_base, for logging dirty pages.
---
v2: workaround mmap issue when offset is not zero
---
lib/librte_vhost/rte_virtio_net.h | 4 ++-
lib/librte_vhost/vhost_user/vhost-net-user.c | 7 ++--
lib/librte_vhost/vhost_user/vhost-net-user.h | 6 ++++
lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++++++++++++++++++++++++++
lib/librte_vhost/vhost_user/virtio-net-user.h | 1 +
5 files changed, 63 insertions(+), 3 deletions(-)
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 10dcb90..8acee02 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -129,7 +129,9 @@ struct virtio_net {
char ifname[IF_NAME_SZ]; /**< Name of the tap device or socket path. */
uint32_t virt_qp_nb; /**< number of queue pair we have allocated */
void *priv; /**< private context */
- uint64_t reserved[64]; /**< Reserve some spaces for future extension. */
+ uint64_t log_size; /**< Size of log area */
+ uint64_t log_base; /**< Where dirty pages are logged */
+ uint64_t reserved[62]; /**< Reserve some spaces for future extension. */
struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2]; /**< Contains all virtqueue information. */
} __rte_cache_aligned;
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 8b7a448..32ad6f6 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -388,9 +388,12 @@ vserver_message_handler(int connfd, void *dat, int *remove)
break;
- RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
- break;
+ user_set_log_base(ctx, &msg);
+ /* it needs a reply */
+ msg.size = sizeof(msg.payload.u64);
+ send_vhost_message(connfd, &msg);
+ break;
close(msg.fds[0]);
RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.h b/lib/librte_vhost/vhost_user/vhost-net-user.h
index 38637cc..6d252a3 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.h
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.h
@@ -83,6 +83,11 @@ typedef struct VhostUserMemory {
VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
} VhostUserMemory;
+typedef struct VhostUserLog {
+ uint64_t mmap_size;
+ uint64_t mmap_offset;
+} VhostUserLog;
+
typedef struct VhostUserMsg {
VhostUserRequest request;
@@ -97,6 +102,7 @@ typedef struct VhostUserMsg {
struct vhost_vring_state state;
struct vhost_vring_addr addr;
VhostUserMemory memory;
+ VhostUserLog log;
} payload;
int fds[VHOST_MEMORY_MAX_NREGIONS];
} __attribute((packed)) VhostUserMsg;
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
index 2934d1c..b77c9b3 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.c
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
@@ -365,3 +365,51 @@ user_set_protocol_features(struct vhost_device_ctx ctx,
dev->protocol_features = protocol_features;
}
+
+int
+user_set_log_base(struct vhost_device_ctx ctx,
+ struct VhostUserMsg *msg)
+{
+ struct virtio_net *dev;
+ int fd = msg->fds[0];
+ uint64_t size, off;
+ void *addr;
+
+ dev = get_device(ctx);
+ if (!dev)
+ return -1;
+
+ if (fd < 0) {
+ RTE_LOG(ERR, VHOST_CONFIG, "invalid log fd: %d\n", fd);
+ return -1;
+ }
+
+ if (msg->size != sizeof(VhostUserLog)) {
+ RTE_LOG(ERR, VHOST_CONFIG,
+ "invalid log base msg size: %"PRId32" != %d\n",
+ msg->size, (int)sizeof(VhostUserLog));
+ return -1;
+ }
+
+ size = msg->payload.log.mmap_size;
+ off = msg->payload.log.mmap_offset;
+ RTE_LOG(INFO, VHOST_CONFIG,
+ "log mmap size: %"PRId64", offset: %"PRId64"\n",
+ size, off);
+
+ /*
+ * mmap from 0 to workaround a hugepage mmap bug: mmap will be
+ * failed when offset is not page size aligned.
+ */
s /will be failed/will fail/
mmap will fail when offset is not zero.
Also we only know this workaround is for hugetlbfs. Not sure of other
tmpfs, so mention hugetlbfs here.
Post by Yuanhan Liu
+ addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (addr == MAP_FAILED) {
+ RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
+ return -1;
+ }
+
+ /* TODO: unmap on stop */
+ dev->log_base = (uint64_t)(uintptr_t)addr + off;
(uint64_t)(uintptr_t)RTE_PTR_ADD(addr, off)?
Post by Yuanhan Liu
+ dev->log_size = size;
+
+ return 0;
+}
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index b82108d..013cf38 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
void user_set_protocol_features(struct vhost_device_ctx ctx,
uint64_t protocol_features);
+int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
Yuanhan Liu
2015-12-22 02:25:41 UTC
Permalink
Post by Xie, Huawei
Post by Yuanhan Liu
+
+ /*
+ * mmap from 0 to workaround a hugepage mmap bug: mmap will be
+ * failed when offset is not page size aligned.
+ */
s /will be failed/will fail/
mmap will fail when offset is not zero.
Also we only know this workaround is for hugetlbfs. Not sure of other
tmpfs, so mention hugetlbfs here.
I have already mentioned "to workaround a __hugepage__ mmap bug"; it's
not enough?
Post by Xie, Huawei
Post by Yuanhan Liu
+ addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (addr == MAP_FAILED) {
+ RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
+ return -1;
+ }
+
+ /* TODO: unmap on stop */
+ dev->log_base = (uint64_t)(uintptr_t)addr + off;
(uint64_t)(uintptr_t)RTE_PTR_ADD(addr, off)?
No, addr is of (void *) type, we should cast it to uint64_t type first,
before adding it with "off".

--yliu
Post by Xie, Huawei
Post by Yuanhan Liu
+ dev->log_size = size;
+
+ return 0;
+}
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index b82108d..013cf38 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
void user_set_protocol_features(struct vhost_device_ctx ctx,
uint64_t protocol_features);
+int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
Xie, Huawei
2015-12-22 02:41:43 UTC
Permalink
-----Original Message-----
Sent: Tuesday, December 22, 2015 10:26 AM
To: Xie, Huawei
Bernard; Pavel Fedin; Peter Xu
Subject: Re: [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE
request
Post by Xie, Huawei
Post by Yuanhan Liu
+
+ /*
+ * mmap from 0 to workaround a hugepage mmap bug: mmap will be
+ * failed when offset is not page size aligned.
+ */
s /will be failed/will fail/
mmap will fail when offset is not zero.
I mistake for 4KB page size. Please check if huge page size align is enough.
Post by Xie, Huawei
Also we only know this workaround is for hugetlbfs. Not sure of
other
Post by Xie, Huawei
tmpfs, so mention hugetlbfs here.
I have already mentioned "to workaround a __hugepage__ mmap bug"; it's
not enough?
Yes.
Post by Xie, Huawei
Post by Yuanhan Liu
+ addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
0);
Post by Xie, Huawei
Post by Yuanhan Liu
+ if (addr == MAP_FAILED) {
+ RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
+ return -1;
+ }
+
+ /* TODO: unmap on stop */
+ dev->log_base = (uint64_t)(uintptr_t)addr + off;
(uint64_t)(uintptr_t)RTE_PTR_ADD(addr, off)?
No, addr is of (void *) type, we should cast it to uint64_t type first,
before adding it with "off".
--yliu
RTE_PTR_ADD is the DPDK interface for pointer arithmetic operation.
Post by Xie, Huawei
Post by Yuanhan Liu
+ dev->log_size = size;
+
+ return 0;
+}
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h
b/lib/librte_vhost/vhost_user/virtio-net-user.h
Post by Xie, Huawei
Post by Yuanhan Liu
index b82108d..013cf38 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx,
struct VhostUserMsg *);
Post by Xie, Huawei
Post by Yuanhan Liu
void user_set_protocol_features(struct vhost_device_ctx ctx,
uint64_t protocol_features);
+int user_set_log_base(struct vhost_device_ctx ctx, struct
VhostUserMsg *);
Post by Xie, Huawei
Post by Yuanhan Liu
int user_get_vring_base(struct vhost_device_ctx, struct
vhost_vring_state *);
Yuanhan Liu
2015-12-22 02:55:23 UTC
Permalink
Post by Xie, Huawei
-----Original Message-----
Sent: Tuesday, December 22, 2015 10:26 AM
To: Xie, Huawei
Bernard; Pavel Fedin; Peter Xu
Subject: Re: [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE
request
Post by Xie, Huawei
Post by Yuanhan Liu
+
+ /*
+ * mmap from 0 to workaround a hugepage mmap bug: mmap will be
+ * failed when offset is not page size aligned.
+ */
s /will be failed/will fail/
mmap will fail when offset is not zero.
I mistake for 4KB page size.
Didn't follow you.
Post by Xie, Huawei
Please check if huge page size align is enough.
It should be. However, I don't think we need bother to do that:
first of all, it happened on few specific old kernels. And, "off"
here is kind of guaranteed to be 0. Last, even it's not, mmaping
it from 0 will resolve that.
Post by Xie, Huawei
Post by Xie, Huawei
Also we only know this workaround is for hugetlbfs. Not sure of
other
Post by Xie, Huawei
tmpfs, so mention hugetlbfs here.
I have already mentioned "to workaround a __hugepage__ mmap bug"; it's
not enough?
Yes.
Post by Xie, Huawei
Post by Yuanhan Liu
+ addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
0);
Post by Xie, Huawei
Post by Yuanhan Liu
+ if (addr == MAP_FAILED) {
+ RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
+ return -1;
+ }
+
+ /* TODO: unmap on stop */
+ dev->log_base = (uint64_t)(uintptr_t)addr + off;
(uint64_t)(uintptr_t)RTE_PTR_ADD(addr, off)?
No, addr is of (void *) type, we should cast it to uint64_t type first,
before adding it with "off".
--yliu
RTE_PTR_ADD is the DPDK interface for pointer arithmetic operation.
log_base is with "uint64_t" type, RTE_PTR_ADD() returns (void*), so it
won't work here.

--yliu
Yuanhan Liu
2015-12-17 03:11:57 UTC
Permalink
Introduce vhost_log_write() helper function to log the dirty pages we
touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
log is presented by 1 bit.

Therefore, vhost_log_write() simply finds the right bit for related
page we are gonna change, and set it to 1. dev->log_base denotes the
start of the dirty page bitmap.

Signed-off-by: Yuanhan Liu <***@linux.intel.com>
Signed-off-by: Victor Kaplansky <***@redhat.com
---
lib/librte_vhost/rte_virtio_net.h | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 8acee02..5726683 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -40,6 +40,7 @@
*/

#include <stdint.h>
+#include <linux/vhost.h>
#include <linux/virtio_ring.h>
#include <linux/virtio_net.h>
#include <sys/eventfd.h>
@@ -59,6 +60,8 @@ struct rte_mbuf;
/* Backend value set by guest. */
#define VIRTIO_DEV_STOPPED -1

+#define VHOST_LOG_PAGE 4096
+

/* Enum for virtqueue management. */
enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
@@ -205,6 +208,32 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
return vhost_va;
}

+static inline void __attribute__((always_inline))
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+ log_base[page / 8] |= 1 << (page % 8);
+}
+
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
+{
+ uint64_t page;
+
+ if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base || !len))
+ return;
+
+ if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+ return;
+
+ page = addr / VHOST_LOG_PAGE;
+ while (page * VHOST_LOG_PAGE < addr + len) {
+ vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+ page += VHOST_LOG_PAGE;
+ }
+}
+
+
/**
* Disable features in feature_mask. Returns 0 on success.
*/
--
1.9.0
Xie, Huawei
2015-12-21 15:06:43 UTC
Permalink
Post by Yuanhan Liu
Introduce vhost_log_write() helper function to log the dirty pages we
touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
log is presented by 1 bit.
Therefore, vhost_log_write() simply finds the right bit for related
page we are gonna change, and set it to 1. dev->log_base denotes the
start of the dirty page bitmap.
---
lib/librte_vhost/rte_virtio_net.h | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 8acee02..5726683 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -40,6 +40,7 @@
*/
#include <stdint.h>
+#include <linux/vhost.h>
#include <linux/virtio_ring.h>
#include <linux/virtio_net.h>
#include <sys/eventfd.h>
@@ -59,6 +60,8 @@ struct rte_mbuf;
/* Backend value set by guest. */
#define VIRTIO_DEV_STOPPED -1
+#define VHOST_LOG_PAGE 4096
+
/* Enum for virtqueue management. */
enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
@@ -205,6 +208,32 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
return vhost_va;
}
+static inline void __attribute__((always_inline))
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+ log_base[page / 8] |= 1 << (page % 8);
+}
+
Those logging functions are not supposed to be API. Could we move them
into an internal header file?
Post by Yuanhan Liu
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
+{
+ uint64_t page;
+
Before we log, we need memory barrier to make sure updates are in place.
Post by Yuanhan Liu
+ if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base || !len))
+ return;
+
+ if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+ return;
+
+ page = addr / VHOST_LOG_PAGE;
+ while (page * VHOST_LOG_PAGE < addr + len) {
Let us have a page_end var to make the code simpler?
Post by Yuanhan Liu
+ vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+ page += VHOST_LOG_PAGE;
page += 1?
Post by Yuanhan Liu
+ }
+}
+
+
/**
* Disable features in feature_mask. Returns 0 on success.
*/
Yuanhan Liu
2015-12-22 02:40:58 UTC
Permalink
Post by Xie, Huawei
Post by Yuanhan Liu
Introduce vhost_log_write() helper function to log the dirty pages we
touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
log is presented by 1 bit.
Therefore, vhost_log_write() simply finds the right bit for related
page we are gonna change, and set it to 1. dev->log_base denotes the
start of the dirty page bitmap.
---
lib/librte_vhost/rte_virtio_net.h | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 8acee02..5726683 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -40,6 +40,7 @@
*/
#include <stdint.h>
+#include <linux/vhost.h>
#include <linux/virtio_ring.h>
#include <linux/virtio_net.h>
#include <sys/eventfd.h>
@@ -59,6 +60,8 @@ struct rte_mbuf;
/* Backend value set by guest. */
#define VIRTIO_DEV_STOPPED -1
+#define VHOST_LOG_PAGE 4096
+
/* Enum for virtqueue management. */
enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
@@ -205,6 +208,32 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
return vhost_va;
}
+static inline void __attribute__((always_inline))
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+ log_base[page / 8] |= 1 << (page % 8);
+}
+
Those logging functions are not supposed to be API. Could we move them
into an internal header file?
Agreed. I should have put them into vhost_rxtx.c
Post by Xie, Huawei
Post by Yuanhan Liu
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
+{
+ uint64_t page;
+
Before we log, we need memory barrier to make sure updates are in place.
Post by Yuanhan Liu
+ if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base || !len))
+ return;
Put a memory barrier inside set_features()?

I see no var dependence here, why putting a barrier then? We are
accessing and modifying same var, doesn't the cache MESI protocol
will get rid of your concerns?
Post by Xie, Huawei
Post by Yuanhan Liu
+
+ if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+ return;
+
+ page = addr / VHOST_LOG_PAGE;
+ while (page * VHOST_LOG_PAGE < addr + len) {
Let us have a page_end var to make the code simpler?
Could do that.
Post by Xie, Huawei
Post by Yuanhan Liu
+ vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+ page += VHOST_LOG_PAGE;
page += 1?
Oops, right.

--yliu
Post by Xie, Huawei
Post by Yuanhan Liu
+ }
+}
+
+
/**
* Disable features in feature_mask. Returns 0 on success.
*/
Xie, Huawei
2015-12-22 02:45:52 UTC
Permalink
Post by Yuanhan Liu
Post by Xie, Huawei
Post by Yuanhan Liu
Introduce vhost_log_write() helper function to log the dirty pages we
touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
log is presented by 1 bit.
Therefore, vhost_log_write() simply finds the right bit for related
page we are gonna change, and set it to 1. dev->log_base denotes the
start of the dirty page bitmap.
---
lib/librte_vhost/rte_virtio_net.h | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 8acee02..5726683 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -40,6 +40,7 @@
*/
#include <stdint.h>
+#include <linux/vhost.h>
#include <linux/virtio_ring.h>
#include <linux/virtio_net.h>
#include <sys/eventfd.h>
@@ -59,6 +60,8 @@ struct rte_mbuf;
/* Backend value set by guest. */
#define VIRTIO_DEV_STOPPED -1
+#define VHOST_LOG_PAGE 4096
+
/* Enum for virtqueue management. */
enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
@@ -205,6 +208,32 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
return vhost_va;
}
+static inline void __attribute__((always_inline))
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+ log_base[page / 8] |= 1 << (page % 8);
+}
+
Those logging functions are not supposed to be API. Could we move them
into an internal header file?
Agreed. I should have put them into vhost_rxtx.c
Post by Xie, Huawei
Post by Yuanhan Liu
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
+{
+ uint64_t page;
+
Before we log, we need memory barrier to make sure updates are in place.
Post by Yuanhan Liu
+ if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base || !len))
+ return;
Put a memory barrier inside set_features()?
I see no var dependence here, why putting a barrier then? We are
accessing and modifying same var, doesn't the cache MESI protocol
will get rid of your concerns?
This fence isn't about feature var. It is to ensure that updates to the
guest buffer are committed before the logging.
For IA strong memory model, compiler barrier is enough. For other weak
memory model, fence is required.
Post by Yuanhan Liu
Post by Xie, Huawei
Post by Yuanhan Liu
+
+ if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+ return;
+
+ page = addr / VHOST_LOG_PAGE;
+ while (page * VHOST_LOG_PAGE < addr + len) {
Let us have a page_end var to make the code simpler?
Could do that.
Post by Xie, Huawei
Post by Yuanhan Liu
+ vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+ page += VHOST_LOG_PAGE;
page += 1?
Oops, right.
--yliu
Post by Xie, Huawei
Post by Yuanhan Liu
+ }
+}
+
+
/**
* Disable features in feature_mask. Returns 0 on success.
*/
Yuanhan Liu
2015-12-22 03:04:32 UTC
Permalink
Post by Xie, Huawei
Post by Yuanhan Liu
Post by Xie, Huawei
Post by Yuanhan Liu
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
+{
+ uint64_t page;
+
Before we log, we need memory barrier to make sure updates are in place.
Post by Yuanhan Liu
+ if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base || !len))
+ return;
Put a memory barrier inside set_features()?
I see no var dependence here, why putting a barrier then? We are
accessing and modifying same var, doesn't the cache MESI protocol
will get rid of your concerns?
This fence isn't about feature var. It is to ensure that updates to the
guest buffer are committed before the logging.
Oh.., I was thinking you were talking about the "dev->features" field
concurrent access and modify you mentioned from V1.
Post by Xie, Huawei
For IA strong memory model, compiler barrier is enough. For other weak
memory model, fence is required.
Post by Yuanhan Liu
Post by Xie, Huawei
Post by Yuanhan Liu
+
+ if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+ return;
So that I should put a "rte_mb()" __here__?

--yliu
Post by Xie, Huawei
Post by Yuanhan Liu
Post by Xie, Huawei
Post by Yuanhan Liu
+
+ page = addr / VHOST_LOG_PAGE;
+ while (page * VHOST_LOG_PAGE < addr + len) {
Let us have a page_end var to make the code simpler?
Could do that.
Post by Xie, Huawei
Post by Yuanhan Liu
+ vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+ page += VHOST_LOG_PAGE;
page += 1?
Oops, right.
--yliu
Post by Xie, Huawei
Post by Yuanhan Liu
+ }
+}
+
+
/**
* Disable features in feature_mask. Returns 0 on success.
*/
Xie, Huawei
2015-12-22 07:02:10 UTC
Permalink
Post by Yuanhan Liu
Post by Xie, Huawei
Post by Yuanhan Liu
Post by Xie, Huawei
Post by Yuanhan Liu
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
+{
+ uint64_t page;
+
Before we log, we need memory barrier to make sure updates are in place.
Post by Yuanhan Liu
+ if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base || !len))
+ return;
Put a memory barrier inside set_features()?
I see no var dependence here, why putting a barrier then? We are
accessing and modifying same var, doesn't the cache MESI protocol
will get rid of your concerns?
This fence isn't about feature var. It is to ensure that updates to the
guest buffer are committed before the logging.
Oh.., I was thinking you were talking about the "dev->features" field
concurrent access and modify you mentioned from V1.
Post by Xie, Huawei
For IA strong memory model, compiler barrier is enough. For other weak
memory model, fence is required.
Post by Yuanhan Liu
Post by Xie, Huawei
Post by Yuanhan Liu
+
+ if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+ return;
So that I should put a "rte_mb()" __here__?
--yliu
I find that we already have the arch dependent version of rte_smp_wmb()
--huawei
Post by Yuanhan Liu
Post by Xie, Huawei
Post by Yuanhan Liu
Post by Xie, Huawei
Post by Yuanhan Liu
+
+ page = addr / VHOST_LOG_PAGE;
+ while (page * VHOST_LOG_PAGE < addr + len) {
Let us have a page_end var to make the code simpler?
Could do that.
Post by Xie, Huawei
Post by Yuanhan Liu
+ vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+ page += VHOST_LOG_PAGE;
page += 1?
Oops, right.
--yliu
Post by Xie, Huawei
Post by Yuanhan Liu
+ }
+}
+
+
/**
* Disable features in feature_mask. Returns 0 on success.
*/
Peter Xu
2015-12-22 05:11:02 UTC
Permalink
Post by Yuanhan Liu
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
+{
+ uint64_t page;
+
+ if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base || !len))
+ return;
+
+ if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
Should it be "<="?

Peter
Yuanhan Liu
2015-12-22 06:09:35 UTC
Permalink
Post by Peter Xu
Post by Yuanhan Liu
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
+{
+ uint64_t page;
+
+ if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base || !len))
+ return;
+
+ if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
Should it be "<="?
Right, thanks for catching it.

--yliu
Yuanhan Liu
2015-12-17 03:11:58 UTC
Permalink
Introducing a vhost_log_write() wrapper, vhost_log_used_vring, to
log used vring changes.

Signed-off-by: Yuanhan Liu <***@linux.intel.com>
Signed-off-by: Victor Kaplansky <***@redhat.com
---
lib/librte_vhost/rte_virtio_net.h | 3 +-
lib/librte_vhost/vhost_rxtx.c | 80 +++++++++++++++++++++++++++------------
lib/librte_vhost/virtio-net.c | 4 ++
3 files changed, 62 insertions(+), 25 deletions(-)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 5726683..0f83719 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -93,7 +93,8 @@ struct vhost_virtqueue {
int callfd; /**< Used to notify the guest (trigger interrupt). */
int kickfd; /**< Currently unused as polling mode is enabled. */
int enabled;
- uint64_t reserved[16]; /**< Reserve some spaces for future extension. */
+ uint64_t log_guest_addr; /**< Physical address of used ring, for logging */
+ uint64_t reserved[15]; /**< Reserve some spaces for future extension. */
struct buf_vector buf_vec[BUF_VECTOR_MAX]; /**< for scatter RX. */
} __rte_cache_aligned;

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index bbf3fac..f305acd 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -49,6 +49,16 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t qp_nb)
return (is_tx ^ (idx & 1)) == 0 && idx < qp_nb * VIRTIO_QNUM;
}

+static inline void __attribute__((always_inline))
+vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ uint64_t offset, uint64_t len)
+{
+ uint64_t addr;
+
+ addr = vq->log_guest_addr + offset;
+ vhost_log_write(dev, addr, len);
+}
+
/**
* This function adds buffers to the virtio devices RX virtqueue. Buffers can
* be received from the physical port or from another virtio device. A packet
@@ -129,6 +139,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
uint32_t offset = 0, vb_offset = 0;
uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
uint8_t hdr = 0, uncompleted_pkt = 0;
+ uint16_t idx;

/* Get descriptor from available ring */
desc = &vq->desc[head[packet_success]];
@@ -200,16 +211,18 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
}

/* Update used ring with desc information */
- vq->used->ring[res_cur_idx & (vq->size - 1)].id =
- head[packet_success];
+ idx = res_cur_idx & (vq->size - 1);
+ vq->used->ring[idx].id = head[packet_success];

/* Drop the packet if it is uncompleted */
if (unlikely(uncompleted_pkt == 1))
- vq->used->ring[res_cur_idx & (vq->size - 1)].len =
- vq->vhost_hlen;
+ vq->used->ring[idx].len = vq->vhost_hlen;
else
- vq->used->ring[res_cur_idx & (vq->size - 1)].len =
- pkt_len + vq->vhost_hlen;
+ vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
+
+ vhost_log_used_vring(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));

res_cur_idx++;
packet_success++;
@@ -236,6 +249,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,

*(volatile uint16_t *)&vq->used->idx += count;
vq->last_used_idx = res_end_idx;
+ vhost_log_used_vring(dev, vq,
+ offsetof(struct vring_used, idx),
+ sizeof(vq->used->idx));

/* flush used->idx update before we read avail->flags. */
rte_mb();
@@ -265,6 +281,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
uint32_t seg_avail;
uint32_t vb_avail;
uint32_t cpy_len, entry_len;
+ uint16_t idx;

if (pkt == NULL)
return 0;
@@ -302,16 +319,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
entry_len = vq->vhost_hlen;

if (vb_avail == 0) {
- uint32_t desc_idx =
- vq->buf_vec[vec_idx].desc_idx;
+ uint32_t desc_idx = vq->buf_vec[vec_idx].desc_idx;
+
+ if ((vq->desc[desc_idx].flags & VRING_DESC_F_NEXT) == 0) {
+ idx = cur_idx & (vq->size - 1);

- if ((vq->desc[desc_idx].flags
- & VRING_DESC_F_NEXT) == 0) {
/* Update used ring with desc information */
- vq->used->ring[cur_idx & (vq->size - 1)].id
- = vq->buf_vec[vec_idx].desc_idx;
- vq->used->ring[cur_idx & (vq->size - 1)].len
- = entry_len;
+ vq->used->ring[idx].id = vq->buf_vec[vec_idx].desc_idx;
+ vq->used->ring[idx].len = entry_len;
+
+ vhost_log_used_vring(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));

entry_len = 0;
cur_idx++;
@@ -354,10 +373,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
if ((vq->desc[vq->buf_vec[vec_idx].desc_idx].flags &
VRING_DESC_F_NEXT) == 0) {
/* Update used ring with desc information */
- vq->used->ring[cur_idx & (vq->size - 1)].id
+ idx = cur_idx & (vq->size - 1);
+ vq->used->ring[idx].id
= vq->buf_vec[vec_idx].desc_idx;
- vq->used->ring[cur_idx & (vq->size - 1)].len
- = entry_len;
+ vq->used->ring[idx].len = entry_len;
+ vhost_log_used_vring(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));
entry_len = 0;
cur_idx++;
entry_success++;
@@ -390,16 +412,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,

if ((vq->desc[desc_idx].flags &
VRING_DESC_F_NEXT) == 0) {
- uint16_t wrapped_idx =
- cur_idx & (vq->size - 1);
+ idx = cur_idx & (vq->size - 1);
/*
* Update used ring with the
* descriptor information
*/
- vq->used->ring[wrapped_idx].id
+ vq->used->ring[idx].id
= desc_idx;
- vq->used->ring[wrapped_idx].len
+ vq->used->ring[idx].len
= entry_len;
+ vhost_log_used_vring(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));
entry_success++;
entry_len = 0;
cur_idx++;
@@ -422,10 +446,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
* This whole packet completes.
*/
/* Update used ring with desc information */
- vq->used->ring[cur_idx & (vq->size - 1)].id
+ idx = cur_idx & (vq->size - 1);
+ vq->used->ring[idx].id
= vq->buf_vec[vec_idx].desc_idx;
- vq->used->ring[cur_idx & (vq->size - 1)].len
- = entry_len;
+ vq->used->ring[idx].len = entry_len;
+ vhost_log_used_vring(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));
entry_success++;
break;
}
@@ -653,6 +680,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
/* Update used index buffer information. */
vq->used->ring[used_idx].id = head[entry_success];
vq->used->ring[used_idx].len = 0;
+ vhost_log_used_vring(dev, vq,
+ offsetof(struct vring_used, ring[used_idx]),
+ sizeof(vq->used->ring[used_idx]));

/* Allocate an mbuf and populate the structure. */
m = rte_pktmbuf_alloc(mbuf_pool);
@@ -773,6 +803,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,

rte_compiler_barrier();
vq->used->idx += entry_success;
+ vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
+ sizeof(vq->used->idx));
/* Kick guest if required. */
if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
eventfd_write(vq->callfd, (eventfd_t)1);
diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index de78a0f..03044f6 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -666,12 +666,16 @@ set_vring_addr(struct vhost_device_ctx ctx, struct vhost_vring_addr *addr)
return -1;
}

+ vq->log_guest_addr = addr->log_guest_addr;
+
LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address desc: %p\n",
dev->device_fh, vq->desc);
LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address avail: %p\n",
dev->device_fh, vq->avail);
LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address used: %p\n",
dev->device_fh, vq->used);
+ LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") log_guest_addr: %"PRIx64"\n",
+ dev->device_fh, vq->log_guest_addr);

return 0;
}
--
1.9.0
Peter Xu
2015-12-22 06:55:52 UTC
Permalink
Post by Yuanhan Liu
+static inline void __attribute__((always_inline))
+vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ uint64_t offset, uint64_t len)
+{
One thing optional: I feel it a little bit confusing regarding to
the helper function name here, for the reasons:

1. It more sounds like "logging all the vrings we used", however,
what I understand is that, here we are logging dirty pages for
guest memory. Or say, there is merely nothing to do directly with
vring (although many vring ops might call this function, we are
only passing [buf, len] pairs).

2. It may also let people think of "vring_used", which is part of
virtio protocol. However, it does not mean it too.

I would suggest a better name without confusion, like
vhost_log_dirty_range() or anything else to avoid those keywords.
Post by Yuanhan Liu
+ uint64_t addr;
+
+ addr = vq->log_guest_addr + offset;
+ vhost_log_write(dev, addr, len);
Optional too: since addr is only used once, would it cleaner using
one line? Like:

vhost_log_write(dev, vq->log_guest_addr + offset, len);
Post by Yuanhan Liu
+}
+
/**
* This function adds buffers to the virtio devices RX virtqueue. Buffers can
* be received from the physical port or from another virtio device. A packet
@@ -129,6 +139,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
uint32_t offset = 0, vb_offset = 0;
uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
uint8_t hdr = 0, uncompleted_pkt = 0;
+ uint16_t idx;
/* Get descriptor from available ring */
desc = &vq->desc[head[packet_success]];
@@ -200,16 +211,18 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
}
/* Update used ring with desc information */
- vq->used->ring[res_cur_idx & (vq->size - 1)].id =
- head[packet_success];
+ idx = res_cur_idx & (vq->size - 1);
+ vq->used->ring[idx].id = head[packet_success];
/* Drop the packet if it is uncompleted */
if (unlikely(uncompleted_pkt == 1))
- vq->used->ring[res_cur_idx & (vq->size - 1)].len =
- vq->vhost_hlen;
+ vq->used->ring[idx].len = vq->vhost_hlen;
else
- vq->used->ring[res_cur_idx & (vq->size - 1)].len =
- pkt_len + vq->vhost_hlen;
+ vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
+
+ vhost_log_used_vring(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));
Got a question here:

I see that we are logging down changes when we are marking
used_vring. Do we need to log down buffer copy in rte_memcpy() too?
I am not sure whether I understand it correctly, it seems that this
is part of DPDK API ops to deliver data to the guest (from, e.g.,
OVS?), when we do rte_memcpy(), we seems to be modifying guest
memory too. Am I wrong?

Peter
Post by Yuanhan Liu
res_cur_idx++;
packet_success++;
@@ -236,6 +249,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
*(volatile uint16_t *)&vq->used->idx += count;
vq->last_used_idx = res_end_idx;
+ vhost_log_used_vring(dev, vq,
+ offsetof(struct vring_used, idx),
+ sizeof(vq->used->idx));
/* flush used->idx update before we read avail->flags. */
rte_mb();
@@ -265,6 +281,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
uint32_t seg_avail;
uint32_t vb_avail;
uint32_t cpy_len, entry_len;
+ uint16_t idx;
if (pkt == NULL)
return 0;
@@ -302,16 +319,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
entry_len = vq->vhost_hlen;
if (vb_avail == 0) {
- uint32_t desc_idx =
- vq->buf_vec[vec_idx].desc_idx;
+ uint32_t desc_idx = vq->buf_vec[vec_idx].desc_idx;
+
+ if ((vq->desc[desc_idx].flags & VRING_DESC_F_NEXT) == 0) {
+ idx = cur_idx & (vq->size - 1);
- if ((vq->desc[desc_idx].flags
- & VRING_DESC_F_NEXT) == 0) {
/* Update used ring with desc information */
- vq->used->ring[cur_idx & (vq->size - 1)].id
- = vq->buf_vec[vec_idx].desc_idx;
- vq->used->ring[cur_idx & (vq->size - 1)].len
- = entry_len;
+ vq->used->ring[idx].id = vq->buf_vec[vec_idx].desc_idx;
+ vq->used->ring[idx].len = entry_len;
+
+ vhost_log_used_vring(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));
entry_len = 0;
cur_idx++;
@@ -354,10 +373,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
if ((vq->desc[vq->buf_vec[vec_idx].desc_idx].flags &
VRING_DESC_F_NEXT) == 0) {
/* Update used ring with desc information */
- vq->used->ring[cur_idx & (vq->size - 1)].id
+ idx = cur_idx & (vq->size - 1);
+ vq->used->ring[idx].id
= vq->buf_vec[vec_idx].desc_idx;
- vq->used->ring[cur_idx & (vq->size - 1)].len
- = entry_len;
+ vq->used->ring[idx].len = entry_len;
+ vhost_log_used_vring(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));
entry_len = 0;
cur_idx++;
entry_success++;
@@ -390,16 +412,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
if ((vq->desc[desc_idx].flags &
VRING_DESC_F_NEXT) == 0) {
- uint16_t wrapped_idx =
- cur_idx & (vq->size - 1);
+ idx = cur_idx & (vq->size - 1);
/*
* Update used ring with the
* descriptor information
*/
- vq->used->ring[wrapped_idx].id
+ vq->used->ring[idx].id
= desc_idx;
- vq->used->ring[wrapped_idx].len
+ vq->used->ring[idx].len
= entry_len;
+ vhost_log_used_vring(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));
entry_success++;
entry_len = 0;
cur_idx++;
@@ -422,10 +446,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
* This whole packet completes.
*/
/* Update used ring with desc information */
- vq->used->ring[cur_idx & (vq->size - 1)].id
+ idx = cur_idx & (vq->size - 1);
+ vq->used->ring[idx].id
= vq->buf_vec[vec_idx].desc_idx;
- vq->used->ring[cur_idx & (vq->size - 1)].len
- = entry_len;
+ vq->used->ring[idx].len = entry_len;
+ vhost_log_used_vring(dev, vq,
+ offsetof(struct vring_used, ring[idx]),
+ sizeof(vq->used->ring[idx]));
entry_success++;
break;
}
@@ -653,6 +680,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
/* Update used index buffer information. */
vq->used->ring[used_idx].id = head[entry_success];
vq->used->ring[used_idx].len = 0;
+ vhost_log_used_vring(dev, vq,
+ offsetof(struct vring_used, ring[used_idx]),
+ sizeof(vq->used->ring[used_idx]));
/* Allocate an mbuf and populate the structure. */
m = rte_pktmbuf_alloc(mbuf_pool);
@@ -773,6 +803,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
rte_compiler_barrier();
vq->used->idx += entry_success;
+ vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
+ sizeof(vq->used->idx));
/* Kick guest if required. */
if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
eventfd_write(vq->callfd, (eventfd_t)1);
diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index de78a0f..03044f6 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -666,12 +666,16 @@ set_vring_addr(struct vhost_device_ctx ctx, struct vhost_vring_addr *addr)
return -1;
}
+ vq->log_guest_addr = addr->log_guest_addr;
+
LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address desc: %p\n",
dev->device_fh, vq->desc);
LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address avail: %p\n",
dev->device_fh, vq->avail);
LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address used: %p\n",
dev->device_fh, vq->used);
+ LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") log_guest_addr: %"PRIx64"\n",
+ dev->device_fh, vq->log_guest_addr);
return 0;
}
--
1.9.0
Yuanhan Liu
2015-12-17 03:12:00 UTC
Permalink
It's actually a feature already enabled in Linux kernel. What we need to
do is simply to claim that we support such feature, and nothing else.

With that, the guest will send GARP messages after live migration.

Signed-off-by: Yuanhan Liu <***@linux.intel.com>
---
lib/librte_vhost/virtio-net.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index 03044f6..0ba5045 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
#define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
(1ULL << VIRTIO_NET_F_CTRL_VQ) | \
(1ULL << VIRTIO_NET_F_CTRL_RX) | \
+ (1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
(VHOST_SUPPORTS_MQ) | \
(1ULL << VIRTIO_F_VERSION_1) | \
(1ULL << VHOST_F_LOG_ALL) | \
--
1.9.0
Yuanhan Liu
2015-12-17 03:11:59 UTC
Permalink
Every time we copy a buf to vring desc, we need to log it.

Signed-off-by: Yuanhan Liu <***@linux.intel.com>
Signed-off-by: Victor Kaplansky <***@redhat.com
---
lib/librte_vhost/vhost_rxtx.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index f305acd..c2d514b 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -71,7 +71,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
struct rte_mbuf **pkts, uint32_t count)
{
struct vhost_virtqueue *vq;
- struct vring_desc *desc;
+ struct vring_desc *desc, *hdr_desc;
struct rte_mbuf *buff;
/* The virtio_hdr is initialised to 0. */
struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
@@ -153,6 +153,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,

/* Copy virtio_hdr to packet and increment buffer address */
buff_hdr_addr = buff_addr;
+ hdr_desc = desc;

/*
* If the descriptors are chained the header and data are
@@ -177,6 +178,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
rte_memcpy((void *)(uintptr_t)(buff_addr + vb_offset),
rte_pktmbuf_mtod_offset(buff, const void *, offset),
len_to_cpy);
+ vhost_log_write(dev, desc->addr + vb_offset, len_to_cpy);
PRINT_PACKET(dev, (uintptr_t)(buff_addr + vb_offset),
len_to_cpy, 0);

@@ -232,6 +234,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,

rte_memcpy((void *)(uintptr_t)buff_hdr_addr,
(const void *)&virtio_hdr, vq->vhost_hlen);
+ vhost_log_write(dev, hdr_desc->addr, vq->vhost_hlen);

PRINT_PACKET(dev, (uintptr_t)buff_hdr_addr, vq->vhost_hlen, 1);

@@ -309,6 +312,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,

rte_memcpy((void *)(uintptr_t)vb_hdr_addr,
(const void *)&virtio_hdr, vq->vhost_hlen);
+ vhost_log_write(dev, vq->buf_vec[vec_idx].buf_addr, vq->vhost_hlen);

PRINT_PACKET(dev, (uintptr_t)vb_hdr_addr, vq->vhost_hlen, 1);

@@ -353,6 +357,8 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
rte_memcpy((void *)(uintptr_t)(vb_addr + vb_offset),
rte_pktmbuf_mtod_offset(pkt, const void *, seg_offset),
cpy_len);
+ vhost_log_write(dev, vq->buf_vec[vec_idx].buf_addr + vb_offset,
+ cpy_len);

PRINT_PACKET(dev,
(uintptr_t)(vb_addr + vb_offset),
--
1.9.0
Yuanhan Liu
2015-12-17 03:12:01 UTC
Permalink
To claim that we support vhost-user live migration support:
SET_LOG_BASE request will be send only when this feature flag
is set.

Besides this flag, we actually need another feature flag set
to make vhost-user live migration work: VHOST_F_LOG_ALL.
Which, however, has been enabled long time ago.

Signed-off-by: Yuanhan Liu <***@linux.intel.com>
---
lib/librte_vhost/vhost_user/virtio-net-user.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index 013cf38..a3a889d 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -38,8 +38,10 @@
#include "vhost-net-user.h"

#define VHOST_USER_PROTOCOL_F_MQ 0
+#define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1

-#define VHOST_USER_PROTOCOL_FEATURES (1ULL << VHOST_USER_PROTOCOL_F_MQ)
+#define VHOST_USER_PROTOCOL_FEATURES ((1ULL << VHOST_USER_PROTOCOL_F_MQ) | \
+ (1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD))

int user_set_mem_table(struct vhost_device_ctx, struct VhostUserMsg *);
--
1.9.0
Iremonger, Bernard
2015-12-17 12:08:13 UTC
Permalink
Hi Yuanhan,
-----Original Message-----
Sent: Thursday, December 17, 2015 3:12 AM
Subject: [PATCH v2 0/6] vhost-user live migration support
This patch set adds the vhost-user live migration support.
The major task behind that is to log pages we touched during live migration,
including used vring and desc buffer. So, this patch set is basically about
adding vhost log support, and using it.
Patchset
========
- Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
the dirty memory bitmap is.
- Patch 2 introduces a vhost_log_write() helper function to log
pages we are gonna change.
- Patch 3 logs changes we made to used vring.
- Patch 4 logs changes we made to vring desc buffer.
- Patch 5 and 6 add some feature bits related to live migration.
The follow test guide should probably be added the DPDK doc files.
It could be added to the sample app guide or the programmers guide.
There is already a Vhost Library section in the programmers guide and
A Vhost Sample Application section in the sample app guide.
A simple test guide (on same host)
==================================
The following test is based on OVS + DPDK (check [0] for how to setup OVS +
[0]: http://wiki.qemu.org/Features/vhost-user-ovs-dpdk
1. start ovs-vswitchd
2. Add two ovs vhost-user port, say vhost0 and vhost1
$ $QEMU -enable-kvm -m 1024 -smp 4 \
-chardev socket,id=char0,path=/var/run/openvswitch/vhost0 \
-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
-device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
-object memory-backend-file,id=mem,size=1024M,mem-
path=$HOME/hugetlbfs,share=on \
-numa node,memdev=mem -mem-prealloc \
-kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
-hda fc-19-i386.img \
-monitor telnet::3333,server,nowait -curses
4. run "ping $host" inside VM1
5. Start VM2 to connect to vhost0, and marking it as the target
of live migration (by adding -incoming tcp:0:4444 option)
$ $QEMU -enable-kvm -m 1024 -smp 4 \
-chardev socket,id=char0,path=/var/run/openvswitch/vhost1 \
-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
-device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
-object memory-backend-file,id=mem,size=1024M,mem-
path=$HOME/hugetlbfs,share=on \
-numa node,memdev=mem -mem-prealloc \
-kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
-hda fc-19-i386.img \
-monitor telnet::3334,server,nowait -curses \
-incoming tcp:0:4444
migrate tcp:0:4444
7. After a while, you will find that VM1 has been migrated to VM2,
and the "ping" command continues running, perfectly.
---
vhost: handle VHOST_USER_SET_LOG_BASE request
vhost: introduce vhost_log_write
vhost: log used vring changes
vhost: log vring desc buffer changes
vhost: claim that we support GUEST_ANNOUNCE feature
vhost: enable log_shmfd protocol feature
lib/librte_vhost/rte_virtio_net.h | 36 ++++++++++-
lib/librte_vhost/vhost_rxtx.c | 88 +++++++++++++++++++--------
lib/librte_vhost/vhost_user/vhost-net-user.c | 7 ++-
lib/librte_vhost/vhost_user/vhost-net-user.h | 6 ++
lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++++++++++++++
lib/librte_vhost/vhost_user/virtio-net-user.h | 5 +-
lib/librte_vhost/virtio-net.c | 5 ++
7 files changed, 165 insertions(+), 30 deletions(-)
--
1.9.0
Regards,

Bernard.
Yuanhan Liu
2015-12-17 12:45:31 UTC
Permalink
Post by Iremonger, Bernard
Hi Yuanhan,
-----Original Message-----
Sent: Thursday, December 17, 2015 3:12 AM
Subject: [PATCH v2 0/6] vhost-user live migration support
This patch set adds the vhost-user live migration support.
The major task behind that is to log pages we touched during live migration,
including used vring and desc buffer. So, this patch set is basically about
adding vhost log support, and using it.
Patchset
========
- Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
the dirty memory bitmap is.
- Patch 2 introduces a vhost_log_write() helper function to log
pages we are gonna change.
- Patch 3 logs changes we made to used vring.
- Patch 4 logs changes we made to vring desc buffer.
- Patch 5 and 6 add some feature bits related to live migration.
The follow test guide should probably be added the DPDK doc files.
Yes, but not this one, which is a fare rough one. The official one
should do live migration between two hosts.
Post by Iremonger, Bernard
It could be added to the sample app guide or the programmers guide.
There is already a Vhost Library section in the programmers guide and
A Vhost Sample Application section in the sample app guide.
We may do it after the validation from validation team.

--yliu
Pavel Fedin
2015-12-21 08:17:03 UTC
Permalink
Works fine.

Tested-by: Pavel Fedin <***@samsung.com>

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia
-----Original Message-----
Sent: Thursday, December 17, 2015 6:12 AM
Fedin; Peter Xu; Yuanhan Liu; Chen Zhihui; Yang Maggie
Subject: [PATCH v2 0/6] vhost-user live migration support
This patch set adds the vhost-user live migration support.
The major task behind that is to log pages we touched during
live migration, including used vring and desc buffer. So, this
patch set is basically about adding vhost log support, and
using it.
Patchset
========
- Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
the dirty memory bitmap is.
- Patch 2 introduces a vhost_log_write() helper function to log
pages we are gonna change.
- Patch 3 logs changes we made to used vring.
- Patch 4 logs changes we made to vring desc buffer.
- Patch 5 and 6 add some feature bits related to live migration.
A simple test guide (on same host)
==================================
The following test is based on OVS + DPDK (check [0] for
[0]: http://wiki.qemu.org/Features/vhost-user-ovs-dpdk
1. start ovs-vswitchd
2. Add two ovs vhost-user port, say vhost0 and vhost1
$ $QEMU -enable-kvm -m 1024 -smp 4 \
-chardev socket,id=char0,path=/var/run/openvswitch/vhost0 \
-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
-device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
-object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
-numa node,memdev=mem -mem-prealloc \
-kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
-hda fc-19-i386.img \
-monitor telnet::3333,server,nowait -curses
4. run "ping $host" inside VM1
5. Start VM2 to connect to vhost0, and marking it as the target
of live migration (by adding -incoming tcp:0:4444 option)
$ $QEMU -enable-kvm -m 1024 -smp 4 \
-chardev socket,id=char0,path=/var/run/openvswitch/vhost1 \
-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
-device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
-object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
-numa node,memdev=mem -mem-prealloc \
-kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
-hda fc-19-i386.img \
-monitor telnet::3334,server,nowait -curses \
-incoming tcp:0:4444
migrate tcp:0:4444
7. After a while, you will find that VM1 has been migrated to VM2,
and the "ping" command continues running, perfectly.
---
vhost: handle VHOST_USER_SET_LOG_BASE request
vhost: introduce vhost_log_write
vhost: log used vring changes
vhost: log vring desc buffer changes
vhost: claim that we support GUEST_ANNOUNCE feature
vhost: enable log_shmfd protocol feature
lib/librte_vhost/rte_virtio_net.h | 36 ++++++++++-
lib/librte_vhost/vhost_rxtx.c | 88 +++++++++++++++++++--------
lib/librte_vhost/vhost_user/vhost-net-user.c | 7 ++-
lib/librte_vhost/vhost_user/vhost-net-user.h | 6 ++
lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++++++++++++++
lib/librte_vhost/vhost_user/virtio-net-user.h | 5 +-
lib/librte_vhost/virtio-net.c | 5 ++
7 files changed, 165 insertions(+), 30 deletions(-)
--
1.9.0
Continue reading on narkive:
Loading...