Discussion:
[dpdk-dev] vhost compliant virtio based networking interface in container
Xie, Huawei
2015-08-20 10:14:55 UTC
Permalink
I read your mail, seems what we did are quite similar. Here i wrote a
quick mail to describe our design. Let me know if it is the same thing.
We don't have a high performance networking interface in container for
NFV. Current veth pair based interface couldn't be easily accelerated.
1. DPDK based virtio PMD driver in container.
2. device simulation framework in container.
3. dpdk(or kernel) vhost running in host.
How virtio is created?
A: There is no "real" virtio-pci device in container environment.
1). Host maintains pools of memories, and shares memory to container.
This could be accomplished through host share a huge page file to container.
2). Containers creates virtio rings based on the shared memory.
3). Container creates mbuf memory pools on the shared memory.
4) Container send the memory and vring information to vhost through
vhost message. This could be done either through ioctl call or vhost
user message.
How vhost message is sent?
A: There are two alternative ways to do this.
1) The customized virtio PMD is responsible for all the vring creation,
and vhost message sending.
2) We could do this through a lightweight device simulation framework.
The device simulation creates simple PCI bus. On the PCI bus,
virtio-net PCI devices are created. The device simulations provides
IOAPI for MMIO/IO access.
2.1 virtio PMD configures the pseudo virtio device as how it does in
KVM guest enviroment.
2.2 Rather than using io instruction, virtio PMD uses IOAPI for IO
operation on the virtio-net PCI device.
2.3 The device simulation is responsible for device state machine
simulation.
2.4 The device simulation is responsbile for talking to vhost.
With this approach, we could minimize the virtio PMD modifications.
The virtio PMD is like configuring a real virtio-net PCI device.
Memory mapping?
A: QEMU could access the whole guest memory in KVM enviroment. We need
to fill the gap.
container maps the shared memory to container's virtual address space
and host maps it to host's virtual address space. There is a fixed
offset mapping.
Container creates shared vring based on the memory. Container also
creates mbuf memory pool based on the shared memroy.
In VHOST_SET_MEMORY_TABLE message, we send the memory mapping
information for the shared memory. As we require mbuf pool created on
the shared memory, and buffers are allcoated from the mbuf pools, dpdk
vhost could translate the GPA in vring desc to host virtual.
GPA or CVA in vring desc?
To ease the memory translation, rather than using GPA, here we use
CVA(container virtual address). This the tricky thing here.
1) virtio PMD writes vring's VFN rather than PFN to PFN register through
IOAPI.
2) device simulation framework will use VFN as PFN.
3) device simulation sends SET_VRING_ADDR with CVA.
4) virtio PMD fills vring desc with CVA of the mbuf data pointer rather
than GPA.
So when host sees the CVA, it could translates it to HVA(host virtual
address).
The virtio interface in container follows the vhost message format, and
is compliant with dpdk vhost implmentation, i.e, no dpdk vhost
modification is needed.
vHost isn't aware whether the incoming virtio comes from KVM guest or
container.
The pretty much covers the high level design. There are quite some low
level issues. For example, 32bit PFN is enough for KVM guest, since we
use 64bit VFN(virtual page frame number), trick is done here through a
special IOAPI.
/huawei
Tetsuya Mukawa
2015-08-25 02:58:41 UTC
Permalink
Hi Xie and Yanping,


May I ask you some questions?
It seems we are also developing an almost same one.
I read your mail, seems what we did are quite similar. Here i wrote a
quick mail to describe our design. Let me know if it is the same thing.
We don't have a high performance networking interface in container for
NFV. Current veth pair based interface couldn't be easily accelerated.
1. DPDK based virtio PMD driver in container.
2. device simulation framework in container.
3. dpdk(or kernel) vhost running in host.
How virtio is created?
A: There is no "real" virtio-pci device in container environment.
1). Host maintains pools of memories, and shares memory to container.
This could be accomplished through host share a huge page file to container.
2). Containers creates virtio rings based on the shared memory.
3). Container creates mbuf memory pools on the shared memory.
4) Container send the memory and vring information to vhost through
vhost message. This could be done either through ioctl call or vhost
user message.
How vhost message is sent?
A: There are two alternative ways to do this.
1) The customized virtio PMD is responsible for all the vring creation,
and vhost message sending.
Above is our approach so far.
It seems Yanping also takes this kind of approach.
We are using vhost-user functionality instead of using the vhost-net
kernel module.
Probably this is the difference between Yanping and us.

BTW, we are going to submit a vhost PMD for DPDK-2.2.
This PMD is implemented on librte_vhost.
It allows DPDK application to handle a vhost-user(cuse) backend as a
normal NIC port.
This PMD should work with both Xie and Yanping approach.
(In the case of Yanping approach, we may need vhost-cuse)
2) We could do this through a lightweight device simulation framework.
The device simulation creates simple PCI bus. On the PCI bus,
virtio-net PCI devices are created. The device simulations provides
IOAPI for MMIO/IO access.
Does it mean you implemented a kernel module?
If so, do you still need vhost-cuse functionality to handle vhost
messages n userspace?
2.1 virtio PMD configures the pseudo virtio device as how it does in
KVM guest enviroment.
2.2 Rather than using io instruction, virtio PMD uses IOAPI for IO
operation on the virtio-net PCI device.
2.3 The device simulation is responsible for device state machine
simulation.
2.4 The device simulation is responsbile for talking to vhost.
With this approach, we could minimize the virtio PMD modifications.
The virtio PMD is like configuring a real virtio-net PCI device.
Memory mapping?
A: QEMU could access the whole guest memory in KVM enviroment. We need
to fill the gap.
container maps the shared memory to container's virtual address space
and host maps it to host's virtual address space. There is a fixed
offset mapping.
Container creates shared vring based on the memory. Container also
creates mbuf memory pool based on the shared memroy.
In VHOST_SET_MEMORY_TABLE message, we send the memory mapping
information for the shared memory. As we require mbuf pool created on
the shared memory, and buffers are allcoated from the mbuf pools, dpdk
vhost could translate the GPA in vring desc to host virtual.
GPA or CVA in vring desc?
To ease the memory translation, rather than using GPA, here we use
CVA(container virtual address). This the tricky thing here.
1) virtio PMD writes vring's VFN rather than PFN to PFN register through
IOAPI.
2) device simulation framework will use VFN as PFN.
3) device simulation sends SET_VRING_ADDR with CVA.
4) virtio PMD fills vring desc with CVA of the mbuf data pointer rather
than GPA.
So when host sees the CVA, it could translates it to HVA(host virtual
address).
The virtio interface in container follows the vhost message format, and
is compliant with dpdk vhost implmentation, i.e, no dpdk vhost
modification is needed.
vHost isn't aware whether the incoming virtio comes from KVM guest or
container.
The pretty much covers the high level design. There are quite some low
level issues. For example, 32bit PFN is enough for KVM guest, since we
use 64bit VFN(virtual page frame number), trick is done here through a
special IOAPI.
In addition above, we might consider "namespace" kernel functionality.
Technically, it would not be a big problem, but related with security.
So it would be nice to take account.

Regards,
Tetsuya
/huawei
Xie, Huawei
2015-08-25 09:56:59 UTC
Permalink
Post by Tetsuya Mukawa
Hi Xie and Yanping,
May I ask you some questions?
It seems we are also developing an almost same one.
Good to know that we are tackling the same problem and have the similar
idea.
What is your status now? We had the POC running, and compliant with
dpdkvhost.
Interrupt like notification isn't supported.
Post by Tetsuya Mukawa
I read your mail, seems what we did are quite similar. Here i wrote a
quick mail to describe our design. Let me know if it is the same thing.
We don't have a high performance networking interface in container for
NFV. Current veth pair based interface couldn't be easily accelerated.
1. DPDK based virtio PMD driver in container.
2. device simulation framework in container.
3. dpdk(or kernel) vhost running in host.
How virtio is created?
A: There is no "real" virtio-pci device in container environment.
1). Host maintains pools of memories, and shares memory to container.
This could be accomplished through host share a huge page file to container.
2). Containers creates virtio rings based on the shared memory.
3). Container creates mbuf memory pools on the shared memory.
4) Container send the memory and vring information to vhost through
vhost message. This could be done either through ioctl call or vhost
user message.
How vhost message is sent?
A: There are two alternative ways to do this.
1) The customized virtio PMD is responsible for all the vring creation,
and vhost message sending.
Above is our approach so far.
It seems Yanping also takes this kind of approach.
We are using vhost-user functionality instead of using the vhost-net
kernel module.
Probably this is the difference between Yanping and us.
In my current implementation, the device simulation layer talks to "user
space" vhost through cuse interface. It could also be done through vhost
user socket. This isn't the key point.
Here vhost-user is kind of confusing, maybe user space vhost is more
accurate, either cuse or unix domain socket. :).

As for yanping, they are now connecting to vhost-net kernel module, but
they are also trying to connect to "user space" vhost. Correct me if wrong.
Yes, there is some difference between these two. Vhost-net kernel module
could directly access other process's memory, while using
vhost-user(cuse/user), we need do the memory mapping.
Post by Tetsuya Mukawa
BTW, we are going to submit a vhost PMD for DPDK-2.2.
This PMD is implemented on librte_vhost.
It allows DPDK application to handle a vhost-user(cuse) backend as a
normal NIC port.
This PMD should work with both Xie and Yanping approach.
(In the case of Yanping approach, we may need vhost-cuse)
2) We could do this through a lightweight device simulation framework.
The device simulation creates simple PCI bus. On the PCI bus,
virtio-net PCI devices are created. The device simulations provides
IOAPI for MMIO/IO access.
Does it mean you implemented a kernel module?
If so, do you still need vhost-cuse functionality to handle vhost
messages n userspace?
The device simulation is a library running in user space in container.
It is linked with DPDK app. It creates pseudo buses and virtio-net PCI
devices.
The virtio-container-PMD configures the virtio-net pseudo devices
through IOAPI provided by the device simulation rather than IO
instructions as in KVM.
Why we use device simulation?
We could create other virtio devices in container, and provide an common
way to talk to vhost-xx module.
Post by Tetsuya Mukawa
2.1 virtio PMD configures the pseudo virtio device as how it does in
KVM guest enviroment.
2.2 Rather than using io instruction, virtio PMD uses IOAPI for IO
operation on the virtio-net PCI device.
2.3 The device simulation is responsible for device state machine
simulation.
2.4 The device simulation is responsbile for talking to vhost.
With this approach, we could minimize the virtio PMD modifications.
The virtio PMD is like configuring a real virtio-net PCI device.
Memory mapping?
A: QEMU could access the whole guest memory in KVM enviroment. We need
to fill the gap.
container maps the shared memory to container's virtual address space
and host maps it to host's virtual address space. There is a fixed
offset mapping.
Container creates shared vring based on the memory. Container also
creates mbuf memory pool based on the shared memroy.
In VHOST_SET_MEMORY_TABLE message, we send the memory mapping
information for the shared memory. As we require mbuf pool created on
the shared memory, and buffers are allcoated from the mbuf pools, dpdk
vhost could translate the GPA in vring desc to host virtual.
GPA or CVA in vring desc?
To ease the memory translation, rather than using GPA, here we use
CVA(container virtual address). This the tricky thing here.
1) virtio PMD writes vring's VFN rather than PFN to PFN register through
IOAPI.
2) device simulation framework will use VFN as PFN.
3) device simulation sends SET_VRING_ADDR with CVA.
4) virtio PMD fills vring desc with CVA of the mbuf data pointer rather
than GPA.
So when host sees the CVA, it could translates it to HVA(host virtual
address).
The virtio interface in container follows the vhost message format, and
is compliant with dpdk vhost implmentation, i.e, no dpdk vhost
modification is needed.
vHost isn't aware whether the incoming virtio comes from KVM guest or
container.
The pretty much covers the high level design. There are quite some low
level issues. For example, 32bit PFN is enough for KVM guest, since we
use 64bit VFN(virtual page frame number), trick is done here through a
special IOAPI.
In addition above, we might consider "namespace" kernel functionality.
Technically, it would not be a big problem, but related with security.
So it would be nice to take account.
There is no namespace concept here because we don't generate kernel
netdev devices. It might be usefull if we could extend our work to
support kernel netdev interface and assign to container's namespace.
Post by Tetsuya Mukawa
Regards,
Tetsuya
/huawei
Tetsuya Mukawa
2015-08-26 09:23:04 UTC
Permalink
Post by Xie, Huawei
Post by Tetsuya Mukawa
Hi Xie and Yanping,
May I ask you some questions?
It seems we are also developing an almost same one.
Good to know that we are tackling the same problem and have the similar
idea.
What is your status now? We had the POC running, and compliant with
dpdkvhost.
Interrupt like notification isn't supported.
We implemented vhost PMD first, so we just start implementing it.
Post by Xie, Huawei
Post by Tetsuya Mukawa
I read your mail, seems what we did are quite similar. Here i wrote a
quick mail to describe our design. Let me know if it is the same thing.
We don't have a high performance networking interface in container for
NFV. Current veth pair based interface couldn't be easily accelerated.
1. DPDK based virtio PMD driver in container.
2. device simulation framework in container.
3. dpdk(or kernel) vhost running in host.
How virtio is created?
A: There is no "real" virtio-pci device in container environment.
1). Host maintains pools of memories, and shares memory to container.
This could be accomplished through host share a huge page file to container.
2). Containers creates virtio rings based on the shared memory.
3). Container creates mbuf memory pools on the shared memory.
4) Container send the memory and vring information to vhost through
vhost message. This could be done either through ioctl call or vhost
user message.
How vhost message is sent?
A: There are two alternative ways to do this.
1) The customized virtio PMD is responsible for all the vring creation,
and vhost message sending.
Above is our approach so far.
It seems Yanping also takes this kind of approach.
We are using vhost-user functionality instead of using the vhost-net
kernel module.
Probably this is the difference between Yanping and us.
In my current implementation, the device simulation layer talks to "user
space" vhost through cuse interface. It could also be done through vhost
user socket. This isn't the key point.
Here vhost-user is kind of confusing, maybe user space vhost is more
accurate, either cuse or unix domain socket. :).
As for yanping, they are now connecting to vhost-net kernel module, but
they are also trying to connect to "user space" vhost. Correct me if wrong.
Yes, there is some difference between these two. Vhost-net kernel module
could directly access other process's memory, while using
vhost-user(cuse/user), we need do the memory mapping.
Post by Tetsuya Mukawa
BTW, we are going to submit a vhost PMD for DPDK-2.2.
This PMD is implemented on librte_vhost.
It allows DPDK application to handle a vhost-user(cuse) backend as a
normal NIC port.
This PMD should work with both Xie and Yanping approach.
(In the case of Yanping approach, we may need vhost-cuse)
2) We could do this through a lightweight device simulation framework.
The device simulation creates simple PCI bus. On the PCI bus,
virtio-net PCI devices are created. The device simulations provides
IOAPI for MMIO/IO access.
Does it mean you implemented a kernel module?
If so, do you still need vhost-cuse functionality to handle vhost
messages n userspace?
The device simulation is a library running in user space in container.
It is linked with DPDK app. It creates pseudo buses and virtio-net PCI
devices.
The virtio-container-PMD configures the virtio-net pseudo devices
through IOAPI provided by the device simulation rather than IO
instructions as in KVM.
Why we use device simulation?
We could create other virtio devices in container, and provide an common
way to talk to vhost-xx module.
Thanks for explanation.
At first reading, I thought the difference between approach1 and
approach2 is whether we need to implement a new kernel module, or not.
But I understand how you implemented.

Please let me explain our design more.
We might use a kind of similar approach to handle a pseudo virtio-net
device in DPDK.
(Anyway, we haven't finished implementing yet, this overview might have
some technical problems)

Step1. Separate virtio-net and vhost-user socket related code from QEMU,
then implement it as a separated program.
The program also has below features.
- Create a directory that contains almost same files like
/sys/bus/pci/device/<pci address>/*
(To scan these file located on outside sysfs, we need to fix EAL)
- This dummy device is driven by dummy-virtio-net-driver. This name is
specified by '<pci addr>/driver' file.
- Create a shared file that represents pci configuration space, then
mmap it, also specify the path in '<pci addr>/resource_path'

The program will be GPL, but it will be like a bridge on the shared
memory between virtio-net PMD and DPDK vhost backend.
Actually, It will work under virtio-net PMD, but we don't need to link it.
So I guess we don't have GPL license issue.

Step2. Fix pci scan code of EAL to scan dummy devices.
- To scan above files, extend pci_scan() of EAL.

Step3. Add a new kdrv type to EAL.
- To handle the 'dummy-virtio-net-driver', add a new kdrv type to EAL.

Step4. Implement pci_dummy_virtio_net_map/unmap().
- It will have almost same functionality like pci_uio_map(), but for
dummy virtio-net device.
- The dummy device will be mmaped using a path specified in '<pci
addr>/resource_path'.

Step5. Add a new compile option for virtio-net device to replace IO
functions.
- The IO functions of virtio-net PMD will be replaced by read() and
write() to access to the shared memory.
- Add notification mechanism to IO functions. This will be used when
write() to the shared memory is done.
(Not sure exactly, but probably we need it)

Does it make sense?
I guess Step1&2 is different from your approach, but the rest might be
similar.

Actually, we just need sysfs entries for a virtio-net dummy device, but
so far, I don't have a fine way to register them from user space without
loading a kernel module.
This is because I need to change pci_scan() also.

It seems you have implemented a virtio-net pseudo device as BSD license.
If so, this kind of PMD would be nice to use it.
In the case that it takes much time to implement some lost
functionalities like interrupt mode, using QEMU code might be an one of
options.

Anyway, we just need a fine virtual NIC between containers and host.
So we don't hold to our approach and implementation.

Thanks,
Tetsuya
Post by Xie, Huawei
Post by Tetsuya Mukawa
2.1 virtio PMD configures the pseudo virtio device as how it does in
KVM guest enviroment.
2.2 Rather than using io instruction, virtio PMD uses IOAPI for IO
operation on the virtio-net PCI device.
2.3 The device simulation is responsible for device state machine
simulation.
2.4 The device simulation is responsbile for talking to vhost.
With this approach, we could minimize the virtio PMD modifications.
The virtio PMD is like configuring a real virtio-net PCI device.
Memory mapping?
A: QEMU could access the whole guest memory in KVM enviroment. We need
to fill the gap.
container maps the shared memory to container's virtual address space
and host maps it to host's virtual address space. There is a fixed
offset mapping.
Container creates shared vring based on the memory. Container also
creates mbuf memory pool based on the shared memroy.
In VHOST_SET_MEMORY_TABLE message, we send the memory mapping
information for the shared memory. As we require mbuf pool created on
the shared memory, and buffers are allcoated from the mbuf pools, dpdk
vhost could translate the GPA in vring desc to host virtual.
GPA or CVA in vring desc?
To ease the memory translation, rather than using GPA, here we use
CVA(container virtual address). This the tricky thing here.
1) virtio PMD writes vring's VFN rather than PFN to PFN register through
IOAPI.
2) device simulation framework will use VFN as PFN.
3) device simulation sends SET_VRING_ADDR with CVA.
4) virtio PMD fills vring desc with CVA of the mbuf data pointer rather
than GPA.
So when host sees the CVA, it could translates it to HVA(host virtual
address).
The virtio interface in container follows the vhost message format, and
is compliant with dpdk vhost implmentation, i.e, no dpdk vhost
modification is needed.
vHost isn't aware whether the incoming virtio comes from KVM guest or
container.
The pretty much covers the high level design. There are quite some low
level issues. For example, 32bit PFN is enough for KVM guest, since we
use 64bit VFN(virtual page frame number), trick is done here through a
special IOAPI.
In addition above, we might consider "namespace" kernel functionality.
Technically, it would not be a big problem, but related with security.
So it would be nice to take account.
There is no namespace concept here because we don't generate kernel
netdev devices. It might be usefull if we could extend our work to
support kernel netdev interface and assign to container's namespace.
Post by Tetsuya Mukawa
Regards,
Tetsuya
/huawei
Xie, Huawei
2015-09-07 05:54:13 UTC
Permalink
Post by Tetsuya Mukawa
Post by Xie, Huawei
Post by Tetsuya Mukawa
Hi Xie and Yanping,
May I ask you some questions?
It seems we are also developing an almost same one.
Good to know that we are tackling the same problem and have the similar
idea.
What is your status now? We had the POC running, and compliant with
dpdkvhost.
Interrupt like notification isn't supported.
We implemented vhost PMD first, so we just start implementing it.
Post by Xie, Huawei
Post by Tetsuya Mukawa
I read your mail, seems what we did are quite similar. Here i wrote a
quick mail to describe our design. Let me know if it is the same thing.
We don't have a high performance networking interface in container for
NFV. Current veth pair based interface couldn't be easily accelerated.
1. DPDK based virtio PMD driver in container.
2. device simulation framework in container.
3. dpdk(or kernel) vhost running in host.
How virtio is created?
A: There is no "real" virtio-pci device in container environment.
1). Host maintains pools of memories, and shares memory to container.
This could be accomplished through host share a huge page file to container.
2). Containers creates virtio rings based on the shared memory.
3). Container creates mbuf memory pools on the shared memory.
4) Container send the memory and vring information to vhost through
vhost message. This could be done either through ioctl call or vhost
user message.
How vhost message is sent?
A: There are two alternative ways to do this.
1) The customized virtio PMD is responsible for all the vring creation,
and vhost message sending.
Above is our approach so far.
It seems Yanping also takes this kind of approach.
We are using vhost-user functionality instead of using the vhost-net
kernel module.
Probably this is the difference between Yanping and us.
In my current implementation, the device simulation layer talks to "user
space" vhost through cuse interface. It could also be done through vhost
user socket. This isn't the key point.
Here vhost-user is kind of confusing, maybe user space vhost is more
accurate, either cuse or unix domain socket. :).
As for yanping, they are now connecting to vhost-net kernel module, but
they are also trying to connect to "user space" vhost. Correct me if wrong.
Yes, there is some difference between these two. Vhost-net kernel module
could directly access other process's memory, while using
vhost-user(cuse/user), we need do the memory mapping.
Post by Tetsuya Mukawa
BTW, we are going to submit a vhost PMD for DPDK-2.2.
This PMD is implemented on librte_vhost.
It allows DPDK application to handle a vhost-user(cuse) backend as a
normal NIC port.
This PMD should work with both Xie and Yanping approach.
(In the case of Yanping approach, we may need vhost-cuse)
2) We could do this through a lightweight device simulation framework.
The device simulation creates simple PCI bus. On the PCI bus,
virtio-net PCI devices are created. The device simulations provides
IOAPI for MMIO/IO access.
Does it mean you implemented a kernel module?
If so, do you still need vhost-cuse functionality to handle vhost
messages n userspace?
The device simulation is a library running in user space in container.
It is linked with DPDK app. It creates pseudo buses and virtio-net PCI
devices.
The virtio-container-PMD configures the virtio-net pseudo devices
through IOAPI provided by the device simulation rather than IO
instructions as in KVM.
Why we use device simulation?
We could create other virtio devices in container, and provide an common
way to talk to vhost-xx module.
Thanks for explanation.
At first reading, I thought the difference between approach1 and
approach2 is whether we need to implement a new kernel module, or not.
But I understand how you implemented.
Please let me explain our design more.
We might use a kind of similar approach to handle a pseudo virtio-net
device in DPDK.
(Anyway, we haven't finished implementing yet, this overview might have
some technical problems)
Step1. Separate virtio-net and vhost-user socket related code from QEMU,
then implement it as a separated program.
The program also has below features.
- Create a directory that contains almost same files like
/sys/bus/pci/device/<pci address>/*
(To scan these file located on outside sysfs, we need to fix EAL)
- This dummy device is driven by dummy-virtio-net-driver. This name is
specified by '<pci addr>/driver' file.
- Create a shared file that represents pci configuration space, then
mmap it, also specify the path in '<pci addr>/resource_path'
The program will be GPL, but it will be like a bridge on the shared
memory between virtio-net PMD and DPDK vhost backend.
Actually, It will work under virtio-net PMD, but we don't need to link it.
So I guess we don't have GPL license issue.
Step2. Fix pci scan code of EAL to scan dummy devices.
- To scan above files, extend pci_scan() of EAL.
Step3. Add a new kdrv type to EAL.
- To handle the 'dummy-virtio-net-driver', add a new kdrv type to EAL.
Step4. Implement pci_dummy_virtio_net_map/unmap().
- It will have almost same functionality like pci_uio_map(), but for
dummy virtio-net device.
- The dummy device will be mmaped using a path specified in '<pci
addr>/resource_path'.
Step5. Add a new compile option for virtio-net device to replace IO
functions.
- The IO functions of virtio-net PMD will be replaced by read() and
write() to access to the shared memory.
- Add notification mechanism to IO functions. This will be used when
write() to the shared memory is done.
(Not sure exactly, but probably we need it)
Does it make sense?
I guess Step1&2 is different from your approach, but the rest might be
similar.
Actually, we just need sysfs entries for a virtio-net dummy device, but
so far, I don't have a fine way to register them from user space without
loading a kernel module.
Tetsuya:
I don't quite get the details. Who will create those sysfs entries? A
kernel module right?
The virtio-net is configured through read/write to sharing
memory(between host and guest), right?
Where is shared vring created and shared memory created, on shared huge
page between host and guest?
Who will talk to dpdkvhost?
Post by Tetsuya Mukawa
This is because I need to change pci_scan() also.
It seems you have implemented a virtio-net pseudo device as BSD license.
If so, this kind of PMD would be nice to use it.
Currently it is based on native linux kvm tool.
Post by Tetsuya Mukawa
In the case that it takes much time to implement some lost
functionalities like interrupt mode, using QEMU code might be an one of
options.
For interrupt mode, i plan to use eventfd for sleep/wake, have not tried
yet.
Post by Tetsuya Mukawa
Anyway, we just need a fine virtual NIC between containers and host.
So we don't hold to our approach and implementation.
Do you have comments to my implementation?
We could publish the version without the device framework first for
reference.
Post by Tetsuya Mukawa
Thanks,
Tetsuya
Post by Xie, Huawei
Post by Tetsuya Mukawa
2.1 virtio PMD configures the pseudo virtio device as how it does in
KVM guest enviroment.
2.2 Rather than using io instruction, virtio PMD uses IOAPI for IO
operation on the virtio-net PCI device.
2.3 The device simulation is responsible for device state machine
simulation.
2.4 The device simulation is responsbile for talking to vhost.
With this approach, we could minimize the virtio PMD modifications.
The virtio PMD is like configuring a real virtio-net PCI device.
Memory mapping?
A: QEMU could access the whole guest memory in KVM enviroment. We need
to fill the gap.
container maps the shared memory to container's virtual address space
and host maps it to host's virtual address space. There is a fixed
offset mapping.
Container creates shared vring based on the memory. Container also
creates mbuf memory pool based on the shared memroy.
In VHOST_SET_MEMORY_TABLE message, we send the memory mapping
information for the shared memory. As we require mbuf pool created on
the shared memory, and buffers are allcoated from the mbuf pools, dpdk
vhost could translate the GPA in vring desc to host virtual.
GPA or CVA in vring desc?
To ease the memory translation, rather than using GPA, here we use
CVA(container virtual address). This the tricky thing here.
1) virtio PMD writes vring's VFN rather than PFN to PFN register through
IOAPI.
2) device simulation framework will use VFN as PFN.
3) device simulation sends SET_VRING_ADDR with CVA.
4) virtio PMD fills vring desc with CVA of the mbuf data pointer rather
than GPA.
So when host sees the CVA, it could translates it to HVA(host virtual
address).
The virtio interface in container follows the vhost message format, and
is compliant with dpdk vhost implmentation, i.e, no dpdk vhost
modification is needed.
vHost isn't aware whether the incoming virtio comes from KVM guest or
container.
The pretty much covers the high level design. There are quite some low
level issues. For example, 32bit PFN is enough for KVM guest, since we
use 64bit VFN(virtual page frame number), trick is done here through a
special IOAPI.
In addition above, we might consider "namespace" kernel functionality.
Technically, it would not be a big problem, but related with security.
So it would be nice to take account.
There is no namespace concept here because we don't generate kernel
netdev devices. It might be usefull if we could extend our work to
support kernel netdev interface and assign to container's namespace.
Yes, it would be great if we could extend this to support both kernel
networking and user space networking.
No progress so far.
Post by Tetsuya Mukawa
Post by Xie, Huawei
Post by Tetsuya Mukawa
Regards,
Tetsuya
/huawei
Tetsuya Mukawa
2015-09-08 04:44:50 UTC
Permalink
Post by Xie, Huawei
Post by Tetsuya Mukawa
Post by Xie, Huawei
Post by Tetsuya Mukawa
Hi Xie and Yanping,
May I ask you some questions?
It seems we are also developing an almost same one.
Good to know that we are tackling the same problem and have the similar
idea.
What is your status now? We had the POC running, and compliant with
dpdkvhost.
Interrupt like notification isn't supported.
We implemented vhost PMD first, so we just start implementing it.
Post by Xie, Huawei
Post by Tetsuya Mukawa
I read your mail, seems what we did are quite similar. Here i wrote a
quick mail to describe our design. Let me know if it is the same thing.
We don't have a high performance networking interface in container for
NFV. Current veth pair based interface couldn't be easily accelerated.
1. DPDK based virtio PMD driver in container.
2. device simulation framework in container.
3. dpdk(or kernel) vhost running in host.
How virtio is created?
A: There is no "real" virtio-pci device in container environment.
1). Host maintains pools of memories, and shares memory to container.
This could be accomplished through host share a huge page file to container.
2). Containers creates virtio rings based on the shared memory.
3). Container creates mbuf memory pools on the shared memory.
4) Container send the memory and vring information to vhost through
vhost message. This could be done either through ioctl call or vhost
user message.
How vhost message is sent?
A: There are two alternative ways to do this.
1) The customized virtio PMD is responsible for all the vring creation,
and vhost message sending.
Above is our approach so far.
It seems Yanping also takes this kind of approach.
We are using vhost-user functionality instead of using the vhost-net
kernel module.
Probably this is the difference between Yanping and us.
In my current implementation, the device simulation layer talks to "user
space" vhost through cuse interface. It could also be done through vhost
user socket. This isn't the key point.
Here vhost-user is kind of confusing, maybe user space vhost is more
accurate, either cuse or unix domain socket. :).
As for yanping, they are now connecting to vhost-net kernel module, but
they are also trying to connect to "user space" vhost. Correct me if wrong.
Yes, there is some difference between these two. Vhost-net kernel module
could directly access other process's memory, while using
vhost-user(cuse/user), we need do the memory mapping.
Post by Tetsuya Mukawa
BTW, we are going to submit a vhost PMD for DPDK-2.2.
This PMD is implemented on librte_vhost.
It allows DPDK application to handle a vhost-user(cuse) backend as a
normal NIC port.
This PMD should work with both Xie and Yanping approach.
(In the case of Yanping approach, we may need vhost-cuse)
2) We could do this through a lightweight device simulation framework.
The device simulation creates simple PCI bus. On the PCI bus,
virtio-net PCI devices are created. The device simulations provides
IOAPI for MMIO/IO access.
Does it mean you implemented a kernel module?
If so, do you still need vhost-cuse functionality to handle vhost
messages n userspace?
The device simulation is a library running in user space in container.
It is linked with DPDK app. It creates pseudo buses and virtio-net PCI
devices.
The virtio-container-PMD configures the virtio-net pseudo devices
through IOAPI provided by the device simulation rather than IO
instructions as in KVM.
Why we use device simulation?
We could create other virtio devices in container, and provide an common
way to talk to vhost-xx module.
Thanks for explanation.
At first reading, I thought the difference between approach1 and
approach2 is whether we need to implement a new kernel module, or not.
But I understand how you implemented.
Please let me explain our design more.
We might use a kind of similar approach to handle a pseudo virtio-net
device in DPDK.
(Anyway, we haven't finished implementing yet, this overview might have
some technical problems)
Step1. Separate virtio-net and vhost-user socket related code from QEMU,
then implement it as a separated program.
The program also has below features.
- Create a directory that contains almost same files like
/sys/bus/pci/device/<pci address>/*
(To scan these file located on outside sysfs, we need to fix EAL)
- This dummy device is driven by dummy-virtio-net-driver. This name is
specified by '<pci addr>/driver' file.
- Create a shared file that represents pci configuration space, then
mmap it, also specify the path in '<pci addr>/resource_path'
The program will be GPL, but it will be like a bridge on the shared
memory between virtio-net PMD and DPDK vhost backend.
Actually, It will work under virtio-net PMD, but we don't need to link it.
So I guess we don't have GPL license issue.
Step2. Fix pci scan code of EAL to scan dummy devices.
- To scan above files, extend pci_scan() of EAL.
Step3. Add a new kdrv type to EAL.
- To handle the 'dummy-virtio-net-driver', add a new kdrv type to EAL.
Step4. Implement pci_dummy_virtio_net_map/unmap().
- It will have almost same functionality like pci_uio_map(), but for
dummy virtio-net device.
- The dummy device will be mmaped using a path specified in '<pci
addr>/resource_path'.
Step5. Add a new compile option for virtio-net device to replace IO
functions.
- The IO functions of virtio-net PMD will be replaced by read() and
write() to access to the shared memory.
- Add notification mechanism to IO functions. This will be used when
write() to the shared memory is done.
(Not sure exactly, but probably we need it)
Does it make sense?
I guess Step1&2 is different from your approach, but the rest might be
similar.
Actually, we just need sysfs entries for a virtio-net dummy device, but
so far, I don't have a fine way to register them from user space without
loading a kernel module.
I don't quite get the details. Who will create those sysfs entries? A
kernel module right?
Hi Xie,

I don't create sysfs entries. Just create a directory that contains
files looks like sysfs entries.
And initialize EAL with not only sysfs but also the above directory.

In quoted last sentence, I wanted to say we just needed files looks like
sysfs entries.
But I don't know a good way to create files under sysfs without loading
kernel module.
This is because I try to create the additional directory.
Post by Xie, Huawei
The virtio-net is configured through read/write to sharing
memory(between host and guest), right?
Yes, I agree.
Post by Xie, Huawei
Where is shared vring created and shared memory created, on shared huge
page between host and guest?
The vritqueues(vrings) are on guest hugepage.

Let me explain.
Guest container should have read/write access to a part of hugepage
directory on host.
(For example, /mnt/huge/conainer1/ is shared between host and guest.)
Also host and guest needs to communicate through a unix domain socket.
(For example, host and guest can communicate with using
"/tmp/container1/sock")

If we can do like above, a virtio-net PMD on guest can creates
virtqueues(vrings) on it's hugepage, and writes these information to a
pseudo virtio-net device that is a process created in guest container.
Then the pseudo virtio-net device sends it to vhost-user backend(host
DPDK application) through a unix domain socket.

So with my plan, there are 3 processes.
DPDK applications on host and guest, also a process that works like
virtio-net device.
Post by Xie, Huawei
Who will talk to dpdkvhost?
If we need to talk to a cuse device or the vhost-net kernel module, an
above pseudo virtio-net device could talk to.
(But, so far, my target is only vhost-user.)
Post by Xie, Huawei
Post by Tetsuya Mukawa
This is because I need to change pci_scan() also.
It seems you have implemented a virtio-net pseudo device as BSD license.
If so, this kind of PMD would be nice to use it.
Currently it is based on native linux kvm tool.
Great, I hadn't noticed this option.
Post by Xie, Huawei
Post by Tetsuya Mukawa
In the case that it takes much time to implement some lost
functionalities like interrupt mode, using QEMU code might be an one of
options.
For interrupt mode, i plan to use eventfd for sleep/wake, have not tried
yet.
Post by Tetsuya Mukawa
Anyway, we just need a fine virtual NIC between containers and host.
So we don't hold to our approach and implementation.
Do you have comments to my implementation?
We could publish the version without the device framework first for
reference.
No I don't have. Could you please share it?
I am looking forward to seeing it.

Tetsuya
Xie, Huawei
2015-09-14 03:15:52 UTC
Permalink
Post by Tetsuya Mukawa
Post by Xie, Huawei
Post by Tetsuya Mukawa
Post by Xie, Huawei
Post by Tetsuya Mukawa
Hi Xie and Yanping,
May I ask you some questions?
It seems we are also developing an almost same one.
Good to know that we are tackling the same problem and have the similar
idea.
What is your status now? We had the POC running, and compliant with
dpdkvhost.
Interrupt like notification isn't supported.
We implemented vhost PMD first, so we just start implementing it.
Post by Xie, Huawei
Post by Tetsuya Mukawa
I read your mail, seems what we did are quite similar. Here i wrote a
quick mail to describe our design. Let me know if it is the same thing.
We don't have a high performance networking interface in container for
NFV. Current veth pair based interface couldn't be easily accelerated.
1. DPDK based virtio PMD driver in container.
2. device simulation framework in container.
3. dpdk(or kernel) vhost running in host.
How virtio is created?
A: There is no "real" virtio-pci device in container environment.
1). Host maintains pools of memories, and shares memory to container.
This could be accomplished through host share a huge page file to container.
2). Containers creates virtio rings based on the shared memory.
3). Container creates mbuf memory pools on the shared memory.
4) Container send the memory and vring information to vhost through
vhost message. This could be done either through ioctl call or vhost
user message.
How vhost message is sent?
A: There are two alternative ways to do this.
1) The customized virtio PMD is responsible for all the vring creation,
and vhost message sending.
Above is our approach so far.
It seems Yanping also takes this kind of approach.
We are using vhost-user functionality instead of using the vhost-net
kernel module.
Probably this is the difference between Yanping and us.
In my current implementation, the device simulation layer talks to "user
space" vhost through cuse interface. It could also be done through vhost
user socket. This isn't the key point.
Here vhost-user is kind of confusing, maybe user space vhost is more
accurate, either cuse or unix domain socket. :).
As for yanping, they are now connecting to vhost-net kernel module, but
they are also trying to connect to "user space" vhost. Correct me if wrong.
Yes, there is some difference between these two. Vhost-net kernel module
could directly access other process's memory, while using
vhost-user(cuse/user), we need do the memory mapping.
Post by Tetsuya Mukawa
BTW, we are going to submit a vhost PMD for DPDK-2.2.
This PMD is implemented on librte_vhost.
It allows DPDK application to handle a vhost-user(cuse) backend as a
normal NIC port.
This PMD should work with both Xie and Yanping approach.
(In the case of Yanping approach, we may need vhost-cuse)
2) We could do this through a lightweight device simulation framework.
The device simulation creates simple PCI bus. On the PCI bus,
virtio-net PCI devices are created. The device simulations provides
IOAPI for MMIO/IO access.
Does it mean you implemented a kernel module?
If so, do you still need vhost-cuse functionality to handle vhost
messages n userspace?
The device simulation is a library running in user space in container.
It is linked with DPDK app. It creates pseudo buses and virtio-net PCI
devices.
The virtio-container-PMD configures the virtio-net pseudo devices
through IOAPI provided by the device simulation rather than IO
instructions as in KVM.
Why we use device simulation?
We could create other virtio devices in container, and provide an common
way to talk to vhost-xx module.
Thanks for explanation.
At first reading, I thought the difference between approach1 and
approach2 is whether we need to implement a new kernel module, or not.
But I understand how you implemented.
Please let me explain our design more.
We might use a kind of similar approach to handle a pseudo virtio-net
device in DPDK.
(Anyway, we haven't finished implementing yet, this overview might have
some technical problems)
Step1. Separate virtio-net and vhost-user socket related code from QEMU,
then implement it as a separated program.
The program also has below features.
- Create a directory that contains almost same files like
/sys/bus/pci/device/<pci address>/*
(To scan these file located on outside sysfs, we need to fix EAL)
- This dummy device is driven by dummy-virtio-net-driver. This name is
specified by '<pci addr>/driver' file.
- Create a shared file that represents pci configuration space, then
mmap it, also specify the path in '<pci addr>/resource_path'
The program will be GPL, but it will be like a bridge on the shared
memory between virtio-net PMD and DPDK vhost backend.
Actually, It will work under virtio-net PMD, but we don't need to link it.
So I guess we don't have GPL license issue.
Step2. Fix pci scan code of EAL to scan dummy devices.
- To scan above files, extend pci_scan() of EAL.
Step3. Add a new kdrv type to EAL.
- To handle the 'dummy-virtio-net-driver', add a new kdrv type to EAL.
Step4. Implement pci_dummy_virtio_net_map/unmap().
- It will have almost same functionality like pci_uio_map(), but for
dummy virtio-net device.
- The dummy device will be mmaped using a path specified in '<pci
addr>/resource_path'.
Step5. Add a new compile option for virtio-net device to replace IO
functions.
- The IO functions of virtio-net PMD will be replaced by read() and
write() to access to the shared memory.
- Add notification mechanism to IO functions. This will be used when
write() to the shared memory is done.
(Not sure exactly, but probably we need it)
Does it make sense?
I guess Step1&2 is different from your approach, but the rest might be
similar.
Actually, we just need sysfs entries for a virtio-net dummy device, but
so far, I don't have a fine way to register them from user space without
loading a kernel module.
I don't quite get the details. Who will create those sysfs entries? A
kernel module right?
Hi Xie,
I don't create sysfs entries. Just create a directory that contains
files looks like sysfs entries.
And initialize EAL with not only sysfs but also the above directory.
In quoted last sentence, I wanted to say we just needed files looks like
sysfs entries.
But I don't know a good way to create files under sysfs without loading
kernel module.
This is because I try to create the additional directory.
Post by Xie, Huawei
The virtio-net is configured through read/write to sharing
memory(between host and guest), right?
Yes, I agree.
Post by Xie, Huawei
Where is shared vring created and shared memory created, on shared huge
page between host and guest?
The vritqueues(vrings) are on guest hugepage.
Let me explain.
Guest container should have read/write access to a part of hugepage
directory on host.
(For example, /mnt/huge/conainer1/ is shared between host and guest.)
Also host and guest needs to communicate through a unix domain socket.
(For example, host and guest can communicate with using
"/tmp/container1/sock")
If we can do like above, a virtio-net PMD on guest can creates
virtqueues(vrings) on it's hugepage, and writes these information to a
pseudo virtio-net device that is a process created in guest container.
Then the pseudo virtio-net device sends it to vhost-user backend(host
DPDK application) through a unix domain socket.
So with my plan, there are 3 processes.
DPDK applications on host and guest, also a process that works like
virtio-net device.
Post by Xie, Huawei
Who will talk to dpdkvhost?
If we need to talk to a cuse device or the vhost-net kernel module, an
above pseudo virtio-net device could talk to.
(But, so far, my target is only vhost-user.)
Post by Xie, Huawei
Post by Tetsuya Mukawa
This is because I need to change pci_scan() also.
It seems you have implemented a virtio-net pseudo device as BSD license.
If so, this kind of PMD would be nice to use it.
Currently it is based on native linux kvm tool.
Great, I hadn't noticed this option.
Post by Xie, Huawei
Post by Tetsuya Mukawa
In the case that it takes much time to implement some lost
functionalities like interrupt mode, using QEMU code might be an one of
options.
For interrupt mode, i plan to use eventfd for sleep/wake, have not tried
yet.
Post by Tetsuya Mukawa
Anyway, we just need a fine virtual NIC between containers and host.
So we don't hold to our approach and implementation.
Do you have comments to my implementation?
We could publish the version without the device framework first for
reference.
No I don't have. Could you please share it?
I am looking forward to seeing it.
OK, we are removing the device framework. Hope to publish it in one
month's time.
Post by Tetsuya Mukawa
Tetsuya
Loading...