Discussion:
[dpdk-dev] Best example for showing throughput?
Patrick Mahan
2013-05-24 14:11:02 UTC
Permalink
Good morning,

I have been playing with this code now for about 2 weeks. I posted
earlier about being unable to get it to work on Fedora 14, but have
it working on CentOS 6.4. Here is my hardware -

Intel Xeon E5-2690 (8 physical, 16 virtual)
64 Gbyte DDR3 memory
Intel 82599EB-SPF dual port 10GE interface
CentOS 6.4 (2.6.32-358.6.1.el6.x86_64)

The 82599 is in a 16x PCI-e slot.

I have it attached to an IXIA box. I have been running the app 'testpmd'
in iofwd mode with 2K rx/tx descriptors and 512 burst/mbcache. I have been
varying the # of queues and unfortunately, I am not seeing full line rate.

I have CentOS booted to runlevel 3 (no X windows) and have turned off (I think)
all of the background processes I can.

I am seeing about 20-24% droppage on the receive side. It doesn't seem to
matter the # of queues.

Question 1: Is 'testpmd' the best application for this type of testing? If not,
which program? Or do I need to roll my own?

Question 2: I have blacklisted the Intel i350 ports on the motherboard and am
using ssh to access the platform. Could this be affecting the test?

Thoughts?


Thanks,

Patrick
Thomas Monjalon
2013-05-24 14:41:38 UTC
Permalink
Hello,
Post by Patrick Mahan
Intel Xeon E5-2690 (8 physical, 16 virtual)
How many CPU sockets have you ?
Post by Patrick Mahan
64 Gbyte DDR3 memory
Intel 82599EB-SPF dual port 10GE interface
CentOS 6.4 (2.6.32-358.6.1.el6.x86_64)
The 82599 is in a 16x PCI-e slot.
Check the datasheet of your motherboard.
Are you sure it is wired as a 16x PCI-e ?
Is it connected to the right NUMA node ?
Post by Patrick Mahan
I have it attached to an IXIA box. I have been running the app 'testpmd'
in iofwd mode with 2K rx/tx descriptors and 512 burst/mbcache. I have been
varying the # of queues and unfortunately, I am not seeing full line rate.
What is your command line ?
Post by Patrick Mahan
I am seeing about 20-24% droppage on the receive side. It doesn't seem to
matter the # of queues.
If queues are polled by different cores, it should matter.
Post by Patrick Mahan
Question 1: Is 'testpmd' the best application for this type of testing? If
not, which program? Or do I need to roll my own?
testpmd is the right application for performance benchmark.
It is also possible to use examples l2fwd/l3fwd but you should keep testpmd.
Post by Patrick Mahan
Question 2: I have blacklisted the Intel i350 ports on the motherboard and
am using ssh to access the platform. Could this be affecting the test?
You mean i350 is used for ssh ? It shouldn't significantly affect your test.
--
Thomas
Thomas Monjalon
2013-05-24 15:45:25 UTC
Permalink
Post by Thomas Monjalon
Post by Patrick Mahan
Intel Xeon E5-2690 (8 physical, 16 virtual)
How many CPU sockets have you ?
Post by Patrick Mahan
64 Gbyte DDR3 memory
Intel 82599EB-SPF dual port 10GE interface
CentOS 6.4 (2.6.32-358.6.1.el6.x86_64)
The 82599 is in a 16x PCI-e slot.
Check the datasheet of your motherboard.
Are you sure it is wired as a 16x PCI-e ?
Is it connected to the right NUMA node ?
Post by Patrick Mahan
I have it attached to an IXIA box.
Which packet size are you sending with your packet generator ?
In case of 64 byte packets (with Ethernet CRC), (64+20)*8 = 672 bits.
So line rate is 10000/672 = 14.88 Mpps.
This bandwith should be supported by your 82599 NIC.

Are you sending and receiving on the 2 ports at the same time ?
Forwarding in the 2 directions is equivalent to double the bandwidth.
Maybe that 14.88*2 = 29.76 Mpps is too much for your hardware.

You could also try with 2 ports on 2 different NICs.
Post by Thomas Monjalon
Post by Patrick Mahan
I have been running the app 'testpmd'
in iofwd mode with 2K rx/tx descriptors and 512 burst/mbcache. I have
been varying the # of queues and unfortunately, I am not seeing full
line rate.
What is your command line ?
Post by Patrick Mahan
I am seeing about 20-24% droppage on the receive side. It doesn't seem
to matter the # of queues.
If queues are polled by different cores, it should matter.
Post by Patrick Mahan
Question 1: Is 'testpmd' the best application for this type of testing?
If not, which program? Or do I need to roll my own?
testpmd is the right application for performance benchmark.
It is also possible to use examples l2fwd/l3fwd but you should keep testpmd.
Post by Patrick Mahan
Question 2: I have blacklisted the Intel i350 ports on the motherboard
and am using ssh to access the platform. Could this be affecting the
test?
You mean i350 is used for ssh ? It shouldn't significantly affect your test.
--
Thomas
Patrick Mahan
2013-05-24 18:51:09 UTC
Permalink
Post by Thomas Monjalon
Post by Thomas Monjalon
Post by Patrick Mahan
Intel Xeon E5-2690 (8 physical, 16 virtual)
How many CPU sockets have you ?
Post by Patrick Mahan
64 Gbyte DDR3 memory
Intel 82599EB-SPF dual port 10GE interface
CentOS 6.4 (2.6.32-358.6.1.el6.x86_64)
The 82599 is in a 16x PCI-e slot.
Check the datasheet of your motherboard.
Are you sure it is wired as a 16x PCI-e ?
Is it connected to the right NUMA node ?
Post by Patrick Mahan
I have it attached to an IXIA box.
Which packet size are you sending with your packet generator ?
In case of 64 byte packets (with Ethernet CRC), (64+20)*8 = 672 bits.
So line rate is 10000/672 = 14.88 Mpps.
This bandwith should be supported by your 82599 NIC.
Yes, the Ixia is sending the standard 64 byte packet. The stats show a send rate of 14.880 Mpps.
Post by Thomas Monjalon
Are you sending and receiving on the 2 ports at the same time ?
Forwarding in the 2 directions is equivalent to double the bandwidth.
Maybe that 14.88*2 = 29.76 Mpps is too much for your hardware.
Yes I am running traffic both ways. Interestingly, the amount of drops seem consistent in both directions. This makes sense since testpmd is spinning off a thread to read from each input queue.
Post by Thomas Monjalon
You could also try with 2 ports on 2 different NICs.
Hmmm, not sure if I can lay hands on another 82599 card. This one is a loaner.

Thanks,

Patrick
Post by Thomas Monjalon
Post by Thomas Monjalon
Post by Patrick Mahan
I have been running the app 'testpmd'
in iofwd mode with 2K rx/tx descriptors and 512 burst/mbcache. I have
been varying the # of queues and unfortunately, I am not seeing full
line rate.
What is your command line ?
Post by Patrick Mahan
I am seeing about 20-24% droppage on the receive side. It doesn't seem
to matter the # of queues.
If queues are polled by different cores, it should matter.
Post by Patrick Mahan
Question 1: Is 'testpmd' the best application for this type of testing?
If not, which program? Or do I need to roll my own?
testpmd is the right application for performance benchmark.
It is also possible to use examples l2fwd/l3fwd but you should keep testpmd.
Post by Patrick Mahan
Question 2: I have blacklisted the Intel i350 ports on the motherboard
and am using ssh to access the platform. Could this be affecting the
test?
You mean i350 is used for ssh ? It shouldn't significantly affect your test.
--
Thomas
Damien Millescamps
2013-05-25 19:23:47 UTC
Permalink
Post by Patrick Mahan
Post by Thomas Monjalon
Are you sending and receiving on the 2 ports at the same time ?
Forwarding in the 2 directions is equivalent to double the bandwidth.
Maybe that 14.88*2 = 29.76 Mpps is too much for your hardware.
Yes I am running traffic both ways. Interestingly, the amount of drops seem consistent in both directions. This makes sense since testpmd is spinning off a thread to read from each input queue.
Hi Patrick,

If you are using both ports of the same Niantics at the same time then
you won't be able to reach the line-rate on both port. It is a
limitation of the PLX bridge on the board. The expected performance on
port 1 when port 0 is used at line-rate should be around 75% of the
line-rate from what I know.
Since you are forwarding packets from one port to another, the packet
dropped because of this will impact the performance on both ways since
what is dropped on port 1 won't be forwarded for obvious reasons, and
some of the packets from port 0 will be lost while trying to transmit
them on port 1. Eventually, you will end up with the lower performance
of the two ports which seems to be consistent with the performance you
report.

Hope this helps,
--
Damien
Damien Millescamps
2013-05-25 20:59:04 UTC
Permalink
Post by Damien Millescamps
Hi Patrick,
If you are using both ports of the same Niantics at the same time then
you won't be able to reach the line-rate on both port.
For a better explanation, you can refer to this post from Alexander
Duyck from Intel on the linux network mailing list:

http://permalink.gmane.org/gmane.linux.network/207295

Regards,
--
Damien Millescamps
Patrick Mahan
2013-05-28 19:15:35 UTC
Permalink
Post by Damien Millescamps
Post by Damien Millescamps
Hi Patrick,
If you are using both ports of the same Niantics at the same time then
you won't be able to reach the line-rate on both port.
For a better explanation, you can refer to this post from Alexander
http://permalink.gmane.org/gmane.linux.network/207295
Interesting article.

Okay, I attempted to find the bus rate of this card this morning (the output of lspci is below).

This shows me I am capable of 5GT/s raw which works out to 4 Gbps/lane with 8 lanes enabled which should, theoretically, give 32 Gbps. Reading the article, the suggestion is that due to the smaller packet the overhead is contributing to more 50% of the PCIe bus traffic. Given that I'm seeing a forwarding speed of only about 12.282 Mpps with ~17% drops for only in one direction (rss enabled, 7 queues enabled with 6 queues used).

So the overhead cost is almost 70%?

Can this ever do line rate? Under what conditions? It has been my experience that the industry standard is testing throughput using these 64 byte packets.

Output of 'lspci -vvv -s 03:00.0':

03:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev
01)
Subsystem: Intel Corporation Ethernet Server Adapter X520-2
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2
B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <P
ERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 48
Region 0: Memory at d9080000 (64-bit, prefetchable) [size=512K]
Region 2: I/O ports at ecc0 [size=32]
Region 4: Memory at d91f8000 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=4 offset=00000000
PBA: BAR=4 offset=00002000
Capabilities: [a0] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 <1us, L1 <8us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- Complia
nceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase
1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- Un
supReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- Un
supReq- ACSViol-
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ Un
supReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [140 v1] Device Serial Number 00-1b-21-ff-ff-6b-8d-d4
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
IOVSta: Migration-
Initial VFs: 64, Total VFs: 64, Number of VFs: 64, Function Dependency Link: 00
VF offset: 128, stride: 2, Device ID: 10ed
Supported Page Size: 00000553, System Page Size: 00000001
Region 0: Memory at 00000000d0000000 (64-bit, non-prefetchable)
Region 3: Memory at 00000000d0100000 (64-bit, non-prefetchable)
VF Migration: offset: 00000000, BIR: 0
Kernel driver in use: igb_uio
Kernel modules: ixgbe

Thanks,

Patrick
Post by Damien Millescamps
Regards,
--
Damien Millescamps
Damien Millescamps
2013-05-29 14:07:28 UTC
Permalink
Post by Patrick Mahan
So the overhead cost is almost 70%?
Can this ever do line rate? Under what conditions? It has been my experience that the industry standard is testing throughput using these 64 byte packets.
This overhead can actually be explained considering the PCIe 2.1[1]
standard and 82599 Specifications[2].

To sum up, for each packet the adapter needs to first send a read
request on a 16 Bytes packet descriptor (cf. [2]), to which it will
receive a read answer. Then the adapter must issue either a read or
write request to the packet physical address for the size of the packet.
The frame format for PCIe read and write request is composed with a
Start of frame, a Sequence Number, a Header, the Data, an LRC and an End
of frame (cf. [1]). The overhead we are talking about here is more than
16 Bytes per PCIe message. In addition to that, the PCIe physical layer
uses a 10bits per bytes encoding, thus adding to the overhead.
Now if you apply this to the 64 Bytes packet, you should notice that the
overhead is way above 70% (4 messages plus descriptor and data size
times 10b/8b encoding which should be around 83% if I didn't miss anything).

However, if we end up with a limited overhead it is because the 82599
implements thresholds in order to be able to batch the packet descriptor
reading / writing back (cf. [2] WTHRESH for example) thus reducing the
overhead to a little more than 70% with the default DPDK parameters.

You can achieve line-rate for 64 Bytes packets on each port
independently. When using both port simultaneously you can achieve
line-rate using packet size above 64Bytes. In the post to which I
redirected you, Alexander talked about 256Bytes packets. But if you take
the time to compute the total throughput needed on the PCIe as a
function of the packet size, you'll probably end up with a lower minimum
packet size than 256B to achieve line-rate simultaneously on both port.

[1]
http://www.pcisig.com/members/downloads/specifications/pciexpress/PCI_Express_Base_r2_1_04Mar09.pdf
[2]
http://www.intel.com/content/dam/doc/datasheet/82599-10-gbe-controller-datasheet.pdf
--
Damien Millescamps
Patrick Mahan
2013-05-29 18:24:51 UTC
Permalink
Post by Damien Millescamps
Post by Patrick Mahan
So the overhead cost is almost 70%?
Can this ever do line rate? Under what conditions? It has been my experience that the industry standard is testing throughput using these 64 byte packets.
This overhead can actually be explained considering the PCIe 2.1[1]
standard and 82599 Specifications[2].
Damien,

Thanks very much for this explanation of the overhead costs associated with the 64-byte packet size (and the references). I am just recently started looking at what it takes to do 10GE using off the shelf components and having the PCIe overhead explained so clearly helps, hugely!

Patrick
Post by Damien Millescamps
To sum up, for each packet the adapter needs to first send a read
request on a 16 Bytes packet descriptor (cf. [2]), to which it will
receive a read answer. Then the adapter must issue either a read or
write request to the packet physical address for the size of the packet.
The frame format for PCIe read and write request is composed with a
Start of frame, a Sequence Number, a Header, the Data, an LRC and an End
of frame (cf. [1]). The overhead we are talking about here is more than
16 Bytes per PCIe message. In addition to that, the PCIe physical layer
uses a 10bits per bytes encoding, thus adding to the overhead.
Now if you apply this to the 64 Bytes packet, you should notice that the
overhead is way above 70% (4 messages plus descriptor and data size
times 10b/8b encoding which should be around 83% if I didn't miss anything).
However, if we end up with a limited overhead it is because the 82599
implements thresholds in order to be able to batch the packet descriptor
reading / writing back (cf. [2] WTHRESH for example) thus reducing the
overhead to a little more than 70% with the default DPDK parameters.
You can achieve line-rate for 64 Bytes packets on each port
independently. When using both port simultaneously you can achieve
line-rate using packet size above 64Bytes. In the post to which I
redirected you, Alexander talked about 256Bytes packets. But if you take
the time to compute the total throughput needed on the PCIe as a
function of the packet size, you'll probably end up with a lower minimum
packet size than 256B to achieve line-rate simultaneously on both port.
[1]
http://www.pcisig.com/members/downloads/specifications/pciexpress/PCI_Express_Base_r2_1_04Mar09.pdf
[2]
http://www.intel.com/content/dam/doc/datasheet/82599-10-gbe-controller-datasheet.pdf
--
Damien Millescamps
Patrick Mahan
2013-05-24 18:32:43 UTC
Permalink
Post by Thomas Monjalon
Hello,
Post by Patrick Mahan
Intel Xeon E5-2690 (8 physical, 16 virtual)
How many CPU sockets have you ?
This is a Dell PowerEdge T620, it has two sockets, but only one has a CPU in it.
Post by Thomas Monjalon
Post by Patrick Mahan
64 Gbyte DDR3 memory
Intel 82599EB-SPF dual port 10GE interface
CentOS 6.4 (2.6.32-358.6.1.el6.x86_64)
The 82599 is in a 16x PCI-e slot.
Check the datasheet of your motherboard.
Are you sure it is wired as a 16x PCI-e ?
As far as I can tell from the specs on the Dell site - www.dell.com/us/business/p/poweredge-t620/pd
Post by Thomas Monjalon
Is it connected to the right NUMA node ?
Yes, it's in slot labeled: PCIE_G3_x16 (cpu1). The interfaces show up as p2p1 and p2p2.
Post by Thomas Monjalon
Post by Patrick Mahan
I have it attached to an IXIA box. I have been running the app 'testpmd'
in iofwd mode with 2K rx/tx descriptors and 512 burst/mbcache. I have been
varying the # of queues and unfortunately, I am not seeing full line rate.
What is your command line ?
sudo build/app/testpmd -b 0000:03:00.0 -b 0000:03:00.1 -c<coremask> -n3 -- --nb-cores=<ncores> --nb-ports=2 --rxd=2048 --rxd=2048 --mbcache=512 --burst=512 --rxd=<nqueues> --txq=<nqueues>

Where I am using the following to determine cores, coremask and nqueues:

ncores = nqueues * 2 // this actually the number of ports being tested
coremask = (1 << (ncores + 1)) - 1

So for say, 3 rx/tx queues -

ncores = 3 * 2 = 6
coremask = (1 << (6 + 1)) - 1 = 127 (0x7f)

Now that I remember it I had to fix testpmd to allocate enough mbufs.
Post by Thomas Monjalon
Post by Patrick Mahan
I am seeing about 20-24% droppage on the receive side. It doesn't seem to
matter the # of queues.
If queues are polled by different cores, it should matter.
I assume you mean different physical cores, yes?

There is only one physical core, but each 'forwarding' thread is on a separate core.
Post by Thomas Monjalon
Post by Patrick Mahan
Question 1: Is 'testpmd' the best application for this type of testing? If
not, which program? Or do I need to roll my own?
testpmd is the right application for performance benchmark.
It is also possible to use examples l2fwd/l3fwd but you should keep testpmd.
I am just starting with testpmd to get a feel for raw throughput. I want to test l2 and l3 soon but I may loose access to the Ixia.
Post by Thomas Monjalon
Post by Patrick Mahan
Question 2: I have blacklisted the Intel i350 ports on the motherboard and
am using ssh to access the platform. Could this be affecting the test?
You mean i350 is used for ssh ? It shouldn't significantly affect your test.
Okay, I noticed that they scanned by the pci dpdk layer.

Thanks,

Patrick
Post by Thomas Monjalon
--
Thomas
Olivier MATZ
2013-05-24 20:03:19 UTC
Permalink
Hello Patrick,
Post by Patrick Mahan
sudo build/app/testpmd -b 0000:03:00.0 -b 0000:03:00.1 -c<coremask> -n3
-- --nb-cores=<ncores> --nb-ports=2 --rxd=2048 --rxd=2048 --mbcache=512
--burst=512 --rxd=<nqueues> --txq=<nqueues>
I guess it's a typo, but just in case, I think you mean "rxq" instead
of "rxd" at the end of the command line?

You can check that all is properly configured by using the interactive
mode of testpmd, and display the configuration with the following
commands:

show config rxtx
show config cores
show config fwd
...

Regards,
Olivier
Patrick Mahan
2013-05-24 20:44:51 UTC
Permalink
Yes it is a typo. I'm away from the box and was referring to my notes.

Thanks,

Patrick

Sent from my iPad
Post by Olivier MATZ
Hello Patrick,
Post by Patrick Mahan
sudo build/app/testpmd -b 0000:03:00.0 -b 0000:03:00.1 -c<coremask> -n3
-- --nb-cores=<ncores> --nb-ports=2 --rxd=2048 --rxd=2048 --mbcache=512
--burst=512 --rxd=<nqueues> --txq=<nqueues>
I guess it's a typo, but just in case, I think you mean "rxq" instead
of "rxd" at the end of the command line?
You can check that all is properly configured by using the interactive
mode of testpmd, and display the configuration with the following
show config rxtx
show config cores
show config fwd
...
Regards,
Olivier
Loading...