Discussion:
[dpdk-dev] rte_prefetch0() performance info
Parikshith Chowdaiah
2015-03-05 08:46:23 UTC
Permalink
Hi all,
I have a question related to usage of rte_prefetch0() function,In one of
the sample files, we have implementation like:

/* Prefetch first packets */

for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {

rte_prefetch0(rte_pktmbuf_mtod(

pkts_burst[j], void *));

}



/* Prefetch and forward already prefetched packets */

for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {

rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[

j + PREFETCH_OFFSET], void *));

l3fwd_simple_forward(pkts_burst[j], portid,

qconf);

}



/* Forward remaining prefetched packets */

for (; j < nb_rx; j++) {

l3fwd_simple_forward(pkts_burst[j], portid,

qconf);

}


where the prefetch0() is carried out in multiple split iterations, would
like to have an insight on whether it makes performance improvement to
likes of:



for (j = 0; j < nb_rx; j++) {

rte_prefetch0(rte_pktmbuf_mtod(

pkts_burst[j], void *));

}


and how frequent rte_prefetch() needs to called for the same packet. and
any mechanisms to call in bulk for 64 packets at once ?


thanks

Parikshith
Anuj Kalia
2015-03-05 08:51:01 UTC
Permalink
Hi Parikshith.

A CPU core can have a limited number of prefetches in flight (around 10).
So if you issue 64 (or nb_rx > 10) prefetches in quick succession, you'll
stall on memory access. The main idea here is to overlap prefetches of some
packets with computation from other packets.

This paper explains it in the context of hash tables, but the idea is
similar: https://www.cs.cmu.edu/~binfan/papers/conext13_cuckooswitch.pdf

--Anuj
Post by Parikshith Chowdaiah
Hi all,
I have a question related to usage of rte_prefetch0() function,In one of
/* Prefetch first packets */
for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
rte_prefetch0(rte_pktmbuf_mtod(
pkts_burst[j], void *));
}
/* Prefetch and forward already prefetched packets */
for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
j + PREFETCH_OFFSET], void *));
l3fwd_simple_forward(pkts_burst[j], portid,
qconf);
}
/* Forward remaining prefetched packets */
for (; j < nb_rx; j++) {
l3fwd_simple_forward(pkts_burst[j], portid,
qconf);
}
where the prefetch0() is carried out in multiple split iterations, would
like to have an insight on whether it makes performance improvement to
for (j = 0; j < nb_rx; j++) {
rte_prefetch0(rte_pktmbuf_mtod(
pkts_burst[j], void *));
}
and how frequent rte_prefetch() needs to called for the same packet. and
any mechanisms to call in bulk for 64 packets at once ?
thanks
Parikshith
Loading...