Discussion:
[dpdk-dev] [Bug] Static constructors considered evil
Jean Tourrilhes
2016-09-22 20:46:37 UTC
Permalink
Hi,

Expecting static constructors to match between the primary and
the secondary is ill advised and putting yourself at the mercy of the
linker magic. In this case, both primary and secondary were compiled
using the same DPDK directory and exact same libdpdk.a.

Config :
------
CONFIG_RTE_LIBRTE_HASH=y
CONFIG_RTE_LIBRTE_LPM=y
CONFIG_RTE_LIBRTE_ACL=y
CONFIG_RTE_LIBRTE_DISTRIBUTOR=y
CONFIG_RTE_LIBRTE_REORDER=y

Primary :
-------
EAL: Registered tailq: RTE_ACL
EAL: Registered tailq: RTE_HASH
EAL: Registered tailq: RTE_FBK_HASH
EAL: Registered tailq: RTE_MEMPOOL
EAL: Registered tailq: RTE_RING
EAL: Registered tailq: VFIO_RESOURCE_LIST
EAL: Registered tailq: UIO_RESOURCE_LIST
Tailq 0: qname:<RTE_ACL>, tqh_first:(nil), tqh_last:0x7feffffff41c
Tailq 1: qname:<RTE_HASH>, tqh_first:(nil), tqh_last:0x7feffffff44c
Tailq 2: qname:<RTE_FBK_HASH>, tqh_first:(nil), tqh_last:0x7feffffff47c
Tailq 3: qname:<RTE_MEMPOOL>, tqh_first:(nil), tqh_last:0x7feffffff4ac
Tailq 4: qname:<RTE_RING>, tqh_first:(nil), tqh_last:0x7feffffff4dc
Tailq 5: qname:<VFIO_RESOURCE_LIST>, tqh_first:(nil), tqh_last:0x7feffffff50c
Tailq 6: qname:<UIO_RESOURCE_LIST>, tqh_first:(nil), tqh_last:0x7feffffff53c
Tailq 7: qname:<>, tqh_first:(nil), tqh_last:(nil)
Tailq 8: qname:<>, tqh_first:(nil), tqh_last:(nil)
Tailq 9: qname:<>, tqh_first:(nil), tqh_last:(nil)

Secondary :
---------
EAL: Registered tailq: RTE_LPM
EAL: Registered tailq: RTE_LPM6
EAL: Registered tailq: RTE_HASH
EAL: Registered tailq: RTE_FBK_HASH
EAL: Registered tailq: RTE_ACL
EAL: Registered tailq: RTE_REORDER
EAL: Registered tailq: VFIO_RESOURCE_LIST
EAL: Registered tailq: UIO_RESOURCE_LIST
EAL: Registered tailq: RTE_DISTRIBUTOR
EAL: Registered tailq: RTE_MEMPOOL
EAL: Registered tailq: RTE_RING
EAL: Cannot initialize tailq: RTE_LPM
Tailq 0: qname:<RTE_ACL>, tqh_first:(nil), tqh_last:0x7feffffff41c
Tailq 1: qname:<RTE_HASH>, tqh_first:0x7ff0004c62c0, tqh_last:0x7ff0004c62c0
Tailq 2: qname:<RTE_FBK_HASH>, tqh_first:(nil), tqh_last:0x7feffffff47c
Tailq 3: qname:<RTE_MEMPOOL>, tqh_first:0x7ff0004a81c0, tqh_last:0x7ff0004a8040
Tailq 4: qname:<RTE_RING>, tqh_first:0x7ff0004a8140, tqh_last:0x7ff0004c6340
Tailq 5: qname:<VFIO_RESOURCE_LIST>, tqh_first:0x7ff0005fee80, tqh_last:0x7ff00052be80
Tailq 6: qname:<UIO_RESOURCE_LIST>, tqh_first:(nil), tqh_last:0x7feffffff53c
Tailq 7: qname:<>, tqh_first:(nil), tqh_last:(nil)
Tailq 8: qname:<>, tqh_first:(nil), tqh_last:(nil)
Tailq 9: qname:<>, tqh_first:(nil), tqh_last:(nil)
[...]
PANIC in rte_eal_init():
Cannot init tail queues for objects

Obviously not happy...
Any advices ?

Jean
Jean Tourrilhes
2016-09-22 21:17:28 UTC
Permalink
lib/librte_eal/common/eal_common_tailqs.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_tailqs.c b/lib/librte_eal/common/eal_common_tailqs.c
index bb08ec8..6960d06 100644
--- a/lib/librte_eal/common/eal_common_tailqs.c
+++ b/lib/librte_eal/common/eal_common_tailqs.c
@@ -143,6 +143,8 @@ rte_eal_tailq_update(struct rte_tailq_elem *t)
t->head = rte_eal_tailq_create(t->name);
} else {
t->head = rte_eal_tailq_lookup(t->name);
+ if (t->head != NULL)
+ rte_tailqs_count++;
}
}

@@ -188,9 +190,16 @@ rte_eal_tailqs_init(void)
if (t->head == NULL) {
RTE_LOG(ERR, EAL,
"Cannot initialize tailq: %s\n", t->name);
- /* no need to TAILQ_REMOVE, we are going to panic in
- * rte_eal_init() */
- goto fail;
+ if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
+ /* no need to TAILQ_REMOVE, we are going
+ * to panic in rte_eal_init() */
+ goto fail;
+ } else {
+ /* This means our list of constructor is
+ * no the same as primary. Just remove
+ * that missing tailq and continue */
+ TAILQ_REMOVE(&rte_tailq_elem_head, t, next);
+ }
}
}
Sergio Gonzalez Monroy
2016-10-04 13:11:39 UTC
Permalink
Hi Jean,

As with your other patch, commit title needs fixing and also missing
Signed-off-by line.
Post by Jean Tourrilhes
lib/librte_eal/common/eal_common_tailqs.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/lib/librte_eal/common/eal_common_tailqs.c b/lib/librte_eal/common/eal_common_tailqs.c
index bb08ec8..6960d06 100644
--- a/lib/librte_eal/common/eal_common_tailqs.c
+++ b/lib/librte_eal/common/eal_common_tailqs.c
@@ -143,6 +143,8 @@ rte_eal_tailq_update(struct rte_tailq_elem *t)
t->head = rte_eal_tailq_create(t->name);
} else {
t->head = rte_eal_tailq_lookup(t->name);
+ if (t->head != NULL)
+ rte_tailqs_count++;
}
}
@@ -188,9 +190,16 @@ rte_eal_tailqs_init(void)
if (t->head == NULL) {
RTE_LOG(ERR, EAL,
"Cannot initialize tailq: %s\n", t->name);
- /* no need to TAILQ_REMOVE, we are going to panic in
- * rte_eal_init() */
- goto fail;
+ if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
+ /* no need to TAILQ_REMOVE, we are going
+ * to panic in rte_eal_init() */
+ goto fail;
+ } else {
+ /* This means our list of constructor is
+ * no the same as primary. Just remove
+ * that missing tailq and continue */
+ TAILQ_REMOVE(&rte_tailq_elem_head, t, next);
+ }
}
}
I might be missing something here so bear with me.
The case you are trying to fix is, as an example, when your secondary
app is using LPM but your primary is not.
So basically with this patch, you are removing the tailq for LPM on
secondary and continuing as normal, is that the case?

I am not convinced about this approach.
I guess the assumption here is that all the TAILQ infrastructure works
even when the tailq list itself is NULL?
I say assumption because I don't have an easy way to trigger this use
case with default DPDK sample/test apps.

What about letting the secondary process create a tailq if it doesn't
exists?
We would need to lock protect at least the register function to avoid
race conditions when having multiple secondary processes.

David, what do you think?

Sergio
Jean Tourrilhes
2016-10-04 16:59:30 UTC
Permalink
Post by Sergio Gonzalez Monroy
Hi Jean,
As with your other patch, commit title needs fixing and also missing
Signed-off-by line.
I'll do that, no worries...
Post by Sergio Gonzalez Monroy
I might be missing something here so bear with me.
Yes, I know I was terse. Sorry.
Post by Sergio Gonzalez Monroy
The case you are trying to fix is, as an example, when your secondary app is
using LPM but your primary is not.
So basically with this patch, you are removing the tailq for LPM on
secondary and continuing as normal, is that the case?
The secondary can't use tailq types that the primary does not
have, because they are shared across the shared memory.

What happens is that the primary and secondary did not compile
in the same list of tailq. See previous e-mail :
http://dpdk.org/ml/archives/dev/2016-September/047329.html
The reason it's happening is that the secondary was not
compiled with the DPDK build system, but with the build system of the
application (in this case, Snort). Oubviously, porting the application
to the DPDK build system is not practical, so we need to live with
this case.
The build system of the application does not have all the
subtelties of the DPDK build system, and ends up including *all* the
constructors, wether they are used or not in the code. Moreover, they
are included in a different order. Actually, by default the builds
include no constructors at all (which is a big fail), so the library
needs to be included with --whole-archive (see Snort DPDK
instructions).
Post by Sergio Gonzalez Monroy
I am not convinced about this approach.
I agree that the whole constructor approach is flaky and my
patch is only a band aid. Constructors should be entirely removed
IMHO, and a more deterministic init method should be used instead of
depending on linker magic.
Note that the other constructors happen to work right in my
case, but that's probably pure luck. The list of mempool constructors
happen to be the same and in the same order (order matters for mempool
constructors). The app is not using spinlock, hash, crc and acl, so
I did not look if the lists did match.
Post by Sergio Gonzalez Monroy
I guess the assumption here is that all the TAILQ infrastructure works even
when the tailq list itself is NULL?
If tailq are not used in the primary, I assume it would work.
Post by Sergio Gonzalez Monroy
I say assumption because I don't have an easy way to trigger this use case
with default DPDK sample/test apps.
I know. I'll see what I can do to release the code.
Post by Sergio Gonzalez Monroy
What about letting the secondary process create a tailq if it doesn't
exists?
Primary owns the shared memory, and I don't see how primary
could handle an unknown tailq. Secondary can't do much without
primary. So, I don't see how adding the missing tailqs helps.
Post by Sergio Gonzalez Monroy
We would need to lock protect at least the register function to avoid race
conditions when having multiple secondary processes.
David, what do you think?
Sergio
Regards,

Jean
David Marchand
2016-10-05 07:58:01 UTC
Permalink
Hello,
Post by Jean Tourrilhes
Post by Sergio Gonzalez Monroy
The case you are trying to fix is, as an example, when your secondary app is
using LPM but your primary is not.
So basically with this patch, you are removing the tailq for LPM on
secondary and continuing as normal, is that the case?
The secondary can't use tailq types that the primary does not
have, because they are shared across the shared memory.
I am not a "multi process" user but afaik the primary process is
responsible for filling the shared memory.
The secondary processes look at it.
So having unaligned processes can't work.
Post by Jean Tourrilhes
What happens is that the primary and secondary did not compile
http://dpdk.org/ml/archives/dev/2016-September/047329.html
The reason it's happening is that the secondary was not
compiled with the DPDK build system, but with the build system of the
application (in this case, Snort). Oubviously, porting the application
to the DPDK build system is not practical, so we need to live with
this case.
The build system of the application does not have all the
subtelties of the DPDK build system, and ends up including *all* the
constructors, wether they are used or not in the code. Moreover, they
are included in a different order. Actually, by default the builds
include no constructors at all (which is a big fail), so the library
needs to be included with --whole-archive (see Snort DPDK
instructions).
I thought you had unaligned binaries.
You are compiling only one binary ?
Post by Jean Tourrilhes
Post by Sergio Gonzalez Monroy
I am not convinced about this approach.
I agree that the whole constructor approach is flaky and my
patch is only a band aid. Constructors should be entirely removed
IMHO, and a more deterministic init method should be used instead of
depending on linker magic.
Note that the other constructors happen to work right in my
case, but that's probably pure luck. The list of mempool constructors
happen to be the same and in the same order (order matters for mempool
constructors). The app is not using spinlock, hash, crc and acl, so
I did not look if the lists did match.
I am not sure Sergio is talking about the constructor approach.

Anyway, the constructors invocation order should not matter.
Primary and secondary processes build their local tailq entries list
in constructors (so far, I can't see how this is wrong).
"Later", each process updates this list with the actual pointer to the
lists by looking at the shared memory in rte_eal_init (calling
rte_eal_tailqs_init).

What matters is that secondary tailqs are a subset of the primary tailqs.


I still have some trouble understanding what you are trying to do.
As Sergio asked, can you come up with a simplified example/use case ?

Thanks.
--
David Marchand
Jean Tourrilhes
2016-10-05 16:49:06 UTC
Permalink
Post by David Marchand
Hello,
Hi there,
Post by David Marchand
I thought you had unaligned binaries.
You are compiling only one binary ?
Primary is compiled using the DPDK build process.
Secondary is build using the Snort build process.
Both are pointing to the exact same libdpdk.a.
Post by David Marchand
I am not sure Sergio is talking about the constructor approach.
But, this is exactly the cause of the problem.
Post by David Marchand
Anyway, the constructors invocation order should not matter.
For tailq, I agree. For mempool constructors, order do matter.
Post by David Marchand
Primary and secondary processes build their local tailq entries list
in constructors (so far, I can't see how this is wrong).
"Later", each process updates this list with the actual pointer to the
lists by looking at the shared memory in rte_eal_init (calling
rte_eal_tailqs_init).
What matters is that secondary tailqs are a subset of the primary tailqs.
Which is not the case for me, I have secondary including all
tailqs, and primary only having a subset.
Check here :
http://dpdk.org/ml/archives/dev/2016-September/047329.html
Post by David Marchand
I still have some trouble understanding what you are trying to do.
Having things work ;-)
Post by David Marchand
As Sergio asked, can you come up with a simplified example/use case ?
Not trivial. I'll see what I can do.
Post by David Marchand
Thanks.
--
David Marchand
Regards,

Jean
Thomas Monjalon
2016-10-05 17:09:14 UTC
Permalink
Post by Jean Tourrilhes
Post by David Marchand
I thought you had unaligned binaries.
You are compiling only one binary ?
Primary is compiled using the DPDK build process.
Secondary is build using the Snort build process.
Both are pointing to the exact same libdpdk.a.
Probably that you would have some aligned builds if Snort was using
a pkg-config approach to link DPDK.
I cannot commit but I would like to generate some pkg-config files
in the DPDK build system to ease linking from external applications.
Post by Jean Tourrilhes
Post by David Marchand
I am not sure Sergio is talking about the constructor approach.
But, this is exactly the cause of the problem.
Post by David Marchand
Anyway, the constructors invocation order should not matter.
For tailq, I agree. For mempool constructors, order do matter.
I don't know why such a complex function (rte_mempool_register_ops) is
called inside a constructor. Maybe that's the main problem.
Jean Tourrilhes
2016-10-05 17:34:55 UTC
Permalink
Post by Thomas Monjalon
Probably that you would have some aligned builds if Snort was using
a pkg-config approach to link DPDK.
I seriously doubt it, but maybe there is some deep linker
magic that would pick the appropriate set of constructor.
Post by Thomas Monjalon
Post by Jean Tourrilhes
For tailq, I agree. For mempool constructors, order do matter.
I don't know why such a complex function (rte_mempool_register_ops) is
called inside a constructor. Maybe that's the main problem.
No. The problem is that the list of constructors linked by the
linker in each binary is different, whereas DPDK expect them to be the
same.
Regards,

Jean
Jean Tourrilhes
2016-10-05 17:47:34 UTC
Permalink
If the primary and secondary process were build using different build
systems, the list of constructors included by the linker in each
binary might be different. Tailqs are registered via constructors, so
the linker magic will directly impact which tailqs are registered with
the primary and the secondary.

DPDK currently assumes that the secondary has a subset of the tailqs
registered at the primary. In some build scenario, the secondary might
register a tailq that the primary did not register. In this case,
instead of exiting with a panic, just unregister the offending tailq
and allow the secondary to run.

Signed-off-by: Jean Tourrilhes <***@labs.hpe.com>
---
lib/librte_eal/common/eal_common_tailqs.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/lib/librte_eal/common/eal_common_tailqs.c b/lib/librte_eal/common/eal_common_tailqs.c
index bb08ec8..cf5a771 100644
--- a/lib/librte_eal/common/eal_common_tailqs.c
+++ b/lib/librte_eal/common/eal_common_tailqs.c
@@ -143,6 +143,8 @@ rte_eal_tailq_update(struct rte_tailq_elem *t)
t->head = rte_eal_tailq_create(t->name);
} else {
t->head = rte_eal_tailq_lookup(t->name);
+ if (t->head != NULL)
+ rte_tailqs_count++;
}
}

@@ -178,19 +180,27 @@ int
rte_eal_tailqs_init(void)
{
struct rte_tailq_elem *t;
+ void *tmp_te;

rte_tailqs_count = 0;

- TAILQ_FOREACH(t, &rte_tailq_elem_head, next) {
+ TAILQ_FOREACH_SAFE(t, &rte_tailq_elem_head, next, tmp_te) {
/* second part of register job for "early" tailqs, see
* rte_eal_tailq_register and EAL_REGISTER_TAILQ */
rte_eal_tailq_update(t);
if (t->head == NULL) {
RTE_LOG(ERR, EAL,
"Cannot initialize tailq: %s\n", t->name);
- /* no need to TAILQ_REMOVE, we are going to panic in
- * rte_eal_init() */
- goto fail;
+ if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
+ /* no need to TAILQ_REMOVE, we are going
+ * to panic in rte_eal_init() */
+ goto fail;
+ } else {
+ /* This means our list of constructor is
+ * no the same as primary. Just remove
+ * that missing tailq and continue */
+ TAILQ_REMOVE(&rte_tailq_elem_head, t, next);
+ }
}
}

Loading...