Discussion:
[dpdk-dev] [RFC 0/3] tqs: add thread quiescent state library
(too old to reply)
Honnappa Nagarahalli
2018-11-22 03:30:52 UTC
Permalink
Lock-less data structures provide scalability and determinism.
They enable use cases where locking may not be allowed
(for ex: real-time applications).

In the following paras, the term 'memory' refers to memory allocated
by typical APIs like malloc or anything that is representative of
memory, for ex: an index of a free element array.

Since these data structures are lock less, the writers and readers
are accessing the data structures simultaneously. Hence, while removing
an element from a data structure, the writers cannot return the memory
to the allocator, without knowing that the readers are not
referencing that element/memory anymore. Hence, it is required to
separate the operation of removing an element into 2 steps:

Delete: in this step, the writer removes the element from the
data structure but does not return the associated memory to the allocator.
This will ensure that new readers will not get a reference to the removed
element. Removing the reference is an atomic operation.

Free: in this step, the writer returns the memory to the
memory allocator, only after knowing that all the readers have stopped
referencing the removed element.

This library helps the writer determine when it is safe to free the
memory.

This library makes use of Thread Quiescent State (TQS). TQS can be
defined as 'any point in the thread execution where the thread does
not hold a reference to shared memory'. It is upto the application to
determine its quiescent state. Let us consider the following diagram:

Time -------------------------------------------------->

| |
RT1 $++++****D1****+++***D2*|**+++|+++**D3*****++++$
| |
RT2 $++++****D1****++|+**D2|***++++++**D3*****++++$
| |
RT3 $++++****D1****+++***|D2***|++++++**D2*****++++$
| |
|<--->|
Del | Free
|
Cannot free memory
during this period

RTx - Reader thread
< and > - Start and end of while(1) loop
***Dx*** - Reader thread is accessing the shared data structure Dx.
i.e. critical section.
+++ - Reader thread is not accessing any shared data structure.
i.e. non critical section or quiescent state.
Del - Point in time when the reference to the entry is removed using
atomic operation.
Free - Point in time when the writer can free the entry.

As shown thread RT1 acesses data structures D1, D2 and D3. When it is
accessing D2, if the writer has to remove an element from D2, the
writer cannot return the memory associated with that element to the
allocator. The writer can return the memory to the allocator only after
the reader stops referencng D2. In other words, reader thread RT1
has to enter a quiescent state.

Similarly, since thread RT3 is also accessing D2, writer has to wait till
RT3 enters quiescent state as well.

However, the writer does not need to wait for RT2 to enter quiescent state.
Thread RT2 was not accessing D2 when the delete operation happened.
So, RT2 will not get a reference to the deleted entry.

It can be noted that, the critical sections for D2 and D3 are quiescent states
for D1. i.e. for a given data structure Dx, any point in the thread execution
that does not reference Dx is a quiescent state.

For DPDK applications, the start and end of while(1) loop (where no shared
data structures are getting accessed) act as perfect quiescent states. This
will combine all the shared data structure accesses into a single critical
section and keeps the over head introduced by this library to the minimum.

However, the length of the critical section and the number of reader threads
is proportional to the time taken to identify the end of critical section.
So, if the application desires, it should be possible to identify the end
of critical section for each data structure.

To provide the required flexibility, this library has a concept of TQS
variable. The application can create one or more TQS variables to help it
track the end of one or more critical sections.

The application can create a TQS variable using the API rte_tqs_alloc.
It takes a mask of lcore IDs that will report their quiescent states
using this variable. This mask can be empty to start with.

rte_tqs_register_lcore API will register a reader thread to report its
quiescent state. This can be called from any control plane thread or from
the reader thread. The application can create a TQS variable with no reader
threads and add the threads dynamically using this API.

The application can trigger the reader threads to report their quiescent
state status by calling the API rte_tqs_start. It is possible for multiple
writer threads to query the quiescent state status simultaneously. Hence,
rte_tqs_start returns a token to each caller.

The application has to call rte_tqs_check API with the token to get the
current status. Option to block till all the threads enter the quiescent
state is provided. If this API indicates that all the threads have entered
the quiescent state, the application can free the deleted entry.

The separation of triggering the reporting from querying the status provides
the writer threads flexibility to do useful work instead of waiting for the
reader threads to enter the quiescent state.

rte_tqs_unregister_lcore API will remove a reader thread from reporting its
quiescent state using a TQS variable. The rte_tqs_check API will not wait
for this reader thread to report the quiescent state status anymore.

Finally, a TQS variable can be deleted by calling rte_tqs_free API.
Application must make sure that the reader threads are not referencing the
TQS variable anymore before deleting it.

The reader threads should call rte_tqs_update API to indicate that they
entered a quiescent state. This API checks if a writer has triggered a
quiescent state query and update the state accordingly.

Next Steps:
1) Add more test cases
2) Convert to patch
3) Incorporate feedback from community
4) Add documentation

Dharmik Thakkar (1):
test/tqs: Add API and functional tests

Honnappa Nagarahalli (2):
log: add TQS log type
tqs: add thread quiescent state library

config/common_base | 6 +
lib/Makefile | 2 +
lib/librte_eal/common/include/rte_log.h | 1 +
lib/librte_tqs/Makefile | 23 +
lib/librte_tqs/meson.build | 5 +
lib/librte_tqs/rte_tqs.c | 249 +++++++++++
lib/librte_tqs/rte_tqs.h | 352 +++++++++++++++
lib/librte_tqs/rte_tqs_version.map | 16 +
lib/meson.build | 2 +-
mk/rte.app.mk | 1 +
test/test/Makefile | 2 +
test/test/autotest_data.py | 6 +
test/test/meson.build | 5 +-
test/test/test_tqs.c | 540 ++++++++++++++++++++++++
14 files changed, 1208 insertions(+), 2 deletions(-)
create mode 100644 lib/librte_tqs/Makefile
create mode 100644 lib/librte_tqs/meson.build
create mode 100644 lib/librte_tqs/rte_tqs.c
create mode 100644 lib/librte_tqs/rte_tqs.h
create mode 100644 lib/librte_tqs/rte_tqs_version.map
create mode 100644 test/test/test_tqs.c
--
2.17.1
Honnappa Nagarahalli
2018-11-22 03:30:53 UTC
Permalink
Signed-off-by: Honnappa Nagarahalli <***@arm.com>
Reviewed-by: Steve Capper <***@arm.com>
Reviewed-by: Gavin Hu <***@arm.com>
---
lib/librte_eal/common/include/rte_log.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/lib/librte_eal/common/include/rte_log.h b/lib/librte_eal/common/include/rte_log.h
index 2f789cb90..b4e91a4a5 100644
--- a/lib/librte_eal/common/include/rte_log.h
+++ b/lib/librte_eal/common/include/rte_log.h
@@ -61,6 +61,7 @@ extern struct rte_logs rte_logs;
#define RTE_LOGTYPE_EFD 18 /**< Log related to EFD. */
#define RTE_LOGTYPE_EVENTDEV 19 /**< Log related to eventdev. */
#define RTE_LOGTYPE_GSO 20 /**< Log related to GSO. */
+#define RTE_LOGTYPE_TQS 21 /**< Log related to Thread Quiescent State. */

/* these log types can be used in an application */
#define RTE_LOGTYPE_USER1 24 /**< User-defined log type 1. */
--
2.17.1
Stephen Hemminger
2018-11-27 22:24:15 UTC
Permalink
On Wed, 21 Nov 2018 21:30:53 -0600
Post by Honnappa Nagarahalli
---
lib/librte_eal/common/include/rte_log.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/lib/librte_eal/common/include/rte_log.h b/lib/librte_eal/common/include/rte_log.h
index 2f789cb90..b4e91a4a5 100644
--- a/lib/librte_eal/common/include/rte_log.h
+++ b/lib/librte_eal/common/include/rte_log.h
@@ -61,6 +61,7 @@ extern struct rte_logs rte_logs;
#define RTE_LOGTYPE_EFD 18 /**< Log related to EFD. */
#define RTE_LOGTYPE_EVENTDEV 19 /**< Log related to eventdev. */
#define RTE_LOGTYPE_GSO 20 /**< Log related to GSO. */
+#define RTE_LOGTYPE_TQS 21 /**< Log related to Thread Quiescent State. */
/* these log types can be used in an application */
#define RTE_LOGTYPE_USER1 24 /**< User-defined log type 1. */
Sorry, I don't think this is the right way now.

All new logging should be using dynamic log types.
We should work on getting rid of others (EFD, EVENTDEV, GSO).
Honnappa Nagarahalli
2018-11-28 05:58:58 UTC
Permalink
Post by Stephen Hemminger
On Wed, 21 Nov 2018 21:30:53 -0600
Post by Honnappa Nagarahalli
---
lib/librte_eal/common/include/rte_log.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/lib/librte_eal/common/include/rte_log.h
b/lib/librte_eal/common/include/rte_log.h
Post by Honnappa Nagarahalli
index 2f789cb90..b4e91a4a5 100644
--- a/lib/librte_eal/common/include/rte_log.h
+++ b/lib/librte_eal/common/include/rte_log.h
@@ -61,6 +61,7 @@ extern struct rte_logs rte_logs;
#define RTE_LOGTYPE_EFD 18 /**< Log related to EFD. */
#define RTE_LOGTYPE_EVENTDEV 19 /**< Log related to eventdev. */
#define RTE_LOGTYPE_GSO 20 /**< Log related to GSO. */
+#define RTE_LOGTYPE_TQS 21 /**< Log related to Thread Quiescent State.
*/
Post by Honnappa Nagarahalli
/* these log types can be used in an application */
#define RTE_LOGTYPE_USER1 24 /**< User-defined log type 1. */
Sorry, I don't think this is the right way now.
Ok. I see some examples for the libraries already. I will change it in next version.
Post by Stephen Hemminger
All new logging should be using dynamic log types.
We should work on getting rid of others (EFD, EVENTDEV, GSO).
Honnappa Nagarahalli
2018-11-22 03:30:54 UTC
Permalink
Add Thread Quiescent State (TQS) library. This library helps identify
the quiescent state of the reader threads so that the writers
can free the memory associated with the lock less data structures.

Signed-off-by: Honnappa Nagarahalli <***@arm.com>
Reviewed-by: Steve Capper <***@arm.com>
Reviewed-by: Gavin Hu <***@arm.com>
---
config/common_base | 6 +
lib/Makefile | 2 +
lib/librte_tqs/Makefile | 23 ++
lib/librte_tqs/meson.build | 5 +
lib/librte_tqs/rte_tqs.c | 249 ++++++++++++++++++++
lib/librte_tqs/rte_tqs.h | 352 +++++++++++++++++++++++++++++
lib/librte_tqs/rte_tqs_version.map | 16 ++
lib/meson.build | 2 +-
mk/rte.app.mk | 1 +
9 files changed, 655 insertions(+), 1 deletion(-)
create mode 100644 lib/librte_tqs/Makefile
create mode 100644 lib/librte_tqs/meson.build
create mode 100644 lib/librte_tqs/rte_tqs.c
create mode 100644 lib/librte_tqs/rte_tqs.h
create mode 100644 lib/librte_tqs/rte_tqs_version.map

diff --git a/config/common_base b/config/common_base
index d12ae98bc..af40a9f81 100644
--- a/config/common_base
+++ b/config/common_base
@@ -792,6 +792,12 @@ CONFIG_RTE_LIBRTE_LATENCY_STATS=y
#
CONFIG_RTE_LIBRTE_TELEMETRY=n

+#
+# Compile librte_tqs
+#
+CONFIG_RTE_LIBRTE_TQS=y
+CONFIG_RTE_LIBRTE_TQS_DEBUG=n
+
#
# Compile librte_lpm
#
diff --git a/lib/Makefile b/lib/Makefile
index b7370ef97..7095eac88 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -108,6 +108,8 @@ DIRS-$(CONFIG_RTE_LIBRTE_BPF) += librte_bpf
DEPDIRS-librte_bpf := librte_eal librte_mempool librte_mbuf librte_ethdev
DIRS-$(CONFIG_RTE_LIBRTE_TELEMETRY) += librte_telemetry
DEPDIRS-librte_telemetry := librte_eal librte_metrics librte_ethdev
+DIRS-$(CONFIG_RTE_LIBRTE_TQS) += librte_tqs
+DEPDIRS-librte_tqs := librte_eal

ifeq ($(CONFIG_RTE_EXEC_ENV_LINUXAPP),y)
DIRS-$(CONFIG_RTE_LIBRTE_KNI) += librte_kni
diff --git a/lib/librte_tqs/Makefile b/lib/librte_tqs/Makefile
new file mode 100644
index 000000000..059de53e2
--- /dev/null
+++ b/lib/librte_tqs/Makefile
@@ -0,0 +1,23 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Arm Limited
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# library name
+LIB = librte_tqs.a
+
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
+LDLIBS += -lrte_eal
+
+EXPORT_MAP := rte_tqs_version.map
+
+LIBABIVER := 1
+
+# all source are stored in SRCS-y
+SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_tqs.c
+
+# install includes
+SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_tqs.h
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_tqs/meson.build b/lib/librte_tqs/meson.build
new file mode 100644
index 000000000..dd696ab07
--- /dev/null
+++ b/lib/librte_tqs/meson.build
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Arm Limited
+
+sources = files('rte_tqs.c')
+headers = files('rte_tqs.h')
diff --git a/lib/librte_tqs/rte_tqs.c b/lib/librte_tqs/rte_tqs.c
new file mode 100644
index 000000000..cbc36864e
--- /dev/null
+++ b/lib/librte_tqs/rte_tqs.c
@@ -0,0 +1,249 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Copyright (c) 2018 Arm Limited
+ */
+
+#include <stdio.h>
+#include <string.h>
+#include <stdint.h>
+#include <errno.h>
+
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_malloc.h>
+#include <rte_eal.h>
+#include <rte_eal_memconfig.h>
+#include <rte_atomic.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_errno.h>
+
+#include "rte_tqs.h"
+
+TAILQ_HEAD(rte_tqs_list, rte_tailq_entry);
+
+static struct rte_tailq_elem rte_tqs_tailq = {
+ .name = RTE_TAILQ_TQS_NAME,
+};
+EAL_REGISTER_TAILQ(rte_tqs_tailq)
+
+/* Allocate a new TQS variable with the name *name* in memory. */
+struct rte_tqs * __rte_experimental
+rte_tqs_alloc(const char *name, int socket_id, uint64_t lcore_mask)
+{
+ char tqs_name[RTE_TQS_NAMESIZE];
+ struct rte_tailq_entry *te, *tmp_te;
+ struct rte_tqs_list *tqs_list;
+ struct rte_tqs *v, *tmp_v;
+ int ret;
+
+ if (name == NULL) {
+ RTE_LOG(ERR, TQS, "Invalid input parameters\n");
+ rte_errno = -EINVAL;
+ return NULL;
+ }
+
+ te = rte_zmalloc("TQS_TAILQ_ENTRY", sizeof(*te), 0);
+ if (te == NULL) {
+ RTE_LOG(ERR, TQS, "Cannot reserve memory for tailq\n");
+ rte_errno = -ENOMEM;
+ return NULL;
+ }
+
+ snprintf(tqs_name, sizeof(tqs_name), "%s", name);
+ v = rte_zmalloc_socket(tqs_name, sizeof(struct rte_tqs),
+ RTE_CACHE_LINE_SIZE, socket_id);
+ if (v == NULL) {
+ RTE_LOG(ERR, TQS, "Cannot reserve memory for TQS variable\n");
+ rte_errno = -ENOMEM;
+ goto alloc_error;
+ }
+
+ ret = snprintf(v->name, sizeof(v->name), "%s", name);
+ if (ret < 0 || ret >= (int)sizeof(v->name)) {
+ rte_errno = -ENAMETOOLONG;
+ goto alloc_error;
+ }
+
+ te->data = (void *) v;
+ v->lcore_mask = lcore_mask;
+
+ rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
+
+ tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);
+
+ /* Search if a TQS variable with the same name exists already */
+ TAILQ_FOREACH(tmp_te, tqs_list, next) {
+ tmp_v = (struct rte_tqs *) tmp_te->data;
+ if (strncmp(name, tmp_v->name, RTE_TQS_NAMESIZE) == 0)
+ break;
+ }
+
+ if (tmp_te != NULL) {
+ rte_errno = -EEXIST;
+ goto tqs_exist;
+ }
+
+ TAILQ_INSERT_TAIL(tqs_list, te, next);
+
+ rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+ return v;
+
+tqs_exist:
+ rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+alloc_error:
+ rte_free(te);
+ rte_free(v);
+ return NULL;
+}
+
+/* De-allocate all the memory used by a TQS variable. */
+void __rte_experimental
+rte_tqs_free(struct rte_tqs *v)
+{
+ struct rte_tqs_list *tqs_list;
+ struct rte_tailq_entry *te;
+
+ /* Search for the TQS variable in tailq */
+ tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);
+ rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
+
+ TAILQ_FOREACH(te, tqs_list, next) {
+ if (te->data == (void *) v)
+ break;
+ }
+
+ if (te != NULL)
+ TAILQ_REMOVE(tqs_list, te, next);
+ else
+ RTE_LOG(ERR, TQS, "TQS variable %s not found\n", v->name);
+
+ rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+ rte_free(te);
+ rte_free(v);
+}
+
+/* Add a reader thread, running on an lcore, to the list of threads
+ * reporting their quiescent state on a TQS variable.
+ */
+int __rte_experimental
+rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id)
+{
+ TQS_RETURN_IF_TRUE((v == NULL || lcore_id >= RTE_TQS_MAX_LCORE),
+ -EINVAL);
+
+ /* Worker thread has to count the quiescent states
+ * only from the current value of token.
+ */
+ v->w[lcore_id].cnt = v->token;
+
+ /* Release the store to initial TQS count so that workers
+ * can use it immediately after this function returns.
+ */
+ __atomic_fetch_or(&v->lcore_mask, (1UL << lcore_id), __ATOMIC_RELEASE);
+
+ return 0;
+}
+
+/* Remove a reader thread, running on an lcore, from the list of threads
+ * reporting their quiescent state on a TQS variable.
+ */
+int __rte_experimental
+rte_tqs_unregister_lcore(struct rte_tqs *v, unsigned int lcore_id)
+{
+ TQS_RETURN_IF_TRUE((v == NULL ||
+ lcore_id >= RTE_TQS_MAX_LCORE), -EINVAL);
+
+ /* This can be a relaxed store. Since this is an API, make sure
+ * the store is not reordered with other memory operations.
+ */
+ __atomic_fetch_and(&v->lcore_mask,
+ ~(1UL << lcore_id), __ATOMIC_RELEASE);
+
+ return 0;
+}
+
+/* Search a TQS variable, given its name. */
+struct rte_tqs * __rte_experimental
+rte_tqs_lookup(const char *name)
+{
+ struct rte_tqs_list *tqs_list;
+ struct rte_tailq_entry *te;
+ struct rte_tqs *v;
+
+ if (name == NULL) {
+ RTE_LOG(ERR, TQS, "Invalid input parameters\n");
+ rte_errno = -EINVAL;
+ return NULL;
+ }
+
+ v = NULL;
+ tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);
+
+ rte_rwlock_read_lock(RTE_EAL_TAILQ_RWLOCK);
+
+ /* find out tailq entry */
+ TAILQ_FOREACH(te, tqs_list, next) {
+ v = (struct rte_tqs *) te->data;
+ if (strncmp(name, v->name, RTE_TQS_NAMESIZE) == 0)
+ break;
+ }
+
+ if (te == NULL) {
+ rte_errno = -ENOENT;
+ v = NULL;
+ }
+
+ rte_rwlock_read_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+ return v;
+}
+
+/* Dump the details of a single TQS variables to a file. */
+void __rte_experimental
+rte_tqs_dump(FILE *f, struct rte_tqs *v)
+{
+ uint64_t tmp_mask;
+ uint32_t i;
+
+ TQS_ERR_LOG_IF_TRUE(v == NULL || f == NULL);
+
+ fprintf(f, "\nTQS <%s>@%p\n", v->name, v);
+ fprintf(f, " lcore mask = 0x%lx\n", v->lcore_mask);
+ fprintf(f, " token = %u\n", v->token);
+
+ if (v->lcore_mask != 0) {
+ fprintf(f, "Quiescent State Counts for readers:\n");
+ tmp_mask = v->lcore_mask;
+ while (tmp_mask) {
+ i = __builtin_ctz(tmp_mask);
+ fprintf(f, "lcore # = %d, count = %u\n",
+ i, v->w[i].cnt);
+ tmp_mask &= ~(1UL << i);
+ }
+ }
+}
+
+/* Dump the details of all the TQS variables to a file. */
+void __rte_experimental
+rte_tqs_list_dump(FILE *f)
+{
+ const struct rte_tailq_entry *te;
+ struct rte_tqs_list *tqs_list;
+
+ TQS_ERR_LOG_IF_TRUE(f == NULL);
+
+ tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);
+
+ rte_rwlock_read_lock(RTE_EAL_TAILQ_RWLOCK);
+
+ TAILQ_FOREACH(te, tqs_list, next) {
+ rte_tqs_dump(f, (struct rte_tqs *) te->data);
+ }
+
+ rte_rwlock_read_unlock(RTE_EAL_TAILQ_RWLOCK);
+}
diff --git a/lib/librte_tqs/rte_tqs.h b/lib/librte_tqs/rte_tqs.h
new file mode 100644
index 000000000..9136418d2
--- /dev/null
+++ b/lib/librte_tqs/rte_tqs.h
@@ -0,0 +1,352 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2018 Arm Limited
+ */
+
+#ifndef _RTE_TQS_H_
+#define _RTE_TQS_H_
+
+/**
+ * @file
+ * RTE Thread Quiescent State
+ *
+ * Thread Quiescent State (TQS) is any point in the thread execution
+ * where the thread does not hold a reference to shared memory, i.e.
+ * a non-critical section. A critical section for a data structure can
+ * be a quiescent state for another data structure.
+ *
+ * An application can identify the quiescent state according to its
+ * needs. It can identify 1 quiescent state for each data structure or
+ * 1 quiescent state for a group/all of data structures.
+ *
+ * This library provides the flexibility for these use cases.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdio.h>
+#include <stdint.h>
+#include <errno.h>
+#include <rte_common.h>
+#include <rte_memory.h>
+#include <rte_lcore.h>
+
+#define RTE_TAILQ_TQS_NAME "RTE_TQS"
+
+/**< The maximum length of a TQS variable name. */
+#define RTE_TQS_NAMESIZE 32
+
+/**< Maximum number of lcores supported. */
+#if (RTE_MAX_LCORE > 64)
+#define RTE_TQS_MAX_LCORE 64
+#else
+#define RTE_TQS_MAX_LCORE RTE_MAX_LCORE
+#endif
+
+/* Macro for run-time checking of function parameters */
+#if defined(RTE_LIBRTE_TQS_DEBUG)
+#define TQS_RETURN_IF_TRUE(cond, retval) do { \
+ if ((cond)) \
+ return retval; \
+} while (0)
+#else
+#define TQS_RETURN_IF_TRUE(cond, retval)
+#endif
+
+/* Macro to log error message */
+#define TQS_ERR_LOG_IF_TRUE(cond) do { \
+ if ((cond)) { \
+ RTE_LOG(ERR, TQS, "Invalid parameters\n"); \
+ return; \
+ } \
+} while (0)
+
+/* Worker thread counter */
+struct rte_tqs_cnt {
+ volatile uint32_t cnt; /**< Quiescent state counter. */
+} __rte_cache_aligned;
+
+/**
+ * RTE Thread Quiescent State structure.
+ */
+struct rte_tqs {
+ char name[RTE_TQS_NAMESIZE];
+ /**< Name of the ring. */
+ uint64_t lcore_mask;
+ /**< Worker lcores reporting on this TQS */
+
+ uint32_t token __rte_cache_aligned;
+ /**< Counter to allow for multiple simultaneous TQS queries */
+
+ struct rte_tqs_cnt w[RTE_TQS_MAX_LCORE] __rte_cache_aligned;
+ /**< TQS counter for each worker thread, counts upto
+ * current value of token.
+ */
+} __rte_cache_aligned;
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Allocate a new TQS variable with the name *name* in memory.
+ *
+ * The TQS variable is added in RTE_TAILQ_TQS list.
+ *
+ * @param name
+ * The name of the TQS variable.
+ * @param socket_id
+ * The *socket_id* argument is the socket identifier in case of
+ * NUMA. The value can be *SOCKET_ID_ANY* if there is no NUMA
+ * constraint for the reserved zone.
+ * @param lcore_mask
+ * Data plane reader threads in this mask will report their quiescent
+ * state on this TQS variable.
+ * @return
+ * On success, the pointer to the new allocated TQS variable. NULL on
+ * error with rte_errno set appropriately. Possible errno values include:
+ * - EINVAL - invalid input parameters
+ * - ENAMETOOLONG - TQS variable name is longer than RTE_TQS_NAMESIZE
+ * - EEXIST - a TQS variable with the same name already exists
+ * - ENOMEM - no appropriate memory area found in which to create memzone
+ */
+struct rte_tqs * __rte_experimental
+rte_tqs_alloc(const char *name, int socket_id, uint64_t lcore_mask);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * De-allocate all the memory used by a TQS variable. It is the
+ * application's responsibility to make sure that no other thread
+ * is using the TQS variable.
+ *
+ * The TQS variable is removed from RTE_TAILQ_TQS list.
+ *
+ * @param v
+ * TQS variable to free.
+ */
+void __rte_experimental rte_tqs_free(struct rte_tqs *v);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add a worker thread, running on an lcore, to the list of threads
+ * reporting their quiescent state on a TQS variable.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe. This API can be called from the worker threads during
+ * initialization. Any ongoing TQS queries may wait for the
+ * status from this registered worker thread.
+ *
+ * @param v
+ * TQS variable
+ * @param lcore_id
+ * Worker thread on this lcore will report its quiescent state on
+ * this TQS variable.
+ * @return
+ * - 0 if registered successfully.
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ */
+int __rte_experimental
+rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Remove a worker thread, running on an lcore, from the list of threads
+ * reporting their quiescent state on a TQS variable.
+ *
+ * This is implemented as a lock-free function. It is multi-thread safe.
+ * This API can be called from the worker threads during shutdown.
+ * Any ongoing TQS queries may stop waiting for the status from this
+ * unregistered worker thread.
+ *
+ * @param v
+ * TQS variable
+ * @param lcore_id
+ * Worker thread on this lcore will stop reporting its quiescent state
+ * on this TQS variable.
+ * @return
+ * - 0 if un-registered successfully.
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ */
+int __rte_experimental
+rte_tqs_unregister_lcore(struct rte_tqs *v, unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Trigger the worker threads to report the quiescent state
+ * status.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe and can be called from the worker threads as well.
+ *
+ * @param v
+ * TQS variable
+ * @param n
+ * Expected number of times the quiescent state is entered
+ * @param t
+ * - If successful, this is the token for this call of the API.
+ * This should be passed to rte_tqs_check API.
+ * @return
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ * - 0 Otherwise and always (non-debug mode compilation).
+ */
+static __rte_always_inline int __rte_experimental
+rte_tqs_start(struct rte_tqs *v, unsigned int n, uint32_t *t)
+{
+ TQS_RETURN_IF_TRUE((v == NULL || t == NULL), -EINVAL);
+
+ /* This store release will ensure that changes to any data
+ * structure are visible to the workers before the token
+ * update is visible.
+ */
+ *t = __atomic_add_fetch(&v->token, n, __ATOMIC_RELEASE);
+
+ return 0;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Update quiescent state for the worker thread on a lcore.
+ *
+ * This is implemented as a lock-free function. It is multi-thread safe.
+ * All the worker threads registered to report their quiescent state
+ * on the TQS variable must call this API.
+ *
+ * @param v
+ * TQS variable
+ */
+static __rte_always_inline void __rte_experimental
+rte_tqs_update(struct rte_tqs *v, unsigned int lcore_id)
+{
+ uint32_t t;
+
+ TQS_ERR_LOG_IF_TRUE(v == NULL || lcore_id >= RTE_TQS_MAX_LCORE);
+
+ /* Load the token before the worker thread loads any other
+ * (lock-free) data structure. This ensures that updates
+ * to the data structures are visible if the update
+ * to token is visible.
+ */
+ t = __atomic_load_n(&v->token, __ATOMIC_ACQUIRE);
+ if (v->w[lcore_id].cnt != t)
+ v->w[lcore_id].cnt++;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Checks if all the worker threads have entered the quiescent state
+ * 'n' number of times. 'n' is provided in rte_tqs_start API.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe and can be called from the worker threads as well.
+ *
+ * @param v
+ * TQS variable
+ * @param t
+ * Token returned by rte_tqs_start API
+ * @param wait
+ * If true, block till all the worker threads have completed entering
+ * the quiescent state 'n' number of times
+ * @return
+ * - 0 if all worker threads have NOT passed through specified number
+ * of quiescent states.
+ * - 1 if all worker threads have passed through specified number
+ * of quiescent states.
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ */
+static __rte_always_inline int __rte_experimental
+rte_tqs_check(struct rte_tqs *v, uint32_t t, bool wait)
+{
+ uint64_t l;
+ uint64_t lcore_mask;
+
+ TQS_RETURN_IF_TRUE((v == NULL), -EINVAL);
+
+ do {
+ /* Load the current lcore_mask before loading the
+ * worker thread quiescent state counters.
+ */
+ lcore_mask = __atomic_load_n(&v->lcore_mask, __ATOMIC_ACQUIRE);
+
+ while (lcore_mask) {
+ l = __builtin_ctz(lcore_mask);
+ if (v->w[l].cnt != t)
+ break;
+
+ lcore_mask &= ~(1UL << l);
+ }
+
+ if (lcore_mask == 0)
+ return 1;
+
+ rte_pause();
+ } while (wait);
+
+ return 0;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Search a TQS variable, given its name.
+ *
+ * It is multi-thread safe.
+ *
+ * @param name
+ * The name of the TQS variable.
+ * @return
+ * On success, the pointer to the TQS variable. NULL on
+ * error with rte_errno set appropriately. Possible errno values include:
+ * - EINVAL - invalid input parameters.
+ * - ENOENT - entry not found.
+ */
+struct rte_tqs * __rte_experimental
+rte_tqs_lookup(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Dump the details of a single TQS variables to a file.
+ *
+ * It is NOT multi-thread safe.
+ *
+ * @param f
+ * A pointer to a file for output
+ * @param tqs
+ * TQS variable
+ */
+void __rte_experimental
+rte_tqs_dump(FILE *f, struct rte_tqs *v);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Dump the details of all the TQS variables to a file.
+ *
+ * It is multi-thread safe.
+ *
+ * @param f
+ * A pointer to a file for output
+ */
+void __rte_experimental
+rte_tqs_list_dump(FILE *f);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_TQS_H_ */
diff --git a/lib/librte_tqs/rte_tqs_version.map b/lib/librte_tqs/rte_tqs_version.map
new file mode 100644
index 000000000..2e4d5c094
--- /dev/null
+++ b/lib/librte_tqs/rte_tqs_version.map
@@ -0,0 +1,16 @@
+EXPERIMENTAL {
+ global:
+
+ rte_tqs_alloc;
+ rte_tqs_free;
+ rte_tqs_register_lcore;
+ rte_tqs_unregister_lcore;
+ rte_tqs_start;
+ rte_tqs_update;
+ rte_tqs_check;
+ rte_tqs_lookup;
+ rte_tqs_list_dump;
+ rte_tqs_dump;
+
+ local: *;
+};
diff --git a/lib/meson.build b/lib/meson.build
index bb7f443f9..ee19c483e 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -21,7 +21,7 @@ libraries = [ 'compat', # just a header, used for versioning
'gro', 'gso', 'ip_frag', 'jobstats',
'kni', 'latencystats', 'lpm', 'member',
'meter', 'power', 'pdump', 'rawdev',
- 'reorder', 'sched', 'security', 'vhost',
+ 'reorder', 'sched', 'security', 'tqs', 'vhost',
# add pkt framework libs which use other libs from above
'port', 'table', 'pipeline',
# flow_classify lib depends on pkt framework table lib
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 3ebc4e64c..6e19e669a 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -92,6 +92,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_EAL) += -lrte_eal
_LDLIBS-$(CONFIG_RTE_LIBRTE_CMDLINE) += -lrte_cmdline
_LDLIBS-$(CONFIG_RTE_LIBRTE_REORDER) += -lrte_reorder
_LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED) += -lrte_sched
+_LDLIBS-$(CONFIG_RTE_LIBRTE_TQS) += -lrte_tqs

ifeq ($(CONFIG_RTE_EXEC_ENV_LINUXAPP),y)
_LDLIBS-$(CONFIG_RTE_LIBRTE_KNI) += -lrte_kni
--
2.17.1
Ananyev, Konstantin
2018-11-24 12:18:00 UTC
Permalink
Hi Honnappa,
Post by Honnappa Nagarahalli
+
+/* Allocate a new TQS variable with the name *name* in memory. */
+struct rte_tqs * __rte_experimental
+rte_tqs_alloc(const char *name, int socket_id, uint64_t lcore_mask)
+{
+ char tqs_name[RTE_TQS_NAMESIZE];
+ struct rte_tailq_entry *te, *tmp_te;
+ struct rte_tqs_list *tqs_list;
+ struct rte_tqs *v, *tmp_v;
+ int ret;
+
+ if (name == NULL) {
+ RTE_LOG(ERR, TQS, "Invalid input parameters\n");
+ rte_errno = -EINVAL;
+ return NULL;
+ }
+
+ te = rte_zmalloc("TQS_TAILQ_ENTRY", sizeof(*te), 0);
+ if (te == NULL) {
+ RTE_LOG(ERR, TQS, "Cannot reserve memory for tailq\n");
+ rte_errno = -ENOMEM;
+ return NULL;
+ }
+
+ snprintf(tqs_name, sizeof(tqs_name), "%s", name);
+ v = rte_zmalloc_socket(tqs_name, sizeof(struct rte_tqs),
+ RTE_CACHE_LINE_SIZE, socket_id);
+ if (v == NULL) {
+ RTE_LOG(ERR, TQS, "Cannot reserve memory for TQS variable\n");
+ rte_errno = -ENOMEM;
+ goto alloc_error;
+ }
+
+ ret = snprintf(v->name, sizeof(v->name), "%s", name);
+ if (ret < 0 || ret >= (int)sizeof(v->name)) {
+ rte_errno = -ENAMETOOLONG;
+ goto alloc_error;
+ }
+
+ te->data = (void *) v;
+ v->lcore_mask = lcore_mask;
+
+ rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
+
+ tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);
+
+ /* Search if a TQS variable with the same name exists already */
+ TAILQ_FOREACH(tmp_te, tqs_list, next) {
+ tmp_v = (struct rte_tqs *) tmp_te->data;
+ if (strncmp(name, tmp_v->name, RTE_TQS_NAMESIZE) == 0)
+ break;
+ }
+
+ if (tmp_te != NULL) {
+ rte_errno = -EEXIST;
+ goto tqs_exist;
+ }
+
+ TAILQ_INSERT_TAIL(tqs_list, te, next);
+
+ rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+ return v;
+
+ rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+ rte_free(te);
+ rte_free(v);
+ return NULL;
+}
That seems quite heavy-weight function just to allocate sync variable.
As size of struct rte_tqs is constant and known to the user, might be better
just provide rte_tqs_init(struct rte_tqs *tqs, ...) and let user allocate/free
memory for it by himself.
Post by Honnappa Nagarahalli
+
+/* Add a reader thread, running on an lcore, to the list of threads
+ * reporting their quiescent state on a TQS variable.
+ */
+int __rte_experimental
+rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id)
+{
+ TQS_RETURN_IF_TRUE((v == NULL || lcore_id >= RTE_TQS_MAX_LCORE),
+ -EINVAL);
It is not very good practice to make function return different values and behave
in a different way in debug/non-debug mode.
I'd say that for slow-path (functions in .c) it is always good to check input parameters.
For fast-path (functions in .h) we sometimes skip such checking,
but debug mode can probably use RTE_ASSERT() or so.


lcore_id >= RTE_TQS_MAX_LCORE

Is this limitation really necessary?
First it means that only lcores can use that API (at least data-path part), second
even today many machines have more than 64 cores.
I think you can easily avoid such limitation, if instead of requiring lcore_id as
input parameter, you'll just make it return index of next available entry in w[].
Then tqs_update() can take that index as input parameter.
Post by Honnappa Nagarahalli
+
+ /* Worker thread has to count the quiescent states
+ * only from the current value of token.
+ */
+ v->w[lcore_id].cnt = v->token;
Wonder what would happen, if new reader will
call register(), after writer calls start()?
Looks like a race-condition.
Or such pattern is not supported?
Post by Honnappa Nagarahalli
+
+ /* Release the store to initial TQS count so that workers
+ * can use it immediately after this function returns.
+ */
+ __atomic_fetch_or(&v->lcore_mask, (1UL << lcore_id), __ATOMIC_RELEASE);
+
+ return 0;
+}
+
+/**
+ *
+ * Trigger the worker threads to report the quiescent state
+ * status.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe and can be called from the worker threads as well.
+ *
+ * TQS variable
+ * Expected number of times the quiescent state is entered
+ * - If successful, this is the token for this call of the API.
+ * This should be passed to rte_tqs_check API.
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ * - 0 Otherwise and always (non-debug mode compilation).
+ */
+static __rte_always_inline int __rte_experimental
+rte_tqs_start(struct rte_tqs *v, unsigned int n, uint32_t *t)
+{
+ TQS_RETURN_IF_TRUE((v == NULL || t == NULL), -EINVAL);
+
+ /* This store release will ensure that changes to any data
+ * structure are visible to the workers before the token
+ * update is visible.
+ */
+ *t = __atomic_add_fetch(&v->token, n, __ATOMIC_RELEASE);
+
+ return 0;
+}
+
+/**
+ *
+ * Update quiescent state for the worker thread on a lcore.
+ *
+ * This is implemented as a lock-free function. It is multi-thread safe.
+ * All the worker threads registered to report their quiescent state
+ * on the TQS variable must call this API.
+ *
+ * TQS variable
+ */
+static __rte_always_inline void __rte_experimental
+rte_tqs_update(struct rte_tqs *v, unsigned int lcore_id)
+{
+ uint32_t t;
+
+ TQS_ERR_LOG_IF_TRUE(v == NULL || lcore_id >= RTE_TQS_MAX_LCORE);
+
+ /* Load the token before the worker thread loads any other
+ * (lock-free) data structure. This ensures that updates
+ * to the data structures are visible if the update
+ * to token is visible.
+ */
+ t = __atomic_load_n(&v->token, __ATOMIC_ACQUIRE);
Hmm, I am not very familiar with C11 model, but it looks like a race
condition to me:
as I understand, update() supposed be called at the end of reader's
critical section, correct?
But ACQUIRE is only a hoist barrier, which means compiler and cpu
are free to move earlier reads (and writes) after it.
It probably needs to be a full ACQ_REL here.
Post by Honnappa Nagarahalli
+ if (v->w[lcore_id].cnt != t)
+ v->w[lcore_id].cnt++;
+}
+
+/**
+ *
+ * Checks if all the worker threads have entered the quiescent state
+ * 'n' number of times. 'n' is provided in rte_tqs_start API.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe and can be called from the worker threads as well.
+ *
+ * TQS variable
+ * Token returned by rte_tqs_start API
+ * If true, block till all the worker threads have completed entering
+ * the quiescent state 'n' number of times
+ * - 0 if all worker threads have NOT passed through specified number
+ * of quiescent states.
+ * - 1 if all worker threads have passed through specified number
+ * of quiescent states.
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ */
+static __rte_always_inline int __rte_experimental
+rte_tqs_check(struct rte_tqs *v, uint32_t t, bool wait)
+{
+ uint64_t l;
+ uint64_t lcore_mask;
+
+ TQS_RETURN_IF_TRUE((v == NULL), -EINVAL);
+
+ do {
+ /* Load the current lcore_mask before loading the
+ * worker thread quiescent state counters.
+ */
+ lcore_mask = __atomic_load_n(&v->lcore_mask, __ATOMIC_ACQUIRE);
What would happen if reader will call unregister() simultaneously
with check() and will update lcore_mask straight after that load?
As I understand check() might hang in such case.
Post by Honnappa Nagarahalli
+
+ while (lcore_mask) {
+ l = __builtin_ctz(lcore_mask);
+ if (v->w[l].cnt != t)
+ break;
As I understand, that makes control-path function progress dependent
on simultaneous invocation of data-path functions.
In some cases that might cause control-path to hang.
Let say if data-path function wouldn't be called, or user invokes
control-path and data-path functions from the same thread.
Post by Honnappa Nagarahalli
+
+ lcore_mask &= ~(1UL << l);
+ }
+
+ if (lcore_mask == 0)
+ return 1;
+
+ rte_pause();
+ } while (wait);
+
+ return 0;
+}
+
Honnappa Nagarahalli
2018-11-27 21:32:04 UTC
Permalink
Post by Ananyev, Konstantin
Hi Honnappa,
Thank you for reviewing the patch, appreciate your comments.
Post by Ananyev, Konstantin
Post by Honnappa Nagarahalli
+
+/* Allocate a new TQS variable with the name *name* in memory. */
+struct rte_tqs * __rte_experimental rte_tqs_alloc(const char *name,
+int socket_id, uint64_t lcore_mask) {
+ char tqs_name[RTE_TQS_NAMESIZE];
+ struct rte_tailq_entry *te, *tmp_te;
+ struct rte_tqs_list *tqs_list;
+ struct rte_tqs *v, *tmp_v;
+ int ret;
+
+ if (name == NULL) {
+ RTE_LOG(ERR, TQS, "Invalid input parameters\n");
+ rte_errno = -EINVAL;
+ return NULL;
+ }
+
+ te = rte_zmalloc("TQS_TAILQ_ENTRY", sizeof(*te), 0);
+ if (te == NULL) {
+ RTE_LOG(ERR, TQS, "Cannot reserve memory for tailq\n");
+ rte_errno = -ENOMEM;
+ return NULL;
+ }
+
+ snprintf(tqs_name, sizeof(tqs_name), "%s", name);
+ v = rte_zmalloc_socket(tqs_name, sizeof(struct rte_tqs),
+ RTE_CACHE_LINE_SIZE, socket_id);
+ if (v == NULL) {
+ RTE_LOG(ERR, TQS, "Cannot reserve memory for TQS
variable\n");
Post by Honnappa Nagarahalli
+ rte_errno = -ENOMEM;
+ goto alloc_error;
+ }
+
+ ret = snprintf(v->name, sizeof(v->name), "%s", name);
+ if (ret < 0 || ret >= (int)sizeof(v->name)) {
+ rte_errno = -ENAMETOOLONG;
+ goto alloc_error;
+ }
+
+ te->data = (void *) v;
+ v->lcore_mask = lcore_mask;
+
+ rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
+
+ tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);
+
+ /* Search if a TQS variable with the same name exists already */
+ TAILQ_FOREACH(tmp_te, tqs_list, next) {
+ tmp_v = (struct rte_tqs *) tmp_te->data;
+ if (strncmp(name, tmp_v->name, RTE_TQS_NAMESIZE) == 0)
+ break;
+ }
+
+ if (tmp_te != NULL) {
+ rte_errno = -EEXIST;
+ goto tqs_exist;
+ }
+
+ TAILQ_INSERT_TAIL(tqs_list, te, next);
+
+ rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+ return v;
+
+ rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+ rte_free(te);
+ rte_free(v);
+ return NULL;
+}
That seems quite heavy-weight function just to allocate sync variable.
As size of struct rte_tqs is constant and known to the user, might be better just
provide rte_tqs_init(struct rte_tqs *tqs, ...) and let user allocate/free memory
for it by himself.
I believe, when you say heavy-weight, you are referring to adding tqs variable to the TAILQ and allocating the memory for it. Agree. I also do not expect that there are a whole lot of tqs variables used in an application. Even in rte_tqs_free, there is similar overhead.

The extra part is due to the way the TQS variable will get identified by data plane threads. I am thinking that a data plane thread will use the rte_tqs_lookup API to identify a TQS variable. However, it is possible to share this with data plane threads via a simple shared structure as well.

Along with not allocating the memory, are you suggesting that we could skip maintaining a list of TQS variables in the TAILQ? This will remove rte_tqs_lookup, rte_tqs_free, rte_tqs_list_dump APIs. I am fine with this approach.
Post by Ananyev, Konstantin
Post by Honnappa Nagarahalli
+
+/* Add a reader thread, running on an lcore, to the list of threads
+ * reporting their quiescent state on a TQS variable.
+ */
+int __rte_experimental
+rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {
+ TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=
RTE_TQS_MAX_LCORE),
Post by Honnappa Nagarahalli
+ -EINVAL);
It is not very good practice to make function return different values and behave
in a different way in debug/non-debug mode.
I'd say that for slow-path (functions in .c) it is always good to check input parameters.
For fast-path (functions in .h) we sometimes skip such checking, but debug
mode can probably use RTE_ASSERT() or so.
Makes sense, I will change this in the next version.
Post by Ananyev, Konstantin
lcore_id >= RTE_TQS_MAX_LCORE
Is this limitation really necessary?
I added this limitation because currently DPDK application cannot take a mask more than 64bit wide. Otherwise, this should be as big as RTE_MAX_LCORE.
I see that in the case of '-lcores' option, the number of lcores can be more than the number of PEs. In this case, we still need a MAX limit (but can be bigger than 64).
Post by Ananyev, Konstantin
First it means that only lcores can use that API (at least data-path part), second
even today many machines have more than 64 cores.
I think you can easily avoid such limitation, if instead of requiring lcore_id as
input parameter, you'll just make it return index of next available entry in w[].
Then tqs_update() can take that index as input parameter.
I had thought about a similar approach based on IDs. I was concerned that ID will be one more thing to manage for the application. But, I see the limitations of the current approach now. I will change it to allocation based. This will support even non-EAL pthreads as well.
Post by Ananyev, Konstantin
Post by Honnappa Nagarahalli
+
+ /* Worker thread has to count the quiescent states
+ * only from the current value of token.
+ */
+ v->w[lcore_id].cnt = v->token;
Wonder what would happen, if new reader will call register(), after writer calls
start()?
Looks like a race-condition.
Or such pattern is not supported?
The start should be called only after the reference to the entry in the data structure is 'deleted'. Hence the new reader will not get the reference to the deleted entry and does not have to increment its counter. When rte_tqs_check is called, it will see that the counter is already up to date. (I am missing a load-acquire on the token, I will correct that in the next version).
Post by Ananyev, Konstantin
Post by Honnappa Nagarahalli
+
+ /* Release the store to initial TQS count so that workers
+ * can use it immediately after this function returns.
+ */
+ __atomic_fetch_or(&v->lcore_mask, (1UL << lcore_id),
+__ATOMIC_RELEASE);
+
+ return 0;
+}
+
+/**
+ *
+ * Trigger the worker threads to report the quiescent state
+ * status.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe and can be called from the worker threads as well.
+ *
+ * TQS variable
+ * Expected number of times the quiescent state is entered
+ * - If successful, this is the token for this call of the API.
+ * This should be passed to rte_tqs_check API.
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ * - 0 Otherwise and always (non-debug mode compilation).
+ */
+static __rte_always_inline int __rte_experimental
+rte_tqs_start(struct rte_tqs *v, unsigned int n, uint32_t *t) {
+ TQS_RETURN_IF_TRUE((v == NULL || t == NULL), -EINVAL);
+
+ /* This store release will ensure that changes to any data
+ * structure are visible to the workers before the token
+ * update is visible.
+ */
+ *t = __atomic_add_fetch(&v->token, n, __ATOMIC_RELEASE);
+
+ return 0;
+}
+
+/**
+ *
+ * Update quiescent state for the worker thread on a lcore.
+ *
+ * This is implemented as a lock-free function. It is multi-thread safe.
+ * All the worker threads registered to report their quiescent state
+ * on the TQS variable must call this API.
+ *
+ * TQS variable
+ */
+static __rte_always_inline void __rte_experimental
+rte_tqs_update(struct rte_tqs *v, unsigned int lcore_id) {
+ uint32_t t;
+
+ TQS_ERR_LOG_IF_TRUE(v == NULL || lcore_id >=
RTE_TQS_MAX_LCORE);
Post by Honnappa Nagarahalli
+
+ /* Load the token before the worker thread loads any other
+ * (lock-free) data structure. This ensures that updates
+ * to the data structures are visible if the update
+ * to token is visible.
+ */
+ t = __atomic_load_n(&v->token, __ATOMIC_ACQUIRE);
Hmm, I am not very familiar with C11 model, but it looks like a race condition
as I understand, update() supposed be called at the end of reader's critical
section, correct?
Yes, the understanding is correct.
Post by Ananyev, Konstantin
But ACQUIRE is only a hoist barrier, which means compiler and cpu are free to
move earlier reads (and writes) after it.
Yes, your understanding is correct.
Post by Ananyev, Konstantin
It probably needs to be a full ACQ_REL here.
The sequence of operations is as follows:
1) Writer 'deletes' an entry from a lock-free data structure
2) Writer calls rte_tqs_start - This API increments the 'token' and does a store-release. So, any earlier stores would be visible if the store to 'token' is visible (to the data plane threads).
3) Reader calls rte_tqs_update - This API load-acquires the 'token'.
a) If this 'token' is the updated value from 2) then the entry deleted from 1) will not be available for the reader to reference (even if that reference is due to earlier reads being moved after load-acquire of 'token').
b) If this 'token' is not the updated value from 2) then the entry deleted from 1) may or may not be available for the reader to reference. In this case the w[lcore_id].cnt is not updated, hence the writer will wait to 'free' the deleted entry from 1)
Post by Ananyev, Konstantin
Post by Honnappa Nagarahalli
+ if (v->w[lcore_id].cnt != t)
+ v->w[lcore_id].cnt++;
+}
+
+/**
+ *
+ * Checks if all the worker threads have entered the quiescent state
+ * 'n' number of times. 'n' is provided in rte_tqs_start API.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe and can be called from the worker threads as well.
+ *
+ * TQS variable
+ * Token returned by rte_tqs_start API
+ * If true, block till all the worker threads have completed entering
+ * the quiescent state 'n' number of times
+ * - 0 if all worker threads have NOT passed through specified number
+ * of quiescent states.
+ * - 1 if all worker threads have passed through specified number
+ * of quiescent states.
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ */
+static __rte_always_inline int __rte_experimental
+rte_tqs_check(struct rte_tqs *v, uint32_t t, bool wait) {
+ uint64_t l;
+ uint64_t lcore_mask;
+
+ TQS_RETURN_IF_TRUE((v == NULL), -EINVAL);
+
+ do {
+ /* Load the current lcore_mask before loading the
+ * worker thread quiescent state counters.
+ */
+ lcore_mask = __atomic_load_n(&v->lcore_mask,
__ATOMIC_ACQUIRE);
What would happen if reader will call unregister() simultaneously with check()
and will update lcore_mask straight after that load?
As I understand check() might hang in such case.
If the 'lcore_mask' is updated after this load, it will affect only the current iteration of the while loop below. In the next iteration the 'lcore_mask' is loaded again.
Post by Ananyev, Konstantin
Post by Honnappa Nagarahalli
+
+ while (lcore_mask) {
+ l = __builtin_ctz(lcore_mask);
+ if (v->w[l].cnt != t)
+ break;
As I understand, that makes control-path function progress dependent on
simultaneous invocation of data-path functions.
I agree that the control-path function progress (for ex: how long to wait for freeing the memory) depends on invocation of the data-path functions. The separation of 'start', 'check' and the option not to block in 'check' provide the flexibility for control-path to do some other work if it chooses to.
Post by Ananyev, Konstantin
In some cases that might cause control-path to hang.
Let say if data-path function wouldn't be called, or user invokes control-path
and data-path functions from the same thread.
I agree with the case of data-path function not getting called. I would consider that as programming error. I can document that warning in the rte_tqs_check API.

In the case of same thread calling both control-path and data-path functions, it would depend on the sequence of the calls. The following sequence should not cause any hangs:
Worker thread
1) 'deletes' an entry from a lock-free data structure
2) rte_tqs_start
3) rte_tqs_update
4) rte_tqs_check (wait == 1 or wait == 0)
5) 'free' the entry deleted in 1)

If 3) and 4) are interchanged, then there will be a hang if wait is set to 1. If wait is set to 0, there should not be a hang.
I can document this as part of the documentation (I do not think API documentation is required for this).
Post by Ananyev, Konstantin
Post by Honnappa Nagarahalli
+
+ lcore_mask &= ~(1UL << l);
+ }
+
+ if (lcore_mask == 0)
+ return 1;
+
+ rte_pause();
+ } while (wait);
+
+ return 0;
+}
+
Ananyev, Konstantin
2018-11-28 15:25:21 UTC
Permalink
Post by Honnappa Nagarahalli
Post by Ananyev, Konstantin
Hi Honnappa,
Thank you for reviewing the patch, appreciate your comments.
Post by Ananyev, Konstantin
Post by Honnappa Nagarahalli
+
+/* Allocate a new TQS variable with the name *name* in memory. */
+struct rte_tqs * __rte_experimental rte_tqs_alloc(const char *name,
+int socket_id, uint64_t lcore_mask) {
+ char tqs_name[RTE_TQS_NAMESIZE];
+ struct rte_tailq_entry *te, *tmp_te;
+ struct rte_tqs_list *tqs_list;
+ struct rte_tqs *v, *tmp_v;
+ int ret;
+
+ if (name == NULL) {
+ RTE_LOG(ERR, TQS, "Invalid input parameters\n");
+ rte_errno = -EINVAL;
+ return NULL;
+ }
+
+ te = rte_zmalloc("TQS_TAILQ_ENTRY", sizeof(*te), 0);
+ if (te == NULL) {
+ RTE_LOG(ERR, TQS, "Cannot reserve memory for tailq\n");
+ rte_errno = -ENOMEM;
+ return NULL;
+ }
+
+ snprintf(tqs_name, sizeof(tqs_name), "%s", name);
+ v = rte_zmalloc_socket(tqs_name, sizeof(struct rte_tqs),
+ RTE_CACHE_LINE_SIZE, socket_id);
+ if (v == NULL) {
+ RTE_LOG(ERR, TQS, "Cannot reserve memory for TQS
variable\n");
Post by Honnappa Nagarahalli
+ rte_errno = -ENOMEM;
+ goto alloc_error;
+ }
+
+ ret = snprintf(v->name, sizeof(v->name), "%s", name);
+ if (ret < 0 || ret >= (int)sizeof(v->name)) {
+ rte_errno = -ENAMETOOLONG;
+ goto alloc_error;
+ }
+
+ te->data = (void *) v;
+ v->lcore_mask = lcore_mask;
+
+ rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
+
+ tqs_list = RTE_TAILQ_CAST(rte_tqs_tailq.head, rte_tqs_list);
+
+ /* Search if a TQS variable with the same name exists already */
+ TAILQ_FOREACH(tmp_te, tqs_list, next) {
+ tmp_v = (struct rte_tqs *) tmp_te->data;
+ if (strncmp(name, tmp_v->name, RTE_TQS_NAMESIZE) == 0)
+ break;
+ }
+
+ if (tmp_te != NULL) {
+ rte_errno = -EEXIST;
+ goto tqs_exist;
+ }
+
+ TAILQ_INSERT_TAIL(tqs_list, te, next);
+
+ rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+ return v;
+
+ rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+ rte_free(te);
+ rte_free(v);
+ return NULL;
+}
That seems quite heavy-weight function just to allocate sync variable.
As size of struct rte_tqs is constant and known to the user, might be better just
provide rte_tqs_init(struct rte_tqs *tqs, ...) and let user allocate/free memory
for it by himself.
I believe, when you say heavy-weight, you are referring to adding tqs variable to the TAILQ and allocating the memory for it.
Yes.
Post by Honnappa Nagarahalli
Agree. I also
do not expect that there are a whole lot of tqs variables used in an application. Even in rte_tqs_free, there is similar overhead.
The extra part is due to the way the TQS variable will get identified by data plane threads. I am thinking that a data plane thread will use the
rte_tqs_lookup API to identify a TQS variable. However, it is possible to share this with data plane threads via a simple shared structure as
well.
Along with not allocating the memory, are you suggesting that we could skip maintaining a list of TQS variables in the TAILQ? This will
remove rte_tqs_lookup, rte_tqs_free, rte_tqs_list_dump APIs. I am fine with this approach.
Yes, that's what I suggest.
My thought was - it is just another data structure used for synchronization (as spinlock, rwlock, etc.).
So should be possible to allocate it statically and we probably don't need to have an ability to lookup
such variable by name via tailq.
Post by Honnappa Nagarahalli
Post by Ananyev, Konstantin
Post by Honnappa Nagarahalli
+
+/* Add a reader thread, running on an lcore, to the list of threads
+ * reporting their quiescent state on a TQS variable.
+ */
+int __rte_experimental
+rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {
+ TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=
RTE_TQS_MAX_LCORE),
Post by Honnappa Nagarahalli
+ -EINVAL);
It is not very good practice to make function return different values and behave
in a different way in debug/non-debug mode.
I'd say that for slow-path (functions in .c) it is always good to check input
parameters.
For fast-path (functions in .h) we sometimes skip such checking, but debug
mode can probably use RTE_ASSERT() or so.
Makes sense, I will change this in the next version.
Post by Ananyev, Konstantin
lcore_id >= RTE_TQS_MAX_LCORE
Is this limitation really necessary?
I added this limitation because currently DPDK application cannot take a mask more than 64bit wide. Otherwise, this should be as big as
RTE_MAX_LCORE.
I see that in the case of '-lcores' option, the number of lcores can be more than the number of PEs. In this case, we still need a MAX limit
(but can be bigger than 64).
Post by Ananyev, Konstantin
First it means that only lcores can use that API (at least data-path part), second
even today many machines have more than 64 cores.
I think you can easily avoid such limitation, if instead of requiring lcore_id as
input parameter, you'll just make it return index of next available entry in w[].
Then tqs_update() can take that index as input parameter.
I had thought about a similar approach based on IDs. I was concerned that ID will be one more thing to manage for the application. But, I
see the limitations of the current approach now. I will change it to allocation based. This will support even non-EAL pthreads as well.
Yes, with such approach non-lcore threads will be able to use it also.
Post by Honnappa Nagarahalli
Post by Ananyev, Konstantin
Post by Honnappa Nagarahalli
+
+ /* Worker thread has to count the quiescent states
+ * only from the current value of token.
+ */
+ v->w[lcore_id].cnt = v->token;
Wonder what would happen, if new reader will call register(), after writer calls
start()?
Looks like a race-condition.
Or such pattern is not supported?
The start should be called only after the reference to the entry in the data structure is 'deleted'. Hence the new reader will not get the
reference to the deleted entry and does not have to increment its counter. When rte_tqs_check is called, it will see that the counter is
already up to date. (I am missing a load-acquire on the token, I will correct that in the next version).
Yes, with _acquire_ in place it seems to be good here.
Post by Honnappa Nagarahalli
Post by Ananyev, Konstantin
Post by Honnappa Nagarahalli
+
+ /* Release the store to initial TQS count so that workers
+ * can use it immediately after this function returns.
+ */
+ __atomic_fetch_or(&v->lcore_mask, (1UL << lcore_id),
+__ATOMIC_RELEASE);
+
+ return 0;
+}
+
+/**
+ *
+ * Trigger the worker threads to report the quiescent state
+ * status.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe and can be called from the worker threads as well.
+ *
+ * TQS variable
+ * Expected number of times the quiescent state is entered
+ * - If successful, this is the token for this call of the API.
+ * This should be passed to rte_tqs_check API.
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ * - 0 Otherwise and always (non-debug mode compilation).
+ */
+static __rte_always_inline int __rte_experimental
+rte_tqs_start(struct rte_tqs *v, unsigned int n, uint32_t *t) {
+ TQS_RETURN_IF_TRUE((v == NULL || t == NULL), -EINVAL);
+
+ /* This store release will ensure that changes to any data
+ * structure are visible to the workers before the token
+ * update is visible.
+ */
+ *t = __atomic_add_fetch(&v->token, n, __ATOMIC_RELEASE);
+
+ return 0;
+}
+
+/**
+ *
+ * Update quiescent state for the worker thread on a lcore.
+ *
+ * This is implemented as a lock-free function. It is multi-thread safe.
+ * All the worker threads registered to report their quiescent state
+ * on the TQS variable must call this API.
+ *
+ * TQS variable
+ */
+static __rte_always_inline void __rte_experimental
+rte_tqs_update(struct rte_tqs *v, unsigned int lcore_id) {
+ uint32_t t;
+
+ TQS_ERR_LOG_IF_TRUE(v == NULL || lcore_id >=
RTE_TQS_MAX_LCORE);
Post by Honnappa Nagarahalli
+
+ /* Load the token before the worker thread loads any other
+ * (lock-free) data structure. This ensures that updates
+ * to the data structures are visible if the update
+ * to token is visible.
+ */
+ t = __atomic_load_n(&v->token, __ATOMIC_ACQUIRE);
Hmm, I am not very familiar with C11 model, but it looks like a race condition
as I understand, update() supposed be called at the end of reader's critical
section, correct?
Yes, the understanding is correct.
Post by Ananyev, Konstantin
But ACQUIRE is only a hoist barrier, which means compiler and cpu are free to
move earlier reads (and writes) after it.
Yes, your understanding is correct.
Post by Ananyev, Konstantin
It probably needs to be a full ACQ_REL here.
1) Writer 'deletes' an entry from a lock-free data structure
2) Writer calls rte_tqs_start - This API increments the 'token' and does a store-release. So, any earlier stores would be visible if the store to
'token' is visible (to the data plane threads).
3) Reader calls rte_tqs_update - This API load-acquires the 'token'.
a) If this 'token' is the updated value from 2) then the entry deleted from 1) will not be available for the reader to reference (even if
that reference is due to earlier reads being moved after load-acquire of 'token').
b) If this 'token' is not the updated value from 2) then the entry deleted from 1) may or may not be available for the reader to
reference. In this case the w[lcore_id].cnt is not updated, hence the writer will wait to 'free' the deleted entry from 1)
Yes, you right, it's me being confused.
Post by Honnappa Nagarahalli
Post by Ananyev, Konstantin
Post by Honnappa Nagarahalli
+ if (v->w[lcore_id].cnt != t)
+ v->w[lcore_id].cnt++;
+}
+
+/**
+ *
+ * Checks if all the worker threads have entered the quiescent state
+ * 'n' number of times. 'n' is provided in rte_tqs_start API.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe and can be called from the worker threads as well.
+ *
+ * TQS variable
+ * Token returned by rte_tqs_start API
+ * If true, block till all the worker threads have completed entering
+ * the quiescent state 'n' number of times
+ * - 0 if all worker threads have NOT passed through specified number
+ * of quiescent states.
+ * - 1 if all worker threads have passed through specified number
+ * of quiescent states.
+ * - -EINVAL if the parameters are invalid (debug mode compilation only).
+ */
+static __rte_always_inline int __rte_experimental
+rte_tqs_check(struct rte_tqs *v, uint32_t t, bool wait) {
+ uint64_t l;
+ uint64_t lcore_mask;
+
+ TQS_RETURN_IF_TRUE((v == NULL), -EINVAL);
+
+ do {
+ /* Load the current lcore_mask before loading the
+ * worker thread quiescent state counters.
+ */
+ lcore_mask = __atomic_load_n(&v->lcore_mask,
__ATOMIC_ACQUIRE);
What would happen if reader will call unregister() simultaneously with check()
and will update lcore_mask straight after that load?
As I understand check() might hang in such case.
If the 'lcore_mask' is updated after this load, it will affect only the current iteration of the while loop below. In the next iteration the
'lcore_mask' is loaded again.
True, my confusion again.
Post by Honnappa Nagarahalli
Post by Ananyev, Konstantin
Post by Honnappa Nagarahalli
+
+ while (lcore_mask) {
+ l = __builtin_ctz(lcore_mask);
+ if (v->w[l].cnt != t)
+ break;
As I understand, that makes control-path function progress dependent on
simultaneous invocation of data-path functions.
I agree that the control-path function progress (for ex: how long to wait for freeing the memory) depends on invocation of the data-path
functions. The separation of 'start', 'check' and the option not to block in 'check' provide the flexibility for control-path to do some other
work if it chooses to.
Post by Ananyev, Konstantin
In some cases that might cause control-path to hang.
Let say if data-path function wouldn't be called, or user invokes control-path
and data-path functions from the same thread.
I agree with the case of data-path function not getting called. I would consider that as programming error. I can document that warning in
the rte_tqs_check API.
Sure, it can be documented.
Though that means, that each data-path thread would have to do explicit update() call
for every tqs it might use.
I just think that it would complicate things and might limit usage of the library quite significantly.
Post by Honnappa Nagarahalli
In the case of same thread calling both control-path and data-path functions, it would depend on the sequence of the calls. The following
Worker thread
1) 'deletes' an entry from a lock-free data structure
2) rte_tqs_start
3) rte_tqs_update
4) rte_tqs_check (wait == 1 or wait == 0)
5) 'free' the entry deleted in 1)
That an interesting idea, and that should help, I think.
Probably worth to have {2,3,4} sequence as a new high level function.
Post by Honnappa Nagarahalli
If 3) and 4) are interchanged, then there will be a hang if wait is set to 1. If wait is set to 0, there should not be a hang.
I can document this as part of the documentation (I do not think API documentation is required for this).
Post by Ananyev, Konstantin
Post by Honnappa Nagarahalli
+
+ lcore_mask &= ~(1UL << l);
+ }
+
+ if (lcore_mask == 0)
+ return 1;
+
+ rte_pause();
+ } while (wait);
+
+ return 0;
+}
+
Honnappa Nagarahalli
2018-12-07 07:27:16 UTC
Permalink
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
+
+/* Add a reader thread, running on an lcore, to the list of
+threads
+ * reporting their quiescent state on a TQS variable.
+ */
+int __rte_experimental
+rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {
+ TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=
RTE_TQS_MAX_LCORE),
Post by Honnappa Nagarahalli
+ -EINVAL);
It is not very good practice to make function return different
values and behave in a different way in debug/non-debug mode.
I'd say that for slow-path (functions in .c) it is always good to
check input parameters.
For fast-path (functions in .h) we sometimes skip such checking, but
debug mode can probably use RTE_ASSERT() or so.
Makes sense, I will change this in the next version.
Post by Honnappa Nagarahalli
lcore_id >= RTE_TQS_MAX_LCORE
Is this limitation really necessary?
I added this limitation because currently DPDK application cannot take
a mask more than 64bit wide. Otherwise, this should be as big as
RTE_MAX_LCORE.
Post by Honnappa Nagarahalli
I see that in the case of '-lcores' option, the number of lcores can
be more than the number of PEs. In this case, we still need a MAX limit (but
can be bigger than 64).
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
First it means that only lcores can use that API (at least data-path
part), second even today many machines have more than 64 cores.
I think you can easily avoid such limitation, if instead of
requiring lcore_id as input parameter, you'll just make it return index of
next available entry in w[].
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
Then tqs_update() can take that index as input parameter.
I had thought about a similar approach based on IDs. I was concerned
that ID will be one more thing to manage for the application. But, I see the
limitations of the current approach now. I will change it to allocation based.
This will support even non-EAL pthreads as well.
Yes, with such approach non-lcore threads will be able to use it also.
I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore need to be efficient as they can be called from the worker's packet processing loop (rte_event_dequeue_burst allows blocking. So, the worker thread needs to call rte_tqs_unregister_lcore before calling rte_event_dequeue_burst and rte_tqs_register_lcore before starting packet processing). Allocating the thread ID in these functions will make them more complex.

I suggest that we change the argument 'lcore_id' to 'thread_id'. The application could use 'lcore_id' as 'thread_id' if threads are mapped to physical cores 1:1.

If the threads are not mapped 1:1 to physical cores, the threads need to use a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do not see that DPDK has a thread_id concept. For TQS, the thread IDs are global (i.e. not per TQS variable). I could provide APIs to do the thread ID allocation, but I think the thread ID allocation should not be part of this library. Such thread ID might be useful for other libraries.

<snip>
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
+
+ while (lcore_mask) {
+ l = __builtin_ctz(lcore_mask);
+ if (v->w[l].cnt != t)
+ break;
As I understand, that makes control-path function progress dependent
on simultaneous invocation of data-path functions.
I agree that the control-path function progress (for ex: how long to
wait for freeing the memory) depends on invocation of the data-path
functions. The separation of 'start', 'check' and the option not to block in
'check' provide the flexibility for control-path to do some other work if it
chooses to.
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
In some cases that might cause control-path to hang.
Let say if data-path function wouldn't be called, or user invokes
control-path and data-path functions from the same thread.
I agree with the case of data-path function not getting called. I
would consider that as programming error. I can document that warning in
the rte_tqs_check API.
Sure, it can be documented.
Though that means, that each data-path thread would have to do explicit
update() call for every tqs it might use.
I just think that it would complicate things and might limit usage of the library
quite significantly.
Each data path thread has to report its quiescent state. Hence, each data-path thread has to call update() (similar to how rte_timer_manage() has to be called periodically on the worker thread).
Do you have any particular use case in mind where this fails?
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
In the case of same thread calling both control-path and data-path
functions, it would depend on the sequence of the calls. The following
Worker thread
1) 'deletes' an entry from a lock-free data structure
2) rte_tqs_start
3) rte_tqs_update
4) rte_tqs_check (wait == 1 or wait == 0)
5) 'free' the entry deleted in 1)
That an interesting idea, and that should help, I think.
Probably worth to have {2,3,4} sequence as a new high level function.
Yes, this is a good idea. Such a function would be applicable only in the worker thread. I would prefer to leave it to the application to take care.
Stephen Hemminger
2018-12-07 17:29:36 UTC
Permalink
On Fri, 7 Dec 2018 07:27:16 +0000
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
+
+/* Add a reader thread, running on an lcore, to the list of
+threads
+ * reporting their quiescent state on a TQS variable.
+ */
+int __rte_experimental
+rte_tqs_register_lcore(struct rte_tqs *v, unsigned int lcore_id) {
+ TQS_RETURN_IF_TRUE((v == NULL || lcore_id >=
RTE_TQS_MAX_LCORE),
Post by Honnappa Nagarahalli
+ -EINVAL);
It is not very good practice to make function return different
values and behave in a different way in debug/non-debug mode.
I'd say that for slow-path (functions in .c) it is always good to
check input parameters.
For fast-path (functions in .h) we sometimes skip such checking, but
debug mode can probably use RTE_ASSERT() or so.
Makes sense, I will change this in the next version.
Post by Honnappa Nagarahalli
lcore_id >= RTE_TQS_MAX_LCORE
Is this limitation really necessary?
I added this limitation because currently DPDK application cannot take
a mask more than 64bit wide. Otherwise, this should be as big as
RTE_MAX_LCORE.
Post by Honnappa Nagarahalli
I see that in the case of '-lcores' option, the number of lcores can
be more than the number of PEs. In this case, we still need a MAX limit (but
can be bigger than 64).
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
First it means that only lcores can use that API (at least data-path
part), second even today many machines have more than 64 cores.
I think you can easily avoid such limitation, if instead of
requiring lcore_id as input parameter, you'll just make it return index of
next available entry in w[].
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
Then tqs_update() can take that index as input parameter.
I had thought about a similar approach based on IDs. I was concerned
that ID will be one more thing to manage for the application. But, I see the
limitations of the current approach now. I will change it to allocation based.
This will support even non-EAL pthreads as well.
Yes, with such approach non-lcore threads will be able to use it also.
I realized that rte_tqs_register_lcore/ rte_tqs_unregister_lcore need to be efficient as they can be called from the worker's packet processing loop (rte_event_dequeue_burst allows blocking. So, the worker thread needs to call rte_tqs_unregister_lcore before calling rte_event_dequeue_burst and rte_tqs_register_lcore before starting packet processing). Allocating the thread ID in these functions will make them more complex.
I suggest that we change the argument 'lcore_id' to 'thread_id'. The application could use 'lcore_id' as 'thread_id' if threads are mapped to physical cores 1:1.
If the threads are not mapped 1:1 to physical cores, the threads need to use a thread_id in the range of 0 - RTE_TQS_MAX_THREADS. I do not see that DPDK has a thread_id concept. For TQS, the thread IDs are global (i.e. not per TQS variable). I could provide APIs to do the thread ID allocation, but I think the thread ID allocation should not be part of this library. Such thread ID might be useful for other libraries.
<snip
Thread id is problematic since Glibc doesn't want to give it out.
You have to roll your own function to do gettid().
It is not as easy as just that. Plus what about preemption?

Honnappa Nagarahalli
2018-11-22 03:30:55 UTC
Permalink
From: Dharmik Thakkar <***@arm.com>

Add API positive/negative test cases and functional tests.

Signed-off-by: Malvika Gupta <***@arm.com>
Signed-off-by: Dharmik Thakkar <***@arm.com>
Signed-off-by: Honnappa Nagarahalli <***@arm.com>
Reviewed-by: Gavin Hu <***@arm.com>
---
test/test/Makefile | 2 +
test/test/autotest_data.py | 6 +
test/test/meson.build | 5 +-
test/test/test_tqs.c | 540 +++++++++++++++++++++++++++++++++++++
4 files changed, 552 insertions(+), 1 deletion(-)
create mode 100644 test/test/test_tqs.c

diff --git a/test/test/Makefile b/test/test/Makefile
index ab4fec34a..7a07039e7 100644
--- a/test/test/Makefile
+++ b/test/test/Makefile
@@ -207,6 +207,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_KVARGS) += test_kvargs.c

SRCS-$(CONFIG_RTE_LIBRTE_BPF) += test_bpf.c

+SRCS-$(CONFIG_RTE_LIBRTE_TQS) += test_tqs.c
+
CFLAGS += -DALLOW_EXPERIMENTAL_API

CFLAGS += -O3
diff --git a/test/test/autotest_data.py b/test/test/autotest_data.py
index 0fb7866db..e676757cd 100644
--- a/test/test/autotest_data.py
+++ b/test/test/autotest_data.py
@@ -676,6 +676,12 @@
"Func": default_autotest,
"Report": None,
},
+ {
+ "Name": "Thread Quiescent State autotest",
+ "Command": "tqs_autotest",
+ "Func": default_autotest,
+ "Report": None,
+ },
#
# Please always make sure that ring_perf is the last test!
#
diff --git a/test/test/meson.build b/test/test/meson.build
index 554e9945f..b80a449ad 100644
--- a/test/test/meson.build
+++ b/test/test/meson.build
@@ -100,6 +100,7 @@ test_sources = files('commands.c',
'test_timer.c',
'test_timer_perf.c',
'test_timer_racecond.c',
+ 'tet_tqs.c',
'test_version.c',
'virtual_pmd.c'
)
@@ -122,7 +123,8 @@ test_deps = ['acl',
'port',
'reorder',
'ring',
- 'timer'
+ 'timer',
+ 'tqs'
]

test_names = [
@@ -228,6 +230,7 @@ test_names = [
'timer_autotest',
'timer_perf__autotest',
'timer_racecond_autotest',
+ 'tqs_autotest',
'user_delay_us',
'version_autotest',
]
diff --git a/test/test/test_tqs.c b/test/test/test_tqs.c
new file mode 100644
index 000000000..8633a80a6
--- /dev/null
+++ b/test/test/test_tqs.c
@@ -0,0 +1,540 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2018 Arm Limited
+ */
+
+#include <stdio.h>
+#include <stdbool.h>
+#include <rte_pause.h>
+#include <rte_tqs.h>
+#include <rte_hash.h>
+#include <rte_hash_crc.h>
+#include <rte_malloc.h>
+
+#include "test.h"
+
+/* Check condition and return an error if true. */
+#define TQS_RETURN_IF_ERROR(tqs, cond, str, ...) do { \
+ if (cond) { \
+ printf("ERROR file %s, line %d: " str "\n", __FILE__, \
+ __LINE__, ##__VA_ARGS__); \
+ if (tqs) \
+ rte_tqs_free(tqs); \
+ return -1; \
+ } \
+} while (0)
+
+#define RTE_TQS_MAX_LCORE 64
+uint16_t enabled_core_ids[RTE_TQS_MAX_LCORE];
+uint64_t core_mask;
+uint32_t sw_token;
+uint16_t num_1qs = 1; /* Number of quiescent states = 1 */
+
+struct node {
+ int data;
+ struct node *next;
+};
+
+struct node *head = NULL, *lastNode;
+
+static inline int
+get_enabled_cores_mask(void)
+{
+ uint32_t i;
+ uint16_t core_id;
+ uint32_t max_cores = rte_lcore_count();
+
+ if (max_cores > RTE_TQS_MAX_LCORE) {
+ printf("Number of cores exceed %d\n", RTE_TQS_MAX_LCORE);
+ return -1;
+ }
+
+ core_id = 0;
+ core_mask = 0;
+ i = 0;
+ RTE_LCORE_FOREACH_SLAVE(core_id) {
+ enabled_core_ids[i] = core_id;
+ i++;
+ core_mask |= 1UL << core_id;
+ }
+
+ return 0;
+}
+
+/*
+ * rte_tqs_alloc: Allocate a TQS variable
+ */
+static int
+test_tqs_alloc(void)
+{
+ const char name32[] = "xyzxyzxyzxyzxyzxyzxyzxyzxyzx123";
+ const char name33[] = "xyzxyizxyzxyzxyzxyzxyzxyzxyzxyzxyz";
+ const char name3[] = "xyz";
+ struct rte_tqs *t;
+
+ printf("Test rte_tqs_alloc()\n");
+
+ t = rte_tqs_alloc(NULL, SOCKET_ID_ANY, core_mask);
+ TQS_RETURN_IF_ERROR(t, (t != NULL), "NULL TQS variable");
+
+ t = rte_tqs_alloc(name3, SOCKET_ID_ANY, core_mask);
+ TQS_RETURN_IF_ERROR(t, (t == NULL), "Variable name < %d",
+ RTE_TQS_NAMESIZE);
+ rte_tqs_free(t);
+
+ t = rte_tqs_alloc(name32, SOCKET_ID_ANY, core_mask);
+ TQS_RETURN_IF_ERROR(t, (t == NULL), "Variable name < %d",
+ RTE_TQS_NAMESIZE);
+ rte_tqs_free(t);
+
+ t = rte_tqs_alloc(name33, SOCKET_ID_ANY, core_mask);
+ TQS_RETURN_IF_ERROR(t, (t != NULL), "Variable name > %d",
+ RTE_TQS_NAMESIZE);
+
+ t = rte_tqs_alloc(name3, 0, core_mask);
+ TQS_RETURN_IF_ERROR(t, (t == NULL), "Valid socket ID");
+ rte_tqs_free(t);
+
+ t = rte_tqs_alloc(name3, 10000, core_mask);
+ TQS_RETURN_IF_ERROR(t, (t != NULL), "Invalid socket ID");
+
+ t = rte_tqs_alloc(name3, SOCKET_ID_ANY, 0);
+ TQS_RETURN_IF_ERROR(t, (t == NULL), "0 core mask");
+ rte_tqs_free(t);
+
+ return 0;
+}
+
+/*
+ * rte_tqs_register_lcore: Register threads
+ */
+static int
+test_tqs_register_lcore(void)
+{
+ struct rte_tqs *t;
+ const char *name = "TQS";
+ int ret;
+
+ printf("\nTest rte_tqs_register_lcore()\n");
+
+ t = rte_tqs_alloc(name, SOCKET_ID_ANY, 0);
+ TQS_RETURN_IF_ERROR(t, (t == NULL), "Failed to alloc TQS variable");
+
+ ret = rte_tqs_register_lcore(t, enabled_core_ids[0]);
+ TQS_RETURN_IF_ERROR(t, (ret == -EINVAL),
+ "lcore_id < RTE_TQS_MAX_LCORE");
+ rte_tqs_free(t);
+ return 0;
+}
+
+/*
+ * rte_tqs_unregister_lcore: Unregister threads
+ */
+static int
+test_tqs_unregister_lcore(void)
+{
+ struct rte_tqs *t;
+ const char *name = "TQS";
+ int i, ret;
+
+ printf("\nTest rte_tqs_unregister_lcore()\n");
+
+ t = rte_tqs_alloc(name, SOCKET_ID_ANY, 0);
+ TQS_RETURN_IF_ERROR(t, (t == NULL), "Failed to alloc TQS variable");
+
+ ret = rte_tqs_register_lcore(t, enabled_core_ids[0]);
+ TQS_RETURN_IF_ERROR(t, (ret == -EINVAL), "Register lcore failed");
+
+ /* Find first disabled core */
+ for (i = 0; i < RTE_TQS_MAX_LCORE; i++) {
+ if (enabled_core_ids[i] == 0)
+ break;
+ }
+ ret = rte_tqs_unregister_lcore(t, i);
+ TQS_RETURN_IF_ERROR(t, (ret != 0), "lcore disabled");
+
+ ret = rte_tqs_unregister_lcore(t, enabled_core_ids[0]);
+ TQS_RETURN_IF_ERROR(t, (ret == -EINVAL), "Valid lcore_id");
+
+ rte_tqs_free(t);
+ return 0;
+}
+
+/*
+ * rte_tqs_start: Trigger reader threads to count the number of times they enter
+ * the quiescent state
+ */
+static int
+test_tqs_start(void)
+{
+ struct rte_tqs *t;
+ const char *name = "TQS";
+ uint32_t token;
+ int i, ret;
+
+ printf("\nTest rte_tqs_start()\n");
+
+ t = rte_tqs_alloc(name, SOCKET_ID_ANY, 0);
+ TQS_RETURN_IF_ERROR(t, (t == NULL), "Failed to alloc TQS variable");
+
+ for (i = 0; i < 3; i++) {
+ ret = rte_tqs_register_lcore(t, enabled_core_ids[i]);
+ TQS_RETURN_IF_ERROR(t, (ret == -EINVAL),
+ "Register lcore failed");
+ }
+
+ ret = rte_tqs_start(t, 1, &token);
+ TQS_RETURN_IF_ERROR(t, (ret == -EINVAL), "1 quiescent states");
+
+ rte_tqs_free(t);
+ return 0;
+}
+
+/*
+ * rte_tqs_check: Check if all reader threads have completed entering the
+ * quiescent state 'n' times
+ */
+static int
+test_tqs_check(void)
+{
+
+ struct rte_tqs *t;
+ const char *name = "TQS";
+ int i, ret;
+ uint32_t token;
+
+ printf("\nTest rte_tqs_check()\n");
+
+ t = rte_tqs_alloc(name, SOCKET_ID_ANY, 0);
+ TQS_RETURN_IF_ERROR(t, (t == NULL), "Failed to alloc TQS variable");
+
+ ret = rte_tqs_start(t, 1, &token);
+ TQS_RETURN_IF_ERROR(t, (ret == -EINVAL), "TQS Start failed");
+
+ ret = rte_tqs_check(t, 0, true);
+ TQS_RETURN_IF_ERROR(t, (ret == -EINVAL), "Token = 0");
+
+ ret = rte_tqs_check(t, token, true);
+ TQS_RETURN_IF_ERROR(t, (ret == -EINVAL), "Blocking TQS check");
+
+ for (i = 0; i < 3; i++) {
+ ret = rte_tqs_register_lcore(t, enabled_core_ids[i]);
+ TQS_RETURN_IF_ERROR(t, (ret == -EINVAL),
+ "Register lcore failed");
+ }
+
+ ret = rte_tqs_check(t, token, false);
+ TQS_RETURN_IF_ERROR(t, (ret == -EINVAL), "Non-blocking TQS check");
+
+ rte_tqs_free(t);
+ return 0;
+}
+
+/*
+ * rte_tqs_lookup: Lookup a TQS variable by its name
+ */
+static int
+test_tqs_lookup(void)
+{
+ struct rte_tqs *t, *ret;
+ const char *name1 = "TQS";
+ const char *name2 = "NO_TQS";
+
+ printf("\nTest rte_tqs_lookup()\n");
+
+ t = rte_tqs_alloc(name1, SOCKET_ID_ANY, 0);
+ TQS_RETURN_IF_ERROR(t, (t == NULL), "Failed to alloc TQS variable");
+
+ ret = rte_tqs_lookup(name1);
+ TQS_RETURN_IF_ERROR(t, (ret != t), "Allocated TQS variable name");
+
+ ret = rte_tqs_lookup(name2);
+ TQS_RETURN_IF_ERROR(t, (ret != NULL), "Unallocated TQS variable name");
+
+ rte_tqs_free(t);
+ return 0;
+}
+
+/*
+ * rte_tqs_list_dump: Dump the status of all TQS variables to a file
+ */
+static int
+test_tqs_list_dump(void)
+{
+ struct rte_tqs *t1, *t2;
+ const char name1[] = "TQS_1";
+ const char name2[] = "TQS_2";
+ int i, ret;
+
+ printf("\nTest rte_tqs_list_dump()\n");
+
+ /* File pointer NULL */
+ rte_tqs_list_dump(NULL);
+
+ /* Dump an empty list */
+ rte_tqs_list_dump(stdout);
+
+ t1 = rte_tqs_alloc(name1, SOCKET_ID_ANY, 0);
+ TQS_RETURN_IF_ERROR(t1, (t1 == NULL), "Failed to alloc TQS variable");
+
+ /* Dump a list with TQS variable that has no readers */
+ rte_tqs_list_dump(stdout);
+
+ t2 = rte_tqs_alloc(name2, SOCKET_ID_ANY, 0);
+ TQS_RETURN_IF_ERROR(t2, (t2 == NULL), "Failed to alloc TQS variable");
+
+ ret = rte_tqs_register_lcore(t1, enabled_core_ids[0]);
+ if (ret != 0) {
+ printf("ERROR file %s, line %d: Failed to register lcore\n",
+ __FILE__, __LINE__);
+ return -1;
+ }
+
+ for (i = 1; i < 3; i++) {
+ ret = rte_tqs_register_lcore(t2, enabled_core_ids[i]);
+ if (ret != 0) {
+ printf("ERROR file %s, line %d: Failed to register lcore\n",
+ __FILE__, __LINE__);
+ return -1;
+ }
+ }
+
+ rte_tqs_list_dump(stdout);
+
+ rte_tqs_free(t1);
+ rte_tqs_free(t2);
+
+ return 0;
+}
+
+/*
+ * rte_tqs_dump: Dump status of a single TQS variable to a file
+ */
+static int
+test_tqs_dump(void)
+{
+ struct rte_tqs *t1, *t2;
+ const char name1[] = "TQS_1";
+ const char name2[] = "TQS_2";
+ int i, ret;
+
+ printf("\nTest rte_tqs_dump()\n");
+
+ /* NULL TQS variable */
+ rte_tqs_dump(stdout, NULL);
+
+ t1 = rte_tqs_alloc(name1, SOCKET_ID_ANY, 0);
+ TQS_RETURN_IF_ERROR(t1, (t1 == NULL), "Failed to alloc TQS variable");
+
+ /* NULL file pointer */
+ rte_tqs_dump(NULL, t1);
+
+ /* TQS variable with 0 core mask */
+ rte_tqs_dump(stdout, t1);
+
+ t2 = rte_tqs_alloc(name2, SOCKET_ID_ANY, 0);
+ TQS_RETURN_IF_ERROR(t2, (t2 == NULL), "Failed to alloc TQS variable");
+
+ ret = rte_tqs_register_lcore(t1, enabled_core_ids[0]);
+ if (ret != 0) {
+ printf("ERROR file %s, line %d: Failed to register lcore\n",
+ __FILE__, __LINE__);
+ return -1;
+ }
+
+ for (i = 1; i < 3; i++) {
+ ret = rte_tqs_register_lcore(t2, enabled_core_ids[i]);
+ if (ret != 0) {
+ printf("ERROR file %s, line %d: Failed to register lcore\n",
+ __FILE__, __LINE__);
+ return -1;
+ }
+ }
+
+ rte_tqs_dump(stdout, t1);
+ rte_tqs_dump(stdout, t2);
+
+ return 0;
+}
+
+struct rte_hash *handle;
+static uint32_t *keys;
+#define TOTAL_ENTRY (1 * 8)
+#define COUNTER_VALUE 4096
+uint32_t *hash_data[TOTAL_ENTRY];
+uint8_t writer_done;
+static int
+
+test_tqs_reader(__attribute__((unused)) void *arg)
+{
+ struct rte_tqs *t;
+ int i, ret;
+ uint32_t lcore_id = rte_lcore_id();
+
+ t = rte_tqs_lookup("TQS");
+ TQS_RETURN_IF_ERROR(t, (t == NULL), "TQS variable lookup failed");
+
+ do {
+ for (i = 0; i < TOTAL_ENTRY; i++) {
+ uint32_t *pdata;
+ ret = rte_hash_lookup_data(handle, keys+i,
+ (void **)&pdata);
+ if (ret != -ENOENT)
+ while (*pdata < COUNTER_VALUE)
+ ++*pdata;
+ }
+
+ /* Update quiescent state counter */
+ rte_tqs_update(t, lcore_id);
+ } while (!writer_done);
+
+ return 0;
+}
+
+static int
+init_hash(void)
+{
+ int i, ret;
+ struct rte_hash_parameters hash_params = {
+ .entries = TOTAL_ENTRY,
+ .key_len = sizeof(uint32_t),
+ .hash_func_init_val = 0,
+ .socket_id = rte_socket_id(),
+ .hash_func = rte_hash_crc,
+ .extra_flag =
+ RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF,
+ .name = "tests",
+ };
+
+ handle = rte_hash_create(&hash_params);
+ TQS_RETURN_IF_ERROR(NULL, (handle == NULL), "Hash create Failed");
+
+ for (i = 0; i < TOTAL_ENTRY; i++) {
+ hash_data[i] = rte_zmalloc(NULL, sizeof(uint32_t), 0);
+ TQS_RETURN_IF_ERROR(NULL, (hash_data[i] == NULL), "No memory");
+ }
+ keys = rte_malloc(NULL, sizeof(uint32_t) * TOTAL_ENTRY, 0);
+ TQS_RETURN_IF_ERROR(NULL, (keys == NULL), "No memory");
+
+ for (i = 0; i < TOTAL_ENTRY; i++)
+ keys[i] = i;
+
+ for (i = 0; i < TOTAL_ENTRY; i++) {
+ ret = rte_hash_add_key_data(handle, keys + i,
+ (void *)((uintptr_t)hash_data[i]));
+ TQS_RETURN_IF_ERROR(NULL, (ret < 0),
+ "Hash key add Failed #%d\n", i);
+ }
+ return 0;
+}
+
+/*
+ * Single writer, 1 TQS variable, 1 quiescent state
+ */
+static int
+test_tqs_sw_sv_1qs(void)
+{
+ struct rte_tqs *t;
+ const char *name = "TQS";
+ static uint32_t token;
+ int i, ret;
+ int32_t pos;
+ writer_done = 0;
+
+ printf("\nTest Single writer, 1 TQS variable, pass 1 quiescent state\n");
+
+ /* TQS variable is allocated */
+ t = rte_tqs_alloc(name, SOCKET_ID_ANY, 0);
+ TQS_RETURN_IF_ERROR(t, (t == NULL), "Failed to alloc TQS variable");
+
+ /* Register worker threads on 4 cores */
+ for (i = 0; i < 4; i++) {
+ ret = rte_tqs_register_lcore(t, enabled_core_ids[i]);
+ TQS_RETURN_IF_ERROR(t, (ret == -EINVAL),
+ "Register lcore failed");
+ }
+
+ /* Shared data structure created */
+ ret = init_hash();
+ TQS_RETURN_IF_ERROR(t, (ret != 0), "Hash init failed");
+
+ /* Reader threads are launched */
+ for (i = 0; i < 4; i++)
+ rte_eal_remote_launch(test_tqs_reader, NULL,
+ enabled_core_ids[i]);
+
+ for (i = 0; i < TOTAL_ENTRY; i++) {
+ /* Delete elements from the shared data structure */
+ pos = rte_hash_del_key(handle, keys + i);
+ TQS_RETURN_IF_ERROR(t, (pos < 0), "Delete key failed #%d",
+ keys[i]);
+ /* Start the quiescent state query process */
+ ret = rte_tqs_start(t, num_1qs, &token);
+ TQS_RETURN_IF_ERROR(t, (ret != 0), "TQS Start Failed");
+
+ /* Check the quiescent state status */
+ rte_tqs_check(t, token, true);
+ TQS_RETURN_IF_ERROR(t, (*hash_data[i] != COUNTER_VALUE),
+ "Reader did not complete %d",
+ *hash_data[i]);
+
+ rte_hash_free_key_with_position(handle, pos);
+ TQS_RETURN_IF_ERROR(t, (ret < 0),
+ "Failed to free the key #%d", keys[i]);
+ rte_free(hash_data[i]);
+ }
+ writer_done = 1;
+ /* Wait until all readers have exited */
+ rte_eal_mp_wait_lcore();
+
+ /* Free TQS variable */
+ rte_tqs_free(t);
+ rte_hash_free(handle);
+ rte_free(keys);
+
+ return 0;
+}
+
+static int
+test_tqs_main(void)
+{
+ if (get_enabled_cores_mask() != 0)
+ return -1;
+
+ /* Error-checking test cases */
+ if (test_tqs_alloc() < 0)
+ goto test_fail;
+
+ if (test_tqs_register_lcore() < 0)
+ goto test_fail;
+
+ if (test_tqs_unregister_lcore() < 0)
+ goto test_fail;
+
+ if (test_tqs_start() < 0)
+ goto test_fail;
+
+ if (test_tqs_check() < 0)
+ goto test_fail;
+
+ if (test_tqs_lookup() < 0)
+ goto test_fail;
+
+ if (test_tqs_list_dump() < 0)
+ goto test_fail;
+
+ if (test_tqs_dump() < 0)
+ goto test_fail;
+
+
+ /* Functional test cases */
+ if (test_tqs_sw_sv_1qs() < 0)
+ goto test_fail;
+
+ printf("\n");
+ return 0;
+
+ test_fail:
+ return -1;
+}
+
+REGISTER_TEST_COMMAND(tqs_autotest, test_tqs_main);
--
2.17.1
Ilya Maximets
2018-11-22 07:31:08 UTC
Permalink
Hi.
Is the any differentiation points with liburcu [1] ?
Is there any profit having own implementation inside DPDK ?

[1] http://liburcu.org/
https://lwn.net/Articles/573424/

Best regards, Ilya Maximets.
Post by Honnappa Nagarahalli
Lock-less data structures provide scalability and determinism.
They enable use cases where locking may not be allowed
(for ex: real-time applications).
In the following paras, the term 'memory' refers to memory allocated
by typical APIs like malloc or anything that is representative of
memory, for ex: an index of a free element array.
Since these data structures are lock less, the writers and readers
are accessing the data structures simultaneously. Hence, while removing
an element from a data structure, the writers cannot return the memory
to the allocator, without knowing that the readers are not
referencing that element/memory anymore. Hence, it is required to
Delete: in this step, the writer removes the element from the
data structure but does not return the associated memory to the allocator.
This will ensure that new readers will not get a reference to the removed
element. Removing the reference is an atomic operation.
Free: in this step, the writer returns the memory to the
memory allocator, only after knowing that all the readers have stopped
referencing the removed element.
This library helps the writer determine when it is safe to free the
memory.
This library makes use of Thread Quiescent State (TQS). TQS can be
defined as 'any point in the thread execution where the thread does
not hold a reference to shared memory'. It is upto the application to
Time -------------------------------------------------->
| |
RT1 $++++****D1****+++***D2*|**+++|+++**D3*****++++$
| |
RT2 $++++****D1****++|+**D2|***++++++**D3*****++++$
| |
RT3 $++++****D1****+++***|D2***|++++++**D2*****++++$
| |
|<--->|
Del | Free
|
Cannot free memory
during this period
RTx - Reader thread
< and > - Start and end of while(1) loop
***Dx*** - Reader thread is accessing the shared data structure Dx.
i.e. critical section.
+++ - Reader thread is not accessing any shared data structure.
i.e. non critical section or quiescent state.
Del - Point in time when the reference to the entry is removed using
atomic operation.
Free - Point in time when the writer can free the entry.
As shown thread RT1 acesses data structures D1, D2 and D3. When it is
accessing D2, if the writer has to remove an element from D2, the
writer cannot return the memory associated with that element to the
allocator. The writer can return the memory to the allocator only after
the reader stops referencng D2. In other words, reader thread RT1
has to enter a quiescent state.
Similarly, since thread RT3 is also accessing D2, writer has to wait till
RT3 enters quiescent state as well.
However, the writer does not need to wait for RT2 to enter quiescent state.
Thread RT2 was not accessing D2 when the delete operation happened.
So, RT2 will not get a reference to the deleted entry.
It can be noted that, the critical sections for D2 and D3 are quiescent states
for D1. i.e. for a given data structure Dx, any point in the thread execution
that does not reference Dx is a quiescent state.
For DPDK applications, the start and end of while(1) loop (where no shared
data structures are getting accessed) act as perfect quiescent states. This
will combine all the shared data structure accesses into a single critical
section and keeps the over head introduced by this library to the minimum.
However, the length of the critical section and the number of reader threads
is proportional to the time taken to identify the end of critical section.
So, if the application desires, it should be possible to identify the end
of critical section for each data structure.
To provide the required flexibility, this library has a concept of TQS
variable. The application can create one or more TQS variables to help it
track the end of one or more critical sections.
The application can create a TQS variable using the API rte_tqs_alloc.
It takes a mask of lcore IDs that will report their quiescent states
using this variable. This mask can be empty to start with.
rte_tqs_register_lcore API will register a reader thread to report its
quiescent state. This can be called from any control plane thread or from
the reader thread. The application can create a TQS variable with no reader
threads and add the threads dynamically using this API.
The application can trigger the reader threads to report their quiescent
state status by calling the API rte_tqs_start. It is possible for multiple
writer threads to query the quiescent state status simultaneously. Hence,
rte_tqs_start returns a token to each caller.
The application has to call rte_tqs_check API with the token to get the
current status. Option to block till all the threads enter the quiescent
state is provided. If this API indicates that all the threads have entered
the quiescent state, the application can free the deleted entry.
The separation of triggering the reporting from querying the status provides
the writer threads flexibility to do useful work instead of waiting for the
reader threads to enter the quiescent state.
rte_tqs_unregister_lcore API will remove a reader thread from reporting its
quiescent state using a TQS variable. The rte_tqs_check API will not wait
for this reader thread to report the quiescent state status anymore.
Finally, a TQS variable can be deleted by calling rte_tqs_free API.
Application must make sure that the reader threads are not referencing the
TQS variable anymore before deleting it.
The reader threads should call rte_tqs_update API to indicate that they
entered a quiescent state. This API checks if a writer has triggered a
quiescent state query and update the state accordingly.
1) Add more test cases
2) Convert to patch
3) Incorporate feedback from community
4) Add documentation
test/tqs: Add API and functional tests
log: add TQS log type
tqs: add thread quiescent state library
config/common_base | 6 +
lib/Makefile | 2 +
lib/librte_eal/common/include/rte_log.h | 1 +
lib/librte_tqs/Makefile | 23 +
lib/librte_tqs/meson.build | 5 +
lib/librte_tqs/rte_tqs.c | 249 +++++++++++
lib/librte_tqs/rte_tqs.h | 352 +++++++++++++++
lib/librte_tqs/rte_tqs_version.map | 16 +
lib/meson.build | 2 +-
mk/rte.app.mk | 1 +
test/test/Makefile | 2 +
test/test/autotest_data.py | 6 +
test/test/meson.build | 5 +-
test/test/test_tqs.c | 540 ++++++++++++++++++++++++
14 files changed, 1208 insertions(+), 2 deletions(-)
create mode 100644 lib/librte_tqs/Makefile
create mode 100644 lib/librte_tqs/meson.build
create mode 100644 lib/librte_tqs/rte_tqs.c
create mode 100644 lib/librte_tqs/rte_tqs.h
create mode 100644 lib/librte_tqs/rte_tqs_version.map
create mode 100644 test/test/test_tqs.c
--
2.17.1
Stephen Hemminger
2018-11-27 22:28:03 UTC
Permalink
On Wed, 21 Nov 2018 21:30:52 -0600
Post by Honnappa Nagarahalli
Lock-less data structures provide scalability and determinism.
They enable use cases where locking may not be allowed
(for ex: real-time applications).
In the following paras, the term 'memory' refers to memory allocated
by typical APIs like malloc or anything that is representative of
memory, for ex: an index of a free element array.
Since these data structures are lock less, the writers and readers
are accessing the data structures simultaneously. Hence, while removing
an element from a data structure, the writers cannot return the memory
to the allocator, without knowing that the readers are not
referencing that element/memory anymore. Hence, it is required to
Delete: in this step, the writer removes the element from the
data structure but does not return the associated memory to the allocator.
This will ensure that new readers will not get a reference to the removed
element. Removing the reference is an atomic operation.
Free: in this step, the writer returns the memory to the
memory allocator, only after knowing that all the readers have stopped
referencing the removed element.
This library helps the writer determine when it is safe to free the
memory.
This library makes use of Thread Quiescent State (TQS). TQS can be
defined as 'any point in the thread execution where the thread does
not hold a reference to shared memory'. It is upto the application to
Time -------------------------------------------------->
| |
RT1 $++++****D1****+++***D2*|**+++|+++**D3*****++++$
| |
RT2 $++++****D1****++|+**D2|***++++++**D3*****++++$
| |
RT3 $++++****D1****+++***|D2***|++++++**D2*****++++$
| |
|<--->|
Del | Free
|
Cannot free memory
during this period
RTx - Reader thread
< and > - Start and end of while(1) loop
***Dx*** - Reader thread is accessing the shared data structure Dx.
i.e. critical section.
+++ - Reader thread is not accessing any shared data structure.
i.e. non critical section or quiescent state.
Del - Point in time when the reference to the entry is removed using
atomic operation.
Free - Point in time when the writer can free the entry.
As shown thread RT1 acesses data structures D1, D2 and D3. When it is
accessing D2, if the writer has to remove an element from D2, the
writer cannot return the memory associated with that element to the
allocator. The writer can return the memory to the allocator only after
the reader stops referencng D2. In other words, reader thread RT1
has to enter a quiescent state.
Similarly, since thread RT3 is also accessing D2, writer has to wait till
RT3 enters quiescent state as well.
However, the writer does not need to wait for RT2 to enter quiescent state.
Thread RT2 was not accessing D2 when the delete operation happened.
So, RT2 will not get a reference to the deleted entry.
It can be noted that, the critical sections for D2 and D3 are quiescent states
for D1. i.e. for a given data structure Dx, any point in the thread execution
that does not reference Dx is a quiescent state.
For DPDK applications, the start and end of while(1) loop (where no shared
data structures are getting accessed) act as perfect quiescent states. This
will combine all the shared data structure accesses into a single critical
section and keeps the over head introduced by this library to the minimum.
However, the length of the critical section and the number of reader threads
is proportional to the time taken to identify the end of critical section.
So, if the application desires, it should be possible to identify the end
of critical section for each data structure.
To provide the required flexibility, this library has a concept of TQS
variable. The application can create one or more TQS variables to help it
track the end of one or more critical sections.
The application can create a TQS variable using the API rte_tqs_alloc.
It takes a mask of lcore IDs that will report their quiescent states
using this variable. This mask can be empty to start with.
rte_tqs_register_lcore API will register a reader thread to report its
quiescent state. This can be called from any control plane thread or from
the reader thread. The application can create a TQS variable with no reader
threads and add the threads dynamically using this API.
The application can trigger the reader threads to report their quiescent
state status by calling the API rte_tqs_start. It is possible for multiple
writer threads to query the quiescent state status simultaneously. Hence,
rte_tqs_start returns a token to each caller.
The application has to call rte_tqs_check API with the token to get the
current status. Option to block till all the threads enter the quiescent
state is provided. If this API indicates that all the threads have entered
the quiescent state, the application can free the deleted entry.
The separation of triggering the reporting from querying the status provides
the writer threads flexibility to do useful work instead of waiting for the
reader threads to enter the quiescent state.
rte_tqs_unregister_lcore API will remove a reader thread from reporting its
quiescent state using a TQS variable. The rte_tqs_check API will not wait
for this reader thread to report the quiescent state status anymore.
Finally, a TQS variable can be deleted by calling rte_tqs_free API.
Application must make sure that the reader threads are not referencing the
TQS variable anymore before deleting it.
The reader threads should call rte_tqs_update API to indicate that they
entered a quiescent state. This API checks if a writer has triggered a
quiescent state query and update the state accordingly.
1) Add more test cases
2) Convert to patch
3) Incorporate feedback from community
4) Add documentation
test/tqs: Add API and functional tests
log: add TQS log type
tqs: add thread quiescent state library
config/common_base | 6 +
lib/Makefile | 2 +
lib/librte_eal/common/include/rte_log.h | 1 +
lib/librte_tqs/Makefile | 23 +
lib/librte_tqs/meson.build | 5 +
lib/librte_tqs/rte_tqs.c | 249 +++++++++++
lib/librte_tqs/rte_tqs.h | 352 +++++++++++++++
lib/librte_tqs/rte_tqs_version.map | 16 +
lib/meson.build | 2 +-
mk/rte.app.mk | 1 +
test/test/Makefile | 2 +
test/test/autotest_data.py | 6 +
test/test/meson.build | 5 +-
test/test/test_tqs.c | 540 ++++++++++++++++++++++++
14 files changed, 1208 insertions(+), 2 deletions(-)
create mode 100644 lib/librte_tqs/Makefile
create mode 100644 lib/librte_tqs/meson.build
create mode 100644 lib/librte_tqs/rte_tqs.c
create mode 100644 lib/librte_tqs/rte_tqs.h
create mode 100644 lib/librte_tqs/rte_tqs_version.map
create mode 100644 test/test/test_tqs.c
Mixed feelings about this one.

Love to see RCU used for more things since it is much better than reader/writer
locks for many applications. But hate to see DPDK reinventing every other library
and not reusing code. Userspace RCU https://liburcu.org/ is widely supported by
distro's, more throughly tested and documented, and more flexiple.

The issue with many of these reinventions is a tradeoff of DPDK growing
another dependency on external library versus using common code.

For RCU, the big issue for me is the testing and documentation of how to do RCU
safely. Many people get it wrong!
Van Haaren, Harry
2018-11-27 22:49:21 UTC
Permalink
-----Original Message-----
Sent: Tuesday, November 27, 2018 2:28 PM
Subject: Re: [dpdk-dev] [RFC 0/3] tqs: add thread quiescent state library
On Wed, 21 Nov 2018 21:30:52 -0600
Post by Honnappa Nagarahalli
Lock-less data structures provide scalability and determinism.
They enable use cases where locking may not be allowed
(for ex: real-time applications).
<snip detailed RFC commit messag>
Post by Honnappa Nagarahalli
test/tqs: Add API and functional tests
log: add TQS log type
tqs: add thread quiescent state library
config/common_base | 6 +
lib/Makefile | 2 +
lib/librte_eal/common/include/rte_log.h | 1 +
lib/librte_tqs/Makefile | 23 +
lib/librte_tqs/meson.build | 5 +
lib/librte_tqs/rte_tqs.c | 249 +++++++++++
lib/librte_tqs/rte_tqs.h | 352 +++++++++++++++
lib/librte_tqs/rte_tqs_version.map | 16 +
lib/meson.build | 2 +-
mk/rte.app.mk | 1 +
test/test/Makefile | 2 +
test/test/autotest_data.py | 6 +
test/test/meson.build | 5 +-
test/test/test_tqs.c | 540 ++++++++++++++++++++++++
14 files changed, 1208 insertions(+), 2 deletions(-)
create mode 100644 lib/librte_tqs/Makefile
create mode 100644 lib/librte_tqs/meson.build
create mode 100644 lib/librte_tqs/rte_tqs.c
create mode 100644 lib/librte_tqs/rte_tqs.h
create mode 100644 lib/librte_tqs/rte_tqs_version.map
create mode 100644 test/test/test_tqs.c
Mixed feelings about this one.
Love to see RCU used for more things since it is much better than reader/writer
locks for many applications. But hate to see DPDK reinventing every other library
and not reusing code. Userspace RCU https://liburcu.org/ is widely supported by
distro's, more throughly tested and documented, and more flexiple.
The issue with many of these reinventions is a tradeoff of DPDK growing
another dependency on external library versus using common code.
For RCU, the big issue for me is the testing and documentation of how to do RCU
safely. Many people get it wrong!
Some notes on liburcu (and my amateur understanding of LGPL, I'm not a license lawyer :)

Liburcu is LGPL, which AFAIK means we must dynamically link applications if the application code is BSD or other permissive licenses.

The side effect of this is that urcu function calls must be "real" function calls and inlining them is not possible. Therefore using liburcu in LGPL mode could have a performance impact in this case. I expect estimating the performance cost would be
difficult as its pretty much a case-by-case depending on what you're doing in the surrounding code.

Generally I'm in favour of using established libraries (particularly for complex functionality like RCU) but in this case I think there's a tradeoff with raw performance.
Honnappa Nagarahalli
2018-11-28 05:31:56 UTC
Permalink
Post by Van Haaren, Harry
Post by Stephen Hemminger
On Wed, 21 Nov 2018 21:30:52 -0600
Post by Honnappa Nagarahalli
Lock-less data structures provide scalability and determinism.
real-time applications).
<snip detailed RFC commit messag>
Post by Stephen Hemminger
Post by Honnappa Nagarahalli
test/tqs: Add API and functional tests
log: add TQS log type
tqs: add thread quiescent state library
config/common_base | 6 +
lib/Makefile | 2 +
lib/librte_eal/common/include/rte_log.h | 1 +
lib/librte_tqs/Makefile | 23 +
lib/librte_tqs/meson.build | 5 +
lib/librte_tqs/rte_tqs.c | 249 +++++++++++
lib/librte_tqs/rte_tqs.h | 352 +++++++++++++++
lib/librte_tqs/rte_tqs_version.map | 16 +
lib/meson.build | 2 +-
mk/rte.app.mk | 1 +
test/test/Makefile | 2 +
test/test/autotest_data.py | 6 +
test/test/meson.build | 5 +-
test/test/test_tqs.c | 540 ++++++++++++++++++++++++
14 files changed, 1208 insertions(+), 2 deletions(-) create mode
100644 lib/librte_tqs/Makefile create mode 100644
lib/librte_tqs/meson.build create mode 100644
lib/librte_tqs/rte_tqs.c create mode 100644
lib/librte_tqs/rte_tqs.h create mode 100644
lib/librte_tqs/rte_tqs_version.map
create mode 100644 test/test/test_tqs.c
Mixed feelings about this one.
Love to see RCU used for more things since it is much better than
reader/writer locks for many applications. But hate to see DPDK
reinventing every other library and not reusing code. Userspace RCU
https://liburcu.org/ is widely supported by distro's, more throughly
tested and documented, and more flexiple.
The issue with many of these reinventions is a tradeoff of DPDK
growing another dependency on external library versus using common code.
Agree with the dependency issues. Sometimes flexibility also causes confusion and features that are not necessarily required for a targeted use case. I have seen that much of the functionality that can be left to the application is implemented as part of the library.
I think having it in DPDK will give us control over the amount of capability this library will have and freedom over changes we would like to make to such a library. I also view DPDK as one package where all things required for data plane development are available.
Post by Van Haaren, Harry
Post by Stephen Hemminger
For RCU, the big issue for me is the testing and documentation of how
to do RCU safely. Many people get it wrong!
Hopefully, we all will do a better job collectively :)
Post by Van Haaren, Harry
Some notes on liburcu (and my amateur understanding of LGPL, I'm not a license lawyer :)
Liburcu is LGPL, which AFAIK means we must dynamically link applications if
the application code is BSD or other permissive licenses.
The side effect of this is that urcu function calls must be "real" function calls
and inlining them is not possible. Therefore using liburcu in LGPL mode could
have a performance impact in this case. I expect estimating the performance
cost would be
difficult as its pretty much a case-by-case depending on what you're doing in
the surrounding code.
Generally I'm in favour of using established libraries (particularly for complex
functionality like RCU) but in this case I think there's a tradeoff with raw
performance.
Stephen Hemminger
2018-11-28 23:23:51 UTC
Permalink
On Wed, 28 Nov 2018 05:31:56 +0000
Post by Honnappa Nagarahalli
Post by Stephen Hemminger
Mixed feelings about this one.
Love to see RCU used for more things since it is much better than
reader/writer locks for many applications. But hate to see DPDK
reinventing every other library and not reusing code. Userspace RCU
https://liburcu.org/ is widely supported by distro's, more throughly
tested and documented, and more flexiple.
The issue with many of these reinventions is a tradeoff of DPDK
growing another dependency on external library versus using common code.
Agree with the dependency issues. Sometimes flexibility also causes confusion and features that are not necessarily required for a targeted use case. I have seen that much of the functionality that can be left to the application is implemented as part of the library.
I think having it in DPDK will give us control over the amount of capability this library will have and freedom over changes we would like to make to such a library. I also view DPDK as one package where all things required for data plane development are available.
Post by Stephen Hemminger
For RCU, the big issue for me is the testing and documentation of how
to do RCU safely. Many people get it wrong!
Hopefully, we all will do a better job collectively :)
Reinventing RCU is not helping anyone.


DPDK needs to fix its dependency model, and just admit that it is ok
to build off of more than glibc.

Having used liburcu, it can be done in a small manner and really isn't that
confusing.

Is your real issue the LGPL license of liburcu for your skittish customers?
Honnappa Nagarahalli
2018-11-30 02:13:32 UTC
Permalink
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
Post by Stephen Hemminger
Mixed feelings about this one.
Love to see RCU used for more things since it is much better than
reader/writer locks for many applications. But hate to see DPDK
reinventing every other library and not reusing code. Userspace
RCU https://liburcu.org/ is widely supported by distro's, more
throughly tested and documented, and more flexiple.
The issue with many of these reinventions is a tradeoff of DPDK
growing another dependency on external library versus using common
code.
Post by Honnappa Nagarahalli
Agree with the dependency issues. Sometimes flexibility also causes confusion
and features that are not necessarily required for a targeted use case. I have
seen that much of the functionality that can be left to the application is
implemented as part of the library.
Post by Honnappa Nagarahalli
I think having it in DPDK will give us control over the amount of capability this
library will have and freedom over changes we would like to make to such a
library. I also view DPDK as one package where all things required for data
plane development are available.
Post by Honnappa Nagarahalli
Post by Stephen Hemminger
For RCU, the big issue for me is the testing and documentation of
how to do RCU safely. Many people get it wrong!
Hopefully, we all will do a better job collectively :)
Reinventing RCU is not helping anyone.
IMO, this depends on what the rte_tqs has to offer and what the requirements are. Before starting this patch, I looked at the liburcu APIs. I have to say, fairly quickly (no offense) I concluded that this does not address DPDK's needs. I took a deeper look at the APIs/code in the past day and I still concluded the same. My partial analysis (analysis of more APIs can be done, I do not have cycles at this point) is as follows:

The reader threads' information is maintained in a linked list[1]. This linked list is protected by a mutex lock[2]. Any additions/deletions/traversals of this list are blocking and cannot happen in parallel.

The API, 'synchronize_rcu' [3] (similar functionality to rte_tqs_check call) is a blocking call. There is no option provided to make it non-blocking. The writer spins cycles while waiting for the grace period to get over.

'synchronize_rcu' also has grace period lock [4]. If I have multiple writers running on data plane threads, I cannot call this API to reclaim the memory in the worker threads as it will block other worker threads. This means, there is an extra thread required (on the control plane?) which does garbage collection and a method to push the pointers from worker threads to the garbage collection thread. This also means the time duration from delete to free increases putting pressure on amount of memory held up.
Since this API cannot be called concurrently by multiple writers, each writer has to wait for other writer's grace period to get over (i.e. multiple writer threads cannot overlap their grace periods).

This API also has to traverse the linked list which is not very well suited for calling on data plane.

I have not gone too much into rcu_thread_offline[5] API. This again needs to be used in worker cores and does not look to be very optimal.

I have glanced at rcu_quiescent_state [6], it wakes up the thread calling 'synchronize_rcu' which seems good amount of code for the data plane.

[1] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-qsbr.h#L85
[2] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c#L68
[3] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c#L344
[4] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c#L58
[5] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-qsbr.h#L211
[6] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-qsbr.h#L193

Coming to what is provided in rte_tqs:

The synchronize_rcu functionality is split in to 2 APIs: rte_tqs_start and rte_tqs_check. The reader data is maintained as an array.

Both the APIs are lock-free, allowing them to be called from multiple threads concurrently. This allows multiple writers to wait for their grace periods concurrently as well as overlap their grace periods. rte_tqs_start API returns a token which provides the ability to separate the quiescent state waiting of different writers. Hence, no writer waits for other writer's grace period to get over.
Since these 2 APIs are lock-free, they can be called from writers running on worker cores as well without the need for a separate thread to do garbage collection.

The separation into 2 APIs provides the ability for writers to not spin cycles waiting for the grace period to get over. This enables different ways of doing garbage collection. For ex: a data structure delete API could remove the entry from the data structure, call rte_tqs_start and return back to the caller. On the invocation of next API call of the library, the API can call rte_tqs_check (which will mostly indicate that the grace period is complete) and free the previously deleted entry.

rte_tqs_update (mapping to rcu_quiescent_state) is pretty small and simple.

rte_tqs_register and rte_tqs_unregister APIs are lock free. Hence additional APIs like rcu_thread_online and rcu_thread_offline are not required. The rte_tqs_unregister API (when compared to rcu_thread_offline) is much simple and conducive to be used in worker threads.
Post by Honnappa Nagarahalli
DPDK needs to fix its dependency model, and just admit that it is ok to build off
of more than glibc.
Having used liburcu, it can be done in a small manner and really isn't that
confusing.
Is your real issue the LGPL license of liburcu for your skittish customers?
I have not had any discussions on this. Customers are mainly focused on having a solution on which they have meaningful control. They want to be able to submit a patch and change things if required. For ex: barriers for Arm [7] are not optimal. How easy is it to change this and get it into distros (there are both internal and external factors here)?

[7] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/arch/arm.h#L44
Luca Boccassi
2018-11-30 16:26:25 UTC
Permalink
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
Post by Stephen Hemminger
Mixed feelings about this one.
Love to see RCU used for more things since it is much better than
reader/writer locks for many applications. But hate to see DPDK
reinventing every other library and not reusing code.
Userspace
RCU https://liburcu.org/ is widely supported by distro's, more
throughly tested and documented, and more flexiple.
The issue with many of these reinventions is a tradeoff of DPDK
growing another dependency on external library versus using common
code.
Post by Honnappa Nagarahalli
Agree with the dependency issues. Sometimes flexibility also causes confusion
and features that are not necessarily required for a targeted use case. I have
seen that much of the functionality that can be left to the
application is
implemented as part of the library.
Post by Honnappa Nagarahalli
I think having it in DPDK will give us control over the amount of capability this
library will have and freedom over changes we would like to make to such a
library. I also view DPDK as one package where all things required for data
plane development are available.
Post by Honnappa Nagarahalli
Post by Stephen Hemminger
For RCU, the big issue for me is the testing and
documentation of
how to do RCU safely. Many people get it wrong!
Hopefully, we all will do a better job collectively :)
Reinventing RCU is not helping anyone.
IMO, this depends on what the rte_tqs has to offer and what the
requirements are. Before starting this patch, I looked at the liburcu
APIs. I have to say, fairly quickly (no offense) I concluded that
this does not address DPDK's needs. I took a deeper look at the
APIs/code in the past day and I still concluded the same. My partial
analysis (analysis of more APIs can be done, I do not have cycles at
The reader threads' information is maintained in a linked list[1].
This linked list is protected by a mutex lock[2]. Any
additions/deletions/traversals of this list are blocking and cannot
happen in parallel.
The API, 'synchronize_rcu' [3] (similar functionality to
rte_tqs_check call) is a blocking call. There is no option provided
to make it non-blocking. The writer spins cycles while waiting for
the grace period to get over.
'synchronize_rcu' also has grace period lock [4]. If I have multiple
writers running on data plane threads, I cannot call this API to
reclaim the memory in the worker threads as it will block other
worker threads. This means, there is an extra thread required (on the
control plane?) which does garbage collection and a method to push
the pointers from worker threads to the garbage collection thread.
This also means the time duration from delete to free increases
putting pressure on amount of memory held up.
Since this API cannot be called concurrently by multiple writers,
each writer has to wait for other writer's grace period to get over
(i.e. multiple writer threads cannot overlap their grace periods).
This API also has to traverse the linked list which is not very well
suited for calling on data plane.
I have not gone too much into rcu_thread_offline[5] API. This again
needs to be used in worker cores and does not look to be very
optimal.
I have glanced at rcu_quiescent_state [6], it wakes up the thread
calling 'synchronize_rcu' which seems good amount of code for the
data plane.
[1] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/st
atic/urcu-qsbr.h#L85
[2] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c
#L68
[3] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c
#L344
[4] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c
#L58
[5] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/st
atic/urcu-qsbr.h#L211
[6] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/st
atic/urcu-qsbr.h#L193
rte_tqs_start and rte_tqs_check. The reader data is maintained as an
array.
Both the APIs are lock-free, allowing them to be called from multiple
threads concurrently. This allows multiple writers to wait for their
grace periods concurrently as well as overlap their grace periods.
rte_tqs_start API returns a token which provides the ability to
separate the quiescent state waiting of different writers. Hence, no
writer waits for other writer's grace period to get over. 
Since these 2 APIs are lock-free, they can be called from writers
running on worker cores as well without the need for a separate
thread to do garbage collection.
The separation into 2 APIs provides the ability for writers to not
spin cycles waiting for the grace period to get over. This enables
different ways of doing garbage collection. For ex: a data structure
delete API could remove the entry from the data structure, call
rte_tqs_start and return back to the caller. On the invocation of
next API call of the library, the API can call rte_tqs_check (which
will mostly indicate that the grace period is complete) and free the
previously deleted entry.
rte_tqs_update (mapping to rcu_quiescent_state) is pretty small and simple.
rte_tqs_register and rte_tqs_unregister APIs are lock free. Hence
additional APIs like rcu_thread_online and rcu_thread_offline are not
required. The rte_tqs_unregister API (when compared to
rcu_thread_offline) is much simple and conducive to be used in worker
threads.
liburcu has many flavours already, qsbr being one of them. If none of
those are optimal for this use case, why not work with upstream to
either improve the existing flavours, or add a new one?

You have the specific knowledge and expertise about the requirements
needs for the implementation for this use case, and they have the long-
time and extensive experience maintaining the library on a wide range
of systems and use cases. Why not combine both?
I might be wrong, but to me, nothing described above seems to be
particularly or uniquely tied to implementing a software dataplane or
to DPDK. IMHO we should pool and share wherever possible, rather than
build an ecosystem closed onto itself.

Just my 2c of course!
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
DPDK needs to fix its dependency model, and just admit that it is ok to build off
of more than glibc.
Having used liburcu, it can be done in a small manner and really isn't that
confusing.
Is your real issue the LGPL license of liburcu for your skittish customers?
I have not had any discussions on this. Customers are mainly focused
on having a solution on which they have meaningful control. They want
barriers for Arm [7] are not optimal. How easy is it to change this
and get it into distros (there are both internal and external factors
here)?
It's just as easy (or as hard) as it is with DPDK. So it's either wait
for the distros to update, or rebuild locally.
--
Kind regards,
Luca Boccassi
Stephen Hemminger
2018-11-30 18:32:07 UTC
Permalink
On Fri, 30 Nov 2018 16:26:25 +0000
Post by Luca Boccassi
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
Post by Stephen Hemminger
Mixed feelings about this one.
Love to see RCU used for more things since it is much better than
reader/writer locks for many applications. But hate to see DPDK
reinventing every other library and not reusing code. Userspace
RCU https://liburcu.org/ is widely supported by distro's, more
throughly tested and documented, and more flexiple.
The issue with many of these reinventions is a tradeoff of DPDK
growing another dependency on external library versus using common
code.
Post by Honnappa Nagarahalli
Agree with the dependency issues. Sometimes flexibility also causes confusion
and features that are not necessarily required for a targeted use case. I have
seen that much of the functionality that can be left to the application is
implemented as part of the library.
Post by Honnappa Nagarahalli
I think having it in DPDK will give us control over the amount of
capability this
library will have and freedom over changes we would like to make to such a
library. I also view DPDK as one package where all things required for data
plane development are available.
Post by Honnappa Nagarahalli
Post by Stephen Hemminger
For RCU, the big issue for me is the testing and
documentation of
how to do RCU safely. Many people get it wrong!
Hopefully, we all will do a better job collectively :)
Reinventing RCU is not helping anyone.
IMO, this depends on what the rte_tqs has to offer and what the
requirements are. Before starting this patch, I looked at the liburcu
APIs. I have to say, fairly quickly (no offense) I concluded that
this does not address DPDK's needs. I took a deeper look at the
APIs/code in the past day and I still concluded the same. My partial
analysis (analysis of more APIs can be done, I do not have cycles at
The reader threads' information is maintained in a linked list[1].
This linked list is protected by a mutex lock[2]. Any
additions/deletions/traversals of this list are blocking and cannot
happen in parallel.
The API, 'synchronize_rcu' [3] (similar functionality to
rte_tqs_check call) is a blocking call. There is no option provided
to make it non-blocking. The writer spins cycles while waiting for
the grace period to get over.
'synchronize_rcu' also has grace period lock [4]. If I have multiple
writers running on data plane threads, I cannot call this API to
reclaim the memory in the worker threads as it will block other
worker threads. This means, there is an extra thread required (on the
control plane?) which does garbage collection and a method to push
the pointers from worker threads to the garbage collection thread.
This also means the time duration from delete to free increases
putting pressure on amount of memory held up.
Since this API cannot be called concurrently by multiple writers,
each writer has to wait for other writer's grace period to get over
(i.e. multiple writer threads cannot overlap their grace periods).
This API also has to traverse the linked list which is not very well
suited for calling on data plane.
I have not gone too much into rcu_thread_offline[5] API. This again
needs to be used in worker cores and does not look to be very
optimal.
I have glanced at rcu_quiescent_state [6], it wakes up the thread
calling 'synchronize_rcu' which seems good amount of code for the
data plane.
[1] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/st
atic/urcu-qsbr.h#L85
[2] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c
#L68
[3] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c
#L344
[4] https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c
#L58
[5] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/st
atic/urcu-qsbr.h#L211
[6] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/st
atic/urcu-qsbr.h#L193
rte_tqs_start and rte_tqs_check. The reader data is maintained as an
array.
Both the APIs are lock-free, allowing them to be called from multiple
threads concurrently. This allows multiple writers to wait for their
grace periods concurrently as well as overlap their grace periods.
rte_tqs_start API returns a token which provides the ability to
separate the quiescent state waiting of different writers. Hence, no
writer waits for other writer's grace period to get over. 
Since these 2 APIs are lock-free, they can be called from writers
running on worker cores as well without the need for a separate
thread to do garbage collection.
The separation into 2 APIs provides the ability for writers to not
spin cycles waiting for the grace period to get over. This enables
different ways of doing garbage collection. For ex: a data structure
delete API could remove the entry from the data structure, call
rte_tqs_start and return back to the caller. On the invocation of
next API call of the library, the API can call rte_tqs_check (which
will mostly indicate that the grace period is complete) and free the
previously deleted entry.
rte_tqs_update (mapping to rcu_quiescent_state) is pretty small and simple.
rte_tqs_register and rte_tqs_unregister APIs are lock free. Hence
additional APIs like rcu_thread_online and rcu_thread_offline are not
required. The rte_tqs_unregister API (when compared to
rcu_thread_offline) is much simple and conducive to be used in worker
threads.
liburcu has many flavours already, qsbr being one of them. If none of
those are optimal for this use case, why not work with upstream to
either improve the existing flavours, or add a new one?
You have the specific knowledge and expertise about the requirements
needs for the implementation for this use case, and they have the long-
time and extensive experience maintaining the library on a wide range
of systems and use cases. Why not combine both?
I might be wrong, but to me, nothing described above seems to be
particularly or uniquely tied to implementing a software dataplane or
to DPDK. IMHO we should pool and share wherever possible, rather than
build an ecosystem closed onto itself.
Just my 2c of course!
Post by Honnappa Nagarahalli
Post by Honnappa Nagarahalli
DPDK needs to fix its dependency model, and just admit that it is ok to build off
of more than glibc.
Having used liburcu, it can be done in a small manner and really isn't that
confusing.
Is your real issue the LGPL license of liburcu for your skittish customers?
I have not had any discussions on this. Customers are mainly focused
on having a solution on which they have meaningful control. They want
barriers for Arm [7] are not optimal. How easy is it to change this
and get it into distros (there are both internal and external factors
here)?
It's just as easy (or as hard) as it is with DPDK. So it's either wait
for the distros to update, or rebuild locally.
Either way it would be useful to have the broader RCU community
in on the discussion.
Honnappa Nagarahalli
2018-11-30 20:20:34 UTC
Permalink
Hi Luca,
Appreciate your comments.
liburcu has many flavours already, qsbr being one of them. If none of those
are optimal for this use case, why not work with upstream to either improve
the existing flavours, or add a new one?
You have the specific knowledge and expertise about the requirements
needs for the implementation for this use case, and they have the long-
time and extensive experience maintaining the library on a wide range of
systems and use cases. Why not combine both?
I might be wrong, but to me, nothing described above seems to be
particularly or uniquely tied to implementing a software dataplane or to
DPDK.
This comment does not help much, I would prefer a more concrete comment. If possible, can you please mention where else these features are required? IMO, if these were required for other use cases, they would have been present in liburcu already.

IMHO we should pool and share wherever possible, rather than build
an ecosystem closed onto itself.
Just my 2c of course!
I would say this is true for some of the other libraries in DPDK as well [1] or in fact for the whole of DPDK. Linux Kernel has introduced XDP. I would say Linux Kernel has much vibrant history and community. If XDP lacks some of the features we want, why not work with Linux community to improve it and not do DPDK?

[1] https://github.com/urcu/userspace-rcu/blob/master/doc/uatomic-api.md

IMO, we should focus our discussion on how relevant rte_tqs is to DPDK, whether it solves the problems people face while using DPDK and the value add it brings. This should be the basis for the acceptance rather than what is available elsewhere.
Post by Honnappa Nagarahalli
Post by Stephen Hemminger
DPDK needs to fix its dependency model, and just admit that it is ok
to build off of more than glibc.
Having used liburcu, it can be done in a small manner and really
isn't that confusing.
Is your real issue the LGPL license of liburcu for your skittish customers?
I have not had any discussions on this. Customers are mainly focused
on having a solution on which they have meaningful control. They want
barriers for Arm [7] are not optimal. How easy is it to change this
and get it into distros (there are both internal and external factors
here)?
It's just as easy (or as hard) as it is with DPDK. So it's either wait for the
distros to update, or rebuild locally.
I have not worked with that community, hence I cannot comment on that. Apologies for asking some hard questions, but, do you know how easy it is to change the license for liburcu? Can the DPDK governing board take a decision to change the license
Mattias Rönnblom
2018-11-30 20:56:30 UTC
Permalink
Post by Honnappa Nagarahalli
Post by Stephen Hemminger
Reinventing RCU is not helping anyone.
The reader threads' information is maintained in a linked list[1]. This linked list is protected by a mutex lock[2]. Any additions/deletions/traversals of this list are blocking and cannot happen in parallel.
The API, 'synchronize_rcu' [3] (similar functionality to rte_tqs_check call) is a blocking call. There is no option provided to make it non-blocking. The writer spins cycles while waiting for the grace period to get over.
Wouldn't the options be call_rcu, which rarely blocks, or defer_rcu()
which never? Why would the average application want to wait for the
grace period to be over anyway?
Post by Honnappa Nagarahalli
'synchronize_rcu' also has grace period lock [4]. If I have multiple writers running on data plane threads, I cannot call this API to reclaim the memory in the worker threads as it will block other worker threads. This means, there is an extra thread required (on the control plane?) which does garbage collection and a method to push the pointers from worker threads to the garbage collection thread. This also means the time duration from delete to free increases putting pressure on amount of memory held up.
Since this API cannot be called concurrently by multiple writers, each writer has to wait for other writer's grace period to get over (i.e. multiple writer threads cannot overlap their grace periods).
"Real" DPDK applications typically have to interact with the outside
world using interfaces beyond DPDK packet I/O, and this is best done via
an intermediate "control plane" thread running in the DPDK application.
Typically, this thread would also be the RCU writer and "garbage
collector", I would say.
Post by Honnappa Nagarahalli
This API also has to traverse the linked list which is not very well suited for calling on data plane.
I have not gone too much into rcu_thread_offline[5] API. This again needs to be used in worker cores and does not look to be very optimal.
I have glanced at rcu_quiescent_state [6], it wakes up the thread calling 'synchronize_rcu' which seems good amount of code for the data plane.
Wouldn't the typical DPDK lcore worker call rcu_quiescent_state() after
processing a burst of packets? If so, I would more lean toward
"negligible overhead", than "a good amount of code".

I must admit I didn't look at your library in detail, but I must still
ask: if TQS is basically RCU, why isn't it called RCU? And why isn't the
API calls named in a similar manner?
Stephen Hemminger
2018-11-30 23:44:17 UTC
Permalink
On Fri, 30 Nov 2018 21:56:30 +0100
Post by Mattias Rönnblom
Post by Honnappa Nagarahalli
Post by Stephen Hemminger
Reinventing RCU is not helping anyone.
The reader threads' information is maintained in a linked list[1]. This linked list is protected by a mutex lock[2]. Any additions/deletions/traversals of this list are blocking and cannot happen in parallel.
The API, 'synchronize_rcu' [3] (similar functionality to rte_tqs_check call) is a blocking call. There is no option provided to make it non-blocking. The writer spins cycles while waiting for the grace period to get over.
Wouldn't the options be call_rcu, which rarely blocks, or defer_rcu()
which never? Why would the average application want to wait for the
grace period to be over anyway?
Post by Honnappa Nagarahalli
'synchronize_rcu' also has grace period lock [4]. If I have multiple writers running on data plane threads, I cannot call this API to reclaim the memory in the worker threads as it will block other worker threads. This means, there is an extra thread required (on the control plane?) which does garbage collection and a method to push the pointers from worker threads to the garbage collection thread. This also means the time duration from delete to free increases putting pressure on amount of memory held up.
Since this API cannot be called concurrently by multiple writers, each writer has to wait for other writer's grace period to get over (i.e. multiple writer threads cannot overlap their grace periods).
"Real" DPDK applications typically have to interact with the outside
world using interfaces beyond DPDK packet I/O, and this is best done via
an intermediate "control plane" thread running in the DPDK application.
Typically, this thread would also be the RCU writer and "garbage
collector", I would say.
Post by Honnappa Nagarahalli
This API also has to traverse the linked list which is not very well suited for calling on data plane.
I have not gone too much into rcu_thread_offline[5] API. This again needs to be used in worker cores and does not look to be very optimal.
I have glanced at rcu_quiescent_state [6], it wakes up the thread calling 'synchronize_rcu' which seems good amount of code for the data plane.
Wouldn't the typical DPDK lcore worker call rcu_quiescent_state() after
processing a burst of packets? If so, I would more lean toward
"negligible overhead", than "a good amount of code".
I must admit I didn't look at your library in detail, but I must still
ask: if TQS is basically RCU, why isn't it called RCU? And why isn't the
API calls named in a similar manner?
We used liburcu at Brocade with DPDK. It was just a case of putting rcu_quiescent_state in the packet handling
loop. There were a bunch more cases where control thread needed to register/unregister as part of RCU.
I think any library would have that issue with user supplied threads. You need a "worry about me" and
a "don't worry about me" API in the library.

There is also a tradeoff between call_rcu and defer_rcu about what context the RCU callback happens.
You really need a control thread to handle the RCU cleanup.

The point is that RCU steps into the application design, and liburcu seems to be flexible enough
and well documented enough to allow for more options.
Honnappa Nagarahalli
2018-12-01 18:37:02 UTC
Permalink
Post by Stephen Hemminger
On Fri, 30 Nov 2018 21:56:30 +0100
Post by Mattias Rönnblom
Post by Honnappa Nagarahalli
Post by Stephen Hemminger
Reinventing RCU is not helping anyone.
IMO, this depends on what the rte_tqs has to offer and what the
requirements are. Before starting this patch, I looked at the liburcu APIs. I
have to say, fairly quickly (no offense) I concluded that this does not address
DPDK's needs. I took a deeper look at the APIs/code in the past day and I still
concluded the same. My partial analysis (analysis of more APIs can be done, I
Post by Mattias Rönnblom
Post by Honnappa Nagarahalli
The reader threads' information is maintained in a linked list[1]. This
linked list is protected by a mutex lock[2]. Any additions/deletions/traversals
of this list are blocking and cannot happen in parallel.
Post by Mattias Rönnblom
Post by Honnappa Nagarahalli
The API, 'synchronize_rcu' [3] (similar functionality to rte_tqs_check call)
is a blocking call. There is no option provided to make it non-blocking. The
writer spins cycles while waiting for the grace period to get over.
Post by Mattias Rönnblom
Wouldn't the options be call_rcu, which rarely blocks, or defer_rcu()
which never?
call_rcu (I do not know about defer_rcu, have you looked at the implementation to verify your claim?) requires a separate thread that does garbage collection (this forces a programming model, the thread is even launched by the library). call_rcu() allows you to batch and defer the work to the garbage collector thread. In the garbage collector thread, when 'synchronize_rcu' is called, it spins for at least 1 grace period. Deferring and batching also have the side effect that memory is being held up for longer time.

Why would the average application want to wait for the
Post by Stephen Hemminger
Post by Mattias Rönnblom
grace period to be over anyway?
I assume when you say 'average application', you mean the writer(s) are on control plane.
It has been agreed (in the context of rte_hash) that writer(s) can be on data plane. In this case, 'synchronize_rcu' cannot be called from data plane. If call_rcu has to be called, it adds additional cycles to push the pointers (or any data) to the garbage collector thread to the data plane. I kindly suggest you to take a look for more details in liburcu code and the rte_tqs code.
Additionally, call_rcu function is more than 10 lines.
Post by Stephen Hemminger
Post by Mattias Rönnblom
Post by Honnappa Nagarahalli
'synchronize_rcu' also has grace period lock [4]. If I have multiple writers
running on data plane threads, I cannot call this API to reclaim the memory in
the worker threads as it will block other worker threads. This means, there is
an extra thread required (on the control plane?) which does garbage
collection and a method to push the pointers from worker threads to the
garbage collection thread. This also means the time duration from delete to
free increases putting pressure on amount of memory held up.
Post by Mattias Rönnblom
Post by Honnappa Nagarahalli
Since this API cannot be called concurrently by multiple writers, each
writer has to wait for other writer's grace period to get over (i.e. multiple
writer threads cannot overlap their grace periods).
Post by Mattias Rönnblom
"Real" DPDK applications typically have to interact with the outside
world using interfaces beyond DPDK packet I/O, and this is best done
via an intermediate "control plane" thread running in the DPDK application.
Typically, this thread would also be the RCU writer and "garbage
collector", I would say.
Agree, that is one way to do it and it comes with its own issues as I described above.
Post by Stephen Hemminger
Post by Mattias Rönnblom
Post by Honnappa Nagarahalli
This API also has to traverse the linked list which is not very well suited for
calling on data plane.
Post by Mattias Rönnblom
Post by Honnappa Nagarahalli
I have not gone too much into rcu_thread_offline[5] API. This again needs
to be used in worker cores and does not look to be very optimal.
Post by Mattias Rönnblom
Post by Honnappa Nagarahalli
I have glanced at rcu_quiescent_state [6], it wakes up the thread calling
'synchronize_rcu' which seems good amount of code for the data plane.
Post by Mattias Rönnblom
Wouldn't the typical DPDK lcore worker call rcu_quiescent_state()
after processing a burst of packets? If so, I would more lean toward
"negligible overhead", than "a good amount of code".
DPDK is being used in embedded and real time applications as well. There, processing a burst of packets is not possible due to low latency requirements. Hence it is not possible to amortize the cost.
Post by Stephen Hemminger
Post by Mattias Rönnblom
I must admit I didn't look at your library in detail, but I must still
ask: if TQS is basically RCU, why isn't it called RCU? And why isn't
the API calls named in a similar manner?
I kindly request you to take a look at the patch. More than that, if you have not done already, please take a look at the liburcu implementation as well.
TQS is not RCU (Read-Copy-Update). TQS helps implement RCU. TQS helps to understand when the threads have passed through the quiescent state.
I am also not sure why the name liburcu has RCU in it. It does not do any Read-Copy-Update.
Post by Stephen Hemminger
We used liburcu at Brocade with DPDK. It was just a case of putting
rcu_quiescent_state in the packet handling
loop. There were a bunch more cases where control thread needed to
register/unregister as part of RCU.
I assume that the packet handling loop was a polling loop (correct me if I am wrong). With the support of event dev, we have rte_event_dequeue_burst API which supports blocking till the packets are available (or blocking for an extended period of time). This means that, before calling this API, the thread needs to inform "don't worry about me". Once, this API returns, it needs to inform "worry about me". So, these two APIs need to be efficient. Please look at rte_tqs_register/unregister APIs.
Post by Stephen Hemminger
I think any library would have that issue with user supplied threads. You need
a "worry about me" and
a "don't worry about me" API in the library.
There is also a tradeoff between call_rcu and defer_rcu about what context
the RCU callback happens.
You really need a control thread to handle the RCU cleanup.
That is if you choose to use liburcu. rte_tqs provides the ability to do cleanup efficiently without the need for a control plane thread in DPDK use cases.
Post by Stephen Hemminger
The point is that RCU steps into the application design, and liburcu seems to
be flexible enough
and well documented enough to allow for more options.
Agree that RCU steps into application design. That is the reason rte_tqs just does enough and provides the flexibility to the application to implement the RCU however it feels like. DPDK has also stepped into application design by providing libraries like hash, LPM etc.

I do not understand why you think liburcu is flexible enough for DPDK use cases. I mentioned the specific use cases where liburcu is not useful. I did not find anything in the documentation to help me solve these use cases. Appreciate if you could help me understa
Honnappa Nagarahalli
2018-11-30 02:25:45 UTC
Permalink
Post by Van Haaren, Harry
Post by Stephen Hemminger
Mixed feelings about this one.
Love to see RCU used for more things since it is much better than
reader/writer locks for many applications. But hate to see DPDK
reinventing every other library and not reusing code. Userspace RCU
https://liburcu.org/ is widely supported by distro's, more throughly
tested and documented, and more flexiple.
The issue with many of these reinventions is a tradeoff of DPDK
growing another dependency on external library versus using common code.
For RCU, the big issue for me is the testing and documentation of how
to do RCU safely. Many people get it wrong!
Some notes on liburcu (and my amateur understanding of LGPL, I'm not a license lawyer :)
Liburcu is LGPL, which AFAIK means we must dynamically link applications if
the application code is BSD or other permissive licenses.
The side effect of this is that urcu function calls must be "real" function calls
and inlining them is not possible. Therefore using liburcu in LGPL mode could
have a performance impact in this case. I expect estimating the performance
cost would be
difficult as its pretty much a case-by-case depending on what you're doing in
the surrounding code.
Generally I'm in favour of using established libraries (particularly for complex
functionality like RCU) but in this case I think there's a tradeoff with raw
performance.
The licensing info [1] is very interesting. Again I am no lawyer :)

[1] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-qsbr.h#L184
Mattias Rönnblom
2018-11-30 21:03:37 UTC
Permalink
Post by Honnappa Nagarahalli
Post by Van Haaren, Harry
Generally I'm in favour of using established libraries (particularly for complex
functionality like RCU) but in this case I think there's a tradeoff with raw
performance.
The licensing info [1] is very interesting. Again I am no lawyer :)
[1] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-qsbr.h#L184
If you don't know the macro/inline function exception of LGPL 2.1, maybe
it's time to read the license text. Lawyer or not.
Loading...