Discussion:
[dpdk-dev] [RFC v1] doc compression API for DPDK
Verma, Shally
2017-10-31 11:39:22 UTC
Permalink
HI Fiona

This is an RFC document to brief our understanding and requirements on compression API proposal in DPDK. It is based on "[RFC] Compression API in DPDK http://dpdk.org/ml/archives/dev/2017-October/079377.html".
Intention of this document is to align on concepts built into compression API, its usage and identify further requirements.

Going further it could be a base to Compression Module Programmer Guide.

Current scope is limited to
- definition of the terminology which makes up foundation of compression API
- typical API flow expected to use by applications
 
Overview
~~~~~~~~
A. Notion of a session in compression API
================================== 
A Session is per device logical entity which is setup with chained-xforms to be performed on burst operations where individual entry contains operation type (decompress/compress) and related parameter.
A typical Session parameter includes:
- compress / decompress
- dev_id
- compression algorithm and other related parameters
- mempool - for use by session for runtime requirement
- and any other associated private data maintained by session
 
Application can setup multiple sessions on a device as dictated by dev_info.nb_sessions or nb_session_per_qp.
 
B. Notion of burst operations in compression API
 =======================================
struct rte_comp_op defines compression/decompression operational parameter and makes up one single element of burst. This is both an input/output parameter.
PMD gets source, destination and checksum information at input and updated it with bytes consumed and produced at output.
Once enqueued for processing, rte_comp_op *cannot be reused* until its status is set to RTE_COMP_OP_FAILURE or RTE_COMP_OP_STATUS_SUCCESS.
 
C. Session and rte_comp_op
 =======================
Every operation in a burst is tied to a Session. More to cover on this under Stateless Vs Stateful section.
 
D. Stateless Vs Stateful
===================
Compression API provide RTE_COMP_FF_STATEFUL feature flag for PMD to reflect its support for Stateful operation.
 
D.1 Compression API Stateless operation
------------------------------------------------------ 
A Stateless operation means all enqueued packets are independent of each other i.e. Each packet has
-              Their flush value is set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL (required only on compression side),
-              All-of the required input and sufficient large buffer size to store output i.e. OUT_OF_SPACE can never occur (required during both compression and decompression)
 
In such case, PMD initiates stateless processing and releases acquired resources after processing of current operation is complete i.e. full input consumed and full output written.
Application can attach same or different session to each packet and can make consecutive enque_burst() calls i.e. Following is relevant usage:
 
enqueued = rte_comp_enque_burst (dev_id, qp_id, ops1, nb_ops); 
enqueued = rte_comp_enque_burst(dev_id, qp_id, ops2, nb_ops);  
enqueued = rte_comp_enque_burst(dev_id, qp_id, ops3, nb_ops);
 
*Note – Every call has different ops array i.e.  same rte_comp_op array *cannot be reused* to queue next batch of data until previous ones are completely processed.

Also if multiple threads calls enqueue_burst() on same queue pair then it’s application onus to use proper locking mechanism to ensure serialized enqueuing of operations.
 
Please note any time output buffer ran out of space during write then operation will turn “Stateful”.  See more on Stateful under respective section.

Typical API(flow-wise) to setup for stateless operation:
1. rte_comp_session *sess = rte_comp_session_create(rte_mempool *pool);  
2. rte_comp_session_init (int dev_id, rte_comp_session *sess, rte_comp_xform *xform, rte_mempool *sess_pool);  
3. rte_comp_op_pool_create(rte_mempool ..)  
4. rte_comp_op_bulk_alloc (struct rte_mempool *mempool, struct rte_comp_op **ops, uint16_t nb_ops);  
5. for every rte_comp_op in ops[],
    5.1 rte_comp_op_attach_session(rte_comp_op *op, rte_comp_session *sess);
    5.2 set up with src/dst buffer 
6. enq = rte_compdev_enqueue_burst(uint8_t dev_id, uint16_t qp_id, struct rte_comp_op **ops, uint16_t nb_ops);
7. dqu = rte_compdev_dequeue_burst(dev_id, qp_id, ops, enq);
8. repeat 7 while (dqu < enq) // Wait till all of enqueued are dequeued
9. Repeat 5.2 for next batch of data  
10. rte_comp_session_clear () // only reset private data memory area and *not* the xform and devid information. In case, you want to re-use session. 
11. rte_comp_session_free(ret_comp_sess *session) 

D.1.2 Requirement for Stateless
-------------------------------------------
Since operation can complete out-of-order. There should be one (void *user) per rte_comp_op to enable application to map dequeued op to enqueued op.

D.2 Compression API Stateful operation
----------------------------------------------------------
 A Stateful operation means following conditions:
- API ran into out_of_space situation during processing of input. Example, stateless compressed stream fed fully to decompressor but output buffer is not large enough to hold output.
- API waiting for more input to produce output. Example, stateless compressed stream fed partially to decompressor.
- API is dependent on previous operation for further compression/decompression

In case of either one or all of the above conditions PMD is required to maintain context of operations across enque_burst() calls, until a packet with  RTE_FLUSH_FULL/FINAL and sufficient input/output buffers is received and processed.
 
D.2.1 Compression API requirement for Stateful
---------------------------------------------------------------

D.2.1.1 Sliding Window Size
------------------------------------
Maximum length of Sliding Window in bytes. Previous data lookup will be performed up to this length. To be added as algorithm capability parameter and set by PMD.
 
D.2.1.2 Stateful operation state maintenance
 -------------------------------------------------------------
This section starts with description of our understanding about compression API support for stateful. Depending upon understanding build upon these concepts, we will identify required data structure/param to maintain in-progress operation context by PMD.
 
For stateful compression, batch of dependent packets starts at a packet having RTE_NO_FLUSH/RTE_SYNC_FLUSH flush value and end at packet having RTE_FULL_FLUSH/FINAL_FLUSH. i.e. array of operations will carry structure like this:

------------------------------------------------------------------------------------
|op1.no_flush | op2.no_flush | op3.no_flush | op4.full_flush|
------------------------------------------------------------------------------------
 
For sake of simplicity, we will use term "stream" to identify such related set of operation in following description.
 
Stream processing impose following limitations on usage of enque_burst() API
-              All dependent packets in a stream should carry same session
-              if stream is broken into multiple enqueue_burst() call, then next enqueue_burst() cannot be called until previous one has fully processed. I.E.

               Consider for example, a stream with ops1 ..ops7, This is *not* allowed

                                       ----------------------------------------------------------------------------------
                enque_burst(|op1.no_flush | op2.no_flush | op3.no_flush | op4.no_flush|)
                                       ----------------------------------------------------------------------------------
 
                                       ----------------------------------------------------------------
               enque_burst(|op5.no_flush | op6.no_flush | op7.flush_final |)
                                        ----------------------------------------------------------------
 
              This *is* allowed
                                       ----------------------------------------------------------------------------------
               enque_burst(|op1.no_flush | op2.no_flush | op3.no_flush | op4.no_flush|)
                                       ----------------------------------------------------------------------------------
 
                deque_burst(ops1 ..ops4)
 
                                       ----------------------------------------------------------------
               enque_burst(|op5.no_flush | op6.no_flush | op7.flush_final |)
                                        ----------------------------------------------------------------

-              A single enque_burst() can carry only one stream. I.E. This is *not* allowed
              
                                      ---------------------------------------------------------------------------------------------------------
              enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final | op4.no_flush | op5.no_flush |)
                                       ---------------------------------------------------------------------------------------------------------

If a stream is broken in to several enqueue_burst() calls, then compress API need to maintain operational state between calls. For this, concept of rte_comp_stream is enabled in to compression API.
Here’re the proposed changes to existing design:

1. Add rte_comp_op_type
........................................
enum rte_comp_op_type {
RTE_COMP_OP_STATELESS,
RTE_COMP_OP_STATEFUL
}

2. Add new data type rte_comp_stream to maintain stream state
........................................................................................................
rte_comp_stream is an opaque data structure to application which is exchanged back and forth between application and PMD during stateful compression/decompression.
It should be allocated per stream AND before beginning of stateful operation. If stream is broken into multiple enqueue_burst() then each
respective enqueue_burst() must carry same rte_comp_stream pointer. It is mandatory input for stateful operations.
rte_comp_stream can be cleared and reused via compression API rte_comp_stream_clear() and free via rte_comp_stream_free(). Clear/free should not be called when it is in use.

This enables sharing of a session by multiple threads handling different streams as each bulk ops carry its own context. This can also be used by PMD to handle OUT_OF_SPACE situation.

3. Add stream allocate, clear and free API
...................................................................
3.1. rte_comp_op_stream_alloc(rte_mempool *pool, rte_comp_op_type type, rte_comp_stream **stream);
3.2. rte_comp_op_stream_clear(rte_comp_stream *stream); // in this case stream will be useable for new stateful batch
3.3. rte_comp_op_stream_free(rte_comp_stream *stream); // to free context

4. Add new API rte_compdev_enqueue_stream()
...............................................................................
static inline uint16_t rte_compdev_enqueue_stream(uint8_t dev_id,
uint16_t qp_id,
struct rte_comp_op **ops,
uint16_t nb_ops,
rte_comp_stream *stream); //to be passed with each call

Application should call this API to process dependent set of data OR when output buffer size is unknown.

rte_comp_op_pool_create() should create mempool large enough to accommodate operational state (maintained by rte_comp_stream) based on rte_comp_op_type. Since rte_comp_stream would be maintained by PMD, thus allocating it from PMD managed pool offers performance gains.

API flow: rte_comp_op_pool_create() -→ rte_comp_op_bulk_alloc() ---> rte_comp_op_stream_alloc() → enque_stream(..ops, .., stream)

D.2.1.3 History buffer
-----------------------------
Will be maintained by PMD with in rte_comp_st
Verma, Shally
2017-11-20 05:11:03 UTC
Permalink
Ping. Awaiting feedback/comments.

Thanks
Shally
-----Original Message-----
Sent: 31 October 2017 17:09
Subject: [dpdk-dev] [RFC v1] doc compression API for DPDK
[This sender failed our fraud detection checks and may not be who they
appear to be. Learn about spoofing at http://aka.ms/LearnAboutSpoofing]
HI Fiona
This is an RFC document to brief our understanding and requirements on
compression API proposal in DPDK. It is based on "[RFC] Compression API in
DPDK http://dpdk.org/ml/archives/dev/2017-October/079377.html".
Intention of this document is to align on concepts built into compression API,
its usage and identify further requirements.
Going further it could be a base to Compression Module Programmer Guide.
Current scope is limited to
- definition of the terminology which makes up foundation of compression API
- typical API flow expected to use by applications
Overview
~~~~~~~~
A. Notion of a session in compression API
==================================
A Session is per device logical entity which is setup with chained-xforms to be
performed on burst operations where individual entry contains operation
type (decompress/compress) and related parameter.
- compress / decompress
- dev_id
- compression algorithm and other related parameters
- mempool - for use by session for runtime requirement
- and any other associated private data maintained by session
Application can setup multiple sessions on a device as dictated by
dev_info.nb_sessions or nb_session_per_qp.
B. Notion of burst operations in compression API
=======================================
struct rte_comp_op defines compression/decompression operational
parameter and makes up one single element of burst. This is both an
input/output parameter.
PMD gets source, destination and checksum information at input and
updated it with bytes consumed and produced at output.
Once enqueued for processing, rte_comp_op *cannot be reused* until its
status is set to RTE_COMP_OP_FAILURE or
RTE_COMP_OP_STATUS_SUCCESS.
C. Session and rte_comp_op
=======================
Every operation in a burst is tied to a Session. More to cover on this under
Stateless Vs Stateful section.
D. Stateless Vs Stateful
===================
Compression API provide RTE_COMP_FF_STATEFUL feature flag for PMD to
reflect its support for Stateful operation.
D.1 Compression API Stateless operation
------------------------------------------------------
A Stateless operation means all enqueued packets are independent of each
other i.e. Each packet has
- Their flush value is set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL
(required only on compression side),
- All-of the required input and sufficient large buffer size to store
output i.e. OUT_OF_SPACE can never occur (required during both
compression and decompression)
In such case, PMD initiates stateless processing and releases acquired
resources after processing of current operation is complete i.e. full input
consumed and full output written.
Application can attach same or different session to each packet and can make
enqueued = rte_comp_enque_burst (dev_id, qp_id, ops1, nb_ops);
enqueued = rte_comp_enque_burst(dev_id, qp_id, ops2, nb_ops);
enqueued = rte_comp_enque_burst(dev_id, qp_id, ops3, nb_ops);
*Note – Every call has different ops array i.e. same rte_comp_op array
*cannot be reused* to queue next batch of data until previous ones are
completely processed.
Also if multiple threads calls enqueue_burst() on same queue pair then it’s
application onus to use proper locking mechanism to ensure serialized
enqueuing of operations.
Please note any time output buffer ran out of space during write then
operation will turn “Stateful”. See more on Stateful under respective
section.
1. rte_comp_session *sess = rte_comp_session_create(rte_mempool
*pool);
2. rte_comp_session_init (int dev_id, rte_comp_session *sess,
rte_comp_xform *xform, rte_mempool *sess_pool);
3. rte_comp_op_pool_create(rte_mempool ..)
4. rte_comp_op_bulk_alloc (struct rte_mempool *mempool, struct
rte_comp_op **ops, uint16_t nb_ops);
5. for every rte_comp_op in ops[],
5.1 rte_comp_op_attach_session(rte_comp_op *op, rte_comp_session
*sess);
5.2 set up with src/dst buffer
6. enq = rte_compdev_enqueue_burst(uint8_t dev_id, uint16_t qp_id,
struct rte_comp_op **ops, uint16_t nb_ops);
7. dqu = rte_compdev_dequeue_burst(dev_id, qp_id, ops, enq);
8. repeat 7 while (dqu < enq) // Wait till all of enqueued are dequeued
9. Repeat 5.2 for next batch of data
10. rte_comp_session_clear () // only reset private data memory area and
*not* the xform and devid information. In case, you want to re-use session.
11. rte_comp_session_free(ret_comp_sess *session)
D.1.2 Requirement for Stateless
-------------------------------------------
Since operation can complete out-of-order. There should be one (void
*user) per rte_comp_op to enable application to map dequeued op to
enqueued op.
D.2 Compression API Stateful operation
----------------------------------------------------------
- API ran into out_of_space situation during processing of input. Example,
stateless compressed stream fed fully to decompressor but output buffer is
not large enough to hold output.
- API waiting for more input to produce output. Example, stateless
compressed stream fed partially to decompressor.
- API is dependent on previous operation for further
compression/decompression
In case of either one or all of the above conditions PMD is required to
maintain context of operations across enque_burst() calls, until a packet with
RTE_FLUSH_FULL/FINAL and sufficient input/output buffers is received and
processed.
D.2.1 Compression API requirement for Stateful
---------------------------------------------------------------
D.2.1.1 Sliding Window Size
------------------------------------
Maximum length of Sliding Window in bytes. Previous data lookup will be
performed up to this length. To be added as algorithm capability parameter
and set by PMD.
D.2.1.2 Stateful operation state maintenance
-------------------------------------------------------------
This section starts with description of our understanding about compression
API support for stateful. Depending upon understanding build upon these
concepts, we will identify required data structure/param to maintain in-
progress operation context by PMD.
For stateful compression, batch of dependent packets starts at a packet
having RTE_NO_FLUSH/RTE_SYNC_FLUSH flush value and end at packet
having RTE_FULL_FLUSH/FINAL_FLUSH. i.e. array of operations will carry
------------------------------------------------------------------------------------
|op1.no_flush | op2.no_flush | op3.no_flush | op4.full_flush|
------------------------------------------------------------------------------------
For sake of simplicity, we will use term "stream" to identify such related set
of operation in following description.
Stream processing impose following limitations on usage of enque_burst() API
- All dependent packets in a stream should carry same session
- if stream is broken into multiple enqueue_burst() call, then next
enqueue_burst() cannot be called until previous one has fully processed. I.E.
Consider for example, a stream with ops1 ..ops7, This is *not*
allowed
----------------------------------------------------------------------
------------
enque_burst(|op1.no_flush | op2.no_flush | op3.no_flush |
op4.no_flush|)
----------------------------------------------------------------------
------------
----------------------------------------------------------------
enque_burst(|op5.no_flush | op6.no_flush | op7.flush_final |)
----------------------------------------------------------------
This *is* allowed
----------------------------------------------------------------------
------------
enque_burst(|op1.no_flush | op2.no_flush | op3.no_flush |
op4.no_flush|)
----------------------------------------------------------------------
------------
deque_burst(ops1 ..ops4)
----------------------------------------------------------------
enque_burst(|op5.no_flush | op6.no_flush | op7.flush_final |)
----------------------------------------------------------------
- A single enque_burst() can carry only one stream. I.E. This is *not*
allowed
-----------------------------------------------------------------------
----------------------------------
enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final |
op4.no_flush | op5.no_flush |)
----------------------------------------------------------------------
-----------------------------------
If a stream is broken in to several enqueue_burst() calls, then compress API
need to maintain operational state between calls. For this, concept of
rte_comp_stream is enabled in to compression API.
1. Add rte_comp_op_type
........................................
enum rte_comp_op_type {
RTE_COMP_OP_STATELESS,
RTE_COMP_OP_STATEFUL
}
2. Add new data type rte_comp_stream to maintain stream state
........................................................................................................
rte_comp_stream is an opaque data structure to application which is
exchanged back and forth between application and PMD during stateful
compression/decompression.
It should be allocated per stream AND before beginning of stateful
operation. If stream is broken into multiple enqueue_burst() then each
respective enqueue_burst() must carry same rte_comp_stream pointer. It is
mandatory input for stateful operations.
rte_comp_stream can be cleared and reused via compression API
rte_comp_stream_clear() and free via rte_comp_stream_free(). Clear/free
should not be called when it is in use.
This enables sharing of a session by multiple threads handling different
streams as each bulk ops carry its own context. This can also be used by PMD
to handle OUT_OF_SPACE situation.
3. Add stream allocate, clear and free API
...................................................................
3.1. rte_comp_op_stream_alloc(rte_mempool *pool, rte_comp_op_type
type, rte_comp_stream **stream);
3.2. rte_comp_op_stream_clear(rte_comp_stream *stream); // in this case
stream will be useable for new stateful batch
3.3. rte_comp_op_stream_free(rte_comp_stream *stream); // to free context
4. Add new API rte_compdev_enqueue_stream()
...............................................................................
static inline uint16_t rte_compdev_enqueue_stream(uint8_t dev_id,
uint16_t qp_id,
struct rte_comp_op **ops,
uint16_t nb_ops,
rte_comp_stream *stream); //to be passed with
each call
Application should call this API to process dependent set of data OR when
output buffer size is unknown.
rte_comp_op_pool_create() should create mempool large enough to
accommodate operational state (maintained by rte_comp_stream) based on
rte_comp_op_type. Since rte_comp_stream would be maintained by PMD,
thus allocating it from PMD managed pool offers performance gains.
API flow: rte_comp_op_pool_create() -→ rte_comp_op_bulk_alloc() --->
rte_comp_op_stream_alloc() → enque_stream(..ops, .., stream)
D.2.1.3 History buffer
-----------------------------
Will be maintained by PMD with in rte_comp_stream
Thanks
Shally
Trahe, Fiona
2017-11-27 18:54:30 UTC
Permalink
Hi Shally,
-----Original Message-----
Sent: Tuesday, October 31, 2017 11:39 AM
Subject: [RFC v1] doc compression API for DPDK
HI Fiona
This is an RFC document to brief our understanding and requirements on compression API proposal in
DPDK. It is based on "[RFC] Compression API in DPDK http://dpdk.org/ml/archives/dev/2017-
October/079377.html".
Intention of this document is to align on concepts built into compression API, its usage and identify further
requirements.
Going further it could be a base to Compression Module Programmer Guide.
Current scope is limited to
- definition of the terminology which makes up foundation of compression API
- typical API flow expected to use by applications
Overview
~~~~~~~~
A. Notion of a session in compression API
==================================
A Session is per device logical entity which is setup with chained-xforms to be performed on burst
operations where individual entry contains operation type (decompress/compress) and related parameter.
- compress / decompress
- dev_id
- compression algorithm and other related parameters
- mempool - for use by session for runtime requirement
- and any other associated private data maintained by session
Application can setup multiple sessions on a device as dictated by dev_info.nb_sessions or
nb_session_per_qp.
[Fiona] The session design is modelled on the cryptodev session design and so allows
to create a session which can be used on different driver types. E.g. a session could be set up and initialised
to run on a QuickAssist device and a Software device. This may be useful for stateless
requests, and enable load-balancing. For stateful flows the session should be set up for
only one specific driver-type as the state information will be stored in the private data specific to the driver-type
and not transferrable between driver-types.
So a session
- is not per-device
- has no dev_id
- has no mempool stored in it - the pool is created by the application, the lib can retrieve the pool from the object with rte_mempool_from_obj()
- does not have a limit number per device, just per qp, i.e. there is no dev_info.nb_sessions, just dev_info.max_nb_sessions_per_qp

Do you think any of this needs to be changed?
B. Notion of burst operations in compression API
 =======================================
struct rte_comp_op defines compression/decompression operational parameter and makes up one single
element of burst. This is both an input/output parameter.
PMD gets source, destination and checksum information at input and updated it with bytes consumed and
produced at output.
[Fiona] Agreed
Once enqueued for processing, rte_comp_op *cannot be reused* until its status is set to
RTE_COMP_OP_FAILURE or RTE_COMP_OP_STATUS_SUCCESS.
[Fiona] cannot be used until its status is set to any value other than RTE_COMP_OP_NOT_PROCESSED
C. Session and rte_comp_op
 =======================
Every operation in a burst is tied to a Session. More to cover on this under Stateless Vs Stateful section.
[Fiona] Agreed. I would add that each operation in a burst may be attached to a different session.
D. Stateless Vs Stateful
===================
Compression API provide RTE_COMP_FF_STATEFUL feature flag for PMD to reflect its support for Stateful
operation.
[Fiona] Agreed.
D.1 Compression API Stateless operation
------------------------------------------------------
A Stateless operation means all enqueued packets are independent of each other i.e. Each packet has
-              Their flush value is set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL (required only on compression
side),
-              All-of the required input and sufficient large buffer size to store output i.e. OUT_OF_SPACE can
never occur (required during both compression and decompression)
In such case, PMD initiates stateless processing and releases acquired resources after processing of current
operation is complete i.e. full input consumed and full output written.
Application can attach same or different session to each packet and can make consecutive enque_burst()
enqueued = rte_comp_enque_burst (dev_id, qp_id, ops1, nb_ops);
enqueued = rte_comp_enque_burst(dev_id, qp_id, ops2, nb_ops);
enqueued = rte_comp_enque_burst(dev_id, qp_id, ops3, nb_ops);
*Note – Every call has different ops array i.e.  same rte_comp_op array *cannot be reused* to queue next
batch of data until previous ones are completely processed.
Also if multiple threads calls enqueue_burst() on same queue pair then it’s application onus to use proper
locking mechanism to ensure serialized enqueuing of operations.
[Fiona] Agreed to above stateless description.
Please note any time output buffer ran out of space during write then operation will turn “Stateful”.  See
more on Stateful under respective section.
[Fiona] Let's come back to this later. An alternative is that OUT_OF_SPACE is returned and the application
must treat as a fail and resubmit the operation with a larger destination buffer.
1. rte_comp_session *sess = rte_comp_session_create(rte_mempool *pool);
2. rte_comp_session_init (int dev_id, rte_comp_session *sess, rte_comp_xform *xform, rte_mempool
*sess_pool);
3. rte_comp_op_pool_create(rte_mempool ..)
4. rte_comp_op_bulk_alloc (struct rte_mempool *mempool, struct rte_comp_op **ops, uint16_t
nb_ops);
5. for every rte_comp_op in ops[],
    5.1 rte_comp_op_attach_session(rte_comp_op *op, rte_comp_session *sess);
    5.2 set up with src/dst buffer
6. enq = rte_compdev_enqueue_burst(uint8_t dev_id, uint16_t qp_id, struct rte_comp_op **ops, uint16_t
nb_ops);
7. dqu = rte_compdev_dequeue_burst(dev_id, qp_id, ops, enq);
8. repeat 7 while (dqu < enq) // Wait till all of enqueued are dequeued
9. Repeat 5.2 for next batch of data
10. rte_comp_session_clear () // only reset private data memory area and *not* the xform and devid
information. In case, you want to re-use session.
11. rte_comp_session_free(ret_comp_sess *session)
[Fiona] ok. This is one possible flow. There are variations possible
- Above assumes all ops are using the same session, this is not necessarily the case. E.g. there
could be a compression session and a decompression session and a burst of ops may contain both.
In this case Step 9 would be Repeat 5.1 as well as 5.2
- Also it would not be necessary to wait until the full burst is dequeued before doing
another enqueue - though of course the ops would need to be managed so only
those finished with are reused, or multiple sets of ops could be allocated.
- What do you mean by Step 10 comment? The session only has private data. It's up to the PMD to
store whatever it needs from the xform. I think session_clear should mean all the data is zeroed in the session.
If the session is to be re-used then nothing needs to be cleared. Each op is already re-using the session in your
flow above without clearing between ops.
BUT this only applies to stateless - for stateful we may need a different behaviour - to clear state data but
keep algo, level, Huffman-type, checksum-type. Let's discuss under stateful - this may need a new API.
D.1.2 Requirement for Stateless
-------------------------------------------
Since operation can complete out-of-order. There should be one (void *user) per rte_comp_op to enable
application to map dequeued op to enqueued op.
[Fiona] In cryptodev there was an opaque_data field in the op - it was removed as the application can store any
private data it needs following the op, as it creates the op pool and dictates the size. Do you think we need an
explicit (void *user) or can we follow the same approach?

[Fiona] Out of time - I'll continue from here later in the week.
D.2 Compression API Stateful operation
----------------------------------------------------------
- API ran into out_of_space situation during processing of input. Example, stateless compressed stream
fed fully to decompressor but output buffer is not large enough to hold output.
- API waiting for more input to produce output. Example, stateless compressed stream fed partially to
decompressor.
- API is dependent on previous operation for further compression/decompression
In case of either one or all of the above conditions PMD is required to maintain context of operations
across enque_burst() calls, until a packet with  RTE_FLUSH_FULL/FINAL and sufficient input/output buffers
is received and processed.
D.2.1 Compression API requirement for Stateful
---------------------------------------------------------------
D.2.1.1 Sliding Window Size
------------------------------------
Maximum length of Sliding Window in bytes. Previous data lookup will be performed up to this length. To
be added as algorithm capability parameter and set by PMD.
D.2.1.2 Stateful operation state maintenance
 -------------------------------------------------------------
This section starts with description of our understanding about compression API support for stateful.
Depending upon understanding build upon these concepts, we will identify required data structure/param
to maintain in-progress operation context by PMD.
For stateful compression, batch of dependent packets starts at a packet having
RTE_NO_FLUSH/RTE_SYNC_FLUSH flush value and end at packet having RTE_FULL_FLUSH/FINAL_FLUSH.
------------------------------------------------------------------------------------
|op1.no_flush | op2.no_flush | op3.no_flush | op4.full_flush|
------------------------------------------------------------------------------------
For sake of simplicity, we will use term "stream" to identify such related set of operation in following
description.
Stream processing impose following limitations on usage of enque_burst() API
-              All dependent packets in a stream should carry same session
-              if stream is broken into multiple enqueue_burst() call, then next enqueue_burst() cannot be called
until previous one has fully processed. I.E.
               Consider for example, a stream with ops1 ..ops7, This is *not* allowed
                                       ----------------------------------------------------------------------------------
                enque_burst(|op1.no_flush | op2.no_flush | op3.no_flush | op4.no_flush|)
                                       ----------------------------------------------------------------------------------
                                       ----------------------------------------------------------------
               enque_burst(|op5.no_flush | op6.no_flush | op7.flush_final |)
                                        ----------------------------------------------------------------
              This *is* allowed
                                       ----------------------------------------------------------------------------------
               enque_burst(|op1.no_flush | op2.no_flush | op3.no_flush | op4.no_flush|)
                                       ----------------------------------------------------------------------------------
                deque_burst(ops1 ..ops4)
                                       ----------------------------------------------------------------
               enque_burst(|op5.no_flush | op6.no_flush | op7.flush_final |)
                                        ----------------------------------------------------------------
-              A single enque_burst() can carry only one stream. I.E. This is *not* allowed
                                      ---------------------------------------------------------------------------------------------------------
              enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final | op4.no_flush | op5.no_flush |)
                                       ---------------------------------------------------------------------------------------------------------
If a stream is broken in to several enqueue_burst() calls, then compress API need to maintain operational
state between calls. For this, concept of rte_comp_stream is enabled in to compression API.
1. Add rte_comp_op_type
........................................
enum rte_comp_op_type {
RTE_COMP_OP_STATELESS,
RTE_COMP_OP_STATEFUL
}
2. Add new data type rte_comp_stream to maintain stream state
........................................................................................................
rte_comp_stream is an opaque data structure to application which is exchanged back and forth between
application and PMD during stateful compression/decompression.
It should be allocated per stream AND before beginning of stateful operation. If stream is broken into
multiple enqueue_burst() then each
respective enqueue_burst() must carry same rte_comp_stream pointer. It is mandatory input for stateful
operations.
rte_comp_stream can be cleared and reused via compression API rte_comp_stream_clear() and free via
rte_comp_stream_free(). Clear/free should not be called when it is in use.
This enables sharing of a session by multiple threads handling different streams as each bulk ops carry its
own context. This can also be used by PMD to handle OUT_OF_SPACE situation.
3. Add stream allocate, clear and free API
...................................................................
3.1. rte_comp_op_stream_alloc(rte_mempool *pool, rte_comp_op_type type, rte_comp_stream
**stream);
3.2. rte_comp_op_stream_clear(rte_comp_stream *stream); // in this case stream will be useable for new
stateful batch
3.3. rte_comp_op_stream_free(rte_comp_stream *stream); // to free context
4. Add new API rte_compdev_enqueue_stream()
...............................................................................
static inline uint16_t rte_compdev_enqueue_stream(uint8_t dev_id,
uint16_t qp_id,
struct rte_comp_op **ops,
uint16_t nb_ops,
rte_comp_stream *stream); //to be passed with each call
Application should call this API to process dependent set of data OR when output buffer size is unknown.
rte_comp_op_pool_create() should create mempool large enough to accommodate operational state
(maintained by rte_comp_stream) based on rte_comp_op_type. Since rte_comp_stream would be
maintained by PMD, thus allocating it from PMD managed pool offers performance gains.
API flow: rte_comp_op_pool_create() -→ rte_comp_op_bulk_alloc() ---> rte_comp_op_stream_alloc() →
enque_stream(..ops, .., stream)
D.2.1.3 History buffer
-----------------------------
Will be maintained by PMD
Verma, Shally
2017-11-30 11:13:00 UTC
Permalink
HI Fiona
-----Original Message-----
Sent: 28 November 2017 00:25
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Shally,
-----Original Message-----
Sent: Tuesday, October 31, 2017 11:39 AM
Narayana Prasad
Subject: [RFC v1] doc compression API for DPDK
HI Fiona
This is an RFC document to brief our understanding and requirements on
compression API proposal in
DPDK. It is based on "[RFC] Compression API in DPDK
http://dpdk.org/ml/archives/dev/2017-
October/079377.html".
Intention of this document is to align on concepts built into compression
API, its usage and identify further
requirements.
Going further it could be a base to Compression Module Programmer
Guide.
Current scope is limited to
- definition of the terminology which makes up foundation of compression
API
- typical API flow expected to use by applications
Overview
~~~~~~~~
A. Notion of a session in compression API
==================================
A Session is per device logical entity which is setup with chained-xforms to
be performed on burst
operations where individual entry contains operation type
(decompress/compress) and related parameter.
- compress / decompress
- dev_id
- compression algorithm and other related parameters
- mempool - for use by session for runtime requirement
- and any other associated private data maintained by session
Application can setup multiple sessions on a device as dictated by
dev_info.nb_sessions or
nb_session_per_qp.
[Fiona] The session design is modelled on the cryptodev session design and
so allows
to create a session which can be used on different driver types. E.g. a
session could be set up and initialised
to run on a QuickAssist device and a Software device. This may be useful for
stateless
requests, and enable load-balancing. For stateful flows the session should be
set up for
only one specific driver-type as the state information will be stored in the
private data specific to the driver-type
and not transferrable between driver-types.
So a session
- is not per-device
- has no dev_id
- has no mempool stored in it - the pool is created by the application, the lib
can retrieve the pool from the object with rte_mempool_from_obj()
- does not have a limit number per device, just per qp, i.e. there is no
dev_info.nb_sessions, just dev_info.max_nb_sessions_per_qp
Do you think any of this needs to be changed?
[Shally] Please help confirm following before I could answer this.

In cryptodev, session holds an drivers array initialized on it where each can be setup to perform same/different operation in its private_data.
On mapping it to compression it mean, a session:
- Will not retain any of the info as mentioned above (xform, mempool, algos et el). All such information is maintained as part of associated device driver private data.
- App can use same session to set compress xform and decompress xform on devices but if both devices maps to same driver_id then only either is effective (whichever is set first)?

Is this understanding correct?
B. Notion of burst operations in compression API
 =======================================
struct rte_comp_op defines compression/decompression operational
parameter and makes up one single
element of burst. This is both an input/output parameter.
PMD gets source, destination and checksum information at input and
updated it with bytes consumed and
produced at output.
[Fiona] Agreed
Once enqueued for processing, rte_comp_op *cannot be reused* until its
status is set to
RTE_COMP_OP_FAILURE or RTE_COMP_OP_STATUS_SUCCESS.
[Fiona] cannot be used until its status is set to any value other than
RTE_COMP_OP_NOT_PROCESSED
[Shally] How user will know that status is NOT_PROCESSED after ops are enqueued?
I assume only way to check enqueued ops status is dequeue_burst() and PMD put an op into completion queue for dequeue *only when* it is completed with Pass/Fail/Out_of_space condition *not* when it's in progress (equivalent of RTE_COMP_OP_NOT_PROCESSED).

Am I missing anything here?
C. Session and rte_comp_op
 =======================
Every operation in a burst is tied to a Session. More to cover on this under
Stateless Vs Stateful section.
[Fiona] Agreed. I would add that each operation in a burst may be attached
to a different session.
D. Stateless Vs Stateful
===================
Compression API provide RTE_COMP_FF_STATEFUL feature flag for PMD
to reflect its support for Stateful
operation.
[Fiona] Agreed.
D.1 Compression API Stateless operation
------------------------------------------------------
A Stateless operation means all enqueued packets are independent of
each other i.e. Each packet has
-              Their flush value is set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL
(required only on compression
side),
-              All-of the required input and sufficient large buffer size to store
output i.e. OUT_OF_SPACE can
never occur (required during both compression and decompression)
In such case, PMD initiates stateless processing and releases acquired
resources after processing of current
operation is complete i.e. full input consumed and full output written.
Application can attach same or different session to each packet and can
make consecutive enque_burst()
enqueued = rte_comp_enque_burst (dev_id, qp_id, ops1, nb_ops);
enqueued = rte_comp_enque_burst(dev_id, qp_id, ops2, nb_ops);
enqueued = rte_comp_enque_burst(dev_id, qp_id, ops3, nb_ops);
*Note – Every call has different ops array i.e.  same rte_comp_op array
*cannot be reused* to queue next
batch of data until previous ones are completely processed.
Also if multiple threads calls enqueue_burst() on same queue pair then it’s
application onus to use proper
locking mechanism to ensure serialized enqueuing of operations.
[Fiona] Agreed to above stateless description.
Please note any time output buffer ran out of space during write then
operation will turn “Stateful”.  See
more on Stateful under respective section.
[Fiona] Let's come back to this later. An alternative is that OUT_OF_SPACE is
returned and the application
must treat as a fail and resubmit the operation with a larger destination
buffer.
[Shally] Then I propose to add a feature flag "FF_SUPPORT_OUT_OF_SPACE" per xform type for flexible PMD design.
As there're devices which treat it as error on compression but not on decompression.
If it is not supported, then it should be treated as failure condition and app can resubmit operation.
if supported, behaviour *To-be-Defined* under stateful.
1. rte_comp_session *sess = rte_comp_session_create(rte_mempool
*pool);
2. rte_comp_session_init (int dev_id, rte_comp_session *sess,
rte_comp_xform *xform, rte_mempool
*sess_pool);
3. rte_comp_op_pool_create(rte_mempool ..)
4. rte_comp_op_bulk_alloc (struct rte_mempool *mempool, struct
rte_comp_op **ops, uint16_t
nb_ops);
5. for every rte_comp_op in ops[],
    5.1 rte_comp_op_attach_session(rte_comp_op *op, rte_comp_session
*sess);
    5.2 set up with src/dst buffer
6. enq = rte_compdev_enqueue_burst(uint8_t dev_id, uint16_t qp_id,
struct rte_comp_op **ops, uint16_t
nb_ops);
7. dqu = rte_compdev_dequeue_burst(dev_id, qp_id, ops, enq);
8. repeat 7 while (dqu < enq) // Wait till all of enqueued are dequeued
9. Repeat 5.2 for next batch of data
10. rte_comp_session_clear () // only reset private data memory area and
*not* the xform and devid
information. In case, you want to re-use session.
11. rte_comp_session_free(ret_comp_sess *session)
[Fiona] ok. This is one possible flow. There are variations possible
- Above assumes all ops are using the same session, this is not necessarily
the case. E.g. there
could be a compression session and a decompression session and a burst of
ops may contain both.
[Shally] Agree but assume applicable only for stateless until we cover stateful
In this case Step 9 would be Repeat 5.1 as well as 5.2
- Also it would not be necessary to wait until the full burst is dequeued
before doing
another enqueue - though of course the ops would need to be managed so
only
those finished with are reused, or multiple sets of ops could be allocated.
[Shally] Agree
- What do you mean by Step 10 comment? The session only has private data.
It's up to the PMD to
store whatever it needs from the xform. I think session_clear should mean
all the data is zeroed in the session.
If the session is to be re-used then nothing needs to be cleared. Each op is
already re-using the session in your
flow above without clearing between ops.
BUT this only applies to stateless
[Shally] This came from my previous notion of session. Now when I see it analogy to cryptodev, its purpose is clear to me.
But then I propose to rename API to rte_compdev_sess_term() to make it self-explanatory.

- for stateful we may need a different
behaviour - to clear state data but
keep algo, level, Huffman-type, checksum-type. Let's discuss under stateful
- this may need a new API.
[Shally] Or PMD can internally reset its state once it process an op with FULL_FLUSH/FINISH. Will revisit it under stateful.
D.1.2 Requirement for Stateless
-------------------------------------------
Since operation can complete out-of-order. There should be one (void
*user) per rte_comp_op to enable
application to map dequeued op to enqueued op.
[Fiona] In cryptodev there was an opaque_data field in the op - it was
removed as the application can store any
private data it needs following the op, as it creates the op pool and dictates
the size. Do you think we need an
explicit (void *user) or can we follow the same approach?
[Shally] if priv_data in crypto_op_pool_create() is user data, then we don't need explicit user *.
But then I propose variable name should be renamed to user_data indicating it is an app data opaque to PMD.
And, __rte_comp_op_get_priv_data_size () should be changed to __rte_comp_op_get_user_data_size().
[Fiona] Out of time - I'll continue from here later in the week.
[Shally] Sure. Look forward to that.

Thanks
Shally
D.2 Compression API Stateful operation
----------------------------------------------------------
- API ran into out_of_space situation during processing of input. Example,
stateless compressed stream
fed fully to decompressor but output buffer is not large enough to hold
output.
- API waiting for more input to produce output. Example, stateless
compressed stream fed partially to
decompressor.
- API is dependent on previous operation for further
compression/decompression
In case of either one or all of the above conditions PMD is required to
maintain context of operations
across enque_burst() calls, until a packet with  RTE_FLUSH_FULL/FINAL and
sufficient input/output buffers
is received and processed.
D.2.1 Compression API requirement for Stateful
---------------------------------------------------------------
D.2.1.1 Sliding Window Size
------------------------------------
Maximum length of Sliding Window in bytes. Previous data lookup will be
performed up to this length. To
be added as algorithm capability parameter and set by PMD.
D.2.1.2 Stateful operation state maintenance
 -------------------------------------------------------------
This section starts with description of our understanding about
compression API support for stateful.
Depending upon understanding build upon these concepts, we will identify
required data structure/param
to maintain in-progress operation context by PMD.
For stateful compression, batch of dependent packets starts at a packet
having
RTE_NO_FLUSH/RTE_SYNC_FLUSH flush value and end at packet having
RTE_FULL_FLUSH/FINAL_FLUSH.
------------------------------------------------------------------------------------
|op1.no_flush | op2.no_flush | op3.no_flush | op4.full_flush|
------------------------------------------------------------------------------------
For sake of simplicity, we will use term "stream" to identify such related set
of operation in following
description.
Stream processing impose following limitations on usage of enque_burst()
API
-              All dependent packets in a stream should carry same session
-              if stream is broken into multiple enqueue_burst() call, then next
enqueue_burst() cannot be called
until previous one has fully processed. I.E.
               Consider for example, a stream with ops1 ..ops7, This is *not*
allowed
                                       --------------------------------------------------------------------
--------------
                enque_burst(|op1.no_flush | op2.no_flush | op3.no_flush |
op4.no_flush|)
                                       --------------------------------------------------------------------
--------------
                                       ----------------------------------------------------------------
               enque_burst(|op5.no_flush | op6.no_flush | op7.flush_final |)
                                        ----------------------------------------------------------------
              This *is* allowed
                                       --------------------------------------------------------------------
--------------
               enque_burst(|op1.no_flush | op2.no_flush | op3.no_flush |
op4.no_flush|)
                                       --------------------------------------------------------------------
--------------
                deque_burst(ops1 ..ops4)
                                       ----------------------------------------------------------------
               enque_burst(|op5.no_flush | op6.no_flush | op7.flush_final |)
                                        ----------------------------------------------------------------
-              A single enque_burst() can carry only one stream. I.E. This is *not*
allowed
                                      ---------------------------------------------------------------------
------------------------------------
              enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final |
op4.no_flush | op5.no_flush |)
                                       --------------------------------------------------------------------
-------------------------------------
If a stream is broken in to several enqueue_burst() calls, then compress
API need to maintain operational
state between calls. For this, concept of rte_comp_stream is enabled in to
compression API.
1. Add rte_comp_op_type
........................................
enum rte_comp_op_type {
RTE_COMP_OP_STATELESS,
RTE_COMP_OP_STATEFUL
}
2. Add new data type rte_comp_stream to maintain stream state
........................................................................................................
rte_comp_stream is an opaque data structure to application which is
exchanged back and forth between
application and PMD during stateful compression/decompression.
It should be allocated per stream AND before beginning of stateful
operation. If stream is broken into
multiple enqueue_burst() then each
respective enqueue_burst() must carry same rte_comp_stream pointer. It
is mandatory input for stateful
operations.
rte_comp_stream can be cleared and reused via compression API
rte_comp_stream_clear() and free via
rte_comp_stream_free(). Clear/free should not be called when it is in use.
This enables sharing of a session by multiple threads handling different
streams as each bulk ops carry its
own context. This can also be used by PMD to handle OUT_OF_SPACE
situation.
3. Add stream allocate, clear and free API
...................................................................
3.1. rte_comp_op_stream_alloc(rte_mempool *pool, rte_comp_op_type
type, rte_comp_stream
**stream);
3.2. rte_comp_op_stream_clear(rte_comp_stream *stream); // in this
case stream will be useable for new
stateful batch
3.3. rte_comp_op_stream_free(rte_comp_stream *stream); // to free
context
4. Add new API rte_compdev_enqueue_stream()
...............................................................................
static inline uint16_t rte_compdev_enqueue_stream(uint8_t dev_id,
uint16_t qp_id,
struct rte_comp_op **ops,
uint16_t nb_ops,
rte_comp_stream *stream); //to be passed with
each call
Application should call this API to process dependent set of data OR when
output buffer size is unknown.
rte_comp_op_pool_create() should create mempool large enough to
accommodate operational state
(maintained by rte_comp_stream) based on rte_comp_op_type. Since
rte_comp_stream would be
maintained by PMD, thus allocating it from PMD managed pool offers
performance gains.
API flow: rte_comp_op_pool_create() -→ rte_comp_op_bulk_alloc() --->
rte_comp_op_stream_alloc() →
enque_stream(..ops, .., stream)
D.2.1.3 History buffer
-----------------------------
Will be maintained by PMD wi
Trahe, Fiona
2017-12-01 19:12:10 UTC
Permalink
-----Original Message-----
Sent: Thursday, November 30, 2017 11:13 AM
Subject: Re: [RFC v1] doc compression API for DPDK
HI Fiona
-----Original Message-----
Sent: 28 November 2017 00:25
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Shally,
-----Original Message-----
Sent: Tuesday, October 31, 2017 11:39 AM
Narayana Prasad
Subject: [RFC v1] doc compression API for DPDK
HI Fiona
This is an RFC document to brief our understanding and requirements on
compression API proposal in
DPDK. It is based on "[RFC] Compression API in DPDK
http://dpdk.org/ml/archives/dev/2017-
October/079377.html".
Intention of this document is to align on concepts built into compression
API, its usage and identify further
requirements.
Going further it could be a base to Compression Module Programmer
Guide.
Current scope is limited to
- definition of the terminology which makes up foundation of compression
API
- typical API flow expected to use by applications
Overview
~~~~~~~~
A. Notion of a session in compression API
==================================
A Session is per device logical entity which is setup with chained-xforms to
be performed on burst
operations where individual entry contains operation type
(decompress/compress) and related parameter.
- compress / decompress
- dev_id
- compression algorithm and other related parameters
- mempool - for use by session for runtime requirement
- and any other associated private data maintained by session
Application can setup multiple sessions on a device as dictated by
dev_info.nb_sessions or
nb_session_per_qp.
[Fiona] The session design is modelled on the cryptodev session design and
so allows
to create a session which can be used on different driver types. E.g. a
session could be set up and initialised
to run on a QuickAssist device and a Software device. This may be useful for
stateless
requests, and enable load-balancing. For stateful flows the session should be
set up for
only one specific driver-type as the state information will be stored in the
private data specific to the driver-type
and not transferrable between driver-types.
So a session
- is not per-device
- has no dev_id
- has no mempool stored in it - the pool is created by the application, the lib
can retrieve the pool from the object with rte_mempool_from_obj()
- does not have a limit number per device, just per qp, i.e. there is no
dev_info.nb_sessions, just dev_info.max_nb_sessions_per_qp
Do you think any of this needs to be changed?
[Shally] Please help confirm following before I could answer this.
In cryptodev, session holds an drivers array initialized on it where each can be setup to perform
same/different operation in its private_data.
- Will not retain any of the info as mentioned above (xform, mempool, algos et el). All such information is
maintained as part of associated device driver private data.
[Fiona] exactly
- App can use same session to set compress xform and decompress xform on devices but if both devices
maps to same driver_id then only either is effective (whichever is set first)?
[Fiona] No, the intention is that the session is initialised for all drivers, and so for all devices, using the same xform. So it should only be initialised to either compress or decompress.
The intent being that an application can prepare an operation (stateless) independently of the driver & device it's targeting, and then choose where to send it, possibly based on which device is not busy.
Is this understanding correct?
B. Notion of burst operations in compression API
 =======================================
struct rte_comp_op defines compression/decompression operational
parameter and makes up one single
element of burst. This is both an input/output parameter.
PMD gets source, destination and checksum information at input and
updated it with bytes consumed and
produced at output.
[Fiona] Agreed
Once enqueued for processing, rte_comp_op *cannot be reused* until its
status is set to
RTE_COMP_OP_FAILURE or RTE_COMP_OP_STATUS_SUCCESS.
[Fiona] cannot be used until its status is set to any value other than
RTE_COMP_OP_NOT_PROCESSED
[Shally] How user will know that status is NOT_PROCESSED after ops are enqueued?
I assume only way to check enqueued ops status is dequeue_burst() and PMD put an op into completion
queue for dequeue *only when* it is completed with Pass/Fail/Out_of_space condition *not* when it's in
progress (equivalent of RTE_COMP_OP_NOT_PROCESSED).
Am I missing anything here?
[Fiona] Correct. PMD should only return an op in the dequeue once it's processed.
C. Session and rte_comp_op
 =======================
Every operation in a burst is tied to a Session. More to cover on this under
Stateless Vs Stateful section.
[Fiona] Agreed. I would add that each operation in a burst may be attached
to a different session.
D. Stateless Vs Stateful
===================
Compression API provide RTE_COMP_FF_STATEFUL feature flag for PMD
to reflect its support for Stateful
operation.
[Fiona] Agreed.
D.1 Compression API Stateless operation
------------------------------------------------------
A Stateless operation means all enqueued packets are independent of
each other i.e. Each packet has
-              Their flush value is set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL
(required only on compression
side),
-              All-of the required input and sufficient large buffer size to store
output i.e. OUT_OF_SPACE can
never occur (required during both compression and decompression)
In such case, PMD initiates stateless processing and releases acquired
resources after processing of current
operation is complete i.e. full input consumed and full output written.
Application can attach same or different session to each packet and can
make consecutive enque_burst()
enqueued = rte_comp_enque_burst (dev_id, qp_id, ops1, nb_ops);
enqueued = rte_comp_enque_burst(dev_id, qp_id, ops2, nb_ops);
enqueued = rte_comp_enque_burst(dev_id, qp_id, ops3, nb_ops);
*Note – Every call has different ops array i.e.  same rte_comp_op array
*cannot be reused* to queue next
batch of data until previous ones are completely processed.
Also if multiple threads calls enqueue_burst() on same queue pair then it’s
application onus to use proper
locking mechanism to ensure serialized enqueuing of operations.
[Fiona] Agreed to above stateless description.
Please note any time output buffer ran out of space during write then
operation will turn “Stateful”.  See
more on Stateful under respective section.
[Fiona] Let's come back to this later. An alternative is that OUT_OF_SPACE is
returned and the application
must treat as a fail and resubmit the operation with a larger destination
buffer.
[Shally] Then I propose to add a feature flag "FF_SUPPORT_OUT_OF_SPACE" per xform type for flexible
PMD design.
As there're devices which treat it as error on compression but not on decompression.
If it is not supported, then it should be treated as failure condition and app can resubmit operation.
if supported, behaviour *To-be-Defined* under stateful.
[Fiona] Can you explain 'turn stateful' some more?
If compressor runs out of space during stateless operation, either comp or decomp, and turns stateful, how would the app know? And what would be in status, consumed and produced?
Could it return OUT_OF_SPACE, and if both consumed and produced == 0 then the whole op must be resubmitted with a bigger output buffer. But if consumed and produced > 0 then app could take the output and submit next op
continuing from consumed+1.
1. rte_comp_session *sess = rte_comp_session_create(rte_mempool
*pool);
2. rte_comp_session_init (int dev_id, rte_comp_session *sess,
rte_comp_xform *xform, rte_mempool
*sess_pool);
3. rte_comp_op_pool_create(rte_mempool ..)
4. rte_comp_op_bulk_alloc (struct rte_mempool *mempool, struct
rte_comp_op **ops, uint16_t
nb_ops);
5. for every rte_comp_op in ops[],
    5.1 rte_comp_op_attach_session(rte_comp_op *op, rte_comp_session
*sess);
    5.2 set up with src/dst buffer
6. enq = rte_compdev_enqueue_burst(uint8_t dev_id, uint16_t qp_id,
struct rte_comp_op **ops, uint16_t
nb_ops);
7. dqu = rte_compdev_dequeue_burst(dev_id, qp_id, ops, enq);
8. repeat 7 while (dqu < enq) // Wait till all of enqueued are dequeued
9. Repeat 5.2 for next batch of data
10. rte_comp_session_clear () // only reset private data memory area and
*not* the xform and devid
information. In case, you want to re-use session.
11. rte_comp_session_free(ret_comp_sess *session)
[Fiona] ok. This is one possible flow. There are variations possible
- Above assumes all ops are using the same session, this is not necessarily
the case. E.g. there
could be a compression session and a decompression session and a burst of
ops may contain both.
[Shally] Agree but assume applicable only for stateless until we cover stateful
In this case Step 9 would be Repeat 5.1 as well as 5.2
- Also it would not be necessary to wait until the full burst is dequeued
before doing
another enqueue - though of course the ops would need to be managed so
only
those finished with are reused, or multiple sets of ops could be allocated.
[Shally] Agree
- What do you mean by Step 10 comment? The session only has private data.
It's up to the PMD to
store whatever it needs from the xform. I think session_clear should mean
all the data is zeroed in the session.
If the session is to be re-used then nothing needs to be cleared. Each op is
already re-using the session in your
flow above without clearing between ops.
BUT this only applies to stateless
[Shally] This came from my previous notion of session. Now when I see it analogy to cryptodev, its purpose
is clear to me.
But then I propose to rename API to rte_compdev_sess_term() to make it self-explanatory.
[Fiona] So I think term isn't very clear, I guess it's short for terminate?
If so I can change to rte_compdev_sess_terminate()
However I think we may not need this at all if we go with your stream suggestion below.
As there may be only immutable data in the session - like algo, level, checksum type, etc.
In cryptodev the session data is also immutable, but as it holds a key it's important to clear
it before allowing the session to be re-used. I'm not sure we'll have any data left in
session which needs to be cleared, all the state data may be in the stream.
- for stateful we may need a different
behaviour - to clear state data but
keep algo, level, Huffman-type, checksum-type. Let's discuss under stateful
- this may need a new API.
[Shally] Or PMD can internally reset its state once it process an op with FULL_FLUSH/FINISH. Will revisit it
under stateful.
[Fiona] Yes, that could also be an option for stream.
D.1.2 Requirement for Stateless
-------------------------------------------
Since operation can complete out-of-order. There should be one (void
*user) per rte_comp_op to enable
application to map dequeued op to enqueued op.
[Fiona] In cryptodev there was an opaque_data field in the op - it was
removed as the application can store any
private data it needs following the op, as it creates the op pool and dictates
the size. Do you think we need an
explicit (void *user) or can we follow the same approach?
[Shally] if priv_data in crypto_op_pool_create() is user data, then we don't need explicit user *.
But then I propose variable name should be renamed to user_data indicating it is an app data opaque to
PMD.
And, __rte_comp_op_get_priv_data_size () should be changed to __rte_comp_op_get_user_data_size().
[Fiona] Agreed.
[Fiona] Out of time - I'll continue from here later in the week.
[Shally] Sure. Look forward to that.
Thanks
Shally
D.2 Compression API Stateful operation
----------------------------------------------------------
- API ran into out_of_space situation during processing of input. Example,
stateless compressed stream
fed fully to decompressor but output buffer is not large enough to hold
output.
- API waiting for more input to produce output. Example, stateless
compressed stream fed partially to
decompressor.
[Fiona] I agree this is stateful if more input is to follow. But don’t understand the
stateless ref. here. It's not relevant whether compression was done statefully
or statelessly, if only part of the data is fed to the decompressor then it should
be fed to a stateful session.
- API is dependent on previous operation for further
compression/decompression
In case of either one or all of the above conditions PMD is required to
maintain context of operations
across enque_burst() calls, until a packet with  RTE_FLUSH_FULL/FINAL and
sufficient input/output buffers
is received and processed.
D.2.1 Compression API requirement for Stateful
---------------------------------------------------------------
D.2.1.1 Sliding Window Size
------------------------------------
Maximum length of Sliding Window in bytes. Previous data lookup will be
performed up to this length. To
be added as algorithm capability parameter and set by PMD.
[Fiona] Agreed. 32k default.
To be set on session xform for both compression and decompression?
Is there a subset of specific values you'd like supported on the API?
(rather than all to minimise test cases)
D.2.1.2 Stateful operation state maintenance
 -------------------------------------------------------------
This section starts with description of our understanding about
compression API support for stateful.
Depending upon understanding build upon these concepts, we will identify
required data structure/param
to maintain in-progress operation context by PMD.
For stateful compression, batch of dependent packets starts at a packet
having
RTE_NO_FLUSH/RTE_SYNC_FLUSH flush value and end at packet having
RTE_FULL_FLUSH/FINAL_FLUSH.
------------------------------------------------------------------------------------
|op1.no_flush | op2.no_flush | op3.no_flush | op4.full_flush|
------------------------------------------------------------------------------------
[Fiona] I think it needs to be more constrained than your examples below.
Only 1 operation from a stream can be in a burst. As each operation
in a stateful stream must complete, as next operation needs state and history
of previous operation to be complete before it can be processed.
And if one failed, e.g. due to OUT_OF_SPACE, this should affect
the following operation in the same stream.
Worst case this means bursts of 1. Burst can be >1 if there are multiple
independent streams with available data for processing. Or if there is
data available which can be statelessly processed.

If there are multiple buffers available from a stream , then instead they can
be linked together in an mbuf chain sent in a single operation.

To handle the sequences below would mean the PMD
would need to store ops sending one at a time to be processed.

As this is significantly different from what you describe below, I'll wait for further feedback
before continuing.

But in principal I agree with the idea of a stream structure to hold the state data
needed for a stateful flow. Originally I'd assumed this data would be held in the session
but I think it's better to not have a 1:1 mapping between stream and session as this
would limit a session unnecessarily. So there can be many streams in a single session.
For sake of simplicity, we will use term "stream" to identify such related set
of operation in following
description.
Stream processing impose following limitations on usage of enque_burst()
API
-              All dependent packets in a stream should carry same session
-              if stream is broken into multiple enqueue_burst() call, then next
enqueue_burst() cannot be called
until previous one has fully processed. I.E.
               Consider for example, a stream with ops1 ..ops7, This is *not*
allowed
                                       --------------------------------------------------------------------
--------------
                enque_burst(|op1.no_flush | op2.no_flush | op3.no_flush |
op4.no_flush|)
                                       --------------------------------------------------------------------
--------------
                                       ----------------------------------------------------------------
               enque_burst(|op5.no_flush | op6.no_flush | op7.flush_final |)
                                        ----------------------------------------------------------------
              This *is* allowed
                                       --------------------------------------------------------------------
--------------
               enque_burst(|op1.no_flush | op2.no_flush | op3.no_flush |
op4.no_flush|)
                                       --------------------------------------------------------------------
--------------
                deque_burst(ops1 ..ops4)
                                       ----------------------------------------------------------------
               enque_burst(|op5.no_flush | op6.no_flush | op7.flush_final |)
                                        ----------------------------------------------------------------
-              A single enque_burst() can carry only one stream. I.E. This is *not*
allowed
                                      ---------------------------------------------------------------------
------------------------------------
              enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final |
op4.no_flush | op5.no_flush |)
                                       --------------------------------------------------------------------
-------------------------------------
If a stream is broken in to several enqueue_burst() calls, then compress
API need to maintain operational
state between calls. For this, concept of rte_comp_stream is enabled in to
compression API.
1. Add rte_comp_op_type
........................................
enum rte_comp_op_type {
RTE_COMP_OP_STATELESS,
RTE_COMP_OP_STATEFUL
}
2. Add new data type rte_comp_stream to maintain stream state
........................................................................................................
rte_comp_stream is an opaque data structure to application which is
exchanged back and forth between
application and PMD during stateful compression/decompression.
It should be allocated per stream AND before beginning of stateful
operation. If stream is broken into
multiple enqueue_burst() then each
respective enqueue_burst() must carry same rte_comp_stream pointer. It
is mandatory input for stateful
operations.
rte_comp_stream can be cleared and reused via compression API
rte_comp_stream_clear() and free via
rte_comp_stream_free(). Clear/free should not be called when it is in use.
This enables sharing of a session by multiple threads handling different
streams as each bulk ops carry its
own context. This can also be used by PMD to handle OUT_OF_SPACE
situation.
3. Add stream allocate, clear and free API
...................................................................
3.1. rte_comp_op_stream_alloc(rte_mempool *pool, rte_comp_op_type
type, rte_comp_stream
**stream);
3.2. rte_comp_op_stream_clear(rte_comp_stream *stream); // in this
case stream will be useable for new
stateful batch
3.3. rte_comp_op_stream_free(rte_comp_stream *stream); // to free
context
4. Add new API rte_compdev_enqueue_stream()
...............................................................................
static inline uint16_t rte_compdev_enqueue_stream(uint8_t dev_id,
uint16_t qp_id,
struct rte_comp_op **ops,
uint16_t nb_ops,
rte_comp_stream *stream); //to be passed with
each call
Application should call this API to process dependent set of data OR when
output buffer size is unknown.
rte_comp_op_pool_create() should create mempool large enough to
accommodate operational state
(maintained by rte_comp_stream) based on rte_comp_op_type. Since
rte_comp_stream would be
maintained by PMD, thus allocating it from PMD managed pool offers
performance gains.
API flow: rte_comp_op_pool_create() -→ rte_comp_op_bulk_alloc() --->
rte_comp_op_stream_alloc() →
enque_stream(..ops, .., stream)
D.2.1.3 History buffer
-----------------------------
Will be maintained by PMD with in rte_comp_stre
Verma, Shally
2017-12-07 05:42:54 UTC
Permalink
HI Fiona
-----Original Message-----
Sent: 02 December 2017 00:42
Subject: RE: [RFC v1] doc compression API for DPDK
-----Original Message-----
Sent: Thursday, November 30, 2017 11:13 AM
Challa, Mahipal
Subject: Re: [RFC v1] doc compression API for DPDK
HI Fiona
-----Original Message-----
Sent: 28 November 2017 00:25
Mahipal
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Shally,
-----Original Message-----
Sent: Tuesday, October 31, 2017 11:39 AM
Narayana Prasad
Subject: [RFC v1] doc compression API for DPDK
HI Fiona
This is an RFC document to brief our understanding and requirements
on
compression API proposal in
DPDK. It is based on "[RFC] Compression API in DPDK
http://dpdk.org/ml/archives/dev/2017-
October/079377.html".
Intention of this document is to align on concepts built into
compression
API, its usage and identify further
requirements.
Going further it could be a base to Compression Module Programmer
Guide.
Current scope is limited to
- definition of the terminology which makes up foundation of
compression
API
- typical API flow expected to use by applications
Overview
~~~~~~~~
A. Notion of a session in compression API
==================================
A Session is per device logical entity which is setup with chained-xforms
to
be performed on burst
operations where individual entry contains operation type
(decompress/compress) and related parameter.
- compress / decompress
- dev_id
- compression algorithm and other related parameters
- mempool - for use by session for runtime requirement
- and any other associated private data maintained by session
Application can setup multiple sessions on a device as dictated by
dev_info.nb_sessions or
nb_session_per_qp.
[Fiona] The session design is modelled on the cryptodev session design
and
so allows
to create a session which can be used on different driver types. E.g. a
session could be set up and initialised
to run on a QuickAssist device and a Software device. This may be useful
for
stateless
requests, and enable load-balancing. For stateful flows the session
should be
set up for
only one specific driver-type as the state information will be stored in the
private data specific to the driver-type
and not transferrable between driver-types.
So a session
- is not per-device
- has no dev_id
- has no mempool stored in it - the pool is created by the application, the
lib
can retrieve the pool from the object with rte_mempool_from_obj()
- does not have a limit number per device, just per qp, i.e. there is no
dev_info.nb_sessions, just dev_info.max_nb_sessions_per_qp
Do you think any of this needs to be changed?
[Shally] Please help confirm following before I could answer this.
In cryptodev, session holds an drivers array initialized on it where each can
be setup to perform
same/different operation in its private_data.
- Will not retain any of the info as mentioned above (xform, mempool,
algos et el). All such information is
maintained as part of associated device driver private data.
[Fiona] exactly
- App can use same session to set compress xform and decompress xform
on devices but if both devices
maps to same driver_id then only either is effective (whichever is set first)?
[Fiona] No, the intention is that the session is initialised for all drivers, and so
for all devices, using the same xform. So it should only be initialised to either
compress or decompress.
[Shally] Ok. Then documentation can be updated to reflect this purpose.
The intent being that an application can prepare an operation (stateless)
independently of the driver & device it's targeting, and then choose where
to send it, possibly based on which device is not busy.
[Shally] Ok. Then for now we can continue with same design approach as in cryptodev. Will propose any changes later on a need basis.
Also, I may have few questions on session API implementation in rte_compressdev.c. Will ask them in API RFC.
Is this understanding correct?
B. Notion of burst operations in compression API
 =======================================
struct rte_comp_op defines compression/decompression operational
parameter and makes up one single
element of burst. This is both an input/output parameter.
PMD gets source, destination and checksum information at input and
updated it with bytes consumed and
produced at output.
[Fiona] Agreed
Once enqueued for processing, rte_comp_op *cannot be reused* until
its
status is set to
RTE_COMP_OP_FAILURE or RTE_COMP_OP_STATUS_SUCCESS.
[Fiona] cannot be used until its status is set to any value other than
RTE_COMP_OP_NOT_PROCESSED
[Shally] How user will know that status is NOT_PROCESSED after ops are
enqueued?
I assume only way to check enqueued ops status is dequeue_burst() and
PMD put an op into completion
queue for dequeue *only when* it is completed with
Pass/Fail/Out_of_space condition *not* when it's in
progress (equivalent of RTE_COMP_OP_NOT_PROCESSED).
Am I missing anything here?
[Fiona] Correct. PMD should only return an op in the dequeue once it's processed.
C. Session and rte_comp_op
 =======================
Every operation in a burst is tied to a Session. More to cover on this
under
Stateless Vs Stateful section.
[Fiona] Agreed. I would add that each operation in a burst may be
attached
to a different session.
D. Stateless Vs Stateful
===================
Compression API provide RTE_COMP_FF_STATEFUL feature flag for
PMD
to reflect its support for Stateful
operation.
[Fiona] Agreed.
D.1 Compression API Stateless operation
------------------------------------------------------
A Stateless operation means all enqueued packets are independent of
each other i.e. Each packet has
-              Their flush value is set to RTE_FLUSH_FULL or
RTE_FLUSH_FINAL
(required only on compression
side),
-              All-of the required input and sufficient large buffer size to store
output i.e. OUT_OF_SPACE can
never occur (required during both compression and decompression)
In such case, PMD initiates stateless processing and releases acquired
resources after processing of current
operation is complete i.e. full input consumed and full output written.
Application can attach same or different session to each packet and can
make consecutive enque_burst()
enqueued = rte_comp_enque_burst (dev_id, qp_id, ops1, nb_ops);
enqueued = rte_comp_enque_burst(dev_id, qp_id, ops2, nb_ops);
enqueued = rte_comp_enque_burst(dev_id, qp_id, ops3, nb_ops);
*Note – Every call has different ops array i.e.  same rte_comp_op array
*cannot be reused* to queue next
batch of data until previous ones are completely processed.
Also if multiple threads calls enqueue_burst() on same queue pair then
it’s
application onus to use proper
locking mechanism to ensure serialized enqueuing of operations.
[Fiona] Agreed to above stateless description.
Please note any time output buffer ran out of space during write then
operation will turn “Stateful”.  See
more on Stateful under respective section.
[Fiona] Let's come back to this later. An alternative is that
OUT_OF_SPACE is
returned and the application
must treat as a fail and resubmit the operation with a larger destination
buffer.
[Shally] Then I propose to add a feature flag
"FF_SUPPORT_OUT_OF_SPACE" per xform type for flexible
PMD design.
As there're devices which treat it as error on compression but not on
decompression.
If it is not supported, then it should be treated as failure condition and app
can resubmit operation.
if supported, behaviour *To-be-Defined* under stateful.
[Fiona] Can you explain 'turn stateful' some more?
If compressor runs out of space during stateless operation, either comp or
decomp, and turns stateful, how would the app know? And what would be in
status, consumed and produced?
Could it return OUT_OF_SPACE, and if both consumed and produced == 0
[Shally] If consumed = produced == 0, then it's not OUT_OF_SPACE condition.
then the whole op must be resubmitted with a bigger output buffer. But if
consumed and produced > 0 then app could take the output and submit next
op
continuing from consumed+1.
[Shally] consumed and produced will *always* be > 0 in case of OUT_OF_SPACE.
OUT_OF_SPACE means output buffer exhausted while writing data into it and PMD may have more to write to it. So in such case, PMD should set
Produced = complete length of output buffer
Status = OUT_OF_SPACE
consume, following possibilities here:
1. consumed = complete length of src mbuf means PMD has read full input, OR
2. consumed = partial length of src mbuf means PMD has read partial input

On seeing this status, app should consume output and re-enqueue same op with empty output buffer and src = consumed+1.
Please note as per current proposal, app should call rte_compdev_enqueue_stream() version of API if it doesn't know output size beforehand.
1. rte_comp_session *sess = rte_comp_session_create(rte_mempool
*pool);
2. rte_comp_session_init (int dev_id, rte_comp_session *sess,
rte_comp_xform *xform, rte_mempool
*sess_pool);
3. rte_comp_op_pool_create(rte_mempool ..)
4. rte_comp_op_bulk_alloc (struct rte_mempool *mempool, struct
rte_comp_op **ops, uint16_t
nb_ops);
5. for every rte_comp_op in ops[],
    5.1 rte_comp_op_attach_session(rte_comp_op *op,
rte_comp_session
*sess);
    5.2 set up with src/dst buffer
6. enq = rte_compdev_enqueue_burst(uint8_t dev_id, uint16_t qp_id,
struct rte_comp_op **ops, uint16_t
nb_ops);
7. dqu = rte_compdev_dequeue_burst(dev_id, qp_id, ops, enq);
8. repeat 7 while (dqu < enq) // Wait till all of enqueued are dequeued
9. Repeat 5.2 for next batch of data
10. rte_comp_session_clear () // only reset private data memory area
and
*not* the xform and devid
information. In case, you want to re-use session.
11. rte_comp_session_free(ret_comp_sess *session)
[Fiona] ok. This is one possible flow. There are variations possible
- Above assumes all ops are using the same session, this is not
necessarily
the case. E.g. there
could be a compression session and a decompression session and a
burst of
ops may contain both.
[Shally] Agree but assume applicable only for stateless until we cover
stateful
In this case Step 9 would be Repeat 5.1 as well as 5.2
- Also it would not be necessary to wait until the full burst is dequeued
before doing
another enqueue - though of course the ops would need to be
managed so
only
those finished with are reused, or multiple sets of ops could be
allocated.
[Shally] Agree
- What do you mean by Step 10 comment? The session only has private
data.
It's up to the PMD to
store whatever it needs from the xform. I think session_clear should
mean
all the data is zeroed in the session.
If the session is to be re-used then nothing needs to be cleared. Each op
is
already re-using the session in your
flow above without clearing between ops.
BUT this only applies to stateless
[Shally] This came from my previous notion of session. Now when I see it
analogy to cryptodev, its purpose
is clear to me.
But then I propose to rename API to rte_compdev_sess_term() to make it
self-explanatory.
[Fiona] So I think term isn't very clear, I guess it's short for terminate?
If so I can change to rte_compdev_sess_terminate()
[Shally] Yes by term I meant terminate. So rte_compdev_sess_terminate() is fine.
However I think we may not need this at all if we go with your stream suggestion below.
As there may be only immutable data in the session - like algo, level, checksum type, etc.
In cryptodev the session data is also immutable, but as it holds a key it's
important to clear
it before allowing the session to be re-used. I'm not sure we'll have any data left in
session which needs to be cleared, all the state data may be in the stream.
[Shally] I would say we still need it to reverse operations done in sess_init() like free up device private data.
So, we should retain it.
- for stateful we may need a different
behaviour - to clear state data but
keep algo, level, Huffman-type, checksum-type. Let's discuss under
stateful
- this may need a new API.
[Shally] Or PMD can internally reset its state once it process an op with
FULL_FLUSH/FINISH. Will revisit it
under stateful.
[Fiona] Yes, that could also be an option for stream.
D.1.2 Requirement for Stateless
-------------------------------------------
Since operation can complete out-of-order. There should be one (void
*user) per rte_comp_op to enable
application to map dequeued op to enqueued op.
[Fiona] In cryptodev there was an opaque_data field in the op - it was
removed as the application can store any
private data it needs following the op, as it creates the op pool and
dictates
the size. Do you think we need an
explicit (void *user) or can we follow the same approach?
[Shally] if priv_data in crypto_op_pool_create() is user data, then we don't
need explicit user *.
But then I propose variable name should be renamed to user_data
indicating it is an app data opaque to
PMD.
And, __rte_comp_op_get_priv_data_size () should be changed to
__rte_comp_op_get_user_data_size().
[Fiona] Agreed.
[Fiona] Out of time - I'll continue from here later in the week.
[Shally] Sure. Look forward to that.
Thanks
Shally
D.2 Compression API Stateful operation
----------------------------------------------------------
- API ran into out_of_space situation during processing of input.
Example,
stateless compressed stream
fed fully to decompressor but output buffer is not large enough to hold
output.
- API waiting for more input to produce output. Example, stateless
compressed stream fed partially to
decompressor.
[Fiona] I agree this is stateful if more input is to follow. But don’t understand the
stateless ref. here. It's not relevant whether compression was done statefully
or statelessly, if only part of the data is fed to the decompressor then it should
be fed to a stateful session.
[Shally] Agree to what you said. This is regardless of stateful/stateless.
- API is dependent on previous operation for further
compression/decompression
In case of either one or all of the above conditions PMD is required to
maintain context of operations
across enque_burst() calls, until a packet with  RTE_FLUSH_FULL/FINAL
and
sufficient input/output buffers
is received and processed.
D.2.1 Compression API requirement for Stateful
---------------------------------------------------------------
D.2.1.1 Sliding Window Size
------------------------------------
Maximum length of Sliding Window in bytes. Previous data lookup will
be
performed up to this length. To
be added as algorithm capability parameter and set by PMD.
[Fiona] Agreed. 32k default.
To be set on session xform for both compression and decompression?
[Shally] Yes for both compression and decompression.
Is there a subset of specific values you'd like supported on the API?
(rather than all to minimise test cases)
[Shally] I would suggest to stay with current approach of providing max.
Though it may increase test efforts but that way it will be more conformant to zlib/deflate RFC and extensible to meet different app need.
D.2.1.2 Stateful operation state maintenance
 -------------------------------------------------------------
This section starts with description of our understanding about
compression API support for stateful.
Depending upon understanding build upon these concepts, we will
identify
required data structure/param
to maintain in-progress operation context by PMD.
For stateful compression, batch of dependent packets starts at a packet
having
RTE_NO_FLUSH/RTE_SYNC_FLUSH flush value and end at packet having
RTE_FULL_FLUSH/FINAL_FLUSH.
------------------------------------------------------------------------------------
|op1.no_flush | op2.no_flush | op3.no_flush | op4.full_flush|
------------------------------------------------------------------------------------
[Fiona] I think it needs to be more constrained than your examples below.
Only 1 operation from a stream can be in a burst. As each operation
in a stateful stream must complete, as next operation needs state and history
of previous operation to be complete before it can be processed.
And if one failed, e.g. due to OUT_OF_SPACE, this should affect
the following operation in the same stream.
Worst case this means bursts of 1. Burst can be >1 if there are multiple
independent streams with available data for processing. Or if there is
data available which can be statelessly processed.
If there are multiple buffers available from a stream , then instead they can
be linked together in an mbuf chain sent in a single operation.
To handle the sequences below would mean the PMD
would need to store ops sending one at a time to be processed.
As this is significantly different from what you describe below, I'll wait for
further feedback
before continuing.
[Shally] I concur with your thoughts. And these're are not significantly different from the concept presented below.

Yes as you mentioned, even for burst_size>1 PMD will have to serialize each op internally i.e.
It has to wait for previous to finish before putting next for processing which is
as good as application making serialised call passing one op at-a-time or if
stream consists of multiple buffers, making their scatter-gather list and
then enqueue it as one op at a time which is more efficient and ideal usage.
However in order to allow extensibility, I didn't mention limitation on burst_size.
Because If PMD doesn't support burst_size > 1 it can always return nb_enqueued = 1, in which case
app can enqueue next however with condition it should wait for previous to complete
before making next enqueue call.

So, if we take simple example to compress 2k of data with src mbuf size = 1k.
Then with burst_size=1, expected call flow would be(this is just one flow, other variations are also possible suchas making chain of 1k buffers and pass whole data in one go):

1. fill 1st 1k chunk of data in op.msrc
2.enqueue_stream (..., |op.flush = no_flush|, 1, ptr_stream);
3.dequeue_burst(|op|,1);
4.refill next 1k chunk in op.msrc
5.enqueue_stream(...,|op.flush = full_flush|, 1 , ptr_stream);
6.dequeue_burst(|op|, 1);
7.end

So, I don’t see much of a change in API call flow from here to design presented below except nb_ops = 1 in each call.
However I am assuming that op structure would still be same for stateful processing i.e. it would start with op.flush value = NO/SYNC_FLUSH and end at op with flush value = FULL FLUSH.
Are we on same page here?

Thanks
Shally
But in principal I agree with the idea of a stream structure to hold the state data
needed for a stateful flow. Originally I'd assumed this data would be held in the session
but I think it's better to not have a 1:1 mapping between stream and session as this
would limit a session unnecessarily. So there can be many streams in a single session.
For sake of simplicity, we will use term "stream" to identify such related
set
of operation in following
description.
Stream processing impose following limitations on usage of
enque_burst()
API
-              All dependent packets in a stream should carry same session
-              if stream is broken into multiple enqueue_burst() call, then
next
enqueue_burst() cannot be called
until previous one has fully processed. I.E.
               Consider for example, a stream with ops1 ..ops7, This is *not*
allowed
                                       ----------------------------------------------------------------
----
--------------
                enque_burst(|op1.no_flush | op2.no_flush | op3.no_flush |
op4.no_flush|)
                                       ----------------------------------------------------------------
----
--------------
                                       ----------------------------------------------------------------
               enque_burst(|op5.no_flush | op6.no_flush | op7.flush_final |)
                                        ---------------------------------------------------------------
-
              This *is* allowed
                                       ----------------------------------------------------------------
----
--------------
               enque_burst(|op1.no_flush | op2.no_flush | op3.no_flush |
op4.no_flush|)
                                       ----------------------------------------------------------------
----
--------------
                deque_burst(ops1 ..ops4)
                                       ----------------------------------------------------------------
               enque_burst(|op5.no_flush | op6.no_flush | op7.flush_final |)
                                        ---------------------------------------------------------------
-
-              A single enque_burst() can carry only one stream. I.E. This is
*not*
allowed
                                      ----------------------------------------------------------------
-----
------------------------------------
              enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final |
op4.no_flush | op5.no_flush |)
                                       ----------------------------------------------------------------
----
-------------------------------------
If a stream is broken in to several enqueue_burst() calls, then compress
API need to maintain operational
state between calls. For this, concept of rte_comp_stream is enabled in
to
compression API.
1. Add rte_comp_op_type
........................................
enum rte_comp_op_type {
RTE_COMP_OP_STATELESS,
RTE_COMP_OP_STATEFUL
}
2. Add new data type rte_comp_stream to maintain stream state
........................................................................................................
rte_comp_stream is an opaque data structure to application which is
exchanged back and forth between
application and PMD during stateful compression/decompression.
It should be allocated per stream AND before beginning of stateful
operation. If stream is broken into
multiple enqueue_burst() then each
respective enqueue_burst() must carry same rte_comp_stream
pointer. It
is mandatory input for stateful
operations.
rte_comp_stream can be cleared and reused via compression API
rte_comp_stream_clear() and free via
rte_comp_stream_free(). Clear/free should not be called when it is in
use.
This enables sharing of a session by multiple threads handling different
streams as each bulk ops carry its
own context. This can also be used by PMD to handle OUT_OF_SPACE
situation.
3. Add stream allocate, clear and free API
...................................................................
3.1. rte_comp_op_stream_alloc(rte_mempool *pool,
rte_comp_op_type
type, rte_comp_stream
**stream);
3.2. rte_comp_op_stream_clear(rte_comp_stream *stream); // in this
case stream will be useable for new
stateful batch
3.3. rte_comp_op_stream_free(rte_comp_stream *stream); // to free
context
4. Add new API rte_compdev_enqueue_stream()
...............................................................................
static inline uint16_t rte_compdev_enqueue_stream(uint8_t dev_id,
uint16_t qp_id,
struct rte_comp_op **ops,
uint16_t nb_ops,
rte_comp_stream *stream); //to be passed
with
each call
Application should call this API to process dependent set of data OR
when
output buffer size is unknown.
rte_comp_op_pool_create() should create mempool large enough to
accommodate operational state
(maintained by rte_comp_stream) based on rte_comp_op_type. Since
rte_comp_stream would be
maintained by PMD, thus allocating it from PMD managed pool offers
performance gains.
API flow: rte_comp_op_pool_create() -→ rte_comp_op_bulk_alloc() --
->
rte_comp_op_stream_alloc() →
enque_stream(..ops, .., stream)
D.2.1.3 History buffer
-----------------------------
Will be maintained by PMD with in rte_comp_stream
Trahe, Fiona
2017-12-15 17:40:49 UTC
Permalink
Hi Shally,
-----Original Message-----
Sent: Thursday, December 7, 2017 5:43 AM
Subject: RE: [RFC v1] doc compression API for DPDK
//snip....
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
Please note any time output buffer ran out of space during write then
operation will turn “Stateful”.  See
Post by Verma, Shally
more on Stateful under respective section.
[Fiona] Let's come back to this later. An alternative is that
OUT_OF_SPACE is
Post by Verma, Shally
Post by Verma, Shally
returned and the application
must treat as a fail and resubmit the operation with a larger destination
buffer.
[Shally] Then I propose to add a feature flag
"FF_SUPPORT_OUT_OF_SPACE" per xform type for flexible
Post by Verma, Shally
PMD design.
As there're devices which treat it as error on compression but not on
decompression.
Post by Verma, Shally
If it is not supported, then it should be treated as failure condition and app
can resubmit operation.
Post by Verma, Shally
if supported, behaviour *To-be-Defined* under stateful.
[Fiona] Can you explain 'turn stateful' some more?
If compressor runs out of space during stateless operation, either comp or
decomp, and turns stateful, how would the app know? And what would be in
status, consumed and produced?
Could it return OUT_OF_SPACE, and if both consumed and produced == 0
[Shally] If consumed = produced == 0, then it's not OUT_OF_SPACE condition.
Post by Trahe, Fiona
then the whole op must be resubmitted with a bigger output buffer. But if
consumed and produced > 0 then app could take the output and submit next
op
continuing from consumed+1.
[Shally] consumed and produced will *always* be > 0 in case of OUT_OF_SPACE.
OUT_OF_SPACE means output buffer exhausted while writing data into it and PMD may have more to
write to it. So in such case, PMD should set
Produced = complete length of output buffer
Status = OUT_OF_SPACE
1. consumed = complete length of src mbuf means PMD has read full input, OR
2. consumed = partial length of src mbuf means PMD has read partial input
On seeing this status, app should consume output and re-enqueue same op with empty output buffer and
src = consumed+1.
[Fiona] As this was a stateless op, the PMD cannot be expected to have stored the history and state and so
cannot be expected to continue from consumed+1. This would be stateful behaviour.
But it seems you are saying that even on in this stateless case you'd like the PMDs who can store state
to have the option of converting to stateful. So
a PMD which can support this could return OUT_OF_SPACE with produced/consumed as you describe above.
a PMD which can't support it should return an error.
The appl can continue on from consumed+1 in the former case and resubmit the full request
with a bigger buffer in the latter case.
Is this the behaviour you're looking for?
If so the error could be something like NEED_BIGGER_DST_BUF?
However, wouldn't OUT_OF_SPACE with produced=consumed=0 convey the same information on the API?
It may correspond to an error on the underlying PMD, but would it be simpler on the compressdev API
Please note as per current proposal, app should call rte_compdev_enqueue_stream() version of API if it
doesn't know output size beforehand.
[Fiona] True. But above is only trying to describe behaviour in the stateless error case.

//snip.....
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
D.2.1.2 Stateful operation state maintenance
 -------------------------------------------------------------
This section starts with description of our understanding about
compression API support for stateful.
Post by Verma, Shally
Depending upon understanding build upon these concepts, we will
identify
Post by Verma, Shally
Post by Verma, Shally
required data structure/param
Post by Verma, Shally
to maintain in-progress operation context by PMD.
For stateful compression, batch of dependent packets starts at a packet
having
Post by Verma, Shally
RTE_NO_FLUSH/RTE_SYNC_FLUSH flush value and end at packet having
RTE_FULL_FLUSH/FINAL_FLUSH.
Post by Verma, Shally
------------------------------------------------------------------------------------
|op1.no_flush | op2.no_flush | op3.no_flush | op4.full_flush|
------------------------------------------------------------------------------------
[Fiona] I think it needs to be more constrained than your examples below.
Only 1 operation from a stream can be in a burst. As each operation
in a stateful stream must complete, as next operation needs state and history
of previous operation to be complete before it can be processed.
And if one failed, e.g. due to OUT_OF_SPACE, this should affect
the following operation in the same stream.
Worst case this means bursts of 1. Burst can be >1 if there are multiple
independent streams with available data for processing. Or if there is
data available which can be statelessly processed.
If there are multiple buffers available from a stream , then instead they can
be linked together in an mbuf chain sent in a single operation.
To handle the sequences below would mean the PMD
would need to store ops sending one at a time to be processed.
As this is significantly different from what you describe below, I'll wait for
further feedback
before continuing.
[Shally] I concur with your thoughts. And these're are not significantly different from the concept
presented below.
Yes as you mentioned, even for burst_size>1 PMD will have to serialize each op internally i.e.
It has to wait for previous to finish before putting next for processing which is
as good as application making serialised call passing one op at-a-time or if
stream consists of multiple buffers, making their scatter-gather list and
then enqueue it as one op at a time which is more efficient and ideal usage.
However in order to allow extensibility, I didn't mention limitation on burst_size.
Because If PMD doesn't support burst_size > 1 it can always return nb_enqueued = 1, in which case
app can enqueue next however with condition it should wait for previous to complete
before making next enqueue call.
So, if we take simple example to compress 2k of data with src mbuf size = 1k.
Then with burst_size=1, expected call flow would be(this is just one flow, other variations are also possible
1. fill 1st 1k chunk of data in op.msrc
2.enqueue_stream (..., |op.flush = no_flush|, 1, ptr_stream);
3.dequeue_burst(|op|,1);
4.refill next 1k chunk in op.msrc
5.enqueue_stream(...,|op.flush = full_flush|, 1 , ptr_stream);
6.dequeue_burst(|op|, 1);
7.end
So, I don’t see much of a change in API call flow from here to design presented below except nb_ops = 1 in
each call.
However I am assuming that op structure would still be same for stateful processing i.e. it would start with
op.flush value = NO/SYNC_FLUSH and end at op with flush value = FULL FLUSH.
Are we on same page here?
Thanks
Shally
[Fiona] We still have a different understanding of the stateful flow needed on the API.
I’ll try to clarify and maybe we can set up a meeting to discuss.
My assumptions first:
• Order of ops on a qp must be maintained – ops should be dequeued in same sequence they are enqueued.
• Ops from many streams can be enqueued on same qp.
• Ops from a qp may be fanned out to available hw or sw engines and processed in parallel, so each op must be independent.
• Stateless and stateful ops can be enqueued on the same qp

Submitting a burst of stateless ops to a qp is no problem.
Submitting more than 1 op at a time from the same stateful stream to a qp is a problem.
Example:
Appl submits 2 ops in same stream in a burst, each has src and dest mbufs, input length/offset and
requires checksum to be calculated.
The first op must be processed to completion before the second can be started as it needs the history and the checksum so far.
If each dest mbuf is big enough so no overflow, each dest mbuf will be partially filled. This is probably not
what’s desired, and will force an extra copy to make the output data contiguous.
If the dest mbuf in the first op is too small, then does the PMD alloc more memory in the dest mbuf?
Or alloc another mbuf? Or fail and the whole burst must be resubmitted?
Or store the 2nd op, wait, on seeing the OUT_OF_SPACE on the 1st op, overwrite the src, dest, len etc of the 2nd op
to include the unprocessed part of the 1st op?
In the meantime, are all other ops on the qp blocked behind these?
For hw accelerators it’s worse, as PMD would normally return once ops are offloaded and the dequeue would
pass processed ops straight back to the appl. Instead, the enqueue would need to kick off a thread to
dequeue ops and filter to find the stateful one, storing the others til the next application dequeue is called.

Above scenarios don’t lend themselves to accelerating a packet processing workload.
It pushes a workload down to all PMDs which I believe belongs above this API as
that work is not about offloading the compute intensive compression work but
about the sequencing of data and so is better coded once, above the API in an application layer
common to all PMDs. (See Note1 in http://dpdk.org/ml/archives/dev/2017-October/078944.html )
If an application has several packets with data from a stream that it needs to (de)compress statefully,
what it probably wants is for the output data to fill each output buffer completely before writing to the next buffer.
Chaining the src mbufs in these pkts into one chain and sending as one op allows the output
data to be packed into a dest mbuf or mbuf chain.
I think what’s needed is a layer above the API to accumulate incoming packets while waiting for the
previous set of packets to be compressed. Forwarding to the PMD to queue there is not the right place
to buffer them as the queue should be per stream rather than on the accelerator engine’s queue
which has lots of other independent packets.


Proposal:
• Ops from a qp may be fanned out to available hw or sw engines and
processed in parallel, so each op must be independent.
• Order of ops on a qp must be maintained – ops should be dequeued in same sequence they are enqueued.
• Stateless and stateful ops can be enqueued on the same qp
• Stateless and stateful ops can be enqueued in the same burst
• Only 1 op at a time may be enqueued to the qp from any stateful stream.
• A burst can have multiple stateful ops, but each must be from a different stream.
• All ops will have a session attached – this will only contain immutable data which
can be used by many ops, devices and or drivers at the same time.
• All stateful ops will have a stream attached for maintaining state and
history, this can only be used by one op at a time.


Code artefacts:

enum rte_comp_op_type {
RTE_COMP_OP_STATELESS,
RTE_COMP_OP_STATEFUL
}

Add following to rte_comp_op:
enum rte_comp_op_type op_type;
void * stream_private;
/* location where PMD maintains stream state – only required if op_type is STATEFUL, else set to NULL */

As size of stream data will vary depending on PMD, each PMD or device should allocate & manage its own mempool. Associated APIs are:
rte_comp_stream_create(uint8_t dev_id, rte_comp_session *sess, void ** stream);
/* This should alloc a stream from the device’s mempool and initialise it. This handle will be passed to the PMD with every op in the stream. Q. Should qp_id also be added, with constraint that all ops in the same stream should be sent to the same qp? */
rte_comp_stream_free(uint8_t dev_id, void * stream);
/* This should clear the stream and return it to the device’s mempool */

All ops are enqueued/dequeued to device & qp using same rte_compressdev_enqueue_burst()/…dequeue_burst;

Re flush flags, stateful stream would start with op.flush = NONE or SYNC and end with FULL or FINAL
STATELESS ops would just use either FULL or FINAL


Let me know if you want to set up a meeting - it might be a more effective way to
arrive at an API that works for all PMDs.

I'll send out a v3 today with above plus updates ba
Verma, Shally
2017-11-30 15:46:09 UTC
Permalink
Resend with +Pablo

-----Original Message-----
From: Verma, Shally
Sent: 30 November 2017 16:43
To: 'Trahe, Fiona' <***@intel.com>; ***@dpdk.org
Cc: Athreya, Narayana Prasad <***@cavium.com>; Challa, Mahipal <***@cavium.com>
Subject: Re: [RFC v1] doc compression API for DPDK

HI Fiona
-----Original Message-----
Sent: 28 November 2017 00:25
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Shally,
-----Original Message-----
Sent: Tuesday, October 31, 2017 11:39 AM
Narayana Prasad
Subject: [RFC v1] doc compression API for DPDK
HI Fiona
This is an RFC document to brief our understanding and requirements on
compression API proposal in
DPDK. It is based on "[RFC] Compression API in DPDK
http://dpdk.org/ml/archives/dev/2017-
October/079377.html".
Intention of this document is to align on concepts built into compression
API, its usage and identify further
requirements.
Going further it could be a base to Compression Module Programmer
Guide.
Current scope is limited to
- definition of the terminology which makes up foundation of compression
API
- typical API flow expected to use by applications
Overview
~~~~~~~~
A. Notion of a session in compression API
==================================
A Session is per device logical entity which is setup with chained-xforms to
be performed on burst
operations where individual entry contains operation type
(decompress/compress) and related parameter.
- compress / decompress
- dev_id
- compression algorithm and other related parameters
- mempool - for use by session for runtime requirement
- and any other associated private data maintained by session
Application can setup multiple sessions on a device as dictated by
dev_info.nb_sessions or
nb_session_per_qp.
[Fiona] The session design is modelled on the cryptodev session design and
so allows
to create a session which can be used on different driver types. E.g. a
session could be set up and initialised
to run on a QuickAssist device and a Software device. This may be useful for
stateless
requests, and enable load-balancing. For stateful flows the session should be
set up for
only one specific driver-type as the state information will be stored in the
private data specific to the driver-type
and not transferrable between driver-types.
So a session
- is not per-device
- has no dev_id
- has no mempool stored in it - the pool is created by the application, the lib
can retrieve the pool from the object with rte_mempool_from_obj()
- does not have a limit number per device, just per qp, i.e. there is no
dev_info.nb_sessions, just dev_info.max_nb_sessions_per_qp
Do you think any of this needs to be changed?
[Shally] Please help confirm following before I could answer this.

In cryptodev, session holds an drivers array initialized on it where each can be setup to perform same/different operation in its private_data.
On mapping it to compression it mean, a session:
- Will not retain any of the info as mentioned above (xform, mempool, algos et el). All such information is maintained as part of associated device driver private data.
- App can use same session to set compress xform and decompress xform on devices but if both devices maps to same driver_id then only either is effective (whichever is set first)?

Is this understanding correct?
B. Notion of burst operations in compression API
 =======================================
struct rte_comp_op defines compression/decompression operational
parameter and makes up one single
element of burst. This is both an input/output parameter.
PMD gets source, destination and checksum information at input and
updated it with bytes consumed and
produced at output.
[Fiona] Agreed
Once enqueued for processing, rte_comp_op *cannot be reused* until its
status is set to
RTE_COMP_OP_FAILURE or RTE_COMP_OP_STATUS_SUCCESS.
[Fiona] cannot be used until its status is set to any value other than
RTE_COMP_OP_NOT_PROCESSED
[Shally] How user will know that status is NOT_PROCESSED after ops are enqueued?
I assume only way to check enqueued ops status is dequeue_burst() and PMD put an op into completion queue for dequeue *only when* it is completed with Pass/Fail/Out_of_space condition *not* when it's in progress (equivalent of RTE_COMP_OP_NOT_PROCESSED).

Am I missing anything here?
C. Session and rte_comp_op
 =======================
Every operation in a burst is tied to a Session. More to cover on this under
Stateless Vs Stateful section.
[Fiona] Agreed. I would add that each operation in a burst may be attached
to a different session.
D. Stateless Vs Stateful
===================
Compression API provide RTE_COMP_FF_STATEFUL feature flag for PMD
to reflect its support for Stateful
operation.
[Fiona] Agreed.
D.1 Compression API Stateless operation
------------------------------------------------------
A Stateless operation means all enqueued packets are independent of
each other i.e. Each packet has
-              Their flush value is set to RTE_FLUSH_FULL or RTE_FLUSH_FINAL
(required only on compression
side),
-              All-of the required input and sufficient large buffer size to store
output i.e. OUT_OF_SPACE can
never occur (required during both compression and decompression)
In such case, PMD initiates stateless processing and releases acquired
resources after processing of current
operation is complete i.e. full input consumed and full output written.
Application can attach same or different session to each packet and can
make consecutive enque_burst()
enqueued = rte_comp_enque_burst (dev_id, qp_id, ops1, nb_ops);
enqueued = rte_comp_enque_burst(dev_id, qp_id, ops2, nb_ops);
enqueued = rte_comp_enque_burst(dev_id, qp_id, ops3, nb_ops);
*Note – Every call has different ops array i.e.  same rte_comp_op array
*cannot be reused* to queue next
batch of data until previous ones are completely processed.
Also if multiple threads calls enqueue_burst() on same queue pair then it’s
application onus to use proper
locking mechanism to ensure serialized enqueuing of operations.
[Fiona] Agreed to above stateless description.
Please note any time output buffer ran out of space during write then
operation will turn “Stateful”.  See
more on Stateful under respective section.
[Fiona] Let's come back to this later. An alternative is that OUT_OF_SPACE is
returned and the application
must treat as a fail and resubmit the operation with a larger destination
buffer.
[Shally] Then I propose to add a feature flag "FF_SUPPORT_OUT_OF_SPACE" per xform type for flexible PMD design.
As there're devices which treat it as error on compression but not on decompression.
If it is not supported, then it should be treated as failure condition and app can resubmit operation.
if supported, behaviour *To-be-Defined* under stateful.
1. rte_comp_session *sess = rte_comp_session_create(rte_mempool
*pool);
2. rte_comp_session_init (int dev_id, rte_comp_session *sess,
rte_comp_xform *xform, rte_mempool
*sess_pool);
3. rte_comp_op_pool_create(rte_mempool ..)
4. rte_comp_op_bulk_alloc (struct rte_mempool *mempool, struct
rte_comp_op **ops, uint16_t
nb_ops);
5. for every rte_comp_op in ops[],
    5.1 rte_comp_op_attach_session(rte_comp_op *op, rte_comp_session
*sess);
    5.2 set up with src/dst buffer
6. enq = rte_compdev_enqueue_burst(uint8_t dev_id, uint16_t qp_id,
struct rte_comp_op **ops, uint16_t
nb_ops);
7. dqu = rte_compdev_dequeue_burst(dev_id, qp_id, ops, enq);
8. repeat 7 while (dqu < enq) // Wait till all of enqueued are dequeued
9. Repeat 5.2 for next batch of data
10. rte_comp_session_clear () // only reset private data memory area and
*not* the xform and devid
information. In case, you want to re-use session.
11. rte_comp_session_free(ret_comp_sess *session)
[Fiona] ok. This is one possible flow. There are variations possible
- Above assumes all ops are using the same session, this is not necessarily
the case. E.g. there
could be a compression session and a decompression session and a burst of
ops may contain both.
[Shally] Agree but assume applicable only for stateless until we cover stateful
In this case Step 9 would be Repeat 5.1 as well as 5.2
- Also it would not be necessary to wait until the full burst is dequeued
before doing
another enqueue - though of course the ops would need to be managed so
only
those finished with are reused, or multiple sets of ops could be allocated.
[Shally] Agree
- What do you mean by Step 10 comment? The session only has private data.
It's up to the PMD to
store whatever it needs from the xform. I think session_clear should mean
all the data is zeroed in the session.
If the session is to be re-used then nothing needs to be cleared. Each op is
already re-using the session in your
flow above without clearing between ops.
BUT this only applies to stateless
[Shally] This came from my previous notion of session. Now when I see it analogy to cryptodev, its purpose is clear to me.
But then I propose to rename API to rte_compdev_sess_term() to make it self-explanatory.

- for stateful we may need a different
behaviour - to clear state data but
keep algo, level, Huffman-type, checksum-type. Let's discuss under stateful
- this may need a new API.
[Shally] Or PMD can internally reset its state once it process an op with FULL_FLUSH/FINISH. Will revisit it under stateful.
D.1.2 Requirement for Stateless
-------------------------------------------
Since operation can complete out-of-order. There should be one (void
*user) per rte_comp_op to enable
application to map dequeued op to enqueued op.
[Fiona] In cryptodev there was an opaque_data field in the op - it was
removed as the application can store any
private data it needs following the op, as it creates the op pool and dictates
the size. Do you think we need an
explicit (void *user) or can we follow the same approach?
[Shally] if priv_data in crypto_op_pool_create() is user data, then we don't need explicit user *.
But then I propose variable name should be renamed to user_data indicating it is an app data opaque to PMD.
And, __rte_comp_op_get_priv_data_size () should be changed to __rte_comp_op_get_user_data_size().
[Fiona] Out of time - I'll continue from here later in the week.
[Shally] Sure. Look forward to that.

Thanks
Shally
D.2 Compression API Stateful operation
----------------------------------------------------------
- API ran into out_of_space situation during processing of input. Example,
stateless compressed stream
fed fully to decompressor but output buffer is not large enough to hold
output.
- API waiting for more input to produce output. Example, stateless
compressed stream fed partially to
decompressor.
- API is dependent on previous operation for further
compression/decompression
In case of either one or all of the above conditions PMD is required to
maintain context of operations
across enque_burst() calls, until a packet with  RTE_FLUSH_FULL/FINAL and
sufficient input/output buffers
is received and processed.
D.2.1 Compression API requirement for Stateful
---------------------------------------------------------------
D.2.1.1 Sliding Window Size
------------------------------------
Maximum length of Sliding Window in bytes. Previous data lookup will be
performed up to this length. To
be added as algorithm capability parameter and set by PMD.
D.2.1.2 Stateful operation state maintenance
 -------------------------------------------------------------
This section starts with description of our understanding about
compression API support for stateful.
Depending upon understanding build upon these concepts, we will identify
required data structure/param
to maintain in-progress operation context by PMD.
For stateful compression, batch of dependent packets starts at a packet
having
RTE_NO_FLUSH/RTE_SYNC_FLUSH flush value and end at packet having
RTE_FULL_FLUSH/FINAL_FLUSH.
------------------------------------------------------------------------------------
|op1.no_flush | op2.no_flush | op3.no_flush | op4.full_flush|
------------------------------------------------------------------------------------
For sake of simplicity, we will use term "stream" to identify such related set
of operation in following
description.
Stream processing impose following limitations on usage of enque_burst()
API
-              All dependent packets in a stream should carry same session
-              if stream is broken into multiple enqueue_burst() call, then next
enqueue_burst() cannot be called
until previous one has fully processed. I.E.
               Consider for example, a stream with ops1 ..ops7, This is *not*
allowed
                                       --------------------------------------------------------------------
--------------
                enque_burst(|op1.no_flush | op2.no_flush | op3.no_flush |
op4.no_flush|)
                                       --------------------------------------------------------------------
--------------
                                       ----------------------------------------------------------------
               enque_burst(|op5.no_flush | op6.no_flush | op7.flush_final |)
                                        ----------------------------------------------------------------
              This *is* allowed
                                       --------------------------------------------------------------------
--------------
               enque_burst(|op1.no_flush | op2.no_flush | op3.no_flush |
op4.no_flush|)
                                       --------------------------------------------------------------------
--------------
                deque_burst(ops1 ..ops4)
                                       ----------------------------------------------------------------
               enque_burst(|op5.no_flush | op6.no_flush | op7.flush_final |)
                                        ----------------------------------------------------------------
-              A single enque_burst() can carry only one stream. I.E. This is *not*
allowed
                                      ---------------------------------------------------------------------
------------------------------------
              enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final |
op4.no_flush | op5.no_flush |)
                                       --------------------------------------------------------------------
-------------------------------------
If a stream is broken in to several enqueue_burst() calls, then compress
API need to maintain operational
state between calls. For this, concept of rte_comp_stream is enabled in to
compression API.
1. Add rte_comp_op_type
........................................
enum rte_comp_op_type {
RTE_COMP_OP_STATELESS,
RTE_COMP_OP_STATEFUL
}
2. Add new data type rte_comp_stream to maintain stream state
........................................................................................................
rte_comp_stream is an opaque data structure to application which is
exchanged back and forth between
application and PMD during stateful compression/decompression.
It should be allocated per stream AND before beginning of stateful
operation. If stream is broken into
multiple enqueue_burst() then each
respective enqueue_burst() must carry same rte_comp_stream pointer. It
is mandatory input for stateful
operations.
rte_comp_stream can be cleared and reused via compression API
rte_comp_stream_clear() and free via
rte_comp_stream_free(). Clear/free should not be called when it is in use.
This enables sharing of a session by multiple threads handling different
streams as each bulk ops carry its
own context. This can also be used by PMD to handle OUT_OF_SPACE
situation.
3. Add stream allocate, clear and free API
...................................................................
3.1. rte_comp_op_stream_alloc(rte_mempool *pool, rte_comp_op_type
type, rte_comp_stream
**stream);
3.2. rte_comp_op_stream_clear(rte_comp_stream *stream); // in this
case stream will be useable for new
stateful batch
3.3. rte_comp_op_stream_free(rte_comp_stream *stream); // to free
context
4. Add new API rte_compdev_enqueue_stream()
...............................................................................
static inline uint16_t rte_compdev_enqueue_stream(uint8_t dev_id,
uint16_t qp_id,
struct rte_comp_op **ops,
uint16_t nb_ops,
rte_comp_stream *stream); //to be passed with
each call
Application should call this API to process dependent set of data OR when
output buffer size is unknown.
rte_comp_op_pool_create() should create mempool large enough to
accommodate operational state
(maintained by rte_comp_stream) based on rte_comp_op_type. Since
rte_comp_stream would be
maintained by PMD, thus allocating it from PMD managed pool offers
performance gains.
API flow: rte_comp_op_pool_create() -→ rte_comp_op_bulk_alloc() --->
rte_comp_op_stream_alloc() →
enque_stream(..ops, .., stream)
D.2.1.3 History buffer
-----------------------------
Will be maintained by PMD with in rte_comp_st
Verma, Shally
2017-12-20 07:15:00 UTC
Permalink
Hi Fiona

Please refer to my comments below with my understanding on two major points OUT_OF_SPACE and Stateful Design.
If you believe we still need a meeting to converge on same please share meeting details to me.
-----Original Message-----
Sent: 15 December 2017 23:11
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Shally,
-----Original Message-----
Sent: Thursday, December 7, 2017 5:43 AM
Challa, Mahipal
Subject: RE: [RFC v1] doc compression API for DPDK
//snip....
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
Please note any time output buffer ran out of space during write
then
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
operation will turn “Stateful”.  See
Post by Verma, Shally
more on Stateful under respective section.
[Fiona] Let's come back to this later. An alternative is that
OUT_OF_SPACE is
Post by Verma, Shally
Post by Verma, Shally
returned and the application
must treat as a fail and resubmit the operation with a larger
destination
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
buffer.
[Shally] Then I propose to add a feature flag
"FF_SUPPORT_OUT_OF_SPACE" per xform type for flexible
Post by Verma, Shally
PMD design.
As there're devices which treat it as error on compression but not on
decompression.
Post by Verma, Shally
If it is not supported, then it should be treated as failure condition and
app
Post by Trahe, Fiona
can resubmit operation.
Post by Verma, Shally
if supported, behaviour *To-be-Defined* under stateful.
[Fiona] Can you explain 'turn stateful' some more?
If compressor runs out of space during stateless operation, either comp
or
Post by Trahe, Fiona
decomp, and turns stateful, how would the app know? And what would
be in
Post by Trahe, Fiona
status, consumed and produced?
Could it return OUT_OF_SPACE, and if both consumed and produced == 0
[Shally] If consumed = produced == 0, then it's not OUT_OF_SPACE
condition.
Post by Trahe, Fiona
then the whole op must be resubmitted with a bigger output buffer. But
if
Post by Trahe, Fiona
consumed and produced > 0 then app could take the output and submit
next
Post by Trahe, Fiona
op
continuing from consumed+1.
[Shally] consumed and produced will *always* be > 0 in case of
OUT_OF_SPACE.
OUT_OF_SPACE means output buffer exhausted while writing data into it
and PMD may have more to
write to it. So in such case, PMD should set
Produced = complete length of output buffer
Status = OUT_OF_SPACE
1. consumed = complete length of src mbuf means PMD has read full input,
OR
2. consumed = partial length of src mbuf means PMD has read partial input
On seeing this status, app should consume output and re-enqueue same
op with empty output buffer and
src = consumed+1.
[Fiona] As this was a stateless op, the PMD cannot be expected to have
stored the history and state and so
cannot be expected to continue from consumed+1. This would be stateful
behaviour.
[Shally] Exactly.
But it seems you are saying that even on in this stateless case you'd like the
PMDs who can store state
to have the option of converting to stateful. So
a PMD which can support this could return OUT_OF_SPACE with
produced/consumed as you describe above.
a PMD which can't support it should return an error.
The appl can continue on from consumed+1 in the former case and resubmit
the full request
with a bigger buffer in the latter case.
Is this the behaviour you're looking for?
If so the error could be something like NEED_BIGGER_DST_BUF?
However, wouldn't OUT_OF_SPACE with produced=consumed=0 convey the
same information on the API?
It may correspond to an error on the underlying PMD, but would it be simpler
on the compressdev API
Please note as per current proposal, app should call
rte_compdev_enqueue_stream() version of API if it
doesn't know output size beforehand.
[Fiona] True. But above is only trying to describe behaviour in the stateless
error case.
[Shally] Ok. Now I got point of confusion with term 'turns stateful' here. No it's not like stateless to stateful conversion.
Stateless operation is stateless only and in stateless we don't expect OUT_OF_SPACE error. So, now I also understand what you're trying to imply with produced=consumed=0.

So, let me summarise redefinition of OUT_OF_SPACE based on RFC v3:

Interpreting OUT_OF_SPACE condition:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A. Stateless Operations:
----------------------------------
A.1 If operation is stateless i.e. rte_comp_op. op_type == RTE_COMP_OP_STATELESS, and PMD runs out of buffer during compression or decompression then it is an error condition for PMD.
It will reset itself and return with produced=consumed=0 with status OUT_OF_SPACE. On seeing this, application should resubmit full request with bigger output buffer size.

B. Stateful Operations:
-------------------------------
B.1 If operation is stateful i.e. rte_comp_op.op_type == RTE_COMP_OP_STATEFUL, and PMD runs out of buffer during compression or decompression, then PMD will update
produced=consumed (as mentioned above) and app should resubmit op with input from consumed+1 and output buffer with free space.
Please note for such case, application should allocate stream via call to rte_comp_stream_create() and attach it to op and pass it along every time pending op is enqueued until op processing is complete with status set to SUCCESS/FAILURE.
//snip.....
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
D.2.1.2 Stateful operation state maintenance
 -------------------------------------------------------------
This section starts with description of our understanding about
compression API support for stateful.
Post by Verma, Shally
Depending upon understanding build upon these concepts, we will
identify
Post by Verma, Shally
Post by Verma, Shally
required data structure/param
Post by Verma, Shally
to maintain in-progress operation context by PMD.
For stateful compression, batch of dependent packets starts at a
packet
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
having
Post by Verma, Shally
RTE_NO_FLUSH/RTE_SYNC_FLUSH flush value and end at packet
having
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
RTE_FULL_FLUSH/FINAL_FLUSH.
Post by Verma, Shally
-----------------------------------------------------------------------------------
-
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
|op1.no_flush | op2.no_flush | op3.no_flush | op4.full_flush|
-----------------------------------------------------------------------------------
-
Post by Trahe, Fiona
[Fiona] I think it needs to be more constrained than your examples
below.
Post by Trahe, Fiona
Only 1 operation from a stream can be in a burst. As each operation
in a stateful stream must complete, as next operation needs state and
history
of previous operation to be complete before it can be processed.
And if one failed, e.g. due to OUT_OF_SPACE, this should affect
the following operation in the same stream.
Worst case this means bursts of 1. Burst can be >1 if there are multiple
independent streams with available data for processing. Or if there is
data available which can be statelessly processed.
If there are multiple buffers available from a stream , then instead they
can
Post by Trahe, Fiona
be linked together in an mbuf chain sent in a single operation.
To handle the sequences below would mean the PMD
would need to store ops sending one at a time to be processed.
As this is significantly different from what you describe below, I'll wait for
further feedback
before continuing.
[Shally] I concur with your thoughts. And these're are not significantly
different from the concept
presented below.
Yes as you mentioned, even for burst_size>1 PMD will have to serialize
each op internally i.e.
It has to wait for previous to finish before putting next for processing which
is
as good as application making serialised call passing one op at-a-time or if
stream consists of multiple buffers, making their scatter-gather list and
then enqueue it as one op at a time which is more efficient and ideal usage.
However in order to allow extensibility, I didn't mention limitation on
burst_size.
Because If PMD doesn't support burst_size > 1 it can always return
nb_enqueued = 1, in which case
app can enqueue next however with condition it should wait for previous
to complete
before making next enqueue call.
So, if we take simple example to compress 2k of data with src mbuf size =
1k.
Then with burst_size=1, expected call flow would be(this is just one flow,
other variations are also possible
1. fill 1st 1k chunk of data in op.msrc
2.enqueue_stream (..., |op.flush = no_flush|, 1, ptr_stream);
3.dequeue_burst(|op|,1);
4.refill next 1k chunk in op.msrc
5.enqueue_stream(...,|op.flush = full_flush|, 1 , ptr_stream);
6.dequeue_burst(|op|, 1);
7.end
So, I don’t see much of a change in API call flow from here to design
presented below except nb_ops = 1 in
each call.
However I am assuming that op structure would still be same for stateful
processing i.e. it would start with
op.flush value = NO/SYNC_FLUSH and end at op with flush value = FULL
FLUSH.
Are we on same page here?
Thanks
Shally
[Fiona] We still have a different understanding of the stateful flow needed
on the API.
I’ll try to clarify and maybe we can set up a meeting to discuss.
• Order of ops on a qp must be maintained – ops should be dequeued
in same sequence they are enqueued.
• Ops from many streams can be enqueued on same qp.
• Ops from a qp may be fanned out to available hw or sw engines and
processed in parallel, so each op must be independent.
• Stateless and stateful ops can be enqueued on the same qp
Submitting a burst of stateless ops to a qp is no problem.
Submitting more than 1 op at a time from the same stateful stream to a qp is
a problem.
Appl submits 2 ops in same stream in a burst, each has src and dest mbufs,
input length/offset and
requires checksum to be calculated.
The first op must be processed to completion before the second can be
started as it needs the history and the checksum so far.
If each dest mbuf is big enough so no overflow, each dest mbuf will be
partially filled. This is probably not
what’s desired, and will force an extra copy to make the output data
contiguous.
If the dest mbuf in the first op is too small, then does the PMD alloc more
memory in the dest mbuf?
Or alloc another mbuf? Or fail and the whole burst must be resubmitted?
Or store the 2nd op, wait, on seeing the OUT_OF_SPACE on the 1st op,
overwrite the src, dest, len etc of the 2nd op
to include the unprocessed part of the 1st op?
In the meantime, are all other ops on the qp blocked behind these?
For hw accelerators it’s worse, as PMD would normally return once ops are
offloaded and the dequeue would
pass processed ops straight back to the appl. Instead, the enqueue would
need to kick off a thread to
dequeue ops and filter to find the stateful one, storing the others til the next
application dequeue is called.
Above scenarios don’t lend themselves to accelerating a packet processing
workload.
It pushes a workload down to all PMDs which I believe belongs above this API
as
that work is not about offloading the compute intensive compression work
but
about the sequencing of data and so is better coded once, above the API in
an application layer
common to all PMDs. (See Note1 in http://dpdk.org/ml/archives/dev/2017-
October/078944.html )
If an application has several packets with data from a stream that it needs to
(de)compress statefully,
what it probably wants is for the output data to fill each output buffer
completely before writing to the next buffer.
Chaining the src mbufs in these pkts into one chain and sending as one op
allows the output
data to be packed into a dest mbuf or mbuf chain.
I think what’s needed is a layer above the API to accumulate incoming
packets while waiting for the
previous set of packets to be compressed. Forwarding to the PMD to queue
there is not the right place
to buffer them as the queue should be per stream rather than on the
accelerator engine’s queue
which has lots of other independent packets.
[Shally] Ok. I believe I get it.
In general I agree to this proposal. However have concern on 1 point here i.e. order maintenance. Please see further for more explanation.
• Ops from a qp may be fanned out to available hw or sw engines and
processed in parallel, so each op must be independent.
[Shally] Possible only if PMD support combination of SW and HW processing. Right?
• Order of ops on a qp must be maintained – ops should be dequeued in
same sequence they are enqueued.
[Shally] If each op is independent then why do we need to maintain ordering. Since they're independent and thus can be processed in parallel so they can well be quite out-of-order and available for dequeue as soon as completed.
Serializing them will limit HW throughput capability. And I can envision some app may not care about ordering just completion.
So I would suggest if application need ordering should tag each op with some id or serial number in op user_data area to identify enqueue order OR we may add flag in enqueue_burst() API to enforce serialized dequeuing, if that's hard requirement of any.
• Stateless and stateful ops can be enqueued on the same qp
• Stateless and stateful ops can be enqueued in the same burst
• Only 1 op at a time may be enqueued to the qp from any stateful stream.
• A burst can have multiple stateful ops, but each must be from a different
stream.
• All ops will have a session attached – this will only contain immutable data
which
can be used by many ops, devices and or drivers at the same time.
• All stateful ops will have a stream attached for maintaining state and
history, this can only be used by one op at a time.
[Shally] So, you mean:

A single enque_burst() *can* carry multiple streams. I.E. This is allowed both in burst or in qp (say, when multiple threads call enque_burst() on same qp)

---------------------------------------------------------------------------------------------------------
enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final | op4.no_flush | op5.no_flush |)
---------------------------------------------------------------------------------------------------------
Where,
All op1, op2...op5 belongs to all *different* streams. Op3 can be stateless/stateful depending upon op_type value and each can have *same or different* sessions.

If I understand this right, then yes it looks good to me. However this also bring one minor point for discussion but I would wait to initiate that until we close on current open points.

Thanks
Shally
enum rte_comp_op_type {
RTE_COMP_OP_STATELESS,
RTE_COMP_OP_STATEFUL
}
enum rte_comp_op_type op_type;
void * stream_private;
/* location where PMD maintains stream state – only required if op_type is
STATEFUL, else set to NULL */
As size of stream data will vary depending on PMD, each PMD or device
rte_comp_stream_create(uint8_t dev_id, rte_comp_session *sess, void **
stream);
/* This should alloc a stream from the device’s mempool and initialise it. This
handle will be passed to the PMD with every op in the stream. Q. Should
qp_id also be added, with constraint that all ops in the same stream should
be sent to the same qp? */
rte_comp_stream_free(uint8_t dev_id, void * stream);
/* This should clear the stream and return it to the device’s mempool */
All ops are enqueued/dequeued to device & qp using same
rte_compressdev_enqueue_burst()/…dequeue_burst;
Re flush flags, stateful stream would start with op.flush = NONE or SYNC and
end with FULL or FINAL
STATELESS ops would just use either FULL or FINAL
Let me know if you want to set up a meeting - it might be a more effective
way to
arrive at an API that works for all PMDs.
I'll send out a v3 today with above plus updates based on all the other
Trahe, Fiona
2017-12-20 15:32:37 UTC
Permalink
Hi Shally,

I think we are almost in sync now - a few comments below with just one open question which I suspect was a typo.
If this is ok then no need for a meeting I think.
In this case will you issue a v2 of this doc ?
-----Original Message-----
Sent: Wednesday, December 20, 2017 7:15 AM
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Fiona
Please refer to my comments below with my understanding on two major points OUT_OF_SPACE and
Stateful Design.
If you believe we still need a meeting to converge on same please share meeting details to me.
-----Original Message-----
Sent: 15 December 2017 23:11
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Shally,
-----Original Message-----
Sent: Thursday, December 7, 2017 5:43 AM
Challa, Mahipal
Subject: RE: [RFC v1] doc compression API for DPDK
//snip....
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
Please note any time output buffer ran out of space during write
then
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
operation will turn “Stateful”.  See
Post by Verma, Shally
more on Stateful under respective section.
[Fiona] Let's come back to this later. An alternative is that
OUT_OF_SPACE is
Post by Verma, Shally
Post by Verma, Shally
returned and the application
must treat as a fail and resubmit the operation with a larger
destination
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
buffer.
[Shally] Then I propose to add a feature flag
"FF_SUPPORT_OUT_OF_SPACE" per xform type for flexible
Post by Verma, Shally
PMD design.
As there're devices which treat it as error on compression but not on
decompression.
Post by Verma, Shally
If it is not supported, then it should be treated as failure condition and
app
Post by Trahe, Fiona
can resubmit operation.
Post by Verma, Shally
if supported, behaviour *To-be-Defined* under stateful.
[Fiona] Can you explain 'turn stateful' some more?
If compressor runs out of space during stateless operation, either comp
or
Post by Trahe, Fiona
decomp, and turns stateful, how would the app know? And what would
be in
Post by Trahe, Fiona
status, consumed and produced?
Could it return OUT_OF_SPACE, and if both consumed and produced == 0
[Shally] If consumed = produced == 0, then it's not OUT_OF_SPACE
condition.
Post by Trahe, Fiona
then the whole op must be resubmitted with a bigger output buffer. But
if
Post by Trahe, Fiona
consumed and produced > 0 then app could take the output and submit
next
Post by Trahe, Fiona
op
continuing from consumed+1.
[Shally] consumed and produced will *always* be > 0 in case of
OUT_OF_SPACE.
OUT_OF_SPACE means output buffer exhausted while writing data into it
and PMD may have more to
write to it. So in such case, PMD should set
Produced = complete length of output buffer
Status = OUT_OF_SPACE
1. consumed = complete length of src mbuf means PMD has read full input,
OR
2. consumed = partial length of src mbuf means PMD has read partial input
On seeing this status, app should consume output and re-enqueue same
op with empty output buffer and
src = consumed+1.
[Fiona] As this was a stateless op, the PMD cannot be expected to have
stored the history and state and so
cannot be expected to continue from consumed+1. This would be stateful
behaviour.
[Shally] Exactly.
But it seems you are saying that even on in this stateless case you'd like the
PMDs who can store state
to have the option of converting to stateful. So
a PMD which can support this could return OUT_OF_SPACE with
produced/consumed as you describe above.
a PMD which can't support it should return an error.
The appl can continue on from consumed+1 in the former case and resubmit
the full request
with a bigger buffer in the latter case.
Is this the behaviour you're looking for?
If so the error could be something like NEED_BIGGER_DST_BUF?
However, wouldn't OUT_OF_SPACE with produced=consumed=0 convey the
same information on the API?
It may correspond to an error on the underlying PMD, but would it be simpler
on the compressdev API
Please note as per current proposal, app should call
rte_compdev_enqueue_stream() version of API if it
doesn't know output size beforehand.
[Fiona] True. But above is only trying to describe behaviour in the stateless
error case.
[Shally] Ok. Now I got point of confusion with term 'turns stateful' here. No it's not like stateless to
stateful conversion.
Stateless operation is stateless only and in stateless we don't expect OUT_OF_SPACE error. So, now I
also understand what you're trying to imply with produced=consumed=0.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
----------------------------------
A.1 If operation is stateless i.e. rte_comp_op. op_type == RTE_COMP_OP_STATELESS, and PMD runs out
of buffer during compression or decompression then it is an error condition for PMD.
It will reset itself and return with produced=consumed=0 with status OUT_OF_SPACE. On seeing this,
application should resubmit full request with bigger output buffer size.
-------------------------------
B.1 If operation is stateful i.e. rte_comp_op.op_type == RTE_COMP_OP_STATEFUL, and PMD runs out
of buffer during compression or decompression, then PMD will update
produced=consumed (as mentioned above)
[Fiona] ? Did you mean to say "will update produced & consumed" ?
I think
- consumed would be <= input length (typically <)
- produced would be <= output buffer len (typically =, but could be a few bytes less)
- status would be OUT_OF_SPACE
Do you agree?
and app should resubmit op with input from consumed+1
and output buffer with free space.
Please note for such case, application should allocate stream via call to rte_comp_stream_create() and
attach it to op and pass it along every time pending op is enqueued until op processing is complete with
status set to SUCCESS/FAILURE.
[Fiona] Agreed
//snip.....
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
D.2.1.2 Stateful operation state maintenance
 -------------------------------------------------------------
This section starts with description of our understanding about
compression API support for stateful.
Post by Verma, Shally
Depending upon understanding build upon these concepts, we will
identify
Post by Verma, Shally
Post by Verma, Shally
required data structure/param
Post by Verma, Shally
to maintain in-progress operation context by PMD.
For stateful compression, batch of dependent packets starts at a
packet
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
having
Post by Verma, Shally
RTE_NO_FLUSH/RTE_SYNC_FLUSH flush value and end at packet
having
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
RTE_FULL_FLUSH/FINAL_FLUSH.
Post by Verma, Shally
-----------------------------------------------------------------------------------
-
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
|op1.no_flush | op2.no_flush | op3.no_flush | op4.full_flush|
-----------------------------------------------------------------------------------
-
Post by Trahe, Fiona
[Fiona] I think it needs to be more constrained than your examples
below.
Post by Trahe, Fiona
Only 1 operation from a stream can be in a burst. As each operation
in a stateful stream must complete, as next operation needs state and
history
of previous operation to be complete before it can be processed.
And if one failed, e.g. due to OUT_OF_SPACE, this should affect
the following operation in the same stream.
Worst case this means bursts of 1. Burst can be >1 if there are multiple
independent streams with available data for processing. Or if there is
data available which can be statelessly processed.
If there are multiple buffers available from a stream , then instead they
can
Post by Trahe, Fiona
be linked together in an mbuf chain sent in a single operation.
To handle the sequences below would mean the PMD
would need to store ops sending one at a time to be processed.
As this is significantly different from what you describe below, I'll wait for
further feedback
before continuing.
[Shally] I concur with your thoughts. And these're are not significantly
different from the concept
presented below.
Yes as you mentioned, even for burst_size>1 PMD will have to serialize
each op internally i.e.
It has to wait for previous to finish before putting next for processing which
is
as good as application making serialised call passing one op at-a-time or if
stream consists of multiple buffers, making their scatter-gather list and
then enqueue it as one op at a time which is more efficient and ideal usage.
However in order to allow extensibility, I didn't mention limitation on
burst_size.
Because If PMD doesn't support burst_size > 1 it can always return
nb_enqueued = 1, in which case
app can enqueue next however with condition it should wait for previous
to complete
before making next enqueue call.
So, if we take simple example to compress 2k of data with src mbuf size =
1k.
Then with burst_size=1, expected call flow would be(this is just one flow,
other variations are also possible
1. fill 1st 1k chunk of data in op.msrc
2.enqueue_stream (..., |op.flush = no_flush|, 1, ptr_stream);
3.dequeue_burst(|op|,1);
4.refill next 1k chunk in op.msrc
5.enqueue_stream(...,|op.flush = full_flush|, 1 , ptr_stream);
6.dequeue_burst(|op|, 1);
7.end
So, I don’t see much of a change in API call flow from here to design
presented below except nb_ops = 1 in
each call.
However I am assuming that op structure would still be same for stateful
processing i.e. it would start with
op.flush value = NO/SYNC_FLUSH and end at op with flush value = FULL
FLUSH.
Are we on same page here?
Thanks
Shally
[Fiona] We still have a different understanding of the stateful flow needed
on the API.
I’ll try to clarify and maybe we can set up a meeting to discuss.
• Order of ops on a qp must be maintained – ops should be dequeued
in same sequence they are enqueued.
• Ops from many streams can be enqueued on same qp.
• Ops from a qp may be fanned out to available hw or sw engines and
processed in parallel, so each op must be independent.
• Stateless and stateful ops can be enqueued on the same qp
Submitting a burst of stateless ops to a qp is no problem.
Submitting more than 1 op at a time from the same stateful stream to a qp is
a problem.
Appl submits 2 ops in same stream in a burst, each has src and dest mbufs,
input length/offset and
requires checksum to be calculated.
The first op must be processed to completion before the second can be
started as it needs the history and the checksum so far.
If each dest mbuf is big enough so no overflow, each dest mbuf will be
partially filled. This is probably not
what’s desired, and will force an extra copy to make the output data
contiguous.
If the dest mbuf in the first op is too small, then does the PMD alloc more
memory in the dest mbuf?
Or alloc another mbuf? Or fail and the whole burst must be resubmitted?
Or store the 2nd op, wait, on seeing the OUT_OF_SPACE on the 1st op,
overwrite the src, dest, len etc of the 2nd op
to include the unprocessed part of the 1st op?
In the meantime, are all other ops on the qp blocked behind these?
For hw accelerators it’s worse, as PMD would normally return once ops are
offloaded and the dequeue would
pass processed ops straight back to the appl. Instead, the enqueue would
need to kick off a thread to
dequeue ops and filter to find the stateful one, storing the others til the next
application dequeue is called.
Above scenarios don’t lend themselves to accelerating a packet processing
workload.
It pushes a workload down to all PMDs which I believe belongs above this API
as
that work is not about offloading the compute intensive compression work
but
about the sequencing of data and so is better coded once, above the API in
an application layer
common to all PMDs. (See Note1 in http://dpdk.org/ml/archives/dev/2017-
October/078944.html )
If an application has several packets with data from a stream that it needs to
(de)compress statefully,
what it probably wants is for the output data to fill each output buffer
completely before writing to the next buffer.
Chaining the src mbufs in these pkts into one chain and sending as one op
allows the output
data to be packed into a dest mbuf or mbuf chain.
I think what’s needed is a layer above the API to accumulate incoming
packets while waiting for the
previous set of packets to be compressed. Forwarding to the PMD to queue
there is not the right place
to buffer them as the queue should be per stream rather than on the
accelerator engine’s queue
which has lots of other independent packets.
[Shally] Ok. I believe I get it.
In general I agree to this proposal. However have concern on 1 point here i.e. order maintenance. Please
see further for more explanation.
• Ops from a qp may be fanned out to available hw or sw engines and
processed in parallel, so each op must be independent.
[Shally] Possible only if PMD support combination of SW and HW processing. Right?
[Fiona] Not necessarily, Intel QuickAssist accelerators are HW and can process ops from same qp in parallel
• Order of ops on a qp must be maintained – ops should be dequeued in
same sequence they are enqueued.
[Shally] If each op is independent then why do we need to maintain ordering. Since they're independent
and thus can be processed in parallel so they can well be quite out-of-order and available for dequeue as
soon as completed.
Serializing them will limit HW throughput capability. And I can envision some app may not care about
ordering just completion.
So I would suggest if application need ordering should tag each op with some id or serial number in op
user_data area to identify enqueue order OR we may add flag in enqueue_burst() API to enforce
serialized dequeuing, if that's hard requirement of any.
[Fiona] Ok, I think you're right, this requirement isn't needed.
In stateless ops it's not needed.
For stateful the appl should only have one op per stream inflight at any time so manages the ordering.
So we can specify on the API that ordering is not necessarily maintained on the qp and PMDs may return responses out-of-order.
The responsibility is on the application to maintain order if it's needed.
If later we find some argument for maintaining order I'd suggest a configuration param per qp or even per device rather than on the enqueue_burst()
• Stateless and stateful ops can be enqueued on the same qp
• Stateless and stateful ops can be enqueued in the same burst
• Only 1 op at a time may be enqueued to the qp from any stateful stream.
• A burst can have multiple stateful ops, but each must be from a different
stream.
• All ops will have a session attached – this will only contain immutable data
which
can be used by many ops, devices and or drivers at the same time.
• All stateful ops will have a stream attached for maintaining state and
history, this can only be used by one op at a time.
A single enque_burst() *can* carry multiple streams. I.E. This is allowed both in burst or in qp (say, when
multiple threads call enque_burst() on same qp)
---------------------------------------------------------------------------------------------------------
enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final | op4.no_flush | op5.no_flush |)
---------------------------------------------------------------------------------------------------------
Where,
All op1, op2...op5 belongs to all *different* streams. Op3 can be stateless/stateful depending upon
op_type value and each can have *same or different* sessions.
[Fiona] Exactly
If I understand this right, then yes it looks good to me. However this also bring one minor point for
discussion but I would wait to initiate that until we close on current open points.
Thanks
Shally
enum rte_comp_op_type {
RTE_COMP_OP_STATELESS,
RTE_COMP_OP_STATEFUL
}
enum rte_comp_op_type op_type;
void * stream_private;
/* location where PMD maintains stream state – only required if op_type is
STATEFUL, else set to NULL */
As size of stream data will vary depending on PMD, each PMD or device
rte_comp_stream_create(uint8_t dev_id, rte_comp_session *sess, void **
stream);
/* This should alloc a stream from the device’s mempool and initialise it. This
handle will be passed to the PMD with every op in the stream. Q. Should
qp_id also be added, with constraint that all ops in the same stream should
be sent to the same qp? */
rte_comp_stream_free(uint8_t dev_id, void * stream);
/* This should clear the stream and return it to the device’s mempool */
All ops are enqueued/dequeued to device & qp using same
rte_compressdev_enqueue_burst()/…dequeue_burst;
Re flush flags, stateful stream would start with op.flush = NONE or SYNC and
end with FULL or FINAL
STATELESS ops would just use either FULL or FINAL
Let me know if you want to set up a meeting - it might be a more effective
way to
arrive at an API that works for all PMDs.
I'll send out a v3 today with above plus updates based on all the other
feedback.
Verma, Shally
2017-12-22 07:45:33 UTC
Permalink
Hi Fiona
-----Original Message-----
Sent: 20 December 2017 21:03
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Shally,
I think we are almost in sync now - a few comments below with just one
open question which I suspect was a typo.
If this is ok then no need for a meeting I think.
In this case will you issue a v2 of this doc ?
-----Original Message-----
Sent: Wednesday, December 20, 2017 7:15 AM
Gupta, Ashish
De Lara Guarch, Pablo
Ahmed Mansour
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Fiona
Please refer to my comments below with my understanding on two major
points OUT_OF_SPACE and
Stateful Design.
If you believe we still need a meeting to converge on same please share
meeting details to me.
-----Original Message-----
Sent: 15 December 2017 23:11
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Shally,
-----Original Message-----
Sent: Thursday, December 7, 2017 5:43 AM
Cc: Athreya, Narayana Prasad
Challa, Mahipal
Subject: RE: [RFC v1] doc compression API for DPDK
//snip....
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
Please note any time output buffer ran out of space during write
then
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
operation will turn “Stateful”.  See
Post by Verma, Shally
more on Stateful under respective section.
[Fiona] Let's come back to this later. An alternative is that
OUT_OF_SPACE is
Post by Verma, Shally
Post by Verma, Shally
returned and the application
must treat as a fail and resubmit the operation with a larger
destination
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
buffer.
[Shally] Then I propose to add a feature flag
"FF_SUPPORT_OUT_OF_SPACE" per xform type for flexible
Post by Verma, Shally
PMD design.
As there're devices which treat it as error on compression but not
on
Post by Trahe, Fiona
decompression.
Post by Verma, Shally
If it is not supported, then it should be treated as failure condition
and
app
Post by Trahe, Fiona
can resubmit operation.
Post by Verma, Shally
if supported, behaviour *To-be-Defined* under stateful.
[Fiona] Can you explain 'turn stateful' some more?
If compressor runs out of space during stateless operation, either
comp
or
Post by Trahe, Fiona
decomp, and turns stateful, how would the app know? And what
would
be in
Post by Trahe, Fiona
status, consumed and produced?
Could it return OUT_OF_SPACE, and if both consumed and produced
== 0
[Shally] If consumed = produced == 0, then it's not OUT_OF_SPACE
condition.
Post by Trahe, Fiona
then the whole op must be resubmitted with a bigger output buffer.
But
if
Post by Trahe, Fiona
consumed and produced > 0 then app could take the output and
submit
next
Post by Trahe, Fiona
op
continuing from consumed+1.
[Shally] consumed and produced will *always* be > 0 in case of
OUT_OF_SPACE.
OUT_OF_SPACE means output buffer exhausted while writing data into
it
and PMD may have more to
write to it. So in such case, PMD should set
Produced = complete length of output buffer
Status = OUT_OF_SPACE
1. consumed = complete length of src mbuf means PMD has read full
input,
OR
2. consumed = partial length of src mbuf means PMD has read partial
input
On seeing this status, app should consume output and re-enqueue
same
op with empty output buffer and
src = consumed+1.
[Fiona] As this was a stateless op, the PMD cannot be expected to have
stored the history and state and so
cannot be expected to continue from consumed+1. This would be
stateful
behaviour.
[Shally] Exactly.
But it seems you are saying that even on in this stateless case you'd like
the
PMDs who can store state
to have the option of converting to stateful. So
a PMD which can support this could return OUT_OF_SPACE with
produced/consumed as you describe above.
a PMD which can't support it should return an error.
The appl can continue on from consumed+1 in the former case and
resubmit
the full request
with a bigger buffer in the latter case.
Is this the behaviour you're looking for?
If so the error could be something like NEED_BIGGER_DST_BUF?
However, wouldn't OUT_OF_SPACE with produced=consumed=0 convey
the
same information on the API?
It may correspond to an error on the underlying PMD, but would it be
simpler
on the compressdev API
Please note as per current proposal, app should call
rte_compdev_enqueue_stream() version of API if it
doesn't know output size beforehand.
[Fiona] True. But above is only trying to describe behaviour in the
stateless
error case.
[Shally] Ok. Now I got point of confusion with term 'turns stateful' here. No
it's not like stateless to
stateful conversion.
Stateless operation is stateless only and in stateless we don't expect
OUT_OF_SPACE error. So, now I
also understand what you're trying to imply with produced=consumed=0.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
----------------------------------
A.1 If operation is stateless i.e. rte_comp_op. op_type ==
RTE_COMP_OP_STATELESS, and PMD runs out
of buffer during compression or decompression then it is an error condition
for PMD.
It will reset itself and return with produced=consumed=0 with status
OUT_OF_SPACE. On seeing this,
application should resubmit full request with bigger output buffer size.
-------------------------------
B.1 If operation is stateful i.e. rte_comp_op.op_type ==
RTE_COMP_OP_STATEFUL, and PMD runs out
of buffer during compression or decompression, then PMD will update
produced=consumed (as mentioned above)
[Fiona] ? Did you mean to say "will update produced & consumed" ?
[Shally] Yes you right that was typo. It should be produced & consumed.
I think
- consumed would be <= input length (typically <)
- produced would be <= output buffer len (typically =, but could be a few bytes less)
- status would be OUT_OF_SPACE
Do you agree?
[Shally] Yes.
and app should resubmit op with input from consumed+1
and output buffer with free space.
Please note for such case, application should allocate stream via call to
rte_comp_stream_create() and
attach it to op and pass it along every time pending op is enqueued until op
processing is complete with
status set to SUCCESS/FAILURE.
[Fiona] Agreed
//snip.....
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
D.2.1.2 Stateful operation state maintenance
 -------------------------------------------------------------
This section starts with description of our understanding about
compression API support for stateful.
Post by Verma, Shally
Depending upon understanding build upon these concepts, we
will
Post by Trahe, Fiona
identify
Post by Verma, Shally
Post by Verma, Shally
required data structure/param
Post by Verma, Shally
to maintain in-progress operation context by PMD.
For stateful compression, batch of dependent packets starts at a
packet
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
having
Post by Verma, Shally
RTE_NO_FLUSH/RTE_SYNC_FLUSH flush value and end at packet
having
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
RTE_FULL_FLUSH/FINAL_FLUSH.
Post by Verma, Shally
------------------------------------------------------------------------------
-----
-
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
|op1.no_flush | op2.no_flush | op3.no_flush | op4.full_flush|
------------------------------------------------------------------------------
-----
-
Post by Trahe, Fiona
[Fiona] I think it needs to be more constrained than your examples
below.
Post by Trahe, Fiona
Only 1 operation from a stream can be in a burst. As each operation
in a stateful stream must complete, as next operation needs state
and
Post by Trahe, Fiona
history
of previous operation to be complete before it can be processed.
And if one failed, e.g. due to OUT_OF_SPACE, this should affect
the following operation in the same stream.
Worst case this means bursts of 1. Burst can be >1 if there are
multiple
Post by Trahe, Fiona
independent streams with available data for processing. Or if there is
data available which can be statelessly processed.
If there are multiple buffers available from a stream , then instead
they
can
Post by Trahe, Fiona
be linked together in an mbuf chain sent in a single operation.
To handle the sequences below would mean the PMD
would need to store ops sending one at a time to be processed.
As this is significantly different from what you describe below, I'll wait
for
Post by Trahe, Fiona
further feedback
before continuing.
[Shally] I concur with your thoughts. And these're are not significantly
different from the concept
presented below.
Yes as you mentioned, even for burst_size>1 PMD will have to serialize
each op internally i.e.
It has to wait for previous to finish before putting next for processing
which
is
as good as application making serialised call passing one op at-a-time or
if
stream consists of multiple buffers, making their scatter-gather list and
then enqueue it as one op at a time which is more efficient and ideal
usage.
However in order to allow extensibility, I didn't mention limitation on
burst_size.
Because If PMD doesn't support burst_size > 1 it can always return
nb_enqueued = 1, in which case
app can enqueue next however with condition it should wait for
previous
to complete
before making next enqueue call.
So, if we take simple example to compress 2k of data with src mbuf size
=
1k.
Then with burst_size=1, expected call flow would be(this is just one
flow,
other variations are also possible
1. fill 1st 1k chunk of data in op.msrc
2.enqueue_stream (..., |op.flush = no_flush|, 1, ptr_stream);
3.dequeue_burst(|op|,1);
4.refill next 1k chunk in op.msrc
5.enqueue_stream(...,|op.flush = full_flush|, 1 , ptr_stream);
6.dequeue_burst(|op|, 1);
7.end
So, I don’t see much of a change in API call flow from here to design
presented below except nb_ops = 1 in
each call.
However I am assuming that op structure would still be same for
stateful
processing i.e. it would start with
op.flush value = NO/SYNC_FLUSH and end at op with flush value = FULL
FLUSH.
Are we on same page here?
Thanks
Shally
[Fiona] We still have a different understanding of the stateful flow
needed
on the API.
I’ll try to clarify and maybe we can set up a meeting to discuss.
• Order of ops on a qp must be maintained – ops should be dequeued
in same sequence they are enqueued.
• Ops from many streams can be enqueued on same qp.
• Ops from a qp may be fanned out to available hw or sw engines and
processed in parallel, so each op must be independent.
• Stateless and stateful ops can be enqueued on the same qp
Submitting a burst of stateless ops to a qp is no problem.
Submitting more than 1 op at a time from the same stateful stream to a
qp is
a problem.
Appl submits 2 ops in same stream in a burst, each has src and dest
mbufs,
input length/offset and
requires checksum to be calculated.
The first op must be processed to completion before the second can be
started as it needs the history and the checksum so far.
If each dest mbuf is big enough so no overflow, each dest mbuf will be
partially filled. This is probably not
what’s desired, and will force an extra copy to make the output data
contiguous.
If the dest mbuf in the first op is too small, then does the PMD alloc more
memory in the dest mbuf?
Or alloc another mbuf? Or fail and the whole burst must be resubmitted?
Or store the 2nd op, wait, on seeing the OUT_OF_SPACE on the 1st op,
overwrite the src, dest, len etc of the 2nd op
to include the unprocessed part of the 1st op?
In the meantime, are all other ops on the qp blocked behind these?
For hw accelerators it’s worse, as PMD would normally return once ops
are
offloaded and the dequeue would
pass processed ops straight back to the appl. Instead, the enqueue would
need to kick off a thread to
dequeue ops and filter to find the stateful one, storing the others til the
next
application dequeue is called.
Above scenarios don’t lend themselves to accelerating a packet
processing
workload.
It pushes a workload down to all PMDs which I believe belongs above this
API
as
that work is not about offloading the compute intensive compression
work
but
about the sequencing of data and so is better coded once, above the API
in
an application layer
common to all PMDs. (See Note1 in
http://dpdk.org/ml/archives/dev/2017-
October/078944.html )
If an application has several packets with data from a stream that it needs
to
(de)compress statefully,
what it probably wants is for the output data to fill each output buffer
completely before writing to the next buffer.
Chaining the src mbufs in these pkts into one chain and sending as one op
allows the output
data to be packed into a dest mbuf or mbuf chain.
I think what’s needed is a layer above the API to accumulate incoming
packets while waiting for the
previous set of packets to be compressed. Forwarding to the PMD to
queue
there is not the right place
to buffer them as the queue should be per stream rather than on the
accelerator engine’s queue
which has lots of other independent packets.
[Shally] Ok. I believe I get it.
In general I agree to this proposal. However have concern on 1 point here
i.e. order maintenance. Please
see further for more explanation.
• Ops from a qp may be fanned out to available hw or sw engines and
processed in parallel, so each op must be independent.
[Shally] Possible only if PMD support combination of SW and HW
processing. Right?
[Fiona] Not necessarily, Intel QuickAssist accelerators are HW and can
process ops from same qp in parallel
• Order of ops on a qp must be maintained – ops should be dequeued in
same sequence they are enqueued.
[Shally] If each op is independent then why do we need to maintain
ordering. Since they're independent
and thus can be processed in parallel so they can well be quite out-of-order
and available for dequeue as
soon as completed.
Serializing them will limit HW throughput capability. And I can envision
some app may not care about
ordering just completion.
So I would suggest if application need ordering should tag each op with
some id or serial number in op
user_data area to identify enqueue order OR we may add flag in
enqueue_burst() API to enforce
serialized dequeuing, if that's hard requirement of any.
[Fiona] Ok, I think you're right, this requirement isn't needed.
In stateless ops it's not needed.
For stateful the appl should only have one op per stream inflight at any time
so manages the ordering.
So we can specify on the API that ordering is not necessarily maintained on
the qp and PMDs may return responses out-of-order.
The responsibility is on the application to maintain order if it's needed.
If later we find some argument for maintaining order I'd suggest a
configuration param per qp or even per device rather than on the
enqueue_burst()
[Shally] Done.
• Stateless and stateful ops can be enqueued on the same qp
• Stateless and stateful ops can be enqueued in the same burst
• Only 1 op at a time may be enqueued to the qp from any stateful
stream.
• A burst can have multiple stateful ops, but each must be from a
different
stream.
• All ops will have a session attached – this will only contain immutable
data
which
can be used by many ops, devices and or drivers at the same time.
• All stateful ops will have a stream attached for maintaining state and
history, this can only be used by one op at a time.
A single enque_burst() *can* carry multiple streams. I.E. This is allowed
both in burst or in qp (say, when
multiple threads call enque_burst() on same qp)
---------------------------------------------------------------------
------------------------------------
enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final |
op4.no_flush | op5.no_flush |)
--------------------------------------------------------------------
-------------------------------------
Where,
All op1, op2...op5 belongs to all *different* streams. Op3 can be
stateless/stateful depending upon
op_type value and each can have *same or different* sessions.
[Fiona] Exactly
If I understand this right, then yes it looks good to me. However this also
bring one minor point for
discussion but I would wait to initiate that until we close on current open
points.
Thanks
Shally
[Shally] Since we are in sync now. So I will bring up another point for discussion.

I'm thinking probably we should have stream regardless of op_type where it should be marked *mandatory* for stateful but *optional* (or may be mandatory) for stateless as having it for stateless may help some PMD to gain performance in data path and reason is here:

Currently we see stream as resource which maintain states (et el) for stateful processing but this is also a placeholder where PMD can choose to do one-time resource setup common to both op_types (such as allocating instruction from its internal pool and initialize it with session params).
So for such PMD designs, it will be beneficial to use stream for stateless as well as it will help minimize instruction setup time on data path as all 1-time operations will be done in stream_create().
In case of stateless, if stream is present would mean it is available for next use as soon as last associated op is dequeued as it hold no context of last op but common resources re-useable for next op.

Apart, there's another point. We can enable API spec to leverage mempool object cache concept by allowing PMD to allocate stream from op pool as per-op private data i.e. each object elt_size = sizeof(rte_comp_op) + user_size + stream_size.
This would help PMD reduce memory access time if caching is enabled as each op stream reside with it in cache rather than having them from different pool with different policies.

If agreed, then it can be enabled in API spec such as (this is just a proposal, there could be other ..)

- Modify stream_create spec following:

struct rte_comp_op_pool_private {
uint16_t user_size;
/**< Size of private user data with each operation. */
uint16_t dev_priv_data_size;
/**< Size of device private data with each operation. */
};

int rte_comp_stream_create(uint32_t dev_id,
rte_comp_session *sess,
void **stream,
rte_mempool *op_pool /* optional */);

- This will map to PMD ops stream_create(). Here if op_pool != NULL, then PMD can re-allocate op pool with new elt_size = rte_comp_op + op_private->user_size + dev_private_stream_size so that stream reside in op private area.

Or, we may add another API altogether to do stream allocations from op_pool such as rte_comp_stream_create_from_pool(...);

I will issue RFC Doc v2 after your feedback these.

Thanks
Shally
enum rte_comp_op_type {
RTE_COMP_OP_STATELESS,
RTE_COMP_OP_STATEFUL
}
enum rte_comp_op_type op_type;
void * stream_private;
/* location where PMD maintains stream state – only required if
op_type is
STATEFUL, else set to NULL */
As size of stream data will vary depending on PMD, each PMD or device
rte_comp_stream_create(uint8_t dev_id, rte_comp_session *sess, void
**
stream);
/* This should alloc a stream from the device’s mempool and initialise it.
This
handle will be passed to the PMD with every op in the stream. Q. Should
qp_id also be added, with constraint that all ops in the same stream
should
be sent to the same qp? */
rte_comp_stream_free(uint8_t dev_id, void * stream);
/* This should clear the stream and return it to the device’s mempool */
All ops are enqueued/dequeued to device & qp using same
rte_compressdev_enqueue_burst()/…dequeue_burst;
Re flush flags, stateful stream would start with op.flush = NONE or SYNC
and
end with FULL or FINAL
STATELESS ops would just use either FULL or FINAL
Let me know if you want to set up a meeting - it might be a more
effective
way to
arrive at an API that works for all PMDs.
I'll send out a v3 today with above plus updates based on all the other
feedback.
Regards,
Fiona
Trahe, Fiona
2017-12-22 15:13:00 UTC
Permalink
Hi Shally,
-----Original Message-----
Sent: Friday, December 22, 2017 7:46 AM
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Fiona
-----Original Message-----
Sent: 20 December 2017 21:03
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Shally,
I think we are almost in sync now - a few comments below with just one
open question which I suspect was a typo.
If this is ok then no need for a meeting I think.
In this case will you issue a v2 of this doc ?
-----Original Message-----
Sent: Wednesday, December 20, 2017 7:15 AM
Gupta, Ashish
De Lara Guarch, Pablo
Ahmed Mansour
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Fiona
Please refer to my comments below with my understanding on two major
points OUT_OF_SPACE and
Stateful Design.
If you believe we still need a meeting to converge on same please share
meeting details to me.
-----Original Message-----
Sent: 15 December 2017 23:11
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Shally,
-----Original Message-----
Sent: Thursday, December 7, 2017 5:43 AM
Cc: Athreya, Narayana Prasad
Challa, Mahipal
Subject: RE: [RFC v1] doc compression API for DPDK
//snip....
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
Please note any time output buffer ran out of space during write
then
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
operation will turn “Stateful”.  See
Post by Verma, Shally
more on Stateful under respective section.
[Fiona] Let's come back to this later. An alternative is that
OUT_OF_SPACE is
Post by Verma, Shally
Post by Verma, Shally
returned and the application
must treat as a fail and resubmit the operation with a larger
destination
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
buffer.
[Shally] Then I propose to add a feature flag
"FF_SUPPORT_OUT_OF_SPACE" per xform type for flexible
Post by Verma, Shally
PMD design.
As there're devices which treat it as error on compression but not
on
Post by Trahe, Fiona
decompression.
Post by Verma, Shally
If it is not supported, then it should be treated as failure condition
and
app
Post by Trahe, Fiona
can resubmit operation.
Post by Verma, Shally
if supported, behaviour *To-be-Defined* under stateful.
[Fiona] Can you explain 'turn stateful' some more?
If compressor runs out of space during stateless operation, either
comp
or
Post by Trahe, Fiona
decomp, and turns stateful, how would the app know? And what
would
be in
Post by Trahe, Fiona
status, consumed and produced?
Could it return OUT_OF_SPACE, and if both consumed and produced
== 0
[Shally] If consumed = produced == 0, then it's not OUT_OF_SPACE
condition.
Post by Trahe, Fiona
then the whole op must be resubmitted with a bigger output buffer.
But
if
Post by Trahe, Fiona
consumed and produced > 0 then app could take the output and
submit
next
Post by Trahe, Fiona
op
continuing from consumed+1.
[Shally] consumed and produced will *always* be > 0 in case of
OUT_OF_SPACE.
OUT_OF_SPACE means output buffer exhausted while writing data into
it
and PMD may have more to
write to it. So in such case, PMD should set
Produced = complete length of output buffer
Status = OUT_OF_SPACE
1. consumed = complete length of src mbuf means PMD has read full
input,
OR
2. consumed = partial length of src mbuf means PMD has read partial
input
On seeing this status, app should consume output and re-enqueue
same
op with empty output buffer and
src = consumed+1.
[Fiona] As this was a stateless op, the PMD cannot be expected to have
stored the history and state and so
cannot be expected to continue from consumed+1. This would be
stateful
behaviour.
[Shally] Exactly.
But it seems you are saying that even on in this stateless case you'd like
the
PMDs who can store state
to have the option of converting to stateful. So
a PMD which can support this could return OUT_OF_SPACE with
produced/consumed as you describe above.
a PMD which can't support it should return an error.
The appl can continue on from consumed+1 in the former case and
resubmit
the full request
with a bigger buffer in the latter case.
Is this the behaviour you're looking for?
If so the error could be something like NEED_BIGGER_DST_BUF?
However, wouldn't OUT_OF_SPACE with produced=consumed=0 convey
the
same information on the API?
It may correspond to an error on the underlying PMD, but would it be
simpler
on the compressdev API
Please note as per current proposal, app should call
rte_compdev_enqueue_stream() version of API if it
doesn't know output size beforehand.
[Fiona] True. But above is only trying to describe behaviour in the
stateless
error case.
[Shally] Ok. Now I got point of confusion with term 'turns stateful' here. No
it's not like stateless to
stateful conversion.
Stateless operation is stateless only and in stateless we don't expect
OUT_OF_SPACE error. So, now I
also understand what you're trying to imply with produced=consumed=0.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
----------------------------------
A.1 If operation is stateless i.e. rte_comp_op. op_type ==
RTE_COMP_OP_STATELESS, and PMD runs out
of buffer during compression or decompression then it is an error condition
for PMD.
It will reset itself and return with produced=consumed=0 with status
OUT_OF_SPACE. On seeing this,
application should resubmit full request with bigger output buffer size.
-------------------------------
B.1 If operation is stateful i.e. rte_comp_op.op_type ==
RTE_COMP_OP_STATEFUL, and PMD runs out
of buffer during compression or decompression, then PMD will update
produced=consumed (as mentioned above)
[Fiona] ? Did you mean to say "will update produced & consumed" ?
[Shally] Yes you right that was typo. It should be produced & consumed.
I think
- consumed would be <= input length (typically <)
- produced would be <= output buffer len (typically =, but could be a few bytes less)
- status would be OUT_OF_SPACE
Do you agree?
[Shally] Yes.
and app should resubmit op with input from consumed+1
and output buffer with free space.
Please note for such case, application should allocate stream via call to
rte_comp_stream_create() and
attach it to op and pass it along every time pending op is enqueued until op
processing is complete with
status set to SUCCESS/FAILURE.
[Fiona] Agreed
//snip.....
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
D.2.1.2 Stateful operation state maintenance
 -------------------------------------------------------------
This section starts with description of our understanding about
compression API support for stateful.
Post by Verma, Shally
Depending upon understanding build upon these concepts, we
will
Post by Trahe, Fiona
identify
Post by Verma, Shally
Post by Verma, Shally
required data structure/param
Post by Verma, Shally
to maintain in-progress operation context by PMD.
For stateful compression, batch of dependent packets starts at a
packet
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
having
Post by Verma, Shally
RTE_NO_FLUSH/RTE_SYNC_FLUSH flush value and end at packet
having
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
RTE_FULL_FLUSH/FINAL_FLUSH.
Post by Verma, Shally
------------------------------------------------------------------------------
-----
-
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
|op1.no_flush | op2.no_flush | op3.no_flush | op4.full_flush|
------------------------------------------------------------------------------
-----
-
Post by Trahe, Fiona
[Fiona] I think it needs to be more constrained than your examples
below.
Post by Trahe, Fiona
Only 1 operation from a stream can be in a burst. As each operation
in a stateful stream must complete, as next operation needs state
and
Post by Trahe, Fiona
history
of previous operation to be complete before it can be processed.
And if one failed, e.g. due to OUT_OF_SPACE, this should affect
the following operation in the same stream.
Worst case this means bursts of 1. Burst can be >1 if there are
multiple
Post by Trahe, Fiona
independent streams with available data for processing. Or if there is
data available which can be statelessly processed.
If there are multiple buffers available from a stream , then instead
they
can
Post by Trahe, Fiona
be linked together in an mbuf chain sent in a single operation.
To handle the sequences below would mean the PMD
would need to store ops sending one at a time to be processed.
As this is significantly different from what you describe below, I'll wait
for
Post by Trahe, Fiona
further feedback
before continuing.
[Shally] I concur with your thoughts. And these're are not significantly
different from the concept
presented below.
Yes as you mentioned, even for burst_size>1 PMD will have to serialize
each op internally i.e.
It has to wait for previous to finish before putting next for processing
which
is
as good as application making serialised call passing one op at-a-time or
if
stream consists of multiple buffers, making their scatter-gather list and
then enqueue it as one op at a time which is more efficient and ideal
usage.
However in order to allow extensibility, I didn't mention limitation on
burst_size.
Because If PMD doesn't support burst_size > 1 it can always return
nb_enqueued = 1, in which case
app can enqueue next however with condition it should wait for
previous
to complete
before making next enqueue call.
So, if we take simple example to compress 2k of data with src mbuf size
=
1k.
Then with burst_size=1, expected call flow would be(this is just one
flow,
other variations are also possible
1. fill 1st 1k chunk of data in op.msrc
2.enqueue_stream (..., |op.flush = no_flush|, 1, ptr_stream);
3.dequeue_burst(|op|,1);
4.refill next 1k chunk in op.msrc
5.enqueue_stream(...,|op.flush = full_flush|, 1 , ptr_stream);
6.dequeue_burst(|op|, 1);
7.end
So, I don’t see much of a change in API call flow from here to design
presented below except nb_ops = 1 in
each call.
However I am assuming that op structure would still be same for
stateful
processing i.e. it would start with
op.flush value = NO/SYNC_FLUSH and end at op with flush value = FULL
FLUSH.
Are we on same page here?
Thanks
Shally
[Fiona] We still have a different understanding of the stateful flow
needed
on the API.
I’ll try to clarify and maybe we can set up a meeting to discuss.
• Order of ops on a qp must be maintained – ops should be dequeued
in same sequence they are enqueued.
• Ops from many streams can be enqueued on same qp.
• Ops from a qp may be fanned out to available hw or sw engines and
processed in parallel, so each op must be independent.
• Stateless and stateful ops can be enqueued on the same qp
Submitting a burst of stateless ops to a qp is no problem.
Submitting more than 1 op at a time from the same stateful stream to a
qp is
a problem.
Appl submits 2 ops in same stream in a burst, each has src and dest
mbufs,
input length/offset and
requires checksum to be calculated.
The first op must be processed to completion before the second can be
started as it needs the history and the checksum so far.
If each dest mbuf is big enough so no overflow, each dest mbuf will be
partially filled. This is probably not
what’s desired, and will force an extra copy to make the output data
contiguous.
If the dest mbuf in the first op is too small, then does the PMD alloc more
memory in the dest mbuf?
Or alloc another mbuf? Or fail and the whole burst must be resubmitted?
Or store the 2nd op, wait, on seeing the OUT_OF_SPACE on the 1st op,
overwrite the src, dest, len etc of the 2nd op
to include the unprocessed part of the 1st op?
In the meantime, are all other ops on the qp blocked behind these?
For hw accelerators it’s worse, as PMD would normally return once ops
are
offloaded and the dequeue would
pass processed ops straight back to the appl. Instead, the enqueue would
need to kick off a thread to
dequeue ops and filter to find the stateful one, storing the others til the
next
application dequeue is called.
Above scenarios don’t lend themselves to accelerating a packet
processing
workload.
It pushes a workload down to all PMDs which I believe belongs above this
API
as
that work is not about offloading the compute intensive compression
work
but
about the sequencing of data and so is better coded once, above the API
in
an application layer
common to all PMDs. (See Note1 in
http://dpdk.org/ml/archives/dev/2017-
October/078944.html )
If an application has several packets with data from a stream that it needs
to
(de)compress statefully,
what it probably wants is for the output data to fill each output buffer
completely before writing to the next buffer.
Chaining the src mbufs in these pkts into one chain and sending as one op
allows the output
data to be packed into a dest mbuf or mbuf chain.
I think what’s needed is a layer above the API to accumulate incoming
packets while waiting for the
previous set of packets to be compressed. Forwarding to the PMD to
queue
there is not the right place
to buffer them as the queue should be per stream rather than on the
accelerator engine’s queue
which has lots of other independent packets.
[Shally] Ok. I believe I get it.
In general I agree to this proposal. However have concern on 1 point here
i.e. order maintenance. Please
see further for more explanation.
• Ops from a qp may be fanned out to available hw or sw engines and
processed in parallel, so each op must be independent.
[Shally] Possible only if PMD support combination of SW and HW
processing. Right?
[Fiona] Not necessarily, Intel QuickAssist accelerators are HW and can
process ops from same qp in parallel
• Order of ops on a qp must be maintained – ops should be dequeued in
same sequence they are enqueued.
[Shally] If each op is independent then why do we need to maintain
ordering. Since they're independent
and thus can be processed in parallel so they can well be quite out-of-order
and available for dequeue as
soon as completed.
Serializing them will limit HW throughput capability. And I can envision
some app may not care about
ordering just completion.
So I would suggest if application need ordering should tag each op with
some id or serial number in op
user_data area to identify enqueue order OR we may add flag in
enqueue_burst() API to enforce
serialized dequeuing, if that's hard requirement of any.
[Fiona] Ok, I think you're right, this requirement isn't needed.
In stateless ops it's not needed.
For stateful the appl should only have one op per stream inflight at any time
so manages the ordering.
So we can specify on the API that ordering is not necessarily maintained on
the qp and PMDs may return responses out-of-order.
The responsibility is on the application to maintain order if it's needed.
If later we find some argument for maintaining order I'd suggest a
configuration param per qp or even per device rather than on the
enqueue_burst()
[Shally] Done.
• Stateless and stateful ops can be enqueued on the same qp
• Stateless and stateful ops can be enqueued in the same burst
• Only 1 op at a time may be enqueued to the qp from any stateful
stream.
• A burst can have multiple stateful ops, but each must be from a
different
stream.
• All ops will have a session attached – this will only contain immutable
data
which
can be used by many ops, devices and or drivers at the same time.
• All stateful ops will have a stream attached for maintaining state and
history, this can only be used by one op at a time.
A single enque_burst() *can* carry multiple streams. I.E. This is allowed
both in burst or in qp (say, when
multiple threads call enque_burst() on same qp)
---------------------------------------------------------------------
------------------------------------
enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final |
op4.no_flush | op5.no_flush |)
--------------------------------------------------------------------
-------------------------------------
Where,
All op1, op2...op5 belongs to all *different* streams. Op3 can be
stateless/stateful depending upon
op_type value and each can have *same or different* sessions.
[Fiona] Exactly
If I understand this right, then yes it looks good to me. However this also
bring one minor point for
discussion but I would wait to initiate that until we close on current open
points.
Thanks
Shally
[Shally] Since we are in sync now. So I will bring up another point for discussion.
I'm thinking probably we should have stream regardless of op_type where it should be marked
*mandatory* for stateful but *optional* (or may be mandatory) for stateless as having it for stateless may
Currently we see stream as resource which maintain states (et el) for stateful processing but this is also a
placeholder where PMD can choose to do one-time resource setup common to both op_types (such as
allocating instruction from its internal pool and initialize it with session params).
So for such PMD designs, it will be beneficial to use stream for stateless as well as it will help minimize
instruction setup time on data path as all 1-time operations will be done in stream_create().
In case of stateless, if stream is present would mean it is available for next use as soon as last associated
op is dequeued as it hold no context of last op but common resources re-useable for next op.
[Fiona] We intend to use session private area for similar. But I don't see a problem with what you suggest.
We can either add a capability which appl must check to know if it should call stream_create() for STATELESS sessions OR we could say stream_create should be called for every session, if it returns non-NULL for stateless session then it should be attached to every op sent to the session?
Apart, there's another point. We can enable API spec to leverage mempool object cache concept by
allowing PMD to allocate stream from op pool as per-op private data i.e. each object elt_size =
sizeof(rte_comp_op) + user_size + stream_size.
This would help PMD reduce memory access time if caching is enabled as each op stream reside with it in
cache rather than having them from different pool with different policies.
[Fiona] I'm not sure about this. The intention was the op-pool would be device-independent.
So the appl would have to retrieve the size of stream from all PMDs and size for the largest.
So could be wasteful for memory.
For stateful I can see, if memory was not an issue this would be good, as one op could be re-used, stream would be already attached.
But for stateless, I think you're suggesting there would be one stream used by many ops in a burst or even in different bursts. So how could that work as each op would have a separate stream?
If agreed, then it can be enabled in API spec such as (this is just a proposal, there could be other ..)
struct rte_comp_op_pool_private {
uint16_t user_size;
/**< Size of private user data with each operation. */
uint16_t dev_priv_data_size;
/**< Size of device private data with each operation. */
};
int rte_comp_stream_create(uint32_t dev_id,
rte_comp_session *sess,
void **stream,
rte_mempool *op_pool /* optional */);
- This will map to PMD ops stream_create(). Here if op_pool != NULL, then PMD can re-allocate op pool
with new elt_size = rte_comp_op + op_private->user_size + dev_private_stream_size so that stream reside
in op private area.
Or, we may add another API altogether to do stream allocations from op_pool such as
rte_comp_stream_create_from_pool(...);
I will issue RFC Doc v2 after your feedback these.
[Fiona] Would it be preferable to issue v2 even with design so far? To give Ahmed and the community a better doc to review. Also I'm on holidays until after Christmas so will not get back to this til 8th January.
Thanks
Shally
enum rte_comp_op_type {
RTE_COMP_OP_STATELESS,
RTE_COMP_OP_STATEFUL
}
enum rte_comp_op_type op_type;
void * stream_private;
/* location where PMD maintains stream state – only required if
op_type is
STATEFUL, else set to NULL */
As size of stream data will vary depending on PMD, each PMD or device
rte_comp_stream_create(uint8_t dev_id, rte_comp_session *sess, void
**
stream);
/* This should alloc a stream from the device’s mempool and initialise it.
This
handle will be passed to the PMD with every op in the stream. Q. Should
qp_id also be added, with constraint that all ops in the same stream
should
be sent to the same qp? */
rte_comp_stream_free(uint8_t dev_id, void * stream);
/* This should clear the stream and return it to the device’s mempool */
All ops are enqueued/dequeued to device & qp using same
rte_compressdev_enqueue_burst()/…dequeue_burst;
Re flush flags, stateful stream would start with op.flush = NONE or SYNC
and
end with FULL or FINAL
STATELESS ops would just use either FULL or FINAL
Let me know if you want to set up a meeting - it might be a more
effective
way to
arrive at an API that works for all PMDs.
I'll send out a v3 today with above plus updates based on all the other
feedba
Verma, Shally
2017-12-26 11:15:40 UTC
Permalink
HI Fiona
Post by Trahe, Fiona
[Fiona] Would it be preferable to issue v2 even with design so far?
Sure will do.
Please see inline for feedback on other points.
Post by Trahe, Fiona
-----Original Message-----
Sent: 22 December 2017 20:43
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Shally,
-----Original Message-----
Sent: Friday, December 22, 2017 7:46 AM
Gupta, Ashish
De Lara Guarch, Pablo
Ahmed Mansour
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Fiona
-----Original Message-----
Sent: 20 December 2017 21:03
Ahmed
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Shally,
I think we are almost in sync now - a few comments below with just one
open question which I suspect was a typo.
If this is ok then no need for a meeting I think.
In this case will you issue a v2 of this doc ?
-----Original Message-----
Sent: Wednesday, December 20, 2017 7:15 AM
Cc: Athreya, Narayana Prasad
Gupta, Ashish
De Lara Guarch, Pablo
Ahmed Mansour
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Fiona
Please refer to my comments below with my understanding on two
major
points OUT_OF_SPACE and
Stateful Design.
If you believe we still need a meeting to converge on same please
share
meeting details to me.
-----Original Message-----
Sent: 15 December 2017 23:11
Cc: Athreya, Narayana Prasad
Subject: RE: [RFC v1] doc compression API for DPDK
Hi Shally,
-----Original Message-----
Sent: Thursday, December 7, 2017 5:43 AM
Cc: Athreya, Narayana Prasad
Challa, Mahipal
Subject: RE: [RFC v1] doc compression API for DPDK
//snip....
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
Please note any time output buffer ran out of space during
write
then
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
operation will turn “Stateful”.  See
Post by Verma, Shally
more on Stateful under respective section.
[Fiona] Let's come back to this later. An alternative is that
OUT_OF_SPACE is
Post by Verma, Shally
Post by Verma, Shally
returned and the application
must treat as a fail and resubmit the operation with a larger
destination
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
buffer.
[Shally] Then I propose to add a feature flag
"FF_SUPPORT_OUT_OF_SPACE" per xform type for flexible
Post by Verma, Shally
PMD design.
As there're devices which treat it as error on compression but
not
on
Post by Trahe, Fiona
decompression.
Post by Verma, Shally
If it is not supported, then it should be treated as failure
condition
and
app
Post by Trahe, Fiona
can resubmit operation.
Post by Verma, Shally
if supported, behaviour *To-be-Defined* under stateful.
[Fiona] Can you explain 'turn stateful' some more?
If compressor runs out of space during stateless operation, either
comp
or
Post by Trahe, Fiona
decomp, and turns stateful, how would the app know? And what
would
be in
Post by Trahe, Fiona
status, consumed and produced?
Could it return OUT_OF_SPACE, and if both consumed and
produced
== 0
[Shally] If consumed = produced == 0, then it's not OUT_OF_SPACE
condition.
Post by Trahe, Fiona
then the whole op must be resubmitted with a bigger output
buffer.
But
if
Post by Trahe, Fiona
consumed and produced > 0 then app could take the output and
submit
next
Post by Trahe, Fiona
op
continuing from consumed+1.
[Shally] consumed and produced will *always* be > 0 in case of
OUT_OF_SPACE.
OUT_OF_SPACE means output buffer exhausted while writing data
into
it
and PMD may have more to
write to it. So in such case, PMD should set
Produced = complete length of output buffer
Status = OUT_OF_SPACE
1. consumed = complete length of src mbuf means PMD has read
full
input,
OR
2. consumed = partial length of src mbuf means PMD has read
partial
input
On seeing this status, app should consume output and re-enqueue
same
op with empty output buffer and
src = consumed+1.
[Fiona] As this was a stateless op, the PMD cannot be expected to
have
stored the history and state and so
cannot be expected to continue from consumed+1. This would be
stateful
behaviour.
[Shally] Exactly.
But it seems you are saying that even on in this stateless case you'd
like
the
PMDs who can store state
to have the option of converting to stateful. So
a PMD which can support this could return OUT_OF_SPACE with
produced/consumed as you describe above.
a PMD which can't support it should return an error.
The appl can continue on from consumed+1 in the former case and
resubmit
the full request
with a bigger buffer in the latter case.
Is this the behaviour you're looking for?
If so the error could be something like NEED_BIGGER_DST_BUF?
However, wouldn't OUT_OF_SPACE with produced=consumed=0
convey
the
same information on the API?
It may correspond to an error on the underlying PMD, but would it be
simpler
on the compressdev API
Please note as per current proposal, app should call
rte_compdev_enqueue_stream() version of API if it
doesn't know output size beforehand.
[Fiona] True. But above is only trying to describe behaviour in the
stateless
error case.
[Shally] Ok. Now I got point of confusion with term 'turns stateful' here.
No
it's not like stateless to
stateful conversion.
Stateless operation is stateless only and in stateless we don't expect
OUT_OF_SPACE error. So, now I
also understand what you're trying to imply with
produced=consumed=0.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
----------------------------------
A.1 If operation is stateless i.e. rte_comp_op. op_type ==
RTE_COMP_OP_STATELESS, and PMD runs out
of buffer during compression or decompression then it is an error
condition
for PMD.
It will reset itself and return with produced=consumed=0 with status
OUT_OF_SPACE. On seeing this,
application should resubmit full request with bigger output buffer size.
-------------------------------
B.1 If operation is stateful i.e. rte_comp_op.op_type ==
RTE_COMP_OP_STATEFUL, and PMD runs out
of buffer during compression or decompression, then PMD will update
produced=consumed (as mentioned above)
[Fiona] ? Did you mean to say "will update produced & consumed" ?
[Shally] Yes you right that was typo. It should be produced & consumed.
I think
- consumed would be <= input length (typically <)
- produced would be <= output buffer len (typically =, but could be a
few
bytes less)
- status would be OUT_OF_SPACE
Do you agree?
[Shally] Yes.
and app should resubmit op with input from consumed+1
and output buffer with free space.
Please note for such case, application should allocate stream via call to
rte_comp_stream_create() and
attach it to op and pass it along every time pending op is enqueued until
op
processing is complete with
status set to SUCCESS/FAILURE.
[Fiona] Agreed
//snip.....
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
D.2.1.2 Stateful operation state maintenance
 -------------------------------------------------------------
This section starts with description of our understanding
about
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
compression API support for stateful.
Post by Verma, Shally
Depending upon understanding build upon these concepts,
we
will
Post by Trahe, Fiona
identify
Post by Verma, Shally
Post by Verma, Shally
required data structure/param
Post by Verma, Shally
to maintain in-progress operation context by PMD.
For stateful compression, batch of dependent packets starts
at a
packet
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
having
Post by Verma, Shally
RTE_NO_FLUSH/RTE_SYNC_FLUSH flush value and end at
packet
having
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
RTE_FULL_FLUSH/FINAL_FLUSH.
Post by Verma, Shally
--------------------------------------------------------------------------
----
-----
-
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
|op1.no_flush | op2.no_flush | op3.no_flush |
op4.full_flush|
Post by Trahe, Fiona
Post by Verma, Shally
Post by Verma, Shally
Post by Verma, Shally
--------------------------------------------------------------------------
----
-----
-
Post by Trahe, Fiona
[Fiona] I think it needs to be more constrained than your
examples
below.
Post by Trahe, Fiona
Only 1 operation from a stream can be in a burst. As each
operation
Post by Trahe, Fiona
in a stateful stream must complete, as next operation needs state
and
Post by Trahe, Fiona
history
of previous operation to be complete before it can be processed.
And if one failed, e.g. due to OUT_OF_SPACE, this should affect
the following operation in the same stream.
Worst case this means bursts of 1. Burst can be >1 if there are
multiple
Post by Trahe, Fiona
independent streams with available data for processing. Or if
there is
Post by Trahe, Fiona
data available which can be statelessly processed.
If there are multiple buffers available from a stream , then instead
they
can
Post by Trahe, Fiona
be linked together in an mbuf chain sent in a single operation.
To handle the sequences below would mean the PMD
would need to store ops sending one at a time to be processed.
As this is significantly different from what you describe below, I'll
wait
for
Post by Trahe, Fiona
further feedback
before continuing.
[Shally] I concur with your thoughts. And these're are not
significantly
different from the concept
presented below.
Yes as you mentioned, even for burst_size>1 PMD will have to
serialize
each op internally i.e.
It has to wait for previous to finish before putting next for
processing
which
is
as good as application making serialised call passing one op at-a-time
or
if
stream consists of multiple buffers, making their scatter-gather list
and
then enqueue it as one op at a time which is more efficient and
ideal
usage.
However in order to allow extensibility, I didn't mention limitation
on
burst_size.
Because If PMD doesn't support burst_size > 1 it can always return
nb_enqueued = 1, in which case
app can enqueue next however with condition it should wait for
previous
to complete
before making next enqueue call.
So, if we take simple example to compress 2k of data with src mbuf
size
=
1k.
Then with burst_size=1, expected call flow would be(this is just one
flow,
other variations are also possible
1. fill 1st 1k chunk of data in op.msrc
2.enqueue_stream (..., |op.flush = no_flush|, 1, ptr_stream);
3.dequeue_burst(|op|,1);
4.refill next 1k chunk in op.msrc
5.enqueue_stream(...,|op.flush = full_flush|, 1 , ptr_stream);
6.dequeue_burst(|op|, 1);
7.end
So, I don’t see much of a change in API call flow from here to design
presented below except nb_ops = 1 in
each call.
However I am assuming that op structure would still be same for
stateful
processing i.e. it would start with
op.flush value = NO/SYNC_FLUSH and end at op with flush value =
FULL
FLUSH.
Are we on same page here?
Thanks
Shally
[Fiona] We still have a different understanding of the stateful flow
needed
on the API.
I’ll try to clarify and maybe we can set up a meeting to discuss.
• Order of ops on a qp must be maintained – ops should be
dequeued
in same sequence they are enqueued.
• Ops from many streams can be enqueued on same qp.
• Ops from a qp may be fanned out to available hw or sw
engines and
processed in parallel, so each op must be independent.
• Stateless and stateful ops can be enqueued on the same qp
Submitting a burst of stateless ops to a qp is no problem.
Submitting more than 1 op at a time from the same stateful stream to
a
qp is
a problem.
Appl submits 2 ops in same stream in a burst, each has src and dest
mbufs,
input length/offset and
requires checksum to be calculated.
The first op must be processed to completion before the second can
be
started as it needs the history and the checksum so far.
If each dest mbuf is big enough so no overflow, each dest mbuf will
be
partially filled. This is probably not
what’s desired, and will force an extra copy to make the output data
contiguous.
If the dest mbuf in the first op is too small, then does the PMD alloc
more
memory in the dest mbuf?
Or alloc another mbuf? Or fail and the whole burst must be
resubmitted?
Or store the 2nd op, wait, on seeing the OUT_OF_SPACE on the 1st
op,
overwrite the src, dest, len etc of the 2nd op
to include the unprocessed part of the 1st op?
In the meantime, are all other ops on the qp blocked behind these?
For hw accelerators it’s worse, as PMD would normally return once
ops
are
offloaded and the dequeue would
pass processed ops straight back to the appl. Instead, the enqueue
would
need to kick off a thread to
dequeue ops and filter to find the stateful one, storing the others til
the
next
application dequeue is called.
Above scenarios don’t lend themselves to accelerating a packet
processing
workload.
It pushes a workload down to all PMDs which I believe belongs above
this
API
as
that work is not about offloading the compute intensive compression
work
but
about the sequencing of data and so is better coded once, above the
API
in
an application layer
common to all PMDs. (See Note1 in
http://dpdk.org/ml/archives/dev/2017-
October/078944.html )
If an application has several packets with data from a stream that it
needs
to
(de)compress statefully,
what it probably wants is for the output data to fill each output buffer
completely before writing to the next buffer.
Chaining the src mbufs in these pkts into one chain and sending as
one op
allows the output
data to be packed into a dest mbuf or mbuf chain.
I think what’s needed is a layer above the API to accumulate incoming
packets while waiting for the
previous set of packets to be compressed. Forwarding to the PMD to
queue
there is not the right place
to buffer them as the queue should be per stream rather than on the
accelerator engine’s queue
which has lots of other independent packets.
[Shally] Ok. I believe I get it.
In general I agree to this proposal. However have concern on 1 point
here
i.e. order maintenance. Please
see further for more explanation.
• Ops from a qp may be fanned out to available hw or sw engines and
processed in parallel, so each op must be independent.
[Shally] Possible only if PMD support combination of SW and HW
processing. Right?
[Fiona] Not necessarily, Intel QuickAssist accelerators are HW and can
process ops from same qp in parallel
• Order of ops on a qp must be maintained – ops should be
dequeued in
same sequence they are enqueued.
[Shally] If each op is independent then why do we need to maintain
ordering. Since they're independent
and thus can be processed in parallel so they can well be quite out-of-
order
and available for dequeue as
soon as completed.
Serializing them will limit HW throughput capability. And I can envision
some app may not care about
ordering just completion.
So I would suggest if application need ordering should tag each op with
some id or serial number in op
user_data area to identify enqueue order OR we may add flag in
enqueue_burst() API to enforce
serialized dequeuing, if that's hard requirement of any.
[Fiona] Ok, I think you're right, this requirement isn't needed.
In stateless ops it's not needed.
For stateful the appl should only have one op per stream inflight at any
time
so manages the ordering.
So we can specify on the API that ordering is not necessarily maintained
on
the qp and PMDs may return responses out-of-order.
The responsibility is on the application to maintain order if it's needed.
If later we find some argument for maintaining order I'd suggest a
configuration param per qp or even per device rather than on the
enqueue_burst()
[Shally] Done.
• Stateless and stateful ops can be enqueued on the same qp
• Stateless and stateful ops can be enqueued in the same burst
• Only 1 op at a time may be enqueued to the qp from any stateful
stream.
• A burst can have multiple stateful ops, but each must be from a
different
stream.
• All ops will have a session attached – this will only contain
immutable
data
which
can be used by many ops, devices and or drivers at the same time.
• All stateful ops will have a stream attached for maintaining state and
history, this can only be used by one op at a time.
A single enque_burst() *can* carry multiple streams. I.E. This is allowed
both in burst or in qp (say, when
multiple threads call enque_burst() on same qp)
----------------------------------------------------------------
-----
------------------------------------
enque_burst (|op1.no_flush | op2.no_flush | op3.flush_final |
op4.no_flush | op5.no_flush |)
----------------------------------------------------------------
----
-------------------------------------
Where,
All op1, op2...op5 belongs to all *different* streams. Op3 can be
stateless/stateful depending upon
op_type value and each can have *same or different* sessions.
[Fiona] Exactly
If I understand this right, then yes it looks good to me. However this
also
bring one minor point for
discussion but I would wait to initiate that until we close on current
open
points.
Thanks
Shally
[Shally] Since we are in sync now. So I will bring up another point for
discussion.
I'm thinking probably we should have stream regardless of op_type where
it should be marked
*mandatory* for stateful but *optional* (or may be mandatory) for
stateless as having it for stateless may
Currently we see stream as resource which maintain states (et el) for
stateful processing but this is also a
placeholder where PMD can choose to do one-time resource setup
common to both op_types (such as
allocating instruction from its internal pool and initialize it with session
params).
So for such PMD designs, it will be beneficial to use stream for stateless as
well as it will help minimize
instruction setup time on data path as all 1-time operations will be done in
stream_create().
In case of stateless, if stream is present would mean it is available for next
use as soon as last associated
op is dequeued as it hold no context of last op but common resources re-
useable for next op.
[Fiona] We intend to use session private area for similar. But I don't see a
problem with what you suggest.
We can either add a capability which appl must check to know if it should call
stream_create() for STATELESS sessions OR we could say stream_create
should be called for every session, if it returns non-NULL for stateless session
then it should be attached to every op sent to the session?
[Shally] I would prefer for second option i.e "say stream_create should be called for every session".
It will give more flexibility to PMD as then it can decide it's support per session. Also, it keeps spec simple.
Post by Trahe, Fiona
Apart, there's another point. We can enable API spec to leverage mempool
object cache concept by
allowing PMD to allocate stream from op pool as per-op private data i.e.
each object elt_size =
sizeof(rte_comp_op) + user_size + stream_size.
This would help PMD reduce memory access time if caching is enabled as
each op stream reside with it in
cache rather than having them from different pool with different policies.
[Fiona] I'm not sure about this. The intention was the op-pool would be device-independent.
So the appl would have to retrieve the size of stream from all PMDs and size
for the largest.
So could be wasteful for memory.
For stateful I can see, if memory was not an issue this would be good, as one
op could be re-used, stream would be already attached.
But for stateless, I think you're suggesting there would be one stream used
by many ops in a burst or even in different bursts. So how could that work as
each op would have a separate stream?
[Shally] No that’s not what I intend. Each op would still be separate stream. Only requirement I raised that, stream could also be kept in per-op private area to utilize object caching because in any case each op uses separate stream.
It can be implemented in multiple ways (maybe add another API , say alloc_stream_from_op_priv_area() or add simply get_stream_size()).
Surely, this makes op pool device dependent so I see it as additional feature to meet specific requirement of app using mempool caching.
However I do not see it as blocker to freeze on current spec and thus suggest to leave it open for discussion and add later as incremental patch if we see value in it.

Thanks
Shally
Post by Trahe, Fiona
If agreed, then it can be enabled in API spec such as (this is just a proposal,
there could be other ..)
struct rte_comp_op_pool_private {
uint16_t user_size;
/**< Size of private user data with each operation. */
uint16_t dev_priv_data_size;
/**< Size of device private data with each operation. */
};
int rte_comp_stream_create(uint32_t dev_id,
rte_comp_session *sess,
void **stream,
rte_mempool *op_pool /* optional */);
- This will map to PMD ops stream_create(). Here if op_pool != NULL, then
PMD can re-allocate op pool
with new elt_size = rte_comp_op + op_private->user_size +
dev_private_stream_size so that stream reside
in op private area.
Or, we may add another API altogether to do stream allocations from
op_pool such as
rte_comp_stream_create_from_pool(...);
I will issue RFC Doc v2 after your feedback these.
[Fiona] Would it be preferable to issue v2 even with design so far? To give
Ahmed and the community a better doc to review. Also I'm on holidays until
after Christmas so will not get back to this til 8th January.
Thanks
Shally
enum rte_comp_op_type {
RTE_COMP_OP_STATELESS,
RTE_COMP_OP_STATEFUL
}
enum rte_comp_op_type op_type;
void * stream_private;
/* location where PMD maintains stream state – only required if
op_type is
STATEFUL, else set to NULL */
As size of stream data will vary depending on PMD, each PMD or
device
rte_comp_stream_create(uint8_t dev_id, rte_comp_session *sess,
void
**
stream);
/* This should alloc a stream from the device’s mempool and initialise
it.
This
handle will be passed to the PMD with every op in the stream. Q.
Should
qp_id also be added, with constraint that all ops in the same stream
should
be sent to the same qp? */
rte_comp_stream_free(uint8_t dev_id, void * stream);
/* This should clear the stream and return it to the device’s mempool
*/
All ops are enqueued/dequeued to device & qp using same
rte_compressdev_enqueue_burst()/…dequeue_burst;
Re flush flags, stateful stream would start with op.flush = NONE or
SYNC
and
end with FULL or FINAL
STATELESS ops would just use either FULL or FINAL
Let me know if you want to set up a meeting - it might be a more
effective
way to
arrive at an API that works for all PMDs.
I'll send out a v3 today with above plus updates based on all the other
feedbac
Loading...