Using HPC instructions to accelerate DPDK and FD.io
In 2020 I wrote a series of white-papers describing using the Intel AVX-512 SIMD
instruction-set (AVX-512) to accelerate packet processing applications. AVX-512
is well known for its ability to accelerate AES cryptography with AVX-512 Vector
AES (vAES) instructions. However, what will be counter intuitive to some, is to
use an instruction-set like AVX-512, that was primarily designed for HPC type
workloads to accelerate networking.
When looking at the kinds of optimizations we were using in DPDK and FD.io I
pulled out a number of common threads, and thought to describe them in a series
of papers. In broad strokes, the non-exhaustive list of the kinds of
optimizations that are described are:
-
Data-structure translation, that is using shuffle and permute instructions
to translate from one data structure to another. A good example of this, are
the DPDK i40e and ICE PMDs which do this to translate NIC descriptors into
DPDK mbufs. This also involves a bit of bitfield translation also, along the
way. You can find these examples in the DPDK source code in the AVX-512
variants of DPDK i40e and ICE Poll-Mode-Driver’s RX functions. - Parallel-izing table lookups, the DPDK FIB and ACLs libraries both use a
similar set of optimizations to accelerate lookups. This involves- The batching of operations, both libraries support a batch interface where
multiple lookups are passed in a single call to the library. e.g.
rte_fib_lookup_bulk and rte_acl_classify. - The parallel-izing of table offset calculations, involves SIMD adds, shifts
etc, e.g. _mm512_add_* and _mm512_srli_*, in order to calculate the offset
for multiple table entries in parallel. - The parallel-izing of table entry loads, involves SIMD gather operations,
e.g. _mm512_*gather_* and friends, in order to load multiple tables entries
in parallel.
- The batching of operations, both libraries support a batch interface where
- Match and copy operations on large buffers, these operations all use the
wider AVX-512 ZMM registers to operate on 64-byte (512bits) buffers in one
operation. Some examples are- DPDK and FD.io both use the wider AVX-512 registers to accelerate memory
copy operations, e.g. rte_memcpy (DPDK), clib_memcpy (FD.io) and friends. - FD.io classifiers involve a mask and hash operation to calculate the hash of
buffer, and a mask and match operation to match a buffer to classifer entry.
These both have AVX-512 variants with SIMD AND’ing large buffers with a mask
and then SIMD XOR’ing the result.
- DPDK and FD.io both use the wider AVX-512 registers to accelerate memory
- Caching table lookups, FD.io VPP Virtual Tunnel Endpoint (vTEP) uses
AVX-512 to lookup VTEP IDs, e.g. VXLAN Tunnel IDs, in a simple cache. This
enable checks for valid VTEP IDs to be accelerated. This involves- Splat’ing the VTEP ID for a given packet across the 64bit eight lanes in a
AVX-512 register. - The splat’ed value can then be used to match against a cache of eight valid
VTEP IDs in a single operation. - If any of the VTEP ID matches, we will get a non-zero result and processing
on the packet can continue.
- Splat’ing the VTEP ID for a given packet across the 64bit eight lanes in a
All of these optimizations and more are detailed in the white-papers themselves.
- There is a Solution Brief, an executive summary for the manager-types who
just want the broad-strokes. - There is an Instruction-set Overview, which provides a basic primer on the
AVX-512 instruction-set, as well detailing some of the micro-architectural
features associated with the instruction-set, including core power
utilization and associated uop execution ports. - There is a technology guide to writing packet processing software, which
provides a basic primer on writing packet processing software with the
AVX-512 instruction-set.
DPDK fd.io AVX-512