---

GPU packet classification on Grace Blackwell: from 576us to 4us

I got access to a DGX Spark - NVIDIA's desktop supercomputer built around the GB10 Grace Blackwell Superchip - and decided to answer a question that had been bothering me for a while: can you usefully offload packet classification to a GPU when the CPU and GPU share the same physical memory?

The short answer is yes, but you have to think carefully about which algorithm you run on the GPU.

The hardware is the unusual part

Most GPU machines have the GPU on a PCIe card with its own GDDR memory. Copying data to the GPU costs you a DMA transfer and several microseconds of latency even before the kernel runs. That overhead dominates for packet-by-packet or frame-by-frame workloads, which is why GPU networking never quite happened as a mainstream thing.

Grace Blackwell is different. The CPU (NVIDIA Grace, ARM Neoverse V2) and the GPU (Blackwell GB10) are joined by NVLink-C2C, a die-to-die interconnect that makes them share the same physical DRAM with full cache coherency. CUDA's cudaMallocManaged() gives you a pointer that both sides can dereference natively - no cudaMemcpy, no page migration, no fuss. The GPU sees CPU writes with hardware coherency, not through a software flush.

That changes the calculus for latency-sensitive workloads.

The testbed: VPP

VPP (Vector Packet Processing) is the open-source dataplane that powers a large fraction of telco and cloud routing infrastructure. It processes packets in batches of up to 256 (a "frame") and structures its graph as a chain of "feature arcs". Plugging a new processing stage into the arc is a matter of registering a node and one macro.

The plugin I built - gpu_classify - hooks into the ip4-unicast and ip6-unicast arcs. For each frame, the CPU fills a descriptor array with the five-tuple and IP metadata, launches a CUDA kernel, waits for results, and routes packets accordingly. Three actions: PASS, DROP, MARK.

The full source lives at https://github.com/ayourtch-llm/vpp-spark-gpu, branch dgx-spark-plugin.

The naive version and why it was slow

The first kernel was a straightforward linear scan: 256 threads, one per packet, each walks all N rules and stops at the first match. With shared memory tiling and a block-wide early-exit vote it was reasonably fast for small rule sets. At 64 rules the GPU beat the VPP ACL plugin. At 1024 rules, with a no-match workload (traffic hits none of the deny rules), the GPU kernel time was 576 µs per 256-packet frame.

The VPP ACL plugin at 1024 rules, same workload: 36 µs.

That is a 16× beating. The GPU was the wrong tool.

The reason is architectural. 256 threads all stall simultaneously on each rule - there is no independent warp to schedule while waiting for global memory. The ACL plugin uses O(1) hash lookups grouped by mask combination: for K distinct (src_mask, dst_mask, port/proto/flags) patterns, it does K hash probes per packet and is done. At 1024 rules with one mask pattern, K=1 and the ACL finishes in ~36 µs regardless of N.

The fix: replicate the ACL's hash strategy on the GPU

Each distinct combination of address masks and match flags gets its own open-addressing hash table. Rules with the same mask combo hash into the same table. A lookup is: mask the packet key, hash it with FNV-1a, probe the table, done.

At rule install time, on the CPU:

  1. Scan the rule list; find distinct (src_mask, dst_mask, match_proto, match_src_port, match_dst_port, match_ip_version, tcp_flags_mask) combos. Each combo gets a table descriptor.
  2. For each table, compute slot count = next power-of-two ≥ 2 × rules in that table (50% load factor). Assign a base offset into a flat 4096-slot entry array.
  3. Insert rules with FNV-1a and linear probing. Duplicate masked keys keep the lower rule_idx (higher priority).
  4. Sort descriptors by min_rule_idx ascending - this lets the GPU break early once a match is found.
  5. memcpy everything into cudaMallocManaged buffers. Bump a rule_version counter so the persistent kernel reloads its shared-memory cache on the next frame.

On the GPU, per packet per frame:

best_action = PASS, best_idx = INT32_MAX
for each table descriptor (sorted by min_rule_idx):
    if table.min_rule_idx >= best_idx: break   // early exit
    masked_key = apply table masks to packet fields
    h = FNV-1a(masked_key)
    probe flat entry array at (h & (n_slots-1)) with linear probing
    on key match: if entry.rule_idx < best_idx → update best
results[thread_id] = best_action

The persistent kernel (which stays resident across frames to avoid the 30 µs cudaStreamSynchronize overhead) loads the table descriptors - 64 tables × 64 bytes = 4 KB - into shared memory on rule_version change. Entry probing goes to global memory, but at a handful of probes per table and full Blackwell L2 caching, it is fast.

The same 80 KB shared memory allocation that previously held the full 1024-rule table now holds either descriptors (4 KB, hash path) or the full rule table (80 KB, linear fallback). The fallback activates automatically when more than 64 distinct mask combinations are present.

Results

All measurements: 256-packet frames, 50 000 repetitions = 12.8 M packets per data point. "kern µs" is the CPU-observed GPU round-trip measured with clock_gettime inside the kernel dispatch, excluding VPP arc overhead (~30 µs constant for both GPU and ACL).

No-match, varying rule count - traffic hits none of the deny rules:

 Rules    GPU kern µs    ACL µs/fr    GPU/ACL
     1          3.7         37.6       1.06×
     8          3.7         37.4       1.06×
    64          4.0         37.6       1.06×
   256          3.8         37.8       1.07×
  1024          3.8         37.4       1.06×

3.8 µs flat at 1024 rules. Previously 576 µs. All 1024 rules share the same mask combo (proto + dport), so they hash into a single table: one O(1) probe per packet regardless of N. That is a 150× speedup.

Last-match, varying rule count - worst case for a linear scan, now a direct hash lookup:

 Rules    GPU kern µs    ACL µs/fr    GPU/ACL
     1          3.8         36.8       1.04×
  1024          5.4         36.7       0.99×

Previously also ~576 µs. The matching rule lives at index N-1, but the hash table finds it in one probe.

Diverse prefix lengths (Scenario 4) - 1024 dst-only rules spread across K distinct prefix lengths; both GPU and ACL build K hash tables per K distinct mask combos:

Tables    GPU kern µs    ACL µs/fr    GPU/ACL
     1          5.6         37.4       1.03×
     8         11.9         39.9       0.96×
    16         16.7         50.7       1.01×
    21         22.1         57.1       1.06×

Previously ~518 µs kern at K=21 (linear scan). Now 22 µs - 23× faster. At K ≥ 16 the GPU pulls ahead of ACL; at K=8 it is roughly equal. The baseline (pure VPP forwarding, no feature) is 26.8 µs/frame, so at low K both plugins add only a few microseconds on top.

How to replicate

You need a DGX Spark (or any Grace Blackwell system with NVLink-C2C and CUDA 12.8+). The repository includes a runme script that installs CUDA and VPP build dependencies from a bare container.

git clone https://github.com/ayourtch-llm/vpp-spark-gpu.git
cd vpp-spark-gpu
git checkout dgx-spark-plugin
./runme          # installs cuda-toolkit-12-8, VPP build deps
make build-release

Run the functional tests (19 tests, IPv4 and IPv6, all actions):

make test TEST=test_gpu_classify

Run the GPU vs ACL benchmark:

make test TEST=test_gpu_classify_bench

The benchmark prints four scenario tables to stdout, with GPU Mpps, GPU µs/frame, kern µs, ACL Mpps, ACL µs/frame, and the GPU/ACL ratio side by side.

To try rules interactively from the VPP CLI:

gpu-classify enable GigabitEthernet0/0
gpu-classify rule add proto 6 dport 443 action drop
gpu-classify rule add src 10.0.0.0/8 action mark
show gpu-classify

show gpu-classify reports the number of hash tables built, total slot count, and the full latency histogram (p50/p99/p99.9).

What I learned

The NVLink-C2C coherency really does remove the DMA tax. The persistent kernel trick (launching once and polling a shared control block) removes the 30 µs cudaStreamSynchronize cost. But neither of those mattered until the kernel algorithm matched the ACL plugin's O(K) complexity - before that, the GPU was just a slow linear scanner that happened to avoid page faults.

The broader lesson is familiar: hardware throughput is not the bottleneck until the algorithm is right.

A few things deliberately left out: the persistent-kernel adaptive hysteresis (activates after 4 busy frames, deactivates after 16 idle), the IPv6 support, the dual-use shared memory trick, and the latency histogram. Maybe another post.

Developed & written in pair with Claude Sonnet 4.6 via Claude Code. The exchange that produced the hash-table design is in the commit history.

Comments welcome on LinkedIn or Bluesky.

Files in 2026-02-23-Using-GPU-in-VPP-on-DGX-Spark:


../
HEADER.txt                                         23-Feb-2026 14:54               10414

(c) Andrew Yourtchenko