I got access to a DGX Spark - NVIDIA's desktop supercomputer built around the GB10 Grace Blackwell Superchip - and decided to answer a question that had been bothering me for a while: can you usefully offload packet classification to a GPU when the CPU and GPU share the same physical memory?
The short answer is yes, but you have to think carefully about which algorithm you run on the GPU.
Most GPU machines have the GPU on a PCIe card with its own GDDR memory. Copying data to the GPU costs you a DMA transfer and several microseconds of latency even before the kernel runs. That overhead dominates for packet-by-packet or frame-by-frame workloads, which is why GPU networking never quite happened as a mainstream thing.
Grace Blackwell is different. The CPU (NVIDIA Grace, ARM Neoverse V2) and the GPU (Blackwell
GB10) are joined by NVLink-C2C, a die-to-die interconnect that makes them share the same
physical DRAM with full cache coherency. CUDA's cudaMallocManaged() gives you a
pointer that both sides can dereference natively - no cudaMemcpy, no page
migration, no fuss. The GPU sees CPU writes with hardware coherency, not through a software
flush.
That changes the calculus for latency-sensitive workloads.
VPP (Vector Packet Processing) is the open-source dataplane that powers a large fraction of telco and cloud routing infrastructure. It processes packets in batches of up to 256 (a "frame") and structures its graph as a chain of "feature arcs". Plugging a new processing stage into the arc is a matter of registering a node and one macro.
The plugin I built - gpu_classify - hooks into the ip4-unicast and
ip6-unicast arcs. For each frame, the CPU fills a descriptor array with the
five-tuple and IP metadata, launches a CUDA kernel, waits for results, and routes packets
accordingly. Three actions: PASS, DROP, MARK.
The full source lives at
https://github.com/ayourtch-llm/vpp-spark-gpu,
branch dgx-spark-plugin.
The first kernel was a straightforward linear scan: 256 threads, one per packet, each walks all N rules and stops at the first match. With shared memory tiling and a block-wide early-exit vote it was reasonably fast for small rule sets. At 64 rules the GPU beat the VPP ACL plugin. At 1024 rules, with a no-match workload (traffic hits none of the deny rules), the GPU kernel time was 576 µs per 256-packet frame.
The VPP ACL plugin at 1024 rules, same workload: 36 µs.
That is a 16× beating. The GPU was the wrong tool.
The reason is architectural. 256 threads all stall simultaneously on each rule - there is no independent warp to schedule while waiting for global memory. The ACL plugin uses O(1) hash lookups grouped by mask combination: for K distinct (src_mask, dst_mask, port/proto/flags) patterns, it does K hash probes per packet and is done. At 1024 rules with one mask pattern, K=1 and the ACL finishes in ~36 µs regardless of N.
Each distinct combination of address masks and match flags gets its own open-addressing hash table. Rules with the same mask combo hash into the same table. A lookup is: mask the packet key, hash it with FNV-1a, probe the table, done.
At rule install time, on the CPU:
(src_mask, dst_mask, match_proto,
match_src_port, match_dst_port, match_ip_version, tcp_flags_mask) combos. Each combo
gets a table descriptor.rule_idx (higher priority).min_rule_idx ascending - this lets the GPU break early
once a match is found.memcpy everything into cudaMallocManaged buffers. Bump a
rule_version counter so the persistent kernel reloads its shared-memory cache
on the next frame.On the GPU, per packet per frame:
best_action = PASS, best_idx = INT32_MAX
for each table descriptor (sorted by min_rule_idx):
if table.min_rule_idx >= best_idx: break // early exit
masked_key = apply table masks to packet fields
h = FNV-1a(masked_key)
probe flat entry array at (h & (n_slots-1)) with linear probing
on key match: if entry.rule_idx < best_idx → update best
results[thread_id] = best_action
The persistent kernel (which stays resident across frames to avoid the 30 µs
cudaStreamSynchronize overhead) loads the table descriptors - 64 tables × 64
bytes = 4 KB - into shared memory on rule_version change. Entry probing goes to
global memory, but at a handful of probes per table and full Blackwell L2 caching, it is fast.
The same 80 KB shared memory allocation that previously held the full 1024-rule table now holds either descriptors (4 KB, hash path) or the full rule table (80 KB, linear fallback). The fallback activates automatically when more than 64 distinct mask combinations are present.
All measurements: 256-packet frames, 50 000 repetitions = 12.8 M packets per data point.
"kern µs" is the CPU-observed GPU round-trip measured with clock_gettime inside
the kernel dispatch, excluding VPP arc overhead (~30 µs constant for both GPU and ACL).
No-match, varying rule count - traffic hits none of the deny rules:
Rules GPU kern µs ACL µs/fr GPU/ACL
1 3.7 37.6 1.06×
8 3.7 37.4 1.06×
64 4.0 37.6 1.06×
256 3.8 37.8 1.07×
1024 3.8 37.4 1.06×
3.8 µs flat at 1024 rules. Previously 576 µs. All 1024 rules share the same mask combo (proto + dport), so they hash into a single table: one O(1) probe per packet regardless of N. That is a 150× speedup.
Last-match, varying rule count - worst case for a linear scan, now a direct hash lookup:
Rules GPU kern µs ACL µs/fr GPU/ACL
1 3.8 36.8 1.04×
1024 5.4 36.7 0.99×
Previously also ~576 µs. The matching rule lives at index N-1, but the hash table finds it in one probe.
Diverse prefix lengths (Scenario 4) - 1024 dst-only rules spread across K distinct prefix lengths; both GPU and ACL build K hash tables per K distinct mask combos:
Tables GPU kern µs ACL µs/fr GPU/ACL
1 5.6 37.4 1.03×
8 11.9 39.9 0.96×
16 16.7 50.7 1.01×
21 22.1 57.1 1.06×
Previously ~518 µs kern at K=21 (linear scan). Now 22 µs - 23× faster. At K ≥ 16 the GPU pulls ahead of ACL; at K=8 it is roughly equal. The baseline (pure VPP forwarding, no feature) is 26.8 µs/frame, so at low K both plugins add only a few microseconds on top.
You need a DGX Spark (or any Grace Blackwell system with NVLink-C2C and CUDA 12.8+). The
repository includes a runme script that installs CUDA and VPP build dependencies
from a bare container.
git clone https://github.com/ayourtch-llm/vpp-spark-gpu.git cd vpp-spark-gpu git checkout dgx-spark-plugin ./runme # installs cuda-toolkit-12-8, VPP build deps make build-release
Run the functional tests (19 tests, IPv4 and IPv6, all actions):
make test TEST=test_gpu_classify
Run the GPU vs ACL benchmark:
make test TEST=test_gpu_classify_bench
The benchmark prints four scenario tables to stdout, with GPU Mpps, GPU µs/frame, kern µs, ACL Mpps, ACL µs/frame, and the GPU/ACL ratio side by side.
To try rules interactively from the VPP CLI:
gpu-classify enable GigabitEthernet0/0 gpu-classify rule add proto 6 dport 443 action drop gpu-classify rule add src 10.0.0.0/8 action mark show gpu-classify
show gpu-classify reports the number of hash tables built, total slot count, and
the full latency histogram (p50/p99/p99.9).
The NVLink-C2C coherency really does remove the DMA tax. The persistent kernel trick
(launching once and polling a shared control block) removes the 30 µs
cudaStreamSynchronize cost. But neither of those mattered until the kernel
algorithm matched the ACL plugin's O(K) complexity - before that, the GPU was just a slow
linear scanner that happened to avoid page faults.
The broader lesson is familiar: hardware throughput is not the bottleneck until the algorithm is right.
../ HEADER.txt 23-Feb-2026 14:54 10414