NVIDIA CUDA Acceleration
CUDA kernels offload batch cryptographic operations to NVIDIA GPUs. The implementation targets parallel workloads where hundreds or thousands of independent operations can execute simultaneously across GPU streaming multiprocessors.
Compute Capability
- Compiled for: CC 8.9 (Ada Lovelace architecture)
- PTX forward-compatible: to CC 12.0 (Blackwell, e.g., RTX 5080)
- WSL2 compatible: synchronous operation fallback for Windows Subsystem for Linux
From the CUDA source header:
// Compiled for compute_89 (Ada Lovelace) with PTX forward compatibility
// to Blackwell (RTX 5080, CC 12.0) - WSL2 compatible
BLAKE3 Operations
The BLAKE3 CUDA implementation (metamui-crypto-rust/metamui-blake3/cuda/blake3_batch.cu) provides three kernels:
blake3_batch_hash
Batch hashing where each CUDA thread processes one complete hash independently. Input data is provided as a concatenated buffer with per-hash offsets and sizes:
extern "C" __global__ void blake3_batch_hash(
const unsigned char *input_data,
const unsigned int *input_sizes,
const unsigned int *input_offsets,
unsigned char *output_hashes,
unsigned int num_hashes
);
Each thread:
- Initializes chaining value from BLAKE3 IV (stored in
__constant__memory) - Processes full 64-byte blocks with cumulative block length tracking
-
Handles the final partial block with CHUNK_END ROOT flags
blake3_chunk_hash
Processes 1024-byte chunks into chunk chaining values (32 bytes each) for multi-chunk tree hashing of inputs larger than 1024 bytes. Each thread handles one chunk independently.
blake3_parent_hash
Combines pairs of child chaining values into parent chaining values for tree reduction. Used iteratively to reduce chunk CVs down to a single root hash. Handles odd-child-count edge cases by duplicating the last child.
SMAUG-T Operations
The SMAUG-T CUDA implementation (metamui-crypto-c/metamui-smaug-t/src/gpu/cuda/smaug_cuda_kernels.cu) provides polynomial-level and matrix-level kernels:
Polynomial Operations
poly_add_kernel– element-wise polynomial additionpoly_add_batch_kernel– batched polynomial additionpoly_sub_kernel– element-wise polynomial subtractionpoly_sub_batch_kernel– batched polynomial subtractionpoly_mul_schoolbook_kernel– schoolbook polynomial multiplication with shared memorypoly_mul_karatsuba_kernel– Karatsuba polynomial multiplication
Transform Operations
ntt_forward_kernel– NTT forward transform with twiddle factors
Linear Algebra
vec_vec_mult_kernel– vector-vector multiplication (inner product of polynomial vectors)matrix_vec_mult_kernel– matrix-vector multiplication
Utility Operations
sample_gaussian_kernel– Gaussian noise samplingkem_keypair_batch_kernel– batch KEM key pair generation
Thread Configuration
The SMAUG-T kernels use:
- Block size: 256 threads
- Warp size: 32 threads
- Shared memory: up to 48 KB per block
- Maximum blocks: 65535
Implementation Files
- BLAKE3:
metamui-crypto-rust/metamui-blake3/cuda/blake3_batch.cu - SMAUG-T kernels:
metamui-crypto-c/metamui-smaug-t/src/gpu/cuda/smaug_cuda_kernels.cu - SMAUG-T multi-GPU:
metamui-crypto-c/metamui-smaug-t/src/gpu/cuda/smaug_cuda_multi.cu