NVIDIA CUDA Acceleration

CUDA kernels offload batch cryptographic operations to NVIDIA GPUs. The implementation targets parallel workloads where hundreds or thousands of independent operations can execute simultaneously across GPU streaming multiprocessors.

Compute Capability

Compiled for: CC 8.9 (Ada Lovelace architecture)
PTX forward-compatible: to CC 12.0 (Blackwell, e.g., RTX 5080)
WSL2 compatible: synchronous operation fallback for Windows Subsystem for Linux

From the CUDA source header:

// Compiled for compute_89 (Ada Lovelace) with PTX forward compatibility
// to Blackwell (RTX 5080, CC 12.0) - WSL2 compatible

BLAKE3 Operations

The BLAKE3 CUDA implementation (metamui-crypto-rust/metamui-blake3/cuda/blake3_batch.cu) provides three kernels:

`blake3_batch_hash`

Batch hashing where each CUDA thread processes one complete hash independently. Input data is provided as a concatenated buffer with per-hash offsets and sizes:

extern "C" __global__ void blake3_batch_hash(
    const unsigned char *input_data,
    const unsigned int *input_sizes,
    const unsigned int *input_offsets,
    unsigned char *output_hashes,
    unsigned int num_hashes
);

Each thread:

Initializes chaining value from BLAKE3 IV (stored in __constant__ memory)
Processes full 64-byte blocks with cumulative block length tracking
Handles the final partial block with CHUNK_END ROOT flags

`blake3_chunk_hash`

Processes 1024-byte chunks into chunk chaining values (32 bytes each) for multi-chunk tree hashing of inputs larger than 1024 bytes. Each thread handles one chunk independently.

`blake3_parent_hash`

Combines pairs of child chaining values into parent chaining values for tree reduction. Used iteratively to reduce chunk CVs down to a single root hash. Handles odd-child-count edge cases by duplicating the last child.

SMAUG-T Operations

The SMAUG-T CUDA implementation (metamui-crypto-c/metamui-smaug-t/src/gpu/cuda/smaug_cuda_kernels.cu) provides polynomial-level and matrix-level kernels:

Polynomial Operations

poly_add_kernel – element-wise polynomial addition
poly_add_batch_kernel – batched polynomial addition
poly_sub_kernel – element-wise polynomial subtraction
poly_sub_batch_kernel – batched polynomial subtraction
poly_mul_schoolbook_kernel – schoolbook polynomial multiplication with shared memory
poly_mul_karatsuba_kernel – Karatsuba polynomial multiplication

Transform Operations

ntt_forward_kernel – NTT forward transform with twiddle factors

Linear Algebra

vec_vec_mult_kernel – vector-vector multiplication (inner product of polynomial vectors)
matrix_vec_mult_kernel – matrix-vector multiplication

Utility Operations

sample_gaussian_kernel – Gaussian noise sampling
kem_keypair_batch_kernel – batch KEM key pair generation

Thread Configuration

The SMAUG-T kernels use:

Block size: 256 threads
Warp size: 32 threads
Shared memory: up to 48 KB per block
Maximum blocks: 65535

Implementation Files

BLAKE3: metamui-crypto-rust/metamui-blake3/cuda/blake3_batch.cu
SMAUG-T kernels: metamui-crypto-c/metamui-smaug-t/src/gpu/cuda/smaug_cuda_kernels.cu
SMAUG-T multi-GPU: metamui-crypto-c/metamui-smaug-t/src/gpu/cuda/smaug_cuda_multi.cu