NVIDIA CUDA Acceleration

CUDA kernels offload batch cryptographic operations to NVIDIA GPUs. The implementation targets parallel workloads where hundreds or thousands of independent operations can execute simultaneously across GPU streaming multiprocessors.

Compute Capability

From the CUDA source header:

// Compiled for compute_89 (Ada Lovelace) with PTX forward compatibility
// to Blackwell (RTX 5080, CC 12.0) - WSL2 compatible

BLAKE3 Operations

The BLAKE3 CUDA implementation (metamui-crypto-rust/metamui-blake3/cuda/blake3_batch.cu) provides three kernels:

blake3_batch_hash

Batch hashing where each CUDA thread processes one complete hash independently. Input data is provided as a concatenated buffer with per-hash offsets and sizes:

extern "C" __global__ void blake3_batch_hash(
    const unsigned char *input_data,
    const unsigned int *input_sizes,
    const unsigned int *input_offsets,
    unsigned char *output_hashes,
    unsigned int num_hashes
);

Each thread:

  1. Initializes chaining value from BLAKE3 IV (stored in __constant__ memory)
  2. Processes full 64-byte blocks with cumulative block length tracking
  3. Handles the final partial block with CHUNK_END ROOT flags

blake3_chunk_hash

Processes 1024-byte chunks into chunk chaining values (32 bytes each) for multi-chunk tree hashing of inputs larger than 1024 bytes. Each thread handles one chunk independently.

blake3_parent_hash

Combines pairs of child chaining values into parent chaining values for tree reduction. Used iteratively to reduce chunk CVs down to a single root hash. Handles odd-child-count edge cases by duplicating the last child.

SMAUG-T Operations

The SMAUG-T CUDA implementation (metamui-crypto-c/metamui-smaug-t/src/gpu/cuda/smaug_cuda_kernels.cu) provides polynomial-level and matrix-level kernels:

Polynomial Operations

Transform Operations

Linear Algebra

Utility Operations

Thread Configuration

The SMAUG-T kernels use:

Implementation Files