Apple Metal GPU Acceleration

Metal compute shaders written in Metal Shading Language (MSL) offload batch cryptographic operations to Apple Silicon GPUs. The unified memory architecture on Apple Silicon (M1, M2, M3, M4 families) enables zero-copy CPU-GPU data handoff, eliminating PCIe transfer overhead.

Metal is used exclusively for batch operations where the GPU dispatch overhead is amortized across many items. The dispatch threshold is approximately 100 items – below that, CPU SIMD (NEON) is faster.

Availability

The dispatch system probes for Metal device availability at runtime:

// From metamui-crypto-rust/metamui-falcon512/src/dispatch.rs
pub fn detect_best_for_batch(batch_size: usize) -> SimdType {
    if batch_size >= 100 && is_metal_available() {
        return SimdType::Metal;
    }
    detect_best()
}

Falcon Operations

The Falcon Metal shader (metamui-crypto-rust/metamui-falcon512/src/metal/shaders/falcon_ntt.metal) implements four compute kernels:

falcon_ntt_forward

Cooley-Tukey butterfly NTT over Z_q (q = 12289). Each thread group cooperatively processes one polynomial from a batch:

falcon_ntt_inverse

Gentleman-Sande butterfly (reversed layer order). Same cooperative threadgroup pattern as forward NTT, with a final scaling by n^{-1} mod q.

falcon_pointwise_mul

Coefficient-wise multiplication of two NTT-domain polynomials. This is embarrassingly parallel – each thread processes one coefficient independently via Montgomery multiply. No threadgroup coordination needed.

falcon_batch_verify

Parallel signature norm checking across hundreds of signatures. The CPU pre-computes s0 = c - NTT(s1 * h) using SIMD-optimized NTT multiply, then the GPU checks ||s0||^2 + ||s1||^2 < beta^2 in parallel.

Each thread group handles one signature:

  1. Threads accumulate partial norms across their assigned coefficient stripes
  2. Parallel reduction sums partial norms within the threadgroup
  3. Thread 0 writes the pass/fail result (1 = valid, 0 = invalid)

This hybrid approach leverages CPU SIMD for NTT polynomial multiplication (where it excels) and GPU for parallel norm reduction across many signatures.

Montgomery Arithmetic

All NTT kernels use branchless Montgomery arithmetic with these constants:

BLAKE3 Operations

The BLAKE3 Metal shader (metamui-crypto-rust/metamui-blake3/src/metal/shaders/blake3.metal) implements five compute kernels:

blake3_compress_blocks

Parallel block compression where each thread processes one block independently. Loads chaining value and block words per thread, performs 7 rounds of mixing, and stores the 16-word output.

blake3_compress_blocks_simd

SIMD-optimized variant using Metal uint4 vector types for the G mixing function. The g_simd function operates on four state words simultaneously using Metal’s native vector arithmetic.

blake3_process_chunks

Complete chunk processing kernel. Each thread processes one 1024-byte chunk through all its 64-byte blocks, using threadgroup shared memory for block words. Handles CHUNK_START/CHUNK_END flags and partial final blocks.

blake3_process_chunks_tile

2D tile-based variant optimized for Apple Silicon GPU cache hierarchy (M1/M2/M3 tile-based deferred rendering architecture). Uses uint2 grid/threadgroup positions for better cache utilization on Apple GPUs.

blake3_tree_merge

Parallel tree node merging for BLAKE3 tree hashing. Each thread combines one pair of child chaining values into a parent node using the PARENT flag.

SMAUG-T Operations

The SMAUG-T Metal shader (metamui-crypto-c/metamui-smaug-t/src/gpu/metal/smaug_metal_kernels.metal) implements polynomial-level and matrix-level operations:

Shader File Paths