Apple Metal GPU Acceleration

Metal compute shaders written in Metal Shading Language (MSL) offload batch cryptographic operations to Apple Silicon GPUs. The unified memory architecture on Apple Silicon (M1, M2, M3, M4 families) enables zero-copy CPU-GPU data handoff, eliminating PCIe transfer overhead.

Metal is used exclusively for batch operations where the GPU dispatch overhead is amortized across many items. The dispatch threshold is approximately 100 items – below that, CPU SIMD (NEON) is faster.

Availability

Hardware: Apple Silicon M1, M2, M3, M4 (and their Pro/Max/Ultra variants)
OS: macOS only
Rust: Gated behind the metal cargo feature (#[cfg(all(feature = "metal", target_os = "macos"))])

The dispatch system probes for Metal device availability at runtime:

// From metamui-crypto-rust/metamui-falcon512/src/dispatch.rs
pub fn detect_best_for_batch(batch_size: usize) -> SimdType {
    if batch_size >= 100 && is_metal_available() {
        return SimdType::Metal;
    }
    detect_best()
}

Falcon Operations

The Falcon Metal shader (metamui-crypto-rust/metamui-falcon512/src/metal/shaders/falcon_ntt.metal) implements four compute kernels:

`falcon_ntt_forward`

Cooley-Tukey butterfly NTT over Z_q (q = 12289). Each thread group cooperatively processes one polynomial from a batch:

Coefficients are loaded into threadgroup shared memory (max 1024 coefficients for Falcon-1024)
Converted to Montgomery form on load (to_mont)
Butterfly layers execute with threadgroup_barrier synchronization between layers
Converted back from Montgomery form on store

`falcon_ntt_inverse`

Gentleman-Sande butterfly (reversed layer order). Same cooperative threadgroup pattern as forward NTT, with a final scaling by n^{-1} mod q.

`falcon_pointwise_mul`

Coefficient-wise multiplication of two NTT-domain polynomials. This is embarrassingly parallel – each thread processes one coefficient independently via Montgomery multiply. No threadgroup coordination needed.

`falcon_batch_verify`

Parallel signature norm checking across hundreds of signatures. The CPU pre-computes s0 = c - NTT(s1 * h) using SIMD-optimized NTT multiply, then the GPU checks ||s0||^2 + ||s1||^2 < beta^2 in parallel.

Each thread group handles one signature:

Threads accumulate partial norms across their assigned coefficient stripes
Parallel reduction sums partial norms within the threadgroup
Thread 0 writes the pass/fail result (1 = valid, 0 = invalid)

This hybrid approach leverages CPU SIMD for NTT polynomial multiplication (where it excels) and GPU for parallel norm reduction across many signatures.

Montgomery Arithmetic

All NTT kernels use branchless Montgomery arithmetic with these constants:

NTT_Q = 12289 (Falcon prime)
MONT_R = 65536 (R = 2^16)
MONT_QINV = 12287 (-q^{-1} mod R)
MONT_R2MODQ = 10952 (R^2 mod q)

BLAKE3 Operations

The BLAKE3 Metal shader (metamui-crypto-rust/metamui-blake3/src/metal/shaders/blake3.metal) implements five compute kernels:

`blake3_compress_blocks`

Parallel block compression where each thread processes one block independently. Loads chaining value and block words per thread, performs 7 rounds of mixing, and stores the 16-word output.

`blake3_compress_blocks_simd`

SIMD-optimized variant using Metal uint4 vector types for the G mixing function. The g_simd function operates on four state words simultaneously using Metal’s native vector arithmetic.

`blake3_process_chunks`

Complete chunk processing kernel. Each thread processes one 1024-byte chunk through all its 64-byte blocks, using threadgroup shared memory for block words. Handles CHUNK_START/CHUNK_END flags and partial final blocks.

`blake3_process_chunks_tile`

2D tile-based variant optimized for Apple Silicon GPU cache hierarchy (M1/M2/M3 tile-based deferred rendering architecture). Uses uint2 grid/threadgroup positions for better cache utilization on Apple GPUs.

`blake3_tree_merge`

Parallel tree node merging for BLAKE3 tree hashing. Each thread combines one pair of child chaining values into a parent node using the PARENT flag.

SMAUG-T Operations

The SMAUG-T Metal shader (metamui-crypto-c/metamui-smaug-t/src/gpu/metal/smaug_metal_kernels.metal) implements polynomial-level and matrix-level operations:

poly_add_kernel, poly_add_batch_kernel – polynomial addition
poly_sub_kernel, poly_sub_batch_kernel – polynomial subtraction
poly_mul_schoolbook_kernel – schoolbook polynomial multiplication
poly_mul_karatsuba_kernel – Karatsuba polynomial multiplication
ntt_forward_kernel – NTT forward transform
vec_vec_mult_kernel – vector-vector multiplication
matrix_vec_mult_kernel – matrix-vector multiplication
sample_gaussian_kernel – Gaussian sampling
pack_coefficients_kernel, unpack_coefficients_kernel – coefficient serialization
kem_keypair_batch_kernel – batch key pair generation

Shader File Paths

Falcon NTT: metamui-crypto-rust/metamui-falcon512/src/metal/shaders/falcon_ntt.metal
BLAKE3: metamui-crypto-rust/metamui-blake3/src/metal/shaders/blake3.metal
SMAUG-T: metamui-crypto-c/metamui-smaug-t/src/gpu/metal/smaug_metal_kernels.metal