Apple Metal GPU Acceleration
Metal compute shaders written in Metal Shading Language (MSL) offload batch cryptographic operations to Apple Silicon GPUs. The unified memory architecture on Apple Silicon (M1, M2, M3, M4 families) enables zero-copy CPU-GPU data handoff, eliminating PCIe transfer overhead.
Metal is used exclusively for batch operations where the GPU dispatch overhead is amortized across many items. The dispatch threshold is approximately 100 items – below that, CPU SIMD (NEON) is faster.
Availability
- Hardware: Apple Silicon M1, M2, M3, M4 (and their Pro/Max/Ultra variants)
- OS: macOS only
- Rust: Gated behind the
metalcargo feature (#[cfg(all(feature = "metal", target_os = "macos"))])
The dispatch system probes for Metal device availability at runtime:
// From metamui-crypto-rust/metamui-falcon512/src/dispatch.rs
pub fn detect_best_for_batch(batch_size: usize) -> SimdType {
if batch_size >= 100 && is_metal_available() {
return SimdType::Metal;
}
detect_best()
}
Falcon Operations
The Falcon Metal shader (metamui-crypto-rust/metamui-falcon512/src/metal/shaders/falcon_ntt.metal) implements four compute kernels:
falcon_ntt_forward
Cooley-Tukey butterfly NTT over Z_q (q = 12289). Each thread group cooperatively processes one polynomial from a batch:
- Coefficients are loaded into threadgroup shared memory (max 1024 coefficients for Falcon-1024)
- Converted to Montgomery form on load (
to_mont) - Butterfly layers execute with
threadgroup_barriersynchronization between layers - Converted back from Montgomery form on store
falcon_ntt_inverse
Gentleman-Sande butterfly (reversed layer order). Same cooperative threadgroup pattern as forward NTT, with a final scaling by n^{-1} mod q.
falcon_pointwise_mul
Coefficient-wise multiplication of two NTT-domain polynomials. This is embarrassingly parallel – each thread processes one coefficient independently via Montgomery multiply. No threadgroup coordination needed.
falcon_batch_verify
Parallel signature norm checking across hundreds of signatures. The CPU pre-computes s0 = c - NTT(s1 * h) using SIMD-optimized NTT multiply, then the GPU checks ||s0||^2 + ||s1||^2 < beta^2 in parallel.
Each thread group handles one signature:
- Threads accumulate partial norms across their assigned coefficient stripes
- Parallel reduction sums partial norms within the threadgroup
- Thread 0 writes the pass/fail result (1 = valid, 0 = invalid)
This hybrid approach leverages CPU SIMD for NTT polynomial multiplication (where it excels) and GPU for parallel norm reduction across many signatures.
Montgomery Arithmetic
All NTT kernels use branchless Montgomery arithmetic with these constants:
NTT_Q = 12289(Falcon prime)MONT_R = 65536(R = 2^16)MONT_QINV = 12287(-q^{-1} mod R)MONT_R2MODQ = 10952(R^2 mod q)
BLAKE3 Operations
The BLAKE3 Metal shader (metamui-crypto-rust/metamui-blake3/src/metal/shaders/blake3.metal) implements five compute kernels:
blake3_compress_blocks
Parallel block compression where each thread processes one block independently. Loads chaining value and block words per thread, performs 7 rounds of mixing, and stores the 16-word output.
blake3_compress_blocks_simd
SIMD-optimized variant using Metal uint4 vector types for the G mixing function. The g_simd function operates on four state words simultaneously using Metal’s native vector arithmetic.
blake3_process_chunks
Complete chunk processing kernel. Each thread processes one 1024-byte chunk through all its 64-byte blocks, using threadgroup shared memory for block words. Handles CHUNK_START/CHUNK_END flags and partial final blocks.
blake3_process_chunks_tile
2D tile-based variant optimized for Apple Silicon GPU cache hierarchy (M1/M2/M3 tile-based deferred rendering architecture). Uses uint2 grid/threadgroup positions for better cache utilization on Apple GPUs.
blake3_tree_merge
Parallel tree node merging for BLAKE3 tree hashing. Each thread combines one pair of child chaining values into a parent node using the PARENT flag.
SMAUG-T Operations
The SMAUG-T Metal shader (metamui-crypto-c/metamui-smaug-t/src/gpu/metal/smaug_metal_kernels.metal) implements polynomial-level and matrix-level operations:
poly_add_kernel,poly_add_batch_kernel– polynomial additionpoly_sub_kernel,poly_sub_batch_kernel– polynomial subtractionpoly_mul_schoolbook_kernel– schoolbook polynomial multiplicationpoly_mul_karatsuba_kernel– Karatsuba polynomial multiplicationntt_forward_kernel– NTT forward transformvec_vec_mult_kernel– vector-vector multiplicationmatrix_vec_mult_kernel– matrix-vector multiplicationsample_gaussian_kernel– Gaussian samplingpack_coefficients_kernel,unpack_coefficients_kernel– coefficient serializationkem_keypair_batch_kernel– batch key pair generation
Shader File Paths
- Falcon NTT:
metamui-crypto-rust/metamui-falcon512/src/metal/shaders/falcon_ntt.metal - BLAKE3:
metamui-crypto-rust/metamui-blake3/src/metal/shaders/blake3.metal - SMAUG-T:
metamui-crypto-c/metamui-smaug-t/src/gpu/metal/smaug_metal_kernels.metal