Hardware Acceleration

MetaMUI crypto primitives use zero-config runtime detection to automatically select the fastest available SIMD or GPU backend on every platform. No build flags, no user configuration – the library probes CPU features (via CPUID on x86 or assumes NEON on AArch64) and GPU availability at startup, caches the result, and dispatches every hot-path operation through the best backend.

Dispatch Hierarchy

The runtime dispatcher evaluates backends in this order and selects the first one available:

Metal GPU  (macOS, Apple Silicon, batch >= 100)
    |
    v
AVX-512BW  (Intel Ice Lake 2019+, AMD Zen 4 2022+)
    |
    v
AVX-2      (Intel Haswell 2013+, AMD Excavator 2015+)
    |
    v
ARM NEON   (all AArch64 processors)
    |
    v
Portable   (scalar fallback -- always available)

For GPU backends (Metal, CUDA), the dispatch only activates for batch operations where the overhead of GPU dispatch is amortized across many items. Single-polynomial operations always use CPU SIMD.

Backend Summary

Backend Architecture Register Width Coefficients per Op
AVX-512 x86-64 512-bit 32 x uint16
AVX-2 x86-64 256-bit 16 x uint16
ARM NEON AArch64 128-bit 8 x uint16
Apple Metal Apple Silicon GPU Threadgroup 1024 coefficients
NVIDIA CUDA NVIDIA GPU Warp (32 threads) Configurable
ARM SVE2 AArch64 (experimental) 128-2048 bit Variable
Portable Any 64-bit 1 x uint16

Algorithm Coverage

Each algorithm benefits from different backends depending on its computational profile:

Algorithm AVX-512 AVX-2 NEON Metal CUDA
Falcon-512/1024 NTT, FFT, polynomial ops NTT, FFT, polynomial ops NTT, FFT, polynomial ops NTT, pointwise mul, batch verify
BLAKE3 Batch block compression 8-block batch compression Block compression Chunk processing, tree merge Batch hashing, tree hashing
ML-KEM NTT, poly ops, sampling NTT, poly ops, sampling
HAETAE Packing, polyfix, polymat, FFT, poly, NTT (ASM)
SMAUG-T Poly add/sub/mul, NTT, matrix-vector, Barrett reduce Poly add/sub/mul, NTT, matrix-vector, Karatsuba

Falcon SIMD Operations

The Falcon dispatch covers both NTT-domain (integer mod q=12289) and FFT-domain (floating-point complex) operations:

Constant-Time Guarantee

All SIMD paths use branchless masking throughout. Conditional operations use arithmetic masks ((cond - 1) patterns) rather than branches, ensuring constant-time execution regardless of input values. This applies to:

Portable Fallback

Every accelerated operation has a portable scalar backup that produces identical results. If no SIMD or GPU backend is detected, the library falls through to the portable implementation automatically. The portable path is always compiled in – it is never conditionally excluded.

Further Reading