Hardware Acceleration
MetaMUI crypto primitives use zero-config runtime detection to automatically select the fastest available SIMD or GPU backend on every platform. No build flags, no user configuration – the library probes CPU features (via CPUID on x86 or assumes NEON on AArch64) and GPU availability at startup, caches the result, and dispatches every hot-path operation through the best backend.
Dispatch Hierarchy
The runtime dispatcher evaluates backends in this order and selects the first one available:
Metal GPU (macOS, Apple Silicon, batch >= 100)
|
v
AVX-512BW (Intel Ice Lake 2019+, AMD Zen 4 2022+)
|
v
AVX-2 (Intel Haswell 2013+, AMD Excavator 2015+)
|
v
ARM NEON (all AArch64 processors)
|
v
Portable (scalar fallback -- always available)
For GPU backends (Metal, CUDA), the dispatch only activates for batch operations where the overhead of GPU dispatch is amortized across many items. Single-polynomial operations always use CPU SIMD.
Backend Summary
| Backend | Architecture | Register Width | Coefficients per Op |
|---|---|---|---|
| AVX-512 | x86-64 | 512-bit | 32 x uint16 |
| AVX-2 | x86-64 | 256-bit | 16 x uint16 |
| ARM NEON | AArch64 | 128-bit | 8 x uint16 |
| Apple Metal | Apple Silicon GPU | Threadgroup | 1024 coefficients |
| NVIDIA CUDA | NVIDIA GPU | Warp (32 threads) | Configurable |
| ARM SVE2 | AArch64 (experimental) | 128-2048 bit | Variable |
| Portable | Any | 64-bit | 1 x uint16 |
Algorithm Coverage
Each algorithm benefits from different backends depending on its computational profile:
| Algorithm | AVX-512 | AVX-2 | NEON | Metal | CUDA |
|---|---|---|---|---|---|
| Falcon-512/1024 | NTT, FFT, polynomial ops | NTT, FFT, polynomial ops | NTT, FFT, polynomial ops | NTT, pointwise mul, batch verify | – |
| BLAKE3 | Batch block compression | 8-block batch compression | Block compression | Chunk processing, tree merge | Batch hashing, tree hashing |
| ML-KEM | – | NTT, poly ops, sampling | NTT, poly ops, sampling | – | – |
| HAETAE | – | Packing, polyfix, polymat, FFT, poly, NTT (ASM) | – | – | – |
| SMAUG-T | – | – | – | Poly add/sub/mul, NTT, matrix-vector, Barrett reduce | Poly add/sub/mul, NTT, matrix-vector, Karatsuba |
Falcon SIMD Operations
The Falcon dispatch covers both NTT-domain (integer mod q=12289) and FFT-domain (floating-point complex) operations:
- NTT domain:
ntt_forward,ntt_inverse,poly_pointwise - FFT domain:
poly_add_fft,poly_sub_fft,poly_mul_fft,poly_muladj_fft,poly_adj_fft,poly_inv_fft,poly_norm_fft,poly_split_fft,poly_merge_fft - Fused operations:
poly_muladd_fft(c = acc + ab),poly_mulsub_fft(c = acc - ab)
Constant-Time Guarantee
All SIMD paths use branchless masking throughout. Conditional operations use arithmetic masks ((cond - 1) patterns) rather than branches, ensuring constant-time execution regardless of input values. This applies to:
- Montgomery reduction in NTT kernels
- Modular addition/subtraction (
addmod,submod) - Butterfly operations in NTT and FFT transforms
- Signature norm checking
Portable Fallback
Every accelerated operation has a portable scalar backup that produces identical results. If no SIMD or GPU backend is detected, the library falls through to the portable implementation automatically. The portable path is always compiled in – it is never conditionally excluded.