Hardware Acceleration

MetaMUI crypto primitives use zero-config runtime detection to automatically select the fastest available SIMD or GPU backend on every platform. No build flags, no user configuration – the library probes CPU features (via CPUID on x86 or assumes NEON on AArch64) and GPU availability at startup, caches the result, and dispatches every hot-path operation through the best backend.

Dispatch Hierarchy

The runtime dispatcher evaluates backends in this order and selects the first one available:

Metal GPU  (macOS, Apple Silicon, batch >= 100)
    |
    v
AVX-512BW  (Intel Ice Lake 2019+, AMD Zen 4 2022+)
    |
    v
AVX-2      (Intel Haswell 2013+, AMD Excavator 2015+)
    |
    v
ARM NEON   (all AArch64 processors)
    |
    v
Portable   (scalar fallback -- always available)

For GPU backends (Metal, CUDA), the dispatch only activates for batch operations where the overhead of GPU dispatch is amortized across many items. Single-polynomial operations always use CPU SIMD.

Backend Summary

Backend	Architecture	Register Width	Coefficients per Op
AVX-512	x86-64	512-bit	32 x uint16
AVX-2	x86-64	256-bit	16 x uint16
ARM NEON	AArch64	128-bit	8 x uint16
Apple Metal	Apple Silicon GPU	Threadgroup	1024 coefficients
NVIDIA CUDA	NVIDIA GPU	Warp (32 threads)	Configurable
ARM SVE2	AArch64 (experimental)	128-2048 bit	Variable
Portable	Any	64-bit	1 x uint16

Algorithm Coverage

Each algorithm benefits from different backends depending on its computational profile:

Algorithm	AVX-512	AVX-2	NEON	Metal	CUDA
Falcon-512/1024	NTT, FFT, polynomial ops	NTT, FFT, polynomial ops	NTT, FFT, polynomial ops	NTT, pointwise mul, batch verify	–
BLAKE3	Batch block compression	8-block batch compression	Block compression	Chunk processing, tree merge	Batch hashing, tree hashing
ML-KEM	–	NTT, poly ops, sampling	NTT, poly ops, sampling	–	–
HAETAE	–	Packing, polyfix, polymat, FFT, poly, NTT (ASM)	–	–	–
SMAUG-T	–	–	–	Poly add/sub/mul, NTT, matrix-vector, Barrett reduce	Poly add/sub/mul, NTT, matrix-vector, Karatsuba

Falcon SIMD Operations

The Falcon dispatch covers both NTT-domain (integer mod q=12289) and FFT-domain (floating-point complex) operations:

NTT domain: ntt_forward, ntt_inverse, poly_pointwise
FFT domain: poly_add_fft, poly_sub_fft, poly_mul_fft, poly_muladj_fft, poly_adj_fft, poly_inv_fft, poly_norm_fft, poly_split_fft, poly_merge_fft
Fused operations: poly_muladd_fft (c = acc + ab), poly_mulsub_fft (c = acc - ab)

Constant-Time Guarantee

All SIMD paths use branchless masking throughout. Conditional operations use arithmetic masks ((cond - 1) patterns) rather than branches, ensuring constant-time execution regardless of input values. This applies to:

Montgomery reduction in NTT kernels
Modular addition/subtraction (addmod, submod)
Butterfly operations in NTT and FFT transforms
Signature norm checking

Portable Fallback

Every accelerated operation has a portable scalar backup that produces identical results. If no SIMD or GPU backend is detected, the library falls through to the portable implementation automatically. The portable path is always compiled in – it is never conditionally excluded.