AVX-512 Acceleration
AVX-512 provides 512-bit SIMD registers, processing 32 x uint16_t coefficients per instruction for NTT operations, or 8 x f64 for FFT-domain polynomial arithmetic. This doubles the throughput of AVX-2 for vectorizable operations.
Platform Availability
- Intel: Ice Lake (2019) and later desktop/mobile; Skylake-X (2017) and later server
- AMD: Zen 4 (2022) and later
AVX-512 is a family of extensions. The MetaMUI implementation requires two specific subsets:
- AVX-512F (Foundation) – CPUID leaf 7, EBX bit 16
- AVX-512BW (Byte and Word) – CPUID leaf 7, EBX bit 30
Both must be present for the AVX-512 backend to activate.
Runtime Detection
C: Checks both CPUID bits after verifying OS support for extended state save:
// From metamui-crypto-c/metamui-falcon512/src/simd/falcon_dispatch.c
static int detect_avx512bw(void) {
uint32_t eax, ebx, ecx, edx;
cpuid(1, &eax, &ebx, &ecx, &edx);
if (!(ecx & (1u << 27)))
return 0;
cpuid_count(7, 0, &eax, &ebx, &ecx, &edx);
return ((ebx & (1u << 16)) && (ebx & (1u << 30))) != 0;
}
Rust: Uses dual feature detection macros:
// From metamui-crypto-rust/metamui-falcon512/src/dispatch.rs
pub fn is_avx512_available() -> bool {
#[cfg(target_feature = "avx512bw")]
{ true }
#[cfg(not(target_feature = "avx512bw"))]
{
is_x86_feature_detected!("avx512f") && is_x86_feature_detected!("avx512bw")
}
}
In Rust, the AVX-512 paths are gated behind the avx512 cargo feature flag (#[cfg(feature = "avx512")]).
Algorithms Accelerated
Falcon-512/1024
32-way parallel butterfly operations for NTT, and 8-wide f64 processing for FFT:
NTT operations (32 x uint16_t per register):
- Forward NTT (Cooley-Tukey butterfly)
- Inverse NTT (Gentleman-Sande butterfly)
- Pointwise polynomial multiplication
FFT operations (8 x f64 per register):
poly_add_fft,poly_sub_fft,poly_mul_fftpoly_muladj_fft(multiply by conjugate)poly_adj_fft,poly_inv_fft,poly_norm_fftpoly_split_fft,poly_merge_fftpoly_muladd_fft,poly_mulsub_fft(fused multiply-add/subtract)
BLAKE3
16-block batch compression using 512-bit vectors (16 x uint32_t).
Dispatch Priority
AVX-512 sits just below Metal GPU in the dispatch hierarchy. On machines with AVX-512 support, it is always preferred over AVX-2:
Metal GPU --> AVX-512 --> AVX-2 --> NEON --> Portable
Note that all AVX-512 capable CPUs also support AVX-2, but the dispatcher always chooses the wider implementation when available.
Implementation Files
C (Falcon):
metamui-crypto-c/metamui-falcon512/src/simd/falcon_ntt_avx512.cmetamui-crypto-c/metamui-falcon512/src/simd/falcon_fft_avx512.c
Rust (Falcon):
metamui-crypto-rust/metamui-falcon512/src/simd/avx512_ntt.rsmetamui-crypto-rust/metamui-falcon512/src/simd/avx512_fft.rs
Build Notes
On Rust, AVX-512 support requires the avx512 feature:
cargo build --features avx512
The C implementation uses compile-time guards (#if defined(__AVX512F__) && defined(__AVX512BW__)) and requires the appropriate compiler flags (e.g., -mavx512f -mavx512bw on GCC/Clang).