AVX-512 Acceleration

AVX-512 provides 512-bit SIMD registers, processing 32 x uint16_t coefficients per instruction for NTT operations, or 8 x f64 for FFT-domain polynomial arithmetic. This doubles the throughput of AVX-2 for vectorizable operations.

Platform Availability

Intel: Ice Lake (2019) and later desktop/mobile; Skylake-X (2017) and later server
AMD: Zen 4 (2022) and later

AVX-512 is a family of extensions. The MetaMUI implementation requires two specific subsets:

AVX-512F (Foundation) – CPUID leaf 7, EBX bit 16
AVX-512BW (Byte and Word) – CPUID leaf 7, EBX bit 30

Both must be present for the AVX-512 backend to activate.

Runtime Detection

C: Checks both CPUID bits after verifying OS support for extended state save:

// From metamui-crypto-c/metamui-falcon512/src/simd/falcon_dispatch.c
static int detect_avx512bw(void) {
    uint32_t eax, ebx, ecx, edx;
    cpuid(1, &eax, &ebx, &ecx, &edx);
    if (!(ecx & (1u << 27)))
        return 0;
    cpuid_count(7, 0, &eax, &ebx, &ecx, &edx);
    return ((ebx & (1u << 16)) && (ebx & (1u << 30))) != 0;
}

Rust: Uses dual feature detection macros:

// From metamui-crypto-rust/metamui-falcon512/src/dispatch.rs
pub fn is_avx512_available() -> bool {
    #[cfg(target_feature = "avx512bw")]
    { true }
    #[cfg(not(target_feature = "avx512bw"))]
    {
        is_x86_feature_detected!("avx512f") && is_x86_feature_detected!("avx512bw")
    }
}

In Rust, the AVX-512 paths are gated behind the avx512 cargo feature flag (#[cfg(feature = "avx512")]).

Algorithms Accelerated

Falcon-512/1024

32-way parallel butterfly operations for NTT, and 8-wide f64 processing for FFT:

NTT operations (32 x uint16_t per register):

Forward NTT (Cooley-Tukey butterfly)
Inverse NTT (Gentleman-Sande butterfly)
Pointwise polynomial multiplication

FFT operations (8 x f64 per register):

poly_add_fft, poly_sub_fft, poly_mul_fft
poly_muladj_fft (multiply by conjugate)
poly_adj_fft, poly_inv_fft, poly_norm_fft
poly_split_fft, poly_merge_fft
poly_muladd_fft, poly_mulsub_fft (fused multiply-add/subtract)

BLAKE3

16-block batch compression using 512-bit vectors (16 x uint32_t).

Dispatch Priority

AVX-512 sits just below Metal GPU in the dispatch hierarchy. On machines with AVX-512 support, it is always preferred over AVX-2:

Metal GPU  -->  AVX-512  -->  AVX-2  -->  NEON  -->  Portable

Note that all AVX-512 capable CPUs also support AVX-2, but the dispatcher always chooses the wider implementation when available.

Implementation Files

C (Falcon):

metamui-crypto-c/metamui-falcon512/src/simd/falcon_ntt_avx512.c
metamui-crypto-c/metamui-falcon512/src/simd/falcon_fft_avx512.c

Rust (Falcon):

metamui-crypto-rust/metamui-falcon512/src/simd/avx512_ntt.rs
metamui-crypto-rust/metamui-falcon512/src/simd/avx512_fft.rs

Build Notes

On Rust, AVX-512 support requires the avx512 feature:

cargo build --features avx512

The C implementation uses compile-time guards (#if defined(__AVX512F__) && defined(__AVX512BW__)) and requires the appropriate compiler flags (e.g., -mavx512f -mavx512bw on GCC/Clang).