AVX-2 Acceleration

AVX-2 (Advanced Vector Extensions 2) provides 256-bit SIMD registers, processing 16 x uint16_t coefficients per instruction for NTT operations, or 4 x f64 for FFT-domain polynomial arithmetic.

Platform Availability

Runtime Detection

C: The dispatch code checks CPUID leaf 7, subleaf 0, EBX bit 5 for AVX2, after first verifying OSXSAVE (leaf 1, ECX bit 27) and AVX (leaf 1, ECX bit 28) support:

// From metamui-crypto-c/metamui-falcon512/src/simd/falcon_dispatch.c
static int detect_avx2(void) {
    uint32_t eax, ebx, ecx, edx;
    cpuid(1, &eax, &ebx, &ecx, &edx);
    if (!(ecx & (1u << 27)) || !(ecx & (1u << 28)))
        return 0;
    cpuid_count(7, 0, &eax, &ebx, &ecx, &edx);
    return (ebx & (1u << 5)) != 0;
}

Rust: Uses the is_x86_feature_detected!("avx2") macro, with a compile-time fast path when -C target-feature=+avx2 is set:

// From metamui-crypto-rust/metamui-falcon512/src/dispatch.rs
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
pub fn is_avx2_available() -> bool {
    #[cfg(target_feature = "avx2")]
    { true }
    #[cfg(not(target_feature = "avx2"))]
    { is_x86_feature_detected!("avx2") }
}

Algorithms Accelerated

Falcon-512/1024

AVX-2 accelerates both NTT-domain (integer mod q=12289) and FFT-domain (f64 complex) operations:

NTT operations (16 x uint16_t per register):

FFT operations (4 x f64 per register):

BLAKE3

8-way parallel block compression using 256-bit vectors, processing 8 blocks simultaneously.

ML-KEM

NTT operations, polynomial arithmetic, and sampling in the ML-KEM (Kyber) implementation.

HAETAE

Extensive AVX-2 acceleration including hand-written x86 assembly for performance-critical paths:

C API

The C SIMD API dispatches automatically through falcon_simd.h:

#include "falcon_simd.h"

// Detect best available backend
falcon_simd_t level = falcon_detect_simd();
printf("SIMD level: %s\n", falcon_simd_name(level));

// NTT operations dispatch automatically
falcon_ntt_forward_simd(coefficients, logn);
falcon_ntt_inverse_simd(coefficients, logn);
falcon_poly_pointwise_simd(result, a, b, n);

// FFT domain operations
falcon_poly_mul_fft_simd(c, a, b, logn);
falcon_poly_muladj_fft_simd(c, a, b, logn);
falcon_poly_split_fft_simd(f0, f1, f, logn);
falcon_poly_merge_fft_simd(f, f0, f1, logn);

Implementation Files

C (Falcon):

Rust (Falcon):

Rust (ML-KEM):

C (HAETAE):