AVX-2 Acceleration
AVX-2 (Advanced Vector Extensions 2) provides 256-bit SIMD registers, processing 16 x uint16_t coefficients per instruction for NTT operations, or 4 x f64 for FFT-domain polynomial arithmetic.
Platform Availability
- Intel: Haswell (2013) and later
- AMD: Excavator (2015) and later
- Effectively all modern x86-64 desktop and server processors
Runtime Detection
C: The dispatch code checks CPUID leaf 7, subleaf 0, EBX bit 5 for AVX2, after first verifying OSXSAVE (leaf 1, ECX bit 27) and AVX (leaf 1, ECX bit 28) support:
// From metamui-crypto-c/metamui-falcon512/src/simd/falcon_dispatch.c
static int detect_avx2(void) {
uint32_t eax, ebx, ecx, edx;
cpuid(1, &eax, &ebx, &ecx, &edx);
if (!(ecx & (1u << 27)) || !(ecx & (1u << 28)))
return 0;
cpuid_count(7, 0, &eax, &ebx, &ecx, &edx);
return (ebx & (1u << 5)) != 0;
}
Rust: Uses the is_x86_feature_detected!("avx2") macro, with a compile-time fast path when -C target-feature=+avx2 is set:
// From metamui-crypto-rust/metamui-falcon512/src/dispatch.rs
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
pub fn is_avx2_available() -> bool {
#[cfg(target_feature = "avx2")]
{ true }
#[cfg(not(target_feature = "avx2"))]
{ is_x86_feature_detected!("avx2") }
}
Algorithms Accelerated
Falcon-512/1024
AVX-2 accelerates both NTT-domain (integer mod q=12289) and FFT-domain (f64 complex) operations:
NTT operations (16 x uint16_t per register):
- Forward NTT (Cooley-Tukey butterfly)
- Inverse NTT (Gentleman-Sande butterfly)
- Pointwise polynomial multiplication
FFT operations (4 x f64 per register):
poly_add_fft,poly_sub_fft,poly_mul_fftpoly_muladj_fft(multiply by conjugate)poly_adj_fft,poly_inv_fft,poly_norm_fftpoly_split_fft,poly_merge_fftpoly_muladd_fft,poly_mulsub_fft(fused multiply-add/subtract)
BLAKE3
8-way parallel block compression using 256-bit vectors, processing 8 blocks simultaneously.
ML-KEM
NTT operations, polynomial arithmetic, and sampling in the ML-KEM (Kyber) implementation.
HAETAE
Extensive AVX-2 acceleration including hand-written x86 assembly for performance-critical paths:
packing_avx2.c– coefficient packing/unpackingpolyfix_avx2.c– fixed-point polynomial operationspolymat_avx2.c– polynomial matrix multiplicationfft_avx2.c– FFT operationspoly_avx2.c– core polynomial arithmeticntt.S,invntt.S,pointwise.S,shuffle.S– hand-tuned x86 assembly
C API
The C SIMD API dispatches automatically through falcon_simd.h:
#include "falcon_simd.h"
// Detect best available backend
falcon_simd_t level = falcon_detect_simd();
printf("SIMD level: %s\n", falcon_simd_name(level));
// NTT operations dispatch automatically
falcon_ntt_forward_simd(coefficients, logn);
falcon_ntt_inverse_simd(coefficients, logn);
falcon_poly_pointwise_simd(result, a, b, n);
// FFT domain operations
falcon_poly_mul_fft_simd(c, a, b, logn);
falcon_poly_muladj_fft_simd(c, a, b, logn);
falcon_poly_split_fft_simd(f0, f1, f, logn);
falcon_poly_merge_fft_simd(f, f0, f1, logn);
Implementation Files
C (Falcon):
metamui-crypto-c/metamui-falcon512/src/simd/falcon_ntt_avx2.cmetamui-crypto-c/metamui-falcon512/src/simd/falcon_fft_avx2.c
Rust (Falcon):
metamui-crypto-rust/metamui-falcon512/src/simd/avx2_ntt.rsmetamui-crypto-rust/metamui-falcon512/src/simd/avx2_fft.rs
Rust (ML-KEM):
metamui-crypto-rust/metamui-mlkem/src/optimized/simd/avx2/ntt.rsmetamui-crypto-rust/metamui-mlkem/src/optimized/simd/avx2/poly_ops.rsmetamui-crypto-rust/metamui-mlkem/src/optimized/simd/avx2/sampling.rs
C (HAETAE):
metamui-crypto-c/metamui-haetae/src/simd/avx2/(14 files including ASM)