ARM NEON Acceleration

ARM NEON provides 128-bit SIMD registers, processing 8 x uint16_t coefficients per instruction for NTT operations, or 2 x f64 for FFT-domain polynomial arithmetic.

Platform Availability

NEON is mandatory on all AArch64 (ARM64) processors. No runtime detection is needed on AArch64 – the instruction set is always present. This includes:

Runtime Detection

Because NEON is part of the base AArch64 ISA, detection is a compile-time constant:

C:

// From metamui-crypto-c/metamui-falcon512/src/simd/falcon_dispatch.c
#elif defined(__aarch64__) || defined(_M_ARM64)
falcon_simd_t falcon_detect_simd(void) {
    return FALCON_SIMD_NEON;
}

Rust:

// From metamui-crypto-rust/metamui-falcon512/src/dispatch.rs
#[cfg(target_arch = "aarch64")]
{
    // NEON is mandatory on AArch64
    return SimdType::Neon;
}

No CPUID probing, no feature flags. If the target is AArch64, NEON is available.

Algorithms Accelerated

Falcon-512/1024

NEON accelerates both NTT-domain and FFT-domain polynomial operations:

NTT operations (8 x uint16_t per register):

FFT operations (2 x f64 per register):

BLAKE3

4-way parallel block compression using 128-bit NEON vectors (4 x uint32_t). Implementations exist across multiple languages:

ML-KEM

NTT operations, polynomial arithmetic, and sampling:

Compile Flags

For C code targeting NEON, the standard compile flag is:

-march=armv8-a+simd

On most AArch64 toolchains, NEON intrinsics are available by default without extra flags. The guard in C dispatch code is:

#if defined(__aarch64__) || defined(_M_ARM64)

Implementation Files

C (Falcon):

Rust (Falcon):

Rust (ML-KEM):