ARM NEON Acceleration
ARM NEON provides 128-bit SIMD registers, processing 8 x uint16_t coefficients per instruction for NTT operations, or 2 x f64 for FFT-domain polynomial arithmetic.
Platform Availability
NEON is mandatory on all AArch64 (ARM64) processors. No runtime detection is needed on AArch64 – the instruction set is always present. This includes:
- Apple Silicon: M1, M2, M3, M4 families
- Qualcomm: Snapdragon mobile and server SoCs
- AWS: Graviton 2, 3, and 4 instances
- Ampere: Altra and Altra Max server processors
Runtime Detection
Because NEON is part of the base AArch64 ISA, detection is a compile-time constant:
C:
// From metamui-crypto-c/metamui-falcon512/src/simd/falcon_dispatch.c
#elif defined(__aarch64__) || defined(_M_ARM64)
falcon_simd_t falcon_detect_simd(void) {
return FALCON_SIMD_NEON;
}
Rust:
// From metamui-crypto-rust/metamui-falcon512/src/dispatch.rs
#[cfg(target_arch = "aarch64")]
{
// NEON is mandatory on AArch64
return SimdType::Neon;
}
No CPUID probing, no feature flags. If the target is AArch64, NEON is available.
Algorithms Accelerated
Falcon-512/1024
NEON accelerates both NTT-domain and FFT-domain polynomial operations:
NTT operations (8 x uint16_t per register):
- Forward NTT (Cooley-Tukey butterfly)
- Inverse NTT (Gentleman-Sande butterfly)
- Pointwise polynomial multiplication
FFT operations (2 x f64 per register):
poly_add_fft,poly_sub_fft,poly_mul_fftpoly_muladj_fft(multiply by conjugate)poly_adj_fft,poly_inv_fft,poly_norm_fftpoly_split_fft,poly_merge_fftpoly_muladd_fft,poly_mulsub_fft(fused multiply-add/subtract)
BLAKE3
4-way parallel block compression using 128-bit NEON vectors (4 x uint32_t). Implementations exist across multiple languages:
- C:
blake3_neon.c - Rust:
blake3_simd.rs - C#:
Blake3ArmNeon.cs(System.Runtime.Intrinsics.Arm) - Swift:
Blake3Accelerate.swift(Accelerate framework with SIMD types)
ML-KEM
NTT operations, polynomial arithmetic, and sampling:
metamui-crypto-rust/metamui-mlkem/src/optimized/simd/neon/ntt.rsmetamui-crypto-rust/metamui-mlkem/src/optimized/simd/neon/poly_ops.rsmetamui-crypto-rust/metamui-mlkem/src/optimized/simd/neon/sampling.rs
Compile Flags
For C code targeting NEON, the standard compile flag is:
-march=armv8-a+simd
On most AArch64 toolchains, NEON intrinsics are available by default without extra flags. The guard in C dispatch code is:
#if defined(__aarch64__) || defined(_M_ARM64)
Implementation Files
C (Falcon):
metamui-crypto-c/metamui-falcon512/src/simd/falcon_ntt_neon.cmetamui-crypto-c/metamui-falcon512/src/simd/falcon_fft_neon.c
Rust (Falcon):
metamui-crypto-rust/metamui-falcon512/src/simd/neon_ntt.rsmetamui-crypto-rust/metamui-falcon512/src/simd/neon_fft.rs
Rust (ML-KEM):
metamui-crypto-rust/metamui-mlkem/src/optimized/simd/neon/ntt.rsmetamui-crypto-rust/metamui-mlkem/src/optimized/simd/neon/poly_ops.rsmetamui-crypto-rust/metamui-mlkem/src/optimized/simd/neon/sampling.rs