Runtime Dispatch Architecture
The MetaMUI crypto library uses a zero-config runtime dispatch system. No build flags or environment variables are needed – the library detects CPU and GPU capabilities at startup, caches the result, and routes every performance-critical operation through the fastest available backend.
Dispatch Hierarchy
The dispatcher evaluates backends from fastest to slowest and selects the first one available:
1. Apple Metal GPU (macOS + Apple Silicon + batch_size >= 100)
2. AVX-512BW (x86-64 with AVX-512F + AVX-512BW)
3. AVX-2 (x86-64 with AVX2)
4. ARM NEON (all AArch64 processors)
5. Portable (scalar fallback -- always available)
Metal GPU is only considered for batch operations. Single-polynomial operations always use CPU SIMD regardless of GPU availability.
Rust Dispatch
The Rust implementation (metamui-crypto-rust/metamui-falcon512/src/dispatch.rs) defines a SimdType enum:
pub enum SimdType {
Metal, // Apple Metal GPU batch operations
Avx512, // 32 x u16 per register
Avx2, // 16 x u16 per register
Neon, // 8 x u16 per register
Portable, // scalar fallback
}
Detection uses is_x86_feature_detected! macros on x86 and compile-time constants on AArch64:
pub fn detect_best() -> SimdType {
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
{
if is_avx512_available() { return SimdType::Avx512; }
if is_avx2_available() { return SimdType::Avx2; }
return SimdType::Portable;
}
#[cfg(target_arch = "aarch64")]
{
return SimdType::Neon; // NEON is mandatory on AArch64
}
}
For batch operations, Metal is checked first:
pub fn detect_best_for_batch(batch_size: usize) -> SimdType {
if batch_size >= 100 && is_metal_available() {
return SimdType::Metal;
}
detect_best()
}
Metal availability is probed via the Metal framework device API, gated behind #[cfg(all(feature = "metal", target_os = "macos"))].
Each FFT dispatch function uses #[cfg(...)] blocks to select the implementation at compile time for the target architecture, with runtime CPUID checks for x86 feature levels:
pub fn poly_mul_fft(c: &mut [f64], a: &[f64], b: &[f64], logn: usize) {
#[cfg(target_arch = "aarch64")]
{ /* NEON path */ }
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
{
if is_avx512_available() { /* AVX-512 path */ }
if is_avx2_available() { /* AVX-2 path */ }
}
/* Portable fallback */
}
C Dispatch
The C implementation (metamui-crypto-c/metamui-falcon512/src/simd/falcon_dispatch.c) uses CPUID for x86 detection:
- AVX2: CPUID leaf 1 checks OSXSAVE (ECX bit 27) + AVX (ECX bit 28), then leaf 7 checks EBX bit 5
- AVX-512: Same OSXSAVE check, then leaf 7 checks EBX bit 16 (AVX-512F) + bit 30 (AVX-512BW)
- NEON: AArch64 is detected at compile time (
__aarch64__or_M_ARM64), always returnsFALCON_SIMD_NEON
The detected level is cached in a static variable:
static falcon_simd_t cached_level = (falcon_simd_t)-1;
static falcon_simd_t get_simd_level(void) {
if (cached_level == (falcon_simd_t)-1) {
cached_level = falcon_detect_simd();
}
return cached_level;
}
Dispatch uses switch statements with compile-time guards. The DISPATCH_FFT_3ARG and DISPATCH_FFT_2ARG macros generate the dispatch pattern for FFT operations:
#define DISPATCH_FFT_3ARG(name, c, a, b, logn) \
do { \
falcon_simd_t level = get_simd_level(); \
switch (level) { \
IF_AVX512(case FALCON_SIMD_AVX512: \
falcon_##name##_avx512(c, a, b, logn); return;) \
IF_AVX2(case FALCON_SIMD_AVX2: \
falcon_##name##_avx2(c, a, b, logn); return;) \
IF_NEON(case FALCON_SIMD_NEON: \
falcon_##name##_neon(c, a, b, logn); return;) \
default: falcon_##name(c, a, b, logn); return; \
} \
} while (0)
Querying the Active Backend
C API
#include "falcon_simd.h"
falcon_simd_t level = falcon_detect_simd();
printf("Active backend: %s\n", falcon_simd_name(level));
// Prints one of: "AVX-512BW", "AVX2", "NEON", "Portable (scalar)"
The falcon_simd_t enum values are:
FALCON_SIMD_PORTABLE = 0FALCON_SIMD_NEON = 1FALCON_SIMD_AVX2 = 2FALCON_SIMD_AVX512 = 3
Rust API
use metamui_falcon512::dispatch::{detect_best, detect_best_for_batch};
let level = detect_best();
println!("Active backend: {} (width={})", level.name(), level.width());
// Prints: "ARM NEON (width=8)" on Apple Silicon
let batch_level = detect_best_for_batch(200);
println!("Batch backend: {}", batch_level.name());
// Prints: "Apple Metal GPU" on macOS with Metal
Build Flags to Force a Specific Backend
While the library is designed for automatic dispatch, you can force specific backends for testing or benchmarking:
Rust
# Force AVX-512 (compile-time target feature)
RUSTFLAGS="-C target-feature=+avx512f,+avx512bw" cargo build
# Enable Metal support
cargo build --features metal
# Force portable only (disable all SIMD)
cargo build --target x86_64-unknown-linux-gnu # on non-x86 host
C
# Force AVX-512
cc -mavx512f -mavx512bw -c falcon_ntt_avx512.c
# Force AVX-2 only (no AVX-512)
cc -mavx2 -c falcon_ntt_avx2.c
# Force NEON (cross-compile for AArch64)
aarch64-linux-gnu-gcc -march=armv8-a+simd -c falcon_ntt_neon.c
Implementation Files
- Rust dispatch:
metamui-crypto-rust/metamui-falcon512/src/dispatch.rs - C dispatch:
metamui-crypto-c/metamui-falcon512/src/simd/falcon_dispatch.c - C API header:
metamui-crypto-c/metamui-falcon512/include/falcon_simd.h