Runtime Dispatch Architecture

The MetaMUI crypto library uses a zero-config runtime dispatch system. No build flags or environment variables are needed – the library detects CPU and GPU capabilities at startup, caches the result, and routes every performance-critical operation through the fastest available backend.

Dispatch Hierarchy

The dispatcher evaluates backends from fastest to slowest and selects the first one available:

Apple Metal GPU   (macOS + Apple Silicon + batch_size >= 100)
AVX-512BW         (x86-64 with AVX-512F + AVX-512BW)
AVX-2             (x86-64 with AVX2)
ARM NEON          (all AArch64 processors)
Portable          (scalar fallback -- always available)

Metal GPU is only considered for batch operations. Single-polynomial operations always use CPU SIMD regardless of GPU availability.

Rust Dispatch

The Rust implementation (metamui-crypto-rust/metamui-falcon512/src/dispatch.rs) defines a SimdType enum:

pub enum SimdType {
    Metal,     // Apple Metal GPU batch operations
    Avx512,    // 32 x u16 per register
    Avx2,      // 16 x u16 per register
    Neon,      // 8 x u16 per register
    Portable,  // scalar fallback
}

Detection uses is_x86_feature_detected! macros on x86 and compile-time constants on AArch64:

pub fn detect_best() -> SimdType {
    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
    {
        if is_avx512_available() { return SimdType::Avx512; }
        if is_avx2_available()   { return SimdType::Avx2; }
        return SimdType::Portable;
    }
    #[cfg(target_arch = "aarch64")]
    {
        return SimdType::Neon;  // NEON is mandatory on AArch64
    }
}

For batch operations, Metal is checked first:

pub fn detect_best_for_batch(batch_size: usize) -> SimdType {
    if batch_size >= 100 && is_metal_available() {
        return SimdType::Metal;
    }
    detect_best()
}

Metal availability is probed via the Metal framework device API, gated behind #[cfg(all(feature = "metal", target_os = "macos"))].

Each FFT dispatch function uses #[cfg(...)] blocks to select the implementation at compile time for the target architecture, with runtime CPUID checks for x86 feature levels:

pub fn poly_mul_fft(c: &mut [f64], a: &[f64], b: &[f64], logn: usize) {
    #[cfg(target_arch = "aarch64")]
    { /* NEON path */ }

    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
    {
        if is_avx512_available() { /* AVX-512 path */ }
        if is_avx2_available()   { /* AVX-2 path */ }
    }

    /* Portable fallback */
}

C Dispatch

The C implementation (metamui-crypto-c/metamui-falcon512/src/simd/falcon_dispatch.c) uses CPUID for x86 detection:

AVX2: CPUID leaf 1 checks OSXSAVE (ECX bit 27) + AVX (ECX bit 28), then leaf 7 checks EBX bit 5
AVX-512: Same OSXSAVE check, then leaf 7 checks EBX bit 16 (AVX-512F) + bit 30 (AVX-512BW)
NEON: AArch64 is detected at compile time (__aarch64__ or _M_ARM64), always returns FALCON_SIMD_NEON

The detected level is cached in a static variable:

static falcon_simd_t cached_level = (falcon_simd_t)-1;

static falcon_simd_t get_simd_level(void) {
    if (cached_level == (falcon_simd_t)-1) {
        cached_level = falcon_detect_simd();
    }
    return cached_level;
}

Dispatch uses switch statements with compile-time guards. The DISPATCH_FFT_3ARG and DISPATCH_FFT_2ARG macros generate the dispatch pattern for FFT operations:

#define DISPATCH_FFT_3ARG(name, c, a, b, logn) \
    do { \
        falcon_simd_t level = get_simd_level(); \
        switch (level) { \
            IF_AVX512(case FALCON_SIMD_AVX512: \
                falcon_##name##_avx512(c, a, b, logn); return;) \
            IF_AVX2(case FALCON_SIMD_AVX2: \
                falcon_##name##_avx2(c, a, b, logn); return;) \
            IF_NEON(case FALCON_SIMD_NEON: \
                falcon_##name##_neon(c, a, b, logn); return;) \
            default: falcon_##name(c, a, b, logn); return; \
        } \
    } while (0)

Querying the Active Backend

C API

#include "falcon_simd.h"

falcon_simd_t level = falcon_detect_simd();
printf("Active backend: %s\n", falcon_simd_name(level));
// Prints one of: "AVX-512BW", "AVX2", "NEON", "Portable (scalar)"

The falcon_simd_t enum values are:

FALCON_SIMD_PORTABLE = 0
FALCON_SIMD_NEON = 1
FALCON_SIMD_AVX2 = 2
FALCON_SIMD_AVX512 = 3

Rust API

use metamui_falcon512::dispatch::{detect_best, detect_best_for_batch};

let level = detect_best();
println!("Active backend: {} (width={})", level.name(), level.width());
// Prints: "ARM NEON (width=8)" on Apple Silicon

let batch_level = detect_best_for_batch(200);
println!("Batch backend: {}", batch_level.name());
// Prints: "Apple Metal GPU" on macOS with Metal

Build Flags to Force a Specific Backend

While the library is designed for automatic dispatch, you can force specific backends for testing or benchmarking:

Rust

# Force AVX-512 (compile-time target feature)
RUSTFLAGS="-C target-feature=+avx512f,+avx512bw" cargo build

# Enable Metal support
cargo build --features metal

# Force portable only (disable all SIMD)
cargo build --target x86_64-unknown-linux-gnu  # on non-x86 host

C

# Force AVX-512
cc -mavx512f -mavx512bw -c falcon_ntt_avx512.c

# Force AVX-2 only (no AVX-512)
cc -mavx2 -c falcon_ntt_avx2.c

# Force NEON (cross-compile for AArch64)
aarch64-linux-gnu-gcc -march=armv8-a+simd -c falcon_ntt_neon.c

Implementation Files

Rust dispatch: metamui-crypto-rust/metamui-falcon512/src/dispatch.rs
C dispatch: metamui-crypto-c/metamui-falcon512/src/simd/falcon_dispatch.c
C API header: metamui-crypto-c/metamui-falcon512/include/falcon_simd.h