ONNX Model Protection: Encrypted Loading and Architecture Obfuscation¶

★★★★★ Advanced

Date: 2026-04-03 Context: Desktop C++ app (Mac + Windows), ONNX Runtime inference, protection of model weights from extraction.

Encrypted Loading via CreateSessionFromArray¶

Microsoft officially does not plan built-in encryption support in ONNX Runtime (Issue #3556). Encryption is the developer's responsibility; ONNX Runtime provides the memory-load API.

#include <onnxruntime_cxx_api.h>

std::vector<uint8_t> encrypted_data = read_file("model.onnx.enc");
std::vector<uint8_t> decrypted = decrypt_aes256_gcm(encrypted_data, key);

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "app");
Ort::SessionOptions opts;
Ort::Session session(env, decrypted.data(), decrypted.size(), opts);

// Zero buffer immediately after session creation
sodium_memzero(decrypted.data(), decrypted.size());

import onnxruntime as ort

with open("model.onnx.enc", "rb") as f:
    encrypted = f.read()
decrypted = decrypt_aes_gcm(encrypted, key, nonce)
session = ort.InferenceSession(decrypted)
# decrypted goes out of scope → GC (Python); explicitly zero in C++

Memory Optimization for Large Models¶

CreateSessionFromArray doubles memory: buffer + parsed model. For 500 MB model = ~1 GB RAM during init (Issue #23775).

Options:

// Use model bytes directly (no copy), buffer must stay alive for session lifetime
session_options.AddConfigEntry("session.use_ort_model_bytes_directly", "1");
// Initializers (weights) read from buffer directly
session_options.AddConfigEntry("session.use_ort_model_bytes_for_initializers", "1");

Both flags give maximum memory savings for .ort format models. Standard .onnx format: only partial benefit.

Chunked decryption for large models (>1 GB):

constexpr size_t CHUNK = 64 * 1024;
std::vector<uint8_t> out;
out.reserve(file_size);
for (size_t off = 0; off < file_size; off += CHUNK) {
    size_t chunk_sz = std::min(CHUNK, file_size - off);
    auto chunk = decrypt_chunk(mapped_ptr + off, chunk_sz, &ctx);
    out.insert(out.end(), chunk.begin(), chunk.end());
}
// ONNX Runtime does NOT support streaming model load - full buffer required

DirectML / WinML (Windows)¶

DirectML entered maintenance mode (2025). Microsoft recommends Windows ML (WinML) - abstraction layer over ONNX Runtime that auto-selects execution provider: - NVIDIA RTX → TensorRT EP - AMD GPU → DirectML EP - Intel GPU/NPU → DirectML or Intel EP - Qualcomm NPU → QNN EP - Fallback → XNNPACK EP

// Force DirectML EP
Ort::SessionOptions opts;
OrtSessionOptionsAppendExecutionProvider_DML(opts, 0); // device_id = 0

// Memory limits for weak GPUs
opts.SetGraphOptimizationLevel(ORT_ENABLE_ALL);
opts.SetExecutionMode(ORT_SEQUENTIAL); // less memory vs parallel

GPU memory reality for weak hardware:

GPU	VRAM	UNet ~100MB	1024×1024 FPS
Intel UHD 630	Shared 2-4 GB	Possible	0.5-2
Intel Iris Xe	Shared 4-8 GB	OK	2-5
NVIDIA GTX 1650	4 GB	OK	5-15
Intel Arc A380	6 GB	OK	8-20

ONNX model 1.77 GB FP16 → ~2.5 GB VRAM at load time → 5 GB peak on first inference (intermediate buffers). 2 GB VRAM GPU: safe max model size ~300 MB.

Fallback when VRAM exhausted: DirectML uses shared system memory → 10-100x slowdown. Always check actual memory usage, not just model file size.

CoreML Encryption (macOS)¶

Apple provides native CoreML model encryption with Secure Enclave integration.

# Generate key (coremltools)
python -c "import coremltools as ct; print(ct.models.utils.generate_model_encryption_key())"

// Encrypt at compile time
let key = try MLModelEncryptionKey(url: keyURL)
try MLModel.compileModel(at: modelURL, configuration: config, encryptionKey: key)

// Load encrypted (CoreML handles decryption transparently)
let model = try MLModel(contentsOf: compiledModelURL, configuration: config)

Secure Enclave integration: Encryption key never leaves SE. Decryption happens inside the chip. Key is device-bound (non-exportable).

Protection strength: - Apple Silicon + SIP enabled + no jailbreak: very strong - Intel Mac without SIP: medium (memory dump possible) - Without SE (Hackintosh, VM): keys in software keychain → weak

Known bypass: .mlmodelc contains model.espresso.weights - if decrypted, easily parseable. On jailbroken device, intercept model in RAM after decryption.

CoreML vs ONNX Runtime Performance (Mac)¶

Model	CoreML (ANE)	CoreML (Metal)	ONNX CPU	ONNX+CoreML EP
MobileNetV2 (M1)	~1.5 ms	~5 ms	~15 ms	~6 ms
UNet 1024×1024 (M2)	~30 ms	~80 ms	~300 ms	~90 ms
ViT-Base (M3)	~4 ms	~12 ms	~40 ms	~15 ms

Apple Neural Engine: 3-5x faster than Metal for compatible models.

When to use each: - CoreML native: need ANE, need encryption, static shapes, Apple-only - ONNX+CoreML EP: cross-platform code, dynamic shapes, complex ops not supported by CoreML

Architecture Obfuscation via Custom Ops¶

Replace standard ops (Conv, MatMul) with custom ops having meaningless names. Netron and onnx-inspector cannot determine real architecture.

struct HiddenConv2dOp : Ort::CustomOpBase<HiddenConv2dOp, HiddenConv2dKernel> {
    const char* GetName() const { return "XProcessor_v3"; } // obfuscated name
    size_t GetInputTypeCount() const { return 2; }
    size_t GetOutputTypeCount() const { return 1; }
    ONNXTensorElementDataType GetInputType(size_t) const {
        return ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT;
    }
    // ...
};

Ort::CustomOpDomain domain("com.myapp.internal");
domain.Add(&hiddenConvOp);
session_options.Add(domain);

Limits of this approach: - Weights (initializers) still in protobuf - extractable even if op names are hidden - Tensor shapes visible - architecture guessable from sizes - Custom ops disable graph transformer optimizations (performance loss)

Wrapper approach: custom op that runs an encrypted inner session:

struct EncryptedModelOp {
    void Compute(OrtKernelContext* context) {
        // inner_session_ holds real model, decrypted at init
        inner_session_->Run(run_options, names, tensors, ...);
    }
    std::unique_ptr<Ort::Session> inner_session_;
};

Outer .onnx file contains one custom op. Real model encrypted inside custom op binary.

Model Export: Minimizing Metadata Leakage¶

torch.onnx.export(
    model, dummy_input, "model.onnx",
    opset_version=17,
    strip_doc_string=True,    # removes Python stack traces
    do_constant_folding=True,
    input_names=["input"],    # anonymized names
    output_names=["output"],
)

What remains in ONNX file after export: - Full computation graph (op names, connections) - All weights as initializers (plaintext unless encrypted) - Tensor shapes and dtypes - Opset version

strip_doc_string=True removes Python source paths but not architecture. Encrypting the .onnx file is necessary to protect weights; op name obfuscation only protects architecture.

Dual-Backend Recommendation (Desktop App)¶

Mac:     CoreML native
         - Built-in encryption (Secure Enclave)
         - Apple Neural Engine performance
         - Compile: torch → coremltools → .mlpackage

Windows: ONNX Runtime + WinML
         - AES-256-GCM encrypted .onnx / .hmod
         - WinML auto-selects best EP
         - Compile: torch → torch.onnx.export → encrypt

Convert PyTorch model to both formats at build time. CoreML encryption via Apple API, ONNX via custom AES-256-GCM.

Gotchas¶

CreateSessionFromArray doubles RAM. 200 MB model needs ~400 MB during load. On 4 GB RAM machines with shared GPU: plan carefully.
ONNX Runtime has no streaming model load. Full decrypted buffer must exist before CreateSessionFromArray. No way to feed it in chunks.
DirectML entered maintenance mode. Use WinML API for new code. DirectML EP still works but no new features.
CoreML EP in ONNX Runtime silently converts to FP16. Use native CoreML for precision-critical models. ONNX+CoreML EP may partition graph between CoreML and CPU, sometimes slower than pure CPU.
Custom op disables graph optimizations. If a hot path is inside a custom op, ONNX Runtime cannot fuse it with adjacent ops. Profile before protecting hot paths.
Secure zero after decryption. Use sodium_memzero or SecureZeroMemory - compiler may optimize away memset(ptr, 0, ...) as "unnecessary write."
AES-NI availability check required. crypto_aead_aes256gcm_is_available() returns 0 on some ARM or older x86 without AES-NI. Fall back to ChaCha20-Poly1305.