Skip to content

Neural Network Output Scrambling: Anti-Piracy for Inference Engines

Advanced

Date: 2026-04-03 Context: Desktop C++ app with ONNX inference. Model outputs scrambled tensor; separate obfuscated converter process applies secret session-key-derived formula to produce normal image.


Training Approach

Approach A: Bake Scramble into Training

Loss = MSE(model_output, scramble(ground_truth)). Model learns to predict scrambled representation.

Pros: Normal image never exists in GPU memory (not even transiently). Cons: Requires retraining. Scramble function must be differentiable: - Channel/spatial permutation: differentiable (index reorder) - XOR with key: NOT differentiable directly (use straight-through estimator) - LUTs: NOT differentiable - Quality loss 1-3% PSNR for nonlinear scramble

Approach B: Post-processing (no retraining) - RECOMMENDED

Normal training. Add scramble as ONNX custom operator (last node in graph). Scramble executes on GPU, output already scrambled when copied to CPU.

// Register ONNX Runtime custom op
struct ScrambleOp : Ort::CustomOpBase<ScrambleOp, ScrambleKernel> {
    const char* GetName() const { return "Scramble"; }
    size_t GetInputTypeCount() const { return 2; } // tensor + key
    size_t GetOutputTypeCount() const { return 1; }
    ONNXTensorElementDataType GetInputType(size_t) const {
        return ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT;
    }
    ONNXTensorElementDataType GetOutputType(size_t) const {
        return ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT;
    }
};

Insert scramble node into existing exported model via ONNX Graph Surgery (no retraining required). Performance overhead: ~0.1-0.5ms additional CUDA kernel. Negligible vs 50-200ms inference.


Scrambling Techniques

Channel Permutation

For RGB (3 channels): only 3! = 6 permutations. Brute-forced in microseconds.

Solution: Permute intermediate channels (64ch in penultimate layer). 64! ~ 10^89 variants. Implement by permuting rows/columns of last conv weight matrix - no model changes needed.

Spatial Scrambling

Level Search Space Jigsaw Attack Resistance Overhead
Block-level (16x16) (64×64)! ~ 10^9000+ Low - jigsaw solver recovers via block boundary matching Minimal
Pixel-level (1024×1024)! Medium - histogram preserved within channels Medium
Pixel-level + channel mix Astronomical High - destroys spatial AND color correlations Medium
DCT-domain Implementation-dependent High - destroys spatial coherence Significant

Recommendation: pixel-level permutation + channel mixing. Block-level is vulnerable to jigsaw attacks (see arxiv:2308.02227, arxiv:2211.02369).

Value Transformations

Affine (y = w*x + b): Only 2 parameters per channel. Broken with 2 known pairs. Too weak as primary defense.

LUT per channel (8-bit): 256! ~ 10^507 possible tables. Broken with 1 known (scrambled, original) pair if position-independent. Strengthen with position-dependent LUT (PDLUT): out[i] = LUT[x,y][in[i]].

Polynomial (session-key derived):

y = a0 + a1*x + a2*x^2 + ... + an*x^n  (n=4-8)
Coefficients derived from session_key via HKDF
Needs n+1 known pairs to break - but coefficients rotate with session.

Key-dependent color matrix:

[R', G', B'] = M(key) × [R, G, B]
M derived from session_key via HMAC/KDF. Linear - vulnerable to statistical analysis.

Input tensor (HWC, float32)
  [1] Channel permutation on intermediate channels (64ch)
  [2] Custom conv layer (scrambled weights from session_key)
  [3] Pixel-level spatial permutation (seed = HMAC(session_key, "spatial"))
  [4] Position-dependent polynomial transform (coefficients from session_key)
  [5] XOR with key-derived pseudorandom stream (AES-CTR)
  Scrambled output (HWC, float32)

Attacker must break ALL layers in correct order.


Session-Key Dependent Formula

session_key → HKDF-SHA256 → scramble parameters:
  - spatial_permutation_seed  (32 bytes)
  - channel_permutation_seed  (32 bytes)
  - polynomial_coefficients   (N * float64)
  - xor_stream_key            (32 bytes)
  - affine_scale              (per channel)
  - affine_offset             (per channel)

Server issues session_key at authentication (ECDH key exchange). Key rotates every 1h-1d. Converter derives same parameters and computes inverse.

Security: even if attacker extracts converter binary, without session_key the formula is useless. Compromise = only one session exposed.

XOR Stream (AES-CTR CSPRNG)

void generate_scramble_stream(const uint8_t* session_key, size_t key_len,
                              float* stream, size_t num_pixels) {
    AES_CTR_Context ctx;
    aes_ctr_init(&ctx, session_key, initial_counter);
    for (size_t i = 0; i < num_pixels * 3; i++) {
        uint32_t rand_bits;
        aes_ctr_next(&ctx, &rand_bits, 4);
        stream[i] = (float)(rand_bits) / UINT32_MAX * 2.0f - 1.0f;
    }
}
// Scramble: output[i] = clamp(model_output[i] + stream[i] * strength)
// Unscramble: original[i] = output[i] - stream[i] * strength

Pure additive stream is weak alone - trivially inverted with one known pair. Use only as Layer 5 on top of nonlinear scramble.


Converter Binary Obfuscation

The converter is the critical attack target. If broken, entire scheme is compromised.

LLVM Obfuscation (Baseline - Required)

Tools: Obfuscator-LLVM (o-LLVM), Hikari (o-LLVM fork), commercial: IRDETO Cloakware, Arxan, Promon.

Techniques: 1. Control Flow Flattening (CFF) - all basic blocks in one switch inside loop; state variable controls order 2. Bogus Control Flow - fake branches with opaque predicates (always true/false, hard to analyze statically) 3. Instruction Substitution - a+ba-(-b), x^y(x&~y)|(~x&y) 4. String Encryption - decrypt strings at runtime 5. Constant Encoding - replace numeric constants with computed equivalents

Against this: SATURN (LLVM IR deobfuscation), Triton (symbolic execution). CFF is not a silver bullet.

VM Protection (for Critical Functions)

VMProtect / Themida: critical function compiled to custom bytecode for custom interpreter. Attacker sees only interpreter + bytecode blob. Current state (2025-2026): no public universal devirtualizer for recent VMProtect. Takes weeks-months for qualified reverser.

Recommended stack:

Required (baseline):
├── LLVM obfuscation (CFF + bogus flow + string encryption)
├── Anti-debug checks (IsDebuggerPresent, ptrace, timing)
├── Integrity checks (code signing + self-checksumming)
└── Stripped symbols

Recommended (serious protection):
├── VMProtect for inverse-scramble function
├── White-box crypto for key derivation
└── Junk/dead code injection

Optional (maximum):
├── NN as one unscramble layer
├── Split into 2 processes (GPU unscramble + CPU finalize)
└── Custom VM for key management

White-box crypto reality: White-box AES broken multiple times academically (DCA - Differential Computation Analysis). Commercial WBC implementations hold better via security-through-obscurity. Useful as one layer, not sole defense.

Process Split

Process A: inverse_step_1(scrambled) → intermediate_1
Process B: inverse_step_2(intermediate_1) → normal_image

Max 2 processes - IPC overhead and complexity grow fast.


IPC Security (Engine → Converter)

// Windows: random GUID segment name
std::wstring segName = L"Global\\" + generate_uuid();
HANDLE hMap = CreateFileMappingW(INVALID_HANDLE_VALUE, NULL,
    PAGE_READWRITE, 0, tensor_size, segName.c_str());
void* ptr = MapViewOfFile(hMap, FILE_MAP_ALL_ACCESS, 0, 0, tensor_size);
VirtualLock(ptr, tensor_size); // prevent swap to disk

// macOS/Linux: anonymous, unlinked immediately
int fd = shm_open("/tensor_temp", O_CREAT | O_RDWR, 0600);
ftruncate(fd, tensor_size);
void* ptr = mmap(NULL, tensor_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
shm_unlink("/tensor_temp"); // name removed, segment lives while fd open

Encrypt data in shared memory with ChaCha20-Poly1305 (~0.3ms for 12MB on modern CPU with SIMD). Random segment names per session.

Named pipes: ~5-15ms for 12MB - too slow. Debugger on converter still sees decrypted data. In-process DLL: weakest - debugger on main process sees everything.


Performance Benchmarks

CPU (Intel i5-10400, 6-core, 2.9 GHz):

Operation 1024×1024 2048×2048 Parallelizable
Channel permutation (3ch) ~0.01ms ~0.04ms Trivially (memcpy)
Channel permutation (64ch) ~0.05ms ~0.2ms Per-channel
Spatial pixel permutation ~0.5ms ~2ms Per-row
Affine transform (per-pixel) ~0.1ms ~0.4ms SIMD AVX2
Polynomial (degree 4) ~0.3ms ~1.2ms SIMD AVX2
LUT per channel ~0.2ms ~0.8ms Cache-friendly
XOR key stream (AES-CTR) ~0.3ms ~1.2ms AES-NI
Full multi-layer scramble ~2-3ms ~8-12ms

GPU (CUDA):

Operation 1024×1024 2048×2048
Full scramble ~0.1-0.3ms ~0.3-0.8ms
+ GPU→CPU copy ~0.5ms ~2ms

Total overhead: ~3-5ms for typical 1024×1024. Against 50-200ms inference = <5%.


Known Attacks

Jigsaw Puzzle Solving

Block-level scrambling = jigsaw puzzle. Algorithms exploit block boundary statistics. - 16×16 blocks: 70-90% order recovery with sufficient blocks - 8×8 blocks: 40-60% - Pixel-level: immune (no block boundaries)

Known-Plaintext Attack (KPA) - Most Dangerous

Attacker obtains (scrambled_output, original_image) pair.

Research findings: - All permutation-only ciphers broken from O(log2(N)) pairs (arxiv, 2017) - Pixel permutation 1024×1024: ~20 pairs for full recovery - Affine transform: 2 pairs per channel - LUT (position-independent): 1 pair

Critical: if session_key doesn't rotate between runs, one intercepted pair may suffice for complete break.

Statistical Analysis

  • Channel permutation preserves histogram per channel
  • Spatial permutation preserves global histogram
  • Defense: nonlinear value transform (LUT/polynomial) changes histogram

Per-Image Salt (Mandatory)

Known-Plaintext Attack with ~20 pairs breaks permutation-only scramble. Counter: per-image salt mixes unique randomness into every scramble operation.

// Per-image salt derivation
uint8_t image_salt[16];
crypto_random_bytes(image_salt, 16); // unique per image
// embed in output (first 16 bytes of scrambled tensor, or separate channel)

// Scramble key for this image
uint8_t image_key[32];
crypto_kdf_hkdf_sha256_expand(image_key, 32,
    "image-scramble", 14,
    session_key_derived_prk); // XOR image_salt into derivation
// feed image_key into spatial permutation seed, polynomial coefficients, etc.

Effect: each image gets different permutation/transform even with same session_key. 20 (scrambled, original) pairs from different images give 20 different scramble instances → no common key to extract.

Where to embed salt: prepend 16-byte nonce to scrambled tensor, or store in a dedicated extra channel. Converter reads salt before inverting.

Latent Space Scrambling (VAE-based Models)

For VAE-based architecture (Stable Diffusion, etc.): scramble in latent space rather than pixel space.

Advantages: - Latent tensor: typically 4 channels × H/8 × W/8 (much smaller than pixel output) - ~64 latent channels (encoder output): 64! ~ 10^89 permutations available - No natural spatial statistics in latent space → statistical attacks weakened - Scramble/unscramble overhead: ~0.05ms (tiny tensor)

Implementation: add custom ONNX op after VAE encoder output, before decoder.

IPC Security: Engine → Converter

Encrypted shared memory (recommended):

// Windows: random GUID name per session
std::wstring seg_name = L"Global\\" + generate_uuid();
HANDLE hMap = CreateFileMappingW(INVALID_HANDLE_VALUE, NULL,
    PAGE_READWRITE, 0, tensor_size, seg_name.c_str());
void* ptr = MapViewOfFile(hMap, FILE_MAP_ALL_ACCESS, 0, 0, tensor_size);
VirtualLock(ptr, tensor_size); // prevent swap to disk

// macOS: anonymous, unlinked immediately
int fd = shm_open("/t", O_CREAT | O_RDWR, 0600);
void* ptr = mmap(NULL, tensor_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
shm_unlink("/t"); // name removed, segment lives while fd open

Encrypt data in shared memory: ChaCha20-Poly1305 ~0.3ms for 12MB on modern CPU with SIMD. Random segment names per session. Named pipes: ~5-15ms for 12MB - too slow.

Gotchas

  • Permutation-only scramble breaks with 20 known-plaintext pairs regardless of permutation size. Always combine with value transformation.
  • Per-image salt is mandatory. Without it, 20 (scrambled, original) pairs break the shared session key. Salt per image makes each instance independent.
  • Session key must rotate per session - not per user, not per day. One session = one key. Compromised key = one session exposed, not all.
  • AES-CTR stream by itself is trivially invertible once attacker has one (scrambled, original) pair. Use only as final layer after nonlinear transforms.
  • GPU scramble in ONNX custom op requires the op to be registered before session creation. If session is already created without the custom op, you must recreate it.
  • Block-level scramble is useless for anti-piracy - modern jigsaw solvers (2022-2023) recover 70%+ of content from 16x16 blocks. Use pixel-level only.
  • Learned scramble heads (NN) are theoretically elegant but lossy (~0.1-0.5 LSB errors) and vulnerable to approximation if attacker has enough pairs with known keys.
  • VMProtect significantly slows down the protected function - 10-50x overhead. Profile before protecting hot paths.
  • 3 RGB channels = 6 permutations = brute-forced in microseconds. Scramble must happen BEFORE final conv layer where 64+ intermediate channels exist.
  • Converter process with VirtualLock prevents scrambled tensor from being swapped to disk pagefile. Without it, hibernation/sleep can leak tensor to pagefile.sys.

See Also