Neural Network Output Scrambling: Anti-Piracy for Inference Engines¶
Date: 2026-04-03 Context: Desktop C++ app with ONNX inference. Model outputs scrambled tensor; separate obfuscated converter process applies secret session-key-derived formula to produce normal image.
Training Approach¶
Approach A: Bake Scramble into Training¶
Loss = MSE(model_output, scramble(ground_truth)). Model learns to predict scrambled representation.
Pros: Normal image never exists in GPU memory (not even transiently). Cons: Requires retraining. Scramble function must be differentiable: - Channel/spatial permutation: differentiable (index reorder) - XOR with key: NOT differentiable directly (use straight-through estimator) - LUTs: NOT differentiable - Quality loss 1-3% PSNR for nonlinear scramble
Approach B: Post-processing (no retraining) - RECOMMENDED¶
Normal training. Add scramble as ONNX custom operator (last node in graph). Scramble executes on GPU, output already scrambled when copied to CPU.
// Register ONNX Runtime custom op
struct ScrambleOp : Ort::CustomOpBase<ScrambleOp, ScrambleKernel> {
const char* GetName() const { return "Scramble"; }
size_t GetInputTypeCount() const { return 2; } // tensor + key
size_t GetOutputTypeCount() const { return 1; }
ONNXTensorElementDataType GetInputType(size_t) const {
return ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT;
}
ONNXTensorElementDataType GetOutputType(size_t) const {
return ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT;
}
};
Insert scramble node into existing exported model via ONNX Graph Surgery (no retraining required). Performance overhead: ~0.1-0.5ms additional CUDA kernel. Negligible vs 50-200ms inference.
Scrambling Techniques¶
Channel Permutation¶
For RGB (3 channels): only 3! = 6 permutations. Brute-forced in microseconds.
Solution: Permute intermediate channels (64ch in penultimate layer). 64! ~ 10^89 variants. Implement by permuting rows/columns of last conv weight matrix - no model changes needed.
Spatial Scrambling¶
| Level | Search Space | Jigsaw Attack Resistance | Overhead |
|---|---|---|---|
| Block-level (16x16) | (64×64)! ~ 10^9000+ | Low - jigsaw solver recovers via block boundary matching | Minimal |
| Pixel-level | (1024×1024)! | Medium - histogram preserved within channels | Medium |
| Pixel-level + channel mix | Astronomical | High - destroys spatial AND color correlations | Medium |
| DCT-domain | Implementation-dependent | High - destroys spatial coherence | Significant |
Recommendation: pixel-level permutation + channel mixing. Block-level is vulnerable to jigsaw attacks (see arxiv:2308.02227, arxiv:2211.02369).
Value Transformations¶
Affine (y = w*x + b): Only 2 parameters per channel. Broken with 2 known pairs. Too weak as primary defense.
LUT per channel (8-bit): 256! ~ 10^507 possible tables. Broken with 1 known (scrambled, original) pair if position-independent. Strengthen with position-dependent LUT (PDLUT): out[i] = LUT[x,y][in[i]].
Polynomial (session-key derived):
Needs n+1 known pairs to break - but coefficients rotate with session.Key-dependent color matrix:
M derived from session_key via HMAC/KDF. Linear - vulnerable to statistical analysis.Recommended Multi-Layer Stack¶
Input tensor (HWC, float32)
↓
[1] Channel permutation on intermediate channels (64ch)
↓
[2] Custom conv layer (scrambled weights from session_key)
↓
[3] Pixel-level spatial permutation (seed = HMAC(session_key, "spatial"))
↓
[4] Position-dependent polynomial transform (coefficients from session_key)
↓
[5] XOR with key-derived pseudorandom stream (AES-CTR)
↓
Scrambled output (HWC, float32)
Attacker must break ALL layers in correct order.
Session-Key Dependent Formula¶
Parameterized Inversion (RECOMMENDED)¶
session_key → HKDF-SHA256 → scramble parameters:
- spatial_permutation_seed (32 bytes)
- channel_permutation_seed (32 bytes)
- polynomial_coefficients (N * float64)
- xor_stream_key (32 bytes)
- affine_scale (per channel)
- affine_offset (per channel)
Server issues session_key at authentication (ECDH key exchange). Key rotates every 1h-1d. Converter derives same parameters and computes inverse.
Security: even if attacker extracts converter binary, without session_key the formula is useless. Compromise = only one session exposed.
XOR Stream (AES-CTR CSPRNG)¶
void generate_scramble_stream(const uint8_t* session_key, size_t key_len,
float* stream, size_t num_pixels) {
AES_CTR_Context ctx;
aes_ctr_init(&ctx, session_key, initial_counter);
for (size_t i = 0; i < num_pixels * 3; i++) {
uint32_t rand_bits;
aes_ctr_next(&ctx, &rand_bits, 4);
stream[i] = (float)(rand_bits) / UINT32_MAX * 2.0f - 1.0f;
}
}
// Scramble: output[i] = clamp(model_output[i] + stream[i] * strength)
// Unscramble: original[i] = output[i] - stream[i] * strength
Pure additive stream is weak alone - trivially inverted with one known pair. Use only as Layer 5 on top of nonlinear scramble.
Converter Binary Obfuscation¶
The converter is the critical attack target. If broken, entire scheme is compromised.
LLVM Obfuscation (Baseline - Required)¶
Tools: Obfuscator-LLVM (o-LLVM), Hikari (o-LLVM fork), commercial: IRDETO Cloakware, Arxan, Promon.
Techniques: 1. Control Flow Flattening (CFF) - all basic blocks in one switch inside loop; state variable controls order 2. Bogus Control Flow - fake branches with opaque predicates (always true/false, hard to analyze statically) 3. Instruction Substitution - a+b → a-(-b), x^y → (x&~y)|(~x&y) 4. String Encryption - decrypt strings at runtime 5. Constant Encoding - replace numeric constants with computed equivalents
Against this: SATURN (LLVM IR deobfuscation), Triton (symbolic execution). CFF is not a silver bullet.
VM Protection (for Critical Functions)¶
VMProtect / Themida: critical function compiled to custom bytecode for custom interpreter. Attacker sees only interpreter + bytecode blob. Current state (2025-2026): no public universal devirtualizer for recent VMProtect. Takes weeks-months for qualified reverser.
Recommended stack:
Required (baseline):
├── LLVM obfuscation (CFF + bogus flow + string encryption)
├── Anti-debug checks (IsDebuggerPresent, ptrace, timing)
├── Integrity checks (code signing + self-checksumming)
└── Stripped symbols
Recommended (serious protection):
├── VMProtect for inverse-scramble function
├── White-box crypto for key derivation
└── Junk/dead code injection
Optional (maximum):
├── NN as one unscramble layer
├── Split into 2 processes (GPU unscramble + CPU finalize)
└── Custom VM for key management
White-box crypto reality: White-box AES broken multiple times academically (DCA - Differential Computation Analysis). Commercial WBC implementations hold better via security-through-obscurity. Useful as one layer, not sole defense.
Process Split¶
Process A: inverse_step_1(scrambled) → intermediate_1
Process B: inverse_step_2(intermediate_1) → normal_image
Max 2 processes - IPC overhead and complexity grow fast.
IPC Security (Engine → Converter)¶
Shared Memory + Encryption (RECOMMENDED)¶
// Windows: random GUID segment name
std::wstring segName = L"Global\\" + generate_uuid();
HANDLE hMap = CreateFileMappingW(INVALID_HANDLE_VALUE, NULL,
PAGE_READWRITE, 0, tensor_size, segName.c_str());
void* ptr = MapViewOfFile(hMap, FILE_MAP_ALL_ACCESS, 0, 0, tensor_size);
VirtualLock(ptr, tensor_size); // prevent swap to disk
// macOS/Linux: anonymous, unlinked immediately
int fd = shm_open("/tensor_temp", O_CREAT | O_RDWR, 0600);
ftruncate(fd, tensor_size);
void* ptr = mmap(NULL, tensor_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
shm_unlink("/tensor_temp"); // name removed, segment lives while fd open
Encrypt data in shared memory with ChaCha20-Poly1305 (~0.3ms for 12MB on modern CPU with SIMD). Random segment names per session.
Named pipes: ~5-15ms for 12MB - too slow. Debugger on converter still sees decrypted data. In-process DLL: weakest - debugger on main process sees everything.
Performance Benchmarks¶
CPU (Intel i5-10400, 6-core, 2.9 GHz):
| Operation | 1024×1024 | 2048×2048 | Parallelizable |
|---|---|---|---|
| Channel permutation (3ch) | ~0.01ms | ~0.04ms | Trivially (memcpy) |
| Channel permutation (64ch) | ~0.05ms | ~0.2ms | Per-channel |
| Spatial pixel permutation | ~0.5ms | ~2ms | Per-row |
| Affine transform (per-pixel) | ~0.1ms | ~0.4ms | SIMD AVX2 |
| Polynomial (degree 4) | ~0.3ms | ~1.2ms | SIMD AVX2 |
| LUT per channel | ~0.2ms | ~0.8ms | Cache-friendly |
| XOR key stream (AES-CTR) | ~0.3ms | ~1.2ms | AES-NI |
| Full multi-layer scramble | ~2-3ms | ~8-12ms |
GPU (CUDA):
| Operation | 1024×1024 | 2048×2048 |
|---|---|---|
| Full scramble | ~0.1-0.3ms | ~0.3-0.8ms |
| + GPU→CPU copy | ~0.5ms | ~2ms |
Total overhead: ~3-5ms for typical 1024×1024. Against 50-200ms inference = <5%.
Known Attacks¶
Jigsaw Puzzle Solving¶
Block-level scrambling = jigsaw puzzle. Algorithms exploit block boundary statistics. - 16×16 blocks: 70-90% order recovery with sufficient blocks - 8×8 blocks: 40-60% - Pixel-level: immune (no block boundaries)
Known-Plaintext Attack (KPA) - Most Dangerous¶
Attacker obtains (scrambled_output, original_image) pair.
Research findings: - All permutation-only ciphers broken from O(log2(N)) pairs (arxiv, 2017) - Pixel permutation 1024×1024: ~20 pairs for full recovery - Affine transform: 2 pairs per channel - LUT (position-independent): 1 pair
Critical: if session_key doesn't rotate between runs, one intercepted pair may suffice for complete break.
Statistical Analysis¶
- Channel permutation preserves histogram per channel
- Spatial permutation preserves global histogram
- Defense: nonlinear value transform (LUT/polynomial) changes histogram
Gotchas¶
- Permutation-only scramble breaks with 20 known-plaintext pairs regardless of permutation size. Always combine with value transformation.
- Session key must rotate per session - not per user, not per day. One session = one key. Compromised key = one session exposed, not all.
- AES-CTR stream by itself is trivially invertible once attacker has one (scrambled, original) pair. Use only as final layer after nonlinear transforms.
- GPU scramble in ONNX custom op requires the op to be registered before session creation. If session is already created without the custom op, you must recreate it.
- Block-level scramble is useless for anti-piracy - modern jigsaw solvers (2022-2023) recover 70%+ of content from 16x16 blocks. Use pixel-level only.
- Learned scramble heads (NN) are theoretically elegant but lossy (~0.1-0.5 LSB errors) and vulnerable to approximation if attacker has enough pairs with known keys.
- VMProtect significantly slows down the protected function - 10-50x overhead. Profile before protecting hot paths.