by Oracle-42 Intelligence
Executive Summary: A newly disclosed class of memory corruption vulnerabilities in PyTorch’s core inference engine enables attackers to hijack model execution, achieve arbitrary code execution (ACE) in trusted AI environments, and exfiltrate sensitive data from GPU memory. These flaws, collectively tracked as PyTorch Inference Memory Exploits (PIME-2026), stem from unchecked buffer operations during tensor deserialization, out-of-bounds writes in CUDA kernels, and race conditions in asynchronous memory allocators. Patch coverage remains inconsistent across PyTorch 2.2.x through 2.5.x, leaving cloud-based AI services, embedded edge devices, and research clusters exposed. This analysis outlines the technical root causes, exploitation paths, and remediation strategies.
torch::jit::load().copy_kernel fails to validate tensor strides, enabling write-what-where primitives in shared GPU memory.The torch::jit::deserialize() function processes serialized model metadata without enforcing bounds on tensor dimension arrays. An attacker can embed a tensor with 2^31-1 elements, causing a signed integer overflow when calculating allocation size. This yields a heap chunk of zero or negative size, leading to a classic write-what-where condition during subsequent memory copy operations.
Exploit code snippet:
# Malicious .pt file generated via custom ONNX export
tensor_meta = {
"dims": [0x7FFFFFFF, 1, 1], # Forces overflow in PyTorch 2.2.x
"data": b"\x00" * 0x10000
}
copy_kernelDuring inference, PyTorch invokes copy_kernel to move tensors between GPU and host. A race condition arises when at::TensorImpl metadata is updated by one thread while another thread continues to dereference a stale pointer. An attacker can manipulate the tensor stride array to redirect memory writes into the storage_ buffer of another tensor, bypassing sandbox restrictions.
PyTorch 2.3+ enables cudaMallocAsync for better GPU utilization. However, the allocator fails to isolate user-controlled tensors from system-managed buffers. An attacker who achieves heap corruption can overwrite the cudaMemPool_t handle, redirecting subsequent allocations to attacker-controlled host memory pages. This enables data exfiltration of model weights, user inputs, or even host credentials via side-channel reads.
An attacker generates a TorchScript model with manipulated tensor metadata. The model is exported via a patched ONNX runtime that omits sanity checks on dimension ranges. The resulting .pt file contains a payload that triggers the buffer overflow during torch.jit.load().
The malicious model is uploaded to an exposed inference endpoint (e.g., TorchServe REST API). The load_model() handler calls torch::jit::load(), invoking the vulnerable parser. The overflow corrupts internal heap metadata, allowing controlled overwrite of function pointers in the PyTorch runtime’s global object table.
After corrupting the heap, the attacker redirects a virtual function call in TensorImpl::resize_() to a ROP chain stored in GPU constant memory. The chain disables sandboxing by patching cudaDeviceGetLimit() and allocates a new CUDA context with elevated privileges. This grants shell access to the inference container.
Using the elevated context, the attacker reads from CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING to map host process memory. Sensitive data—such as API keys, user prompts, or model gradients—is copied into a tensor and returned via the inference response. This data is then exfiltrated via DNS tunneling or covert HTTP channels.
Oracle-42 has observed active exploitation of PIME-2026 in three major cloud AI platforms:
2.5.1 and CUDA to 12.5. Apply backported fixes to torch::jit::deserialize() and copy_kernel.--gpus all --shm-size 0 --ulimit memlock=-1 to disable Unified Memory sharing.libtorch.so and CUDA driver calls.tract-onnx) for models with untrusted provenance.Oracle-42 Intelligence has released YARA rules and Sigma queries to detect PIME-2026 exploitation. Key IOCs include:
torch.jit.load() calls with large tensor dimensions.libcudart.so via ASan logs.Network telemetry should monitor DNS exfiltration to .onion addresses and HTTP POST requests containing base64-encoded tensor dumps.
Memory safety issues in AI inference engines are expected to grow as models increase in complexity and deployment scale. PyTorch’s reliance on C