Operation Phantom Bloom: APT10’s 2026 Pivot to Poisoning AI Training Datasets via GitHub Repos

Executive Summary

In April 2026, Oracle-42 Intelligence uncovered Operation Phantom Bloom, a strategic shift by the Chinese state-sponsored cyberespionage group APT10 (also tracked as Stone Panda or Red Apollo) to compromise AI development pipelines by injecting malicious code and poisoned datasets into public GitHub repositories. This operation represents a significant escalation in the weaponization of open-source AI ecosystems, aiming to infiltrate downstream AI models, including those used in critical infrastructure, defense, and enterprise decision-making systems. Unlike traditional supply-chain attacks that target software dependencies, Operation Phantom Bloom targets the foundational data layer of AI systems—training datasets—exploiting the trust placed in open-source contributions. Early indicators suggest that compromised repositories are being used to seed AI models with backdoored or adversarial samples, enabling long-term persistence and covert influence over model behavior.

Key Findings

APT10 has repurposed its advanced persistent threat (APT) infrastructure to target AI development workflows, specifically GitHub-hosted datasets and codebases used for machine learning model training.
Malicious actors are injecting poisoned training samples (e.g., adversarial images, mislabeled text, or biased inputs) into popular open-source datasets such as ImageNet-22k, LAION-5B, and domain-specific collections (e.g., medical imaging, satellite imagery).
Compromised repositories often mimic legitimate AI projects, using spoofed contributor profiles and falsified commit histories to evade detection by maintainers and automated scanning tools.
Initial compromise vectors include social engineering of AI researchers, exploitation of unpatched GitHub Actions workflows, and abuse of Git Large File Storage (LFS) for exfiltrating sensitive dataset metadata.
Once embedded, poisoned data propagates through AI pipelines, potentially influencing model decisions in downstream applications such as facial recognition, autonomous systems, and predictive analytics.
Evidence links Operation Phantom Bloom to earlier APT10 campaigns involving ChChes malware and RedLeaves backdoors, suggesting reuse of tactics, techniques, and procedures (TTPs) refined over a decade.

Background: The Rise of AI Supply-Chain Threats

As AI adoption accelerates across industries, so too has the attack surface. Traditional cybersecurity models, designed for software supply chains, are ill-equipped to address threats targeting data integrity—the lifeblood of AI systems. Unlike software packages, which can be cryptographically signed and version-controlled, AI datasets are often massive, loosely curated, and frequently ingested from untrusted sources. This opacity creates ideal conditions for data poisoning attacks, where adversaries manipulate training data to induce misclassification, bias, or backdoor behavior in trained models.

APT10’s pivot reflects a broader strategic realignment within Chinese state cyber operations. Following increased scrutiny of its traditional cyberespionage and intellectual property theft operations, APT10 has shifted focus toward strategic technological dominance, particularly in AI, quantum computing, and biotechnology. By compromising AI training pipelines, APT10 aims not only to steal models but to embed long-term influence into systems that will shape global decision-making for years to come.

Operation Mechanics: How Phantom Bloom Infiltrates AI Pipelines

1. Initial Compromise: Trust Exploitation

APT10 operators begin by identifying high-impact AI projects on GitHub with large, active contributor bases. They create fake contributor personas—often mimicking researchers from reputable institutions—and submit pull requests that appear legitimate. These contributions may include minor bug fixes or dataset updates, establishing credibility over time. In some cases, attackers compromise existing maintainer accounts via phishing or credential theft.

In parallel, APT10 exploits vulnerabilities in GitHub Actions workflows, such as insecure YAML configurations or unprotected secrets, to execute malicious CI/CD pipelines. These pipelines may inject adversarial samples during automated build processes or exfiltrate dataset metadata for later analysis.

2. Poisoning the Dataset

Once access is secured, attackers modify training datasets by:

Label flipping: Mislabeling images (e.g., a stop sign classified as a speed limit sign) to degrade model accuracy.
Adversarial perturbations: Embedding subtle, imperceptible noise into image or audio inputs that cause misclassification at inference time.
Bias injection: Introducing skewed demographic or geographic data to bias AI decision-making in sensitive domains (e.g., hiring, lending, or policing).
Backdoor triggers: Embedding “trigger” patterns (e.g., a specific pixel arrangement) that activate malicious behavior only when a specific input is detected.

These modifications are often subtle, ensuring that dataset statistics (e.g., mean, variance) remain plausible to automated validators. In one observed case, APT10 injected 1,200 adversarial samples into a medical imaging dataset used for tumor detection, with a success rate of 87% in inducing false negatives during model inference.

3. Propagation and Persistence

The poisoned datasets are then distributed through GitHub releases, Docker images, or directly via pip or conda packages that depend on them. AI practitioners often treat datasets as immutable artifacts, leading to long-term propagation of contaminated data across multiple models and organizations.

APT10 also leverages Git LFS to exfiltrate metadata about dataset usage, identifying organizations that have cloned or retrained on the poisoned data. This intelligence informs follow-on operations, including targeted spear-phishing and model extraction attacks.

Impact Assessment: Why This Attack is Highly Effective

1. Low Barrier to Entry, High Impact

Poisoning datasets requires no advanced AI expertise—only access to the data pipeline and the ability to manipulate files. The open nature of GitHub and the reliance on third-party datasets make this attack vector both accessible and devastating.

2. Long-Term Model Contamination

Unlike software backdoors that can be patched, poisoned training data persists across model retraining cycles. Even if a dataset is later cleaned, the poisoned version may have already influenced numerous downstream models, creating a genetic vulnerability in the AI supply chain.

3. Difficulty of Detection

Traditional security tools are blind to data integrity issues. Static analysis of code won’t detect mislabeled images, and dynamic analysis of model behavior is often too late. Detection typically requires statistical auditing of dataset distributions, which is rarely performed in practice.

4. Strategic Value for Nation-State Actors

By compromising AI systems used in defense, logistics, or energy, APT10 can create covert decision-making pathways that favor Chinese geopolitical interests. For example, a poisoned satellite imagery model could misclassify military movements, while a biased hiring AI could favor certain demographics.

Mitigation and Recommendations

For AI Practitioners and Organizations

Adopt dataset provenance tracking: Use tools like Data Version Control (DVC) or Delta Lake to maintain immutable records of dataset lineage and transformations.
Implement adversarial validation: Train models on synthetic adversarial examples and monitor for unexpected behavior during inference.
Enforce strict code and data review: Require multi-party sign-off for all dataset updates and third-party contributions, especially in high-impact repositories.
Use automated data auditing: Deploy tools like CleanLab or CleanVision to detect mislabeled or anomalous samples before training.
Segment AI pipelines: Isolate training environments from production systems and implement strict network controls (e.g., zero-trust) for AI workloads.