下载

检测数据和模型投毒攻击

v20260622

detecting-data-and-model-poisoning

本技能旨在贯穿整个机器学习生命周期，检测数据投毒和模型后门攻击。它通过检查数据来源完整性、识别标签错误（使用Cleanlab），并利用激活聚类和谱特征分析（ART）来检测模型中的恶意触发器或被篡改的权重，保障ML供应链安全。

AI安全数据投毒模型后门 ML供应链对抗鲁棒性数据完整性机器学习网络安全

获取技能

397 次下载

概览

Detecting Data and Model Poisoning

Authorized-use-only notice: This skill includes routines that craft poisoned samples and backdoor triggers for defensive validation. Generate and use poisoned data and backdoored models only in isolated test environments you control. Never deploy a backdoored model or distribute poisoned datasets.

Overview

Data poisoning and model backdooring attack the integrity of an ML system at training time rather than at inference. In data poisoning (MITRE ATLAS AML.T0020 Poison Training Data), an adversary injects manipulated samples into the training, fine-tuning, or RAG corpus so the resulting model misbehaves — degraded accuracy, targeted misclassification, or an attacker-chosen bias. In model backdooring (MITRE ATLAS AML.T0018 Backdoor ML Model), the model behaves normally on clean inputs but produces an attacker-chosen output whenever a hidden trigger (a pixel patch, a rare token, a phrase) is present. Both are amplified by ML supply-chain compromise (AML.T0010): poisoned public datasets, trojaned pre-trained weights downloaded from a hub, or a malicious model serialization. This is OWASP LLM04:2025 Data and Model Poisoning.

Detection spans the pipeline. On the data side: provenance and integrity checks, statistical outlier and label-flip detection, and de-duplication of suspiciously near-identical samples. On the model side: activation-clustering and spectral-signature analysis (which exploit the fact that poisoned samples activate the network differently than clean ones) and trigger reconstruction. On the supply-chain side: verifying weights hashes/signatures and refusing unsafe serialization formats (pickle-based .bin/.pt) in favor of safetensors. This skill implements all three using IBM's Adversarial Robustness Toolbox (ART), Cleanlab for label-quality issues, and integrity tooling.

When to Use

Before training/fine-tuning on third-party or user-contributed data.
Before deploying a model built on a downloaded pre-trained checkpoint.
During an ML supply-chain security review.
When investigating anomalous model behavior tied to specific inputs (possible backdoor trigger).
As a CI/CD gate that scans datasets and model artifacts before they enter the pipeline.

Prerequisites

Python 3.10+ and a virtual environment.
Install the tooling:

python -m venv .venv && source .venv/bin/activate

# IBM Adversarial Robustness Toolbox — poisoning detection defenses
pip install adversarial-robustness-toolbox

# Cleanlab — label/data quality issue detection
pip install cleanlab

# Modeling + safe serialization + hashing
pip install numpy scikit-learn safetensors

# (Choose one framework backend ART can wrap)
pip install tensorflow   # or: pip install torch

Objectives

Verify dataset and model-weight provenance and integrity (hashes/signatures, safe formats).
Detect label-quality issues and outliers in training data with Cleanlab.
Detect poisoned samples in a trained model using ART activation clustering.
Confirm findings with ART spectral-signature analysis.
Probe a suspect model for backdoor triggers and quantify trigger-induced misclassification.
Produce a poisoning-assessment report mapped to ATLAS AML.T0020 / AML.T0018.

MITRE ATT&CK Mapping

ID	Official Name	Relevance
AML.T0020	Poison Training Data	Injection of manipulated samples into the training corpus
AML.T0018	Backdoor ML Model	Trigger-activated hidden behavior in the trained model
AML.T0010	ML Supply Chain Compromise	Poisoned public datasets / trojaned downloaded weights
AML.T0024	Exfiltration via ML Inference API	Some poisoning aims to leak data via the model's responses

Workflow

1. Verify data and model provenance/integrity

Refuse artifacts whose hash/signature you cannot verify, and prefer safetensors over pickle-based formats (pickle can execute code on load).

# Verify a downloaded checkpoint against a published SHA-256
sha256sum model.safetensors
# compare to the hub-published digest

# Flag unsafe pickle-based weights in a directory
find ./models -type f \( -name "*.bin" -o -name "*.pt" -o -name "*.pkl" -o -name "*.ckpt" \)

# safe_load.py — load weights without executing pickle
from safetensors.numpy import load_file
weights = load_file("model.safetensors")   # no arbitrary code execution

2. Detect label/data-quality issues with Cleanlab

Cleanlab finds mislabeled, outlier, and near-duplicate samples — common signatures of label-flip poisoning.

# cleanlab_scan.py
import numpy as np
from cleanlab.filter import find_label_issues

# pred_probs: out-of-sample predicted probabilities (n_samples x n_classes)
# labels: given integer labels (n_samples,)
def scan(labels: np.ndarray, pred_probs: np.ndarray):
    issues = find_label_issues(
        labels=labels, pred_probs=pred_probs,
        return_indices_ranked_by="self_confidence",
    )
    print(f"[*] {len(issues)} suspected label issues (potential poisoning)")
    return issues

3. Detect poisoned samples via ART activation clustering

ActivationDefence clusters per-class activations; a class whose activations split into two distinct clusters indicates injected (poisoned) samples.

# activation_defence.py
import numpy as np
from art.estimators.classification import KerasClassifier
from art.defences.detector.poison import ActivationDefence

def detect(model, x_train, y_train):
    classifier = KerasClassifier(model=model)          # wrap your trained model
    defence = ActivationDefence(classifier, x_train, y_train)
    report, is_clean_lst = defence.detect_poison(
        nb_clusters=2, nb_dims=10, reduce="PCA"
    )
    # is_clean_lst[i] == 0 marks a suspected poisoned sample
    poisoned_idx = np.where(np.array(is_clean_lst) == 0)[0]
    print(f"[*] activation clustering flagged {len(poisoned_idx)} samples")
    return poisoned_idx, report

4. Confirm with ART spectral signatures

Spectral signatures use the covariance spectrum of feature representations to surface poisoned samples — a strong second signal.

# spectral.py
import numpy as np
from art.estimators.classification import KerasClassifier
from art.defences.detector.poison import SpectralSignatureDefense

def detect(model, x_train, y_train, nb_classes):
    classifier = KerasClassifier(model=model)
    defence = SpectralSignatureDefense(
        classifier, x_train, y_train,
        expected_pp_poison=0.05, batch_size=128, eps_multiplier=1.5,
    )
    report, is_clean_lst = defence.detect_poison()
    poisoned_idx = np.where(np.array(is_clean_lst) == 0)[0]
    print(f"[*] spectral signatures flagged {len(poisoned_idx)} samples")
    return poisoned_idx, report

5. Probe the model for backdoor triggers

Test whether a candidate trigger flips predictions to an attacker target class far above the clean baseline.

# trigger_probe.py
import numpy as np

def test_trigger(model, x_clean, target_class, apply_trigger):
    """apply_trigger(x) stamps a candidate trigger (e.g. a corner pixel patch)."""
    clean_preds = model.predict(x_clean).argmax(axis=1)
    x_trig = np.stack([apply_trigger(x.copy()) for x in x_clean])
    trig_preds = model.predict(x_trig).argmax(axis=1)
    asr = float(np.mean(trig_preds == target_class))   # attack success rate
    base = float(np.mean(clean_preds == target_class))
    print(f"[*] target-class rate clean={base:.3f} triggered={asr:.3f}")
    return {"baseline": base, "trigger_success_rate": asr,
            "backdoor_suspected": asr - base > 0.5}

6. Quarantine, retrain, and report

Remove flagged samples (intersection of Cleanlab + ART signals is highest-confidence), retrain on the cleaned set, and re-test for the trigger. Document: artifact provenance, samples flagged by each method, trigger ASR before/after, and ATLAS mapping. Recommend dataset provenance controls, signed weights (safetensors + sigstore/cosign), and ongoing pipeline scanning.

Tools and Resources

Tool	Purpose	Source
Adversarial Robustness Toolbox	Activation clustering & spectral-signature poisoning defenses	https://github.com/Trusted-AI/adversarial-robustness-toolbox
Cleanlab	Label/data-quality issue detection	https://github.com/cleanlab/cleanlab
safetensors	Safe (non-pickle) weight serialization	https://github.com/huggingface/safetensors
OWASP LLM04:2025	Data and Model Poisoning reference	https://genai.owasp.org/llmrisk/llm042025-data-and-model-poisoning/
MITRE ATLAS	AI threat technique taxonomy	https://atlas.mitre.org/

Detection Method Reference

Layer	Method	Tool	Signal
Supply chain	Hash/signature + safe format	sha256/safetensors	Tampered or unsafe artifact
Data	Label issues / outliers	Cleanlab	Mislabeled / injected samples
Model	Activation clustering	ART ActivationDefence	Per-class activation split
Model	Spectral signatures	ART SpectralSignatureDefense	Outlier covariance spectrum
Model	Trigger probing	custom	High trigger attack-success-rate

Validation Criteria

Dataset and weight provenance/integrity verified (hashes, safe format)
Unsafe pickle-based artifacts identified and avoided
Cleanlab label-issue scan run and suspicious samples listed
ART activation clustering executed with flagged sample indices
ART spectral-signature analysis run as confirmation
Backdoor trigger probe quantifies attack-success-rate vs. baseline
Highest-confidence poisoned samples quarantined (multi-method overlap)
Model retrained on cleaned data and re-tested for the trigger
Findings mapped to MITRE ATLAS AML.T0020 / AML.T0018 and OWASP LLM04:2025
Report delivered with remediation (provenance, signed weights, pipeline scanning)

信息

Category 人工智能

Name detecting-data-and-model-poisoning

版本 v20260622

大小 12.14KB

Source mukul975/Anthropic-Cybersecurity-Skills

更新时间 2026-06-26

语言

简体中文

English