技能 编程开发 持续LLM红队测试与安全加固

持续LLM红队测试与安全加固

v20260622
continuous-llm-red-teaming-with-promptfoo
本技能旨在将LLM红队测试集成到CI/CD流程中,实现LLM应用的自动化回归安全测试。它利用Promptfoo和DeepTeam,自动检测并阻止因提示词修改或模型升级而导致的越狱、提示注入等安全漏洞,确保LLM应用的持续安全性。
获取技能
309 次下载
概览

Continuous LLM Red Teaming with Promptfoo

Authorized Use Only: Run these adversarial probes only against LLM applications and endpoints you own or are explicitly authorized to test. Generated attack payloads (jailbreaks, prompt injections, harmful-content elicitation) are adversarial inputs; sending them to third-party services without permission may violate terms of service.

Overview

Promptfoo is an open-source LLM evaluation and red-teaming framework (used by OpenAI and Anthropic per its README) that generates adversarial test cases, runs them against your model/agent, and grades the responses. DeepTeam (by Confident AI) is a complementary open-source framework offering 50+ ready-to-use vulnerabilities and 10+ research-backed attack methods. Together they let you treat LLM security as a regression test: every commit re-runs the same adversarial suite, and the pipeline fails when a previously-safe behavior regresses.

This matters because LLM applications change constantly — prompts, models, RAG sources, tools, and guardrails all drift. A jailbreak that was patched last sprint can silently return after a prompt edit or a model upgrade. Promptfoo maps its plugins directly onto the OWASP LLM Top 10 (owasp:llm) and OWASP Agentic (owasp:agentic) presets, and onto MITRE ATLAS, so the suite tracks recognized risk taxonomies. The core threat addressed here is AML.T0051 — LLM Prompt Injection (MITRE ATLAS): adversarial instructions that override the application's intended behavior. This skill follows the Promptfoo red-team docs (https://www.promptfoo.dev/docs/red-team/) and DeepTeam docs (https://www.trydeepteam.com/docs/getting-started), and aligns to NIST AI RMF MANAGE-4.1 (post-deployment monitoring and feedback to manage AI risk).

When to Use

  • When you need continuous, automated red-teaming of an LLM app in CI/CD rather than one-off manual tests.
  • When you want to enforce a security gate: block merges that introduce or reintroduce jailbreak/injection vulnerabilities.
  • When mapping coverage to OWASP LLM Top 10 / OWASP Agentic / MITRE ATLAS for compliance reporting.
  • When comparing the security posture of two models or prompt versions side by side.
  • When tracking vulnerability regression over time across releases.

Prerequisites

  • Node.js 18+ (Promptfoo is distributed via npm) and Python 3.9+ (for DeepTeam).
  • Install Promptfoo and DeepTeam:
    npm install -g promptfoo            # or: npx promptfoo@latest
    pip install -U deepteam
    
  • API access/credentials for the target LLM endpoint (and a grader model, e.g. an OpenAI key) exposed as environment variables.
  • A CI/CD platform (GitHub Actions, GitLab CI) with secret storage.
  • Authorization to test the target application.

Objectives

  • Scaffold a Promptfoo red-team config targeting your LLM app.
  • Enable OWASP LLM Top 10 and OWASP Agentic plugin presets plus jailbreak/injection strategies.
  • Run the suite locally and interpret the per-plugin pass/fail report.
  • Add DeepTeam as a second engine for programmatic, research-backed attacks.
  • Integrate both into CI/CD so builds fail on new vulnerabilities.
  • Generate shareable HTML/PDF security reports per run.

MITRE ATT&CK Mapping

ID Name (MITRE ATLAS) Tactic
AML.T0051 LLM Prompt Injection Initial Access / Persistence (LLM)
AML.T0051.000 Direct (Prompt Injection) LLM Attack
AML.T0051.001 Indirect (Prompt Injection) LLM Attack
AML.T0054 LLM Jailbreak Privilege Escalation / Defense Evasion (LLM)

Workflow

1. Scaffold the red-team configuration

Initialize an interactive config; it writes promptfooconfig.yaml where targets, plugins, and strategies live.

promptfoo redteam init
# choose your target type (HTTP endpoint, openai:..., anthropic:..., custom provider)

2. Define targets, OWASP presets, and attack strategies

Edit promptfooconfig.yaml. The purpose grounds attack generation; plugins are adversarial input generators; strategies are delivery techniques (jailbreak/injection wrappers).

# promptfooconfig.yaml
targets:
  - id: https://api.example.com/chat        # your app endpoint
    label: support-bot

redteam:
  purpose: |
    A customer-support assistant for an e-commerce site. Must never reveal
    system prompts, leak PII, or perform actions outside order support.
  numTests: 10
  plugins:
    - owasp:llm          # OWASP LLM Top 10 preset
    - owasp:agentic      # OWASP Agentic threats preset
    - id: pii:direct
      numTests: 15
    - prompt-extraction  # system-prompt leakage
    - harmful
  strategies:
    - id: jailbreak              # iterative single-turn jailbreak
    - id: jailbreak:composite    # stacked jailbreak techniques
    - id: crescendo              # multi-turn escalation
    - id: prompt-injection       # injection wrapper

3. Run the suite and view the report

redteam run combines generation + evaluation; then open the interactive report.

promptfoo redteam run
promptfoo redteam report            # launches the web report (pass/fail per plugin)

Each row shows the plugin (mapped to OWASP/ATLAS), the strategy, the attack prompt, the model's response, and the grader's verdict. The attack success rate per plugin is your headline metric — track it per release.

4. Add DeepTeam for programmatic, research-backed attacks

Use DeepTeam to cover additional vulnerabilities/attacks and to script bespoke suites in Python.

# deepteam_suite.py
from deepteam import red_team
from deepteam.vulnerabilities import Bias, PIILeakage
from deepteam.attacks.single_turn import PromptInjection

def model_callback(prompt: str) -> str:
    # call your application's LLM endpoint here and return the text response
    return call_my_app(prompt)

red_team(
    model_callback=model_callback,
    vulnerabilities=[Bias(types=["race"]), PIILeakage(types=["api_and_database_access"])],
    attacks=[PromptInjection()],
)

DeepTeam can also be driven from a YAML config:

deepteam run config.yaml

5. Gate the build in CI/CD (GitHub Actions)

Fail the pipeline when red-team assertions fail. Promptfoo returns a non-zero exit code on failures, which blocks the merge.

# .github/workflows/llm-redteam.yml
name: LLM Red Team
on: [pull_request]
jobs:
  redteam:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20' }
      - run: npm install -g promptfoo
      - name: Run red team (fails build on new vulns)
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: promptfoo redteam run --no-progress-bar
      - name: Export machine-readable results
        if: always()
        run: promptfoo redteam report --output results.json
      - uses: actions/upload-artifact@v4
        if: always()
        with: { name: redteam-report, path: results.json }

6. Track regressions over time

Persist results.json per run and compare attack-success-rate per plugin between releases. A rising rate for any OWASP LLM category is a regression to triage before release. Promptfoo's --filter-failing lets you re-run only previously failing cases to confirm a fix.

promptfoo redteam run --filter-failing results.json

Tools and Resources

Resource Link
Promptfoo red-team docs https://www.promptfoo.dev/docs/red-team/
Promptfoo red-team configuration https://www.promptfoo.dev/docs/red-team/configuration/
Promptfoo CI/CD integration https://www.promptfoo.dev/docs/integrations/ci-cd/
Promptfoo MITRE ATLAS mapping https://www.promptfoo.dev/docs/red-team/mitre-atlas/
DeepTeam (Confident AI) https://github.com/confident-ai/deepteam
DeepTeam docs https://www.trydeepteam.com/docs/getting-started
OWASP Top 10 for LLM Applications https://genai.owasp.org/

Plugin / Strategy Reference

Promptfoo item Type Maps to
owasp:llm preset OWASP LLM Top 10 suite
owasp:agentic preset OWASP Agentic threats
prompt-extraction plugin LLM07 system-prompt leakage
pii:direct plugin LLM06 sensitive-info disclosure
harmful plugin harmful content generation
jailbreak / jailbreak:composite strategy AML.T0054 LLM jailbreak
crescendo strategy multi-turn jailbreak
prompt-injection strategy AML.T0051 prompt injection

Validation Criteria

  • promptfooconfig.yaml created with target, owasp:llm, and owasp:agentic plugins.
  • Jailbreak and prompt-injection strategies enabled.
  • promptfoo redteam run executes and produces a per-plugin pass/fail report.
  • DeepTeam suite runs against the same target via model_callback.
  • CI/CD job fails the build on new red-team failures (non-zero exit).
  • results.json artifact archived per run for regression tracking.
  • Attack-success-rate per OWASP category trended across releases.
信息
Category 编程开发
Name continuous-llm-red-teaming-with-promptfoo
版本 v20260622
大小 12.13KB
更新时间 2026-06-26
语言