技能 人工智能 Cosmos政策评估

Cosmos政策评估

v20260317
evaluating-cosmos-policy
提供在LIBERO与RoboCasa仿真中评估NVIDIA Cosmos Policy的完整流程,涵盖无头EGL渲染、烟雾与完整评估、以及推理延迟分析,面向机器人操作场景。
获取技能
185 次下载
概览

Cosmos Policy Evaluation

Evaluation workflows for NVIDIA Cosmos Policy on LIBERO and RoboCasa simulation environments from the public cosmos-policy repository. Covers blank-machine setup, headless GPU evaluation, and inference profiling.

Quick start

Run a minimal LIBERO evaluation using the official public eval module:

uv run --extra cu128 --group libero --python 3.10 \
  python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
    --config cosmos_predict2_2b_480p_libero__inference_only \
    --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
    --config_file cosmos_policy/config/config.py \
    --use_wrist_image True \
    --use_proprio True \
    --normalize_proprio True \
    --unnormalize_actions True \
    --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
    --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
    --trained_with_image_aug True \
    --chunk_size 16 \
    --num_open_loop_steps 16 \
    --task_suite_name libero_10 \
    --num_trials_per_task 1 \
    --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
    --seed 195 \
    --randomize_seed False \
    --deterministic True \
    --run_id_note smoke \
    --ar_future_prediction False \
    --ar_value_prediction False \
    --use_jpeg_compression True \
    --flip_images True \
    --num_denoising_steps_action 5 \
    --num_denoising_steps_future_state 1 \
    --num_denoising_steps_value 1 \
    --data_collection False

Core concepts

What Cosmos Policy is: NVIDIA Cosmos Policy is a vision-language-action (VLA) model that uses Cosmos Tokenizer to encode visual observations into discrete tokens, then predicts robot actions conditioned on language instructions and visual context.

Key architecture choices:

Component Design
Visual encoder Cosmos Tokenizer (discrete tokens)
Language conditioning Cross-attention to language embeddings
Action prediction Autoregressive action token generation

Public command surface: The supported evaluation entrypoints are cosmos_policy.experiments.robot.libero.run_libero_eval and cosmos_policy.experiments.robot.robocasa.run_robocasa_eval. Keep reproduction notes anchored to these public modules and their documented flags.

Compute requirements

Task GPU VRAM Typical wall time
LIBERO smoke eval (1 trial) 1x A40/A100 ~16 GB 5-10 min
LIBERO full eval (50 trials) 1x A40/A100 ~16 GB 2-4 hours
RoboCasa single-task (2 trials) 1x A40/A100 ~18 GB 10-15 min
RoboCasa all-tasks 1x A40/A100 ~18 GB 4-8 hours

When to use vs alternatives

Use this skill when:

  • Evaluating NVIDIA Cosmos Policy on LIBERO or RoboCasa benchmarks
  • Profiling inference latency and throughput for Cosmos Policy
  • Setting up headless EGL rendering for robot simulation on GPU clusters

Use alternatives when:

  • Training or fine-tuning Cosmos Policy from scratch (use official Cosmos training docs)
  • Working with OpenVLA-based policies (use fine-tuning-openvla-oft)
  • Working with Physical Intelligence pi0 models (use fine-tuning-serving-openpi)
  • Running real-robot evaluation rather than simulation

Workflow 1: LIBERO evaluation

Copy this checklist and track progress:

LIBERO Eval Progress:
- [ ] Step 1: Install environment and dependencies
- [ ] Step 2: Configure headless EGL rendering
- [ ] Step 3: Run smoke evaluation
- [ ] Step 4: Validate outputs and parse results
- [ ] Step 5: Run full benchmark if smoke passes

Step 1: Install environment

git clone https://github.com/NVlabs/cosmos-policy.git
cd cosmos-policy
# Follow SETUP.md to build and enter the supported Docker container.
# Then, inside the container:
uv sync --extra cu128 --group libero --python 3.10

Step 2: Configure headless rendering

export CUDA_VISIBLE_DEVICES=0
export MUJOCO_EGL_DEVICE_ID=0
export MUJOCO_GL=egl
export PYOPENGL_PLATFORM=egl

Step 3: Run smoke evaluation

uv run --extra cu128 --group libero --python 3.10 \
  python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
    --config cosmos_predict2_2b_480p_libero__inference_only \
    --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
    --config_file cosmos_policy/config/config.py \
    --use_wrist_image True \
    --use_proprio True \
    --normalize_proprio True \
    --unnormalize_actions True \
    --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
    --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
    --trained_with_image_aug True \
    --chunk_size 16 \
    --num_open_loop_steps 16 \
    --task_suite_name libero_10 \
    --num_trials_per_task 1 \
    --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
    --seed 195 \
    --randomize_seed False \
    --deterministic True \
    --run_id_note smoke \
    --ar_future_prediction False \
    --ar_value_prediction False \
    --use_jpeg_compression True \
    --flip_images True \
    --num_denoising_steps_action 5 \
    --num_denoising_steps_future_state 1 \
    --num_denoising_steps_value 1 \
    --data_collection False

Step 4: Validate and parse results

import json
import glob

# Find latest evaluation result from the official log directory
log_files = sorted(glob.glob("cosmos_policy/experiments/robot/libero/logs/**/*.json", recursive=True))
with open(log_files[-1]) as f:
    results = json.load(f)

print(results)

Step 5: Scale up

Run across all four LIBERO task suites with 50 trials:

for suite in libero_spatial libero_object libero_goal libero_10; do
  uv run --extra cu128 --group libero --python 3.10 \
    python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
      --config cosmos_predict2_2b_480p_libero__inference_only \
      --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
      --config_file cosmos_policy/config/config.py \
      --use_wrist_image True \
      --use_proprio True \
      --normalize_proprio True \
      --unnormalize_actions True \
      --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
      --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
      --trained_with_image_aug True \
      --chunk_size 16 \
      --num_open_loop_steps 16 \
      --task_suite_name "$suite" \
      --num_trials_per_task 50 \
      --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
      --seed 195 \
      --randomize_seed False \
      --deterministic True \
      --run_id_note "suite_${suite}" \
      --ar_future_prediction False \
      --ar_value_prediction False \
      --use_jpeg_compression True \
      --flip_images True \
      --num_denoising_steps_action 5 \
      --num_denoising_steps_future_state 1 \
      --num_denoising_steps_value 1 \
      --data_collection False
done

Workflow 2: RoboCasa evaluation

Copy this checklist and track progress:

RoboCasa Eval Progress:
- [ ] Step 1: Install RoboCasa assets and verify macros
- [ ] Step 2: Run single-task smoke evaluation
- [ ] Step 3: Validate outputs
- [ ] Step 4: Expand to multi-task runs

Step 1: Install RoboCasa

git clone https://github.com/moojink/robocasa-cosmos-policy.git
uv pip install -e robocasa-cosmos-policy
python -m robocasa.scripts.setup_macros
python -m robocasa.scripts.download_kitchen_assets

This fork installs the robocasa Python package expected by Cosmos Policy while preserving the patched environment changes used in the public RoboCasa eval path. Verify macros_private.py exists and paths are correct.

Step 2: Single-task smoke evaluation

uv run --extra cu128 --group robocasa --python 3.10 \
  python -m cosmos_policy.experiments.robot.robocasa.run_robocasa_eval \
    --config cosmos_predict2_2b_480p_robocasa_50_demos_per_task__inference \
    --ckpt_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B \
    --config_file cosmos_policy/config/config.py \
    --use_wrist_image True \
    --num_wrist_images 1 \
    --use_proprio True \
    --normalize_proprio True \
    --unnormalize_actions True \
    --dataset_stats_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B/robocasa_dataset_statistics.json \
    --t5_text_embeddings_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B/robocasa_t5_embeddings.pkl \
    --trained_with_image_aug True \
    --chunk_size 32 \
    --num_open_loop_steps 16 \
    --task_name TurnOffMicrowave \
    --obj_instance_split A \
    --num_trials_per_task 2 \
    --local_log_dir cosmos_policy/experiments/robot/robocasa/logs/ \
    --seed 195 \
    --randomize_seed False \
    --deterministic True \
    --run_id_note smoke \
    --use_variance_scale False \
    --use_jpeg_compression True \
    --flip_images True \
    --num_denoising_steps_action 5 \
    --num_denoising_steps_future_state 1 \
    --num_denoising_steps_value 1 \
    --data_collection False

Step 3: Validate outputs

  • Confirm the eval log prints the expected task name, object split, and checkpoint/config values.
  • Inspect the final Success rate: line in the log.

Step 4: Expand scope

Increase --num_trials_per_task or add more tasks. Keep --obj_instance_split fixed across repeated runs for comparability.


Workflow 3: Blank-machine cluster launch

Cluster Launch Progress:
- [ ] Step 1: Clone the public repo and enter the supported runtime
- [ ] Step 2: Sync the benchmark-specific dependency group
- [ ] Step 3: Export rendering and cache environment variables before eval

Step 1: Clone and enter the supported runtime

git clone https://github.com/NVlabs/cosmos-policy.git
cd cosmos-policy
# Follow SETUP.md, start the Docker container, and enter it before continuing.

Step 2: Sync dependencies

uv sync --extra cu128 --group libero --python 3.10
# or, for RoboCasa:
uv sync --extra cu128 --group robocasa --python 3.10
# then install the Cosmos-compatible RoboCasa fork:
git clone https://github.com/moojink/robocasa-cosmos-policy.git
uv pip install -e robocasa-cosmos-policy

Step 3: Export runtime environment

export CUDA_VISIBLE_DEVICES=0
export MUJOCO_EGL_DEVICE_ID=0
export MUJOCO_GL=egl
export PYOPENGL_PLATFORM=egl
export HF_HOME=${HF_HOME:-$HOME/.cache/huggingface}
export TRANSFORMERS_CACHE=${TRANSFORMERS_CACHE:-$HF_HOME}

Expected performance benchmarks

Reference values from official evaluation (tied to specific setup and seeds):

Task Suite Success Rate Notes
LIBERO-Spatial 98.1% Official LIBERO spatial result
LIBERO-Object 100.0% Official LIBERO object result
LIBERO-Goal 98.2% Official LIBERO goal result
LIBERO-Long 97.6% Official LIBERO long-horizon result
LIBERO-Average 98.5% Official average across LIBERO suites
RoboCasa 67.1% Official RoboCasa average result

Reproduction note: Published success rates still depend on checkpoint choice, task suite, seeds, and simulator setup. Record the exact command and environment alongside any reported number.


Non-negotiable rules

  • EGL alignment: Always set CUDA_VISIBLE_DEVICES, MUJOCO_EGL_DEVICE_ID, MUJOCO_GL=egl, and PYOPENGL_PLATFORM=egl together on headless GPU nodes.
  • Official runtime first: If host-Python installs hit binary compatibility issues, fall back to the supported container workflow from SETUP.md before debugging package internals.
  • Cache consistency: Use the same cache directory across setup and eval so Hugging Face and dependency caches are reused.
  • Run comparability: Keep task name, object split, seed, and trial count fixed across repeated runs.

Common issues

Issue: binary compatibility or loader failures on host Python

Fix: rerun inside the official container/runtime from SETUP.md. Do not assume host-package rebuilds will match the public release environment.

Issue: LIBERO prompts for config path in a non-interactive shell

Fix: pre-create LIBERO_CONFIG_PATH/config.yaml:

import os, yaml

config_dir = os.path.expanduser("~/.libero")
os.makedirs(config_dir, exist_ok=True)
with open(os.path.join(config_dir, "config.yaml"), "w") as f:
    yaml.dump({"benchmark_root": "/path/to/libero/datasets"}, f)

Issue: EGL initialization or shutdown noise

Fix: align EGL environment variables first. Treat teardown-only EGL_NOT_INITIALIZED warnings as low-signal unless the job exits non-zero.

Issue: Kitchen object sampling NaNs or asset lookup failures in RoboCasa

Fix: rerun asset setup and confirm the patched robocasa install is intact:

python -m robocasa.scripts.download_kitchen_assets
python -c "import robocasa; print(robocasa.__file__)"

Issue: MuJoCo rendering mismatch

Fix: verify GPU device alignment:

import os
cuda_dev = os.environ.get("CUDA_VISIBLE_DEVICES", "not set")
egl_dev = os.environ.get("MUJOCO_EGL_DEVICE_ID", "not set")
assert cuda_dev == egl_dev, f"GPU mismatch: CUDA={cuda_dev}, EGL={egl_dev}"
print(f"Rendering on GPU {cuda_dev}")

Advanced topics

LIBERO command matrix: See references/libero-commands.md RoboCasa command matrix: See references/robocasa-commands.md

Resources

信息
Category 人工智能
Name evaluating-cosmos-policy
版本 v20260317
大小 7.81KB
更新时间 2026-03-19
语言