Add periodic safe eval during training by eugenevinitsky · Pull Request #329 · Emerge-Lab/PufferDrive

eugenevinitsky · 2026-03-08T05:26:21Z

Summary

Periodically run policy with safe/law-abiding reward conditioning (high penalties for collisions, offroad, overspeed) and render videos + collect metrics under the eval/ wandb tab
Add [safe_eval] config section in drive.ini with configurable interval, episode count, and safe reward values
Add --config flag to C visualizer so it can render with alternative INI files
Refactor all 3 eval subprocess functions (wosac, human_replay, safe_eval) to share a common _run_eval_subprocess helper, eliminating ~50 lines of duplicated boilerplate

How it works

At safe_eval_interval epochs, copies the exported .bin model
Generates a temp INI with reward bounds at safe values
Renders videos via the C visualizer using the safe INI → logged under eval/world_state and eval/agent_view in wandb
Runs a Python subprocess (pufferlib.pufferl safe_eval) to collect metrics → logged under eval/* in wandb

Test plan

Run training with safe_eval = True and verify videos appear in wandb eval tab
Verify metrics are logged under eval/ prefix
Confirm existing render and WOSAC/human_replay evals still work after the _run_eval_subprocess refactor
Test with safe_eval = False (default for non-drive envs) to confirm no side effects

NOTE: code heavily generated by claude, this is just a prototype

Periodically run policy with safe/law-abiding reward conditioning values (high collision, offroad, overspeed penalties) and render videos + collect metrics under the eval/ wandb tab. - Add [safe_eval] config section with safe reward conditioning values - Add --config flag to C visualizer for alternative INI files - Generate temp INI with min=max bounds for deterministic safe conditioning - Render safe eval videos and collect metrics via subprocess - Refactor eval subprocess functions to share common helper - Fix visualize.c: guard rand()%0 crash, propagate eval_gif return value - Fix safe_eval(): call policy.eval(), pass actual LSTM done/reward signals

render_videos no longer deletes its input bin file as a side effect. Cleanup of bin_path and bin_path_epoch is now handled in a finally block in pufferl.py, eliminating the need for a defensive copy (bin_path_safe). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Pass --train.device from parent config so eval subprocesses use the correct device (e.g. cpu). Use = syntax for reward bound args to prevent argparse from interpreting negative values like -5e-05 as flags. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Render and safe eval now trigger on their own intervals independently of checkpoint_interval. They use the latest available .pt checkpoint, so they don't require a fresh one. This means safe_eval_interval no longer needs to be a multiple of checkpoint_interval. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rename safe_eval config keys to drop the safe_eval_ prefix since they're already in the [safe_eval] section. Replace SAFE_EVAL_BOUND_KEYS mapping with a simple SAFE_EVAL_REWARD_BOUNDS list since config keys now match bound names directly (e.g. collision instead of safe_eval_collision). Also rename safe_eval→enabled, safe_eval_interval→interval, safe_eval_num_agents→num_agents, safe_eval_num_episodes→num_episodes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of maintaining a hardcoded list of reward bound names, treat every key in [safe_eval] that isn't a control key (enabled, interval, num_agents, num_episodes) as a reward bound. Adding a new reward bound now only requires adding it to the config file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace SAFE_EVAL_CONTROL_KEYS exclusion list with _get_env_reward_bound_names() that discovers valid bound names by pattern-matching reward_bound_*_min keys in the [env] config section. Safe eval now only passes keys that match known env bounds, so adding a new control key to [safe_eval] can't accidentally become a bound arg. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix --eval.wosac-num-agents (nonexistent) to --eval.wosac-num-rollouts (actual config key). Also increase stderr output from 500 to 1000 chars to show full tracebacks when eval subprocesses fail. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… logs WOSAC ground truth trajectories are 9.1s at 10Hz = 91 steps. Without setting episode_length, the eval subprocess used the default training episode_length, causing shape mismatches between simulated and reference trajectories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Eval subprocesses (wosac, human replay, safe eval metrics) previously blocked the training loop. Now they optionally run in daemon threads, with cleanup on close(). Enabled by default via eval_async=True. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reduced the number of episodes from 300 to 100 and added episode length, min and max goal distance parameters.

- safe_eval: pass episode_length, min_goal_distance, max_goal_distance to both the safe_eval() function (env overrides) and the subprocess (CLI args). Also pass num_agents to subprocess. - wosac/human_replay: pass map_dir and num_maps to subprocesses so CLI overrides are respected. - Add missing human_replay_num_agents to drive.ini config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- safe_eval() was using num_episodes as num_steps, running far too few steps to complete any episodes. Now runs until enough episodes are collected, counting by the 'n' field in each info dict. - Fix wandb step conflict: async eval threads logged at stale steps, causing wandb to drop metrics. Use define_metric("eval/*", step_metric="eval_step") so eval metrics have their own step counter. - Fix subprocess config passthrough: pass episode_length, min_goal_distance, max_goal_distance as --safe-eval.* args (not --env.*) so safe_eval() correctly applies them from safe_eval_config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When render_async is enabled, the bin_path_epoch file was deleted in the finally block while the async render process was still reading it. Now the async render process passes bin_path through the render queue, and check_render_queue() deletes it after the process finishes. Also drains the queue in close() to clean up any remaining bin files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge resolution: - Keep _run_eval_subprocess helper and eval_async infrastructure from our branch - Adopt 3.0_beta's new WOSAC params: wosac_batch_size, wosac_target_scenarios, wosac_scenario_pool_size, wosac_eval_mode (replaces wosac_num_rollouts) - Adopt 3.0_beta's expanded wandb keys: realism_meta_score_std, kinematic_metrics, interactive_metrics, map_based_metrics - Adopt 3.0_beta's evaluator.evaluate() batched API in pufferl.py - Keep our eval_step metric, map_dir passthrough, episode_length - Resolve duplicate human_replay_num_agents (use 3.0_beta's value: 16) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Replace os.path.getctime with lexicographic max() for finding latest checkpoint (filenames are zero-padded epoch numbers) - Update SLURM account to torch_pr_355_tandon_advanced - Fix container overlay path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Enable render_async and eval_async by default in drive.ini - Fix human replay subprocess: --eval.num-maps → --env.num-maps - Fix eval_async string "False" being truthy (explicit string-to-bool) - Align WOSAC/human replay trigger to same interval as render/safe eval - Forward env config to eval subprocesses - Add eval documentation to docs/src/evaluation.md - Add TIMING prints for eval component profiling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… prints - close() now waits for render processes (120s timeout) before killing, so async renders can finish and upload videos to wandb - Track temp files per render process; clean up when process dies - Remove broken env_config forwarding from _run_eval_subprocess (key "env_config" didn't exist; WOSAC/human replay should use waymo defaults, not training env config) - Fix wandb step consistency: async renders now use step= parameter - Remove all TIMING debug prints Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The safe eval subprocess loads config from the default INI, which has map_dir=resources/drive/binaries/carla_2D. When training overrides map_dir via CLI (e.g. to resources/drive/binaries/training), the subprocess needs that forwarded explicitly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Render a video with human replay settings (control_sdc_only, create_all_valid, waymo maps) alongside human replay metrics - generate_human_replay_ini() creates temp INI for the visualize binary - Restructure eval block: all render-based evals share the bin export - Videos logged to wandb under human_replay/ prefix Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- All three render types (training, safe eval, human replay) now use the same _dispatch_render() helper for async/sync dispatch - Fixes bug: async throttling was only applied to training render, not safe eval or human replay renders - Fix inconsistent config["eval"] vs config.get("eval", {}) access Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- render/* for training renders - eval/* for safe eval (videos + metrics) - human_replay/* for human replay (videos + metrics) - wosac/* for WOSAC realism metrics - Use step=global_step consistently instead of custom step metrics Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- config["env"] is the env name string, not the env dict. The env config dict is at config["env_config"]. Fixes safe eval crash. - Async evals finish after training moves past their step, so wandb rejects step= parameter. Log train_step as data field instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add threading.Lock to WandbLogger, route all async eval logging through log_async() to prevent concurrent wandb.log() calls - Guard all eval types (render, safe eval, WOSAC, human replay) with is_rank0 check so they only run on the primary process in distributed training Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Eval threads now put results on _eval_results_queue instead of calling wandb directly. The main thread drains the queue in mean_and_log(), keeping all wandb.log() calls single-threaded. Also fixes sync render to not use step= parameter (avoids non-monotonic step warning). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Maps with active_agent_count > 0 were never freed during the shared init loop that counts agents. Only maps with 0 agents were freed. This leaked ~20MB per map, causing 20+ GB memory usage when num_maps=1000 (e.g. WOSAC eval). Fixed by: moving the store logic into an else branch for active maps, and always freeing the temporary env after both branches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Free tracks_to_predict_indices in shared() loop (leaked per map) - Py_DECREF original lists after PyList_GetSlice (ref count leak) - Use get_nowait() directly instead of unreliable queue.empty() - Fix stats mean loop: del then re-insert was keeping bad values - Add defaults for human_replay config keys to prevent KeyError Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Let errors propagate instead of silently swallowing them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Daemon threads die with the process anyway; no need to block shutdown for 11 minutes per stuck thread. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Set episode_length, resample_frequency=0, num_agents, and goal distances in the generated ini so the render shows the same behavior the metrics subprocess measures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tion Default remains 6px/world-unit. Large CARLA maps (900m+) render at 5000+px which causes timeouts with software rendering on cluster. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Render runs async so a long timeout doesn't block training. Large CARLA maps with software rendering can take 5+ minutes at default scale. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Prevents render from blocking training, especially important for large CARLA maps that can take 5+ minutes with software rendering. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eugenevinitsky changed the base branch from main to 3.0_beta March 8, 2026 05:27

eugenevinitsky changed the title ~~Add periodic safe eval during training~~ [WIP, do not review] Add periodic safe eval during training Mar 8, 2026

eugenevinitsky force-pushed the safe-eval-during-training branch 2 times, most recently from a200eb9 to fd6b186 Compare March 8, 2026 05:36

eugenevinitsky force-pushed the safe-eval-during-training branch from fd6b186 to 479b24b Compare March 8, 2026 05:39

Eugene Vinitsky and others added 16 commits March 8, 2026 00:46

Apply ruff formatting to utils.py

a674189

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix ruff formatting for eval_async call sites

3b91abc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adjust lane alignment value in drive.ini

648bd0b

Update drive.ini configuration parameters

e73380d

Reduced the number of episodes from 300 to 100 and added episode length, min and max goal distance parameters.

eugenevinitsky changed the title ~~[WIP, do not review] Add periodic safe eval during training~~ Add periodic safe eval during training Mar 11, 2026

Eugene Vinitsky and others added 7 commits March 10, 2026 22:34

Switch cluster account to torch_pr_355_general

b3a03dc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix sbatch exclude error: don't pass empty --exclude

c7563d6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Eugene Vinitsky and others added 19 commits March 13, 2026 08:59

Update evaluation docs: async defaults, human replay eval types

8a1c261

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Default evals to sync instead of async

455ab55

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert Py_DECREF change in shared() — pre-existing, not our bug

43413b2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Refactor shared() to use continue pattern instead of if/else

14eaa42

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove bare try/except blocks from eval launching code

afb8a33

Let errors propagate instead of silently swallowing them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reduce eval thread join timeout from 660s to 10s in close()

b00ec64

Daemon threads die with the process anyway; no need to block shutdown for 11 minutes per stuck thread. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Enable human replay eval by default, fix ruff formatting

7a77642

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Make safe eval render match metrics subprocess setup

2c17312

Set episode_length, resample_frequency=0, num_agents, and goal distances in the generated ini so the render shows the same behavior the metrics subprocess measures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add --scale CLI arg to visualize binary for controlling render resolu…

0a42f38

…tion Default remains 6px/world-unit. Large CARLA maps (900m+) render at 5000+px which causes timeouts with software rendering on cluster. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Increase render subprocess timeout to 3600s

8125393

Render runs async so a long timeout doesn't block training. Large CARLA maps with software rendering can take 5+ minutes at default scale. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Enable async rendering by default

2c49401

Prevents render from blocking training, especially important for large CARLA maps that can take 5+ minutes with software rendering. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add periodic safe eval during training#329

Add periodic safe eval during training#329
eugenevinitsky wants to merge 43 commits into3.0_betafrom
safe-eval-during-training

eugenevinitsky commented Mar 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eugenevinitsky commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eugenevinitsky commented Mar 8, 2026 •

edited

Loading