Skip to content

Add worker crash recovery to ProcessParallelController#395#430

Open
jhewers-pf wants to merge 2 commits intoalgorithmicsuperintelligence:mainfrom
jhewers-pf:fix/worker_crash_recovery
Open

Add worker crash recovery to ProcessParallelController#395#430
jhewers-pf wants to merge 2 commits intoalgorithmicsuperintelligence:mainfrom
jhewers-pf:fix/worker_crash_recovery

Conversation

@jhewers-pf
Copy link

When a child process in the ProcessPoolExecutor crashes (OOM, segfault, etc.), Python raises BrokenExecutor and the pool becomes unusable. Previously, this was caught as a generic Exception, logged, and caused silent failure of the evolution process.

Changes

  • Add explicit BrokenExecutor exception handling in run_evolution()
  • Add _recover_process_pool() method that gracefully shuts down the broken executor, runs garbage
    collection, waits briefly for system stabilization, and recreates the pool
  • Re-queue all pending iterations after recovery
  • Track recovery attempts with a limit of 3 consecutive failures to prevent infinite loops
  • Reset recovery counter after successful iterations (only consecutive crashes count toward the limit)
  • Propagate BrokenExecutor from _submit_iteration() for centralized handling

Behavior
When a worker crashes:

  1. Detect BrokenExecutor exception
  2. Collect all pending iteration numbers
  3. Shut down broken pool, run GC, wait 2s
  4. Recreate fresh pool
  5. Re-queue failed iterations
  6. Continue evolution

If 3 crashes occur without any successful iterations in between, evolution stops gracefully.

Recreation of #395

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant