fix: Move publishRetryMessage to end of processing loop to avoid duplicate retries #4966

Copilot · 2025-12-18T20:42:58Z

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Per reviewer feedback, publishRetryMessage is now called at the end of the processing loop for each message that is NOT marked as invalid. This prevents duplicate retry messages for the same event, since invalid messages already go back to the SQS queue for retry. Key changes: - Track validMessagesForRetry separately from all messages - Only messages that pass job queued check are added to validMessagesForRetry - publishRetryMessage is called after runner creation, not before - Messages marked as invalid (e.g., max runners reached, creation failed) are excluded from retry message publishing Tests updated to reflect new behavior: - publishRetryMessage is called AFTER runner creation - Messages marked invalid do not get retry messages published - All test scenarios updated with proper mock runner creation Co-authored-by: npalm <[email protected]>

Address code review feedback: - Use Set for invalidMessageIds to improve lookup performance from O(n) to O(1) - Extract duplicate retry message publishing logic into helper function - Maintain both invalidMessages array (for return value) and Set (for fast lookups) This improves performance when processing large message batches and makes the code more maintainable by eliminating duplication. Co-authored-by: npalm <[email protected]>

github-actions · 2025-12-18T21:03:50Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

npalm · 2025-12-18T21:28:22Z

@copilot can you re-check, I think naming can be a bit better. Do we really need this complicty?

Simplified the retry message logic based on reviewer feedback: - Renamed `validMessagesForRetry` to `queuedMessages` (clearer intent) - Renamed `invalidMessageIds` to `rejectedMessageIds` (more accurate) - Removed `invalidMessages` array duplication - now using only Set - Removed `publishRetryMessagesForValid` helper function (inline instead) - Convert Set to array only at return statement Result: clearer naming, less complexity, same functionality. Net reduction of 6 lines while maintaining all tests passing. Co-authored-by: npalm <[email protected]>

Copilot · 2025-12-18T21:33:08Z

@copilot can you re-check, I think naming can be a bit better. Do we really need this complicty?

You're right! I've simplified the code:

Naming improvements:

validMessagesForRetry → queuedMessages (clearer - these are messages with queued jobs)
invalidMessageIds → rejectedMessageIds (more accurate terminology)

Complexity reduction:

Removed duplicate tracking: was using both invalidMessages array AND invalidMessageIds Set
Now using only rejectedMessageIds Set, converting to array at return
Removed publishRetryMessagesForValid helper function - inlined the simple loop instead
Net result: 6 fewer lines, clearer intent

Fixed in commit b5399f6.

npalm · 2025-12-18T23:08:25Z

@copilot adjust the workflow https://github.com/github-aws-runners/terraform-aws-github-runner/blob/main/.github/workflows/lambda.yml to run on any pr not only the onces to main

Remove branch restriction from lambda.yml workflow to support stacked PRs and feature branch workflows. The workflow will now run on any PR that modifies lambda code or the workflow itself. Co-authored-by: npalm <[email protected]>

Copilot · 2025-12-18T23:10:23Z

@copilot adjust the workflow https://github.com/github-aws-runners/terraform-aws-github-runner/blob/main/.github/workflows/lambda.yml to run on any pr not only the onces to main

Updated the lambda workflow to run on all PRs by removing the branches: - main restriction. Now it will trigger on any PR that modifies lambda code or the workflow itself, supporting stacked PRs and feature branch workflows.

Fixed in commit cba7bbf.

lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts

stuartp44 · 2026-01-05T13:23:27Z

@copilot fix your formatting issues please

Fix formatting issues identified by prettier: - Remove trailing whitespace in scale-up.test.ts - Format .prettierrc and eslint.config.js - Format various tsconfig.json and other config files Co-authored-by: stuartp44 <[email protected]>

Copilot · 2026-01-05T13:27:54Z

@copilot fix your formatting issues please

Fixed formatting issues in commit ef5c981. The main issue was trailing whitespace in scale-up.test.ts. Also reformatted config files (.prettierrc, eslint.config.js, and various tsconfig.json files) to match prettier's style. All files now pass yarn format-check.

stuartp44 · 2026-01-05T13:34:33Z

@copilot can you fix the test error? it looks like we have an issue with {"level":"INFO","message":"Found: '1' active GitHub EC2 runner instances before clean-up.","timestamp":"2026-01-05T13:31:35.464Z","service":"service_undefined","sampling_rate":0,"environment":"N/A","module":"scale-down"} {"level":"INFO","message":"Found: '1' active GitHub EC2 runners instances after clean-up.","timestamp":"2026-01-05T13:31:35.465Z","service":"service_undefined","sampling_rate":0,"environment":"N/A","module":"scale-down"} {"level":"INFO","message":"Terminating orphan runner 'i-orphan-1-repo'","timestamp":"2026-01-05T13:31:35.466Z","service":"service_undefined","sampling_rate":0,"environment":"N/A","module":"scale-down"} {"level":"WARN","message":"Failure during orphan termination processing.","timestamp":"2026-01-05T13:31:35.466Z","service":"service_undefined","sampling_rate":0,"environment":"N/A","module":"scale-down","error":{"name":"Error","location":"file:///__w/terraform-aws-github-runner/terraform-aws-github-runner/lambdas/node_modules/@vitest/spy/dist/index.js:285","message":"Failed to terminate","stack":"Error: Failed to terminate\n at /__w/terraform-aws-github-runner/terraform-aws-github-runner/lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts:532:17\n at Mock (file:///__w/terraform-aws-github-runner/terraform-aws-github-runner/lambdas/node_modules/@vitest/spy/dist/index.js:285:34)\n at terminateOrphan (/__w/terraform

The merge from stu/fix_job_retry incorrectly reverted the core change where publishRetryMessage was moved to the end of the loop. This fix restores the correct behavior: - Restore queuedMessages.push(message) instead of await publishRetryMessage(message) - Remove conflicting test assertions from merge (lines expecting both publishRetryMessage NOT to be called AND to be called 2 times) - Reorder test assertions to check listEC2Runners before publishRetryMessage All tests now pass with the correct behavior where publishRetryMessage is called AFTER runner creation and only for non-rejected messages. Co-authored-by: stuartp44 <[email protected]>

Copilot · 2026-01-05T13:40:27Z

@copilot can you fix the test error? it looks like we have an issue with ``` {"level":"INFO","message":"Found: '1' active GitHub EC2 runner instances before clean-up.","timestamp":"2026-01-05T13:31:3...

Fixed in commit 0243458. The issue was caused by an incorrect merge resolution from the stu/fix_job_retry branch that reverted my core changes. The merge:

Changed queuedMessages.push(message) back to await publishRetryMessage(message) (reverting the fix)
Added conflicting test assertions (expecting publishRetryMessage both NOT to be called AND to be called 2 times)

Restored the correct behavior where publishRetryMessage is called at the END of the loop, only for non-rejected messages. All 334 control-plane tests now pass.

Brend-Smits · 2026-01-08T09:50:02Z

@copilot please keep the changes in this pull requests to a minimal. I'm not sure why you have touched so many files, this was not part of the original objective.
Keep focus and revert any changes to files that are not needed to solve the original objective.

Per reviewer feedback to keep changes minimal and focused on the original objective. Reverting: - Formatting changes to config files (.prettierrc, eslint.config.js, tsconfig.json files, nx.json, vitest.base.config.ts) - Formatting changes to test resource JSON files - Workflow change to lambda.yml (restoring branches: main restriction) Only keeping the essential changes: - lambdas/functions/control-plane/src/scale-runners/scale-up.ts - lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts These are the only files needed to fix the original issue of moving publishRetryMessage to the end of the processing loop to avoid duplicate retries. Co-authored-by: Brend-Smits <[email protected]>

Copilot · 2026-01-08T09:59:40Z

@copilot please keep the changes in this pull requests to a minimal. I'm not sure why you have touched so many files, this was not part of the original objective.
Keep focus and revert any changes t...

Reverted unnecessary changes in commit c2712a6. The PR now only modifies the two essential files:

lambdas/functions/control-plane/src/scale-runners/scale-up.ts (the core fix)
lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts (tests for the fix)

Removed all formatting changes (config files, test resources) and the workflow change that were not part of the original objective. All tests, linting, and formatting checks still pass.

Brend-Smits · 2026-01-08T10:55:50Z

This looks good to me now @stuartp44, what do you think?

Brend-Smits · 2026-01-09T10:03:18Z

@npalm can we get this merged so we can proceed with #4961

Initial plan

3e24ec6

Copilot AI assigned Copilot and npalm Dec 18, 2025

Copilot AI mentioned this pull request Dec 18, 2025

fix: job retry mechanism not triggering #4961

Open

Copilot started work on behalf of npalm December 18, 2025 20:43 View session

Copilot AI and others added 2 commits December 18, 2025 20:54

Copilot AI changed the title ~~[WIP] Address feedback on job retry mechanism fix~~ Move publishRetryMessage to end of processing loop to avoid duplicate retries Dec 18, 2025

Copilot AI requested a review from npalm December 18, 2025 20:59

Copilot finished work on behalf of npalm December 18, 2025 20:59

Copilot started work on behalf of npalm December 18, 2025 21:28 View session

Copilot finished work on behalf of npalm December 18, 2025 21:35

Copilot started work on behalf of npalm December 18, 2025 23:08 View session

Copilot finished work on behalf of npalm December 18, 2025 23:11

npalm reviewed Dec 19, 2025

View reviewed changes

lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts Show resolved Hide resolved

lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts Show resolved Hide resolved

lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts Show resolved Hide resolved

Copilot started work on behalf of npalm December 19, 2025 09:49 View session

Copilot AI requested a review from npalm December 19, 2025 09:56

Copilot finished work on behalf of npalm December 19, 2025 09:56

Merge branch 'stu/fix_job_retry' into copilot/sub-pr-4961

dd365f1

stuartp44 changed the title ~~Move publishRetryMessage to end of processing loop to avoid duplicate retries~~ fix: Move publishRetryMessage to end of processing loop to avoid duplicate retries Jan 5, 2026

Copilot started work on behalf of stuartp44 January 5, 2026 13:24 View session

fix: apply prettier formatting fixes

ef5c981

Fix formatting issues identified by prettier: - Remove trailing whitespace in scale-up.test.ts - Format .prettierrc and eslint.config.js - Format various tsconfig.json and other config files Co-authored-by: stuartp44 <[email protected]>

Copilot AI requested a review from stuartp44 January 5, 2026 13:29

Copilot finished work on behalf of stuartp44 January 5, 2026 13:29

Copilot started work on behalf of stuartp44 January 5, 2026 13:35 View session

Copilot finished work on behalf of stuartp44 January 5, 2026 13:41

Copilot started work on behalf of Brend-Smits January 8, 2026 09:50 View session

Copilot AI requested a review from Brend-Smits January 8, 2026 10:00

Copilot finished work on behalf of Brend-Smits January 8, 2026 10:00

Brend-Smits approved these changes Jan 9, 2026

View reviewed changes

fix: Move publishRetryMessage to end of processing loop to avoid duplicate retries #4966

Are you sure you want to change the base?

fix: Move publishRetryMessage to end of processing loop to avoid duplicate retries #4966

Conversation

Copilot AI commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

npalm commented Dec 18, 2025

Uh oh!

Copilot AI commented Dec 18, 2025

Uh oh!

npalm commented Dec 18, 2025

Uh oh!

Copilot AI commented Dec 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stuartp44 commented Jan 5, 2026

Uh oh!

Copilot AI commented Jan 5, 2026

Uh oh!

stuartp44 commented Jan 5, 2026

Uh oh!

Copilot AI commented Jan 5, 2026

Uh oh!

Brend-Smits commented Jan 8, 2026

Uh oh!

Copilot AI commented Jan 8, 2026

Uh oh!

Brend-Smits commented Jan 8, 2026

Uh oh!

Brend-Smits commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Dec 18, 2025 •

edited

Loading

github-actions bot commented Dec 18, 2025 •

edited

Loading