Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 18, 2025

  • Understand the current flow and identify where publishRetryMessage is called
  • Remove the publishRetryMessage call from line 360 (early in the loop)
  • Add logic to track which messages should have retry messages published
  • Call publishRetryMessage at the end of the loop for messages not marked as invalid
  • Update tests to reflect the new behavior (publishRetryMessage after runner creation)
  • Validate changes with linting and testing
  • Address code review feedback: use Set for performance, extract helper function
  • Simplify naming and reduce complexity per reviewer feedback
  • Update lambda workflow to run on all PRs
  • Fix formatting issues
  • Fix merge conflicts from stu/fix_job_retry branch
  • Revert unnecessary changes to keep PR minimal and focused

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits December 18, 2025 20:54
Per reviewer feedback, publishRetryMessage is now called at the end of the
processing loop for each message that is NOT marked as invalid. This prevents
duplicate retry messages for the same event, since invalid messages already go
back to the SQS queue for retry.

Key changes:
- Track validMessagesForRetry separately from all messages
- Only messages that pass job queued check are added to validMessagesForRetry
- publishRetryMessage is called after runner creation, not before
- Messages marked as invalid (e.g., max runners reached, creation failed) are
  excluded from retry message publishing

Tests updated to reflect new behavior:
- publishRetryMessage is called AFTER runner creation
- Messages marked invalid do not get retry messages published
- All test scenarios updated with proper mock runner creation

Co-authored-by: npalm <[email protected]>
Address code review feedback:
- Use Set for invalidMessageIds to improve lookup performance from O(n) to O(1)
- Extract duplicate retry message publishing logic into helper function
- Maintain both invalidMessages array (for return value) and Set (for fast lookups)

This improves performance when processing large message batches and makes the
code more maintainable by eliminating duplication.

Co-authored-by: npalm <[email protected]>
Copilot AI changed the title [WIP] Address feedback on job retry mechanism fix Move publishRetryMessage to end of processing loop to avoid duplicate retries Dec 18, 2025
Copilot AI requested a review from npalm December 18, 2025 20:59
@github-actions
Copy link
Contributor

github-actions bot commented Dec 18, 2025

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@npalm
Copy link
Member

npalm commented Dec 18, 2025

@copilot can you re-check, I think naming can be a bit better. Do we really need this complicty?

Simplified the retry message logic based on reviewer feedback:
- Renamed `validMessagesForRetry` to `queuedMessages` (clearer intent)
- Renamed `invalidMessageIds` to `rejectedMessageIds` (more accurate)
- Removed `invalidMessages` array duplication - now using only Set
- Removed `publishRetryMessagesForValid` helper function (inline instead)
- Convert Set to array only at return statement

Result: clearer naming, less complexity, same functionality.
Net reduction of 6 lines while maintaining all tests passing.

Co-authored-by: npalm <[email protected]>
Copy link
Contributor Author

Copilot AI commented Dec 18, 2025

@copilot can you re-check, I think naming can be a bit better. Do we really need this complicty?

You're right! I've simplified the code:

Naming improvements:

  • validMessagesForRetryqueuedMessages (clearer - these are messages with queued jobs)
  • invalidMessageIdsrejectedMessageIds (more accurate terminology)

Complexity reduction:

  • Removed duplicate tracking: was using both invalidMessages array AND invalidMessageIds Set
  • Now using only rejectedMessageIds Set, converting to array at return
  • Removed publishRetryMessagesForValid helper function - inlined the simple loop instead
  • Net result: 6 fewer lines, clearer intent

Fixed in commit b5399f6.

@npalm
Copy link
Member

npalm commented Dec 18, 2025

@copilot adjust the workflow https://github.com/github-aws-runners/terraform-aws-github-runner/blob/main/.github/workflows/lambda.yml to run on any pr not only the onces to main

Remove branch restriction from lambda.yml workflow to support stacked PRs
and feature branch workflows. The workflow will now run on any PR that
modifies lambda code or the workflow itself.

Co-authored-by: npalm <[email protected]>
Copy link
Contributor Author

Copilot AI commented Dec 18, 2025

@copilot adjust the workflow https://github.com/github-aws-runners/terraform-aws-github-runner/blob/main/.github/workflows/lambda.yml to run on any pr not only the onces to main

Updated the lambda workflow to run on all PRs by removing the branches: - main restriction. Now it will trigger on any PR that modifies lambda code or the workflow itself, supporting stacked PRs and feature branch workflows.

Fixed in commit cba7bbf.

@stuartp44 stuartp44 changed the title Move publishRetryMessage to end of processing loop to avoid duplicate retries fix: Move publishRetryMessage to end of processing loop to avoid duplicate retries Jan 5, 2026
@stuartp44
Copy link
Contributor

@copilot fix your formatting issues please

Fix formatting issues identified by prettier:
- Remove trailing whitespace in scale-up.test.ts
- Format .prettierrc and eslint.config.js
- Format various tsconfig.json and other config files

Co-authored-by: stuartp44 <[email protected]>
Copy link
Contributor Author

Copilot AI commented Jan 5, 2026

@copilot fix your formatting issues please

Fixed formatting issues in commit ef5c981. The main issue was trailing whitespace in scale-up.test.ts. Also reformatted config files (.prettierrc, eslint.config.js, and various tsconfig.json files) to match prettier's style. All files now pass yarn format-check.

Copilot AI requested a review from stuartp44 January 5, 2026 13:29
@stuartp44
Copy link
Contributor

@copilot can you fix the test error? it looks like we have an issue with {"level":"INFO","message":"Found: '1' active GitHub EC2 runner instances before clean-up.","timestamp":"2026-01-05T13:31:35.464Z","service":"service_undefined","sampling_rate":0,"environment":"N/A","module":"scale-down"} {"level":"INFO","message":"Found: '1' active GitHub EC2 runners instances after clean-up.","timestamp":"2026-01-05T13:31:35.465Z","service":"service_undefined","sampling_rate":0,"environment":"N/A","module":"scale-down"} {"level":"INFO","message":"Terminating orphan runner 'i-orphan-1-repo'","timestamp":"2026-01-05T13:31:35.466Z","service":"service_undefined","sampling_rate":0,"environment":"N/A","module":"scale-down"} {"level":"WARN","message":"Failure during orphan termination processing.","timestamp":"2026-01-05T13:31:35.466Z","service":"service_undefined","sampling_rate":0,"environment":"N/A","module":"scale-down","error":{"name":"Error","location":"file:///__w/terraform-aws-github-runner/terraform-aws-github-runner/lambdas/node_modules/@vitest/spy/dist/index.js:285","message":"Failed to terminate","stack":"Error: Failed to terminate\n at /__w/terraform-aws-github-runner/terraform-aws-github-runner/lambdas/functions/control-plane/src/scale-runners/scale-down.test.ts:532:17\n at Mock (file:///__w/terraform-aws-github-runner/terraform-aws-github-runner/lambdas/node_modules/@vitest/spy/dist/index.js:285:34)\n at terminateOrphan (/__w/terraform

The merge from stu/fix_job_retry incorrectly reverted the core change
where publishRetryMessage was moved to the end of the loop. This fix
restores the correct behavior:

- Restore queuedMessages.push(message) instead of await publishRetryMessage(message)
- Remove conflicting test assertions from merge (lines expecting both
  publishRetryMessage NOT to be called AND to be called 2 times)
- Reorder test assertions to check listEC2Runners before publishRetryMessage

All tests now pass with the correct behavior where publishRetryMessage
is called AFTER runner creation and only for non-rejected messages.

Co-authored-by: stuartp44 <[email protected]>
Copy link
Contributor Author

Copilot AI commented Jan 5, 2026

@copilot can you fix the test error? it looks like we have an issue with ``` {"level":"INFO","message":"Found: '1' active GitHub EC2 runner instances before clean-up.","timestamp":"2026-01-05T13:31:3...

Fixed in commit 0243458. The issue was caused by an incorrect merge resolution from the stu/fix_job_retry branch that reverted my core changes. The merge:

  1. Changed queuedMessages.push(message) back to await publishRetryMessage(message) (reverting the fix)
  2. Added conflicting test assertions (expecting publishRetryMessage both NOT to be called AND to be called 2 times)

Restored the correct behavior where publishRetryMessage is called at the END of the loop, only for non-rejected messages. All 334 control-plane tests now pass.

@Brend-Smits
Copy link
Contributor

@copilot please keep the changes in this pull requests to a minimal. I'm not sure why you have touched so many files, this was not part of the original objective.
Keep focus and revert any changes to files that are not needed to solve the original objective.

Per reviewer feedback to keep changes minimal and focused on the
original objective. Reverting:

- Formatting changes to config files (.prettierrc, eslint.config.js,
  tsconfig.json files, nx.json, vitest.base.config.ts)
- Formatting changes to test resource JSON files
- Workflow change to lambda.yml (restoring branches: main restriction)

Only keeping the essential changes:
- lambdas/functions/control-plane/src/scale-runners/scale-up.ts
- lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts

These are the only files needed to fix the original issue of moving
publishRetryMessage to the end of the processing loop to avoid
duplicate retries.

Co-authored-by: Brend-Smits <[email protected]>
Copy link
Contributor Author

Copilot AI commented Jan 8, 2026

@copilot please keep the changes in this pull requests to a minimal. I'm not sure why you have touched so many files, this was not part of the original objective.
Keep focus and revert any changes t...

Reverted unnecessary changes in commit c2712a6. The PR now only modifies the two essential files:

  • lambdas/functions/control-plane/src/scale-runners/scale-up.ts (the core fix)
  • lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts (tests for the fix)

Removed all formatting changes (config files, test resources) and the workflow change that were not part of the original objective. All tests, linting, and formatting checks still pass.

@Brend-Smits
Copy link
Contributor

This looks good to me now @stuartp44, what do you think?

@Brend-Smits
Copy link
Contributor

@npalm can we get this merged so we can proceed with #4961

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants