-
Notifications
You must be signed in to change notification settings - Fork 699
feat(scale-up): add runner count caching to reduce EC2 API rate limit… #4983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(scale-up): add runner count caching to reduce EC2 API rate limit… #4983
Conversation
…ing (github-aws-runners#4710) Implements a multi-tier caching strategy to address EC2 DescribeInstances API rate limiting in high-volume environments (20K+ runners/day): - In-memory TTL cache (5s) for within-invocation deduplication - DynamoDB-based persistent cache with EventBridge for cross-invocation consistency using EC2 state change events - Atomic counters for accurate runner count tracking - Feature is opt-in via `runner_count_cache = { enable = true }` This can reduce EC2 API calls by 90%+ and eliminate 15+ second latency spikes caused by DescribeInstances throttling. Closes github-aws-runners#4710 Signed-off-by: s1v4-d <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a multi-tier caching strategy to address EC2 DescribeInstances API rate limiting in high-volume environments (20K+ runners/day). The solution introduces an in-memory TTL cache (5s) for within-invocation deduplication and a DynamoDB-based persistent cache updated via EventBridge EC2 state change events for cross-invocation consistency. The feature is opt-in via runner_count_cache = { enable = true } configuration.
Key Changes:
- Event-driven DynamoDB counter maintained by EventBridge-triggered Lambda function
- Multi-tier cache fallback strategy (DynamoDB → in-memory → EC2 API)
- Atomic counter operations to track active runner counts per environment/type/owner
Reviewed changes
Copilot reviewed 19 out of 20 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| variables.runner-count-cache.tf | Root-level variable definition for runner count cache feature configuration |
| modules/runners/variables.tf | Module-level variables to pass cache configuration to runners module |
| modules/runners/scale-up.tf | IAM policy and environment variables for scale-up Lambda to read from DynamoDB cache |
| modules/runner-count-cache/versions.tf | Terraform and provider version constraints for new cache module |
| modules/runner-count-cache/variables.tf | Comprehensive variable definitions for cache module configuration |
| modules/runner-count-cache/outputs.tf | Module outputs exposing DynamoDB table, Lambda function, and cache configuration |
| modules/runner-count-cache/main.tf | Core infrastructure: DynamoDB table, EventBridge rule, and event target configuration |
| modules/runner-count-cache/lambda.tf | Lambda function, IAM role, and policies for the counter update function |
| modules/runner-count-cache/README.md | Documentation explaining the feature, architecture, and usage |
| main.tf | Integration of runner-count-cache module into root configuration |
| lambdas/yarn.lock | Added DynamoDB SDK and dependencies for cache functionality |
| lambdas/functions/runner-count-cache/vitest.config.ts | Test configuration for runner-count-cache Lambda |
| lambdas/functions/runner-count-cache/tsconfig.json | TypeScript configuration for runner-count-cache Lambda |
| lambdas/functions/runner-count-cache/src/lambda.ts | EventBridge-triggered Lambda that updates atomic counters in DynamoDB |
| lambdas/functions/runner-count-cache/package.json | Package dependencies and scripts for runner-count-cache Lambda |
| lambdas/functions/control-plane/src/scale-runners/scale-up.ts | Modified to use multi-tier cache strategy before calling EC2 API |
| lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts | Updated tests to reset cache state between test runs |
| lambdas/functions/control-plane/src/scale-runners/cache.ts | Implementation of in-memory and DynamoDB cache classes |
| lambdas/functions/control-plane/src/scale-runners/cache.test.ts | Comprehensive unit tests for both cache implementations |
| lambdas/functions/control-plane/package.json | Added DynamoDB SDK dependency for cache read operations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| required_providers { | ||
| aws = { | ||
| source = "hashicorp/aws" | ||
| version = ">= 6.21" |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AWS provider version constraint in this new module is >= 6.21, which is inconsistent with the README documentation that states ~> 5.27 is required. This could cause confusion or compatibility issues if users are on AWS provider version 5.x. Consider aligning the version constraint with the documented requirements or updating the README accordingly.
| version = ">= 6.21" | |
| version = "~> 5.27" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means the README is most likely not up to date.
| ENABLE_ON_DEMAND_FAILOVER_FOR_ERRORS = jsonencode(var.enable_on_demand_failover_for_errors) | ||
| JOB_RETRY_CONFIG = jsonencode(local.job_retry_config) | ||
| RUNNER_COUNT_CACHE_TABLE_NAME = var.runner_count_cache != null ? var.runner_count_cache.table_name : "" | ||
| RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS = var.runner_count_cache != null ? var.runner_count_cache.stale_threshold_ms : 60000 |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The environment variable RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS is set even when runner_count_cache is null. This would set it to the default value of 60000 instead of an empty string. Consider using a ternary operation similar to the table name to ensure consistency: var.runner_count_cache != null ? var.runner_count_cache.stale_threshold_ms : ""
| RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS = var.runner_count_cache != null ? var.runner_count_cache.stale_threshold_ms : 60000 | |
| RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS = var.runner_count_cache != null ? var.runner_count_cache.stale_threshold_ms : "" |
| let increment = 0; | ||
| if (state === 'running' || state === 'pending') { | ||
| increment = 1; | ||
| } else if (state === 'terminated' || state === 'stopped' || state === 'shutting-down') { | ||
| increment = -1; | ||
| } |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The atomic counter increment/decrement logic counts "pending" and "running" states as +1, but there's a potential double-counting issue. When an instance transitions from "pending" to "running", it would increment twice (once for pending, once for running). Consider only counting "running" state as +1, or implementing logic to prevent double-counting during state transitions.
| const isStale = Date.now() - updated > dynamoDbRunnerCountCache.staleThresholdMs; | ||
|
|
||
| logger.debug('DynamoDB cache hit', { pk, count, isStale, ageMs: Date.now() - updated }); | ||
|
|
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The negative count protection Math.max(0, count) at line 231 is good defensive programming. However, if negative counts occur, this silently corrects them without alerting operators. Consider adding a warning log when a negative count is detected, as it indicates a synchronization issue that may need investigation.
| if (count < 0) { | |
| logger.warn('Negative runner count read from DynamoDB cache, normalizing to 0', { | |
| pk, | |
| rawCount: count, | |
| }); | |
| } |
| actions = [ | ||
| "ec2:DescribeInstances", | ||
| "ec2:DescribeTags", | ||
| ] |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EC2 permissions grant ec2:DescribeInstances and ec2:DescribeTags on all resources ("*"). While this is common for EC2 describe operations, consider documenting why this broad permission is necessary, especially since the counter Lambda should only need to describe specific instances that trigger the EventBridge rule.
| ] | |
| ] | |
| # EC2 Describe* APIs do not support resource-level restrictions and must use "*". | |
| # This read-only access is used by the counter Lambda to read tags for instances | |
| # that triggered the EventBridge rule. |
| githubInstallationClient, | ||
| ); | ||
|
|
||
| // Update the cache with the new runner count to avoid stale data in subsequent iterations |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential race condition in the cache increment logic at line 489. When multiple instances are created, the cache is incremented after all instances are created, but the DynamoDB cache (which tracks state changes) might also be updating. This could lead to discrepancies. Consider documenting this known limitation or implementing additional synchronization.
| // Update the cache with the new runner count to avoid stale data in subsequent iterations | |
| // Update the cache with the new runner count to avoid stale data in subsequent iterations. | |
| // NOTE: This is a best-effort, process-local update and may race with updates made via the | |
| // DynamoDB-backed runner count cache. In the event of concurrent scale-up operations for the | |
| // same (environment, runnerType, group), this increment can temporarily diverge from the | |
| // authoritative count stored in DynamoDB. |
| if (dynamoResult !== null && !dynamoResult.isStale) { | ||
| currentRunners = dynamoResult.count; | ||
| cacheSource = 'dynamodb'; | ||
| logger.debug('Using DynamoDB cached runner count', { | ||
| currentRunners, | ||
| group, | ||
| cacheSource, | ||
| }); |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stale check at line 389 uses !dynamoResult.isStale, but if the cache is stale, the code falls through to try in-memory cache and then EC2 API. However, the stale count is still available and might be better than nothing if all subsequent fallbacks also fail or are slow. Consider logging when a stale cache is ignored, or using the stale value with a warning to reduce EC2 API calls even further.
| if (dynamoResult !== null && !dynamoResult.isStale) { | |
| currentRunners = dynamoResult.count; | |
| cacheSource = 'dynamodb'; | |
| logger.debug('Using DynamoDB cached runner count', { | |
| currentRunners, | |
| group, | |
| cacheSource, | |
| }); | |
| if (dynamoResult !== null) { | |
| if (!dynamoResult.isStale) { | |
| currentRunners = dynamoResult.count; | |
| cacheSource = 'dynamodb'; | |
| logger.debug('Using DynamoDB cached runner count', { | |
| currentRunners, | |
| group, | |
| cacheSource, | |
| }); | |
| } else { | |
| logger.debug('Ignoring stale DynamoDB cached runner count', { | |
| group, | |
| environment, | |
| runnerType, | |
| isStale: dynamoResult.isStale, | |
| }); | |
| } |
| // Initialize DynamoDB cache if configured | ||
| const dynamoDbCacheTableName = process.env.RUNNER_COUNT_CACHE_TABLE_NAME; | ||
| const dynamoDbCacheStaleThreshold = parseInt(process.env.RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS || '60000', 10); | ||
| if (dynamoDbCacheTableName && process.env.AWS_REGION) { | ||
| dynamoDbRunnerCountCache.initialize(dynamoDbCacheTableName, process.env.AWS_REGION, dynamoDbCacheStaleThreshold); | ||
| logger.info('DynamoDB runner count cache enabled', { | ||
| tableName: dynamoDbCacheTableName, | ||
| staleThresholdMs: dynamoDbCacheStaleThreshold, | ||
| }); | ||
| } |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DynamoDB cache initialization happens at module load time (lines 17-23), which means the cache is initialized once when the Lambda starts. If environment variables change between invocations (unlikely but possible in some scenarios), the cache would not pick up those changes. Consider whether this is the intended behavior or if initialization should happen per invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional. Initializing the DynamoDB client at module load time is a standard Lambda optimization pattern that ensures the client is ready when the handler is invoked, reducing latency for warm invocations. The environment variables are available during module initialization.
| source = ["aws.ec2"] | ||
| detail-type = ["EC2 Instance State-change Notification"] | ||
| detail = { | ||
| state = ["running", "pending", "terminated", "stopped", "shutting-down"] |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EventBridge rule captures state changes for all EC2 instances, not just GitHub Action runners. The filtering happens in the Lambda function based on tags. In high-volume AWS accounts with many non-runner EC2 instances, this could result in unnecessary Lambda invocations and costs. Consider adding tag-based filtering in the EventBridge rule pattern if possible to reduce unnecessary invocations.
| state = ["running", "pending", "terminated", "stopped", "shutting-down"] | |
| state = ["running", "pending", "terminated", "stopped", "shutting-down"] | |
| "tag:ghr:runner" = ["true"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, EC2 State Change Notification events do not include tag information in the event payload. The event only contains instance-id, state, and account/region. Tag-based filtering would require tagging-related events (like TagChange) which are separate from state changes. The Lambda function handles this by fetching instance details via DescribeInstances and filtering by tags before updating the counter. The environment filter in the EventBridge rule helps limit events to the correct AWS account.
| } | ||
|
|
||
| variable "counter_lambda_timeout" { | ||
| description = "Time out for the counter update lambda in seconds." |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description says "Time out for the counter update lambda" but should be "Timeout for the counter update lambda" (single word).
| description = "Time out for the counter update lambda in seconds." | |
| description = "Timeout for the counter update lambda in seconds." |
- Fix double-counting issue: only count 'running' state, ignore 'pending' - Add warning log when DynamoDB returns negative counts - Add EC2 permission comment explaining why resource='*' is required - Fix 'Time out' typo to 'Timeout' in variable descriptions - Fix README AWS provider version mismatch (~>5.27 -> >=6.21) - Add race condition documentation between cache layers - Add debug log when stale DynamoDB cache is used - Fix Terraform formatting Signed-off-by: s1v4-d <[email protected]>
|
@s1v4-d good to see you taking this challenge. Just wondering is the PR in your view ready for review. Since the PR is quite big, let me know if you would prefer also some early highlevel feedback. |
Yes, I would prefer a highlevel feedback. |
…ing (#4710)
Implements a multi-tier caching strategy to address EC2 DescribeInstances API rate limiting in high-volume environments (20K+ runners/day):
runner_count_cache = { enable = true }This can reduce EC2 API calls by 90%+ and eliminate 15+ second latency spikes caused by DescribeInstances throttling.
Closes #4710