Skip to content

Conversation

@s1v4-d
Copy link

@s1v4-d s1v4-d commented Jan 5, 2026

…ing (#4710)

Implements a multi-tier caching strategy to address EC2 DescribeInstances API rate limiting in high-volume environments (20K+ runners/day):

  • In-memory TTL cache (5s) for within-invocation deduplication
  • DynamoDB-based persistent cache with EventBridge for cross-invocation consistency using EC2 state change events
  • Atomic counters for accurate runner count tracking
  • Feature is opt-in via runner_count_cache = { enable = true }

This can reduce EC2 API calls by 90%+ and eliminate 15+ second latency spikes caused by DescribeInstances throttling.

Closes #4710

…ing (github-aws-runners#4710)

Implements a multi-tier caching strategy to address EC2 DescribeInstances
API rate limiting in high-volume environments (20K+ runners/day):

- In-memory TTL cache (5s) for within-invocation deduplication
- DynamoDB-based persistent cache with EventBridge for cross-invocation
  consistency using EC2 state change events
- Atomic counters for accurate runner count tracking
- Feature is opt-in via `runner_count_cache = { enable = true }`

This can reduce EC2 API calls by 90%+ and eliminate 15+ second latency
spikes caused by DescribeInstances throttling.

Closes github-aws-runners#4710

Signed-off-by: s1v4-d <[email protected]>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a multi-tier caching strategy to address EC2 DescribeInstances API rate limiting in high-volume environments (20K+ runners/day). The solution introduces an in-memory TTL cache (5s) for within-invocation deduplication and a DynamoDB-based persistent cache updated via EventBridge EC2 state change events for cross-invocation consistency. The feature is opt-in via runner_count_cache = { enable = true } configuration.

Key Changes:

  • Event-driven DynamoDB counter maintained by EventBridge-triggered Lambda function
  • Multi-tier cache fallback strategy (DynamoDB → in-memory → EC2 API)
  • Atomic counter operations to track active runner counts per environment/type/owner

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
variables.runner-count-cache.tf Root-level variable definition for runner count cache feature configuration
modules/runners/variables.tf Module-level variables to pass cache configuration to runners module
modules/runners/scale-up.tf IAM policy and environment variables for scale-up Lambda to read from DynamoDB cache
modules/runner-count-cache/versions.tf Terraform and provider version constraints for new cache module
modules/runner-count-cache/variables.tf Comprehensive variable definitions for cache module configuration
modules/runner-count-cache/outputs.tf Module outputs exposing DynamoDB table, Lambda function, and cache configuration
modules/runner-count-cache/main.tf Core infrastructure: DynamoDB table, EventBridge rule, and event target configuration
modules/runner-count-cache/lambda.tf Lambda function, IAM role, and policies for the counter update function
modules/runner-count-cache/README.md Documentation explaining the feature, architecture, and usage
main.tf Integration of runner-count-cache module into root configuration
lambdas/yarn.lock Added DynamoDB SDK and dependencies for cache functionality
lambdas/functions/runner-count-cache/vitest.config.ts Test configuration for runner-count-cache Lambda
lambdas/functions/runner-count-cache/tsconfig.json TypeScript configuration for runner-count-cache Lambda
lambdas/functions/runner-count-cache/src/lambda.ts EventBridge-triggered Lambda that updates atomic counters in DynamoDB
lambdas/functions/runner-count-cache/package.json Package dependencies and scripts for runner-count-cache Lambda
lambdas/functions/control-plane/src/scale-runners/scale-up.ts Modified to use multi-tier cache strategy before calling EC2 API
lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts Updated tests to reset cache state between test runs
lambdas/functions/control-plane/src/scale-runners/cache.ts Implementation of in-memory and DynamoDB cache classes
lambdas/functions/control-plane/src/scale-runners/cache.test.ts Comprehensive unit tests for both cache implementations
lambdas/functions/control-plane/package.json Added DynamoDB SDK dependency for cache read operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

required_providers {
aws = {
source = "hashicorp/aws"
version = ">= 6.21"
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AWS provider version constraint in this new module is >= 6.21, which is inconsistent with the README documentation that states ~> 5.27 is required. This could cause confusion or compatibility issues if users are on AWS provider version 5.x. Consider aligning the version constraint with the documented requirements or updating the README accordingly.

Suggested change
version = ">= 6.21"
version = "~> 5.27"

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means the README is most likely not up to date.

ENABLE_ON_DEMAND_FAILOVER_FOR_ERRORS = jsonencode(var.enable_on_demand_failover_for_errors)
JOB_RETRY_CONFIG = jsonencode(local.job_retry_config)
RUNNER_COUNT_CACHE_TABLE_NAME = var.runner_count_cache != null ? var.runner_count_cache.table_name : ""
RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS = var.runner_count_cache != null ? var.runner_count_cache.stale_threshold_ms : 60000
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment variable RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS is set even when runner_count_cache is null. This would set it to the default value of 60000 instead of an empty string. Consider using a ternary operation similar to the table name to ensure consistency: var.runner_count_cache != null ? var.runner_count_cache.stale_threshold_ms : ""

Suggested change
RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS = var.runner_count_cache != null ? var.runner_count_cache.stale_threshold_ms : 60000
RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS = var.runner_count_cache != null ? var.runner_count_cache.stale_threshold_ms : ""

Copilot uses AI. Check for mistakes.
Comment on lines 157 to 162
let increment = 0;
if (state === 'running' || state === 'pending') {
increment = 1;
} else if (state === 'terminated' || state === 'stopped' || state === 'shutting-down') {
increment = -1;
}
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The atomic counter increment/decrement logic counts "pending" and "running" states as +1, but there's a potential double-counting issue. When an instance transitions from "pending" to "running", it would increment twice (once for pending, once for running). Consider only counting "running" state as +1, or implementing logic to prevent double-counting during state transitions.

Copilot uses AI. Check for mistakes.
const isStale = Date.now() - updated > dynamoDbRunnerCountCache.staleThresholdMs;

logger.debug('DynamoDB cache hit', { pk, count, isStale, ageMs: Date.now() - updated });

Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The negative count protection Math.max(0, count) at line 231 is good defensive programming. However, if negative counts occur, this silently corrects them without alerting operators. Consider adding a warning log when a negative count is detected, as it indicates a synchronization issue that may need investigation.

Suggested change
if (count < 0) {
logger.warn('Negative runner count read from DynamoDB cache, normalizing to 0', {
pk,
rawCount: count,
});
}

Copilot uses AI. Check for mistakes.
actions = [
"ec2:DescribeInstances",
"ec2:DescribeTags",
]
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EC2 permissions grant ec2:DescribeInstances and ec2:DescribeTags on all resources ("*"). While this is common for EC2 describe operations, consider documenting why this broad permission is necessary, especially since the counter Lambda should only need to describe specific instances that trigger the EventBridge rule.

Suggested change
]
]
# EC2 Describe* APIs do not support resource-level restrictions and must use "*".
# This read-only access is used by the counter Lambda to read tags for instances
# that triggered the EventBridge rule.

Copilot uses AI. Check for mistakes.
githubInstallationClient,
);

// Update the cache with the new runner count to avoid stale data in subsequent iterations
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential race condition in the cache increment logic at line 489. When multiple instances are created, the cache is incremented after all instances are created, but the DynamoDB cache (which tracks state changes) might also be updating. This could lead to discrepancies. Consider documenting this known limitation or implementing additional synchronization.

Suggested change
// Update the cache with the new runner count to avoid stale data in subsequent iterations
// Update the cache with the new runner count to avoid stale data in subsequent iterations.
// NOTE: This is a best-effort, process-local update and may race with updates made via the
// DynamoDB-backed runner count cache. In the event of concurrent scale-up operations for the
// same (environment, runnerType, group), this increment can temporarily diverge from the
// authoritative count stored in DynamoDB.

Copilot uses AI. Check for mistakes.
Comment on lines +389 to +396
if (dynamoResult !== null && !dynamoResult.isStale) {
currentRunners = dynamoResult.count;
cacheSource = 'dynamodb';
logger.debug('Using DynamoDB cached runner count', {
currentRunners,
group,
cacheSource,
});
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stale check at line 389 uses !dynamoResult.isStale, but if the cache is stale, the code falls through to try in-memory cache and then EC2 API. However, the stale count is still available and might be better than nothing if all subsequent fallbacks also fail or are slow. Consider logging when a stale cache is ignored, or using the stale value with a warning to reduce EC2 API calls even further.

Suggested change
if (dynamoResult !== null && !dynamoResult.isStale) {
currentRunners = dynamoResult.count;
cacheSource = 'dynamodb';
logger.debug('Using DynamoDB cached runner count', {
currentRunners,
group,
cacheSource,
});
if (dynamoResult !== null) {
if (!dynamoResult.isStale) {
currentRunners = dynamoResult.count;
cacheSource = 'dynamodb';
logger.debug('Using DynamoDB cached runner count', {
currentRunners,
group,
cacheSource,
});
} else {
logger.debug('Ignoring stale DynamoDB cached runner count', {
group,
environment,
runnerType,
isStale: dynamoResult.isStale,
});
}

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +23
// Initialize DynamoDB cache if configured
const dynamoDbCacheTableName = process.env.RUNNER_COUNT_CACHE_TABLE_NAME;
const dynamoDbCacheStaleThreshold = parseInt(process.env.RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS || '60000', 10);
if (dynamoDbCacheTableName && process.env.AWS_REGION) {
dynamoDbRunnerCountCache.initialize(dynamoDbCacheTableName, process.env.AWS_REGION, dynamoDbCacheStaleThreshold);
logger.info('DynamoDB runner count cache enabled', {
tableName: dynamoDbCacheTableName,
staleThresholdMs: dynamoDbCacheStaleThreshold,
});
}
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DynamoDB cache initialization happens at module load time (lines 17-23), which means the cache is initialized once when the Lambda starts. If environment variables change between invocations (unlikely but possible in some scenarios), the cache would not pick up those changes. Consider whether this is the intended behavior or if initialization should happen per invocation.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional. Initializing the DynamoDB client at module load time is a standard Lambda optimization pattern that ensures the client is ready when the handler is invoked, reducing latency for warm invocations. The environment variables are available during module initialization.

source = ["aws.ec2"]
detail-type = ["EC2 Instance State-change Notification"]
detail = {
state = ["running", "pending", "terminated", "stopped", "shutting-down"]
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EventBridge rule captures state changes for all EC2 instances, not just GitHub Action runners. The filtering happens in the Lambda function based on tags. In high-volume AWS accounts with many non-runner EC2 instances, this could result in unnecessary Lambda invocations and costs. Consider adding tag-based filtering in the EventBridge rule pattern if possible to reduce unnecessary invocations.

Suggested change
state = ["running", "pending", "terminated", "stopped", "shutting-down"]
state = ["running", "pending", "terminated", "stopped", "shutting-down"]
"tag:ghr:runner" = ["true"]

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, EC2 State Change Notification events do not include tag information in the event payload. The event only contains instance-id, state, and account/region. Tag-based filtering would require tagging-related events (like TagChange) which are separate from state changes. The Lambda function handles this by fetching instance details via DescribeInstances and filtering by tags before updating the counter. The environment filter in the EventBridge rule helps limit events to the correct AWS account.

}

variable "counter_lambda_timeout" {
description = "Time out for the counter update lambda in seconds."
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description says "Time out for the counter update lambda" but should be "Timeout for the counter update lambda" (single word).

Suggested change
description = "Time out for the counter update lambda in seconds."
description = "Timeout for the counter update lambda in seconds."

Copilot uses AI. Check for mistakes.
- Fix double-counting issue: only count 'running' state, ignore 'pending'
- Add warning log when DynamoDB returns negative counts
- Add EC2 permission comment explaining why resource='*' is required
- Fix 'Time out' typo to 'Timeout' in variable descriptions
- Fix README AWS provider version mismatch (~>5.27 -> >=6.21)
- Add race condition documentation between cache layers
- Add debug log when stale DynamoDB cache is used
- Fix Terraform formatting

Signed-off-by: s1v4-d <[email protected]>
@npalm
Copy link
Member

npalm commented Jan 6, 2026

@s1v4-d good to see you taking this challenge. Just wondering is the PR in your view ready for review. Since the PR is quite big, let me know if you would prefer also some early highlevel feedback.

@s1v4-d
Copy link
Author

s1v4-d commented Jan 7, 2026

@s1v4-d good to see you taking this challenge. Just wondering is the PR in your view ready for review. Since the PR is quite big, let me know if you would prefer also some early highlevel feedback.

Yes, I would prefer a highlevel feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Looking up number of active runners can be slow and cause API rate limit

2 participants