feat(scale-up): add runner count caching to reduce EC2 API rate limit… #4983

s1v4-d · 2026-01-05T18:52:24Z

…ing (#4710)

Implements a multi-tier caching strategy to address EC2 DescribeInstances API rate limiting in high-volume environments (20K+ runners/day):

In-memory TTL cache (5s) for within-invocation deduplication
DynamoDB-based persistent cache with EventBridge for cross-invocation consistency using EC2 state change events
Atomic counters for accurate runner count tracking
Feature is opt-in via runner_count_cache = { enable = true }

This can reduce EC2 API calls by 90%+ and eliminate 15+ second latency spikes caused by DescribeInstances throttling.

Closes #4710

…ing (github-aws-runners#4710) Implements a multi-tier caching strategy to address EC2 DescribeInstances API rate limiting in high-volume environments (20K+ runners/day): - In-memory TTL cache (5s) for within-invocation deduplication - DynamoDB-based persistent cache with EventBridge for cross-invocation consistency using EC2 state change events - Atomic counters for accurate runner count tracking - Feature is opt-in via `runner_count_cache = { enable = true }` This can reduce EC2 API calls by 90%+ and eliminate 15+ second latency spikes caused by DescribeInstances throttling. Closes github-aws-runners#4710 Signed-off-by: s1v4-d <[email protected]>

Copilot

Pull request overview

This PR implements a multi-tier caching strategy to address EC2 DescribeInstances API rate limiting in high-volume environments (20K+ runners/day). The solution introduces an in-memory TTL cache (5s) for within-invocation deduplication and a DynamoDB-based persistent cache updated via EventBridge EC2 state change events for cross-invocation consistency. The feature is opt-in via runner_count_cache = { enable = true } configuration.

Key Changes:

Event-driven DynamoDB counter maintained by EventBridge-triggered Lambda function
Multi-tier cache fallback strategy (DynamoDB → in-memory → EC2 API)
Atomic counter operations to track active runner counts per environment/type/owner

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
variables.runner-count-cache.tf	Root-level variable definition for runner count cache feature configuration
modules/runners/variables.tf	Module-level variables to pass cache configuration to runners module
modules/runners/scale-up.tf	IAM policy and environment variables for scale-up Lambda to read from DynamoDB cache
modules/runner-count-cache/versions.tf	Terraform and provider version constraints for new cache module
modules/runner-count-cache/variables.tf	Comprehensive variable definitions for cache module configuration
modules/runner-count-cache/outputs.tf	Module outputs exposing DynamoDB table, Lambda function, and cache configuration
modules/runner-count-cache/main.tf	Core infrastructure: DynamoDB table, EventBridge rule, and event target configuration
modules/runner-count-cache/lambda.tf	Lambda function, IAM role, and policies for the counter update function
modules/runner-count-cache/README.md	Documentation explaining the feature, architecture, and usage
main.tf	Integration of runner-count-cache module into root configuration
lambdas/yarn.lock	Added DynamoDB SDK and dependencies for cache functionality
lambdas/functions/runner-count-cache/vitest.config.ts	Test configuration for runner-count-cache Lambda
lambdas/functions/runner-count-cache/tsconfig.json	TypeScript configuration for runner-count-cache Lambda
lambdas/functions/runner-count-cache/src/lambda.ts	EventBridge-triggered Lambda that updates atomic counters in DynamoDB
lambdas/functions/runner-count-cache/package.json	Package dependencies and scripts for runner-count-cache Lambda
lambdas/functions/control-plane/src/scale-runners/scale-up.ts	Modified to use multi-tier cache strategy before calling EC2 API
lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts	Updated tests to reset cache state between test runs
lambdas/functions/control-plane/src/scale-runners/cache.ts	Implementation of in-memory and DynamoDB cache classes
lambdas/functions/control-plane/src/scale-runners/cache.test.ts	Comprehensive unit tests for both cache implementations
lambdas/functions/control-plane/package.json	Added DynamoDB SDK dependency for cache read operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-05T21:42:12Z

modules/runner-count-cache/versions.tf

+  required_providers {
+    aws = {
+      source  = "hashicorp/aws"
+      version = ">= 6.21"


The AWS provider version constraint in this new module is >= 6.21, which is inconsistent with the README documentation that states ~> 5.27 is required. This could cause confusion or compatibility issues if users are on AWS provider version 5.x. Consider aligning the version constraint with the documented requirements or updating the README accordingly.

Suggested change

version = ">= 6.21"

version = "~> 5.27"

this means the README is most likely not up to date.

Copilot · 2026-01-05T21:42:13Z

modules/runners/scale-up.tf

      ENABLE_ON_DEMAND_FAILOVER_FOR_ERRORS     = jsonencode(var.enable_on_demand_failover_for_errors)
      JOB_RETRY_CONFIG                         = jsonencode(local.job_retry_config)
+      RUNNER_COUNT_CACHE_TABLE_NAME            = var.runner_count_cache != null ? var.runner_count_cache.table_name : ""
+      RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS    = var.runner_count_cache != null ? var.runner_count_cache.stale_threshold_ms : 60000


The environment variable RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS is set even when runner_count_cache is null. This would set it to the default value of 60000 instead of an empty string. Consider using a ternary operation similar to the table name to ensure consistency: var.runner_count_cache != null ? var.runner_count_cache.stale_threshold_ms : ""

Suggested change

RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS = var.runner_count_cache != null ? var.runner_count_cache.stale_threshold_ms : 60000

RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS = var.runner_count_cache != null ? var.runner_count_cache.stale_threshold_ms : ""

Copilot · 2026-01-05T21:42:13Z

lambdas/functions/runner-count-cache/src/lambda.ts

+  let increment = 0;
+  if (state === 'running' || state === 'pending') {
+    increment = 1;
+  } else if (state === 'terminated' || state === 'stopped' || state === 'shutting-down') {
+    increment = -1;
+  }


The atomic counter increment/decrement logic counts "pending" and "running" states as +1, but there's a potential double-counting issue. When an instance transitions from "pending" to "running", it would increment twice (once for pending, once for running). Consider only counting "running" state as +1, or implementing logic to prevent double-counting during state transitions.

Copilot · 2026-01-05T21:42:13Z

lambdas/functions/control-plane/src/scale-runners/cache.ts

+      const isStale = Date.now() - updated > dynamoDbRunnerCountCache.staleThresholdMs;
+
+      logger.debug('DynamoDB cache hit', { pk, count, isStale, ageMs: Date.now() - updated });
+


The negative count protection Math.max(0, count) at line 231 is good defensive programming. However, if negative counts occur, this silently corrects them without alerting operators. Consider adding a warning log when a negative count is detected, as it indicates a synchronization issue that may need investigation.

Suggested change

if (count < 0) {

logger.warn('Negative runner count read from DynamoDB cache, normalizing to 0', {

pk,

rawCount: count,

});

}

Copilot · 2026-01-05T21:42:14Z

modules/runner-count-cache/lambda.tf

+    actions = [
+      "ec2:DescribeInstances",
+      "ec2:DescribeTags",
+    ]


The EC2 permissions grant ec2:DescribeInstances and ec2:DescribeTags on all resources ("*"). While this is common for EC2 describe operations, consider documenting why this broad permission is necessary, especially since the counter Lambda should only need to describe specific instances that trigger the EventBridge rule.

Suggested change

]

]

# EC2 Describe* APIs do not support resource-level restrictions and must use "*".

# This read-only access is used by the counter Lambda to read tags for instances

# that triggered the EventBridge rule.

Copilot · 2026-01-05T21:42:15Z

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

      githubInstallationClient,
    );

+    // Update the cache with the new runner count to avoid stale data in subsequent iterations


There's a potential race condition in the cache increment logic at line 489. When multiple instances are created, the cache is incremented after all instances are created, but the DynamoDB cache (which tracks state changes) might also be updating. This could lead to discrepancies. Consider documenting this known limitation or implementing additional synchronization.

Suggested change

// Update the cache with the new runner count to avoid stale data in subsequent iterations

// Update the cache with the new runner count to avoid stale data in subsequent iterations.

// NOTE: This is a best-effort, process-local update and may race with updates made via the

// DynamoDB-backed runner count cache. In the event of concurrent scale-up operations for the

// same (environment, runnerType, group), this increment can temporarily diverge from the

// authoritative count stored in DynamoDB.

Copilot · 2026-01-05T21:42:15Z

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

+        if (dynamoResult !== null && !dynamoResult.isStale) {
+          currentRunners = dynamoResult.count;
+          cacheSource = 'dynamodb';
+          logger.debug('Using DynamoDB cached runner count', {
+            currentRunners,
+            group,
+            cacheSource,
+          });


The stale check at line 389 uses !dynamoResult.isStale, but if the cache is stale, the code falls through to try in-memory cache and then EC2 API. However, the stale count is still available and might be better than nothing if all subsequent fallbacks also fail or are slow. Consider logging when a stale cache is ignored, or using the stale value with a warning to reduce EC2 API calls even further.

Suggested change

if (dynamoResult !== null && !dynamoResult.isStale) {

currentRunners = dynamoResult.count;

cacheSource = 'dynamodb';

logger.debug('Using DynamoDB cached runner count', {

currentRunners,

group,

cacheSource,

});

if (dynamoResult !== null) {

if (!dynamoResult.isStale) {

currentRunners = dynamoResult.count;

cacheSource = 'dynamodb';

logger.debug('Using DynamoDB cached runner count', {

currentRunners,

group,

cacheSource,

});

} else {

logger.debug('Ignoring stale DynamoDB cached runner count', {

group,

environment,

runnerType,

isStale: dynamoResult.isStale,

});

}

Copilot · 2026-01-05T21:42:15Z

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

+// Initialize DynamoDB cache if configured
+const dynamoDbCacheTableName = process.env.RUNNER_COUNT_CACHE_TABLE_NAME;
+const dynamoDbCacheStaleThreshold = parseInt(process.env.RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS || '60000', 10);
+if (dynamoDbCacheTableName && process.env.AWS_REGION) {
+  dynamoDbRunnerCountCache.initialize(dynamoDbCacheTableName, process.env.AWS_REGION, dynamoDbCacheStaleThreshold);
+  logger.info('DynamoDB runner count cache enabled', {
+    tableName: dynamoDbCacheTableName,
+    staleThresholdMs: dynamoDbCacheStaleThreshold,
+  });
+}


The DynamoDB cache initialization happens at module load time (lines 17-23), which means the cache is initialized once when the Lambda starts. If environment variables change between invocations (unlikely but possible in some scenarios), the cache would not pick up those changes. Consider whether this is the intended behavior or if initialization should happen per invocation.

This is intentional. Initializing the DynamoDB client at module load time is a standard Lambda optimization pattern that ensures the client is ready when the handler is invoked, reducing latency for warm invocations. The environment variables are available during module initialization.

Copilot · 2026-01-05T21:42:15Z

modules/runner-count-cache/main.tf

+    source      = ["aws.ec2"]
+    detail-type = ["EC2 Instance State-change Notification"]
+    detail = {
+      state = ["running", "pending", "terminated", "stopped", "shutting-down"]


The EventBridge rule captures state changes for all EC2 instances, not just GitHub Action runners. The filtering happens in the Lambda function based on tags. In high-volume AWS accounts with many non-runner EC2 instances, this could result in unnecessary Lambda invocations and costs. Consider adding tag-based filtering in the EventBridge rule pattern if possible to reduce unnecessary invocations.

Suggested change

state = ["running", "pending", "terminated", "stopped", "shutting-down"]

state = ["running", "pending", "terminated", "stopped", "shutting-down"]

"tag:ghr:runner" = ["true"]

Unfortunately, EC2 State Change Notification events do not include tag information in the event payload. The event only contains instance-id, state, and account/region. Tag-based filtering would require tagging-related events (like TagChange) which are separate from state changes. The Lambda function handles this by fetching instance details via DescribeInstances and filtering by tags before updating the counter. The environment filter in the EventBridge rule helps limit events to the correct AWS account.

Copilot · 2026-01-05T21:42:16Z

modules/runner-count-cache/variables.tf

+}
+
+variable "counter_lambda_timeout" {
+  description = "Time out for the counter update lambda in seconds."


The description says "Time out for the counter update lambda" but should be "Timeout for the counter update lambda" (single word).

Suggested change

description = "Time out for the counter update lambda in seconds."

description = "Timeout for the counter update lambda in seconds."

- Fix double-counting issue: only count 'running' state, ignore 'pending' - Add warning log when DynamoDB returns negative counts - Add EC2 permission comment explaining why resource='*' is required - Fix 'Time out' typo to 'Timeout' in variable descriptions - Fix README AWS provider version mismatch (~>5.27 -> >=6.21) - Add race condition documentation between cache layers - Add debug log when stale DynamoDB cache is used - Fix Terraform formatting Signed-off-by: s1v4-d <[email protected]>

npalm · 2026-01-06T20:54:33Z

@s1v4-d good to see you taking this challenge. Just wondering is the PR in your view ready for review. Since the PR is quite big, let me know if you would prefer also some early highlevel feedback.

s1v4-d · 2026-01-07T05:13:42Z

@s1v4-d good to see you taking this challenge. Just wondering is the PR in your view ready for review. Since the PR is quite big, let me know if you would prefer also some early highlevel feedback.

Yes, I would prefer a highlevel feedback.

s1v4-d requested review from a team as code owners January 5, 2026 18:52

Brend-Smits requested a review from Copilot January 5, 2026 21:37

Copilot started reviewing on behalf of Brend-Smits January 5, 2026 21:37 View session

Copilot AI reviewed Jan 5, 2026

View reviewed changes

	RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS = var.runner_count_cache != null ? var.runner_count_cache.stale_threshold_ms : 60000
	RUNNER_COUNT_CACHE_STALE_THRESHOLD_MS = var.runner_count_cache != null ? var.runner_count_cache.stale_threshold_ms : ""

		const isStale = Date.now() - updated > dynamoDbRunnerCountCache.staleThresholdMs;

		logger.debug('DynamoDB cache hit', { pk, count, isStale, ageMs: Date.now() - updated });

+      if (count < 0) {
+        logger.warn('Negative runner count read from DynamoDB cache, normalizing to 0', {
+          pk,
+          rawCount: count,
+        });
+      }

-    ]
+    ]
+    # EC2 Describe* APIs do not support resource-level restrictions and must use "*".
+    # This read-only access is used by the counter Lambda to read tags for instances
+    # that triggered the EventBridge rule.

-    // Update the cache with the new runner count to avoid stale data in subsequent iterations
+    // Update the cache with the new runner count to avoid stale data in subsequent iterations.
+    // NOTE: This is a best-effort, process-local update and may race with updates made via the
+    // DynamoDB-backed runner count cache. In the event of concurrent scale-up operations for the
+    // same (environment, runnerType, group), this increment can temporarily diverge from the
+    // authoritative count stored in DynamoDB.

	state = ["running", "pending", "terminated", "stopped", "shutting-down"]
	state = ["running", "pending", "terminated", "stopped", "shutting-down"]
	"tag:ghr:runner" = ["true"]

	description = "Time out for the counter update lambda in seconds."
	description = "Timeout for the counter update lambda in seconds."

feat(scale-up): add runner count caching to reduce EC2 API rate limit… #4983

Are you sure you want to change the base?

feat(scale-up): add runner count caching to reduce EC2 API rate limit… #4983

Uh oh!

Conversation

s1v4-d commented Jan 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

npalm Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

s1v4-d Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

s1v4-d Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

npalm commented Jan 6, 2026

Uh oh!

s1v4-d commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants