Skip to content

Conversation

@jestradaMS
Copy link
Contributor

@jestradaMS jestradaMS commented Dec 8, 2025

##Dependent on #5263

Description

This pull request introduces several improvements and bug fixes to the reindexing workflow and search parameter cache refresh logic in the FHIR server codebase. The main focus is on ensuring more robust handling of search parameter updates, improving job orchestration for reindexing, and addressing concurrency and cancellation scenarios. Key changes include enhancements to the search parameter cache refresh configuration, more accurate job status updates, improved resource counting logic, and better handling of job cancellation and heartbeat conflicts.

Search Parameter Cache Refresh Improvements

  • Added a new configuration property SearchParameterCacheRefreshMaxInitialDelaySeconds to CoreFeatureConfiguration to allow for a randomized initial delay when starting the cache refresh background service, helping to prevent thundering herd issues. The default is 15 seconds, and it can be set to 0 for testing.
  • Updated tests to use a 1-second refresh interval and no initial delay for faster and more predictable unit test execution.
  • Improved the timer logic in tests to wait up to 2 seconds, ensuring the background service has time to execute and log appropriately.

Reindex Job Orchestration and Processing

  • Refactored the ReindexOrchestratorJob to always refresh the search parameter cache before and after job creation, and after updating zero-count search parameters, ensuring the most up-to-date search parameter definitions are used throughout the reindex process. [1] [2] [3]
  • Enhanced the logic for creating reindex processing jobs by double-checking resource counts and updating the job record and search parameter status for zero-count resource types, improving accuracy and consistency.
  • Improved job cancellation handling by throwing OperationCanceledException promptly when cancellation is requested, both before job creation and when enqueuing processing jobs.

Job Status and Error Handling

  • Fixed the logic for updating job status to Failed only when there are no in-flight or cancelled jobs, preventing premature failure status in race conditions. Also, if there are still in-flight jobs, status is set to Running to avoid sending incorrect consumer status. [1] [2]
  • Improved the heartbeat logic during job completion checks to catch and log JobConflictException without failing the job, allowing for concurrent updates by multiple workers.

Resource Counting and Hash Map Handling

  • Changed the resource type hash map property in ReindexJobRecord to internal set for better encapsulation and updated the logic for retrieving hash values to ensure correctness. [1] [2]
  • Fixed the condition in CheckJobRecordForAnyWork to correctly identify when there is work to be done based on resource counts.

Other Codebase Improvements

  • Added missing using directives and improved code organization in several files for clarity and maintainability. [1] [2] [3]

Related issues

Addresses AB#175213, AB#178077, AB#178083.

Testing

Describe how this change was tested.

FHIR Team Checklist

  • Update the title of the PR to be succinct and less than 65 characters
  • Add a milestone to the PR for the sprint that it is merged (i.e. add S47)
  • Tag the PR with the type of update: Bug, Build, Dependencies, Enhancement, New-Feature or Documentation
  • Tag the PR with Open source, Azure API for FHIR (CosmosDB or common code) or Azure Healthcare APIs (SQL or common code) to specify where this change is intended to be released.
  • Tag the PR with Schema Version backward compatible or Schema Version backward incompatible or Schema Version unchanged if this adds or updates Sql script which is/is not backward compatible with the code.
  • When changing or adding behavior, if your code modifies the system design or changes design assumptions, please create and include an ADR.
  • CI is green before merge Build Status
  • Review squash-merge requirements

Semver Change (docs)

Patch|Skip|Feature|Breaking (reason)

…ures

Add RetryFactAttribute and supporting infrastructure to enable automatic retry of flaky tests in xunit. This is particularly useful for integration and end-to-end tests that may experience transient failures.

Changes:
- Add RetryFactAttribute with configurable max retries, delay, and assertion failure handling
- Add RetryFactDiscoverer for test case discovery
- Add RetryTestCase to implement retry logic with detailed diagnostics
- Refactor existing tests to use RetryFact where appropriate for transient failures

The retry mechanism defaults to 3 attempts with 5 second delays and skips assertion failures unless explicitly configured, helping distinguish between genuine test bugs and environmental issues.
…tests

- Set SearchParameterCacheRefreshIntervalSeconds to 1 second in tests for faster execution
- Add SearchParameterCacheRefreshMaxInitialDelaySeconds config option with default of 15 seconds to prevent thundering herd on timer startup
- Set initial delay to 0 in tests to eliminate random startup delays
- Increase test timeout from 200ms to 2000ms to account for timer initialization
- Fix reindex operation status logic to check for cancelled jobs and in-flight jobs to prevent incorrect status transitions during race conditions
- Fix to Gen1 to ensure all records are retrieve when continuation token is supplied
- Fix to Gen1 GetReindexResults query to ensure we only get Active records to reindex
- Fix to ensure we have correct hashmap at time of execution of reindex jobs vs. what was suppliled during creation in handler
@jestradaMS jestradaMS added this to the FY26\Q2\2Wk\2Wk11 milestone Dec 8, 2025
@jestradaMS jestradaMS requested a review from a team as a code owner December 8, 2025 15:09
@jestradaMS jestradaMS added Enhancement-Test Enhancement on tests. Azure API for FHIR Label denotes that the issue or PR is relevant to the Azure API for FHIR Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs Schema Version unchanged labels Dec 8, 2025
Copy link
Contributor

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Base automatically changed from users/jestrada/retryfact-infrastructure to main December 9, 2025 21:41

/// <summary>
/// Gets or sets the maximum initial delay in seconds for the SearchParameter cache background service timer.
/// This random delay (0 to this value) staggers timer startup across instances to prevent thundering herd.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never heard the term 'thundering herd' used to describe an influx of traffic due to start up across instances. I like it :)

@jestradaMS jestradaMS enabled auto-merge (squash) December 11, 2025 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Azure API for FHIR Label denotes that the issue or PR is relevant to the Azure API for FHIR Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs Enhancement-Test Enhancement on tests. Schema Version unchanged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants