Reindex core fixes and test enhancements #5264

jestradaMS · 2025-12-08T15:09:47Z

##Dependent on #5263

Description

This pull request introduces several improvements and bug fixes to the reindexing workflow and search parameter cache refresh logic in the FHIR server codebase. The main focus is on ensuring more robust handling of search parameter updates, improving job orchestration for reindexing, and addressing concurrency and cancellation scenarios. Key changes include enhancements to the search parameter cache refresh configuration, more accurate job status updates, improved resource counting logic, and better handling of job cancellation and heartbeat conflicts.

Search Parameter Cache Refresh Improvements

Added a new configuration property SearchParameterCacheRefreshMaxInitialDelaySeconds to CoreFeatureConfiguration to allow for a randomized initial delay when starting the cache refresh background service, helping to prevent thundering herd issues. The default is 15 seconds, and it can be set to 0 for testing.
Updated tests to use a 1-second refresh interval and no initial delay for faster and more predictable unit test execution.
Improved the timer logic in tests to wait up to 2 seconds, ensuring the background service has time to execute and log appropriately.

Reindex Job Orchestration and Processing

Refactored the ReindexOrchestratorJob to always refresh the search parameter cache before and after job creation, and after updating zero-count search parameters, ensuring the most up-to-date search parameter definitions are used throughout the reindex process. [1] [2] [3]
Enhanced the logic for creating reindex processing jobs by double-checking resource counts and updating the job record and search parameter status for zero-count resource types, improving accuracy and consistency.
Improved job cancellation handling by throwing OperationCanceledException promptly when cancellation is requested, both before job creation and when enqueuing processing jobs.

Job Status and Error Handling

Fixed the logic for updating job status to Failed only when there are no in-flight or cancelled jobs, preventing premature failure status in race conditions. Also, if there are still in-flight jobs, status is set to Running to avoid sending incorrect consumer status. [1] [2]
Improved the heartbeat logic during job completion checks to catch and log JobConflictException without failing the job, allowing for concurrent updates by multiple workers.

Resource Counting and Hash Map Handling

Changed the resource type hash map property in ReindexJobRecord to internal set for better encapsulation and updated the logic for retrieving hash values to ensure correctness. [1] [2]
Fixed the condition in CheckJobRecordForAnyWork to correctly identify when there is work to be done based on resource counts.

Other Codebase Improvements

Added missing using directives and improved code organization in several files for clarity and maintainability. [1] [2] [3]

Related issues

Addresses AB#175213, AB#178077, AB#178083.

Testing

Describe how this change was tested.

FHIR Team Checklist

Update the title of the PR to be succinct and less than 65 characters
Add a milestone to the PR for the sprint that it is merged (i.e. add S47)
Tag the PR with the type of update: Bug, Build, Dependencies, Enhancement, New-Feature or Documentation
Tag the PR with Open source, Azure API for FHIR (CosmosDB or common code) or Azure Healthcare APIs (SQL or common code) to specify where this change is intended to be released.
Tag the PR with Schema Version backward compatible or Schema Version backward incompatible or Schema Version unchanged if this adds or updates Sql script which is/is not backward compatible with the code.
When changing or adding behavior, if your code modifies the system design or changes design assumptions, please create and include an ADR.
CI is green before merge
Review squash-merge requirements

Semver Change (docs)

Patch|Skip|Feature|Breaking (reason)

…ures Add RetryFactAttribute and supporting infrastructure to enable automatic retry of flaky tests in xunit. This is particularly useful for integration and end-to-end tests that may experience transient failures. Changes: - Add RetryFactAttribute with configurable max retries, delay, and assertion failure handling - Add RetryFactDiscoverer for test case discovery - Add RetryTestCase to implement retry logic with detailed diagnostics - Refactor existing tests to use RetryFact where appropriate for transient failures The retry mechanism defaults to 3 attempts with 5 second delays and skips assertion failures unless explicitly configured, helping distinguish between genuine test bugs and environmental issues.

…tests - Set SearchParameterCacheRefreshIntervalSeconds to 1 second in tests for faster execution - Add SearchParameterCacheRefreshMaxInitialDelaySeconds config option with default of 15 seconds to prevent thundering herd on timer startup - Set initial delay to 0 in tests to eliminate random startup delays - Increase test timeout from 200ms to 2000ms to account for timer initialization - Fix reindex operation status logic to check for cancelled jobs and in-flight jobs to prevent incorrect status transitions during race conditions - Fix to Gen1 to ensure all records are retrieve when continuation token is supplied - Fix to Gen1 GetReindexResults query to ensure we only get Active records to reindex - Fix to ensure we have correct hashmap at time of execution of reindex jobs vs. what was suppliled during creation in handler

github-advanced-security

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

src/Microsoft.Health.Fhir.Core/Features/Operations/FhirOperationDataStoreBase.cs

src/Microsoft.Health.Fhir.Core/Features/Operations/Reindex/ReindexOrchestratorJob.cs

LTA-Thinking · 2025-12-10T15:59:38Z

src/Microsoft.Health.Fhir.Core/Configs/CoreFeatureConfiguration.cs

+
+        /// <summary>
+        /// Gets or sets the maximum initial delay in seconds for the SearchParameter cache background service timer.
+        /// This random delay (0 to this value) staggers timer startup across instances to prevent thundering herd.


I have never heard the term 'thundering herd' used to describe an influx of traffic due to start up across instances. I like it :)

jestradaMS added 2 commits December 8, 2025 08:43

jestradaMS added this to the FY26\Q2\2Wk\2Wk11 milestone Dec 8, 2025

jestradaMS requested a review from a team as a code owner December 8, 2025 15:09

Adding in missed file

314a38f

jestradaMS mentioned this pull request Dec 8, 2025

Reindex testing improvements and fixes #5240

Closed

1 task

github-advanced-security bot found potential problems Dec 8, 2025

View reviewed changes

increasing blame hang for reindex test

bbb9dab

SergeyGaluzo reviewed Dec 9, 2025

View reviewed changes

src/Microsoft.Health.Fhir.Core/Features/Operations/FhirOperationDataStoreBase.cs Outdated Show resolved Hide resolved

SergeyGaluzo reviewed Dec 9, 2025

View reviewed changes

src/Microsoft.Health.Fhir.Core/Features/Operations/FhirOperationDataStoreBase.cs Show resolved Hide resolved

SergeyGaluzo reviewed Dec 9, 2025

View reviewed changes

src/Microsoft.Health.Fhir.Core/Features/Operations/Reindex/ReindexOrchestratorJob.cs Show resolved Hide resolved

fixing typo

47b6cd5

Base automatically changed from users/jestrada/retryfact-infrastructure to main December 9, 2025 21:41

Merge branch 'main' into users/jestrada/reindex-core-fixes

0a3b7a5

LTA-Thinking reviewed Dec 10, 2025

View reviewed changes

Merge branch 'main' into users/jestrada/reindex-core-fixes

4788eb6

jestradaMS enabled auto-merge (squash) December 11, 2025 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reindex core fixes and test enhancements #5264

Reindex core fixes and test enhancements #5264

Uh oh!

jestradaMS commented Dec 8, 2025 •

edited

Loading

Uh oh!

github-advanced-security bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LTA-Thinking Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Reindex core fixes and test enhancements #5264

Are you sure you want to change the base?

Reindex core fixes and test enhancements #5264

Uh oh!

Conversation

jestradaMS commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Testing

FHIR Team Checklist

Semver Change (docs)

Uh oh!

github-advanced-security bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LTA-Thinking Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jestradaMS commented Dec 8, 2025 •

edited

Loading