-
Notifications
You must be signed in to change notification settings - Fork 568
Reindex core fixes and test enhancements #5264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ures Add RetryFactAttribute and supporting infrastructure to enable automatic retry of flaky tests in xunit. This is particularly useful for integration and end-to-end tests that may experience transient failures. Changes: - Add RetryFactAttribute with configurable max retries, delay, and assertion failure handling - Add RetryFactDiscoverer for test case discovery - Add RetryTestCase to implement retry logic with detailed diagnostics - Refactor existing tests to use RetryFact where appropriate for transient failures The retry mechanism defaults to 3 attempts with 5 second delays and skips assertion failures unless explicitly configured, helping distinguish between genuine test bugs and environmental issues.
…tests - Set SearchParameterCacheRefreshIntervalSeconds to 1 second in tests for faster execution - Add SearchParameterCacheRefreshMaxInitialDelaySeconds config option with default of 15 seconds to prevent thundering herd on timer startup - Set initial delay to 0 in tests to eliminate random startup delays - Increase test timeout from 200ms to 2000ms to account for timer initialization - Fix reindex operation status logic to check for cancelled jobs and in-flight jobs to prevent incorrect status transitions during race conditions - Fix to Gen1 to ensure all records are retrieve when continuation token is supplied - Fix to Gen1 GetReindexResults query to ensure we only get Active records to reindex - Fix to ensure we have correct hashmap at time of execution of reindex jobs vs. what was suppliled during creation in handler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
src/Microsoft.Health.Fhir.Core/Features/Operations/FhirOperationDataStoreBase.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Health.Fhir.Core/Features/Operations/FhirOperationDataStoreBase.cs
Show resolved
Hide resolved
src/Microsoft.Health.Fhir.Core/Features/Operations/Reindex/ReindexOrchestratorJob.cs
Show resolved
Hide resolved
|
|
||
| /// <summary> | ||
| /// Gets or sets the maximum initial delay in seconds for the SearchParameter cache background service timer. | ||
| /// This random delay (0 to this value) staggers timer startup across instances to prevent thundering herd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have never heard the term 'thundering herd' used to describe an influx of traffic due to start up across instances. I like it :)
##Dependent on #5263
Description
This pull request introduces several improvements and bug fixes to the reindexing workflow and search parameter cache refresh logic in the FHIR server codebase. The main focus is on ensuring more robust handling of search parameter updates, improving job orchestration for reindexing, and addressing concurrency and cancellation scenarios. Key changes include enhancements to the search parameter cache refresh configuration, more accurate job status updates, improved resource counting logic, and better handling of job cancellation and heartbeat conflicts.
Search Parameter Cache Refresh Improvements
SearchParameterCacheRefreshMaxInitialDelaySecondstoCoreFeatureConfigurationto allow for a randomized initial delay when starting the cache refresh background service, helping to prevent thundering herd issues. The default is 15 seconds, and it can be set to 0 for testing.Reindex Job Orchestration and Processing
ReindexOrchestratorJobto always refresh the search parameter cache before and after job creation, and after updating zero-count search parameters, ensuring the most up-to-date search parameter definitions are used throughout the reindex process. [1] [2] [3]OperationCanceledExceptionpromptly when cancellation is requested, both before job creation and when enqueuing processing jobs.Job Status and Error Handling
Failedonly when there are no in-flight or cancelled jobs, preventing premature failure status in race conditions. Also, if there are still in-flight jobs, status is set toRunningto avoid sending incorrect consumer status. [1] [2]JobConflictExceptionwithout failing the job, allowing for concurrent updates by multiple workers.Resource Counting and Hash Map Handling
ReindexJobRecordtointernal setfor better encapsulation and updated the logic for retrieving hash values to ensure correctness. [1] [2]CheckJobRecordForAnyWorkto correctly identify when there is work to be done based on resource counts.Other Codebase Improvements
Related issues
Addresses AB#175213, AB#178077, AB#178083.
Testing
Describe how this change was tested.
FHIR Team Checklist
Semver Change (docs)
Patch|Skip|Feature|Breaking (reason)