-
Notifications
You must be signed in to change notification settings - Fork 18
don't cache if the schema registry is unavailable #143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jpcosal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG from infra POV
One minor comment
Co-authored-by: jpcosal <[email protected]>
| // we can't import avroregistry, to compare the error, so we're looking at the error message to see if the | ||
| // error is of type `UnavailableError` (avroregistry/errors.go) | ||
| if err != nil && strings.HasPrefix(err.Error(), "schema registry unavailability caused by") { | ||
| return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it lead to an overload of the registry if we don't cache anything? The other open PR that addresses the same issue caches the error for 1 minute: #127
Can we just merge that one, or maybe decrease the duration a bit if necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging mentioned PR would be nice.
1min might be a bit too much in our case indeed. How about we make that duration configurable (e.g w/ default and overridable through env var)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the schema registr\y is unavailable the request doesn't reach the schema registry pod at all, so it's not stressed.
If i'm understanding this correctly, this should only be called once per pod per topic (unless there's a race condition where it's called for each partition on the same pod at the same time). In any case, the number of requests to the schema registry is very low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the schema registr\y is unavailable the request doesn't reach the schema registry pod at all, so it's not stressed.
If the registry is unavailable for 10 minutes, and in that timeframe one pod after another is started (e.g. autoscaling, node consolidation) and runs into the error, then the registry comes back up, won't all pods send their requests to the registry at the same time? Could that lead to issues?
If yes, the error caching could spread that initial load a bit if I'm not mistaken (assuming the pods with the errors didn't all start at the same time, but spread over those 10 minutes of registry downtime).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we could check the exact issue
- unavailable registry issue: do not cache the error
- registry response issue (invalid schema, data store error, etc): cache the error for X minutes
But this current PR is good enough as a first step IMO
won't all pods send their requests to the registry at the same time? Could that lead to issues?
This is already the current behavior when a deploy happens, the registry should accommodate this (as Adrian mentioned the number of requests per service should be low, basically 1 per topic consumed/produced in per instance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My PR explicitly only doesn't cache avroregistry.UnavailableError
Looking at the code, that can only happen if:
- the schema registry isn't available at all (which should be fixed by an open PR in universe)
- or if the schema registry returns a 5xx response code
I believe that it would be a mistake to cache the error in either of those cases. If we want to limit the cache time for other types of errors, that would be beyond the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already the current behavior when a deploy happens
Indeed. Then no blocker, and we can consider improving it in the future if necessary
fixes #39
When the schema registry is unavailable, we should not cache the error, to allow the client to retry.
Because of a cyclical import, I can't import the actual error type and compare it, so I'm checking the error message instead, I'm open to suggestions on a better way to do this check.
I've also done some gardening on this package, upgraded the go version, packages with vulnerabilities, github actions and fixed a failed test.