Skip to content

Conversation

@adrianmester
Copy link
Contributor

@adrianmester adrianmester commented Mar 18, 2025

fixes #39

When the schema registry is unavailable, we should not cache the error, to allow the client to retry.

Because of a cyclical import, I can't import the actual error type and compare it, so I'm checking the error message instead, I'm open to suggestions on a better way to do this check.

I've also done some gardening on this package, upgraded the go version, packages with vulnerabilities, github actions and fixed a failed test.

@adrianmester adrianmester self-assigned this Mar 18, 2025
@gyndav gyndav requested review from a team, philippgille and skateinmars March 18, 2025 14:12
Copy link
Contributor

@jpcosal jpcosal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG from infra POV

One minor comment

// we can't import avroregistry, to compare the error, so we're looking at the error message to see if the
// error is of type `UnavailableError` (avroregistry/errors.go)
if err != nil && strings.HasPrefix(err.Error(), "schema registry unavailability caused by") {
return nil, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it lead to an overload of the registry if we don't cache anything? The other open PR that addresses the same issue caches the error for 1 minute: #127

Can we just merge that one, or maybe decrease the duration a bit if necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging mentioned PR would be nice.

1min might be a bit too much in our case indeed. How about we make that duration configurable (e.g w/ default and overridable through env var)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the schema registr\y is unavailable the request doesn't reach the schema registry pod at all, so it's not stressed.

If i'm understanding this correctly, this should only be called once per pod per topic (unless there's a race condition where it's called for each partition on the same pod at the same time). In any case, the number of requests to the schema registry is very low.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the schema registr\y is unavailable the request doesn't reach the schema registry pod at all, so it's not stressed.

If the registry is unavailable for 10 minutes, and in that timeframe one pod after another is started (e.g. autoscaling, node consolidation) and runs into the error, then the registry comes back up, won't all pods send their requests to the registry at the same time? Could that lead to issues?
If yes, the error caching could spread that initial load a bit if I'm not mistaken (assuming the pods with the errors didn't all start at the same time, but spread over those 10 minutes of registry downtime).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we could check the exact issue

  • unavailable registry issue: do not cache the error
  • registry response issue (invalid schema, data store error, etc): cache the error for X minutes

But this current PR is good enough as a first step IMO

won't all pods send their requests to the registry at the same time? Could that lead to issues?

This is already the current behavior when a deploy happens, the registry should accommodate this (as Adrian mentioned the number of requests per service should be low, basically 1 per topic consumed/produced in per instance)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PR explicitly only doesn't cache avroregistry.UnavailableError
Looking at the code, that can only happen if:

  • the schema registry isn't available at all (which should be fixed by an open PR in universe)
  • or if the schema registry returns a 5xx response code

I believe that it would be a mistake to cache the error in either of those cases. If we want to limit the cache time for other types of errors, that would be beyond the scope of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already the current behavior when a deploy happens

Indeed. Then no blocker, and we can consider improving it in the future if necessary

@adrianmester adrianmester merged commit c612002 into master Mar 20, 2025
1 check passed
@adrianmester adrianmester deleted the issue-39-error-cache branch March 20, 2025 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

avro: don't cache temporary errors

5 participants