don't cache if the schema registry is unavailable #143

adrianmester · 2025-03-18T12:18:58Z

fixes #39

When the schema registry is unavailable, we should not cache the error, to allow the client to retry.

Because of a cyclical import, I can't import the actual error type and compare it, so I'm checking the error message instead, I'm open to suggestions on a better way to do this check.

I've also done some gardening on this package, upgraded the go version, packages with vulnerabilities, github actions and fixed a failed test.

This reverts commit 1bf8865.

jpcosal

LG from infra POV

One minor comment

.github/workflows/avro.yaml

Co-authored-by: jpcosal <[email protected]>

go.mod

philippgille · 2025-03-18T17:53:02Z

singledecoder.go

+		// we can't import avroregistry, to compare the error, so we're looking at the error message to see if the
+		// error is of type `UnavailableError` (avroregistry/errors.go)
+		if err != nil && strings.HasPrefix(err.Error(), "schema registry unavailability caused by") {
+			return nil, err


Can it lead to an overload of the registry if we don't cache anything? The other open PR that addresses the same issue caches the error for 1 minute: #127

Can we just merge that one, or maybe decrease the duration a bit if necessary?

Merging mentioned PR would be nice.

1min might be a bit too much in our case indeed. How about we make that duration configurable (e.g w/ default and overridable through env var)?

When the schema registr\y is unavailable the request doesn't reach the schema registry pod at all, so it's not stressed.

If i'm understanding this correctly, this should only be called once per pod per topic (unless there's a race condition where it's called for each partition on the same pod at the same time). In any case, the number of requests to the schema registry is very low.

When the schema registr\y is unavailable the request doesn't reach the schema registry pod at all, so it's not stressed.

If the registry is unavailable for 10 minutes, and in that timeframe one pod after another is started (e.g. autoscaling, node consolidation) and runs into the error, then the registry comes back up, won't all pods send their requests to the registry at the same time? Could that lead to issues?
If yes, the error caching could spread that initial load a bit if I'm not mistaken (assuming the pods with the errors didn't all start at the same time, but spread over those 10 minutes of registry downtime).

Ideally we could check the exact issue

unavailable registry issue: do not cache the error

registry response issue (invalid schema, data store error, etc): cache the error for X minutes

But this current PR is good enough as a first step IMO

won't all pods send their requests to the registry at the same time? Could that lead to issues?

This is already the current behavior when a deploy happens, the registry should accommodate this (as Adrian mentioned the number of requests per service should be low, basically 1 per topic consumed/produced in per instance)

My PR explicitly only doesn't cache avroregistry.UnavailableError
Looking at the code, that can only happen if:

the schema registry isn't available at all (which should be fixed by an open PR in universe)

or if the schema registry returns a 5xx response code

I believe that it would be a mistake to cache the error in either of those cases. If we want to limit the cache time for other types of errors, that would be beyond the scope of this PR.

This is already the current behavior when a deploy happens

Indeed. Then no blocker, and we can consider improving it in the future if necessary

adrianmester added 3 commits March 18, 2025 12:31

don't cache the error when the schema registry is unavailable

3a358e6

remove circular dependency

be6320d

add comment

d998a3e

adrianmester self-assigned this Mar 18, 2025

adrianmester added 9 commits March 18, 2025 14:25

don't panic!

39fa195

upgrade go version and update vulnerable packages

2f296b8

upgrade linter

9c67d20

upgrade github actions

b0005e5

linter config

e28e23d

wip

1bf8865

Revert "wip"

4efb647

This reverts commit 1bf8865.

wip

1bc1c70

wip

17aafff

gyndav requested review from a team, philippgille and skateinmars March 18, 2025 14:12

fix test

452ce9c

adrianmester requested review from clement-heetch and jpcosal March 18, 2025 15:02

jpcosal reviewed Mar 18, 2025

View reviewed changes

.github/workflows/avro.yaml Outdated Show resolved Hide resolved

Update .github/workflows/avro.yaml

64a1756

Co-authored-by: jpcosal <[email protected]>

philippgille reviewed Mar 18, 2025

View reviewed changes

change go version to 1.23

6160ecd

adrianmester requested review from jpcosal and philippgille March 19, 2025 08:40

go mod tidy

50d1e0b

philippgille approved these changes Mar 20, 2025

View reviewed changes

skateinmars approved these changes Mar 20, 2025

View reviewed changes

adrianmester merged commit c612002 into master Mar 20, 2025
1 check passed

adrianmester deleted the issue-39-error-cache branch March 20, 2025 13:57

philippgille mentioned this pull request Mar 20, 2025

Caching registry fetch errors forever leads to PODs ending needing re… #127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

don't cache if the schema registry is unavailable #143

don't cache if the schema registry is unavailable #143

Uh oh!

adrianmester commented Mar 18, 2025 •

edited

Loading

Uh oh!

jpcosal left a comment

Uh oh!

Uh oh!

Uh oh!

philippgille Mar 18, 2025

Uh oh!

jpcosal Mar 19, 2025

Uh oh!

adrianmester Mar 19, 2025

Uh oh!

philippgille Mar 19, 2025

Uh oh!

skateinmars Mar 19, 2025

Uh oh!

adrianmester Mar 20, 2025

Uh oh!

philippgille Mar 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

don't cache if the schema registry is unavailable #143

don't cache if the schema registry is unavailable #143

Uh oh!

Conversation

adrianmester commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpcosal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

philippgille Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

jpcosal Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

adrianmester Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

philippgille Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

skateinmars Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

adrianmester Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

philippgille Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

adrianmester commented Mar 18, 2025 •

edited

Loading