Skip to content

Comments

Add retry on 503 error#1496

Merged
sarahyurick merged 4 commits intoNVIDIA-NeMo:mainfrom
omkar-334:http503
Feb 23, 2026
Merged

Add retry on 503 error#1496
sarahyurick merged 4 commits intoNVIDIA-NeMo:mainfrom
omkar-334:http503

Conversation

@omkar-334
Copy link
Contributor

Fixes #1225.

Do we need to add this for wikipedia as well?

Signed-off-by: Omkar Kabde <omkarkabde@gmail.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 12, 2026

Greptile Summary

Added HTTP 503 retry handling to Common Crawl WARC downloads using wget flags --retry-on-http-error=503, --waitretry=5, and --tries=5. This allows the downloader to automatically retry failed downloads when Common Crawl servers return 503 Service Unavailable errors.

Key changes:

  • Modified wget command to include retry flags for 503 errors with 5-second wait between retries and maximum 5 attempts
  • Updated all test assertions to reflect the new command structure

Note: The Wikipedia downloader (nemo_curator/stages/text/download/wikipedia/download.py:56) also uses wget for downloads and could benefit from similar retry logic if Wikipedia dumps experience similar transient failures.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are straightforward and well-tested: adding standard wget retry flags to handle 503 errors. All tests have been properly updated to match the new command structure. The implementation follows wget best practices for retry logic.
  • No files require special attention

Important Files Changed

Filename Overview
nemo_curator/stages/text/download/common_crawl/download.py Added wget retry flags (--retry-on-http-error=503, --waitretry=5, --tries=5) to handle 503 errors gracefully
tests/stages/text/download/common_crawl/test_download.py Updated test assertions to expect the new wget retry flags in command invocations

Last reviewed commit: 15e7e4a

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

# We don't use -c (for continue resume) because we want to download file to temp path using -O
# but -c and -O don't work well together
cmd = ["wget", url_to_download, "-O", path]
cmd = ["wget", url_to_download, "-O", path, "--retry-on-http-error=503", "--waitretry=5", "--tries=5"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--retry-on-http-error requires wget 1.21+ (released Jan 2021). Consider documenting this requirement or adding a fallback for older versions

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@omkar-334
Copy link
Contributor Author

cc @praateekmahajan

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

# We don't use -c (for continue resume) because we want to download file to temp path using -O
# but -c and -O don't work well together
cmd = ["wget", url_to_download, "-O", path]
cmd = ["wget", url_to_download, "-O", path, "--retry-on-http-error=503", "--waitretry=5", "--tries=5"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on https://status.commoncrawl.org/ retrying on 503 makes sense. Not sure if we should add others like 429, etc.

Adding -c could be good too.

WDYT @omkar-334 @praateekmahajan @ayushdg ?

Copy link
Contributor Author

@omkar-334 omkar-334 Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @sarahyurick
i went through the wget docs.
They say that

A combination with ‘-O’/‘--output-document’ is only accepted if the given output file does not exist.

So the comment is true, i don't think we can/should add -c, since the output file would already exist.
what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I am fine with skipping it. Thanks!

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@sarahyurick
Copy link
Contributor

/ok to test 15e7e4a

Copy link
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @omkar-334 , LGTM!

@sarahyurick sarahyurick merged commit 8c17d17 into NVIDIA-NeMo:main Feb 23, 2026
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add retry logic for common crawl download

2 participants