Conversation
Signed-off-by: Omkar Kabde <omkarkabde@gmail.com>
Greptile SummaryAdded HTTP 503 retry handling to Common Crawl WARC downloads using wget flags Key changes:
Note: The Wikipedia downloader ( Confidence Score: 5/5
Important Files Changed
Last reviewed commit: 15e7e4a |
| # We don't use -c (for continue resume) because we want to download file to temp path using -O | ||
| # but -c and -O don't work well together | ||
| cmd = ["wget", url_to_download, "-O", path] | ||
| cmd = ["wget", url_to_download, "-O", path, "--retry-on-http-error=503", "--waitretry=5", "--tries=5"] |
There was a problem hiding this comment.
--retry-on-http-error requires wget 1.21+ (released Jan 2021). Consider documenting this requirement or adding a fallback for older versions
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| # We don't use -c (for continue resume) because we want to download file to temp path using -O | ||
| # but -c and -O don't work well together | ||
| cmd = ["wget", url_to_download, "-O", path] | ||
| cmd = ["wget", url_to_download, "-O", path, "--retry-on-http-error=503", "--waitretry=5", "--tries=5"] |
There was a problem hiding this comment.
Based on https://status.commoncrawl.org/ retrying on 503 makes sense. Not sure if we should add others like 429, etc.
Adding -c could be good too.
WDYT @omkar-334 @praateekmahajan @ayushdg ?
There was a problem hiding this comment.
hi @sarahyurick
i went through the wget docs.
They say that
A combination with ‘-O’/‘--output-document’ is only accepted if the given output file does not exist.
So the comment is true, i don't think we can/should add -c, since the output file would already exist.
what do you think?
There was a problem hiding this comment.
Fair enough, I am fine with skipping it. Thanks!
|
/ok to test 15e7e4a |
sarahyurick
left a comment
There was a problem hiding this comment.
Thanks @omkar-334 , LGTM!
Fixes #1225.
Do we need to add this for wikipedia as well?