-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow IT Dockerfile to exit after all setup attempts fail #17592
Conversation
PTAL @Akshat-Jain @cryptoe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, was able to confirm the new behaviour at https://github.com/apache/druid/actions/runs/12429974963/job/34707262605?pr=17592
integration-tests/docker/Dockerfile
Outdated
rm -f /root/base-setup.sh | ||
rm -f /root/base-setup.sh; \ | ||
if [ "$i" -eq "$SETUP_RETRIES" ]; then \ | ||
exit 1; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets put a message here mentioning gave up on retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Message like
Tried [$i] times to contact [$url]. Unable to do so. Try a different url or check connectivity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a fix-of-the-fix already #17543
I think this retry logic is completely misplaced; it seems to me that the original problem was that wget
times out?
I think running all steps of the script will retry it even if other issues happen ...
I wonder why not specify some retry conditions for wget
instead ?
https://unix.stackexchange.com/questions/227665/wget-abort-retrying-after-failure-or-timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kgyrtkirk, really appreciate the tip. I will look into experimenting with wget retry conditions and update if it helps with our current issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kgyrtkirk @cryptoe @Akshat-Jain, I did some research into the wget
commands, PTAL at my findings:
Error Logging
Turns out that our current wget
command is called with the -q
option, hence silencing all wget logs. Removing the -q
option temporarily (I added it back after debugging, lmk if its better to simply not silence the logs) caused these logs to surface:
- wget success
#8 15.32 --2025-01-13 07:50:55-- https://downloads.apache.org/zookeeper/zookeeper-3.8.4/apache-zookeeper-3.8.4-bin.tar.gz
#8 15.33 Resolving downloads.apache.org (downloads.apache.org)... 88.99.208.237, 135.181.214.104, 2a01:4f9:3a:2c57::2, ...
#8 15.34 Connecting to downloads.apache.org (downloads.apache.org)|88.99.208.237|:443... connected.
#8 15.76 HTTP request sent, awaiting response... 200 OK
- wget failure
#8 14.55 --2025-01-13 07:42:57-- https://downloads.apache.org/zookeeper/zookeeper-3.8.4/apache-zookeeper-3.8.4-bin.tar.gz
#8 14.56 Resolving downloads.apache.org (downloads.apache.org)... 135.181.214.104, 88.99.208.237, 2a01:4f8:10a:39da::2, ...
#8 14.56 Connecting to downloads.apache.org (downloads.apache.org)|135.181.214.104|:443... failed: Connection timed out.
#8 149.2 Connecting to downloads.apache.org (downloads.apache.org)|88.99.208.237|:443... failed: Connection timed out.
#8 284.3 Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f8:10a:39da::2|:443... failed: Network is unreachable.
#8 284.3 Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f9:3a:2c57::2|:443... failed: Network is unreachable.
#8 ERROR: process "/bin/sh -c APACHE_ARCHIVE_MIRROR_HOST=${APACHE_ARCHIVE_MIRROR_HOST} /root/base-setup.sh && rm -f /root/base-setup.sh" did not complete successfully: exit code: 4
------
> [druidbase 3/3] RUN APACHE_ARCHIVE_MIRROR_HOST=https://downloads.apache.org /root/base-setup.sh && rm -f /root/base-setup.sh:
14.04 Setting up python (2.7.16-1) ...
14.05 Setting up python-pkg-resources (40.8.0-1) ...
14.16 Setting up python-meld3 (1.0.2-2) ...
14.23 Setting up supervisor (3.3.5-1) ...
14.42 invoke-rc.d: could not determine current runlevel
14.43 invoke-rc.d: policy-rc.d denied execution of start.
14.52 Processing triggers for libc-bin (2.28-10+deb10u1) ...
failed: Connection timed out.
284.3 Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f8:10a:39da::2|:443... failed: Network is unreachable.
284.3 Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f9:3a:2c57::2|:443... failed: Network is unreachable.
------
Dockerfile:28
--------------------
26 | ARG ZK_VERSION
27 | ARG APACHE_ARCHIVE_MIRROR_HOST=https://downloads.apache.org
28 | >>> RUN APACHE_ARCHIVE_MIRROR_HOST=${APACHE_ARCHIVE_MIRROR_HOST} /root/base-setup.sh && rm -f /root/base-setup.sh
29 |
30 |
--------------------
ERROR: failed to solve: process "/bin/sh -c APACHE_ARCHIVE_MIRROR_HOST=${APACHE_ARCHIVE_MIRROR_HOST} /root/base-setup.sh && rm -f /root/base-setup.sh" did not complete successfully: exit code: 4
Findings
- According to the wget manual, wget will conduct up to 20 retries by default. (configurable with
--retries=n
) - Between each retry, wget employs a linear backoff 1,2,... up to 10 seconds.
- The retry condition does not apply to fatal errors like "connection refused" or "not found".
- From the failure case, we get Network is unreachable errors when calling on IPv6 addresses.
- These errors may cause wget to not trigger the default retry mechanism.
Actions
- Restrict
wget
to only call IPv4 addresses helps to both focus on the right addresses to call, and allow for retries. - Add
--continue
to preventwget
retries to download from scratch. - Revert the 3 retries for the entire
base-setup.sh
script. Chances are that if ourwget
fails a total of 20 times, the host is probably overloaded with download requests. A total of 60wget
may exacerbate the situation. - Added a message to indicate wget succeeded / failed.
local dest=$1 | ||
local host=$2 | ||
|
||
if ! wget --inet4-only --quiet --continue --output-document="$dest" "$host" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the details/changes; looks better!
I feel like the current changes will be volatile to the same issues you've seen before you've started fixing this...
you could simulate connection refused/etc error-s with iptables commands like:
iptables -A OUTPUT -d 135.181.214.104 -j DROP
iptables -A OUTPUT -d 88.99.208.237 -j DROP
or using a tool like saboteur
- I think you could use
--retry-connrefused
and/or other options to force connection retries? - restricting to ipv4 could possibly hit back later....I don't think it should be restricted to that
- instead of a custom message; wouldn't it be enough to use
-nv
? - nit: why did you added
--continue
? could it be already there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kgyrtkirk, thanks for the tip in simulating network problems, I will try to simulate the problem with your tip. Also, good points made about using --retry-connrefused
instead of restricting ipv4
, and -nv
instead of --quiet
.
About the nit:
--continue
is added as an attempt to target the timeout problem. I read that it will help wget to resume downloading the file, so I added it in to reduce the chances that wget is working, but the download speed is too slow to satisfy the timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the changes to the wget
options, and we can now see that using --retry-connrefused
will allow us to wget with [20 retries] (https://github.com/apache/druid/actions/runs/12764706329/job/35581015481?pr=17592) instead of not retrying prior to this PR.
However, I found that using -nv
may confuse people. PTAL at the wget
output for -nv
:
#8 14.22 Processing triggers for libc-bin (2.28-10+deb10u1) ...
#8 148.7 failed: Connection timed out.
#8 283.9 failed: Connection timed out.
#8 283.9 failed: Network is unreachable.
#8 283.9 failed: Network is unreachable.
#8 419.0 failed: Connection timed out.
#8 554.2 failed: Connection timed out.
#8 554.2 failed: Network is unreachable.
#8 554.2 failed: Network is unreachable.
For example, in this case, it is not obvious that:
- We are using
wget
to download zookeeper. - The IP addresses are not shown, this may give users an impression that
wget
is alternating betweenConnection timed out
andNetwork is unreachable
for the same IP host.
As such, I have changed the log settings to the default (--verbose
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks much better - I guess now it would fail correctly as well if all 20 retries are exhausted
Description
Presently the Integration Test Dockerfile continues execution even after 3 failures of
base-setup.sh
, and do not exit immediately. This PR aims to patch that up, by forcing Docker to exit after 3 failures.This PR has: