Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow IT Dockerfile to exit after all setup attempts fail #17592

Merged
merged 21 commits into from
Jan 15, 2025

Conversation

GWphua
Copy link
Contributor

@GWphua GWphua commented Dec 20, 2024

Description

Presently the Integration Test Dockerfile continues execution even after 3 failures of base-setup.sh, and do not exit immediately. This PR aims to patch that up, by forcing Docker to exit after 3 failures.

This PR has:

  • been self-reviewed.

@GWphua GWphua changed the title Allow it dockerfile to exit Allow IT Dockerfile to exit after all setup attempts fail Dec 20, 2024
@GWphua
Copy link
Contributor Author

GWphua commented Dec 20, 2024

PTAL @Akshat-Jain @cryptoe

Copy link
Contributor

@Akshat-Jain Akshat-Jain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm -f /root/base-setup.sh
rm -f /root/base-setup.sh; \
if [ "$i" -eq "$SETUP_RETRIES" ]; then \
exit 1; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets put a message here mentioning gave up on retries.

Copy link
Contributor

@cryptoe cryptoe Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message like
Tried [$i] times to contact [$url]. Unable to do so. Try a different url or check connectivity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a fix-of-the-fix already #17543

I think this retry logic is completely misplaced; it seems to me that the original problem was that wget times out?
I think running all steps of the script will retry it even if other issues happen ...

I wonder why not specify some retry conditions for wget instead ?
https://unix.stackexchange.com/questions/227665/wget-abort-retrying-after-failure-or-timeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kgyrtkirk, really appreciate the tip. I will look into experimenting with wget retry conditions and update if it helps with our current issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kgyrtkirk @cryptoe @Akshat-Jain, I did some research into the wget commands, PTAL at my findings:

Error Logging

Turns out that our current wget command is called with the -q option, hence silencing all wget logs. Removing the -q option temporarily (I added it back after debugging, lmk if its better to simply not silence the logs) caused these logs to surface:

  1. wget success
#8 15.32 --2025-01-13 07:50:55--  https://downloads.apache.org/zookeeper/zookeeper-3.8.4/apache-zookeeper-3.8.4-bin.tar.gz
#8 15.33 Resolving downloads.apache.org (downloads.apache.org)... 88.99.208.237, 135.181.214.104, 2a01:4f9:3a:2c57::2, ...
#8 15.34 Connecting to downloads.apache.org (downloads.apache.org)|88.99.208.237|:443... connected.
#8 15.76 HTTP request sent, awaiting response... 200 OK
  1. wget failure
#8 14.55 --2025-01-13 07:42:57--  https://downloads.apache.org/zookeeper/zookeeper-3.8.4/apache-zookeeper-3.8.4-bin.tar.gz
#8 14.56 Resolving downloads.apache.org (downloads.apache.org)... 135.181.214.104, 88.99.208.237, 2a01:4f8:10a:39da::2, ...
#8 14.56 Connecting to downloads.apache.org (downloads.apache.org)|135.181.214.104|:443... failed: Connection timed out.
#8 149.2 Connecting to downloads.apache.org (downloads.apache.org)|88.99.208.237|:443... failed: Connection timed out.
#8 284.3 Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f8:10a:39da::2|:443... failed: Network is unreachable.
#8 284.3 Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f9:3a:2c57::2|:443... failed: Network is unreachable.
#8 ERROR: process "/bin/sh -c APACHE_ARCHIVE_MIRROR_HOST=${APACHE_ARCHIVE_MIRROR_HOST} /root/base-setup.sh && rm -f /root/base-setup.sh" did not complete successfully: exit code: 4
------
 > [druidbase 3/3] RUN APACHE_ARCHIVE_MIRROR_HOST=https://downloads.apache.org /root/base-setup.sh && rm -f /root/base-setup.sh:
14.04 Setting up python (2.7.16-1) ...
14.05 Setting up python-pkg-resources (40.8.0-1) ...
14.16 Setting up python-meld3 (1.0.2-2) ...
14.23 Setting up supervisor (3.3.5-1) ...
14.42 invoke-rc.d: could not determine current runlevel
14.43 invoke-rc.d: policy-rc.d denied execution of start.
14.52 Processing triggers for libc-bin (2.28-10+deb10u1) ...
failed: Connection timed out.
284.3 Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f8:10a:39da::2|:443... failed: Network is unreachable.
284.3 Connecting to downloads.apache.org (downloads.apache.org)|2a01:4f9:3a:2c57::2|:443... failed: Network is unreachable.
------
Dockerfile:28
--------------------
  26 |     ARG ZK_VERSION
  27 |     ARG APACHE_ARCHIVE_MIRROR_HOST=https://downloads.apache.org
  28 | >>> RUN APACHE_ARCHIVE_MIRROR_HOST=${APACHE_ARCHIVE_MIRROR_HOST} /root/base-setup.sh && rm -f /root/base-setup.sh
  29 |     
  30 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c APACHE_ARCHIVE_MIRROR_HOST=${APACHE_ARCHIVE_MIRROR_HOST} /root/base-setup.sh && rm -f /root/base-setup.sh" did not complete successfully: exit code: 4

Findings

  1. According to the wget manual, wget will conduct up to 20 retries by default. (configurable with --retries=n)
  2. Between each retry, wget employs a linear backoff 1,2,... up to 10 seconds.
  3. The retry condition does not apply to fatal errors like "connection refused" or "not found".
  4. From the failure case, we get Network is unreachable errors when calling on IPv6 addresses.
  5. These errors may cause wget to not trigger the default retry mechanism.

Actions

  1. Restrict wget to only call IPv4 addresses helps to both focus on the right addresses to call, and allow for retries.
  2. Add --continue to prevent wget retries to download from scratch.
  3. Revert the 3 retries for the entire base-setup.sh script. Chances are that if our wget fails a total of 20 times, the host is probably overloaded with download requests. A total of 60 wget may exacerbate the situation.
  4. Added a message to indicate wget succeeded / failed.

local dest=$1
local host=$2

if ! wget --inet4-only --quiet --continue --output-document="$dest" "$host"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the details/changes; looks better!

I feel like the current changes will be volatile to the same issues you've seen before you've started fixing this...
you could simulate connection refused/etc error-s with iptables commands like:

iptables -A OUTPUT -d 135.181.214.104 -j DROP
iptables -A OUTPUT -d 88.99.208.237 -j DROP

or using a tool like saboteur

  • I think you could use --retry-connrefused and/or other options to force connection retries?
  • restricting to ipv4 could possibly hit back later....I don't think it should be restricted to that
  • instead of a custom message; wouldn't it be enough to use -nv ?
  • nit: why did you added --continue? could it be already there?

Copy link
Contributor Author

@GWphua GWphua Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kgyrtkirk, thanks for the tip in simulating network problems, I will try to simulate the problem with your tip. Also, good points made about using --retry-connrefused instead of restricting ipv4, and -nv instead of --quiet.

About the nit:
--continue is added as an attempt to target the timeout problem. I read that it will help wget to resume downloading the file, so I added it in to reduce the chances that wget is working, but the download speed is too slow to satisfy the timeout.

Copy link
Contributor Author

@GWphua GWphua Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the changes to the wget options, and we can now see that using --retry-connrefused will allow us to wget with [20 retries] (https://github.com/apache/druid/actions/runs/12764706329/job/35581015481?pr=17592) instead of not retrying prior to this PR.

However, I found that using -nv may confuse people. PTAL at the wget output for -nv:

#8 14.22 Processing triggers for libc-bin (2.28-10+deb10u1) ...
#8 148.7 failed: Connection timed out.
#8 283.9 failed: Connection timed out.
#8 283.9 failed: Network is unreachable.
#8 283.9 failed: Network is unreachable.
#8 419.0 failed: Connection timed out.
#8 554.2 failed: Connection timed out.
#8 554.2 failed: Network is unreachable.
#8 554.2 failed: Network is unreachable.

For example, in this case, it is not obvious that:

  1. We are using wget to download zookeeper.
  2. The IP addresses are not shown, this may give users an impression that wget is alternating between Connection timed out and Network is unreachable for the same IP host.

As such, I have changed the log settings to the default (--verbose).

Copy link
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks much better - I guess now it would fail correctly as well if all 20 retries are exhausted

@kgyrtkirk kgyrtkirk merged commit 863e91d into apache:master Jan 15, 2025
77 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants