-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elastic Agent doesn't update the enrollment token in Kubernetes Deployment statefile #3586
Comments
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
Hello @elastic/elastic-agent Do we have any deadline for the fix of this issue ? |
This isn't prioritized yet, but it is definitely annoying and has wasted some time even for people inside Elastic. CC @pierrehilbert |
Hello |
Hello together, |
Hello folks. |
My Idea: On Fleet managed elastic agents it would make sense to have "Elastic State Cleanup" Button, so this is easy to handle over Kibana UI. Manually reset the state on more than 60 deployed elastic agents drives the admins angry. |
Thanks @ch9hn , I just spent a day trying to figure out why the Kubernetes elastic agents were working yesterday and not today after I updated the cluster. I uninstalled the agents and installed the previous version with no luck. I finally saw there were some files on the machine in /var/lib/elastic-agent-managed/kube-system/state. I realised the install manifest files had mounted and saved connections details. Solution: I deleted the whole elastic-agent-managed folder on each machine and reinstalled the Kubernetes agent manifest file and it worked. @elasticmachine Elastic team please fix this basic bug!! hours wasted! |
After hours of investigating and reinstalling elastic-agent on kube, and understanding why I had this message "Failed to connect to backoff(elasticsearch(https://244b20202ef45ddb481e55df6b19f4.eu-west-3.aws.elastic-cloud.com:443" I searched for this error message on the 'elastic-cloud' site but no document ... On searching, I realized that something was stored persistently, so my suspicions fell on the Indeed, in the manifest generated by elastic-cloud, we can see that it mounts the directory of - name : elastic-agent-state
hostPath :
path : /var/lib/elastic-agent-managed/kube-system/state
type : DirectoryOrCreate Personally, I don't like the fact that they use the node's disk space to store persistence data. But here's my solution to correct the problem for pod in $(kubectl -n kube-system get pod --no-headers -o custom-columns=":metadata.name" -l app=elastic-agent); do kubectl -n kube-system exec $pod -- rm -rf /usr/share/elastic-agent ; done
for pod in $(kubectl -n kube-system get pod --no-headers -o custom-columns=":metadata.name" -l app=elastic-agent); do ; kubectl -n kube-system delete pods $pod ; done |
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Quick recap of the discussion had surrounding this item from our weekly meeting:
|
Any updates on this especially regarding prio? We currently in discussion with our managed k8s cluster operator team so that they implement a small operator (taking the fleet url and the enrollment token) as the actual cluster users is not allowed to run any kind privileged pods on their own. Before figuring out workarounds which have only a minimal impact e.g. on shipping logs / log duplication, I would like clarify if and how we can maybe prioritize this upstream or maybe also can contribute. |
Another workaround that persists the registry to prevent log duplication:
|
Based on what @mag-mkorn commented, we'll test and use the following as init container script (base alpine image is sufficient): #!/usr/bin/env sh
set -eu
# Set pipefail if it works in a subshell, disregard if unsupported
# shellcheck disable=SC3040
(set -o pipefail 2>/dev/null) && set -o pipefail
STATE_DIRECTORY=/usr/share/elastic-agent/state
DATA_DIRECTORY=${STATE_DIRECTORY}/data
HASH_FILE=${STATE_DIRECTORY}/.env-hash
HASH_TARGET="$(printf "%s\0%s" "${FLEET_URL?}" "${FLEET_ENROLLMENT_TOKEN?}" | sha256sum -)"
prune_state() {
find "${STATE_DIRECTORY}" -path "${DATA_DIRECTORY}" -prune -o -type f -print0 |
xargs -0 --no-run-if-empty rm -v
}
save_hash() {
echo "Save target hash into $HASH_FILE."
printf "%s" "$HASH_TARGET" >"$HASH_FILE"
}
if [ -f "$HASH_FILE" ]; then
echo "Existing hash found, comparing..."
# cmp saved hash to target value
HASH_CURRENT="$(cat "$HASH_FILE")"
if [ "$HASH_TARGET" = "$HASH_CURRENT" ]; then
echo "Not change detected, no cleanup required."
else
echo "Existing hash do not match target hash. Pruning files without data dir..."
prune_state
save_hash
fi
else
save_hash
fi |
Just a note here that the issue also impacts attempting to change the target elastic cluster, not just token changes. with the added wrinkle there that the agent will take control of the state directory and disallow deleting it so you're stuck unless you create an init container to clear the state. |
It worked for me too, but IMO it should be considered as critical bug since it has a big impact for everyone using elastic agent. Do you guys have any roadmap to update it in next elastic agent versions ? Just to be aware. |
We have encountered the same problem. I agree that this is a critical error. |
It would be nice if the Agent logged that it's ignoring the env vars (or logged if the env vars don't match the current config) so that this behavior is more obvious |
I got the same issue when using elastic agent with systemd and container also |
Just wanted to confirm that you saw a workaround is available in the comments, did this work for you? |
Hi, delete the state folder don't work for my case. I don't mount the elasticagent folder to the disk, and I don't think it work for the use of docker compose case |
The problem described in this issue is caused by persistent state in the container. If you're using a fresh container each time and if you are not persisting state via a volume or bind/host mount then you likely do not have the problem described in this issue and you should post on the forums to receive community assistance https://discuss.elastic.co/c/elastic-stack/elastic-agent or contact Elastic support if you have access to support |
oh, come on - I just pulled my hair for hours and was already thinking about opening a support ticket because I always saw And then I found this github issue. After deleting the agent from my kubernetes cluster, deleting That should defintely be noted somewhere in the documentations that this folder can create problems |
I have same issue, need delete /var/lib/elastic-agent-managed directly on the host first then elastic agent not get error "Failed to connect to backoff 401 Unauthorized" after reapply manifest elastic-agent daemonset with same FLEET_URL and FLEET_ENROLLMENT_TOKEN I run in OCP 4.14 with elastic agent 8.15 |
When a new enrollment token is updated as env or envFrom in Kubernetes Manifest, this new token is not reflected in Elastic Agent.
Reason for that is probably the fact, that Elastic Agent saves the state locally on every Kubernetes Node and doesn't update the new token.
This leads to Unauthorised issues on the Agent - a redeploy with a new token is not possible anymore.
For confirmed bugs, please report:
Version: 8.10
Operating System: Ubuntu Linux / Kubernetes 1.27
Discuss Forum URL:
Steps to Reproduce:
Error logs:
"Failed to connect to backoff(elasticsearch(https://xxxx.xxxx.cloud.es.io:443)): 401 Unauthorized: {\"error\":{\"root_cause\":[{\"type\":\"security_exception\",\"reason\":\"unable to authenticate with provided credentials and anonymous access is not allowed for this request\",\"additional_unsuccessful_credentials\":\"API key: api key [xxxxxxx] has been invalidated\",\"header\":{\"WWW-Authenticate\":[\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\",\"Bearer
How to temporary fix:
When using Kustomize deployment, the hostPath can be overwritten quite easily with the following DaemonSet overwrite:
The text was updated successfully, but these errors were encountered: