Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic Agent doesn't update the enrollment token in Kubernetes Deployment statefile #3586

Open
ch9hn opened this issue Oct 11, 2023 · 24 comments · May be fixed by #6568
Open

Elastic Agent doesn't update the enrollment token in Kubernetes Deployment statefile #3586

ch9hn opened this issue Oct 11, 2023 · 24 comments · May be fixed by #6568
Assignees
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@ch9hn
Copy link

ch9hn commented Oct 11, 2023

When a new enrollment token is updated as env or envFrom in Kubernetes Manifest, this new token is not reflected in Elastic Agent.
Reason for that is probably the fact, that Elastic Agent saves the state locally on every Kubernetes Node and doesn't update the new token.
This leads to Unauthorised issues on the Agent - a redeploy with a new token is not possible anymore.

For confirmed bugs, please report:

  • Version: 8.10

  • Operating System: Ubuntu Linux / Kubernetes 1.27

  • Discuss Forum URL:

  • Steps to Reproduce:

  1. Install Elastic Agent on Kubernetes Cluster as described in the docs with enrollment-token "ABC"
  2. Expire "ABC" and add new token "DEF"
  3. Restart Elastic Agent Daemonset
  4. Result: Old token "ABC" is persisted and used for the communication to Elastic Fleet Server

Error logs:

"Failed to connect to backoff(elasticsearch(https://xxxx.xxxx.cloud.es.io:443)): 401 Unauthorized: {\"error\":{\"root_cause\":[{\"type\":\"security_exception\",\"reason\":\"unable to authenticate with provided credentials and anonymous access is not allowed for this request\",\"additional_unsuccessful_credentials\":\"API key: api key [xxxxxxx] has been invalidated\",\"header\":{\"WWW-Authenticate\":[\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\",\"Bearer

How to temporary fix:
When using Kustomize deployment, the hostPath can be overwritten quite easily with the following DaemonSet overwrite:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: elastic-agent
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: elastic-agent
  template:
    metadata:
      labels:
        app: elastic-agent
    spec:
      containers:
        - name: elastic-agent
          env:
            - name: FLEET_URL
              $patch: delete
            - name: FLEET_ENROLLMENT_TOKEN
              $patch: delete
            - name: FLEET_INSECURE
              value: "false"
            - name: KIBANA_HOST
              $patch: delete
            - name: KIBANA_FLEET_USERNAME
              $patch: delete
            - name: KIBANA_FLEET_PASSWORD
              $patch: delete
          envFrom:
            - secretRef:
                name: elastic-agent-token
          volumeMounts:
            - name: elastic-agent-state
              mountPath: /usr/share/elastic-agent/state
      volumes:
        - name: elastic-agent-state
          hostPath:
# Change path here to your deployment namespace or use another name
            path: /var/lib/elastic-agent-managed/monitoring/state. 
            type: DirectoryOrCreate

@ch9hn ch9hn added the bug Something isn't working label Oct 11, 2023
@cmacknz cmacknz added the Team:Elastic-Agent Label for the Agent team label Oct 11, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@shubhu934
Copy link

Hello @elastic/elastic-agent Do we have any deadline for the fix of this issue ?

@cmacknz
Copy link
Member

cmacknz commented Nov 9, 2023

This isn't prioritized yet, but it is definitely annoying and has wasted some time even for people inside Elastic. CC @pierrehilbert

@HGS9761
Copy link

HGS9761 commented Nov 10, 2023

Hello
I am affected by this bug and it is not a minor issue.
So this should have high priority.

@ghost
Copy link

ghost commented Jan 15, 2024

Hello together,
same issue on production nodes, blocking elastic agents tests, workaround to cleanup state folder works but still very unprofessional.

@rafaelbattesti
Copy link

Hello folks.
I have come across this issue and reproduced with support.
This has consumed a few days digging into the root cause and I believe we should give this some attention.

@ghost
Copy link

ghost commented Feb 7, 2024

My Idea: On Fleet managed elastic agents it would make sense to have "Elastic State Cleanup" Button, so this is easy to handle over Kibana UI. Manually reset the state on more than 60 deployed elastic agents drives the admins angry.

@neu7ron2
Copy link

neu7ron2 commented Mar 1, 2024

Thanks @ch9hn , I just spent a day trying to figure out why the Kubernetes elastic agents were working yesterday and not today after I updated the cluster. I uninstalled the agents and installed the previous version with no luck. I finally saw there were some files on the machine in /var/lib/elastic-agent-managed/kube-system/state. I realised the install manifest files had mounted and saved connections details.

Solution: I deleted the whole elastic-agent-managed folder on each machine and reinstalled the Kubernetes agent manifest file and it worked.

@elasticmachine Elastic team please fix this basic bug!! hours wasted!

@badele
Copy link

badele commented Mar 20, 2024

After hours of investigating and reinstalling elastic-agent on kube, and understanding why I had this message "Failed to connect to backoff(elasticsearch(https://244b20202ef45ddb481e55df6b19f4.eu-west-3.aws.elastic-cloud.com:443"

I searched for this error message on the 'elastic-cloud' site but no document ...

On searching, I realized that something was stored persistently, so my suspicions fell on the /usr/share/elastic-agent/state directory.

Indeed, in the manifest generated by elastic-cloud, we can see that it mounts the directory of /var/lib/elastic-agent-managed/kube-system/state from a kube node

        - name : elastic-agent-state
          hostPath :
            path : /var/lib/elastic-agent-managed/kube-system/state
            type : DirectoryOrCreate

Personally, I don't like the fact that they use the node's disk space to store persistence data.

But here's my solution to correct the problem

for pod in $(kubectl -n kube-system get pod --no-headers -o custom-columns=":metadata.name" -l app=elastic-agent); do kubectl -n kube-system exec $pod -- rm -rf /usr/share/elastic-agent ; done
for pod in $(kubectl -n kube-system get pod --no-headers -o custom-columns=":metadata.name" -l app=elastic-agent); do ;  kubectl -n kube-system delete pods $pod ; done

@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 26, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@michel-laterman
Copy link
Contributor

Quick recap of the discussion had surrounding this item from our weekly meeting:

  • Default behaviour for agent on k8s will be to re-enroll if the enrollment token changes
    • we will store the enrolment token hash in local state to compare when value changes
  • some flag/env var to disable this behaviour should be provided i.e. ELASTIC_AGENT_AUTO_REENROLL

@chrko
Copy link

chrko commented Jun 19, 2024

Any updates on this especially regarding prio? We currently in discussion with our managed k8s cluster operator team so that they implement a small operator (taking the fleet url and the enrollment token) as the actual cluster users is not allowed to run any kind privileged pods on their own. Before figuring out workarounds which have only a minimal impact e.g. on shipping logs / log duplication, I would like clarify if and how we can maybe prioritize this upstream or maybe also can contribute.

@mag-mkorn
Copy link

Another workaround that persists the registry to prevent log duplication:

<...>
          volumeMounts:
            - name: elastic-agent-state
              mountPath: /usr/share/elastic-agent/state/data
<...>
      volumes:
        - name: elastic-agent-state
          hostPath:
            path: /var/lib/elastic-agent-managed/monitoring/state/data
            type: DirectoryOrCreate
<...>

@chrko
Copy link

chrko commented Jun 25, 2024

Based on what @mag-mkorn commented, we'll test and use the following as init container script (base alpine image is sufficient):

#!/usr/bin/env sh

set -eu

# Set pipefail if it works in a subshell, disregard if unsupported
# shellcheck disable=SC3040
(set -o pipefail 2>/dev/null) && set -o pipefail

STATE_DIRECTORY=/usr/share/elastic-agent/state
DATA_DIRECTORY=${STATE_DIRECTORY}/data

HASH_FILE=${STATE_DIRECTORY}/.env-hash
HASH_TARGET="$(printf "%s\0%s" "${FLEET_URL?}" "${FLEET_ENROLLMENT_TOKEN?}" | sha256sum -)"

prune_state() {
  find "${STATE_DIRECTORY}" -path "${DATA_DIRECTORY}" -prune -o -type f -print0 |
    xargs -0 --no-run-if-empty rm -v
}

save_hash() {
  echo "Save target hash into $HASH_FILE."
  printf "%s" "$HASH_TARGET" >"$HASH_FILE"
}

if [ -f "$HASH_FILE" ]; then
  echo "Existing hash found, comparing..."
  # cmp saved hash to target value
  HASH_CURRENT="$(cat "$HASH_FILE")"
  if [ "$HASH_TARGET" = "$HASH_CURRENT" ]; then
    echo "Not change detected, no cleanup required."
  else
    echo "Existing hash do not match target hash. Pruning files without data dir..."
    prune_state
    save_hash
  fi
else
  save_hash
fi

@mgaruccio
Copy link

Just a note here that the issue also impacts attempting to change the target elastic cluster, not just token changes. with the added wrinkle there that the agent will take control of the state directory and disallow deleting it so you're stuck unless you create an init container to clear the state.

@rlanhellas
Copy link

rlanhellas commented Sep 23, 2024

Thanks @ch9hn , I just spent a day trying to figure out why the Kubernetes elastic agents were working yesterday and not today after I updated the cluster. I uninstalled the agents and installed the previous version with no luck. I finally saw there were some files on the machine in /var/lib/elastic-agent-managed/kube-system/state. I realised the install manifest files had mounted and saved connections details.

Solution: I deleted the whole elastic-agent-managed folder on each machine and reinstalled the Kubernetes agent manifest file and it worked.

@elasticmachine Elastic team please fix this basic bug!! hours wasted!

It worked for me too, but IMO it should be considered as critical bug since it has a big impact for everyone using elastic agent. Do you guys have any roadmap to update it in next elastic agent versions ? Just to be aware.

@nikolaigut
Copy link

We have encountered the same problem. I agree that this is a critical error.

@strawgate
Copy link
Contributor

It would be nice if the Agent logged that it's ignoring the env vars (or logged if the env vars don't match the current config) so that this behavior is more obvious

@tientmse62290
Copy link

I got the same issue when using elastic agent with systemd and container also

@strawgate
Copy link
Contributor

I got the same issue when using elastic agent with systemd and container also

Just wanted to confirm that you saw a workaround is available in the comments, did this work for you?

@tientmse62290
Copy link

I got the same issue when using elastic agent with systemd and container also

Just wanted to confirm that you saw a workaround is available in the comments, did this work for you?

Hi, delete the state folder don't work for my case. I don't mount the elasticagent folder to the disk, and I don't think it work for the use of docker compose case

@strawgate
Copy link
Contributor

The problem described in this issue is caused by persistent state in the container.

If you're using a fresh container each time and if you are not persisting state via a volume or bind/host mount then you likely do not have the problem described in this issue and you should post on the forums to receive community assistance https://discuss.elastic.co/c/elastic-stack/elastic-agent or contact Elastic support if you have access to support

@BBQigniter
Copy link

BBQigniter commented Dec 19, 2024

oh, come on - I just pulled my hair for hours and was already thinking about opening a support ticket because I always saw invalid api key to authenticate with fleet in the agent's logs. Followed this https://discuss.elastic.co/t/is-it-possible-to-remove-unenrolled-agents-from-fleet/345286 (why can't we delete unenrolled hosts via Kibana?), recreated the API-key, recreated the whole policy, ...

And then I found this github issue.

After deleting the agent from my kubernetes cluster, deleting /var/lib/elastic-agent-managed/kube-system/state and reapplying the manifest, the agent finally works again 😐

That should defintely be noted somewhere in the documentations that this folder can create problems

@jlind23 jlind23 assigned pkoutsovasilis and unassigned swiatekm Jan 6, 2025
@kpi-nourman
Copy link

kpi-nourman commented Jan 20, 2025

I have same issue, need delete /var/lib/elastic-agent-managed directly on the host first then elastic agent not get error "Failed to connect to backoff 401 Unauthorized" after reapply manifest elastic-agent daemonset with same FLEET_URL and FLEET_ENROLLMENT_TOKEN

I run in OCP 4.14 with elastic agent 8.15

@pkoutsovasilis pkoutsovasilis linked a pull request Jan 22, 2025 that will close this issue
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.