Elastic Agent doesn't update the enrollment token in Kubernetes Deployment statefile #3586

ch9hn · 2023-10-11T14:11:24Z

Duplicates [Kubernetes manifest] Use unique identifier for the state file path #5187

When a new enrollment token is updated as env or envFrom in Kubernetes Manifest, this new token is not reflected in Elastic Agent.
Reason for that is probably the fact, that Elastic Agent saves the state locally on every Kubernetes Node and doesn't update the new token.
This leads to Unauthorised issues on the Agent - a redeploy with a new token is not possible anymore.

For confirmed bugs, please report:

Version: 8.10
Operating System: Ubuntu Linux / Kubernetes 1.27
Discuss Forum URL:
Steps to Reproduce:

Install Elastic Agent on Kubernetes Cluster as described in the docs with enrollment-token "ABC"
Expire "ABC" and add new token "DEF"
Restart Elastic Agent Daemonset
Result: Old token "ABC" is persisted and used for the communication to Elastic Fleet Server

Error logs:

"Failed to connect to backoff(elasticsearch(https://xxxx.xxxx.cloud.es.io:443)): 401 Unauthorized: {\"error\":{\"root_cause\":[{\"type\":\"security_exception\",\"reason\":\"unable to authenticate with provided credentials and anonymous access is not allowed for this request\",\"additional_unsuccessful_credentials\":\"API key: api key [xxxxxxx] has been invalidated\",\"header\":{\"WWW-Authenticate\":[\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\",\"Bearer

How to temporary fix:
When using Kustomize deployment, the hostPath can be overwritten quite easily with the following DaemonSet overwrite:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: elastic-agent
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: elastic-agent
  template:
    metadata:
      labels:
        app: elastic-agent
    spec:
      containers:
        - name: elastic-agent
          env:
            - name: FLEET_URL
              $patch: delete
            - name: FLEET_ENROLLMENT_TOKEN
              $patch: delete
            - name: FLEET_INSECURE
              value: "false"
            - name: KIBANA_HOST
              $patch: delete
            - name: KIBANA_FLEET_USERNAME
              $patch: delete
            - name: KIBANA_FLEET_PASSWORD
              $patch: delete
          envFrom:
            - secretRef:
                name: elastic-agent-token
          volumeMounts:
            - name: elastic-agent-state
              mountPath: /usr/share/elastic-agent/state
      volumes:
        - name: elastic-agent-state
          hostPath:
# Change path here to your deployment namespace or use another name
            path: /var/lib/elastic-agent-managed/monitoring/state. 
            type: DirectoryOrCreate

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-10-11T18:01:59Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

shubhu934 · 2023-11-07T11:48:05Z

Hello @elastic/elastic-agent Do we have any deadline for the fix of this issue ?

cmacknz · 2023-11-09T18:27:34Z

This isn't prioritized yet, but it is definitely annoying and has wasted some time even for people inside Elastic. CC @pierrehilbert

HGS9761 · 2023-11-10T09:36:06Z

Hello
I am affected by this bug and it is not a minor issue.
So this should have high priority.

ghost · 2024-01-15T11:24:44Z

Hello together,
same issue on production nodes, blocking elastic agents tests, workaround to cleanup state folder works but still very unprofessional.

rafaelbattesti · 2024-02-07T14:33:57Z

Hello folks.
I have come across this issue and reproduced with support.
This has consumed a few days digging into the root cause and I believe we should give this some attention.

ghost · 2024-02-07T14:40:02Z

My Idea: On Fleet managed elastic agents it would make sense to have "Elastic State Cleanup" Button, so this is easy to handle over Kibana UI. Manually reset the state on more than 60 deployed elastic agents drives the admins angry.

neu7ron2 · 2024-03-01T04:03:55Z

Thanks @ch9hn , I just spent a day trying to figure out why the Kubernetes elastic agents were working yesterday and not today after I updated the cluster. I uninstalled the agents and installed the previous version with no luck. I finally saw there were some files on the machine in /var/lib/elastic-agent-managed/kube-system/state. I realised the install manifest files had mounted and saved connections details.

Solution: I deleted the whole elastic-agent-managed folder on each machine and reinstalled the Kubernetes agent manifest file and it worked.

@elasticmachine Elastic team please fix this basic bug!! hours wasted!

badele · 2024-03-20T13:48:25Z

After hours of investigating and reinstalling elastic-agent on kube, and understanding why I had this message "Failed to connect to backoff(elasticsearch(https://244b20202ef45ddb481e55df6b19f4.eu-west-3.aws.elastic-cloud.com:443"

I searched for this error message on the 'elastic-cloud' site but no document ...

On searching, I realized that something was stored persistently, so my suspicions fell on the /usr/share/elastic-agent/state directory.

Indeed, in the manifest generated by elastic-cloud, we can see that it mounts the directory of /var/lib/elastic-agent-managed/kube-system/state from a kube node

        - name : elastic-agent-state
          hostPath :
            path : /var/lib/elastic-agent-managed/kube-system/state
            type : DirectoryOrCreate

Personally, I don't like the fact that they use the node's disk space to store persistence data.

But here's my solution to correct the problem

for pod in $(kubectl -n kube-system get pod --no-headers -o custom-columns=":metadata.name" -l app=elastic-agent); do kubectl -n kube-system exec $pod -- rm -rf /usr/share/elastic-agent ; done
for pod in $(kubectl -n kube-system get pod --no-headers -o custom-columns=":metadata.name" -l app=elastic-agent); do ;  kubectl -n kube-system delete pods $pod ; done

elasticmachine · 2024-04-26T03:43:03Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

michel-laterman · 2024-05-01T13:50:24Z

Quick recap of the discussion had surrounding this item from our weekly meeting:

Default behaviour for agent on k8s will be to re-enroll if the enrollment token changes
- we will store the enrolment token hash in local state to compare when value changes
some flag/env var to disable this behaviour should be provided i.e. ELASTIC_AGENT_AUTO_REENROLL

chrko · 2024-06-19T09:51:53Z

Any updates on this especially regarding prio? We currently in discussion with our managed k8s cluster operator team so that they implement a small operator (taking the fleet url and the enrollment token) as the actual cluster users is not allowed to run any kind privileged pods on their own. Before figuring out workarounds which have only a minimal impact e.g. on shipping logs / log duplication, I would like clarify if and how we can maybe prioritize this upstream or maybe also can contribute.

mag-mkorn · 2024-06-19T12:51:19Z

Another workaround that persists the registry to prevent log duplication:

<...>
          volumeMounts:
            - name: elastic-agent-state
              mountPath: /usr/share/elastic-agent/state/data
<...>
      volumes:
        - name: elastic-agent-state
          hostPath:
            path: /var/lib/elastic-agent-managed/monitoring/state/data
            type: DirectoryOrCreate
<...>

chrko · 2024-06-25T10:55:32Z

Based on what @mag-mkorn commented, we'll test and use the following as init container script (base alpine image is sufficient):

#!/usr/bin/env sh

set -eu

# Set pipefail if it works in a subshell, disregard if unsupported
# shellcheck disable=SC3040
(set -o pipefail 2>/dev/null) && set -o pipefail

STATE_DIRECTORY=/usr/share/elastic-agent/state
DATA_DIRECTORY=${STATE_DIRECTORY}/data

HASH_FILE=${STATE_DIRECTORY}/.env-hash
HASH_TARGET="$(printf "%s\0%s" "${FLEET_URL?}" "${FLEET_ENROLLMENT_TOKEN?}" | sha256sum -)"

prune_state() {
  find "${STATE_DIRECTORY}" -path "${DATA_DIRECTORY}" -prune -o -type f -print0 |
    xargs -0 --no-run-if-empty rm -v
}

save_hash() {
  echo "Save target hash into $HASH_FILE."
  printf "%s" "$HASH_TARGET" >"$HASH_FILE"
}

if [ -f "$HASH_FILE" ]; then
  echo "Existing hash found, comparing..."
  # cmp saved hash to target value
  HASH_CURRENT="$(cat "$HASH_FILE")"
  if [ "$HASH_TARGET" = "$HASH_CURRENT" ]; then
    echo "Not change detected, no cleanup required."
  else
    echo "Existing hash do not match target hash. Pruning files without data dir..."
    prune_state
    save_hash
  fi
else
  save_hash
fi

mgaruccio · 2024-07-09T19:42:15Z

Just a note here that the issue also impacts attempting to change the target elastic cluster, not just token changes. with the added wrinkle there that the agent will take control of the state directory and disallow deleting it so you're stuck unless you create an init container to clear the state.

rlanhellas · 2024-09-23T23:52:02Z

Thanks @ch9hn , I just spent a day trying to figure out why the Kubernetes elastic agents were working yesterday and not today after I updated the cluster. I uninstalled the agents and installed the previous version with no luck. I finally saw there were some files on the machine in /var/lib/elastic-agent-managed/kube-system/state. I realised the install manifest files had mounted and saved connections details.

Solution: I deleted the whole elastic-agent-managed folder on each machine and reinstalled the Kubernetes agent manifest file and it worked.

@elasticmachine Elastic team please fix this basic bug!! hours wasted!

It worked for me too, but IMO it should be considered as critical bug since it has a big impact for everyone using elastic agent. Do you guys have any roadmap to update it in next elastic agent versions ? Just to be aware.

nikolaigut · 2024-10-10T13:51:04Z

We have encountered the same problem. I agree that this is a critical error.

strawgate · 2024-11-18T19:54:06Z

It would be nice if the Agent logged that it's ignoring the env vars (or logged if the env vars don't match the current config) so that this behavior is more obvious

tientmse62290 · 2024-11-28T14:27:13Z

I got the same issue when using elastic agent with systemd and container also

strawgate · 2024-11-28T15:55:48Z

I got the same issue when using elastic agent with systemd and container also

Just wanted to confirm that you saw a workaround is available in the comments, did this work for you?

tientmse62290 · 2024-11-29T15:44:36Z

I got the same issue when using elastic agent with systemd and container also

Just wanted to confirm that you saw a workaround is available in the comments, did this work for you?

Hi, delete the state folder don't work for my case. I don't mount the elasticagent folder to the disk, and I don't think it work for the use of docker compose case

strawgate · 2024-11-29T15:54:23Z

The problem described in this issue is caused by persistent state in the container.

If you're using a fresh container each time and if you are not persisting state via a volume or bind/host mount then you likely do not have the problem described in this issue and you should post on the forums to receive community assistance https://discuss.elastic.co/c/elastic-stack/elastic-agent or contact Elastic support if you have access to support

BBQigniter · 2024-12-19T10:57:27Z

oh, come on - I just pulled my hair for hours and was already thinking about opening a support ticket because I always saw invalid api key to authenticate with fleet in the agent's logs. Followed this https://discuss.elastic.co/t/is-it-possible-to-remove-unenrolled-agents-from-fleet/345286 (why can't we delete unenrolled hosts via Kibana?), recreated the API-key, recreated the whole policy, ...

And then I found this github issue.

After deleting the agent from my kubernetes cluster, deleting /var/lib/elastic-agent-managed/kube-system/state and reapplying the manifest, the agent finally works again 😐

That should defintely be noted somewhere in the documentations that this folder can create problems

kpi-nourman · 2025-01-20T12:31:35Z

I have same issue, need delete /var/lib/elastic-agent-managed directly on the host first then elastic agent not get error "Failed to connect to backoff 401 Unauthorized" after reapply manifest elastic-agent daemonset with same FLEET_URL and FLEET_ENROLLMENT_TOKEN

I run in OCP 4.14 with elastic agent 8.15

ch9hn added the bug Something isn't working label Oct 11, 2023

cmacknz added the Team:Elastic-Agent Label for the Agent team label Oct 11, 2023

pierrehilbert assigned fearful-symmetry Mar 20, 2024

pierrehilbert unassigned fearful-symmetry Apr 25, 2024

ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 26, 2024

pierrehilbert assigned blakerouse Apr 29, 2024

ycombinator assigned michel-laterman and unassigned blakerouse May 1, 2024

ycombinator mentioned this issue May 8, 2024

[BUG]Can not enroll elastic-agent: API KEY INVALID! #4702

Closed

ycombinator assigned michel-laterman and unassigned michel-laterman Jun 21, 2024

ycombinator unassigned michel-laterman Jun 28, 2024

cmacknz mentioned this issue Jul 23, 2024

[Kubernetes manifest] Use unique identifier for the state file path #5187

Open

strawgate mentioned this issue Nov 19, 2024

Elastic Agent should log when current configuration overrides configured environment variables #6084

Open

ycombinator assigned swiatekm Dec 19, 2024

jlind23 assigned pkoutsovasilis and unassigned swiatekm Jan 6, 2025

pkoutsovasilis linked a pull request Jan 22, 2025 that will close this issue

Fix Fleet Enrollment Handling for Containerized Agent #6568

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elastic Agent doesn't update the enrollment token in Kubernetes Deployment statefile #3586

Elastic Agent doesn't update the enrollment token in Kubernetes Deployment statefile #3586

ch9hn commented Oct 11, 2023 •

edited by cmacknz

Loading

elasticmachine commented Oct 11, 2023

shubhu934 commented Nov 7, 2023

cmacknz commented Nov 9, 2023

HGS9761 commented Nov 10, 2023

ghost commented Jan 15, 2024

rafaelbattesti commented Feb 7, 2024

ghost commented Feb 7, 2024

neu7ron2 commented Mar 1, 2024

badele commented Mar 20, 2024

elasticmachine commented Apr 26, 2024

michel-laterman commented May 1, 2024

chrko commented Jun 19, 2024

mag-mkorn commented Jun 19, 2024

chrko commented Jun 25, 2024

mgaruccio commented Jul 9, 2024

rlanhellas commented Sep 23, 2024 •

edited

Loading

nikolaigut commented Oct 10, 2024

strawgate commented Nov 18, 2024

tientmse62290 commented Nov 28, 2024

strawgate commented Nov 28, 2024

tientmse62290 commented Nov 29, 2024

strawgate commented Nov 29, 2024

BBQigniter commented Dec 19, 2024 •

edited

Loading

kpi-nourman commented Jan 20, 2025 •

edited

Loading

Elastic Agent doesn't update the enrollment token in Kubernetes Deployment statefile #3586

Elastic Agent doesn't update the enrollment token in Kubernetes Deployment statefile #3586

Comments

ch9hn commented Oct 11, 2023 • edited by cmacknz Loading

elasticmachine commented Oct 11, 2023

shubhu934 commented Nov 7, 2023

cmacknz commented Nov 9, 2023

HGS9761 commented Nov 10, 2023

ghost commented Jan 15, 2024

rafaelbattesti commented Feb 7, 2024

ghost commented Feb 7, 2024

neu7ron2 commented Mar 1, 2024

badele commented Mar 20, 2024

elasticmachine commented Apr 26, 2024

michel-laterman commented May 1, 2024

chrko commented Jun 19, 2024

mag-mkorn commented Jun 19, 2024

chrko commented Jun 25, 2024

mgaruccio commented Jul 9, 2024

rlanhellas commented Sep 23, 2024 • edited Loading

nikolaigut commented Oct 10, 2024

strawgate commented Nov 18, 2024

tientmse62290 commented Nov 28, 2024

strawgate commented Nov 28, 2024

tientmse62290 commented Nov 29, 2024

strawgate commented Nov 29, 2024

BBQigniter commented Dec 19, 2024 • edited Loading

kpi-nourman commented Jan 20, 2025 • edited Loading

ch9hn commented Oct 11, 2023 •

edited by cmacknz

Loading

rlanhellas commented Sep 23, 2024 •

edited

Loading

BBQigniter commented Dec 19, 2024 •

edited

Loading

kpi-nourman commented Jan 20, 2025 •

edited

Loading