Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PWM watch thread doesn't restart the service when unable to start #125

Closed
pit1sIBM opened this issue Oct 4, 2024 · 2 comments
Closed

PWM watch thread doesn't restart the service when unable to start #125

pit1sIBM opened this issue Oct 4, 2024 · 2 comments

Comments

@pit1sIBM
Copy link
Contributor

pit1sIBM commented Oct 4, 2024

Describe the bug

When insufficient RBAC is provided, the service recieves 403 errors for the cluster as expected. However if the correct permissions are not applied before the watch thread exits, the service never picks up the changes and stays stuck. In the logs this looks like

2024-10-04T15:38:07.230908 [WTCHT:ERRR:140078775780928] Unable to start watch within 5 attempts
2024-10-04T15:38:12.100343 [TMRTH:DBUG:140078758995520] Timer executing action for event: TimerEvent(time=datetime.datetime(2024, 10, 4, 15, 38, 12, 99364), action=<bound method HeartbeatThread._run_heartbeat of <HeartbeatThread(heartbeat_thread, started daemon 140078758995520)>>, args=(), kwargs={}, stale=False)
2024-10-04T15:38:12.101064 [TMRTH:DBG2:140078758995520] Timer waiting 29.999622s until next scheduled event
2024-10-04T15:38:42.100994 [TMRTH:DBUG:140078758995520] Timer executing action for event: TimerEvent(time=datetime.datetime(2024, 10, 4, 15, 38, 42, 100649), action=<bound method HeartbeatThread._run_heartbeat of <HeartbeatThread(heartbeat_thread, started daemon 140078758995520)>>, args=(), kwargs={}, stale=False)

This is due to the following code, which calls a sys.exit(1) which I believe only applies to the thread (same as _thread.exit()) and not the main process.

log.error(
"Unable to start watch within %d attempts",
config.python_watch_manager.watch_retry_count,
)
sys.exit(1)

I assume we want to be calling the exit on the main thread instead?

Platform

Please provide details about the environment you are using, including the following:

  • Interpreter version: 3.11.7
  • Library version: 0.1.26

Sample Code

Run the operator without the appropriate RBAC

Expected behavior

Service exits and k8s restarts the pod

Observed behavior

operator pod hangs

Additional context

@pit1sIBM
Copy link
Contributor Author

pit1sIBM commented Oct 4, 2024

Looks to be addressed in https://github.com/IBM/oper8/releases/tag/v0.1.27

@HonakerM
Copy link
Collaborator

HonakerM commented Nov 8, 2024

Yeah this was resolved by #116

@HonakerM HonakerM closed this as completed Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants