Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project overview #1

Open
ArchangelSDY opened this issue Nov 25, 2021 · 1 comment
Open

Project overview #1

ArchangelSDY opened this issue Nov 25, 2021 · 1 comment
Labels

Comments

@ArchangelSDY
Copy link
Contributor

ArchangelSDY commented Nov 25, 2021

Summary

Develop an app running in Kubernetes is not easy. Managing a Kubernetes cluster is even harder. It takes years of experience to understand how Kubernetes works, how to read logs from different components and where to start when some part of your cluster are not working.

This projects aims to create a simple tool to run diagnostics and gives advices for your troubleshooting direction in ops scenario.

Goals

  • A handy ops tool for troubleshooting Kubernetes and apps in it

Non-Goals

  • Deep integration with app development flow
  • Debug Kubernetes itself

User Experience

At early stage it should be a CLI tool with minimum dependencies.

Check sub command

Check sub command is used to run specific check suites.

For example, following command runs DNS and HTTP check suites:

kdebug check -s dns,http,kube,app

It generates a report after checks complete.

An example for healthy report:

* DNS
=> [OK] System DNS
=> [OK] In-cluster CoreDNS
=> [OK] Azure DNS
=> [OK] Google DNS
* HTTP
=> [OK] Connectivity to kube-apiserver
=> [OK] Connectivity to google.com
* Kubernetes
=> [OK] Kubelet is running
* Apps
=> [OK] All pods are running

All OK.

An example for unhealthy report:

* DNS
=> [OK] System DNS
=> [Fail] In-cluster CoreDNS
=> [OK] Azure DNS
=> [OK] Google DNS
* HTTP
=> [OK] Connectivity to kube-apiserver
=> [OK] Connectivity to google.com
* Kubernetes
=> [Fail] Kubelet liveness
* Apps
=> [Fail] Pods Crashloopbackoff

kdebug has detected these problems for you:

----------
Checker: In-cluster CoreDNS
Error: Time-out
Description: In-cluster CoreDNS query failed. Check if CoreDNS pods are running.
Recommendations:
Check CoreDNS pods using command ` kubectl get pods -o wide -n kube-system | grep coredns`
Help links:
https://example.com

----------
Checker: Kubelet
Error: systemd service kubelet is not running
Description: Systemd service kubelet is not running. It has crashed 300 times in last 1h.
Logs:
[I] xxx
[I] yyy
[F] cgroup is invalid.
...
Recommendations:
Use `systemctl status kubelet` to check its status.
Use `journactl -r -u kubelet` to see full logs.
Reboot machine.
Help links:
https://foo.com
https://bar.com

----------
Checker: App
Error: Pod default/xxx is in Crashloopbackoff state
Description: Pod is crashing. Last exit reason is OOM
Recommendations:
Increase pod memory limit. Current is 100MB.
Check potential memory leak in your app.
Help links:
https://foo.com
https://bar.com
@ArchangelSDY
Copy link
Contributor Author

ArchangelSDY commented Nov 26, 2021

Checker ideas

Kube system

  • ConfigMap/Secret size limit

VM

  • Disk usage
  • IMDS scheduled events
  • Reboot reason/kernel panic from kernel log
  • OOM killer log

App

  • Too many pod restarts/reason
  • Errors/warnings in kube events
  • Pods stuck in pending state

@ArchangelSDY ArchangelSDY added the enhancement New feature or request label Nov 26, 2021
@ArchangelSDY ArchangelSDY added vision and removed enhancement New feature or request labels Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant