Skip to content
This repository has been archived by the owner on Apr 24, 2023. It is now read-only.

pods go in Pending state intermittently, scheduler restart solves the issue #251

Open
hunny-garg opened this issue Apr 7, 2023 · 1 comment

Comments

@hunny-garg
Copy link

hunny-garg commented Apr 7, 2023

We are facing an issue in our env where Spark pods go in Pending state intermittently. We have to restart Spark scheduler pods to fix the issue.
We are seeing below errors in spark-scheduler-extender logs...not sure this is related to the issue
Looking for some pointers to explain this odd behaviour.

k8s version: v1.23
spark-scheduler version: v0.58.0

"stacktrace": "error when looking for already bound reservations\nfailed to get resource reservations podName:agg-spark-350zvn28en0u-b29f74875b02ba23-exec-1, podNamespace:prod01\n\ngithub.com/palantir/k8s-spark-scheduler/internal/extender.(*ResourceReservationManager).FindAlreadyBoundReservationNode\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/internal/extender/resourcereservations.go:141\ngithub.com/palantir/k8s-spark-scheduler/internal/extender.(*SparkSchedulerExtender).selectExecutorNode\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/internal/extender/resource.go:382\ngithub.com/palantir/k8s-spark-scheduler/internal/extender.(*SparkSchedulerExtender).selectNode\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/internal/extender/resource.go:210\ngithub.com/palantir/k8s-spark-scheduler/internal/extender.(*SparkSchedulerExtender).Predicate\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/internal/extender/resource.go:151\ngithub.com/palantir/k8s-spark-scheduler/cmd.registerExtenderEndpoints.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/cmd/endpoints.go:36\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2109\ngithub.com/palantir/witchcraft-go-server/wrouter.(*rootRouter).Register.func1.1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:136\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRouteLogTraceSpan.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/route.go:107\ngithub.com/palantir/witchcraft-go-server/wrouter.(*routeRequestHandlerWithNext).HandleRequest\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:150\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRouteRequestLog.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/route.go:32\ngithub.com/palantir/witchcraft-go-server/wrouter.(*routeRequestHandlerWithNext).HandleRequest\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:150\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestMetricRequestMeter.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:168\ngithub.com/palantir/witchcraft-go-server/wrouter.(*routeRequestHandlerWithNext).HandleRequest\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:150\ngithub.com/palantir/witchcraft-go-server/wrouter.(*rootRouter).Register.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:139\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2109\ngithub.com/julienschmidt/httprouter.(*Router).Handler.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/julienschmidt/httprouter/router.go:275\ngithub.com/julienschmidt/httprouter.(*Router).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/julienschmidt/httprouter/router.go:387\ngithub.com/palantir/witchcraft-go-server/wrouter/whttprouter.(*router).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/whttprouter/routerimpl.go:71\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestExtractIDs.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:139\ngithub.com/palantir/witchcraft-go-server/wrouter.(*requestHandlerWithNext).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:250\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestContextLoggers.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:73\ngithub.com/palantir/witchcraft-go-server/wrouter.(*requestHandlerWithNext).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:250\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestContextMetricsRegistry.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:84\ngithub.com/palantir/witchcraft-go-server/wrouter.(*requestHandlerWithNext).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:250\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestPanicRecovery.func1.1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:42\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/negroni.(*Recovery).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/negroni/recovery.go:193\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestPanicRecovery.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:41\ngithub.com/palantir/witchcraft-go-server/wrouter.(*requestHandlerWithNext).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:250\ngithub.com/palantir/witchcraft-go-server/wrouter.(*rootRouter).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:103\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2947\nnet/http.initALPNRequest.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:3556\nnet/http.(*http2serverConn).runHandler\n\t/usr/local/go/src/net/http/h2_bundle.go:5910",

@hunny-garg
Copy link
Author

we also see below errors in spark-scheduler-extender container logs when this issue start occuring.

{"type":"service.1","time":"2023-04-08T02:39:45.830415574Z","level":"WARN","origin":"github.com/palantir/k8s-spark-scheduler","message":"found unexplained cache size difference","params":{"rrs":0,"rrsCached":109}}

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant