-
Notifications
You must be signed in to change notification settings - Fork 382
Make readiness and liveliness probes optional #2121
Make readiness and liveliness probes optional #2121
Conversation
Due to widespread problems with unreliable readiness and liveliness probes on installations where the apiserver and controller-manager are not on the same node (working theory), on non-dev clusters, a number of users are reporting that the controller-manager pod spends most of its time in a self-induced crashloopbackoff. As a workaround, we have been having people manually edit the deployment ot remove the checks. Long term we want to make these more configurable anyway (i.e. bumping the timeouts), but for now an "enabled" flag will make it easier for people on AKS, EKS and GKE workaround the issue more easily while we investigate further.
I don't think #2100 is really related - pings from Master to the catalog api pod timeout, looks like infrastructure/networking/config issues. However I have actually run into this issue in my dev environment running with hack/local-up-cluster.sh. I haven't used it in months because of this but I'll retry to see if its still an issue, may it help with debugging the root cause. |
I am not saying that #2100 is caused by this. But based on his logs, and the panic he sent in the /health logs, I think he may be hitting some of the same flakiness in the probes, no? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
All good with this. maybe in a future PR we should split out the param for the readiness probe and liveness probe. Thoughts?
Yeah I was thinking that by making the healthchecks a struct, we can expand it in the future to allow more interesting customization besides completely turning it all off. 👍 |
Due to widespread problems with unreliable readiness and liveliness probes on installations where the apiserver and controller-manager are not on the same node (working theory), on non-dev clusters, a number of users are reporting that the controller-manager pod spends most of its time in a self-induced crashloopbackoff. As a workaround, we have been having people manually edit the deployment ot remove the checks. Long term we want to make these more configurable anyway (i.e. bumping the timeouts), but for now an "enabled" flag will make it easier for people on AKS, EKS and GKE workaround the issue more easily while we investigate further.
Due to reported problems with unreliable readiness and liveliness probes on installations where the apiserver and controller-manager are not on the same node (working theory), on non-dev clusters, a number of users are reporting that the controller-manager pod spends most of its
time in a self-induced crashloopbackoff.
As a workaround, we have been having people manually edit the deployment ot remove the checks. Long term we want to make these more configurable anyway (i.e. bumping the timeouts), but for now an "enabled" flag will make it easier for people on AKS, EKS and GKE workaround the issue more
easily while we investigate further.
Related to #2100 and Azure/AKS#417.