Skip to content
This repository has been archived by the owner on May 6, 2022. It is now read-only.

Make readiness and liveliness probes optional #2121

Merged

Conversation

carolynvs
Copy link
Contributor

Due to reported problems with unreliable readiness and liveliness probes on installations where the apiserver and controller-manager are not on the same node (working theory), on non-dev clusters, a number of users are reporting that the controller-manager pod spends most of its
time in a self-induced crashloopbackoff.

As a workaround, we have been having people manually edit the deployment ot remove the checks. Long term we want to make these more configurable anyway (i.e. bumping the timeouts), but for now an "enabled" flag will make it easier for people on AKS, EKS and GKE workaround the issue more
easily while we investigate further.

Related to #2100 and Azure/AKS#417.

Due to widespread problems with unreliable readiness and liveliness
probes on installations where the apiserver and controller-manager are
not on the same node (working theory), on non-dev clusters, a number of
users are reporting that the controller-manager pod spends most of its
time in a self-induced crashloopbackoff.

As a workaround, we have been having people manually edit the deployment
ot remove the checks. Long term we want to make these more configurable
anyway (i.e. bumping the timeouts), but for now an "enabled" flag will
make it easier for people on AKS, EKS and GKE workaround the issue more
easily while we investigate further.
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 14, 2018
@carolynvs carolynvs requested review from arschles and removed request for bmelville June 14, 2018 12:45
@jboyd01
Copy link
Contributor

jboyd01 commented Jun 14, 2018

I don't think #2100 is really related - pings from Master to the catalog api pod timeout, looks like infrastructure/networking/config issues. However I have actually run into this issue in my dev environment running with hack/local-up-cluster.sh. I haven't used it in months because of this but I'll retry to see if its still an issue, may it help with debugging the root cause.

@jboyd01 jboyd01 added the LGTM1 label Jun 14, 2018
@carolynvs
Copy link
Contributor Author

carolynvs commented Jun 14, 2018

I am not saying that #2100 is caused by this. But based on his logs, and the panic he sent in the /health logs, I think he may be hitting some of the same flakiness in the probes, no?

Copy link
Contributor

@arschles arschles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

All good with this. maybe in a future PR we should split out the param for the readiness probe and liveness probe. Thoughts?

@arschles arschles added the LGTM2 label Jun 14, 2018
@arschles arschles merged commit 467751a into kubernetes-retired:master Jun 14, 2018
@carolynvs
Copy link
Contributor Author

All good with this. maybe in a future PR we should split out the param for the readiness probe and liveness probe. Thoughts?

Yeah I was thinking that by making the healthchecks a struct, we can expand it in the future to allow more interesting customization besides completely turning it all off. 👍

@carolynvs carolynvs deleted the disable-healthchecks-flag branch June 14, 2018 18:07
kikisdeliveryservice pushed a commit to kikisdeliveryservice/service-catalog that referenced this pull request Jun 29, 2018
Due to widespread problems with unreliable readiness and liveliness
probes on installations where the apiserver and controller-manager are
not on the same node (working theory), on non-dev clusters, a number of
users are reporting that the controller-manager pod spends most of its
time in a self-induced crashloopbackoff.

As a workaround, we have been having people manually edit the deployment
ot remove the checks. Long term we want to make these more configurable
anyway (i.e. bumping the timeouts), but for now an "enabled" flag will
make it easier for people on AKS, EKS and GKE workaround the issue more
easily while we investigate further.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. LGTM1 LGTM2 size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants