Make readiness and liveliness probes optional #2121

carolynvs · 2018-06-14T12:42:08Z

Due to reported problems with unreliable readiness and liveliness probes on installations where the apiserver and controller-manager are not on the same node (working theory), on non-dev clusters, a number of users are reporting that the controller-manager pod spends most of its
time in a self-induced crashloopbackoff.

As a workaround, we have been having people manually edit the deployment ot remove the checks. Long term we want to make these more configurable anyway (i.e. bumping the timeouts), but for now an "enabled" flag will make it easier for people on AKS, EKS and GKE workaround the issue more
easily while we investigate further.

Related to #2100 and Azure/AKS#417.

Due to widespread problems with unreliable readiness and liveliness probes on installations where the apiserver and controller-manager are not on the same node (working theory), on non-dev clusters, a number of users are reporting that the controller-manager pod spends most of its time in a self-induced crashloopbackoff. As a workaround, we have been having people manually edit the deployment ot remove the checks. Long term we want to make these more configurable anyway (i.e. bumping the timeouts), but for now an "enabled" flag will make it easier for people on AKS, EKS and GKE workaround the issue more easily while we investigate further.

jboyd01 · 2018-06-14T13:12:37Z

I don't think #2100 is really related - pings from Master to the catalog api pod timeout, looks like infrastructure/networking/config issues. However I have actually run into this issue in my dev environment running with hack/local-up-cluster.sh. I haven't used it in months because of this but I'll retry to see if its still an issue, may it help with debugging the root cause.

carolynvs · 2018-06-14T13:24:43Z

I am not saying that #2100 is caused by this. But based on his logs, and the panic he sent in the /health logs, I think he may be hitting some of the same flakiness in the probes, no?

arschles

LGTM

All good with this. maybe in a future PR we should split out the param for the readiness probe and liveness probe. Thoughts?

carolynvs · 2018-06-14T18:07:55Z

All good with this. maybe in a future PR we should split out the param for the readiness probe and liveness probe. Thoughts?

Yeah I was thinking that by making the healthchecks a struct, we can expand it in the future to allow more interesting customization besides completely turning it all off. 👍

Due to widespread problems with unreliable readiness and liveliness probes on installations where the apiserver and controller-manager are not on the same node (working theory), on non-dev clusters, a number of users are reporting that the controller-manager pod spends most of its time in a self-induced crashloopbackoff. As a workaround, we have been having people manually edit the deployment ot remove the checks. Long term we want to make these more configurable anyway (i.e. bumping the timeouts), but for now an "enabled" flag will make it easier for people on AKS, EKS and GKE workaround the issue more easily while we investigate further.

k8s-ci-robot requested review from bmelville and MHBauer June 14, 2018 12:42

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 14, 2018

carolynvs mentioned this pull request Jun 14, 2018

[URGENT SOMEONE PLEASE HELP] Can someone help me in resolving the issue mentioned in this link https://github.com/kubernetes-incubator/service-catalog/issues/1867 #2100

Closed

carolynvs requested review from arschles and removed request for bmelville June 14, 2018 12:45

jboyd01 added the LGTM1 label Jun 14, 2018

arschles approved these changes Jun 14, 2018

View reviewed changes

arschles added the LGTM2 label Jun 14, 2018

arschles merged commit 467751a into kubernetes-retired:master Jun 14, 2018

carolynvs deleted the disable-healthchecks-flag branch June 14, 2018 18:07

tamalsaha mentioned this pull request Dec 26, 2018

Readiness probe failed: HTTP probe failed with statuscode: 403 voyagermesh/voyager#1296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make readiness and liveliness probes optional #2121

Make readiness and liveliness probes optional #2121

carolynvs commented Jun 14, 2018

jboyd01 commented Jun 14, 2018

carolynvs commented Jun 14, 2018 •

edited

Loading

arschles left a comment

carolynvs commented Jun 14, 2018

Make readiness and liveliness probes optional #2121

Make readiness and liveliness probes optional #2121

Conversation

carolynvs commented Jun 14, 2018

jboyd01 commented Jun 14, 2018

carolynvs commented Jun 14, 2018 • edited Loading

arschles left a comment

Choose a reason for hiding this comment

carolynvs commented Jun 14, 2018

carolynvs commented Jun 14, 2018 •

edited

Loading