Skip to content
This repository has been archived by the owner on May 6, 2022. It is now read-only.

Keep InProgressProperties up-to-date #1916

Merged
merged 4 commits into from
Apr 27, 2018
Merged

Keep InProgressProperties up-to-date #1916

merged 4 commits into from
Apr 27, 2018

Conversation

nilebox
Copy link
Contributor

@nilebox nilebox commented Apr 5, 2018

Alternative fix for #1872 (comment) as described in #1903 (comment)

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 5, 2018
@nilebox
Copy link
Contributor Author

nilebox commented Apr 5, 2018

@kibbles-n-bytes Please check if this fix looks good.

}
const interval = 100 * time.Millisecond
const timeout = 10 * time.Second
err := wait.PollImmediate(interval, timeout, func() (bool, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @kibbles-n-bytes described previously, we need this retry loop to correctly handle the situation after we make locks optional, when user updates the spec before controller sends a successful status update.

instance.Status.InProgressProperties = &v1beta1.ServiceInstancePropertiesState{
ClusterServicePlanExternalName: testClusterServicePlanName,
ClusterServicePlanExternalID: testClusterServicePlanGUID,
Parameters: &runtime.RawExtension{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking that the InProgressProperties stale fields will be updated at retry.

if s1 == nil && s2 == nil {
return true
}
if s1 == nil && s2 != nil || s1 != nil && s2 == nil {
Copy link
Contributor

@duglin duglin Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know go will do the right thing, but to help readability, and newbie understandability, can you put parens around each side of the || ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edited - sorry, I meant ||

if s1.UserInfo != nil || s2.UserInfo != nil {
u1 := s1.UserInfo
u2 := s2.UserInfo
if u1 == nil && u2 != nil || u1 != nil && u2 == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

@@ -377,6 +379,16 @@ func (c *controller) reconcileServiceInstanceAdd(instance *v1beta1.ServiceInstan
// recordStartOfServiceInstanceOperation has updated the instance, so we need to continue in the next iteration
return nil
}
if !isServiceInstancePropertiesStateEqual(instance.Status.InProgressProperties, inProgressProperties) {
instance, err = c.updateServiceInstanceInProgressProperties(instance, inProgressProperties)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could just make it:

_, err = c.update...
return err

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, buf I follow the pattern just a few lines above with recordStartOfServiceInstanceOperation. I guess it makes the code safer for the future refactoring + has nice comments.

@kibbles-n-bytes
Copy link
Contributor

kibbles-n-bytes commented Apr 5, 2018

@nilebox Definitely an improvement! Though I think it's not just the InProgressProperties that we need to make sure are set. We should really be doing everything that is part of recordStartOfServiceInstanceOperation; after all, what we're really doing is starting a new operation that is overriding the previous one we were attempting. So this includes resetting the OperationStartTime and setting DeprovisionStatus to Required as a safe-guard (currently the code would not orphan mitigate correctly if we originally have a 4XX error, but then have a 5XX error or async failure).

So really I think what we could do is group those two blocks as so:

if instance.Status.CurrentOperation == "" || !isServiceInstancePropertiesStateEqual(instance.Status.InProgressProperties, inProgressProperties) {
    instance, err = c.recordStartOfServiceInstanceOperation(instance, v1beta1.ServiceInstanceProvision, inProgressProperties)
    if err != nil {
        // bla bla bla
        return err
    }
    // more bla
    return nil
}

I still think we should defensively start orphan mitigating before starting a new request if we notice ObservedGeneration != Generation && instance.Status.DeprovisionStatus == v1beta1.ServiceInstanceDeprovisionStatusRequired. There will always be holes that could get us into this state, no matter how defensive we are.

And finally, I think we should add the conflict-protection as the default for all status updates, as the implementation of our c.updateServiceInstanceStatus function. It's not just missing successful provisions we are worried about; instance updates (in the OSB sense) are in fact even more dangerous, since if the status update fails, we lose all information as to what came back from the broker. At least for provisioning, we can always orphan mitigate, but we're SOL for instance updates.

@@ -377,6 +379,16 @@ func (c *controller) reconcileServiceInstanceAdd(instance *v1beta1.ServiceInstan
// recordStartOfServiceInstanceOperation has updated the instance, so we need to continue in the next iteration
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminded me of something else.

I'm unclear about what return err vs nil means in the context of this outermost function.

Should the if instance.Status.CurrentOperation == "" { be a guard clause, with the recordStart taking place fully after the entire of the outer if body?

My read is, if we're not doing an operation, start one, but if we are doing an operation, go back to the beginning of the loop.

Is this to avoid updating the existing instance after we've already updated it, because it's been requeued by making a change? Shouldn't we proceed onward and do the rest of the instance processing? Is there something at the beginning of t he function before this that needs to be rechecked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear about what return err vs nil means in the context of this outermost function.

Returning error mostly tells the controller to perform exponential backoff (rate limiter) and not marking item as done.

Is this to avoid updating the existing instance after we've already updated it, because it's been requeued by making a change?

Yes.

Shouldn't we proceed onward and do the rest of the instance processing?

No, because there are issues with cache caused by this like #1639
We should only update instance once during a single iteration.

Copy link
Contributor Author

@nilebox nilebox Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MHBauer In case you (and others) wonder why do we update status in the middle of the process at all. The normal Kube controller logic is to never read the status - it's only used to report to the user. So it's perfectly fine if we do some multi-step work for a minute and then fail to update the status - the operation will be retried and everything's gonna be eventually fine.

We, on the other hand, have to read the status and drive our controller logic based on the last written status. Hence we can't just update the status in the end - we have to commit every stage of the process - effectively we have a "status driven workflow".

The reason for that is that OSB is not Kube friendly - it doesn't have GET endpoints (so we can't get the state of instance/binding), and it's hard to tell which exactly operation is currently in progress (there is just last_operation without details of operation itself). I see this as a hole in OSB REST API, and hope that we will eventually fix it in v3.

@nilebox
Copy link
Contributor Author

nilebox commented Apr 5, 2018

So really I think what we could do is group those two blocks

Ok, will do.

And finally, I think we should add the conflict-protection as the default for all status updates, as the implementation of our c.updateServiceInstanceStatus function.

And this as well.
The rest can be done as separate follow-ups.

I still think we should defensively start orphan mitigating before starting a new request if we notice ObservedGeneration != Generation && instance.Status.DeprovisionStatus == v1beta1.ServiceInstanceDeprovisionStatusRequired.

@kibbles-n-bytes I think it's a bit more complicated than that - it's not only about orphan mitigation, but also potentially about performing status cleanup in general.
There are at least 2 cases:

  1. ProvisionStatus == NotProvisioned. You could start orphan mitigation via processTemporaryProvisionFailure.
  2. ProvisionStatus == Provisioned. There is no need for orphan mitigation, since it was an update case, not provision. For that you could probably invoke processTemporaryUpdateServiceInstanceFailure to trigger cleanup.

That said, I don't think this is a blocker for #1917 given that we already have this issue no matter do we have locks or not.
Feel free to pick up this change.

@kibbles-n-bytes
Copy link
Contributor

There are at least 2 cases

I'm just talking about in reconcileServiceInstanceAdd; we'd only get into that function if getReconciliationActionForServiceInstance returned NotProvisioned.

That said, I don't think this is a blocker for #1917 given that we already have this issue no matter do we have locks or not.

You're right in that it is less priority, now that you've added conflict-resolution to the status updates. Without that we would have seen it a lot more frequently without the lock. With the lock it's not an issue because we just repeat the same request. But I'm fine with it not being included in this PR.

Copy link
Contributor

@kibbles-n-bytes kibbles-n-bytes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Just a note of some follow-up work items I notice:

  1. We should defensively start orphan mitigating before starting a new request if we notice ObservedGeneration != Generation && instance.Status.DeprovisionStatus == v1beta1.ServiceInstanceDeprovisionStatusRequired.

  2. We have a hole where if we start an operation, get a error where we don't have to orphan mitigate, and then retry and get an error where we do have to, we will skip actually deprovisioning at the broker because we never mark DeprovisionStatus as Required.

  3. We no longer do multiple status updates per reconciliation, so the contract of updateServiceInstanceStatus can be simplified to just return error.

I'll make issues for the first two.

@MHBauer
Copy link
Contributor

MHBauer commented Apr 13, 2018

@kibbles-n-bytes links for issues created?

@kibbles-n-bytes
Copy link
Contributor

kibbles-n-bytes commented Apr 27, 2018

@MHBauer #2001 for the first. For the second I need to double check to see if I can actually trigger it.

@@ -4269,7 +4355,7 @@ func TestResolveReferencesNoClusterServiceClass(t *testing.T) {
}

// TestReconcileServiceInstanceUpdateParameters tests updating a
// ServiceInstance with new paramaters
// ServiceInstance with new parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@pmorie pmorie added the LGTM2 label Apr 27, 2018
@pmorie
Copy link
Contributor

pmorie commented Apr 27, 2018

Thanks a lot for the patch

@pmorie pmorie merged commit 541866d into kubernetes-retired:master Apr 27, 2018
@nilebox nilebox deleted the sync-in-progress-properties branch May 14, 2018 00:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. LGTM1 LGTM2 non-happy-path size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants