Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deflaky: TestSnapshotStatus #19313

Merged
merged 1 commit into from
Jan 31, 2025
Merged

Conversation

fuweid
Copy link
Member

@fuweid fuweid commented Jan 31, 2025

Root Cause

The backend commits pending changes into boltdb per 100 ms by default. When brand-new member starts, that member will save member into backend by MustSaveMemberToBackend. However, it doesn't force change into boltdb. It depends on backend's auto-commit per 100ms.

If auto-commit happens after UpdateStorageVersion, the UpdateStorageVersion
will fail on missing confstate information. The confstate information is handled by commit-prehook. Once UpdateStorageVersion fails, TestSnapshotStatus won't have another chance to do that because that
server has been closed. So without storage version information, hash of boltdb data is not expected.

How to reproduce it?

diff --git a/etcdutl/snapshot/v3_snapshot_test.go b/etcdutl/snapshot/v3_snapshot_test.go
index c2b3d5202..cd88ff995 100644
--- a/etcdutl/snapshot/v3_snapshot_test.go
+++ b/etcdutl/snapshot/v3_snapshot_test.go
@@ -141,6 +141,7 @@ func createDB(t *testing.T, generateContent func(*etcdserver.EtcdServer)) string
        t.Helper()

        cfg := embed.NewConfig()
+       cfg.BackendBatchInterval = 5 * time.Second
        cfg.LogLevel = "fatal"
        cfg.Dir = t.TempDir()
➜  snapshot git:(deflaky-TestSnapshotStatus) ✗ go test -v -run "^TestSnapshotStatus$" -count 1 ./
=== RUN   TestSnapshotStatus
    v3_snapshot_test.go:47:
                Error Trace:    /home/weifu/workspace/etcd/etcdutl/snapshot/v3_snapshot_test.go:47
                Error:          Not equal:
                                expected: 0x62132b4d
                                actual  : 0x6bf96324
                Test:           TestSnapshotStatus
--- FAIL: TestSnapshotStatus (0.14s)
FAIL
FAIL    go.etcd.io/etcd/etcdutl/v3/snapshot     0.154s
FAIL

How to fix?

Change BackendBatchLimit to 1 so that we can force-commit during MustSaveMemberToBackend.

Fixes: #19228

cc @ahrtr @serathius

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

Root Cause

The backend commits pending changes into boltdb per 100 ms by default.
When brand-new member starts, that member will save member into
backend by `MustSaveMemberToBackend`. However, it doesn't force change
into boltdb. It depends on backend's auto-commit per 100ms.

If auto-commit happens after `UpdateStorageVersion`, the `UpdateStorageVersion`
will fail on `missing confstate information`. The confstate information
is handled by commit-prehook. Once `UpdateStorageVersion` fails,
`TestSnapshotStatus` won't have another chance to do that because that
server has been closed. So without storage version information, hash of boltdb
data is not expected.

How to reproduce it?

```diff
diff --git a/etcdutl/snapshot/v3_snapshot_test.go b/etcdutl/snapshot/v3_snapshot_test.go
index c2b3d5202..cd88ff995 100644
--- a/etcdutl/snapshot/v3_snapshot_test.go
+++ b/etcdutl/snapshot/v3_snapshot_test.go
@@ -141,6 +141,7 @@ func createDB(t *testing.T, generateContent func(*etcdserver.EtcdServer)) string
        t.Helper()

        cfg := embed.NewConfig()
+       cfg.BackendBatchInterval = 5 * time.Second
        cfg.LogLevel = "fatal"
        cfg.Dir = t.TempDir()
```

```bash
➜  snapshot git:(deflaky-TestSnapshotStatus) ✗ go test -v -run "^TestSnapshotStatus$" -count 1 ./
=== RUN   TestSnapshotStatus
    v3_snapshot_test.go:47:
                Error Trace:    /home/weifu/workspace/etcd/etcdutl/snapshot/v3_snapshot_test.go:47
                Error:          Not equal:
                                expected: 0x62132b4d
                                actual  : 0x6bf96324
                Test:           TestSnapshotStatus
--- FAIL: TestSnapshotStatus (0.14s)
FAIL
FAIL    go.etcd.io/etcd/etcdutl/v3/snapshot     0.154s
FAIL
```

How to fix?

Change `BackendBatchLimit` to `1` so that we can force-commit during
`MustSaveMemberToBackend`.

Signed-off-by: Wei Fu <[email protected]>
Copy link

codecov bot commented Jan 31, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.87%. Comparing base (2895b19) to head (12bbee7).
Report is 43 commits behind head on main.

Additional details and impacted files

see 37 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19313      +/-   ##
==========================================
- Coverage   68.88%   68.87%   -0.02%     
==========================================
  Files         420      420              
  Lines       35689    35701      +12     
==========================================
+ Hits        24586    24588       +2     
- Misses       9684     9691       +7     
- Partials     1419     1422       +3     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2895b19...12bbee7. Read the comment docs.

@ahrtr
Copy link
Member

ahrtr commented Jan 31, 2025

When brand-new member starts, that member will save member into backend by MustSaveMemberToBackend. However, it doesn't force change into boltdb. It depends on backend's auto-commit per 100ms.

The analysis basically makes sense. The reason is UpdateStorageVersion might fail due to Conf State not being persisted in time. What's the relationship with MustSaveMemberToBackend?

@ahrtr
Copy link
Member

ahrtr commented Jan 31, 2025

What's the relationship with MustSaveMemberToBackend?

OK, the confState only gets updated only when applyConfChange.

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for the nice catch!

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, fuweid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ahrtr ahrtr merged commit b197332 into etcd-io:main Jan 31, 2025
36 checks passed
fuweid added a commit to fuweid/etcd that referenced this pull request Feb 16, 2025
1. Update DowngradeUpgradeMembersByID Interface

By default, the leader cancels the downgrade process once all members reach the
target version. However, this behavior is unpredictable and may cause unexpected
errors.

Example 1: Single-Member Cluster

  T0: Set up a single-member cluster.
  T1: Enable the downgrade process.
  T2: Downgrade the member from v3.6.0 to v3.5.0.
  T3: Upgrade the member from v3.5.0 to v3.6.0.

If the leader cancels the downgrade process between T2 and T3, the cluster
should end up at v3.6.0; otherwise, it remains at v3.5.0.

Example 2: Three-Member Cluster

  T0: Set up a three-member cluster.
  T1: Enable the downgrade process.
  T2: Downgrade two members from v3.6.0 to v3.5.0.
  T3: Upgrade those two members back to v3.6.0.

Since the downgrade process is incomplete, after T3, the cluster should remain
at v3.5.0 instead of v3.6.0.

Key Takeaways

  * The binary version alone should not determine the cluster version.
  * The caller must explicitly track and manage the expected version.

2. Disable Downgrade Monitor by Default in Robustness

As demonstrated in Example 1, disabling the downgrade monitor reduces test
flakiness.

3. Remove AssertProcessLogs from DowngradeEnable

The log message "The server is ready to downgrade" appears only when the storage
version monitor detects a mismatch between the cluster and storage versions.

If traffic is insufficient to trigger a commit or if an auto-commit occurs right
after reading the storage version, the monitor may fail to update it, leading
to errors like:

```bash
"msg":"failed to update storage version","cluster-version":"3.6.0",
"error":"cannot detect storage schema version: missing confstate information"
```

Given this, we should remove the AssertProcessLogs statement.

Similar to etcd-io#19313

Signed-off-by: Wei Fu <[email protected]>
@fuweid fuweid deleted the deflaky-TestSnapshotStatus branch February 16, 2025 17:45
fuweid added a commit to fuweid/etcd that referenced this pull request Feb 17, 2025
1. Update DowngradeUpgradeMembersByID Interface

By default, the leader cancels the downgrade process once all members reach the
target version. However, this behavior is unpredictable and may cause unexpected
errors.

Example 1: Single-Member Cluster

  T0: Set up a single-member cluster.
  T1: Enable the downgrade process.
  T2: Downgrade the member from v3.6.0 to v3.5.0.
  T3: Upgrade the member from v3.5.0 to v3.6.0.

If the leader cancels the downgrade process between T2 and T3, the cluster
should end up at v3.6.0; otherwise, it remains at v3.5.0.

Example 2: Three-Member Cluster

  T0: Set up a three-member cluster.
  T1: Enable the downgrade process.
  T2: Downgrade two members from v3.6.0 to v3.5.0.
  T3: Upgrade those two members back to v3.6.0.

Since the downgrade process is incomplete, after T3, the cluster should remain
at v3.5.0 instead of v3.6.0.

Key Takeaways

  * The binary version alone should not determine the cluster version.
  * The caller must explicitly track and manage the expected version.

2. Manually cancels the downgrade process after a successful downgrade in
MemberDowngradeUpdate failpoint

As demonstrated in Example 1, canceling downgrade process can reduce flakiness.

3. Remove AssertProcessLogs from DowngradeEnable

The log message "The server is ready to downgrade" appears only when the storage
version monitor detects a mismatch between the cluster and storage versions.

If traffic is insufficient to trigger a commit or if an auto-commit occurs right
after reading the storage version, the monitor may fail to update it, leading
to errors like:

```bash
"msg":"failed to update storage version","cluster-version":"3.6.0",
"error":"cannot detect storage schema version: missing confstate information"
```

Given this, we should remove the AssertProcessLogs statement.

Similar to etcd-io#19313

Signed-off-by: Wei Fu <[email protected]>
fuweid added a commit to fuweid/etcd that referenced this pull request Feb 20, 2025
1. Update DowngradeUpgradeMembersByID Interface

By default, the leader cancels the downgrade process once all members reach the
target version. However, this behavior is unpredictable and may cause unexpected
errors.

Example 1: Single-Member Cluster

  T0: Set up a single-member cluster.
  T1: Enable the downgrade process.
  T2: Downgrade the member from v3.6.0 to v3.5.0.
  T3: Upgrade the member from v3.5.0 to v3.6.0.

If the leader cancels the downgrade process between T2 and T3, the cluster
should end up at v3.6.0; otherwise, it remains at v3.5.0.

Example 2: Three-Member Cluster

  T0: Set up a three-member cluster.
  T1: Enable the downgrade process.
  T2: Downgrade two members from v3.6.0 to v3.5.0.
  T3: Upgrade those two members back to v3.6.0.

Since the downgrade process is incomplete, after T3, the cluster should remain
at v3.5.0 instead of v3.6.0.

Key Takeaways

  * The binary version alone should not determine the cluster version.
  * The caller must explicitly track and manage the expected version.

2. Remove AssertProcessLogs from DowngradeEnable

The log message "The server is ready to downgrade" appears only when the storage
version monitor detects a mismatch between the cluster and storage versions.

If traffic is insufficient to trigger a commit or if an auto-commit occurs right
after reading the storage version, the monitor may fail to update it, leading
to errors like:

```bash
"msg":"failed to update storage version","cluster-version":"3.6.0",
"error":"cannot detect storage schema version: missing confstate information"
```

Given this, we should remove the AssertProcessLogs statement.

Similar to etcd-io#19313

Signed-off-by: Wei Fu <[email protected]>
fuweid added a commit to fuweid/etcd that referenced this pull request Feb 20, 2025
1. Update DowngradeUpgradeMembersByID Interface

By default, the leader cancels the downgrade process once all members reach the
target version. However, this behavior is unpredictable and may cause unexpected
errors.

Example 1: Single-Member Cluster

  T0: Set up a single-member cluster.
  T1: Enable the downgrade process.
  T2: Downgrade the member from v3.6.0 to v3.5.0.
  T3: Upgrade the member from v3.5.0 to v3.6.0.

If the leader cancels the downgrade process between T2 and T3, the cluster
should end up at v3.6.0; otherwise, it remains at v3.5.0.

Example 2: Three-Member Cluster

  T0: Set up a three-member cluster.
  T1: Enable the downgrade process.
  T2: Downgrade two members from v3.6.0 to v3.5.0.
  T3: Upgrade those two members back to v3.6.0.

Since the downgrade process is incomplete, after T3, the cluster should remain
at v3.5.0 instead of v3.6.0.

Key Takeaways

  * The binary version alone should not determine the cluster version.
  * The caller must explicitly track and manage the expected version.

2. Remove AssertProcessLogs from DowngradeEnable

The log message "The server is ready to downgrade" appears only when the storage
version monitor detects a mismatch between the cluster and storage versions.

If traffic is insufficient to trigger a commit or if an auto-commit occurs right
after reading the storage version, the monitor may fail to update it, leading
to errors like:

```bash
"msg":"failed to update storage version","cluster-version":"3.6.0",
"error":"cannot detect storage schema version: missing confstate information"
```

Given this, we should remove the AssertProcessLogs statement.

Similar to etcd-io#19313

Signed-off-by: Wei Fu <[email protected]>
fuweid added a commit to fuweid/etcd that referenced this pull request Feb 20, 2025
1. Update DowngradeUpgradeMembersByID Interface

By default, the leader cancels the downgrade process once all members reach the
target version. However, this behavior is unpredictable and may cause unexpected
errors.

Example 1: Single-Member Cluster

  T0: Set up a single-member cluster.
  T1: Enable the downgrade process.
  T2: Downgrade the member from v3.6.0 to v3.5.0.
  T3: Upgrade the member from v3.5.0 to v3.6.0.

If the leader cancels the downgrade process between T2 and T3, the cluster
should end up at v3.6.0; otherwise, it remains at v3.5.0.

Example 2: Three-Member Cluster

  T0: Set up a three-member cluster.
  T1: Enable the downgrade process.
  T2: Downgrade two members from v3.6.0 to v3.5.0.
  T3: Upgrade those two members back to v3.6.0.

Since the downgrade process is incomplete, after T3, the cluster should remain
at v3.5.0 instead of v3.6.0.

Key Takeaways

  * The binary version alone should not determine the cluster version.
  * The caller must explicitly track and manage the expected version.

2. Remove AssertProcessLogs from DowngradeEnable

The log message "The server is ready to downgrade" appears only when the storage
version monitor detects a mismatch between the cluster and storage versions.

If traffic is insufficient to trigger a commit or if an auto-commit occurs right
after reading the storage version, the monitor may fail to update it, leading
to errors like:

```bash
"msg":"failed to update storage version","cluster-version":"3.6.0",
"error":"cannot detect storage schema version: missing confstate information"
```

Given this, we should remove the AssertProcessLogs statement.

Similar to etcd-io#19313

Signed-off-by: Wei Fu <[email protected]>
fuweid added a commit to fuweid/etcd that referenced this pull request Feb 20, 2025
1. Update DowngradeUpgradeMembersByID Interface

By default, the leader cancels the downgrade process once all members reach the
target version. However, this behavior is unpredictable and may cause unexpected
errors.

Example 1: Single-Member Cluster

  T0: Set up a single-member cluster.
  T1: Enable the downgrade process.
  T2: Downgrade the member from v3.6.0 to v3.5.0.
  T3: Upgrade the member from v3.5.0 to v3.6.0.

If the leader cancels the downgrade process between T2 and T3, the cluster
should end up at v3.6.0; otherwise, it remains at v3.5.0.

Example 2: Three-Member Cluster

  T0: Set up a three-member cluster.
  T1: Enable the downgrade process.
  T2: Downgrade two members from v3.6.0 to v3.5.0.
  T3: Upgrade those two members back to v3.6.0.

Since the downgrade process is incomplete, after T3, the cluster should remain
at v3.5.0 instead of v3.6.0.

Key Takeaways

  * The binary version alone should not determine the cluster version.
  * The caller must explicitly track and manage the expected version.

2. Remove AssertProcessLogs from DowngradeEnable

The log message "The server is ready to downgrade" appears only when the storage
version monitor detects a mismatch between the cluster and storage versions.

If traffic is insufficient to trigger a commit or if an auto-commit occurs right
after reading the storage version, the monitor may fail to update it, leading
to errors like:

```bash
"msg":"failed to update storage version","cluster-version":"3.6.0",
"error":"cannot detect storage schema version: missing confstate information"
```

Given this, we should remove the AssertProcessLogs statement.

Similar to etcd-io#19313

Signed-off-by: Wei Fu <[email protected]>
fuweid added a commit to fuweid/etcd that referenced this pull request Feb 21, 2025
1. Update DowngradeUpgradeMembersByID

If it's downgrading process, the desire version of cluster should be
target one.
If it's upgrading process, the desire version of cluster should be
determined by mininum binary version of members.

2. Remove AssertProcessLogs from DowngradeEnable

The log message "The server is ready to downgrade" appears only when the storage
version monitor detects a mismatch between the cluster and storage versions.

If traffic is insufficient to trigger a commit or if an auto-commit occurs right
after reading the storage version, the monitor may fail to update it, leading
to errors like:

```bash
"msg":"failed to update storage version","cluster-version":"3.6.0",
"error":"cannot detect storage schema version: missing confstate information"
```

Given this, we should remove the AssertProcessLogs statement.

Similar to etcd-io#19313

Signed-off-by: Wei Fu <[email protected]>
fuweid added a commit to fuweid/etcd that referenced this pull request Feb 21, 2025
1. Update DowngradeUpgradeMembersByID

If it's downgrading process, the desire version of cluster should be
target one.
If it's upgrading process, the desire version of cluster should be
determined by mininum binary version of members.

2. Remove AssertProcessLogs from DowngradeEnable

The log message "The server is ready to downgrade" appears only when the storage
version monitor detects a mismatch between the cluster and storage versions.

If traffic is insufficient to trigger a commit or if an auto-commit occurs right
after reading the storage version, the monitor may fail to update it, leading
to errors like:

```bash
"msg":"failed to update storage version","cluster-version":"3.6.0",
"error":"cannot detect storage schema version: missing confstate information"
```

Given this, we should remove the AssertProcessLogs statement.

Similar to etcd-io#19313

Signed-off-by: Wei Fu <[email protected]>
k8s-infra-cherrypick-robot pushed a commit to k8s-infra-cherrypick-robot/etcd that referenced this pull request Feb 21, 2025
1. Update DowngradeUpgradeMembersByID

If it's downgrading process, the desire version of cluster should be
target one.
If it's upgrading process, the desire version of cluster should be
determined by mininum binary version of members.

2. Remove AssertProcessLogs from DowngradeEnable

The log message "The server is ready to downgrade" appears only when the storage
version monitor detects a mismatch between the cluster and storage versions.

If traffic is insufficient to trigger a commit or if an auto-commit occurs right
after reading the storage version, the monitor may fail to update it, leading
to errors like:

```bash
"msg":"failed to update storage version","cluster-version":"3.6.0",
"error":"cannot detect storage schema version: missing confstate information"
```

Given this, we should remove the AssertProcessLogs statement.

Similar to etcd-io#19313

Signed-off-by: Wei Fu <[email protected]>
fuweid added a commit to fuweid/etcd that referenced this pull request Feb 23, 2025
1. Update DowngradeUpgradeMembersByID

If it's downgrading process, the desire version of cluster should be
target one.
If it's upgrading process, the desire version of cluster should be
determined by mininum binary version of members.

2. Remove AssertProcessLogs from DowngradeEnable

The log message "The server is ready to downgrade" appears only when the storage
version monitor detects a mismatch between the cluster and storage versions.

If traffic is insufficient to trigger a commit or if an auto-commit occurs right
after reading the storage version, the monitor may fail to update it, leading
to errors like:

```bash
"msg":"failed to update storage version","cluster-version":"3.6.0",
"error":"cannot detect storage schema version: missing confstate information"
```

Given this, we should remove the AssertProcessLogs statement.

Similar to etcd-io#19313

Signed-off-by: Wei Fu <[email protected]>
(cherry picked from commit 65159a2)
Signed-off-by: Wei Fu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Flaky TestSnapshotStatus
3 participants