deflaky: TestSnapshotStatus #19313

fuweid · 2025-01-31T20:08:17Z

Root Cause

The backend commits pending changes into boltdb per 100 ms by default. When brand-new member starts, that member will save member into backend by MustSaveMemberToBackend. However, it doesn't force change into boltdb. It depends on backend's auto-commit per 100ms.

If auto-commit happens after UpdateStorageVersion, the UpdateStorageVersion
will fail on missing confstate information. The confstate information is handled by commit-prehook. Once UpdateStorageVersion fails, TestSnapshotStatus won't have another chance to do that because that
server has been closed. So without storage version information, hash of boltdb data is not expected.

How to reproduce it?

diff --git a/etcdutl/snapshot/v3_snapshot_test.go b/etcdutl/snapshot/v3_snapshot_test.go
index c2b3d5202..cd88ff995 100644
--- a/etcdutl/snapshot/v3_snapshot_test.go
+++ b/etcdutl/snapshot/v3_snapshot_test.go
@@ -141,6 +141,7 @@ func createDB(t *testing.T, generateContent func(*etcdserver.EtcdServer)) string
        t.Helper()

        cfg := embed.NewConfig()
+       cfg.BackendBatchInterval = 5 * time.Second
        cfg.LogLevel = "fatal"
        cfg.Dir = t.TempDir()

➜  snapshot git:(deflaky-TestSnapshotStatus) ✗ go test -v -run "^TestSnapshotStatus$" -count 1 ./
=== RUN   TestSnapshotStatus
    v3_snapshot_test.go:47:
                Error Trace:    /home/weifu/workspace/etcd/etcdutl/snapshot/v3_snapshot_test.go:47
                Error:          Not equal:
                                expected: 0x62132b4d
                                actual  : 0x6bf96324
                Test:           TestSnapshotStatus
--- FAIL: TestSnapshotStatus (0.14s)
FAIL
FAIL    go.etcd.io/etcd/etcdutl/v3/snapshot     0.154s
FAIL

How to fix?

Change BackendBatchLimit to 1 so that we can force-commit during MustSaveMemberToBackend.

Fixes: #19228

cc @ahrtr @serathius

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

Root Cause The backend commits pending changes into boltdb per 100 ms by default. When brand-new member starts, that member will save member into backend by `MustSaveMemberToBackend`. However, it doesn't force change into boltdb. It depends on backend's auto-commit per 100ms. If auto-commit happens after `UpdateStorageVersion`, the `UpdateStorageVersion` will fail on `missing confstate information`. The confstate information is handled by commit-prehook. Once `UpdateStorageVersion` fails, `TestSnapshotStatus` won't have another chance to do that because that server has been closed. So without storage version information, hash of boltdb data is not expected. How to reproduce it? ```diff diff --git a/etcdutl/snapshot/v3_snapshot_test.go b/etcdutl/snapshot/v3_snapshot_test.go index c2b3d5202..cd88ff995 100644 --- a/etcdutl/snapshot/v3_snapshot_test.go +++ b/etcdutl/snapshot/v3_snapshot_test.go @@ -141,6 +141,7 @@ func createDB(t *testing.T, generateContent func(*etcdserver.EtcdServer)) string t.Helper() cfg := embed.NewConfig() + cfg.BackendBatchInterval = 5 * time.Second cfg.LogLevel = "fatal" cfg.Dir = t.TempDir() ``` ```bash ➜ snapshot git:(deflaky-TestSnapshotStatus) ✗ go test -v -run "^TestSnapshotStatus$" -count 1 ./ === RUN TestSnapshotStatus v3_snapshot_test.go:47: Error Trace: /home/weifu/workspace/etcd/etcdutl/snapshot/v3_snapshot_test.go:47 Error: Not equal: expected: 0x62132b4d actual : 0x6bf96324 Test: TestSnapshotStatus --- FAIL: TestSnapshotStatus (0.14s) FAIL FAIL go.etcd.io/etcd/etcdutl/v3/snapshot 0.154s FAIL ``` How to fix? Change `BackendBatchLimit` to `1` so that we can force-commit during `MustSaveMemberToBackend`. Signed-off-by: Wei Fu <[email protected]>

codecov · 2025-01-31T20:27:41Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.87%. Comparing base (2895b19) to head (12bbee7).
Report is 43 commits behind head on main.

Additional details and impacted files

see 37 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19313      +/-   ##
==========================================
- Coverage   68.88%   68.87%   -0.02%     
==========================================
  Files         420      420              
  Lines       35689    35701      +12     
==========================================
+ Hits        24586    24588       +2     
- Misses       9684     9691       +7     
- Partials     1419     1422       +3

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2895b19...12bbee7. Read the comment docs.

ahrtr · 2025-01-31T20:34:18Z

When brand-new member starts, that member will save member into backend by MustSaveMemberToBackend. However, it doesn't force change into boltdb. It depends on backend's auto-commit per 100ms.

The analysis basically makes sense. The reason is UpdateStorageVersion might fail due to Conf State not being persisted in time. What's the relationship with MustSaveMemberToBackend?

ahrtr · 2025-01-31T20:37:55Z

What's the relationship with MustSaveMemberToBackend?

OK, the confState only gets updated only when applyConfChange.

ahrtr

LGTM

Thanks for the nice catch!

k8s-ci-robot · 2025-01-31T20:38:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, fuweid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahrtr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1. Update DowngradeUpgradeMembersByID Interface By default, the leader cancels the downgrade process once all members reach the target version. However, this behavior is unpredictable and may cause unexpected errors. Example 1: Single-Member Cluster T0: Set up a single-member cluster. T1: Enable the downgrade process. T2: Downgrade the member from v3.6.0 to v3.5.0. T3: Upgrade the member from v3.5.0 to v3.6.0. If the leader cancels the downgrade process between T2 and T3, the cluster should end up at v3.6.0; otherwise, it remains at v3.5.0. Example 2: Three-Member Cluster T0: Set up a three-member cluster. T1: Enable the downgrade process. T2: Downgrade two members from v3.6.0 to v3.5.0. T3: Upgrade those two members back to v3.6.0. Since the downgrade process is incomplete, after T3, the cluster should remain at v3.5.0 instead of v3.6.0. Key Takeaways * The binary version alone should not determine the cluster version. * The caller must explicitly track and manage the expected version. 2. Disable Downgrade Monitor by Default in Robustness As demonstrated in Example 1, disabling the downgrade monitor reduces test flakiness. 3. Remove AssertProcessLogs from DowngradeEnable The log message "The server is ready to downgrade" appears only when the storage version monitor detects a mismatch between the cluster and storage versions. If traffic is insufficient to trigger a commit or if an auto-commit occurs right after reading the storage version, the monitor may fail to update it, leading to errors like: ```bash "msg":"failed to update storage version","cluster-version":"3.6.0", "error":"cannot detect storage schema version: missing confstate information" ``` Given this, we should remove the AssertProcessLogs statement. Similar to etcd-io#19313 Signed-off-by: Wei Fu <[email protected]>

1. Update DowngradeUpgradeMembersByID Interface By default, the leader cancels the downgrade process once all members reach the target version. However, this behavior is unpredictable and may cause unexpected errors. Example 1: Single-Member Cluster T0: Set up a single-member cluster. T1: Enable the downgrade process. T2: Downgrade the member from v3.6.0 to v3.5.0. T3: Upgrade the member from v3.5.0 to v3.6.0. If the leader cancels the downgrade process between T2 and T3, the cluster should end up at v3.6.0; otherwise, it remains at v3.5.0. Example 2: Three-Member Cluster T0: Set up a three-member cluster. T1: Enable the downgrade process. T2: Downgrade two members from v3.6.0 to v3.5.0. T3: Upgrade those two members back to v3.6.0. Since the downgrade process is incomplete, after T3, the cluster should remain at v3.5.0 instead of v3.6.0. Key Takeaways * The binary version alone should not determine the cluster version. * The caller must explicitly track and manage the expected version. 2. Manually cancels the downgrade process after a successful downgrade in MemberDowngradeUpdate failpoint As demonstrated in Example 1, canceling downgrade process can reduce flakiness. 3. Remove AssertProcessLogs from DowngradeEnable The log message "The server is ready to downgrade" appears only when the storage version monitor detects a mismatch between the cluster and storage versions. If traffic is insufficient to trigger a commit or if an auto-commit occurs right after reading the storage version, the monitor may fail to update it, leading to errors like: ```bash "msg":"failed to update storage version","cluster-version":"3.6.0", "error":"cannot detect storage schema version: missing confstate information" ``` Given this, we should remove the AssertProcessLogs statement. Similar to etcd-io#19313 Signed-off-by: Wei Fu <[email protected]>

1. Update DowngradeUpgradeMembersByID Interface By default, the leader cancels the downgrade process once all members reach the target version. However, this behavior is unpredictable and may cause unexpected errors. Example 1: Single-Member Cluster T0: Set up a single-member cluster. T1: Enable the downgrade process. T2: Downgrade the member from v3.6.0 to v3.5.0. T3: Upgrade the member from v3.5.0 to v3.6.0. If the leader cancels the downgrade process between T2 and T3, the cluster should end up at v3.6.0; otherwise, it remains at v3.5.0. Example 2: Three-Member Cluster T0: Set up a three-member cluster. T1: Enable the downgrade process. T2: Downgrade two members from v3.6.0 to v3.5.0. T3: Upgrade those two members back to v3.6.0. Since the downgrade process is incomplete, after T3, the cluster should remain at v3.5.0 instead of v3.6.0. Key Takeaways * The binary version alone should not determine the cluster version. * The caller must explicitly track and manage the expected version. 2. Remove AssertProcessLogs from DowngradeEnable The log message "The server is ready to downgrade" appears only when the storage version monitor detects a mismatch between the cluster and storage versions. If traffic is insufficient to trigger a commit or if an auto-commit occurs right after reading the storage version, the monitor may fail to update it, leading to errors like: ```bash "msg":"failed to update storage version","cluster-version":"3.6.0", "error":"cannot detect storage schema version: missing confstate information" ``` Given this, we should remove the AssertProcessLogs statement. Similar to etcd-io#19313 Signed-off-by: Wei Fu <[email protected]>

1. Update DowngradeUpgradeMembersByID If it's downgrading process, the desire version of cluster should be target one. If it's upgrading process, the desire version of cluster should be determined by mininum binary version of members. 2. Remove AssertProcessLogs from DowngradeEnable The log message "The server is ready to downgrade" appears only when the storage version monitor detects a mismatch between the cluster and storage versions. If traffic is insufficient to trigger a commit or if an auto-commit occurs right after reading the storage version, the monitor may fail to update it, leading to errors like: ```bash "msg":"failed to update storage version","cluster-version":"3.6.0", "error":"cannot detect storage schema version: missing confstate information" ``` Given this, we should remove the AssertProcessLogs statement. Similar to etcd-io#19313 Signed-off-by: Wei Fu <[email protected]>

1. Update DowngradeUpgradeMembersByID If it's downgrading process, the desire version of cluster should be target one. If it's upgrading process, the desire version of cluster should be determined by mininum binary version of members. 2. Remove AssertProcessLogs from DowngradeEnable The log message "The server is ready to downgrade" appears only when the storage version monitor detects a mismatch between the cluster and storage versions. If traffic is insufficient to trigger a commit or if an auto-commit occurs right after reading the storage version, the monitor may fail to update it, leading to errors like: ```bash "msg":"failed to update storage version","cluster-version":"3.6.0", "error":"cannot detect storage schema version: missing confstate information" ``` Given this, we should remove the AssertProcessLogs statement. Similar to etcd-io#19313 Signed-off-by: Wei Fu <[email protected]> (cherry picked from commit 65159a2) Signed-off-by: Wei Fu <[email protected]>

k8s-ci-robot added area/etcdutl size/XS labels Jan 31, 2025

ahrtr approved these changes Jan 31, 2025

View reviewed changes

k8s-ci-robot added the approved label Jan 31, 2025

ahrtr merged commit b197332 into etcd-io:main Jan 31, 2025
36 checks passed

fuweid mentioned this pull request Feb 16, 2025

*: deflake MemberDowngradeUpdate robustness case #19427

Closed

fuweid deleted the deflaky-TestSnapshotStatus branch February 16, 2025 17:45

fuweid mentioned this pull request Feb 20, 2025

*: support DowngradeInfo field in maintenence.Status API #19451

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deflaky: TestSnapshotStatus #19313

deflaky: TestSnapshotStatus #19313

fuweid commented Jan 31, 2025 •

edited

Loading

codecov bot commented Jan 31, 2025 •

edited

Loading

ahrtr commented Jan 31, 2025

ahrtr commented Jan 31, 2025

ahrtr left a comment

k8s-ci-robot commented Jan 31, 2025

deflaky: TestSnapshotStatus #19313

deflaky: TestSnapshotStatus #19313

Conversation

fuweid commented Jan 31, 2025 • edited Loading

Root Cause

How to reproduce it?

How to fix?

codecov bot commented Jan 31, 2025 • edited Loading

Codecov Report

ahrtr commented Jan 31, 2025

ahrtr commented Jan 31, 2025

ahrtr left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 31, 2025

fuweid commented Jan 31, 2025 •

edited

Loading

codecov bot commented Jan 31, 2025 •

edited

Loading