-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deflaky: TestSnapshotStatus #19313
deflaky: TestSnapshotStatus #19313
Conversation
Root Cause The backend commits pending changes into boltdb per 100 ms by default. When brand-new member starts, that member will save member into backend by `MustSaveMemberToBackend`. However, it doesn't force change into boltdb. It depends on backend's auto-commit per 100ms. If auto-commit happens after `UpdateStorageVersion`, the `UpdateStorageVersion` will fail on `missing confstate information`. The confstate information is handled by commit-prehook. Once `UpdateStorageVersion` fails, `TestSnapshotStatus` won't have another chance to do that because that server has been closed. So without storage version information, hash of boltdb data is not expected. How to reproduce it? ```diff diff --git a/etcdutl/snapshot/v3_snapshot_test.go b/etcdutl/snapshot/v3_snapshot_test.go index c2b3d5202..cd88ff995 100644 --- a/etcdutl/snapshot/v3_snapshot_test.go +++ b/etcdutl/snapshot/v3_snapshot_test.go @@ -141,6 +141,7 @@ func createDB(t *testing.T, generateContent func(*etcdserver.EtcdServer)) string t.Helper() cfg := embed.NewConfig() + cfg.BackendBatchInterval = 5 * time.Second cfg.LogLevel = "fatal" cfg.Dir = t.TempDir() ``` ```bash ➜ snapshot git:(deflaky-TestSnapshotStatus) ✗ go test -v -run "^TestSnapshotStatus$" -count 1 ./ === RUN TestSnapshotStatus v3_snapshot_test.go:47: Error Trace: /home/weifu/workspace/etcd/etcdutl/snapshot/v3_snapshot_test.go:47 Error: Not equal: expected: 0x62132b4d actual : 0x6bf96324 Test: TestSnapshotStatus --- FAIL: TestSnapshotStatus (0.14s) FAIL FAIL go.etcd.io/etcd/etcdutl/v3/snapshot 0.154s FAIL ``` How to fix? Change `BackendBatchLimit` to `1` so that we can force-commit during `MustSaveMemberToBackend`. Signed-off-by: Wei Fu <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted filessee 37 files with indirect coverage changes @@ Coverage Diff @@
## main #19313 +/- ##
==========================================
- Coverage 68.88% 68.87% -0.02%
==========================================
Files 420 420
Lines 35689 35701 +12
==========================================
+ Hits 24586 24588 +2
- Misses 9684 9691 +7
- Partials 1419 1422 +3 Continue to review full report in Codecov by Sentry.
|
The analysis basically makes sense. The reason is |
OK, the confState only gets updated only when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for the nice catch!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr, fuweid The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1. Update DowngradeUpgradeMembersByID Interface By default, the leader cancels the downgrade process once all members reach the target version. However, this behavior is unpredictable and may cause unexpected errors. Example 1: Single-Member Cluster T0: Set up a single-member cluster. T1: Enable the downgrade process. T2: Downgrade the member from v3.6.0 to v3.5.0. T3: Upgrade the member from v3.5.0 to v3.6.0. If the leader cancels the downgrade process between T2 and T3, the cluster should end up at v3.6.0; otherwise, it remains at v3.5.0. Example 2: Three-Member Cluster T0: Set up a three-member cluster. T1: Enable the downgrade process. T2: Downgrade two members from v3.6.0 to v3.5.0. T3: Upgrade those two members back to v3.6.0. Since the downgrade process is incomplete, after T3, the cluster should remain at v3.5.0 instead of v3.6.0. Key Takeaways * The binary version alone should not determine the cluster version. * The caller must explicitly track and manage the expected version. 2. Disable Downgrade Monitor by Default in Robustness As demonstrated in Example 1, disabling the downgrade monitor reduces test flakiness. 3. Remove AssertProcessLogs from DowngradeEnable The log message "The server is ready to downgrade" appears only when the storage version monitor detects a mismatch between the cluster and storage versions. If traffic is insufficient to trigger a commit or if an auto-commit occurs right after reading the storage version, the monitor may fail to update it, leading to errors like: ```bash "msg":"failed to update storage version","cluster-version":"3.6.0", "error":"cannot detect storage schema version: missing confstate information" ``` Given this, we should remove the AssertProcessLogs statement. Similar to etcd-io#19313 Signed-off-by: Wei Fu <[email protected]>
1. Update DowngradeUpgradeMembersByID Interface By default, the leader cancels the downgrade process once all members reach the target version. However, this behavior is unpredictable and may cause unexpected errors. Example 1: Single-Member Cluster T0: Set up a single-member cluster. T1: Enable the downgrade process. T2: Downgrade the member from v3.6.0 to v3.5.0. T3: Upgrade the member from v3.5.0 to v3.6.0. If the leader cancels the downgrade process between T2 and T3, the cluster should end up at v3.6.0; otherwise, it remains at v3.5.0. Example 2: Three-Member Cluster T0: Set up a three-member cluster. T1: Enable the downgrade process. T2: Downgrade two members from v3.6.0 to v3.5.0. T3: Upgrade those two members back to v3.6.0. Since the downgrade process is incomplete, after T3, the cluster should remain at v3.5.0 instead of v3.6.0. Key Takeaways * The binary version alone should not determine the cluster version. * The caller must explicitly track and manage the expected version. 2. Manually cancels the downgrade process after a successful downgrade in MemberDowngradeUpdate failpoint As demonstrated in Example 1, canceling downgrade process can reduce flakiness. 3. Remove AssertProcessLogs from DowngradeEnable The log message "The server is ready to downgrade" appears only when the storage version monitor detects a mismatch between the cluster and storage versions. If traffic is insufficient to trigger a commit or if an auto-commit occurs right after reading the storage version, the monitor may fail to update it, leading to errors like: ```bash "msg":"failed to update storage version","cluster-version":"3.6.0", "error":"cannot detect storage schema version: missing confstate information" ``` Given this, we should remove the AssertProcessLogs statement. Similar to etcd-io#19313 Signed-off-by: Wei Fu <[email protected]>
1. Update DowngradeUpgradeMembersByID Interface By default, the leader cancels the downgrade process once all members reach the target version. However, this behavior is unpredictable and may cause unexpected errors. Example 1: Single-Member Cluster T0: Set up a single-member cluster. T1: Enable the downgrade process. T2: Downgrade the member from v3.6.0 to v3.5.0. T3: Upgrade the member from v3.5.0 to v3.6.0. If the leader cancels the downgrade process between T2 and T3, the cluster should end up at v3.6.0; otherwise, it remains at v3.5.0. Example 2: Three-Member Cluster T0: Set up a three-member cluster. T1: Enable the downgrade process. T2: Downgrade two members from v3.6.0 to v3.5.0. T3: Upgrade those two members back to v3.6.0. Since the downgrade process is incomplete, after T3, the cluster should remain at v3.5.0 instead of v3.6.0. Key Takeaways * The binary version alone should not determine the cluster version. * The caller must explicitly track and manage the expected version. 2. Remove AssertProcessLogs from DowngradeEnable The log message "The server is ready to downgrade" appears only when the storage version monitor detects a mismatch between the cluster and storage versions. If traffic is insufficient to trigger a commit or if an auto-commit occurs right after reading the storage version, the monitor may fail to update it, leading to errors like: ```bash "msg":"failed to update storage version","cluster-version":"3.6.0", "error":"cannot detect storage schema version: missing confstate information" ``` Given this, we should remove the AssertProcessLogs statement. Similar to etcd-io#19313 Signed-off-by: Wei Fu <[email protected]>
1. Update DowngradeUpgradeMembersByID Interface By default, the leader cancels the downgrade process once all members reach the target version. However, this behavior is unpredictable and may cause unexpected errors. Example 1: Single-Member Cluster T0: Set up a single-member cluster. T1: Enable the downgrade process. T2: Downgrade the member from v3.6.0 to v3.5.0. T3: Upgrade the member from v3.5.0 to v3.6.0. If the leader cancels the downgrade process between T2 and T3, the cluster should end up at v3.6.0; otherwise, it remains at v3.5.0. Example 2: Three-Member Cluster T0: Set up a three-member cluster. T1: Enable the downgrade process. T2: Downgrade two members from v3.6.0 to v3.5.0. T3: Upgrade those two members back to v3.6.0. Since the downgrade process is incomplete, after T3, the cluster should remain at v3.5.0 instead of v3.6.0. Key Takeaways * The binary version alone should not determine the cluster version. * The caller must explicitly track and manage the expected version. 2. Remove AssertProcessLogs from DowngradeEnable The log message "The server is ready to downgrade" appears only when the storage version monitor detects a mismatch between the cluster and storage versions. If traffic is insufficient to trigger a commit or if an auto-commit occurs right after reading the storage version, the monitor may fail to update it, leading to errors like: ```bash "msg":"failed to update storage version","cluster-version":"3.6.0", "error":"cannot detect storage schema version: missing confstate information" ``` Given this, we should remove the AssertProcessLogs statement. Similar to etcd-io#19313 Signed-off-by: Wei Fu <[email protected]>
1. Update DowngradeUpgradeMembersByID Interface By default, the leader cancels the downgrade process once all members reach the target version. However, this behavior is unpredictable and may cause unexpected errors. Example 1: Single-Member Cluster T0: Set up a single-member cluster. T1: Enable the downgrade process. T2: Downgrade the member from v3.6.0 to v3.5.0. T3: Upgrade the member from v3.5.0 to v3.6.0. If the leader cancels the downgrade process between T2 and T3, the cluster should end up at v3.6.0; otherwise, it remains at v3.5.0. Example 2: Three-Member Cluster T0: Set up a three-member cluster. T1: Enable the downgrade process. T2: Downgrade two members from v3.6.0 to v3.5.0. T3: Upgrade those two members back to v3.6.0. Since the downgrade process is incomplete, after T3, the cluster should remain at v3.5.0 instead of v3.6.0. Key Takeaways * The binary version alone should not determine the cluster version. * The caller must explicitly track and manage the expected version. 2. Remove AssertProcessLogs from DowngradeEnable The log message "The server is ready to downgrade" appears only when the storage version monitor detects a mismatch between the cluster and storage versions. If traffic is insufficient to trigger a commit or if an auto-commit occurs right after reading the storage version, the monitor may fail to update it, leading to errors like: ```bash "msg":"failed to update storage version","cluster-version":"3.6.0", "error":"cannot detect storage schema version: missing confstate information" ``` Given this, we should remove the AssertProcessLogs statement. Similar to etcd-io#19313 Signed-off-by: Wei Fu <[email protected]>
1. Update DowngradeUpgradeMembersByID Interface By default, the leader cancels the downgrade process once all members reach the target version. However, this behavior is unpredictable and may cause unexpected errors. Example 1: Single-Member Cluster T0: Set up a single-member cluster. T1: Enable the downgrade process. T2: Downgrade the member from v3.6.0 to v3.5.0. T3: Upgrade the member from v3.5.0 to v3.6.0. If the leader cancels the downgrade process between T2 and T3, the cluster should end up at v3.6.0; otherwise, it remains at v3.5.0. Example 2: Three-Member Cluster T0: Set up a three-member cluster. T1: Enable the downgrade process. T2: Downgrade two members from v3.6.0 to v3.5.0. T3: Upgrade those two members back to v3.6.0. Since the downgrade process is incomplete, after T3, the cluster should remain at v3.5.0 instead of v3.6.0. Key Takeaways * The binary version alone should not determine the cluster version. * The caller must explicitly track and manage the expected version. 2. Remove AssertProcessLogs from DowngradeEnable The log message "The server is ready to downgrade" appears only when the storage version monitor detects a mismatch between the cluster and storage versions. If traffic is insufficient to trigger a commit or if an auto-commit occurs right after reading the storage version, the monitor may fail to update it, leading to errors like: ```bash "msg":"failed to update storage version","cluster-version":"3.6.0", "error":"cannot detect storage schema version: missing confstate information" ``` Given this, we should remove the AssertProcessLogs statement. Similar to etcd-io#19313 Signed-off-by: Wei Fu <[email protected]>
1. Update DowngradeUpgradeMembersByID If it's downgrading process, the desire version of cluster should be target one. If it's upgrading process, the desire version of cluster should be determined by mininum binary version of members. 2. Remove AssertProcessLogs from DowngradeEnable The log message "The server is ready to downgrade" appears only when the storage version monitor detects a mismatch between the cluster and storage versions. If traffic is insufficient to trigger a commit or if an auto-commit occurs right after reading the storage version, the monitor may fail to update it, leading to errors like: ```bash "msg":"failed to update storage version","cluster-version":"3.6.0", "error":"cannot detect storage schema version: missing confstate information" ``` Given this, we should remove the AssertProcessLogs statement. Similar to etcd-io#19313 Signed-off-by: Wei Fu <[email protected]>
1. Update DowngradeUpgradeMembersByID If it's downgrading process, the desire version of cluster should be target one. If it's upgrading process, the desire version of cluster should be determined by mininum binary version of members. 2. Remove AssertProcessLogs from DowngradeEnable The log message "The server is ready to downgrade" appears only when the storage version monitor detects a mismatch between the cluster and storage versions. If traffic is insufficient to trigger a commit or if an auto-commit occurs right after reading the storage version, the monitor may fail to update it, leading to errors like: ```bash "msg":"failed to update storage version","cluster-version":"3.6.0", "error":"cannot detect storage schema version: missing confstate information" ``` Given this, we should remove the AssertProcessLogs statement. Similar to etcd-io#19313 Signed-off-by: Wei Fu <[email protected]>
1. Update DowngradeUpgradeMembersByID If it's downgrading process, the desire version of cluster should be target one. If it's upgrading process, the desire version of cluster should be determined by mininum binary version of members. 2. Remove AssertProcessLogs from DowngradeEnable The log message "The server is ready to downgrade" appears only when the storage version monitor detects a mismatch between the cluster and storage versions. If traffic is insufficient to trigger a commit or if an auto-commit occurs right after reading the storage version, the monitor may fail to update it, leading to errors like: ```bash "msg":"failed to update storage version","cluster-version":"3.6.0", "error":"cannot detect storage schema version: missing confstate information" ``` Given this, we should remove the AssertProcessLogs statement. Similar to etcd-io#19313 Signed-off-by: Wei Fu <[email protected]>
1. Update DowngradeUpgradeMembersByID If it's downgrading process, the desire version of cluster should be target one. If it's upgrading process, the desire version of cluster should be determined by mininum binary version of members. 2. Remove AssertProcessLogs from DowngradeEnable The log message "The server is ready to downgrade" appears only when the storage version monitor detects a mismatch between the cluster and storage versions. If traffic is insufficient to trigger a commit or if an auto-commit occurs right after reading the storage version, the monitor may fail to update it, leading to errors like: ```bash "msg":"failed to update storage version","cluster-version":"3.6.0", "error":"cannot detect storage schema version: missing confstate information" ``` Given this, we should remove the AssertProcessLogs statement. Similar to etcd-io#19313 Signed-off-by: Wei Fu <[email protected]> (cherry picked from commit 65159a2) Signed-off-by: Wei Fu <[email protected]>
Root Cause
The backend commits pending changes into boltdb per 100 ms by default. When brand-new member starts, that member will save member into backend by
MustSaveMemberToBackend
. However, it doesn't force change into boltdb. It depends on backend's auto-commit per 100ms.If auto-commit happens after
UpdateStorageVersion
, theUpdateStorageVersion
will fail on
missing confstate information
. The confstate information is handled by commit-prehook. OnceUpdateStorageVersion
fails,TestSnapshotStatus
won't have another chance to do that because thatserver has been closed. So without storage version information, hash of boltdb data is not expected.
How to reproduce it?
How to fix?
Change
BackendBatchLimit
to1
so that we can force-commit duringMustSaveMemberToBackend
.Fixes: #19228
cc @ahrtr @serathius
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.