Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in etcd v3.5.16 #19406

Open
4 tasks done
miancheng7 opened this issue Feb 12, 2025 · 18 comments · Fixed by #19410
Open
4 tasks done

Performance regression in etcd v3.5.16 #19406

miancheng7 opened this issue Feb 12, 2025 · 18 comments · Fixed by #19410

Comments

@miancheng7
Copy link

miancheng7 commented Feb 12, 2025

Bug report criteria

What happened?

The k8s scale test failed consistently since we upgrade etcd version from v3.5.15 to v3.5.16. We observed that all etcd requests experience a 5-second latency spike every 5 minutes, which coincides with the compaction timeline, leading to scale test breaches the upstream SLO in the clusterloader2 test SLO measurement.

Here is an example from a test cluster. Note: the metric in the diagram is the maximum of grpc_server_handling_seconds and the vertical annotations are the time when compaction happened.
Image

Leveraging the git-biset and scale test, we successfully traced the issue back to a change in compaction here which is introduced by commit 02635. The theory is thatticker.C acts as a fixed timer that triggers every 10ms, regardless of how long each batch of compaction takes. The change removes the 10ms waiting grantee between each batch of compaction and if a previous compaction batch takes longer than 10ms, the next batch starts immediately, making compaction a blocking operation for etcd. Previously, time.After(10ms) ensured a 10ms delay between each batch of compaction, allowing other read and write operations to proceed smoothly.

To mitigate the issue, we made a change in PR-19405 to revert the behavior to its previous behavior, and all Kubernetes scale tests passed consistently.

Let me know if you need any more details.

cc @chaochn47 @hakuna-matatah

What did you expect to happen?

Compact should not block R/W requests. We should rollback the compaction behavior to what it's before. Here is a PR PR-19405

How can we reproduce it (as minimally and precisely as possible)?

Execute the 5k node clusterloader2 test against etcd v3.5.16 can reproduce the issue. I can create a e2e test to reproduce the issue

Anything else we need to know?

No response

Etcd version (please run commands below)

etcd v3.5.16 and we suspected that all etcd v3.5.16+ are impacted

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

@jqmichael
Copy link

CC: @AlexStocks @JalinWang @chaochn47

@JalinWang
Copy link
Contributor

JalinWang commented Feb 13, 2025

The cherry-picked #15138 introduced this change as an optimization🤔 Seems safe to revert it and time.After() sounds more reasonable.
cc @guozhao-coder if you can see it

@chaochn47
Copy link
Member

@ahrtr @serathius Did you hear anything from the community about this type of performance regression?

@chaochn47
Copy link
Member

chaochn47 commented Feb 13, 2025

The occurrence of the issue depends on number of key values each batch would compact, disk performance and load on the cluster. @miancheng7 Probably we can demonstrate this by using benchmark tool to describe the load parameter.

Reproduce using etcd standalone test framework can be done with sleep fail point during each batch processing here with some write load.

@serathius
Copy link
Member

serathius commented Feb 13, 2025

which coincides with the compaction timeline, leading to scale test breaches the upstream SLO in the clusterloader2 test SLO measurement.

I'm pretty familiar with the clusterloader. Can you provide a link to perf-dash showing the regression?
I this related to kubernetes/kubernetes#129593 ?

I would like to understand what led to issue being discovered so late, I know that AWS runs scalability tests with much lower IOPs (<16k) than default K8s scalability 5k test (60k). This could mean that using just default scalability might make us blind to this type of regression.

@ahrtr
Copy link
Member

ahrtr commented Feb 13, 2025

The theory is thatticker.C acts as a fixed timer that triggers every 10ms, regardless of how long each batch of compaction takes.

Thanks for the catch!

@ahrtr
Copy link
Member

ahrtr commented Feb 13, 2025

I would like to understand what led to issue being discovered so late

It's easy to understand, the only reason is the big latency of K8s bumping etcd version.

@ahrtr
Copy link
Member

ahrtr commented Feb 13, 2025

It's easy to understand, the only reason is the big latency of K8s bumping etcd version.

I think the 2024 Christmas holiday also contributed this delay.

@ahrtr
Copy link
Member

ahrtr commented Feb 13, 2025

Provide a simple summary for users (especially who do not know much about the internal mechanism of etcdserver) reference. Also to avoid to explain this to different people repeatedly.

What's the issue

etcd 3.5.16 introduced a performance regression (in PR #18514)

What's the symptom

The requests may experience longer delay when the etcd happens to do compaction in high traffic scenarios. The K8s performs the etcd compaction every 5 minutes, so you may see bigger latency every 5 minutes in your K8s cluster.

What's the root cause

etcd's compaction (FYI. also defragmentation) is a stop-the-world operation. When etcd is doing compaction, all requests will be blocked.

So previously ( <=3.5.15), etcd sleeps 10ms when it finishes each batch of compaction (1000 revisions by default) during the compaction. The golang implementation is below,

		select {
		case <-time.After(10 * time.Millisecond):
		case <-s.stopc:
			return KeyValueHash{}, fmt.Errorf("interrupted due to stop signal")
		}

But since etcd v3.5.16, we backported the feature (of letting users to set the compaction sleep interval). The feature itself is good, but the problem is that there is a bug (sorry for not identify it in the first place) in the implementation (see below). The reason is the golang's ticker will adjust the time interval or drop ticks to make up for slow receivers. So there is some uncertainty of how long the sleep is, but it will definitely less than 10ms, even far less or no sleep at all in high traffic scenario. Accordingly it may stop the world for a long time, depending on how much history(revision) to compact.

		select {
		case <-batchTicker.C:
		case <-s.stopc:
			return KeyValueHash{}, fmt.Errorf("interrupted due to stop signal")
		}

Is it a big problem?

It's a performance regression, so it's big problem from this perspective. We take it serious on any regression.

However it isn't a problem if you only have very low traffic in your environment, or not a big problem if the compact frequency is big (i.e. > 10 hours, or even days).

What etcd version will fix the issue?

v3.5.19

@ahrtr ahrtr reopened this Feb 13, 2025
@serathius
Copy link
Member

I would like to understand what led to issue being discovered so late

It's easy to understand, the only reason is the big latency of K8s bumping etcd version.

Latency of adoption might be improved, but still K8s has been running 3.5.16 since September and we haven't seen regression. Problem is that default K8s scale tests 5-10x overprovision disk IOPs so we might not this type of regression.
kubernetes/kubernetes#127285

@serathius
Copy link
Member

serathius commented Feb 13, 2025

Interestingly default tests that run on GKE use 200GB pd-ssd disks.

Those should provide 30 read and write IOPs per GB and 0.48MiBps throughput per GB. So the test disk performance should be around 6k IOPs and 96MiBps throughput in default tests.

kubernetes/kops#17233 was the fix used to address the kubernetes/kubernetes#129593. It increased the IOPs 16k and throughput to 1GiBps. Don't know if we can compare GCE IOPs to AWS, but if so the previous numbers were far below expected.

@serathius
Copy link
Member

Looking at https://github.com/kubernetes/kops/blob/4a42841114f96603b2cd8fa15fb8bf5d3b71f8b0/pkg/model/master_volumes.go#L46-L54 it seems that AWS was running scalability tests on 3k IOPs and 100 MiBps throughput gp3 disk.

So it seems we needed around 3-6k IOPs to reproduce issue.

@fuweid
Copy link
Member

fuweid commented Feb 13, 2025

based on my current knowledge, the ticker should be reset before using, just like timer. Good catch

@hakuna-matatah
Copy link

hakuna-matatah commented Feb 13, 2025

So it seems we needed around 3-6k IOPs to reproduce issue.

@serathius +1. I thought so too ^^^. The fact that IOPS fixed the AWS Load tests issue made me think that.

I was asking the same thing to @miancheng7 internally when this issue was being reproduced and root caused last week.
Correct me if I'm wrong @miancheng7 , I remember you mentioning that we were not breaching any IOPS/throughput but still run into this ?

@serathius
Copy link
Member

I would consider proposing default K8s tests to lower IOPs to catch such things. cc @mborsz

@miancheng7
Copy link
Author

The occurrence of the issue depends on number of key values each batch would compact, disk performance and load on the cluster. @miancheng7 Probably we can demonstrate this by using benchmark tool to describe the load parameter.

Reproduce using etcd standalone test framework can be done with sleep fail point during each batch processing here with some write load.

Thanks @chaochn47 for the guidance. I will give a try and submit PR to add a test case for this.

@fuweid
Copy link
Member

fuweid commented Feb 13, 2025

would you please update changelog for v3.5 release? thanks

@miancheng7
Copy link
Author

miancheng7 commented Feb 13, 2025

I was asking the same thing to @miancheng7 internally when this issue was being reproduced and root caused last week. Correct me if I'm wrong @miancheng7 , I remember you mentioning that we were not breaching any IOPS/throughput but still run into this ?

Thanks for bringing this up, @hakuna-matatah. Agreed—the fact that the upstream test consistently passed after increasing IOPS is a strong indicator that IOPS plays a crucial role here. We need to take a deeper look to gain a better understanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

8 participants