-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in etcd v3.5.16 #19406
Comments
The cherry-picked #15138 introduced this change as an optimization🤔 Seems safe to revert it and |
@ahrtr @serathius Did you hear anything from the community about this type of performance regression? |
The occurrence of the issue depends on number of key values each batch would compact, disk performance and load on the cluster. @miancheng7 Probably we can demonstrate this by using benchmark tool to describe the load parameter. Reproduce using etcd standalone test framework can be done with sleep fail point during each batch processing here with some write load. |
I'm pretty familiar with the clusterloader. Can you provide a link to perf-dash showing the regression? I would like to understand what led to issue being discovered so late, I know that AWS runs scalability tests with much lower IOPs (<16k) than default K8s scalability 5k test (60k). This could mean that using just default scalability might make us blind to this type of regression. |
Thanks for the catch! |
It's easy to understand, the only reason is the big latency of K8s bumping etcd version. |
I think the 2024 Christmas holiday also contributed this delay. |
Provide a simple summary for users (especially who do not know much about the internal mechanism of etcdserver) reference. Also to avoid to explain this to different people repeatedly. What's the issueetcd 3.5.16 introduced a performance regression (in PR #18514) What's the symptomThe requests may experience longer delay when the etcd happens to do compaction in high traffic scenarios. The K8s performs the etcd compaction every 5 minutes, so you may see bigger latency every 5 minutes in your K8s cluster. What's the root causeetcd's compaction (FYI. also defragmentation) is a stop-the-world operation. When etcd is doing compaction, all requests will be blocked. So previously ( <=3.5.15), etcd sleeps 10ms when it finishes each batch of compaction (1000 revisions by default) during the compaction. The golang implementation is below,
But since etcd v3.5.16, we backported the feature (of letting users to set the compaction sleep interval). The feature itself is good, but the problem is that there is a bug (sorry for not identify it in the first place) in the implementation (see below). The reason is the golang's ticker will adjust the time interval or drop ticks to make up for slow receivers. So there is some uncertainty of how long the sleep is, but it will definitely less than 10ms, even far less or no sleep at all in high traffic scenario. Accordingly it may stop the world for a long time, depending on how much history(revision) to compact.
Is it a big problem?It's a performance regression, so it's big problem from this perspective. We take it serious on any regression. However it isn't a problem if you only have very low traffic in your environment, or not a big problem if the compact frequency is big (i.e. > 10 hours, or even days). What etcd version will fix the issue?v3.5.19 |
Latency of adoption might be improved, but still K8s has been running 3.5.16 since September and we haven't seen regression. Problem is that default K8s scale tests 5-10x overprovision disk IOPs so we might not this type of regression. |
Interestingly default tests that run on GKE use 200GB pd-ssd disks. Those should provide 30 read and write IOPs per GB and 0.48MiBps throughput per GB. So the test disk performance should be around 6k IOPs and 96MiBps throughput in default tests. kubernetes/kops#17233 was the fix used to address the kubernetes/kubernetes#129593. It increased the IOPs 16k and throughput to 1GiBps. Don't know if we can compare GCE IOPs to AWS, but if so the previous numbers were far below expected. |
Looking at https://github.com/kubernetes/kops/blob/4a42841114f96603b2cd8fa15fb8bf5d3b71f8b0/pkg/model/master_volumes.go#L46-L54 it seems that AWS was running scalability tests on 3k IOPs and 100 MiBps throughput gp3 disk. So it seems we needed around 3-6k IOPs to reproduce issue. |
based on my current knowledge, the ticker should be reset before using, just like timer. Good catch |
@serathius +1. I thought so too ^^^. The fact that IOPS fixed the AWS Load tests issue made me think that. I was asking the same thing to @miancheng7 internally when this issue was being reproduced and root caused last week. |
I would consider proposing default K8s tests to lower IOPs to catch such things. cc @mborsz |
Thanks @chaochn47 for the guidance. I will give a try and submit PR to add a test case for this. |
|
Thanks for bringing this up, @hakuna-matatah. Agreed—the fact that the upstream test consistently passed after increasing IOPS is a strong indicator that IOPS plays a crucial role here. We need to take a deeper look to gain a better understanding. |
Bug report criteria
What happened?
The k8s scale test failed consistently since we upgrade etcd version from v3.5.15 to v3.5.16. We observed that all etcd requests experience a 5-second latency spike every 5 minutes, which coincides with the compaction timeline, leading to scale test breaches the upstream SLO in the clusterloader2 test SLO measurement.
Here is an example from a test cluster. Note: the metric in the diagram is the maximum of

grpc_server_handling_seconds
and the vertical annotations are the time when compaction happened.Leveraging the git-biset and scale test, we successfully traced the issue back to a change in compaction here which is introduced by commit 02635. The theory is that
ticker.C
acts as a fixed timer that triggers every 10ms, regardless of how long each batch of compaction takes. The change removes the 10ms waiting grantee between each batch of compaction and if a previous compaction batch takes longer than 10ms, the next batch starts immediately, making compaction a blocking operation for etcd. Previously, time.After(10ms) ensured a 10ms delay between each batch of compaction, allowing other read and write operations to proceed smoothly.To mitigate the issue, we made a change in PR-19405 to revert the behavior to its previous behavior, and all Kubernetes scale tests passed consistently.
Let me know if you need any more details.
cc @chaochn47 @hakuna-matatah
What did you expect to happen?
Compact should not block R/W requests. We should rollback the compaction behavior to what it's before. Here is a PR PR-19405
How can we reproduce it (as minimally and precisely as possible)?
Execute the 5k node clusterloader2 test against etcd v3.5.16 can reproduce the issue. I can create a e2e test to reproduce the issue
Anything else we need to know?
No response
Etcd version (please run commands below)
etcd v3.5.16 and we suspected that all etcd v3.5.16+ are impacted
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
The text was updated successfully, but these errors were encountered: