Performance regression in etcd v3.5.16 #19406

miancheng7 · 2025-02-12T22:48:21Z

Bug report criteria

This bug report is not security related, security issues should be disclosed privately via etcd maintainers.
This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
You have read the etcd bug reporting guidelines.
Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

The k8s scale test failed consistently since we upgrade etcd version from v3.5.15 to v3.5.16. We observed that all etcd requests experience a 5-second latency spike every 5 minutes, which coincides with the compaction timeline, leading to scale test breaches the upstream SLO in the clusterloader2 test SLO measurement.

Here is an example from a test cluster. Note: the metric in the diagram is the maximum of grpc_server_handling_seconds and the vertical annotations are the time when compaction happened.

Leveraging the git-biset and scale test, we successfully traced the issue back to a change in compaction here which is introduced by commit 02635. The theory is thatticker.C acts as a fixed timer that triggers every 10ms, regardless of how long each batch of compaction takes. The change removes the 10ms waiting grantee between each batch of compaction and if a previous compaction batch takes longer than 10ms, the next batch starts immediately, making compaction a blocking operation for etcd. Previously, time.After(10ms) ensured a 10ms delay between each batch of compaction, allowing other read and write operations to proceed smoothly.

To mitigate the issue, we made a change in PR-19405 to revert the behavior to its previous behavior, and all Kubernetes scale tests passed consistently.

Let me know if you need any more details.

cc @chaochn47 @hakuna-matatah

What did you expect to happen?

Compact should not block R/W requests. We should rollback the compaction behavior to what it's before. Here is a PR PR-19405

How can we reproduce it (as minimally and precisely as possible)?

Execute the 5k node clusterloader2 test against etcd v3.5.16 can reproduce the issue. I can create a e2e test to reproduce the issue

Anything else we need to know?

No response

Etcd version (please run commands below)

etcd v3.5.16 and we suspected that all etcd v3.5.16+ are impacted

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

The text was updated successfully, but these errors were encountered:

jqmichael · 2025-02-12T23:04:22Z

CC: @AlexStocks @JalinWang @chaochn47

JalinWang · 2025-02-13T02:34:44Z

The cherry-picked #15138 introduced this change as an optimization🤔 Seems safe to revert it and time.After() sounds more reasonable.
cc @guozhao-coder if you can see it

chaochn47 · 2025-02-13T04:56:47Z

@ahrtr @serathius Did you hear anything from the community about this type of performance regression?

chaochn47 · 2025-02-13T04:57:59Z

The occurrence of the issue depends on number of key values each batch would compact, disk performance and load on the cluster. @miancheng7 Probably we can demonstrate this by using benchmark tool to describe the load parameter.

Reproduce using etcd standalone test framework can be done with sleep fail point during each batch processing here with some write load.

serathius · 2025-02-13T08:44:41Z

which coincides with the compaction timeline, leading to scale test breaches the upstream SLO in the clusterloader2 test SLO measurement.

I'm pretty familiar with the clusterloader. Can you provide a link to perf-dash showing the regression?
I this related to kubernetes/kubernetes#129593 ?

I would like to understand what led to issue being discovered so late, I know that AWS runs scalability tests with much lower IOPs (<16k) than default K8s scalability 5k test (60k). This could mean that using just default scalability might make us blind to this type of regression.

ahrtr · 2025-02-13T09:35:40Z

The theory is thatticker.C acts as a fixed timer that triggers every 10ms, regardless of how long each batch of compaction takes.

Thanks for the catch!

ahrtr · 2025-02-13T09:51:01Z

I would like to understand what led to issue being discovered so late

It's easy to understand, the only reason is the big latency of K8s bumping etcd version.

ahrtr · 2025-02-13T09:53:58Z

It's easy to understand, the only reason is the big latency of K8s bumping etcd version.

I think the 2024 Christmas holiday also contributed this delay.

ahrtr · 2025-02-13T10:24:53Z

Provide a simple summary for users (especially who do not know much about the internal mechanism of etcdserver) reference. Also to avoid to explain this to different people repeatedly.

What's the issue

etcd 3.5.16 introduced a performance regression (in PR #18514)

What's the symptom

The requests may experience longer delay when the etcd happens to do compaction in high traffic scenarios. The K8s performs the etcd compaction every 5 minutes, so you may see bigger latency every 5 minutes in your K8s cluster.

What's the root cause

etcd's compaction (FYI. also defragmentation) is a stop-the-world operation. When etcd is doing compaction, all requests will be blocked.

So previously ( <=3.5.15), etcd sleeps 10ms when it finishes each batch of compaction (1000 revisions by default) during the compaction. The golang implementation is below,

		select {
		case <-time.After(10 * time.Millisecond):
		case <-s.stopc:
			return KeyValueHash{}, fmt.Errorf("interrupted due to stop signal")
		}

But since etcd v3.5.16, we backported the feature (of letting users to set the compaction sleep interval). The feature itself is good, but the problem is that there is a bug (sorry for not identify it in the first place) in the implementation (see below). The reason is the golang's ticker will adjust the time interval or drop ticks to make up for slow receivers. So there is some uncertainty of how long the sleep is, but it will definitely less than 10ms, even far less or no sleep at all in high traffic scenario. Accordingly it may stop the world for a long time, depending on how much history(revision) to compact.

		select {
		case <-batchTicker.C:
		case <-s.stopc:
			return KeyValueHash{}, fmt.Errorf("interrupted due to stop signal")
		}

Is it a big problem?

It's a performance regression, so it's big problem from this perspective. We take it serious on any regression.

However it isn't a problem if you only have very low traffic in your environment, or not a big problem if the compact frequency is big (i.e. > 10 hours, or even days).

What etcd version will fix the issue?

v3.5.19

serathius · 2025-02-13T10:48:04Z

I would like to understand what led to issue being discovered so late

It's easy to understand, the only reason is the big latency of K8s bumping etcd version.

Latency of adoption might be improved, but still K8s has been running 3.5.16 since September and we haven't seen regression. Problem is that default K8s scale tests 5-10x overprovision disk IOPs so we might not this type of regression.
kubernetes/kubernetes#127285

serathius · 2025-02-13T11:37:13Z

Interestingly default tests that run on GKE use 200GB pd-ssd disks.

Those should provide 30 read and write IOPs per GB and 0.48MiBps throughput per GB. So the test disk performance should be around 6k IOPs and 96MiBps throughput in default tests.

kubernetes/kops#17233 was the fix used to address the kubernetes/kubernetes#129593. It increased the IOPs 16k and throughput to 1GiBps. Don't know if we can compare GCE IOPs to AWS, but if so the previous numbers were far below expected.

serathius · 2025-02-13T11:52:14Z

Looking at https://github.com/kubernetes/kops/blob/4a42841114f96603b2cd8fa15fb8bf5d3b71f8b0/pkg/model/master_volumes.go#L46-L54 it seems that AWS was running scalability tests on 3k IOPs and 100 MiBps throughput gp3 disk.

So it seems we needed around 3-6k IOPs to reproduce issue.

fuweid · 2025-02-13T12:19:42Z

based on my current knowledge, the ticker should be reset before using, just like timer. Good catch

hakuna-matatah · 2025-02-13T15:13:29Z

So it seems we needed around 3-6k IOPs to reproduce issue.

@serathius +1. I thought so too ^^^. The fact that IOPS fixed the AWS Load tests issue made me think that.

I was asking the same thing to @miancheng7 internally when this issue was being reproduced and root caused last week.
Correct me if I'm wrong @miancheng7 , I remember you mentioning that we were not breaching any IOPS/throughput but still run into this ?

serathius · 2025-02-13T17:07:46Z

I would consider proposing default K8s tests to lower IOPs to catch such things. cc @mborsz

miancheng7 · 2025-02-13T18:48:41Z

The occurrence of the issue depends on number of key values each batch would compact, disk performance and load on the cluster. @miancheng7 Probably we can demonstrate this by using benchmark tool to describe the load parameter.

Reproduce using etcd standalone test framework can be done with sleep fail point during each batch processing here with some write load.

Thanks @chaochn47 for the guidance. I will give a try and submit PR to add a test case for this.

fuweid · 2025-02-13T21:51:44Z

~~would you please update changelog for v3.5 release? thanks~~

miancheng7 · 2025-02-13T22:01:20Z

I was asking the same thing to @miancheng7 internally when this issue was being reproduced and root caused last week. Correct me if I'm wrong @miancheng7 , I remember you mentioning that we were not breaching any IOPS/throughput but still run into this ?

Thanks for bringing this up, @hakuna-matatah. Agreed—the fact that the upstream test consistently passed after increasing IOPS is a strong indicator that IOPS plays a crucial role here. We need to take a deeper look to gain a better understanding.

miancheng7 added the type/bug label Feb 12, 2025

miancheng7 mentioned this issue Feb 12, 2025

[release-3.5] Fix a performance regression due to uncertain compaction sleep interval #19405

Merged

ahrtr added area/performance release/v3.6 release/v3.5 labels Feb 13, 2025

ahrtr mentioned this issue Feb 13, 2025

Fix a performance regression due to uncertain compaction sleep interval #19410

Merged

ahrtr closed this as completed in #19410 Feb 13, 2025

ahrtr reopened this Feb 13, 2025

This was referenced Feb 13, 2025

[release-3.6] Fix a performance regression due to uncertain compaction sleep interval #19411

Merged

Add the fix of the performance regression due to uncertain compaction… #19412

Merged

fuweid mentioned this issue Feb 13, 2025

Build etcd image v3.5.18 kubernetes/kubernetes#129820

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in etcd v3.5.16 #19406

Performance regression in etcd v3.5.16 #19406

miancheng7 commented Feb 12, 2025 •

edited

Loading

jqmichael commented Feb 12, 2025

JalinWang commented Feb 13, 2025 •

edited

Loading

chaochn47 commented Feb 13, 2025

chaochn47 commented Feb 13, 2025 •

edited

Loading

serathius commented Feb 13, 2025 •

edited

Loading

ahrtr commented Feb 13, 2025

ahrtr commented Feb 13, 2025

ahrtr commented Feb 13, 2025

ahrtr commented Feb 13, 2025 •

edited

Loading

serathius commented Feb 13, 2025

serathius commented Feb 13, 2025 •

edited

Loading

serathius commented Feb 13, 2025

fuweid commented Feb 13, 2025

hakuna-matatah commented Feb 13, 2025 •

edited

Loading

serathius commented Feb 13, 2025

miancheng7 commented Feb 13, 2025

fuweid commented Feb 13, 2025 •

edited

Loading

miancheng7 commented Feb 13, 2025 •

edited

Loading

Performance regression in etcd v3.5.16 #19406

Performance regression in etcd v3.5.16 #19406

Comments

miancheng7 commented Feb 12, 2025 • edited Loading

Bug report criteria

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

jqmichael commented Feb 12, 2025

JalinWang commented Feb 13, 2025 • edited Loading

chaochn47 commented Feb 13, 2025

chaochn47 commented Feb 13, 2025 • edited Loading

serathius commented Feb 13, 2025 • edited Loading

ahrtr commented Feb 13, 2025

ahrtr commented Feb 13, 2025

ahrtr commented Feb 13, 2025

ahrtr commented Feb 13, 2025 • edited Loading

What's the issue

What's the symptom

What's the root cause

Is it a big problem?

What etcd version will fix the issue?

serathius commented Feb 13, 2025

serathius commented Feb 13, 2025 • edited Loading

serathius commented Feb 13, 2025

fuweid commented Feb 13, 2025

hakuna-matatah commented Feb 13, 2025 • edited Loading

serathius commented Feb 13, 2025

miancheng7 commented Feb 13, 2025

fuweid commented Feb 13, 2025 • edited Loading

miancheng7 commented Feb 13, 2025 • edited Loading

miancheng7 commented Feb 12, 2025 •

edited

Loading

JalinWang commented Feb 13, 2025 •

edited

Loading

chaochn47 commented Feb 13, 2025 •

edited

Loading

serathius commented Feb 13, 2025 •

edited

Loading

ahrtr commented Feb 13, 2025 •

edited

Loading

serathius commented Feb 13, 2025 •

edited

Loading

hakuna-matatah commented Feb 13, 2025 •

edited

Loading

fuweid commented Feb 13, 2025 •

edited

Loading

miancheng7 commented Feb 13, 2025 •

edited

Loading