[Bug]: Too long time for recovering when ETCD pod failure or network partition #36393

chyezh · 2024-09-20T07:01:02Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: master-89397d1e66acc4547561b5f8beb7d43fc87e7c47
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The etcd client in milvus accesses the etcd node namd etcd-0. The etcd-0 node is unavailable due to network partitioning and is in a state of repeated election, cannot apply write operations. The request timeout of the etcd client is too long (9 seconds in logs), so the process have been locked on the requesting node.

{"level":"warn","ts":"2024-09-19T06:06:59.497Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-09-19T06:06:50.496Z","time spent":"9.000383152s","remote":"10.15.1.24:34664","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}
{"level":"info","ts":"2024-09-19T06:06:45.383Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f is starting a new election at term 4"}

Expected Behavior

shorter timeout for etcd operation and retry the operation to healthy etcd node.

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

chyezh · 2024-09-20T07:01:34Z

/assign chyezh

xiaofan-luan · 2024-09-21T16:13:45Z

@chyezh
so we need to change to etcd timeout to shorter?

We need to retry on etcd fail to avoid etcd error and milvus panic. otherwise tuning the timeout to smaller could be dangerous. We can probably change to 3s with 3-5 retries

chyezh · 2024-09-22T03:03:58Z

@chyezh so we need to change to etcd timeout to shorter?

We need to retry on etcd fail to avoid etcd error and milvus panic. otherwise tuning the timeout to smaller could be dangerous. We can probably change to 3s with 3-5 retries

Yes, current default request timeout of etcdKV is 10s, it's too long.
I need to confirm the implementation of the etcd client, then determine the optimization.

Meanwhile, Milvus' KV interface has a bad implementation, it does not receive RPC context.

func (kv *etcdKV) WalkWithPrefix(prefix string, paginationSize int, fn func([]byte, []byte) error) error {
	start := time.Now()
	prefix = path.Join(kv.rootPath, prefix)

	batch := int64(paginationSize)
	opts := []clientv3.OpOption{
		clientv3.WithSort(clientv3.SortByKey, clientv3.SortAscend),
		clientv3.WithLimit(batch),
		clientv3.WithRange(clientv3.GetPrefixRangeEnd(prefix)),
	}

	key := prefix
	for {
		ctx, cancel := context.WithTimeout(context.TODO(), kv.requestTimeout)
		defer cancel()
		resp, err := kv.getEtcdMeta(ctx, key, opts...)

chyezh · 2024-09-29T07:51:30Z

Another case found.
It may be a bad case for etcd server 3.5.5.
The write operation is sent to a follower etcd-1 at term 2, while the leader etcd-2 is lost.
The etcd-0 and etcd-1 finish the election on term 3 at 02:44:44, etcd-0 became the new leader; so the write operation should be redirected to new leader as expected.

2024-09-19 02:44:44.347	{"level":"info","ts":"2024-09-19T02:44:44.347Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f is starting a new election at term 2"}
2024-09-19 02:44:44.347	{"level":"info","ts":"2024-09-19T02:44:44.347Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f became pre-candidate at term 2"}
2024-09-19 02:44:44.347	{"level":"info","ts":"2024-09-19T02:44:44.347Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f received MsgPreVoteResp from 3d1b452b9ef8ed7f at term 2"}
2024-09-19 02:44:44.347	{"level":"info","ts":"2024-09-19T02:44:44.347Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f [logterm: 2, index: 1524239] sent MsgPreVote request to 5390be8ecae377e2 at term 2"}
2024-09-19 02:44:44.347	{"level":"info","ts":"2024-09-19T02:44:44.347Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f [logterm: 2, index: 1524239] sent MsgPreVote request to d646976b154728f0 at term 2"}
2024-09-19 02:44:44.347	{"level":"info","ts":"2024-09-19T02:44:44.347Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: 3d1b452b9ef8ed7f lost leader 5390be8ecae377e2 at term 2"}
2024-09-19 02:44:44.347	{"level":"info","ts":"2024-09-19T02:44:44.347Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"d646976b154728f0 [logterm: 2, index: 1524239, vote: 5390be8ecae377e2] cast MsgPreVote for 3d1b452b9ef8ed7f [logterm: 2, index: 1524239] at term 2"}
2024-09-19 02:44:44.348	{"level":"info","ts":"2024-09-19T02:44:44.348Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f received MsgPreVoteResp from d646976b154728f0 at term 2"}
2024-09-19 02:44:44.348	{"level":"info","ts":"2024-09-19T02:44:44.348Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f has received 2 MsgPreVoteResp votes and 0 vote rejections"}
2024-09-19 02:44:44.348	{"level":"info","ts":"2024-09-19T02:44:44.348Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f became candidate at term 3"}
2024-09-19 02:44:44.348	{"level":"info","ts":"2024-09-19T02:44:44.348Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f received MsgVoteResp from 3d1b452b9ef8ed7f at term 3"}
2024-09-19 02:44:44.348	{"level":"info","ts":"2024-09-19T02:44:44.348Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f [logterm: 2, index: 1524239] sent MsgVote request to 5390be8ecae377e2 at term 3"}
2024-09-19 02:44:44.348	{"level":"info","ts":"2024-09-19T02:44:44.348Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f [logterm: 2, index: 1524239] sent MsgVote request to d646976b154728f0 at term 3"}
2024-09-19 02:44:44.349	{"level":"info","ts":"2024-09-19T02:44:44.349Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"d646976b154728f0 [term: 2] received a MsgVote message with higher term from 3d1b452b9ef8ed7f [term: 3]"}
2024-09-19 02:44:44.349	{"level":"info","ts":"2024-09-19T02:44:44.349Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"d646976b154728f0 became follower at term 3"}
2024-09-19 02:44:44.349	{"level":"info","ts":"2024-09-19T02:44:44.349Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"d646976b154728f0 [logterm: 2, index: 1524239, vote: 0] cast MsgVote for 3d1b452b9ef8ed7f [logterm: 2, index: 1524239] at term 3"}
2024-09-19 02:44:44.349	{"level":"info","ts":"2024-09-19T02:44:44.349Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: d646976b154728f0 lost leader 5390be8ecae377e2 at term 3"}
2024-09-19 02:44:44.351	{"level":"info","ts":"2024-09-19T02:44:44.350Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f received MsgVoteResp from d646976b154728f0 at term 3"}
2024-09-19 02:44:44.351	{"level":"info","ts":"2024-09-19T02:44:44.351Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f has received 2 MsgVoteResp votes and 0 vote rejections"}
2024-09-19 02:44:44.351	{"level":"info","ts":"2024-09-19T02:44:44.351Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f became leader at term 3"}
2024-09-19 02:44:44.351	{"level":"info","ts":"2024-09-19T02:44:44.351Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: 3d1b452b9ef8ed7f elected leader 3d1b452b9ef8ed7f at term 3"}
2024-09-19 02:44:44.351	{"level":"info","ts":"2024-09-19T02:44:44.351Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: d646976b154728f0 elected leader 3d1b452b9ef8ed7f at term 3"}
2024-09-19 02:44:50.780	{"level":"info","ts":"2024-09-19T02:44:50.780Z","caller":"etcdserver/raft.go:529","msg":"restarting local member","cluster-id":"d353a2e618515b93","local-member-id":"5390be8ecae377e2","commit-index":1524238}
2024-09-19 02:44:50.781	{"level":"info","ts":"2024-09-19T02:44:50.781Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"5390be8ecae377e2 switched to configuration voters=(4403189114337815935 6021522222292891618 15440194858660604144)"}
2024-09-19 02:44:50.781	{"level":"info","ts":"2024-09-19T02:44:50.781Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"5390be8ecae377e2 became follower at term 2"}
2024-09-19 02:44:50.781	{"level":"info","ts":"2024-09-19T02:44:50.781Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"newRaft 5390be8ecae377e2 [peers: [3d1b452b9ef8ed7f,5390be8ecae377e2,d646976b154728f0], term: 2, commit: 1524238, applied: 1520152, lastindex: 1524239, lastterm: 2]"}
2024-09-19 02:44:50.793	{"level":"info","ts":"2024-09-19T02:44:50.793Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"5390be8ecae377e2 [term: 2] received a MsgHeartbeat message with higher term from 3d1b452b9ef8ed7f [term: 3]"}
2024-09-19 02:44:50.793	{"level":"info","ts":"2024-09-19T02:44:50.793Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"5390be8ecae377e2 became follower at term 3"}
2024-09-19 02:44:50.793	{"level":"info","ts":"2024-09-19T02:44:50.793Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: 5390be8ecae377e2 elected leader 3d1b452b9ef8ed7f at term 3"}
2024-09-19 02:44:52.729	{"level":"warn","ts":"2024-09-19T02:44:52.729Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-09-19T02:44:43.729Z","time spent":"9.000073827s","remote":"10.15.1.24:57410","response type":"/etcdserverpb.KV/Put","request count":1,"request size":44,"response count":0,"response size":0,"request content":"key:\"92461f86230a8d5/kv/gid/timestamp\" value_size:8 "}

chyezh · 2024-09-29T08:31:20Z

At etcd 3.5.5, there's no timeout control or redirect at server-side when leader changing.

Summary here:

Sent to a follower lost leader will block until the etcd operation timeout at client side.
Sent to a leader that network partitioned will block until the etcd operation timeout at client side.

xiaofan-luan · 2024-09-29T15:36:40Z

Seems that we need to seprate etcd timeout and etcdkv timeout(or operation timeout) by ourselves

stale · 2024-10-29T22:35:36Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

chyezh · 2024-10-30T01:50:25Z

/reopen

stale · 2024-12-04T06:21:41Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale · 2025-02-23T16:23:25Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

chyezh added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 20, 2024

chyezh assigned yanliang567 Sep 20, 2024

sre-ci-robot assigned chyezh Sep 20, 2024

chyezh mentioned this issue Sep 20, 2024

[Enhancement]: Too long time for recovering when ETCD pod failure or network partition #36394

Closed

1 task

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 20, 2024

yanliang567 added this to the 2.5.0 milestone Sep 20, 2024

yanliang567 removed their assignment Sep 20, 2024

stale bot added the stale indicates no udpates for 30 days label Oct 29, 2024

stale bot removed the stale indicates no udpates for 30 days label Oct 30, 2024

xiaofan-luan mentioned this issue Nov 3, 2024

[Enhancement]: Etcd recovery time is slow. #37394

Closed

1 task

stale bot added the stale indicates no udpates for 30 days label Dec 4, 2024

yanliang567 modified the milestones: 2.5.0, 2.5.1 Dec 24, 2024

stale bot removed stale indicates no udpates for 30 days labels Dec 24, 2024

yanliang567 modified the milestones: 2.5.1, 2.5.2 Dec 30, 2024

yanliang567 modified the milestones: 2.5.2, 2.5.3 Jan 6, 2025

yanliang567 modified the milestones: 2.5.3, 2.5.4 Jan 16, 2025

yanliang567 modified the milestones: 2.5.4, 2.5.5 Jan 24, 2025

stale bot added the stale indicates no udpates for 30 days label Feb 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Too long time for recovering when ETCD pod failure or network partition #36393

[Bug]: Too long time for recovering when ETCD pod failure or network partition #36393

chyezh commented Sep 20, 2024 •

edited

Loading

chyezh commented Sep 20, 2024

xiaofan-luan commented Sep 21, 2024

chyezh commented Sep 22, 2024

chyezh commented Sep 29, 2024 •

edited

Loading

chyezh commented Sep 29, 2024

xiaofan-luan commented Sep 29, 2024

stale bot commented Oct 29, 2024

chyezh commented Oct 30, 2024

stale bot commented Dec 4, 2024

stale bot commented Feb 23, 2025

[Bug]: Too long time for recovering when ETCD pod failure or network partition #36393

[Bug]: Too long time for recovering when ETCD pod failure or network partition #36393

Comments

chyezh commented Sep 20, 2024 • edited Loading

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

chyezh commented Sep 20, 2024

xiaofan-luan commented Sep 21, 2024

chyezh commented Sep 22, 2024

chyezh commented Sep 29, 2024 • edited Loading

chyezh commented Sep 29, 2024

xiaofan-luan commented Sep 29, 2024

stale bot commented Oct 29, 2024

chyezh commented Oct 30, 2024

stale bot commented Dec 4, 2024

stale bot commented Feb 23, 2025

chyezh commented Sep 20, 2024 •

edited

Loading

chyezh commented Sep 29, 2024 •

edited

Loading