-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Too long time for recovering when ETCD pod failure or network partition #36393
Comments
/assign chyezh |
@chyezh We need to retry on etcd fail to avoid etcd error and milvus panic. otherwise tuning the timeout to smaller could be dangerous. We can probably change to 3s with 3-5 retries |
Yes, current default request timeout of etcdKV is 10s, it's too long. Meanwhile, Milvus' KV interface has a bad implementation, it does not receive RPC context.
|
Another case found.
|
At etcd 3.5.5, there's no timeout control or redirect at server-side when leader changing. Summary here:
|
Seems that we need to seprate etcd timeout and etcdkv timeout(or operation timeout) by ourselves |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/reopen |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Is there an existing issue for this?
Environment
Current Behavior
The etcd client in milvus accesses the etcd node namd etcd-0. The etcd-0 node is unavailable due to network partitioning and is in a state of repeated election, cannot apply write operations. The request timeout of the etcd client is too long (9 seconds in logs), so the process have been locked on the requesting node.
Expected Behavior
shorter timeout for etcd operation and retry the operation to healthy etcd node.
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: