-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Loss During Load Testing with METADATA Enabled and Autoscale Flink #12738
Comments
Does the job work well without auto-scale? What is the state of the pipeline after the checkpoint fails, does the writer stil handle inputs? |
Yes, Job works with and without auto-scale if we don't enable MDT. |
Is there any sepcial logs in the JM logging? |
I haven't seen any special log for this, usually, it fails the checkpoint either by autoscaling spin up new TM or if I kill the TM manually and it discards the data post that. Thanks |
Are you saying the pipeline just hangs up there and does nothing? |
No, it simply moves to the next checkpointing and processes nothing. If you observe the 4th checkpointing, it is currently processing millions of records (in progress, as it's not completed) and taking approximately 4 minutes. However, once it fails, it moves to the next checkpointing and processes nothing, completing in milliseconds(same goes for 2 and 3). Thanks |
@danny0405 any findings? |
No, without more detailed logs I can not help more, and I have no knowledge of Flink auto-scaling. |
@maheshguptags do you mean after the checkpoint failed, records ingested after that will loss? if that happens, usually there are two types of problems:
btw, could you also paste the exception trace in the flink dashboard too? |
Hi
@cshuo Let's assume 10 million records are ingested into the job. While processing these records, if the job manager triggers the creation of a new Task Manager (TM) due to auto-scaling, or if a TM is manually removed (to test the scenario without auto-scaling), a checkpoint failure could occur, causing all the previously ingested data (the 10 million records) to be discarded. If new data (e.g., 1 million records) is ingested after the checkpoint failure, the new data will be successfully processed and ingested to Hudi, provided the next checkpoint succeeds. To summarize: Ingest 10M records → checkpoint failure (due to TM change) → discard all data Thanks |
So these records does not even have a complete checkpoint lifecycle and no commits occur. |
I used an example to illustrate the issue. example chkpnt1 → succeeded → ingested 2.5M (out of 10M) Attempts the next checkpoint chkpnt4 → succeeded → no data will be written due to the failure at chkpnt3 and the checkpoint will complete within milliseconds. Thanks |
@danny0405 @cshuo, any progress on this?
You can try it without auto-scaling by forcefully deleting the TM while checkpointing is in progress (causing the checkpointing to fail) with MDT enabled. We can also connect over a call. Thanks |
Issue
While performing load testing with METADATA enabled, I encountered a data loss issue. The issue occurs when deploying the job with Autoscale enabled. Specifically, if checkpointing fails due to reasons such as TM add-ons or memory heap issues, all data is discarded, and no further data is processed after that failure.
Checkpointing failures lead to data loss.
After a failed checkpoint due to lack of resources, a new checkpoint is triggered but no data is processed.
I tried to replicate this behavior on Hudi 1.0, and the same issue persists.
Hudi Properties
Steps to reproduce the behavior:
Expected behavior
After checkpoint failure due to resource issues, the system should continue processing data once resources are available, without losing previously processed data.
Environment Description
Hudi version : 1.0.0
Flink version: 1.18
Hive version : NO
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : Yes
Table Type: COPY_ON_WRITE
Additional context
Can the Hudi team assist with troubleshooting this issue? Is this expected behavior with METADATA enabled, or is there a bug with flink under resource constraint scenarios?
cc: @codope @bhasudha @danny0405 @xushiyan @ad1happy2go @yihua
The text was updated successfully, but these errors were encountered: