Data Loss During Load Testing with METADATA Enabled and Autoscale Flink #12738

maheshguptags · 2025-01-30T05:39:46Z

Issue

While performing load testing with METADATA enabled, I encountered a data loss issue. The issue occurs when deploying the job with Autoscale enabled. Specifically, if checkpointing fails due to reasons such as TM add-ons or memory heap issues, all data is discarded, and no further data is processed after that failure.

Checkpointing failures lead to data loss.
After a failed checkpoint due to lack of resources, a new checkpoint is triggered but no data is processed.
I tried to replicate this behavior on Hudi 1.0, and the same issue persists.

Hudi Properties

#Updated at 2025-01-20T07:41:05.654545Z
#Mon Jan 20 07:41:05 UTC 2025
hoodie.table.keygenerator.type=COMPLEX_AVRO
hoodie.table.type=COPY_ON_WRITE
hoodie.table.precombine.field=updated_date
hoodie.table.create.schema={}
hoodie.timeline.layout.version=2
hoodie.timeline.history.path=history
hoodie.table.checksum=1292384652
hoodie.datasource.write.drop.partition.columns=false
hoodie.record.merge.strategy.id=00000000-0000-0000-0000-000000000000
hoodie.datasource.write.hive_style_partitioning=false
hoodie.table.metadata.partitions.inflight=
hoodie.database.name=default_database
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.record.merge.mode=CUSTOM
hoodie.table.version=8
hoodie.compaction.payload.class=com.gupshup.cdp.PartialUpdate
hoodie.table.initial.version=8
hoodie.table.metadata.partitions=files
hoodie.table.partition.fields=xyz
hoodie.table.cdc.enabled=false
hoodie.archivelog.folder=history
hoodie.table.name=customer_temp
hoodie.table.recordkey.fields=xyz.abc 
hoodie.timeline.path=timeline

Steps to reproduce the behavior:

Create a table with Flink hudi along with MDT Enable
Ingest some load
Try to delete one of TM Or Ingest heavy load so that it can give memory issue
once it fails it will discard all the data after that checkpointing

Expected behavior

After checkpoint failure due to resource issues, the system should continue processing data once resources are available, without losing previously processed data.

Environment Description

Hudi version : 1.0.0
Flink version: 1.18
Hive version : NO
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : Yes
Table Type: COPY_ON_WRITE

Additional context

Can the Hudi team assist with troubleshooting this issue? Is this expected behavior with METADATA enabled, or is there a bug with flink under resource constraint scenarios?

cc: @codope @bhasudha @danny0405 @xushiyan @ad1happy2go @yihua

danny0405 · 2025-01-31T02:31:23Z

Does the job work well without auto-scale? What is the state of the pipeline after the checkpoint fails, does the writer stil handle inputs?

maheshguptags · 2025-01-31T07:05:13Z

Yes, Job works with and without auto-scale if we don't enable MDT.

danny0405 · 2025-02-03T03:22:38Z

Is there any sepcial logs in the JM logging?

maheshguptags · 2025-02-03T05:40:21Z

I haven't seen any special log for this, usually, it fails the checkpoint either by autoscaling spin up new TM or if I kill the TM manually and it discards the data post that.

Thanks
Mahesh

danny0405 · 2025-02-05T00:27:54Z

it discards the data post that.

Are you saying the pipeline just hangs up there and does nothing?

maheshguptags · 2025-02-05T04:22:23Z

No, it simply moves to the next checkpointing and processes nothing. If you observe the 4th checkpointing, it is currently processing millions of records (in progress, as it's not completed) and taking approximately 4 minutes. However, once it fails, it moves to the next checkpointing and processes nothing, completing in milliseconds(same goes for 2 and 3).

Thanks
Mahesh

maheshguptags · 2025-02-11T07:56:42Z

@danny0405 any findings?

danny0405 · 2025-02-12T04:21:54Z

No, without more detailed logs I can not help more, and I have no knowledge of Flink auto-scaling.

cshuo · 2025-02-12T04:54:52Z

once it fails it will discard all the data after that checkpointing

@maheshguptags do you mean after the checkpoint failed, records ingested after that will loss? if that happens, usually there are two types of problems:

the job is stuck somewhere before the hudi write operator, you can check metric of hudi write operator in dashboard, e.g.,"records received" to make sure there are new coming records after the ckp failure.
records are not committed correctly, you should check logs in jobmanager to see if there are exception messages.

btw, could you also paste the exception trace in the flink dashboard too?

maheshguptags · 2025-02-13T09:34:48Z

Hi

do you mean after the checkpoint failed, records ingested after that will loss?

@cshuo Let's assume 10 million records are ingested into the job. While processing these records, if the job manager triggers the creation of a new Task Manager (TM) due to auto-scaling, or if a TM is manually removed (to test the scenario without auto-scaling), a checkpoint failure could occur, causing all the previously ingested data (the 10 million records) to be discarded.

If new data (e.g., 1 million records) is ingested after the checkpoint failure, the new data will be successfully processed and ingested to Hudi, provided the next checkpoint succeeds.

To summarize:

Ingest 10M records → checkpoint failure (due to TM change) → discard all data
Ingest 1M new records → checkpoint success → successfully ingested into Hudi(only 1M).

Thanks
Mahesh

danny0405 · 2025-02-14T03:13:11Z

Ingest 10M records → checkpoint failure (due to TM change) → discard all data

So these records does not even have a complete checkpoint lifecycle and no commits occur.

maheshguptags · 2025-02-14T07:24:27Z

I used an example to illustrate the issue.
The process successfully ingested and checkpointed data (5.5M out of 10M). However, whenever the job was interrupted (either manually or due to autoscaling), the remaining 4.5M records were discarded.

example
Ingest 10M records:

chkpnt1 → succeeded → ingested 2.5M (out of 10M)
chkpnt2 → succeeded → ingested 3M (remaining of 7.5M)
chkpnt3 → failed (either manually or due to autoscaling) → No data written to Hudi table, and the remaining 4.5M records will be discarded after this point

Attempts the next checkpoint

chkpnt4 → succeeded → no data will be written due to the failure at chkpnt3 and the checkpoint will complete within milliseconds.

Thanks
Mahesh

maheshguptags · 2025-02-18T11:58:36Z

@danny0405 @cshuo, any progress on this?
Please let me know if you need further information.

No, without more detailed logs I can not help more, and I have no knowledge of Flink auto-scaling.

You can try it without auto-scaling by forcefully deleting the TM while checkpointing is in progress (causing the checkpointing to fail) with MDT enabled.

We can also connect over a call.

Thanks
Mahesh

ad1happy2go added priority:critical production down; pipelines stalled; Need help asap. flink Issues related to flink data-loss loss of data only, use data-consistency label for inconsistent view labels Feb 7, 2025

ad1happy2go added this to Hudi Issue Support Feb 7, 2025

github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Loss During Load Testing with METADATA Enabled and Autoscale Flink #12738

Data Loss During Load Testing with METADATA Enabled and Autoscale Flink #12738

maheshguptags commented Jan 30, 2025 •

edited

Loading

danny0405 commented Jan 31, 2025

maheshguptags commented Jan 31, 2025

danny0405 commented Feb 3, 2025

maheshguptags commented Feb 3, 2025

danny0405 commented Feb 5, 2025

maheshguptags commented Feb 5, 2025 •

edited

Loading

maheshguptags commented Feb 11, 2025

danny0405 commented Feb 12, 2025

cshuo commented Feb 12, 2025

maheshguptags commented Feb 13, 2025

danny0405 commented Feb 14, 2025

maheshguptags commented Feb 14, 2025

maheshguptags commented Feb 18, 2025

Data Loss During Load Testing with METADATA Enabled and Autoscale Flink #12738

Data Loss During Load Testing with METADATA Enabled and Autoscale Flink #12738

Comments

maheshguptags commented Jan 30, 2025 • edited Loading

danny0405 commented Jan 31, 2025

maheshguptags commented Jan 31, 2025

danny0405 commented Feb 3, 2025

maheshguptags commented Feb 3, 2025

danny0405 commented Feb 5, 2025

maheshguptags commented Feb 5, 2025 • edited Loading

maheshguptags commented Feb 11, 2025

danny0405 commented Feb 12, 2025

cshuo commented Feb 12, 2025

maheshguptags commented Feb 13, 2025

danny0405 commented Feb 14, 2025

maheshguptags commented Feb 14, 2025

maheshguptags commented Feb 18, 2025

maheshguptags commented Jan 30, 2025 •

edited

Loading

maheshguptags commented Feb 5, 2025 •

edited

Loading