Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Loss During Load Testing with METADATA Enabled and Autoscale Flink #12738

Open
maheshguptags opened this issue Jan 30, 2025 · 13 comments
Open
Labels
data-loss loss of data only, use data-consistency label for inconsistent view flink Issues related to flink priority:critical production down; pipelines stalled; Need help asap.

Comments

@maheshguptags
Copy link

maheshguptags commented Jan 30, 2025

Issue

While performing load testing with METADATA enabled, I encountered a data loss issue. The issue occurs when deploying the job with Autoscale enabled. Specifically, if checkpointing fails due to reasons such as TM add-ons or memory heap issues, all data is discarded, and no further data is processed after that failure.

Checkpointing failures lead to data loss.
After a failed checkpoint due to lack of resources, a new checkpoint is triggered but no data is processed.
I tried to replicate this behavior on Hudi 1.0, and the same issue persists.

Hudi Properties

#Updated at 2025-01-20T07:41:05.654545Z
#Mon Jan 20 07:41:05 UTC 2025
hoodie.table.keygenerator.type=COMPLEX_AVRO
hoodie.table.type=COPY_ON_WRITE
hoodie.table.precombine.field=updated_date
hoodie.table.create.schema={}
hoodie.timeline.layout.version=2
hoodie.timeline.history.path=history
hoodie.table.checksum=1292384652
hoodie.datasource.write.drop.partition.columns=false
hoodie.record.merge.strategy.id=00000000-0000-0000-0000-000000000000
hoodie.datasource.write.hive_style_partitioning=false
hoodie.table.metadata.partitions.inflight=
hoodie.database.name=default_database
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.record.merge.mode=CUSTOM
hoodie.table.version=8
hoodie.compaction.payload.class=com.gupshup.cdp.PartialUpdate
hoodie.table.initial.version=8
hoodie.table.metadata.partitions=files
hoodie.table.partition.fields=xyz
hoodie.table.cdc.enabled=false
hoodie.archivelog.folder=history
hoodie.table.name=customer_temp
hoodie.table.recordkey.fields=xyz.abc 
hoodie.timeline.path=timeline

Steps to reproduce the behavior:

  1. Create a table with Flink hudi along with MDT Enable
  2. Ingest some load
  3. Try to delete one of TM Or Ingest heavy load so that it can give memory issue
  4. once it fails it will discard all the data after that checkpointing

Expected behavior

After checkpoint failure due to resource issues, the system should continue processing data once resources are available, without losing previously processed data.

Environment Description

  • Hudi version : 1.0.0

  • Flink version: 1.18

  • Hive version : NO

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : Yes

  • Table Type: COPY_ON_WRITE

Additional context

Can the Hudi team assist with troubleshooting this issue? Is this expected behavior with METADATA enabled, or is there a bug with flink under resource constraint scenarios?

Image

cc: @codope @bhasudha @danny0405 @xushiyan @ad1happy2go @yihua

@danny0405
Copy link
Contributor

Does the job work well without auto-scale? What is the state of the pipeline after the checkpoint fails, does the writer stil handle inputs?

@maheshguptags
Copy link
Author

Yes, Job works with and without auto-scale if we don't enable MDT.

@danny0405
Copy link
Contributor

Is there any sepcial logs in the JM logging?

@maheshguptags
Copy link
Author

I haven't seen any special log for this, usually, it fails the checkpoint either by autoscaling spin up new TM or if I kill the TM manually and it discards the data post that.

Thanks
Mahesh

@danny0405
Copy link
Contributor

it discards the data post that.

Are you saying the pipeline just hangs up there and does nothing?

@maheshguptags
Copy link
Author

maheshguptags commented Feb 5, 2025

No, it simply moves to the next checkpointing and processes nothing. If you observe the 4th checkpointing, it is currently processing millions of records (in progress, as it's not completed) and taking approximately 4 minutes. However, once it fails, it moves to the next checkpointing and processes nothing, completing in milliseconds(same goes for 2 and 3).

Thanks
Mahesh

Image

@ad1happy2go ad1happy2go added priority:critical production down; pipelines stalled; Need help asap. flink Issues related to flink data-loss loss of data only, use data-consistency label for inconsistent view labels Feb 7, 2025
@github-project-automation github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Feb 7, 2025
@maheshguptags
Copy link
Author

@danny0405 any findings?

@danny0405
Copy link
Contributor

No, without more detailed logs I can not help more, and I have no knowledge of Flink auto-scaling.

@cshuo
Copy link
Contributor

cshuo commented Feb 12, 2025

  1. once it fails it will discard all the data after that checkpointing

@maheshguptags do you mean after the checkpoint failed, records ingested after that will loss? if that happens, usually there are two types of problems:

  1. the job is stuck somewhere before the hudi write operator, you can check metric of hudi write operator in dashboard, e.g.,"records received" to make sure there are new coming records after the ckp failure.
  2. records are not committed correctly, you should check logs in jobmanager to see if there are exception messages.

btw, could you also paste the exception trace in the flink dashboard too?

@maheshguptags
Copy link
Author

Hi

do you mean after the checkpoint failed, records ingested after that will loss?

@cshuo Let's assume 10 million records are ingested into the job. While processing these records, if the job manager triggers the creation of a new Task Manager (TM) due to auto-scaling, or if a TM is manually removed (to test the scenario without auto-scaling), a checkpoint failure could occur, causing all the previously ingested data (the 10 million records) to be discarded.

If new data (e.g., 1 million records) is ingested after the checkpoint failure, the new data will be successfully processed and ingested to Hudi, provided the next checkpoint succeeds.

To summarize:

Ingest 10M records → checkpoint failure (due to TM change) → discard all data
Ingest 1M new records → checkpoint success → successfully ingested into Hudi(only 1M).

Thanks
Mahesh

@danny0405
Copy link
Contributor

Ingest 10M records → checkpoint failure (due to TM change) → discard all data

So these records does not even have a complete checkpoint lifecycle and no commits occur.

@maheshguptags
Copy link
Author

I used an example to illustrate the issue.
The process successfully ingested and checkpointed data (5.5M out of 10M). However, whenever the job was interrupted (either manually or due to autoscaling), the remaining 4.5M records were discarded.

example
Ingest 10M records:

chkpnt1 → succeeded → ingested 2.5M (out of 10M)
chkpnt2 → succeeded → ingested 3M (remaining of 7.5M)
chkpnt3 → failed (either manually or due to autoscaling) → No data written to Hudi table, and the remaining 4.5M records will be discarded after this point

Attempts the next checkpoint

chkpnt4 → succeeded → no data will be written due to the failure at chkpnt3 and the checkpoint will complete within milliseconds.

Thanks
Mahesh

@maheshguptags
Copy link
Author

@danny0405 @cshuo, any progress on this?
Please let me know if you need further information.

No, without more detailed logs I can not help more, and I have no knowledge of Flink auto-scaling.

You can try it without auto-scaling by forcefully deleting the TM while checkpointing is in progress (causing the checkpointing to fail) with MDT enabled.

We can also connect over a call.

Thanks
Mahesh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-loss loss of data only, use data-consistency label for inconsistent view flink Issues related to flink priority:critical production down; pipelines stalled; Need help asap.
Projects
Status: Awaiting Triage
Development

No branches or pull requests

4 participants