[Bug] Enabling wal_compression Leads To Coredumps #927

antoniopetrole · 2025-02-13T16:02:12Z

Apache Cloudberry version

Cloudberry 1.6.0 (pre apache release)

What happened

A few days ago we set wal_compression = on in an attempt to reduce IO in our production cluster. Shortly after enabling this, we had users reaching out saying their queries that we part of a big workload were failing. After some investigation, we saw some coredumps being generated on the segments that were throwing errors and these coredumps are directly related to the wal compression functionality. It seems the exception was thrown right after XLogCompressBackupBlock.cold.4 tried running and created a coredump. Thankfully it didn't crash any segments so I imagine the WAL stuff happens in it's own thread. We quickly disabled this GUC and haven't seen this issue again (it happened on multiple segments multiple times since they were running retries on their jobs)

Client Side Error

DEBUG ERROR: Error on receive from seg25 slice1 10. :4001 pid=3498813: server closed the connection unexpectedly
DEBUG ERROR: current transaction is aborted, commands ignored until end of transaction block, command: SELECT
ERROR PSQLException: ERROR: Error on receive from seg25 slice1 10.:4001 pid=3498813: server closed the connection unexpectedly
PL/pgSQL function line 298 at EXECUTEorg.postgresql.util.PSQLException: ERROR: Error on receive from seg25 slice1 10. :4001 pid=3498813: server closed the connection unexpectedly
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2440)
INFO Exited due to an error 1 min 1.7 secs after starting

Coredump Trace

#0  0x00007f665f04e52f in raise () from /lib64/libc.so.6
#1  0x00007f665f021e65 in abort () from /lib64/libc.so.6
#2  0x00007f66600de060 in errfinish () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#3  0x00007f665fa84888 in XLogCompressBackupBlock.cold.4 () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#4  0x00007f665fbfc814 in XLogRecordAssemble () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#5  0x00007f665fbfcbc4 in XLogInsert_Internal () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#6  0x00007f665fb9364f in heap_delete () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#7  0x00007f665fb93836 in simple_heap_delete () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#8  0x00007f665fb5c81c in toast_delete_datum () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#9  0x00007f665fbd2a5f in toast_delete_external () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#10 0x00007f665fba1070 in heap_toast_delete () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#11 0x00007f665fb9343d in heap_delete () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#12 0x00007f665fda4298 in ExecDelete () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#13 0x00007f665fda62f1 in ExecModifyTable () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#14 0x00007f665fd7877b in ExecProcNodeFirst () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#15 0x00007f665fd6f47a in ExecutePlan.part.1 () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#16 0x00007f665fd6ff28 in standard_ExecutorRun () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#17 0x00007f665fd70135 in ExecutorRun () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#18 0x00007f665ff8af2d in ProcessQuery.isra.3 () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#19 0x00007f665ff8beb2 in PortalRunMulti () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#20 0x00007f665ff8c33d in PortalRun () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#21 0x00007f665ff865df in exec_mpp_query () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#22 0x00007f665ff89ebd in PostgresMain () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#23 0x00007f665fee5ddf in ServerLoop () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#24 0x00007f665fee6f1f in PostmasterMain () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#25 0x00000000004017ae in main ()```

### What you think should happen instead

Shouldn't core dump :) 

### How to reproduce

I haven't tried creating a test case for this yet but it should be relatively easy. All we did was enable the guc, run `gpstop -u`, and then our users started having issues.

### Operating System

Rocky Linux 8.10 (Green Obsidian)

### Anything else

_No response_

### Are you willing to submit PR?

- [ ] Yes, I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/cloudberry/blob/main/CODE_OF_CONDUCT.md).

The text was updated successfully, but these errors were encountered:

yjhjstz · 2025-02-17T15:55:28Z

related #806

antoniopetrole added the type: Bug Something isn't working label Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Enabling wal_compression Leads To Coredumps #927

[Bug] Enabling wal_compression Leads To Coredumps #927

antoniopetrole commented Feb 13, 2025

yjhjstz commented Feb 17, 2025

[Bug] Enabling wal_compression Leads To Coredumps #927

[Bug] Enabling wal_compression Leads To Coredumps #927

Comments

antoniopetrole commented Feb 13, 2025

Apache Cloudberry version

What happened

Client Side Error

Coredump Trace

yjhjstz commented Feb 17, 2025