Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Enabling wal_compression Leads To Coredumps #927

Open
antoniopetrole opened this issue Feb 13, 2025 · 1 comment
Open

[Bug] Enabling wal_compression Leads To Coredumps #927

antoniopetrole opened this issue Feb 13, 2025 · 1 comment
Labels
type: Bug Something isn't working

Comments

@antoniopetrole
Copy link
Member

Apache Cloudberry version

Cloudberry 1.6.0 (pre apache release)

What happened

A few days ago we set wal_compression = on in an attempt to reduce IO in our production cluster. Shortly after enabling this, we had users reaching out saying their queries that we part of a big workload were failing. After some investigation, we saw some coredumps being generated on the segments that were throwing errors and these coredumps are directly related to the wal compression functionality. It seems the exception was thrown right after XLogCompressBackupBlock.cold.4 tried running and created a coredump. Thankfully it didn't crash any segments so I imagine the WAL stuff happens in it's own thread. We quickly disabled this GUC and haven't seen this issue again (it happened on multiple segments multiple times since they were running retries on their jobs)

Client Side Error

DEBUG ERROR: Error on receive from seg25 slice1 10. :4001 pid=3498813: server closed the connection unexpectedly
DEBUG ERROR: current transaction is aborted, commands ignored until end of transaction block, command: SELECT
ERROR PSQLException: ERROR: Error on receive from seg25 slice1 10.:4001 pid=3498813: server closed the connection unexpectedly
PL/pgSQL function line 298 at EXECUTEorg.postgresql.util.PSQLException: ERROR: Error on receive from seg25 slice1 10. :4001 pid=3498813: server closed the connection unexpectedly
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2440)
INFO Exited due to an error 1 min 1.7 secs after starting

Coredump Trace

#0  0x00007f665f04e52f in raise () from /lib64/libc.so.6
#1  0x00007f665f021e65 in abort () from /lib64/libc.so.6
#2  0x00007f66600de060 in errfinish () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#3  0x00007f665fa84888 in XLogCompressBackupBlock.cold.4 () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#4  0x00007f665fbfc814 in XLogRecordAssemble () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#5  0x00007f665fbfcbc4 in XLogInsert_Internal () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#6  0x00007f665fb9364f in heap_delete () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#7  0x00007f665fb93836 in simple_heap_delete () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#8  0x00007f665fb5c81c in toast_delete_datum () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#9  0x00007f665fbd2a5f in toast_delete_external () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#10 0x00007f665fba1070 in heap_toast_delete () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#11 0x00007f665fb9343d in heap_delete () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#12 0x00007f665fda4298 in ExecDelete () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#13 0x00007f665fda62f1 in ExecModifyTable () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#14 0x00007f665fd7877b in ExecProcNodeFirst () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#15 0x00007f665fd6f47a in ExecutePlan.part.1 () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#16 0x00007f665fd6ff28 in standard_ExecutorRun () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#17 0x00007f665fd70135 in ExecutorRun () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#18 0x00007f665ff8af2d in ProcessQuery.isra.3 () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#19 0x00007f665ff8beb2 in PortalRunMulti () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#20 0x00007f665ff8c33d in PortalRun () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#21 0x00007f665ff865df in exec_mpp_query () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#22 0x00007f665ff89ebd in PostgresMain () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#23 0x00007f665fee5ddf in ServerLoop () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#24 0x00007f665fee6f1f in PostmasterMain () from /usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#25 0x00000000004017ae in main ()```

### What you think should happen instead

Shouldn't core dump :) 

### How to reproduce

I haven't tried creating a test case for this yet but it should be relatively easy. All we did was enable the guc, run `gpstop -u`, and then our users started having issues.

### Operating System

Rocky Linux 8.10 (Green Obsidian)

### Anything else

_No response_

### Are you willing to submit PR?

- [ ] Yes, I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/cloudberry/blob/main/CODE_OF_CONDUCT.md).
@antoniopetrole antoniopetrole added the type: Bug Something isn't working label Feb 13, 2025
@yjhjstz
Copy link
Member

yjhjstz commented Feb 17, 2025

related #806

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants