You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our Postgres data warehouse experiences intermittent corruption on streaming replicas when the primary Postgres server's write-ahead log (WAL) is located on a ZFS dataset. The replica's walreceiver process terminates and a message like one of the following appears in the replica's logs:
2024-11-25T00:49:23.562566-05:00 pgdata3 postgres[399480]: [19995-1] 673b5162.61878 (@)LOG: invalid magic number 0000 in log segment 00000002000B05B70000000C, offset 7217152
2024-11-26T10:08:03.797783-05:00 pgdata3 postgres[1262718]: [5973-1] 6744cbea.13447e (@)LOG: incorrect resource manager data checksum in record at B0913/B3FC0750
2025-01-02T07:27:34.264343-05:00 pgdata3 postgres[3414526]: [23458-1] 6749fa15.3419fe (@)LOG: record with incorrect prev-link 7084E000/70851800 at B3CF3/A90A8500
The problem does not occur if the primary's WAL is located on a non-ZFS disk--I have tried multiple times to replicate the issue with the WAL on ext4, and cannot reproduce it. It does still occur if the database cluster directory is located on a non-ZFS disk. Data corruption may also be observed using a command-line tool to receive the WAL (eg, pg_receivewal) onto a server that isn't running ZFS at all (the corruption can be detected with pg_waldump). I have reproduced the problem when the WAL is located on a separate ZFS dataset and on a separate zpool from the database.
The issue occurs much more frequently on Rocky 9 than RHEL 7 (we upgraded from 7 to 9 in November, and I have a year's worth of database logs on the primary and 3 replicas if we need to go searching back through them). While we'd see this issue on the order of once a month on RHEL 7, the frequency has increased to minutes or hours on Rocky 9. A smaller ZFS recordsize also seems to increase the frequency of encountering the problem.
The root cause from Postgres's perspective appears to be a race condition between when a WAL page is flushed and when the walsender process reads it for streaming replication. The workaround we've developed to correct the much higher frequency with which we're hitting the bug on Rocky 9 is simply to temporarily switch the replicas from streaming replication to WAL shipping and back--the delay associated with WAL shipping is enough that the affected logfiles are committed to disk on the primary with the correct contents.
Our previous setup ran Postgres 15, ZFS 2.1.15, and RHEL 7.9, but as you can see from this pgsql-hackers thread from my colleague, we saw it at least as far back as Postgres 10.11 and ZFS 0.7.13 on RHEL 7.7. Our production environment is running ZFS 2.1.15 on Rocky 9, but I upgraded my testing environment to 2.2.7 to test the latest bugfixes, and can still trigger it.
We've seen this on multiple physical hosts at this point, so I don't suspect faulty hardware. All servers (including the replicas) use ECC RAM and have SSDs with capacitor-backed caches. For completeness, here's the specs on them:
24x 7.68TB SAS SSDs (Samsung PM1643a) behind a Dell H740P controller in HBA mode
10GbE (Intel X710)
Replicas:
Dell PE R650
2x Intel Xeon Platinum 8358
1.0TB ECC RAM (16x 64GB DIMMs)
2x 1.6TB (Intel P5600) NVMe (currently configured as the replicas' WAL using ext4, but previously ZFS)
24x 7.68TB SAS SSDs (Samsung PM1643a) in a Dell MD1420 behind a Dell HBA355e controller (4x RAIDz1 with 6 disks each)
10GBe (Broadcom 57412)
Servers are distributed across two datacenters, and the new/old primaries are connected to different switches, so the network paths are not identical (hopefully ruling out networking issues).
Describe how to reproduce the problem
Set up a vanilla Postgres streaming replication environment with the primary server's WAL on a ZFS dataset. The replicas should be fast enough to keep up with writing the WAL being streamed to them, but they don't need to be able to apply it quickly. The only settings that are strictly necessary to change on the primary are:
$ diff -u0 -w postgresql.conf{-vanilla,}
+listen_addresses = '*' # what IP address(es) to listen on;
+primary_conninfo = 'host=pgdata4 port=5432 user=pgrepl'
+primary_slot_name = 'pgdata3'
(and of course the primary's pg_hba.conf needs to be configured to allow replication connections)
Copy lots of data to the primary. Suitable data may be created with the pgbench program. To increase parallelization and trigger the bug more reliably, I create multiple copies of the data with:
n=8
pgbench -i -I dtg --scale 10000
for i in {1..$n}; do
psql -c "create table pgbench_accounts${i} (like pgbench_accounts)"
pg_dump --schema-only --clean --table="pgbench_accounts${i}" -f "/tmp/pgbench_accounts${i}.schema"
done
psql -c "copy pgbench_accounts to '/tmp/pgbench_accounts.data'"
I then wipe the database (typically by rolling back to a zfs snapshot) and then use the following script to repeatedly copy the tables just created back into the fresh db, dropping and recreating the tables on each iteration:
#!/bin/bash
set -euo pipefail
PORT=5432
DBNAME=postgres
pids=""
i=0
while true; do
let i=$i+1
echo "iters: $i"
# Drop/recreate the table structures first
for t in {1..8}; do
psql -p $PORT $DBNAME -f "pgbench_accounts${t}.schema" > /dev/null 2>&1
done
# Copy data into tables
for t in {1..8}; do
psql -p $PORT $DBNAME -c "COPY pgbench_accounts${t} FROM STDIN" < pgbench_accounts.data > /dev/null 2>&1 &
pids="$! $pids"
done
wait $pids
pids=""
done
On our hardware I can typically trigger the bug within a few hours to a day of runtime. Be aware that when run with the above pgbench scale factor, it will create a database about 1TB in size, in addition to all the WALs that will be generated. Smaller scale factors may also work, but I haven't tested with them.
Include any warning/errors/backtraces from the system logs
No warnings or errors in system or hardware logs. Example Postgres logs are above; I can provide more (at higher debuglevel) upon request.
The text was updated successfully, but these errors were encountered:
Which compression do you use?
Which hash algorith do you use?
What does a scrub result in?
When you copy the WAL files directly to the server and replay them does it then work without any errors?
Can you try to reproduce this on ZFS 2.2.7 (maybe also 2.3rc latest or git master branch) and more modern OS/kernel (e.g. Fedora 41 with kernel 6.12.8).
Can you please post your zpool/zfs config (e.g. zpool get all, zfs get all).
System information
Describe the problem you're observing
Our Postgres data warehouse experiences intermittent corruption on streaming replicas when the primary Postgres server's write-ahead log (WAL) is located on a ZFS dataset. The replica's
walreceiver
process terminates and a message like one of the following appears in the replica's logs:2024-11-25T00:49:23.562566-05:00 pgdata3 postgres[399480]: [19995-1] 673b5162.61878 (@)LOG: invalid magic number 0000 in log segment 00000002000B05B70000000C, offset 7217152
2024-11-26T10:08:03.797783-05:00 pgdata3 postgres[1262718]: [5973-1] 6744cbea.13447e (@)LOG: incorrect resource manager data checksum in record at B0913/B3FC0750
2025-01-02T07:27:34.264343-05:00 pgdata3 postgres[3414526]: [23458-1] 6749fa15.3419fe (@)LOG: record with incorrect prev-link 7084E000/70851800 at B3CF3/A90A8500
The problem does not occur if the primary's WAL is located on a non-ZFS disk--I have tried multiple times to replicate the issue with the WAL on ext4, and cannot reproduce it. It does still occur if the database cluster directory is located on a non-ZFS disk. Data corruption may also be observed using a command-line tool to receive the WAL (eg,
pg_receivewal
) onto a server that isn't running ZFS at all (the corruption can be detected withpg_waldump
). I have reproduced the problem when the WAL is located on a separate ZFS dataset and on a separate zpool from the database.The issue occurs much more frequently on Rocky 9 than RHEL 7 (we upgraded from 7 to 9 in November, and I have a year's worth of database logs on the primary and 3 replicas if we need to go searching back through them). While we'd see this issue on the order of once a month on RHEL 7, the frequency has increased to minutes or hours on Rocky 9. A smaller ZFS
recordsize
also seems to increase the frequency of encountering the problem.The root cause from Postgres's perspective appears to be a race condition between when a WAL page is flushed and when the
walsender
process reads it for streaming replication. The workaround we've developed to correct the much higher frequency with which we're hitting the bug on Rocky 9 is simply to temporarily switch the replicas from streaming replication to WAL shipping and back--the delay associated with WAL shipping is enough that the affected logfiles are committed to disk on the primary with the correct contents.Our previous setup ran Postgres 15, ZFS 2.1.15, and RHEL 7.9, but as you can see from this pgsql-hackers thread from my colleague, we saw it at least as far back as Postgres 10.11 and ZFS 0.7.13 on RHEL 7.7. Our production environment is running ZFS 2.1.15 on Rocky 9, but I upgraded my testing environment to 2.2.7 to test the latest bugfixes, and can still trigger it.
We've seen this on multiple physical hosts at this point, so I don't suspect faulty hardware. All servers (including the replicas) use ECC RAM and have SSDs with capacitor-backed caches. For completeness, here's the specs on them:
New primary:
Old primary (and current testing environment):
Replicas:
Servers are distributed across two datacenters, and the new/old primaries are connected to different switches, so the network paths are not identical (hopefully ruling out networking issues).
Describe how to reproduce the problem
Set up a vanilla Postgres streaming replication environment with the primary server's WAL on a ZFS dataset. The replicas should be fast enough to keep up with writing the WAL being streamed to them, but they don't need to be able to apply it quickly. The only settings that are strictly necessary to change on the primary are:
(and of course the primary's
pg_hba.conf
needs to be configured to allow replication connections)Copy lots of data to the primary. Suitable data may be created with the
pgbench
program. To increase parallelization and trigger the bug more reliably, I create multiple copies of the data with:I then wipe the database (typically by rolling back to a zfs snapshot) and then use the following script to repeatedly copy the tables just created back into the fresh db, dropping and recreating the tables on each iteration:
On our hardware I can typically trigger the bug within a few hours to a day of runtime. Be aware that when run with the above
pgbench
scale factor, it will create a database about 1TB in size, in addition to all the WALs that will be generated. Smaller scale factors may also work, but I haven't tested with them.Include any warning/errors/backtraces from the system logs
No warnings or errors in system or hardware logs. Example Postgres logs are above; I can provide more (at higher debuglevel) upon request.
The text was updated successfully, but these errors were encountered: