Data corruption in Postgres with Write-Ahead Log on ZFS #16936

seidelma · 2025-01-08T18:54:42Z

System information

Type	Version/Name
Distribution Name	Rocky Linux
Distribution Version	9.5
Kernel Version	5.14.0-503.14.1.el9_5.x86_64
Architecture	x86_64
OpenZFS Version	2.2.7

Describe the problem you're observing

Our Postgres data warehouse experiences intermittent corruption on streaming replicas when the primary Postgres server's write-ahead log (WAL) is located on a ZFS dataset. The replica's walreceiver process terminates and a message like one of the following appears in the replica's logs:

2024-11-25T00:49:23.562566-05:00 pgdata3 postgres[399480]: [19995-1] 673b5162.61878 (@)LOG: invalid magic number 0000 in log segment 00000002000B05B70000000C, offset 7217152

2024-11-26T10:08:03.797783-05:00 pgdata3 postgres[1262718]: [5973-1] 6744cbea.13447e (@)LOG: incorrect resource manager data checksum in record at B0913/B3FC0750

2025-01-02T07:27:34.264343-05:00 pgdata3 postgres[3414526]: [23458-1] 6749fa15.3419fe (@)LOG: record with incorrect prev-link 7084E000/70851800 at B3CF3/A90A8500

The problem does not occur if the primary's WAL is located on a non-ZFS disk--I have tried multiple times to replicate the issue with the WAL on ext4, and cannot reproduce it. It does still occur if the database cluster directory is located on a non-ZFS disk. Data corruption may also be observed using a command-line tool to receive the WAL (eg, pg_receivewal) onto a server that isn't running ZFS at all (the corruption can be detected with pg_waldump). I have reproduced the problem when the WAL is located on a separate ZFS dataset and on a separate zpool from the database.

The issue occurs much more frequently on Rocky 9 than RHEL 7 (we upgraded from 7 to 9 in November, and I have a year's worth of database logs on the primary and 3 replicas if we need to go searching back through them). While we'd see this issue on the order of once a month on RHEL 7, the frequency has increased to minutes or hours on Rocky 9. A smaller ZFS recordsize also seems to increase the frequency of encountering the problem.

The root cause from Postgres's perspective appears to be a race condition between when a WAL page is flushed and when the walsender process reads it for streaming replication. The workaround we've developed to correct the much higher frequency with which we're hitting the bug on Rocky 9 is simply to temporarily switch the replicas from streaming replication to WAL shipping and back--the delay associated with WAL shipping is enough that the affected logfiles are committed to disk on the primary with the correct contents.

Our previous setup ran Postgres 15, ZFS 2.1.15, and RHEL 7.9, but as you can see from this pgsql-hackers thread from my colleague, we saw it at least as far back as Postgres 10.11 and ZFS 0.7.13 on RHEL 7.7. Our production environment is running ZFS 2.1.15 on Rocky 9, but I upgraded my testing environment to 2.2.7 to test the latest bugfixes, and can still trigger it.

We've seen this on multiple physical hosts at this point, so I don't suspect faulty hardware. All servers (including the replicas) use ECC RAM and have SSDs with capacitor-backed caches. For completeness, here's the specs on them:

New primary:

Dell PowerEdge R7625
2x AMD Epyc 9184X
1.5TB ECC RAM (24x 64GB DIMMs)
24x 15.36TB NVMe SSDs (Samsung PM9A3) in 2x 12-disk RAIDz2 vdevs
25GbE (Broadcom 57504)

Old primary (and current testing environment):

Dell PE R740XD
2x Intel Xeon Platinum 8358
1.0TB ECC RAM (16x 64GB DIMMs)
24x 7.68TB SAS SSDs (Samsung PM1643a) behind a Dell H740P controller in HBA mode
10GbE (Intel X710)

Replicas:

Dell PE R650
2x Intel Xeon Platinum 8358
1.0TB ECC RAM (16x 64GB DIMMs)
2x 1.6TB (Intel P5600) NVMe (currently configured as the replicas' WAL using ext4, but previously ZFS)
24x 7.68TB SAS SSDs (Samsung PM1643a) in a Dell MD1420 behind a Dell HBA355e controller (4x RAIDz1 with 6 disks each)
10GBe (Broadcom 57412)

Servers are distributed across two datacenters, and the new/old primaries are connected to different switches, so the network paths are not identical (hopefully ruling out networking issues).

Describe how to reproduce the problem

Set up a vanilla Postgres streaming replication environment with the primary server's WAL on a ZFS dataset. The replicas should be fast enough to keep up with writing the WAL being streamed to them, but they don't need to be able to apply it quickly. The only settings that are strictly necessary to change on the primary are:

$ diff -u0 -w postgresql.conf{-vanilla,}
+listen_addresses = '*'         # what IP address(es) to listen on;
+primary_conninfo = 'host=pgdata4 port=5432 user=pgrepl'
+primary_slot_name = 'pgdata3'

(and of course the primary's pg_hba.conf needs to be configured to allow replication connections)

Copy lots of data to the primary. Suitable data may be created with the pgbench program. To increase parallelization and trigger the bug more reliably, I create multiple copies of the data with:

n=8
pgbench -i -I dtg --scale 10000
for i in {1..$n}; do
  psql -c "create table pgbench_accounts${i} (like pgbench_accounts)"
  pg_dump --schema-only --clean --table="pgbench_accounts${i}" -f "/tmp/pgbench_accounts${i}.schema"
done
psql -c "copy pgbench_accounts to '/tmp/pgbench_accounts.data'"

I then wipe the database (typically by rolling back to a zfs snapshot) and then use the following script to repeatedly copy the tables just created back into the fresh db, dropping and recreating the tables on each iteration:

#!/bin/bash
set -euo pipefail
PORT=5432
DBNAME=postgres

pids=""
i=0
while true; do
        let i=$i+1
        echo "iters: $i"
        # Drop/recreate the table structures first
        for t in {1..8}; do
                psql -p $PORT $DBNAME -f "pgbench_accounts${t}.schema" > /dev/null 2>&1
        done

        # Copy data into tables
        for t in {1..8}; do
                psql -p $PORT $DBNAME -c "COPY pgbench_accounts${t} FROM STDIN" < pgbench_accounts.data > /dev/null 2>&1 &
                pids="$! $pids"
        done
        wait $pids
        pids=""
done

On our hardware I can typically trigger the bug within a few hours to a day of runtime. Be aware that when run with the above pgbench scale factor, it will create a database about 1TB in size, in addition to all the WALs that will be generated. Smaller scale factors may also work, but I haven't tested with them.

Include any warning/errors/backtraces from the system logs

No warnings or errors in system or hardware logs. Example Postgres logs are above; I can provide more (at higher debuglevel) upon request.

The text was updated successfully, but these errors were encountered:

wiesl · 2025-01-10T06:24:59Z

Which compression do you use?
Which hash algorith do you use?
What does a scrub result in?

When you copy the WAL files directly to the server and replay them does it then work without any errors?

Can you try to reproduce this on ZFS 2.2.7 (maybe also 2.3rc latest or git master branch) and more modern OS/kernel (e.g. Fedora 41 with kernel 6.12.8).

Can you please post your zpool/zfs config (e.g. zpool get all, zfs get all).

seidelma added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data corruption in Postgres with Write-Ahead Log on ZFS #16936

Data corruption in Postgres with Write-Ahead Log on ZFS #16936

seidelma commented Jan 8, 2025 •

edited

Loading

wiesl commented Jan 10, 2025 •

edited

Loading

Data corruption in Postgres with Write-Ahead Log on ZFS #16936

Data corruption in Postgres with Write-Ahead Log on ZFS #16936

Comments

seidelma commented Jan 8, 2025 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

wiesl commented Jan 10, 2025 • edited Loading

seidelma commented Jan 8, 2025 •

edited

Loading

wiesl commented Jan 10, 2025 •

edited

Loading