Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tail latencies of IO operations #16844

Open
sargarass opened this issue Dec 6, 2024 · 5 comments
Open

Tail latencies of IO operations #16844

sargarass opened this issue Dec 6, 2024 · 5 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@sargarass
Copy link

sargarass commented Dec 6, 2024

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 24.04.1 LTS
Kernel Version 6.8.0-45-lowlatency
Architecture x86_64
OpenZFS Version zfs-2.2.2-0ubuntu9.1

Describe the problem you're observing

The disk in use is a SAMSUNG MZQL2960HCJR-00A07 (Sequential Write: 1500 MB/s). While using ZFS on this disk, I encountered that tail latencies for I/O operations are excessively high (0.9-1.2 seconds) when performing sequential writes without fsync.
These issues are not reproducible when using ext4 on the same disk. Encryption is disabled in both cases. Could you provide any guidelines to reduce these latencies?

smartctl -a /dev/nvme0n1
smartctl output
# smartctl -a /dev/nvme0n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0-45-lowlatency] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZQL2960HCJR-00A07
Serial Number:                      *****
Firmware Version:                   GDC5302Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 960,197,124,096 [960 GB]
Unallocated NVM Capacity:           0
Controller ID:                      6
NVMe Version:                       1.4
Number of Namespaces:               32
Namespace 1 Size/Capacity:          960,197,124,096 [960 GB]
Namespace 1 Utilization:            867,007,496,192 [867 GB]
Namespace 1 Formatted LBA Size:     4096
Local Time is:                      Fri Dec  6 17:07:41 2024 UTC
Firmware Updates (0x17):            3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x005f):   Security Format Frmw_DL NS_Mngmt Self_Test MI_Snd/Rec
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     80 Celsius
Critical Comp. Temp. Threshold:     83 Celsius
Namespace 1 Features (0x1a):        NA_Fields No_ID_Reuse NP_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
0 +    25.00W   14.00W       -    0  0  0  0       70      70
1 +     8.00W  0.0800W       -    1  1  1  1       70      70

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
0 -     512       0         0
1 +    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        38 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    568,921 [291 GB]
Data Units Written:                 23,677,053 [12.1 TB]
Host Read Commands:                 4,089,356
Host Write Commands:                143,465,109
Controller Busy Time:               966
Power Cycles:                       3
Power On Hours:                     22,089
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               38 Celsius
Temperature Sensor 2:               47 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Read Self-test Log failed: Invalid Field in Command (0x002)

Describe how to reproduce the problem

I used next fio configuration to measure the maximal latencies without fsync (section dirty-writers-thoughput) and with fsync(fsync-writers-thoughput). Each test starts with fs recreation, dropping system caches and fstrim.

jobfile.fio
[global]
kb_base=1024
group_reporting=1
#Type of io pattern 
# write         Sequential writes
rw=write
ioengine=sync
filesize=10m
blocksize=128k
# If value is true, use non-buffered io. This is usually O_DIRECT.
direct=0
# If value is true, use buffered io. This is the opposite of the 'direct' option
buffered=0
unit_base=8
openfiles=1
# If writing to a file, issue a sync of the dirty data for every number of blocks given. 
fsync=0
# If true, serialize the file creating for the jobs.
# This may be handy to avoid interleaving of data
# files, which may greatly depend on the filesystem
# used and even the number of processors in the system.
create_serialize=1
create_on_open=1
# Unlink the job files when done.
unlink=1
directory=/mnt/testpool
# fio defaults to forking jobs, however if this option is
# given, fio will use pthread_create(3) to create threads
# instead.
thread

clocksource=cpu

[dirty-writers-thoughput]
numjobs=10
# If true, fio will fsync() a dirty file on close.
# This differs from end_fsync in that it will happen on every
# file close, not just at the end of the job.
fsync_on_close=0

[fsync-writers-thoughput]
numjobs=10
# If true, fio will fsync() a dirty file on close.
# This differs from end_fsync in that it will happen on every
# file close, not just at the end of the job.
fsync_on_close=1

zfs pool was created with the following command:

zpool create testpool -o ashift=12 -O mountpoint=/mnt/testpool -O checksum=off -O compression=off -O atime=off -O dedup=off /dev/nvme0n1

Below is the properties set for the pool.

zfs get all testpool
root@plt-platform-helix:/mnt# zfs get all testpool
NAME      PROPERTY              VALUE                  SOURCE
testpool  type                  filesystem             -
testpool  creation              Fri Dec  6 17:29 2024  -
testpool  used                  456K                   -
testpool  available             860G                   -
testpool  referenced            96K                    -
testpool  compressratio         1.00x                  -
testpool  mounted               yes                    -
testpool  quota                 none                   default
testpool  reservation           none                   default
testpool  recordsize            128K                   default
testpool  mountpoint            /mnt/testpool          local
testpool  sharenfs              off                    default
testpool  checksum              off                    local
testpool  compression           off                    local
testpool  atime                 off                    local
testpool  devices               on                     default
testpool  exec                  on                     default
testpool  setuid                on                     default
testpool  readonly              off                    default
testpool  zoned                 off                    default
testpool  snapdir               hidden                 default
testpool  aclmode               discard                default
testpool  aclinherit            restricted             default
testpool  createtxg             1                      -
testpool  canmount              on                     default
testpool  xattr                 on                     default
testpool  copies                1                      default
testpool  version               5                      -
testpool  utf8only              off                    -
testpool  normalization         none                   -
testpool  casesensitivity       sensitive              -
testpool  vscan                 off                    default
testpool  nbmand                off                    default
testpool  sharesmb              off                    default
testpool  refquota              none                   default
testpool  refreservation        none                   default
testpool  guid                  9732613394759077339    -
testpool  primarycache          all                    default
testpool  secondarycache        all                    default
testpool  usedbysnapshots       0B                     -
testpool  usedbydataset         96K                    -
testpool  usedbychildren        360K                   -
testpool  usedbyrefreservation  0B                     -
testpool  logbias               latency                default
testpool  objsetid              54                     -
testpool  dedup                 off                    local
testpool  mlslabel              none                   default
testpool  sync                  standard               default
testpool  dnodesize             legacy                 default
testpool  refcompressratio      1.00x                  -
testpool  written               96K                    -
testpool  logicalused           156K                   -
testpool  logicalreferenced     42K                    -
testpool  volmode               default                default
testpool  filesystem_limit      none                   default
testpool  snapshot_limit        none                   default
testpool  filesystem_count      none                   default
testpool  snapshot_count        none                   default
testpool  snapdev               hidden                 default
testpool  acltype               off                    default
testpool  context               none                   default
testpool  fscontext             none                   default
testpool  defcontext            none                   default
testpool  rootcontext           none                   default
testpool  relatime              on                     default
testpool  redundant_metadata    all                    default
testpool  overlay               on                     default
testpool  encryption            off                    default
testpool  keylocation           none                   default
testpool  keyformat             none                   default
testpool  pbkdf2iters           0                      default
testpool  special_small_blocks  0                      default

No fsync writes

In this test, fio writes 10 MiB files using 10 threads without fsync. Each thread generates 200 files. The difference in maximum latency between ZFS and EXT4 is significant—241.7x (918.5 ms vs. 3.1 ms). And it is not even a bad case scenarion. The highest latency observed for ZFS was 1.2 seconds.

zfs: bw=1589MiB/s and lat (usec): min=10, max=918533, avg=731.80, stdev=22025.52
root@plt-platform-helix:/mnt# wipefs -a /dev/nvme0n1
/dev/nvme0n1: 8 bytes were erased at offset 0x00001000 (gpt): 45 46 49 20 50 41 52 54
/dev/nvme0n1: 8 bytes were erased at offset 0xdf90355000 (gpt): 45 46 49 20 50 41 52 54
/dev/nvme0n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa
/dev/nvme0n1: calling ioctl to re-read partition table: Success
root@plt-platform-helix:/mnt# zpool create testpool -O mountpoint=/mnt/testpool -o ashift=12 -O checksum=off -O compression=off -O atime=off -O dedup=off /dev/nvme0n1
root@plt-platform-helix:/mnt# zpool trim testpool
root@plt-platform-helix:/mnt# echo 3 > /proc/sys/vm/drop_caches
root@plt-platform-helix:/mnt# fio --section dirty-writers-thoughput --nrfiles 200 /home/ipa/abornev/jobfile.fio --status-interval=5s
dirty-writers-thoughput: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=sync, iodepth=1
...
fio-3.36
Starting 10 threads
Jobs: 10 (f=10): [W(10)][33.3%][w=2167MiB/s][w=17.3k IOPS][eta 00m:08s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=2439135: Fri Dec  6 18:36:24 2024
  write: IOPS=14.0k, BW=1754MiB/s (1839MB/s)(8558MiB/4879msec); 0 zone resets
    clat (usec): min=10, max=918533, avg=731.00, stdev=22025.51
     lat (usec): min=10, max=918533, avg=731.80, stdev=22025.52
    clat percentiles (usec):
     |  1.00th=[    14],  5.00th=[    15], 10.00th=[    16], 20.00th=[    16],
     | 30.00th=[    17], 40.00th=[    17], 50.00th=[    18], 60.00th=[    20],
     | 70.00th=[    21], 80.00th=[    23], 90.00th=[    28], 95.00th=[    49],
     | 99.00th=[  1516], 99.50th=[  2966], 99.90th=[ 12256], 99.95th=[530580],
     | 99.99th=[918553]
   bw (  MiB/s): min= 1315, max= 4442, per=100.00%, avg=2422.79, stdev=118.89, samples=60
   iops        : min=10524, max=35536, avg=19382.33, stdev=951.11, samples=60
  lat (usec)   : 20=66.86%, 50=28.25%, 100=1.80%, 250=0.67%, 500=0.46%
  lat (usec)   : 750=0.33%, 1000=0.24%
  lat (msec)   : 2=0.61%, 4=0.42%, 10=0.24%, 20=0.01%, 500=0.03%
  lat (msec)   : 750=0.03%, 1000=0.04%
  cpu          : usr=1.16%, sys=3.40%, ctx=29128, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,68525,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1754MiB/s (1839MB/s), 1754MiB/s-1754MiB/s (1839MB/s-1839MB/s), io=8558MiB (8974MB), run=4879-4879msec
Jobs: 10 (f=10): [W(10)][71.4%][w=1287MiB/s][w=10.3k IOPS][eta 00m:04s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=2439135: Fri Dec  6 18:36:29 2024
  write: IOPS=12.6k, BW=1580MiB/s (1657MB/s)(15.2GiB/9878msec); 0 zone resets
    clat (usec): min=10, max=918533, avg=820.16, stdev=23914.44
     lat (usec): min=10, max=918533, avg=820.98, stdev=23914.46
    clat percentiles (usec):
     |  1.00th=[    14],  5.00th=[    15], 10.00th=[    16], 20.00th=[    17],
     | 30.00th=[    17], 40.00th=[    18], 50.00th=[    19], 60.00th=[    21],
     | 70.00th=[    22], 80.00th=[    24], 90.00th=[    29], 95.00th=[    38],
     | 99.00th=[   889], 99.50th=[  2278], 99.90th=[442500], 99.95th=[859833],
     | 99.99th=[918553]
   bw (  MiB/s): min= 1226, max= 4442, per=100.00%, avg=2180.26, stdev=90.12, samples=140
   iops        : min= 9812, max=35536, avg=17440.93, stdev=721.00, samples=140
  lat (usec)   : 20=58.85%, 50=37.75%, 100=1.38%, 250=0.46%, 500=0.27%
  lat (usec)   : 750=0.20%, 1000=0.14%
  lat (msec)   : 2=0.37%, 4=0.27%, 10=0.18%, 20=0.02%, 500=0.02%
  lat (msec)   : 750=0.04%, 1000=0.06%
  cpu          : usr=0.87%, sys=3.13%, ctx=31988, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,124935,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1580MiB/s (1657MB/s), 1580MiB/s-1580MiB/s (1657MB/s-1657MB/s), io=15.2GiB (16.4GB), run=9878-9878msec
Jobs: 9 (f=9): [W(6),_(1),W(1),f(1),W(1)][92.9%][w=1378MiB/s][w=11.0k IOPS][eta 00m:01s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=2439135: Fri Dec  6 18:36:31 2024
  write: IOPS=12.7k, BW=1589MiB/s (1667MB/s)(19.5GiB/12584msec); 0 zone resets
    clat (usec): min=10, max=918533, avg=760.83, stdev=22905.76
     lat (usec): min=10, max=918533, avg=761.64, stdev=22905.77
    clat percentiles (usec):
     |  1.00th=[    14],  5.00th=[    15], 10.00th=[    16], 20.00th=[    17],
     | 30.00th=[    17], 40.00th=[    18], 50.00th=[    20], 60.00th=[    21],
     | 70.00th=[    23], 80.00th=[    25], 90.00th=[    30], 95.00th=[    41],
     | 99.00th=[   603], 99.50th=[  1991], 99.90th=[442500], 99.95th=[530580],
     | 99.99th=[918553]
   bw (  MiB/s): min=   30, max= 4442, per=100.00%, avg=1996.82, stdev=96.99, samples=187
   iops        : min=  241, max=35536, avg=15973.43, stdev=775.96, samples=187
  lat (usec)   : 20=56.86%, 50=39.71%, 100=1.75%, 250=0.39%, 500=0.22%
  lat (usec)   : 750=0.17%, 1000=0.11%
  lat (msec)   : 2=0.30%, 4=0.23%, 10=0.15%, 20=0.02%, 500=0.01%
  lat (msec)   : 750=0.04%, 1000=0.05%
  cpu          : usr=0.79%, sys=3.22%, ctx=33777, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,160000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1589MiB/s (1667MB/s), 1589MiB/s-1589MiB/s (1667MB/s-1667MB/s), io=19.5GiB (21.0GB), run=12584-12584msec
ext4: bw=1437MiB/s and lat (usec): min=65, max=3086, avg=846.51, stdev=297.89
root@plt-platform-helix:/mnt# zpool destroy testpool
root@plt-platform-helix:/mnt# wipefs -a /dev/nvme0n1 
/dev/nvme0n1: 8 bytes were erased at offset 0x00001000 (gpt): 45 46 49 20 50 41 52 54
/dev/nvme0n1: 8 bytes were erased at offset 0xdf90355000 (gpt): 45 46 49 20 50 41 52 54
/dev/nvme0n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa
/dev/nvme0n1: calling ioctl to re-read partition table: Success
root@plt-platform-helix:/mnt#  mkfs -t ext4 /dev/nvme0n1
mke2fs 1.47.0 (5-Feb-2023)
Discarding device blocks: done                            
Creating filesystem with 234423126 4k blocks and 58613760 inodes
Filesystem UUID: 1351297c-106e-46c3-8e00-17e806ca61ab
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
        102400000, 214990848

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (262144 blocks): ^[[Adone
Writing superblocks and filesystem accounting information: done     

root@plt-platform-helix:/mnt# mount /dev/nvme0n1 /mnt/testpool/
root@plt-platform-helix:/mnt# mount | grep testpool
/dev/nvme0n1 on /mnt/testpool type ext4 (rw,relatime,stripe=32)
root@plt-platform-helix:/mnt# fstrim /mnt/testpool/
root@plt-platform-helix:/mnt# echo 3 > /proc/sys/vm/drop_caches
root@plt-platform-helix:/mnt# fio --section dirty-writers-thoughput --nrfiles 200 /home/ipa/abornev/jobfile.fio --status-interval=5s
dirty-writers-thoughput: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=sync, iodepth=1
...
fio-3.36
Starting 10 threads
Jobs: 10 (f=10): [W(10)][28.6%][w=1436MiB/s][w=11.5k IOPS][eta 00m:10s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=2439468: Fri Dec  6 18:38:41 2024
  write: IOPS=11.5k, BW=1437MiB/s (1507MB/s)(7030MiB/4891msec); 0 zone resets
    clat (usec): min=65, max=2656, avg=845.70, stdev=288.05
     lat (usec): min=66, max=2657, avg=848.92, stdev=288.11
    clat percentiles (usec):
     |  1.00th=[   71],  5.00th=[  104], 10.00th=[  139], 20.00th=[  857],
     | 30.00th=[  922], 40.00th=[  955], 50.00th=[  971], 60.00th=[  971],
     | 70.00th=[  979], 80.00th=[  979], 90.00th=[  996], 95.00th=[ 1012],
     | 99.00th=[ 1057], 99.50th=[ 1074], 99.90th=[ 1106], 99.95th=[ 1123],
     | 99.99th=[ 2040]
   bw (  MiB/s): min= 1355, max= 1519, per=100.00%, avg=1438.81, stdev= 4.88, samples=90
   iops        : min=10842, max=12152, avg=11510.44, stdev=39.06, samples=90
  lat (usec)   : 100=2.62%, 250=10.41%, 500=0.03%, 750=0.23%, 1000=79.20%
  lat (msec)   : 2=7.50%, 4=0.01%
  cpu          : usr=3.40%, sys=7.09%, ctx=56385, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,56249,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1437MiB/s (1507MB/s), 1437MiB/s-1437MiB/s (1507MB/s-1507MB/s), io=7030MiB (7371MB), run=4891-4891msec

Disk stats (read/write):
  nvme0n1: ios=4/58671, sectors=32/15019360, merge=0/475, ticks=0/48747, in_queue=48747, util=98.14%
Jobs: 10 (f=10): [W(10)][64.3%][w=1436MiB/s][w=11.5k IOPS][eta 00m:05s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=2439468: Fri Dec  6 18:38:46 2024
  write: IOPS=11.5k, BW=1437MiB/s (1507MB/s)(13.9GiB/9891msec); 0 zone resets
    clat (usec): min=65, max=3079, avg=849.93, stdev=290.13
     lat (usec): min=66, max=3080, avg=853.13, stdev=290.20
    clat percentiles (usec):
     |  1.00th=[   71],  5.00th=[  104], 10.00th=[  139], 20.00th=[  865],
     | 30.00th=[  930], 40.00th=[  963], 50.00th=[  971], 60.00th=[  979],
     | 70.00th=[  979], 80.00th=[  988], 90.00th=[  996], 95.00th=[ 1012],
     | 99.00th=[ 1074], 99.50th=[ 1074], 99.90th=[ 1106], 99.95th=[ 1123],
     | 99.99th=[ 2040]
   bw (  MiB/s): min= 1339, max= 1539, per=100.00%, avg=1438.34, stdev= 4.95, samples=190
   iops        : min=10716, max=12314, avg=11506.74, stdev=39.63, samples=190
  lat (usec)   : 100=2.97%, 250=10.06%, 500=0.02%, 750=0.17%, 1000=77.56%
  lat (msec)   : 2=9.21%, 4=0.01%
  cpu          : usr=2.41%, sys=7.44%, ctx=114295, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,113726,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1437MiB/s (1507MB/s), 1437MiB/s-1437MiB/s (1507MB/s-1507MB/s), io=13.9GiB (14.9GB), run=9891-9891msec

Disk stats (read/write):
  nvme0n1: ios=6/118588, sectors=48/30357632, merge=0/1190, ticks=1/99100, in_queue=99100, util=99.12%
Jobs: 5 (f=5): [_(4),W(1),_(1),W(3),f(1)][100.0%][w=1436MiB/s][w=11.5k IOPS][eta 00m:00s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=2439468: Fri Dec  6 18:38:50 2024
  write: IOPS=11.5k, BW=1437MiB/s (1507MB/s)(19.5GiB/13919msec); 0 zone resets
    clat (usec): min=64, max=3085, avg=843.24, stdev=297.82
     lat (usec): min=65, max=3086, avg=846.51, stdev=297.89
    clat percentiles (usec):
     |  1.00th=[   72],  5.00th=[  104], 10.00th=[  137], 20.00th=[  865],
     | 30.00th=[  922], 40.00th=[  963], 50.00th=[  971], 60.00th=[  979],
     | 70.00th=[  988], 80.00th=[  988], 90.00th=[  996], 95.00th=[ 1012],
     | 99.00th=[ 1074], 99.50th=[ 1074], 99.90th=[ 1106], 99.95th=[ 1123],
     | 99.99th=[ 2040]
   bw (  MiB/s): min= 1325, max= 1618, per=100.00%, avg=1445.63, stdev= 6.15, samples=268
   iops        : min=10606, max=12946, avg=11565.07, stdev=49.20, samples=268
  lat (usec)   : 100=2.86%, 250=11.06%, 500=0.06%, 750=0.31%, 1000=75.96%
  lat (msec)   : 2=9.73%, 4=0.02%
  cpu          : usr=2.17%, sys=7.61%, ctx=160740, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,160000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1437MiB/s (1507MB/s), 1437MiB/s-1437MiB/s (1507MB/s-1507MB/s), io=19.5GiB (21.0GB), run=13919-13919msec

Disk stats (read/write):
  nvme0n1: ios=7/166527, sectors=56/42629392, merge=0/2042, ticks=1/138125, in_queue=138125, util=99.39%

Fsync writes

The same test as above is conducted, but this time an fsync is performed after each file is written.

zfs: bw=1210MiB/s and lat (usec): min=9, max=2205, avg=19.47, stdev=35.62
root@plt-platform-helix:/mnt# wipefs -a /dev/nvme0n1 
/dev/nvme0n1: 2 bytes were erased at offset 0x00000438 (ext4): 53 ef
root@plt-platform-helix:/mnt# zpool create testpool -O mountpoint=/mnt/testpool -o ashift=12 -O checksum=off -O compression=off -O atime=off -O dedup=off /dev/nvme0n1
root@plt-platform-helix:/mnt# mount | grep testpool
/mnt/testpool on /mnt/testpool type zfs (rw,noatime,xattr,noacl,casesensitive)
root@plt-platform-helix:/mnt# zpool trim testpool
root@plt-platform-helix:/mnt# echo 3 > /proc/sys/vm/drop_caches
root@plt-platform-helix:/mnt# fio --section fsync-writers-thoughput --nrfiles 200 /home/ipa/abornev/jobfile.fio --status-interval=5s
fsync-writers-thoughput: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=sync, iodepth=1
...
fio-3.36
Starting 10 threads
Jobs: 10 (f=10): [W(10)][28.6%][w=1379MiB/s][w=11.0k IOPS][eta 00m:10s]
fsync-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=2440009: Fri Dec  6 18:40:35 2024
 write: IOPS=11.5k, BW=1419MiB/s (1488MB/s)(6920MiB/4878msec); 0 zone resets
   clat (usec): min=8, max=2205, avg=24.60, stdev=58.71
    lat (usec): min=9, max=2205, avg=25.29, stdev=58.75
   clat percentiles (usec):
    |  1.00th=[   12],  5.00th=[   12], 10.00th=[   13], 20.00th=[   14],
    | 30.00th=[   14], 40.00th=[   15], 50.00th=[   15], 60.00th=[   15],
    | 70.00th=[   16], 80.00th=[   17], 90.00th=[   26], 95.00th=[   64],
    | 99.00th=[  265], 99.50th=[  363], 99.90th=[  783], 99.95th=[ 1045],
    | 99.99th=[ 1647]
  bw (  MiB/s): min= 1399, max= 1613, per=100.00%, avg=1426.32, stdev= 7.51, samples=90
  iops        : min=11191, max=12907, avg=11410.00, stdev=60.00, samples=90
 lat (usec)   : 10=0.36%, 20=86.00%, 50=7.87%, 100=2.96%, 250=1.73%
 lat (usec)   : 500=0.81%, 750=0.16%, 1000=0.06%
 lat (msec)   : 2=0.05%, 4=0.01%
 cpu          : usr=0.88%, sys=3.14%, ctx=9064, majf=0, minf=0
 IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    issued rwts: total=0,55954,0,0 short=0,0,0,0 dropped=0,0,0,0
    latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
 WRITE: bw=1419MiB/s (1488MB/s), 1419MiB/s-1419MiB/s (1488MB/s-1488MB/s), io=6920MiB (7256MB), run=4878-4878msec
Jobs: 10 (f=10): [W(10)][66.7%][w=1392MiB/s][w=11.1k IOPS][eta 00m:05s]
fsync-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=2440009: Fri Dec  6 18:40:40 2024
 write: IOPS=11.3k, BW=1409MiB/s (1477MB/s)(13.6GiB/9879msec); 0 zone resets
   clat (usec): min=8, max=2205, avg=20.06, stdev=42.22
    lat (usec): min=9, max=2205, avg=20.74, stdev=42.26
   clat percentiles (usec):
    |  1.00th=[   12],  5.00th=[   12], 10.00th=[   13], 20.00th=[   13],
    | 30.00th=[   14], 40.00th=[   14], 50.00th=[   15], 60.00th=[   15],
    | 70.00th=[   16], 80.00th=[   17], 90.00th=[   22], 95.00th=[   32],
    | 99.00th=[  161], 99.50th=[  265], 99.90th=[  578], 99.95th=[  783],
    | 99.99th=[ 1467]
  bw (  MiB/s): min= 1340, max= 1637, per=100.00%, avg=1412.47, stdev= 6.49, samples=190
  iops        : min=10720, max=13099, avg=11299.47, stdev=51.90, samples=190
 lat (usec)   : 10=0.20%, 20=88.14%, 50=8.61%, 100=1.62%, 250=0.90%
 lat (usec)   : 500=0.41%, 750=0.08%, 1000=0.03%
 lat (msec)   : 2=0.02%, 4=0.01%
 cpu          : usr=0.75%, sys=3.05%, ctx=12792, majf=0, minf=0
 IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    issued rwts: total=0,111946,0,0 short=0,0,0,0 dropped=0,0,0,0
    latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
 WRITE: bw=1409MiB/s (1477MB/s), 1409MiB/s-1409MiB/s (1477MB/s-1477MB/s), io=13.6GiB (14.6GB), run=9879-9879msec
Jobs: 10 (f=10): [W(10)][100.0%][w=1417MiB/s][w=11.3k IOPS][eta 00m:00s]
fsync-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=2440009: Fri Dec  6 18:40:47 2024
 write: IOPS=10.8k, BW=1344MiB/s (1410MB/s)(19.5GiB/14878msec); 0 zone resets
   clat (usec): min=8, max=2205, avg=18.79, stdev=35.57
    lat (usec): min=9, max=2205, avg=19.47, stdev=35.62
   clat percentiles (usec):
    |  1.00th=[   12],  5.00th=[   12], 10.00th=[   13], 20.00th=[   13],
    | 30.00th=[   14], 40.00th=[   14], 50.00th=[   15], 60.00th=[   15],
    | 70.00th=[   16], 80.00th=[   17], 90.00th=[   22], 95.00th=[   31],
    | 99.00th=[  103], 99.50th=[  212], 99.90th=[  490], 99.95th=[  668],
    | 99.99th=[ 1303]
  bw (  MiB/s): min= 1340, max= 1637, per=100.00%, avg=1407.53, stdev= 5.89, samples=280
  iops        : min=10720, max=13099, avg=11260.07, stdev=47.11, samples=280
 lat (usec)   : 10=0.14%, 20=88.19%, 50=9.42%, 100=1.23%, 250=0.64%
 lat (usec)   : 500=0.29%, 750=0.06%, 1000=0.02%
 lat (msec)   : 2=0.02%, 4=0.01%
 cpu          : usr=0.61%, sys=4.07%, ctx=186467, majf=0, minf=0
 IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    issued rwts: total=0,160000,0,0 short=0,0,0,0 dropped=0,0,0,0
    latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
 WRITE: bw=1344MiB/s (1410MB/s), 1344MiB/s-1344MiB/s (1410MB/s-1410MB/s), io=19.5GiB (21.0GB), run=14878-14878msec
Jobs: 10 (f=10): [F(10)][100.0%][w=116MiB/s][w=928 IOPS][eta 00m:00s]   
fsync-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=2440009: Fri Dec  6 18:40:47 2024
 write: IOPS=9678, BW=1210MiB/s (1269MB/s)(19.5GiB/16531msec); 0 zone resets
   clat (usec): min=8, max=2205, avg=18.79, stdev=35.57
    lat (usec): min=9, max=2205, avg=19.47, stdev=35.62
   clat percentiles (usec):
    |  1.00th=[   12],  5.00th=[   12], 10.00th=[   13], 20.00th=[   13],
    | 30.00th=[   14], 40.00th=[   14], 50.00th=[   15], 60.00th=[   15],
    | 70.00th=[   16], 80.00th=[   17], 90.00th=[   22], 95.00th=[   31],
    | 99.00th=[  103], 99.50th=[  212], 99.90th=[  490], 99.95th=[  668],
    | 99.99th=[ 1303]
  bw (  MiB/s): min= 1340, max= 1637, per=100.00%, avg=1407.53, stdev= 5.89, samples=280
  iops        : min=10720, max=13099, avg=11260.07, stdev=47.11, samples=280
 lat (usec)   : 10=0.14%, 20=88.19%, 50=9.42%, 100=1.23%, 250=0.64%
 lat (usec)   : 500=0.29%, 750=0.06%, 1000=0.02%
 lat (msec)   : 2=0.02%, 4=0.01%
 cpu          : usr=0.55%, sys=3.66%, ctx=186970, majf=0, minf=0
 IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    issued rwts: total=0,160000,0,0 short=0,0,0,0 dropped=0,0,0,0
    latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
 WRITE: bw=1210MiB/s (1269MB/s), 1210MiB/s-1210MiB/s (1269MB/s-1269MB/s), io=19.5GiB (21.0GB), run=16531-16531msec
ext4: bw=1421MiB/s and lat (usec): min=67, max=4474, avg=840.67, stdev=318.26
root@plt-platform-helix:/mnt# wipefs -a /dev/nvme0n1 
/dev/nvme0n1: 8 bytes were erased at offset 0x00001000 (gpt): 45 46 49 20 50 41 52 54
/dev/nvme0n1: 8 bytes were erased at offset 0xdf90355000 (gpt): 45 46 49 20 50 41 52 54
/dev/nvme0n1: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa
/dev/nvme0n1: calling ioctl to re-read partition table: Success
root@plt-platform-helix:/mnt# mkfs -t ext4 /dev/nvme0n1
mke2fs 1.47.0 (5-Feb-2023)
Discarding device blocks: done                            
Creating filesystem with 234423126 4k blocks and 58613760 inodes
Filesystem UUID: 4f05f73b-98ff-4a7c-b33a-1ba3c8c2b62d
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
        102400000, 214990848

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done     

root@plt-platform-helix:/mnt# mount /dev/nvme0n1 /mnt/
ram/      slowpool/ testpool/ 
root@plt-platform-helix:/mnt# mount /dev/nvme0n1 /mnt/testpool/
root@plt-platform-helix:/mnt# mount | grep testpool
/dev/nvme0n1 on /mnt/testpool type ext4 (rw,relatime,stripe=32)
root@plt-platform-helix:/mnt# fstrim /mnt/testpool/
root@plt-platform-helix:/mnt# echo 3 > /proc/sys/vm/drop_caches
root@plt-platform-helix:/mnt# fio --section fsync-writers-thoughput --nrfiles 200 /home/ipa/abornev/jobfile.fio --status-interval=5s
fsync-writers-thoughput: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=sync, iodepth=1
...
fio-3.36
Starting 10 threads
Jobs: 10 (f=10): [W(10)][28.6%][w=1423MiB/s][w=11.4k IOPS][eta 00m:10s]
fsync-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=2440418: Fri Dec  6 18:43:10 2024
  write: IOPS=11.4k, BW=1425MiB/s (1494MB/s)(6970MiB/4891msec); 0 zone resets
    clat (usec): min=65, max=4469, avg=837.31, stdev=318.19
     lat (usec): min=67, max=4474, avg=840.67, stdev=318.26
    clat percentiles (usec):
     |  1.00th=[   77],  5.00th=[  105], 10.00th=[  145], 20.00th=[  816],
     | 30.00th=[  889], 40.00th=[  955], 50.00th=[  963], 60.00th=[  971],
     | 70.00th=[  979], 80.00th=[  979], 90.00th=[  996], 95.00th=[ 1029],
     | 99.00th=[ 1221], 99.50th=[ 1729], 99.90th=[ 2769], 99.95th=[ 3326],
     | 99.99th=[ 4228]
   bw (  MiB/s): min= 1343, max= 1509, per=100.00%, avg=1425.56, stdev= 5.24, samples=90
   iops        : min=10746, max=12072, avg=11404.44, stdev=41.93, samples=90
  lat (usec)   : 100=2.35%, 250=11.16%, 500=1.22%, 750=1.93%, 1000=73.80%
  lat (msec)   : 2=9.27%, 4=0.26%, 10=0.02%
  cpu          : usr=1.45%, sys=9.24%, ctx=60669, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,55767,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1425MiB/s (1494MB/s), 1425MiB/s-1425MiB/s (1494MB/s-1494MB/s), io=6970MiB (7308MB), run=4891-4891msec

Disk stats (read/write):
  nvme0n1: ios=4/59496, sectors=32/14975400, merge=0/8973, ticks=0/46807, in_queue=46808, util=97.94%
Jobs: 10 (f=10): [W(10)][64.3%][w=1426MiB/s][w=11.4k IOPS][eta 00m:05s]
fsync-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=2440418: Fri Dec  6 18:43:15 2024
  write: IOPS=11.4k, BW=1424MiB/s (1493MB/s)(13.8GiB/9891msec); 0 zone resets
    clat (usec): min=65, max=4469, avg=841.56, stdev=325.10
     lat (usec): min=66, max=4474, avg=844.95, stdev=325.21
    clat percentiles (usec):
     |  1.00th=[   72],  5.00th=[  106], 10.00th=[  147], 20.00th=[  824],
     | 30.00th=[  889], 40.00th=[  955], 50.00th=[  971], 60.00th=[  971],
     | 70.00th=[  979], 80.00th=[  988], 90.00th=[ 1004], 95.00th=[ 1037],
     | 99.00th=[ 1303], 99.50th=[ 1926], 99.90th=[ 2835], 99.95th=[ 3326],
     | 99.99th=[ 3851]
   bw (  MiB/s): min= 1330, max= 1513, per=100.00%, avg=1424.79, stdev= 4.81, samples=190
   iops        : min=10642, max=12106, avg=11398.32, stdev=38.45, samples=190
  lat (usec)   : 100=2.41%, 250=11.04%, 500=1.54%, 750=1.88%, 1000=72.09%
  lat (msec)   : 2=10.65%, 4=0.38%, 10=0.01%
  cpu          : usr=2.40%, sys=7.69%, ctx=123682, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,112659,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1424MiB/s (1493MB/s), 1424MiB/s-1424MiB/s (1493MB/s-1493MB/s), io=13.8GiB (14.8GB), run=9891-9891msec

Disk stats (read/write):
  nvme0n1: ios=9/120153, sectors=72/30247032, merge=0/18753, ticks=2/94737, in_queue=94739, util=98.10%
Jobs: 10 (f=10): [W(10)][100.0%][w=1420MiB/s][w=11.4k IOPS][eta 00m:00s]
fsync-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=2440418: Fri Dec  6 18:43:19 2024
  write: IOPS=11.4k, BW=1421MiB/s (1490MB/s)(19.5GiB/14072msec); 0 zone resets
    clat (usec): min=62, max=4469, avg=842.43, stdev=332.77
     lat (usec): min=64, max=4474, avg=845.90, stdev=332.90
    clat percentiles (usec):
     |  1.00th=[   73],  5.00th=[  108], 10.00th=[  145], 20.00th=[  816],
     | 30.00th=[  889], 40.00th=[  955], 50.00th=[  971], 60.00th=[  979],
     | 70.00th=[  988], 80.00th=[  988], 90.00th=[ 1004], 95.00th=[ 1045],
     | 99.00th=[ 1336], 99.50th=[ 1991], 99.90th=[ 3032], 99.95th=[ 3392],
     | 99.99th=[ 3752]
   bw (  MiB/s): min= 1328, max= 1519, per=100.00%, avg=1424.40, stdev= 4.57, samples=276
   iops        : min=10630, max=12154, avg=11395.19, stdev=36.56, samples=276
  lat (usec)   : 100=2.24%, 250=11.47%, 500=1.70%, 750=1.94%, 1000=71.32%
  lat (msec)   : 2=10.86%, 4=0.46%, 10=0.01%
  cpu          : usr=1.74%, sys=8.32%, ctx=176177, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,160000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1421MiB/s (1490MB/s), 1421MiB/s-1421MiB/s (1490MB/s-1490MB/s), io=19.5GiB (21.0GB), run=14072-14072msec

Disk stats (read/write):
  nvme0n1: ios=10/168595, sectors=80/42445408, merge=0/26638, ticks=2/132971, in_queue=132973, util=98.10%
@sargarass sargarass added the Type: Defect Incorrect behavior (e.g. crash, hang) label Dec 6, 2024
@tonyhutter
Copy link
Contributor

Side note: zpool iostat -w will show you a histogram of all ZIO latencies since boot. zpool iostat -l will show you averages over the sample period.

@sargarass
Copy link
Author

zpool iostat -w

after reboot, created testpool using

zpool create testpool -O mountpoint=/mnt/testpool -o ashift=12 /dev/nvme0n1

then run this command 2 times:

fio --section dirty-writers-thoughput --nrfiles 200 /home/ipa/abornev/jobfile.fio --status-interval=5s
output
root@plt-platform-helix:/home/ipa/abornev# zpool create testpool -O mountpoint=/mnt/testpool -o ashift=12 /dev/nvme0n1
root@plt-platform-helix:/home/ipa/abornev# zpool trim testpool
root@plt-platform-helix:/home/ipa/abornev# echo 3 > /proc/sys/vm/drop_caches
root@plt-platform-helix:/home/ipa/abornev# 
root@plt-platform-helix:/home/ipa/abornev# fio --section dirty-writers-thoughput --nrfiles 200 /home/ipa/abornev/jobfile.fio --status-interval=5s
dirty-writers-thoughput: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=sync, iodepth=1
...
fio-3.36
Starting 10 threads
Jobs: 10 (f=8): [W(10)][30.8%][w=2224MiB/s][w=17.8k IOPS][eta 00m:09s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=19125: Tue Dec 10 14:29:59 2024
  write: IOPS=14.3k, BW=1778MiB/s (1864MB/s)(8676MiB/4880msec); 0 zone resets
    clat (usec): min=9, max=966916, avg=651.05, stdev=20845.66
     lat (usec): min=10, max=966917, avg=652.13, stdev=20845.67
    clat percentiles (usec):
     |  1.00th=[    15],  5.00th=[    16], 10.00th=[    16], 20.00th=[    18],
     | 30.00th=[    19], 40.00th=[    20], 50.00th=[    20], 60.00th=[    21],
     | 70.00th=[    22], 80.00th=[    24], 90.00th=[    42], 95.00th=[   124],
     | 99.00th=[  2040], 99.50th=[  3294], 99.90th=[  8979], 99.95th=[442500],
     | 99.99th=[960496]
   bw (  MiB/s): min= 1089, max= 5009, per=100.00%, avg=2450.46, stdev=146.86, samples=60
   iops        : min= 8712, max=40078, avg=19603.67, stdev=1174.87, samples=60
  lat (usec)   : 10=0.01%, 20=51.92%, 50=39.36%, 100=2.98%, 250=2.00%
  lat (usec)   : 500=0.81%, 750=0.55%, 1000=0.42%
  lat (msec)   : 2=0.94%, 4=0.70%, 10=0.23%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.04%, 1000=0.04%
  cpu          : usr=1.89%, sys=4.26%, ctx=33796, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,69593,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1778MiB/s (1864MB/s), 1778MiB/s-1778MiB/s (1864MB/s-1864MB/s), io=8676MiB (9097MB), run=4880-4880msec
Jobs: 10 (f=10): [W(10)][64.3%][w=1761MiB/s][w=14.1k IOPS][eta 00m:05s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=19125: Tue Dec 10 14:30:04 2024
  write: IOPS=12.5k, BW=1559MiB/s (1635MB/s)(15.0GiB/9879msec); 0 zone resets
    clat (usec): min=9, max=966916, avg=769.29, stdev=22712.79
     lat (usec): min=10, max=966917, avg=770.41, stdev=22712.80
    clat percentiles (usec):
     |  1.00th=[    15],  5.00th=[    16], 10.00th=[    17], 20.00th=[    18],
     | 30.00th=[    19], 40.00th=[    20], 50.00th=[    20], 60.00th=[    21],
     | 70.00th=[    22], 80.00th=[    25], 90.00th=[    41], 95.00th=[   106],
     | 99.00th=[  1893], 99.50th=[  3195], 99.90th=[396362], 99.95th=[471860],
     | 99.99th=[952108]
   bw (  MiB/s): min=  928, max= 5009, per=100.00%, avg=2352.71, stdev=117.73, samples=130
   iops        : min= 7430, max=40078, avg=18821.69, stdev=941.80, samples=130
  lat (usec)   : 10=0.01%, 20=51.33%, 50=40.08%, 100=3.37%, 250=1.98%
  lat (usec)   : 500=0.69%, 750=0.45%, 1000=0.35%
  lat (msec)   : 2=0.79%, 4=0.62%, 10=0.21%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.06%, 1000=0.05%
  cpu          : usr=1.43%, sys=3.73%, ctx=48955, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,123264,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1559MiB/s (1635MB/s), 1559MiB/s-1559MiB/s (1635MB/s-1635MB/s), io=15.0GiB (16.1GB), run=9879-9879msec
Jobs: 8 (f=8): [_(1),W(5),_(1),W(3)][86.7%][w=1603MiB/s][w=12.8k IOPS][eta 00m:02s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=19125: Tue Dec 10 14:30:07 2024
  write: IOPS=12.7k, BW=1586MiB/s (1663MB/s)(19.5GiB/12611msec); 0 zone resets
    clat (usec): min=9, max=966916, avg=724.86, stdev=22294.09
     lat (usec): min=10, max=966917, avg=726.01, stdev=22294.11
    clat percentiles (usec):
     |  1.00th=[    15],  5.00th=[    16], 10.00th=[    17], 20.00th=[    18],
     | 30.00th=[    19], 40.00th=[    20], 50.00th=[    20], 60.00th=[    21],
     | 70.00th=[    22], 80.00th=[    24], 90.00th=[    36], 95.00th=[    92],
     | 99.00th=[  1680], 99.50th=[  2999], 99.90th=[104334], 99.95th=[476054],
     | 99.99th=[952108]
   bw (  MiB/s): min=  597, max= 5009, per=100.00%, avg=2095.59, stdev=115.39, samples=180
   iops        : min= 4780, max=40078, avg=16764.74, stdev=923.12, samples=180
  lat (usec)   : 10=0.01%, 20=51.69%, 50=40.37%, 100=3.25%, 250=1.83%
  lat (usec)   : 500=0.63%, 750=0.40%, 1000=0.31%
  lat (msec)   : 2=0.68%, 4=0.53%, 10=0.18%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.05%, 1000=0.05%
  cpu          : usr=1.33%, sys=3.90%, ctx=54607, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,160000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1586MiB/s (1663MB/s), 1586MiB/s-1586MiB/s (1663MB/s-1663MB/s), io=19.5GiB (21.0GB), run=12611-12611msec

root@plt-platform-helix:/home/ipa/abornev# fio --section dirty-writers-thoughput --nrfiles 200 /home/ipa/abornev/jobfile.fio --status-interval=5s
dirty-writers-thoughput: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=sync, iodepth=1
...
fio-3.36
Starting 10 threads
Jobs: 10 (f=10): [W(10)][38.5%][w=2156MiB/s][w=17.2k IOPS][eta 00m:08s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=19358: Tue Dec 10 14:31:21 2024
  write: IOPS=14.3k, BW=1792MiB/s (1879MB/s)(8745MiB/4880msec); 0 zone resets
    clat (usec): min=10, max=938354, avg=728.89, stdev=21651.15
     lat (usec): min=11, max=938354, avg=729.98, stdev=21651.16
    clat percentiles (usec):
     |  1.00th=[    15],  5.00th=[    16], 10.00th=[    17], 20.00th=[    18],
     | 30.00th=[    19], 40.00th=[    20], 50.00th=[    21], 60.00th=[    21],
     | 70.00th=[    22], 80.00th=[    25], 90.00th=[    39], 95.00th=[   124],
     | 99.00th=[  2147], 99.50th=[  3458], 99.90th=[ 22414], 99.95th=[467665],
     | 99.99th=[910164]
   bw (  MiB/s): min= 1241, max= 4453, per=100.00%, avg=2429.25, stdev=118.79, samples=60
   iops        : min= 9932, max=35628, avg=19434.00, stdev=950.35, samples=60
  lat (usec)   : 20=45.14%, 50=46.47%, 100=2.68%, 250=1.95%, 500=0.85%
  lat (usec)   : 750=0.49%, 1000=0.39%
  lat (msec)   : 2=0.93%, 4=0.75%, 10=0.23%, 20=0.02%, 50=0.01%
  lat (msec)   : 250=0.01%, 500=0.06%, 1000=0.04%
  cpu          : usr=2.15%, sys=4.32%, ctx=31674, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,70028,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1792MiB/s (1879MB/s), 1792MiB/s-1792MiB/s (1879MB/s-1879MB/s), io=8745MiB (9170MB), run=4880-4880msec
Jobs: 10 (f=10): [W(10)][66.7%][w=1575MiB/s][w=12.6k IOPS][eta 00m:05s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=19358: Tue Dec 10 14:31:26 2024
  write: IOPS=12.8k, BW=1585MiB/s (1662MB/s)(15.3GiB/9881msec); 0 zone resets
    clat (usec): min=10, max=938354, avg=758.96, stdev=22399.77
     lat (usec): min=11, max=938354, avg=760.07, stdev=22399.80
    clat percentiles (usec):
     |  1.00th=[    15],  5.00th=[    16], 10.00th=[    18], 20.00th=[    19],
     | 30.00th=[    19], 40.00th=[    20], 50.00th=[    21], 60.00th=[    21],
     | 70.00th=[    22], 80.00th=[    25], 90.00th=[    36], 95.00th=[   102],
     | 99.00th=[  1909], 99.50th=[  3294], 99.90th=[413139], 99.95th=[501220],
     | 99.99th=[910164]
   bw (  MiB/s): min=  781, max= 4453, per=100.00%, avg=1816.65, stdev=93.98, samples=160
   iops        : min= 6249, max=35628, avg=14532.56, stdev=751.86, samples=160
  lat (usec)   : 20=47.27%, 50=44.59%, 100=3.07%, 250=2.01%, 500=0.71%
  lat (usec)   : 750=0.38%, 1000=0.30%
  lat (msec)   : 2=0.72%, 4=0.62%, 10=0.20%, 20=0.03%, 50=0.01%
  lat (msec)   : 250=0.01%, 500=0.05%, 750=0.01%, 1000=0.05%
  cpu          : usr=1.57%, sys=3.83%, ctx=43363, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,126445,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1585MiB/s (1662MB/s), 1585MiB/s-1585MiB/s (1662MB/s-1662MB/s), io=15.3GiB (16.4GB), run=9881-9881msec
Jobs: 10 (f=10): [W(10)][86.7%][w=1624MiB/s][w=13.0k IOPS][eta 00m:02s]02s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=19358: Tue Dec 10 14:31:29 2024
  write: IOPS=12.9k, BW=1610MiB/s (1688MB/s)(19.5GiB/12424msec); 0 zone resets
    clat (usec): min=10, max=938354, avg=741.16, stdev=22308.85
     lat (usec): min=11, max=938354, avg=742.27, stdev=22308.88
    clat percentiles (usec):
     |  1.00th=[    15],  5.00th=[    16], 10.00th=[    18], 20.00th=[    19],
     | 30.00th=[    19], 40.00th=[    20], 50.00th=[    21], 60.00th=[    21],
     | 70.00th=[    22], 80.00th=[    25], 90.00th=[    35], 95.00th=[    92],
     | 99.00th=[  1778], 99.50th=[  3228], 99.90th=[ 29230], 99.95th=[505414],
     | 99.99th=[910164]
   bw (  MiB/s): min=  716, max= 4453, per=100.00%, avg=1738.67, stdev=88.24, samples=210
   iops        : min= 5731, max=35628, avg=13908.38, stdev=705.96, samples=210
  lat (usec)   : 20=47.90%, 50=44.20%, 100=3.21%, 250=1.91%, 500=0.65%
  lat (usec)   : 750=0.35%, 1000=0.27%
  lat (msec)   : 2=0.62%, 4=0.58%, 10=0.19%, 20=0.03%, 50=0.01%
  lat (msec)   : 250=0.01%, 500=0.04%, 750=0.01%, 1000=0.05%
  cpu          : usr=1.39%, sys=3.88%, ctx=48305, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,160000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1610MiB/s (1688MB/s), 1610MiB/s-1610MiB/s (1688MB/s-1688MB/s), io=19.5GiB (21.0GB), run=12424-12424msec
root@plt-platform-helix:/home/ipa/abornev# zpool iostat -w testpool

testpool     total_wait     disk_wait    syncq_wait    asyncq_wait
latency      read  write   read  write   read  write   read  write  scrub   trim  rebuild
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns             0      0      0      0      0      0      0      0      0      0      0
3ns             0      0      0      0      0      0      0      0      0      0      0
7ns             0      0      0      0      0      0      0      0      0      0      0
15ns            0      0      0      0      0      0      0      0      0      0      0
31ns            0      0      0      0      0      0      0      0      0      0      0
63ns            0      0      0      0      0      0      0      0      0      0      0
127ns           0      0      0      0      0      0      0      0      0      0      0
255ns           0      0      0      0      8     20      0     57      0      0      0
511ns           0      0      0      0      2    267      0    231      0      2      0
1us             0      0      0      0      0    289      0    471      0    185      0
2us             0      0      0      0      4     56      0    509      0     34      0
4us             0      0      0      0      0      1      0    132      0      0      0
8us             0      0      0      0      0      5      0    211      0      1      0
16us            0    211      0    431      0      0      0    386      0      0      0
32us            0  1.22K      0  5.15K      0      0      0    842      0      0      0
65us            0  1.80K      0  9.57K      0      0      0  1.14K      0      0      0
131us           3  2.30K      3   118K      0      0      0  1.96K      0      0      0
262us           5  1.96K      5  65.8K      0      0      0  1.34K      0      2      0
524us           2    477      2  9.51K      0      0      0    389      0    107      0
1ms             4    471      4  85.3K      0      0      0    578      0      1      0
2ms             0  1.60K      0    473      0      0      0  8.66K      0    212      0
4ms             0  97.8K      0    274      0      0      0  96.4K      0  5.62K      0
8ms             0   138K      0     12      0      0      0   158K      0    808      0
16ms            0  38.6K      0      0      0      0      0  12.8K      0      0      0
33ms            0  1.27K      0      0      0      0      0  1.06K      0      0      0
67ms            0    684      0      0      0      0      0    675      0      0      0
134ms           0  1.54K      0      0      0      0      0  1.55K      0      0      0
268ms           0  3.52K      0      0      0      0      0  3.51K      0      0      0
536ms           0  3.46K      0      0      0      0      0  3.45K      0      0      0
1s              0      2      0      0      0      0      0      2      0      0      0
2s              0      0      0      0      0      0      0      0      0      0      0
4s              0      0      0      0      0      0      0      0      0      0      0
8s              0      0      0      0      0      0      0      0      0      0      0
17s             0      0      0      0      0      0      0      0      0      0      0
34s             0      0      0      0      0      0      0      0      0      0      0
68s             0      0      0      0      0      0      0      0      0      0      0
137s            0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------
root@plt-platform-helix:/home/ipa/abornev# zpool iostat -l testpool
              capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim  rebuild
pool        alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   wait
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
testpool    1.37M   888G      0    751  2.35K  90.9M  372us   13ms  372us  331us  603ns  702ns      -   12ms      -    3ms      -

@amotin
Copy link
Member

amotin commented Dec 10, 2024

In ZFS, due to its transactional nature and as result heavy write buffering, write latency is defined not so much by disk write latency, as by artificial write throttling algorithm it implements to keep latency more or less constant within several seconds of transaction group commit process. But it would be good to investigate those outliers indeed. It may be the throttling needs to be tuned, or there is something else unrelated.

@sargarass
Copy link
Author

sargarass commented Dec 10, 2024

I tried adjusting zfs_dirty_data_max from the default value of 4 GiB to 1 GiB. The maximum latency indeed decreased, changing from 918 ms to 66 ms.

This also affected the percentiles. This time, they look much closer to those of ext4 and provide better throughput than zfs + writes fsync (even though at 15x worse latencies at the 95th percentile). However, I see a potential issue with this approach: zfs_dirty_data_max is a system-wide setting, meaning it affects every pool on the system. This raises the questions:

  • Are these settings optimal for all pools on the same server? (probably not)
  • And could the latencies at the 95th percentile have been lowered without affecting the throughput? ext4 latencies can be significantly lower when O_DIRECT is not used (jobfile.fio does set it on).

After tuning

    |  1.00th=[   19],  5.00th=[  644], 10.00th=[  750], 20.00th=[  799],
    | 30.00th=[  824], 40.00th=[  840], 50.00th=[  848], 60.00th=[  857],
    | 70.00th=[  865], 80.00th=[  873], 90.00th=[  889], 95.00th=[  906],
    | 99.00th=[  963], 99.50th=[  996], 99.90th=[ 3982], 99.95th=[ 4883],
    | 99.99th=[13042]

vs before tuning

    clat percentiles (usec):
     |  1.00th=[    14],  5.00th=[    15], 10.00th=[    16], 20.00th=[    16],
     | 30.00th=[    17], 40.00th=[    17], 50.00th=[    18], 60.00th=[    20],
     | 70.00th=[    21], 80.00th=[    23], 90.00th=[    28], 95.00th=[    49],
     | 99.00th=[  1516], 99.50th=[  2966], 99.90th=[ 12256], 99.95th=[530580],
     | 99.99th=[918553]

vs fsync after each file

    |  1.00th=[   12],  5.00th=[   12], 10.00th=[   13], 20.00th=[   14],
    | 30.00th=[   14], 40.00th=[   15], 50.00th=[   15], 60.00th=[   15],
    | 70.00th=[   16], 80.00th=[   17], 90.00th=[   26], 95.00th=[   64],
    | 99.00th=[  265], 99.50th=[  363], 99.90th=[  783], 99.95th=[ 1045],
    | 99.99th=[ 1647]
fio
root@plt-platform-helix:/home/ipa/abornev# echo 1073741824 > /sys/module/zfs/parameters/zfs_dirty_data_max
root@plt-platform-helix:/home/ipa/abornev# fio --section dirty-writers-thoughput --nrfiles 200 /home/ipa/abornev/jobfile.fio --status-interval=5s
dirty-writers-thoughput: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=sync, iodepth=1
...
fio-3.36
Starting 10 threads
Jobs: 10 (f=10): [W(10)][30.8%][w=1425MiB/s][w=11.4k IOPS][eta 00m:09s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=65025: Tue Dec 10 17:08:32 2024
  write: IOPS=12.4k, BW=1553MiB/s (1628MB/s)(7579MiB/4881msec); 0 zone resets
    clat (usec): min=10, max=44347, avg=772.01, stdev=479.45
     lat (usec): min=11, max=44349, avg=774.23, stdev=479.61
    clat percentiles (usec):
     |  1.00th=[   17],  5.00th=[   22], 10.00th=[  553], 20.00th=[  766],
     | 30.00th=[  807], 40.00th=[  832], 50.00th=[  840], 60.00th=[  857],
     | 70.00th=[  857], 80.00th=[  865], 90.00th=[  881], 95.00th=[  906],
     | 99.00th=[  963], 99.50th=[ 1037], 99.90th=[ 4359], 99.95th=[ 5407],
     | 99.99th=[17957]
   bw (  MiB/s): min= 1360, max= 2699, per=100.00%, avg=1564.03, stdev=40.86, samples=90
   iops        : min=10886, max=21592, avg=12512.22, stdev=326.90, samples=90
  lat (usec)   : 20=4.21%, 50=3.93%, 100=0.27%, 250=0.57%, 500=0.77%
  lat (usec)   : 750=6.70%, 1000=82.98%
  lat (msec)   : 2=0.25%, 4=0.18%, 10=0.11%, 20=0.01%, 50=0.01%
  cpu          : usr=2.38%, sys=5.22%, ctx=60790, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,60655,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1553MiB/s (1628MB/s), 1553MiB/s-1553MiB/s (1628MB/s-1628MB/s), io=7579MiB (7947MB), run=4881-4881msec
Jobs: 10 (f=9): [W(10)][69.2%][w=1423MiB/s][w=11.4k IOPS][eta 00m:04s] 
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=65025: Tue Dec 10 17:08:37 2024
  write: IOPS=11.9k, BW=1487MiB/s (1559MB/s)(14.3GiB/9881msec); 0 zone resets
    clat (usec): min=10, max=44347, avg=809.20, stdev=393.21
     lat (usec): min=11, max=44349, avg=811.52, stdev=393.31
    clat percentiles (usec):
     |  1.00th=[   18],  5.00th=[  392], 10.00th=[  742], 20.00th=[  799],
     | 30.00th=[  824], 40.00th=[  840], 50.00th=[  848], 60.00th=[  857],
     | 70.00th=[  865], 80.00th=[  873], 90.00th=[  881], 95.00th=[  906],
     | 99.00th=[  963], 99.50th=[  996], 99.90th=[ 3949], 99.95th=[ 4883],
     | 99.99th=[15401]
   bw (  MiB/s): min= 1350, max= 2699, per=100.00%, avg=1490.28, stdev=28.96, samples=190
   iops        : min=10802, max=21592, avg=11922.21, stdev=231.69, samples=190
  lat (usec)   : 20=2.17%, 50=2.03%, 100=0.14%, 250=0.37%, 500=0.52%
  lat (usec)   : 750=5.44%, 1000=88.87%
  lat (msec)   : 2=0.23%, 4=0.14%, 10=0.08%, 20=0.01%, 50=0.01%
  cpu          : usr=2.07%, sys=5.05%, ctx=120053, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,117556,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1487MiB/s (1559MB/s), 1487MiB/s-1487MiB/s (1559MB/s-1559MB/s), io=14.3GiB (15.4GB), run=9881-9881msec
Jobs: 10 (f=9): [W(10)][100.0%][w=1422MiB/s][w=11.4k IOPS][eta 00m:00s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=65025: Tue Dec 10 17:08:41 2024
  write: IOPS=11.7k, BW=1464MiB/s (1535MB/s)(19.5GiB/13658msec); 0 zone resets
    clat (usec): min=10, max=64022, avg=818.43, stdev=422.52
     lat (usec): min=11, max=64025, avg=820.78, stdev=422.58
    clat percentiles (usec):
     |  1.00th=[   19],  5.00th=[  644], 10.00th=[  750], 20.00th=[  799],
     | 30.00th=[  824], 40.00th=[  840], 50.00th=[  848], 60.00th=[  857],
     | 70.00th=[  865], 80.00th=[  873], 90.00th=[  889], 95.00th=[  906],
     | 99.00th=[  963], 99.50th=[  996], 99.90th=[ 3982], 99.95th=[ 4883],
     | 99.99th=[13042]
   bw (  MiB/s): min= 1318, max= 2699, per=100.00%, avg=1470.41, stdev=24.57, samples=270
   iops        : min=10544, max=21592, avg=11763.26, stdev=196.56, samples=270
  lat (usec)   : 20=1.60%, 50=1.49%, 100=0.10%, 250=0.42%, 500=0.69%
  lat (usec)   : 750=5.29%, 1000=89.95%
  lat (msec)   : 2=0.24%, 4=0.11%, 10=0.08%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=1.79%, sys=5.04%, ctx=164172, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,160000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1464MiB/s (1535MB/s), 1464MiB/s-1464MiB/s (1535MB/s-1535MB/s), io=19.5GiB (21.0GB), run=13658-13658msec
zpool iostat -w -T d testpool 5
Tue Dec 10 05:18:42 PM UTC 2024

testpool     total_wait     disk_wait    syncq_wait    asyncq_wait
latency      read  write   read  write   read  write   read  write  scrub   trim  rebuild
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns             0      0      0      0      0      0      0      0      0      0      0
3ns             0      0      0      0      0      0      0      0      0      0      0
7ns             0      0      0      0      0      0      0      0      0      0      0
15ns            0      0      0      0      0      0      0      0      0      0      0
31ns            0      0      0      0      0      0      0      0      0      0      0
63ns            0      0      0      0      0      0      0      0      0      0      0
127ns           0      0      0      0      0      0      0      0      0      0      0
255ns           0      0      0      0      0      0      0      9      0      0      0
511ns           0      0      0      0      0      5      0     30      0      0      0
1us             0      0      0      0      0      1      0     60      0      0      0
2us             0      0      0      0      0      0      0     27      0      0      0
4us             0      0      0      0      0      0      0      6      0      0      0
8us             0      0      0      0      0      0      0     10      0      0      0
16us            0     14      0     19      0      0      0     14      0      0      0
32us            0     93      0    154      0      0      0     26      0      0      0
65us            0     59      0    265      0      0      0     24      0      0      0
131us           0     20      0  1.01K      0      0      0      6      0      0      0
262us           0      9      0    107      0      0      0      9      0      0      0
524us           0      7      0     21      0      0      0     24      0      0      0
1ms             0     22      0  7.95K      0      0      0    107      0      0      0
2ms             0    223      0    105      0      0      0  1.25K      0      0      0
4ms             0  3.28K      0     12      0      0      0  3.16K      0      0      0
8ms             0  4.19K      0      0      0      0      0  4.22K      0      0      0
16ms            0  1.13K      0      0      0      0      0     48      0      0      0
33ms            0     19      0      0      0      0      0     16      0      0      0
67ms            0     40      0      0      0      0      0     40      0      0      0
134ms           0    316      0      0      0      0      0    317      0      0      0
268ms           0    213      0      0      0      0      0    212      0      0      0
536ms           0     24      0      0      0      0      0     24      0      0      0
1s              0      0      0      0      0      0      0      0      0      0      0
2s              0      0      0      0      0      0      0      0      0      0      0
4s              0      0      0      0      0      0      0      0      0      0      0
8s              0      0      0      0      0      0      0      0      0      0      0
17s             0      0      0      0      0      0      0      0      0      0      0
34s             0      0      0      0      0      0      0      0      0      0      0
68s             0      0      0      0      0      0      0      0      0      0      0
137s            0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------
Tue Dec 10 05:18:47 PM UTC 2024

testpool     total_wait     disk_wait    syncq_wait    asyncq_wait
latency      read  write   read  write   read  write   read  write  scrub   trim  rebuild
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns             0      0      0      0      0      0      0      0      0      0      0
3ns             0      0      0      0      0      0      0      0      0      0      0
7ns             0      0      0      0      0      0      0      0      0      0      0
15ns            0      0      0      0      0      0      0      0      0      0      0
31ns            0      0      0      0      0      0      0      0      0      0      0
63ns            0      0      0      0      0      0      0      0      0      0      0
127ns           0      0      0      0      0      0      0      0      0      0      0
255ns           0      0      0      0      0      0      0     13      0      0      0
511ns           0      0      0      0      0      5      0     42      0      0      0
1us             0      0      0      0      0      2      0     90      0      0      0
2us             0      0      0      0      0      0      0     29      0      0      0
4us             0      0      0      0      0      0      0      5      0      0      0
8us             0      0      0      0      0      0      0     10      0      0      0
16us            0     20      0     25      0      0      0     21      0      0      0
32us            0    118      0    192      0      0      0     28      0      0      0
65us            0     71      0    303      0      0      0     27      0      0      0
131us           0     23      0  1.23K      0      0      0      5      0      0      0
262us           0      7      0     79      0      0      0      9      0      0      0
524us           0     16      0     30      0      0      0     31      0      0      0
1ms             0     33      0  9.54K      0      0      0     63      0      0      0
2ms             0    129      0    105      0      0      0    662      0      0      0
4ms             0  2.78K      0     12      0      0      0  3.87K      0      0      0
8ms             0  6.35K      0      0      0      0      0  5.83K      0      0      0
16ms            0  1.25K      0      0      0      0      0     60      0      0      0
33ms            0     81      0      0      0      0      0     76      0      0      0
67ms            0     86      0      0      0      0      0     86      0      0      0
134ms           0    215      0      0      0      0      0    215      0      0      0
268ms           0    230      0      0      0      0      0    231      0      0      0
536ms           0    112      0      0      0      0      0    111      0      0      0
1s              0      0      0      0      0      0      0      0      0      0      0
2s              0      0      0      0      0      0      0      0      0      0      0
4s              0      0      0      0      0      0      0      0      0      0      0
8s              0      0      0      0      0      0      0      0      0      0      0
17s             0      0      0      0      0      0      0      0      0      0      0
34s             0      0      0      0      0      0      0      0      0      0      0
68s             0      0      0      0      0      0      0      0      0      0      0
137s            0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------
Tue Dec 10 05:18:52 PM UTC 2024

testpool     total_wait     disk_wait    syncq_wait    asyncq_wait
latency      read  write   read  write   read  write   read  write  scrub   trim  rebuild
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns             0      0      0      0      0      0      0      0      0      0      0
3ns             0      0      0      0      0      0      0      0      0      0      0
7ns             0      0      0      0      0      0      0      0      0      0      0
15ns            0      0      0      0      0      0      0      0      0      0      0
31ns            0      0      0      0      0      0      0      0      0      0      0
63ns            0      0      0      0      0      0      0      0      0      0      0
127ns           0      0      0      0      0      0      0      0      0      0      0
255ns           0      0      0      0      0      0      0     11      0      0      0
511ns           0      0      0      0      0      6      0     37      0      0      0
1us             0      0      0      0      0      2      0     73      0      0      0
2us             0      0      0      0      0      0      0     22      0      0      0
4us             0      0      0      0      0      0      0     10      0      0      0
8us             0      0      0      0      0      0      0     11      0      0      0
16us            0     16      0     26      0      0      0     18      0      0      0
32us            0    108      0    195      0      0      0     30      0      0      0
65us            0     72      0    247      0      0      0     35      0      0      0
131us           0     30      0  1.34K      0      0      0     15      0      0      0
262us           0     14      0    216      0      0      0     17      0      0      0
524us           0     10      0     36      0      0      0     29      0      0      0
1ms             0     42      0  8.93K      0      0      0    112      0      0      0
2ms             0    193      0    162      0      0      0    578      0      0      0
4ms             0  2.22K      0     10      0      0      0  3.01K      0      0      0
8ms             0  6.40K      0      0      0      0      0  6.42K      0      0      0
16ms            0  1.37K      0      0      0      0      0     62      0      0      0
33ms            0     46      0      0      0      0      0     39      0      0      0
67ms            0     82      0      0      0      0      0     83      0      0      0
134ms           0    192      0      0      0      0      0    200      0      0      0
268ms           0    229      0      0      0      0      0    224      0      0      0
536ms           0    139      0      0      0      0      0    135      0      0      0
1s              0      0      0      0      0      0      0      0      0      0      0
2s              0      0      0      0      0      0      0      0      0      0      0
4s              0      0      0      0      0      0      0      0      0      0      0
8s              0      0      0      0      0      0      0      0      0      0      0
17s             0      0      0      0      0      0      0      0      0      0      0
34s             0      0      0      0      0      0      0      0      0      0      0
68s             0      0      0      0      0      0      0      0      0      0      0
137s            0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------

Same happens if I tune zfs_vdev_async_write_active_max_dirty_percent and zfs_delay_min_dirty_percent from 60 to 30

fio
root@plt-platform-helix:/home/ipa/abornev# fio --section dirty-writers-thoughput --nrfiles 200 /home/ipa/abornev/jobfile.fio --status-interval=5s
dirty-writers-thoughput: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=sync, iodepth=1
...
fio-3.36
Starting 10 threads
Jobs: 10 (f=10): [W(10)][36.4%][w=1429MiB/s][w=11.4k IOPS][eta 00m:07s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=70983: Tue Dec 10 17:32:49 2024
  write: IOPS=14.0k, BW=1745MiB/s (1830MB/s)(8518MiB/4880msec); 0 zone resets
    clat (usec): min=11, max=43495, avg=684.69, stdev=426.97
     lat (usec): min=12, max=43500, avg=686.78, stdev=427.27
    clat percentiles (usec):
     |  1.00th=[   15],  5.00th=[   17], 10.00th=[   21], 20.00th=[  502],
     | 30.00th=[  750], 40.00th=[  807], 50.00th=[  824], 60.00th=[  832],
     | 70.00th=[  840], 80.00th=[  857], 90.00th=[  873], 95.00th=[  889],
     | 99.00th=[  947], 99.50th=[ 1029], 99.90th=[ 4359], 99.95th=[ 7046],
     | 99.99th=[11338]
   bw (  MiB/s): min= 1401, max= 4441, per=100.00%, avg=1772.33, stdev=95.66, samples=90
   iops        : min=11210, max=35528, avg=14178.67, stdev=765.31, samples=90
  lat (usec)   : 20=9.82%, 50=4.41%, 100=0.48%, 250=2.20%, 500=3.05%
  lat (usec)   : 750=10.38%, 1000=69.16%
  lat (msec)   : 2=0.23%, 4=0.16%, 10=0.10%, 20=0.02%, 50=0.01%
  cpu          : usr=2.47%, sys=5.84%, ctx=67691, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,68153,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1745MiB/s (1830MB/s), 1745MiB/s-1745MiB/s (1830MB/s-1830MB/s), io=8518MiB (8932MB), run=4880-4880msec
Jobs: 10 (f=10): [W(10)][75.0%][w=1430MiB/s][w=11.4k IOPS][eta 00m:03s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=70983: Tue Dec 10 17:32:54 2024
  write: IOPS=12.7k, BW=1586MiB/s (1663MB/s)(15.3GiB/9880msec); 0 zone resets
    clat (usec): min=11, max=61378, avg=758.02, stdev=392.11
     lat (usec): min=12, max=61381, avg=760.22, stdev=392.30
    clat percentiles (usec):
     |  1.00th=[   15],  5.00th=[   20], 10.00th=[  371], 20.00th=[  758],
     | 30.00th=[  807], 40.00th=[  824], 50.00th=[  840], 60.00th=[  848],
     | 70.00th=[  857], 80.00th=[  865], 90.00th=[  881], 95.00th=[  898],
     | 99.00th=[  947], 99.50th=[  963], 99.90th=[ 3982], 99.95th=[ 6456],
     | 99.99th=[11338]
   bw (  MiB/s): min= 1381, max= 4441, per=100.00%, avg=1592.37, stdev=67.89, samples=190
   iops        : min=11052, max=35528, avg=12738.95, stdev=543.12, samples=190
  lat (usec)   : 20=5.34%, 50=2.40%, 100=0.26%, 250=1.26%, 500=1.79%
  lat (usec)   : 750=7.39%, 1000=81.20%
  lat (msec)   : 2=0.16%, 4=0.11%, 10=0.08%, 20=0.02%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=1.99%, sys=5.46%, ctx=127013, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,125348,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1586MiB/s (1663MB/s), 1586MiB/s-1586MiB/s (1663MB/s-1663MB/s), io=15.3GiB (16.4GB), run=9880-9880msec
Jobs: 10 (f=10): [W(1),f(1),W(4),f(3),W(1)][100.0%][w=1379MiB/s][w=11.0k IOPS][eta 00m:00s]
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=70983: Tue Dec 10 17:32:57 2024
  write: IOPS=12.3k, BW=1541MiB/s (1616MB/s)(19.5GiB/12980msec); 0 zone resets
    clat (usec): min=11, max=61378, avg=775.58, stdev=363.35
     lat (usec): min=12, max=61381, avg=777.83, stdev=363.51
    clat percentiles (usec):
     |  1.00th=[   16],  5.00th=[   22], 10.00th=[  611], 20.00th=[  775],
     | 30.00th=[  816], 40.00th=[  832], 50.00th=[  840], 60.00th=[  848],
     | 70.00th=[  857], 80.00th=[  865], 90.00th=[  881], 95.00th=[  898],
     | 99.00th=[  947], 99.50th=[  963], 99.90th=[ 3654], 99.95th=[ 5800],
     | 99.99th=[11338]
   bw (  MiB/s): min= 1378, max= 4441, per=100.00%, avg=1553.48, stdev=59.56, samples=250
   iops        : min=11024, max=35528, avg=12427.84, stdev=476.51, samples=250
  lat (usec)   : 20=4.18%, 50=1.88%, 100=0.20%, 250=1.03%, 500=1.62%
  lat (usec)   : 750=7.02%, 1000=83.74%
  lat (msec)   : 2=0.14%, 4=0.09%, 10=0.07%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=1.76%, sys=5.42%, ctx=162929, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,160000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1541MiB/s (1616MB/s), 1541MiB/s-1541MiB/s (1616MB/s-1616MB/s), io=19.5GiB (21.0GB), run=12980-12980msec
zpool iostat -w -T d testpool 5
Tue Dec 10 05:32:48 PM UTC 2024

testpool     total_wait     disk_wait    syncq_wait    asyncq_wait
latency      read  write   read  write   read  write   read  write  scrub   trim  rebuild
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns             0      0      0      0      0      0      0      0      0      0      0
3ns             0      0      0      0      0      0      0      0      0      0      0
7ns             0      0      0      0      0      0      0      0      0      0      0
15ns            0      0      0      0      0      0      0      0      0      0      0
31ns            0      0      0      0      0      0      0      0      0      0      0
63ns            0      0      0      0      0      0      0      0      0      0      0
127ns           0      0      0      0      0      0      0      0      0      0      0
255ns           0      0      0      0      0      0      0      2      0      0      0
511ns           0      0      0      0      0      1      0     14      0      0      0
1us             0      0      0      0      0      0      0     31      0      0      0
2us             0      0      0      0      0      0      0      8      0      0      0
4us             0      0      0      0      0      0      0      5      0      0      0
8us             0      0      0      0      0      0      0      4      0      0      0
16us            0      6      0      9      0      0      0      5      0      0      0
32us            0     40      0     80      0      0      0     13      0      0      0
65us            0     29      0    273      0      0      0     18      0      0      0
131us           0     18      0  1.09K      0      0      0      5      0      0      0
262us           0      2      0     53      0      0      0      4      0      0      0
524us           0      4      0     18      0      0      0      8      0      0      0
1ms             0     21      0  8.12K      0      0      0    336      0      0      0
2ms             0    454      0    100      0      0      0    733      0      0      0
4ms             0  1.07K      0      6      0      0      0    825      0      0      0
8ms             0  5.86K      0      0      0      0      0  7.22K      0      0      0
16ms            0  1.76K      0      0      0      0      0     61      0      0      0
33ms            0     68      0      0      0      0      0     68      0      0      0
67ms            0     38      0      0      0      0      0     38      0      0      0
134ms           0     55      0      0      0      0      0     55      0      0      0
268ms           0    184      0      0      0      0      0    184      0      0      0
536ms           0    130      0      0      0      0      0    132      0      0      0
1s              0     15      0      0      0      0      0     14      0      0      0
2s              0      0      0      0      0      0      0      0      0      0      0
4s              0      0      0      0      0      0      0      0      0      0      0
8s              0      0      0      0      0      0      0      0      0      0      0
17s             0      0      0      0      0      0      0      0      0      0      0
34s             0      0      0      0      0      0      0      0      0      0      0
68s             0      0      0      0      0      0      0      0      0      0      0
137s            0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------
Tue Dec 10 05:32:53 PM UTC 2024

testpool     total_wait     disk_wait    syncq_wait    asyncq_wait
latency      read  write   read  write   read  write   read  write  scrub   trim  rebuild
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns             0      0      0      0      0      0      0      0      0      0      0
3ns             0      0      0      0      0      0      0      0      0      0      0
7ns             0      0      0      0      0      0      0      0      0      0      0
15ns            0      0      0      0      0      0      0      0      0      0      0
31ns            0      0      0      0      0      0      0      0      0      0      0
63ns            0      0      0      0      0      0      0      0      0      0      0
127ns           0      0      0      0      0      0      0      0      0      0      0
255ns           0      0      0      0      0      0      0      3      0      0      0
511ns           0      0      0      0      0      2      0     11      0      0      0
1us             0      0      0      0      0      0      0     27      0      0      0
2us             0      0      0      0      0      0      0     10      0      0      0
4us             0      0      0      0      0      0      0      2      0      0      0
8us             0      0      0      0      0      0      0      4      0      0      0
16us            0      4      0      6      0      0      0      8      0      0      0
32us            0     39      0     84      0      0      0      9      0      0      0
65us            0     30      0    294      0      0      0     17      0      0      0
131us           0     12      0  1.21K      0      0      0      1      0      0      0
262us           0      0      0     52      0      0      0      4      0      0      0
524us           0      1      0     15      0      0      0      8      0      0      0
1ms             0     13      0  9.61K      0      0      0     23      0      0      0
2ms             0     41      0    100      0      0      0    208      0      0      0
4ms             0  1.58K      0      8      0      0      0  2.24K      0      0      0
8ms             0  7.67K      0      0      0      0      0  8.38K      0      0      0
16ms            0  1.60K      0      0      0      0      0     27      0      0      0
33ms            0     35      0      0      0      0      0     35      0      0      0
67ms            0     20      0      0      0      0      0     20      0      0      0
134ms           0     26      0      0      0      0      0     26      0      0      0
268ms           0    107      0      0      0      0      0    107      0      0      0
536ms           0    178      0      0      0      0      0    183      0      0      0
1s              0     17      0      0      0      0      0     12      0      0      0
2s              0      0      0      0      0      0      0      0      0      0      0
4s              0      0      0      0      0      0      0      0      0      0      0
8s              0      0      0      0      0      0      0      0      0      0      0
17s             0      0      0      0      0      0      0      0      0      0      0
34s             0      0      0      0      0      0      0      0      0      0      0
68s             0      0      0      0      0      0      0      0      0      0      0
137s            0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------
Tue Dec 10 05:32:58 PM UTC 2024

testpool     total_wait     disk_wait    syncq_wait    asyncq_wait
latency      read  write   read  write   read  write   read  write  scrub   trim  rebuild
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns             0      0      0      0      0      0      0      0      0      0      0
3ns             0      0      0      0      0      0      0      0      0      0      0
7ns             0      0      0      0      0      0      0      0      0      0      0
15ns            0      0      0      0      0      0      0      0      0      0      0
31ns            0      0      0      0      0      0      0      0      0      0      0
63ns            0      0      0      0      0      0      0      0      0      0      0
127ns           0      0      0      0      0      0      0      0      0      0      0
255ns           0      0      0      0      0      0      0      3      0      0      0
511ns           0      0      0      0      0      2      0     16      0      0      0
1us             0      0      0      0      0      1      0     33      0      0      0
2us             0      0      0      0      0      0      0     11      0      0      0
4us             0      0      0      0      0      0      0      2      0      0      0
8us             0      0      0      0      0      0      0      3      0      0      0
16us            0      7      0     13      0      0      0      7      0      0      0
32us            0     44      0    101      0      0      0     16      0      0      0
65us            0     34      0    280      0      0      0     22      0      0      0
131us           0     22      0  2.05K      0      0      0      7      0      0      0
262us           0     10      0    188      0      0      0      6      0      0      0
524us           0      3      0    125      0      0      0      4      0      0      0
1ms             0     17      0  7.84K      0      0      0     66      0      0      0
2ms             0     97      0    124      0      0      0    559      0      0      0
4ms             0  1.73K      0      7      0      0      0  1.84K      0      0      0
8ms             0  6.76K      0      2      0      0      0  7.59K      0      0      0
16ms            0  1.55K      0      0      0      0      0    104      0      0      0
33ms            0     77      0      0      0      0      0     72      0      0      0
67ms            0     21      0      0      0      0      0     21      0      0      0
134ms           0     40      0      0      0      0      0     40      0      0      0
268ms           0    159      0      0      0      0      0    158      0      0      0
536ms           0    147      0      0      0      0      0    147      0      0      0
1s              0      0      0      0      0      0      0      0      0      0      0
2s              0      0      0      0      0      0      0      0      0      0      0
4s              0      0      0      0      0      0      0      0      0      0      0
8s              0      0      0      0      0      0      0      0      0      0      0
17s             0      0      0      0      0      0      0      0      0      0      0
34s             0      0      0      0      0      0      0      0      0      0      0
68s             0      0      0      0      0      0      0      0      0      0      0
137s            0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------
Tue Dec 10 05:33:03 PM UTC 2024

testpool     total_wait     disk_wait    syncq_wait    asyncq_wait
latency      read  write   read  write   read  write   read  write  scrub   trim  rebuild
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns             0      0      0      0      0      0      0      0      0      0      0
3ns             0      0      0      0      0      0      0      0      0      0      0
7ns             0      0      0      0      0      0      0      0      0      0      0
15ns            0      0      0      0      0      0      0      0      0      0      0
31ns            0      0      0      0      0      0      0      0      0      0      0
63ns            0      0      0      0      0      0      0      0      0      0      0
127ns           0      0      0      0      0      0      0      0      0      0      0
255ns           0      0      0      0      0      0      0      0      0      0      0
511ns           0      0      0      0      0      0      0      1      0      0      0
1us             0      0      0      0      0      0      0      1      0      0      0
2us             0      0      0      0      0      0      0      0      0      0      0
4us             0      0      0      0      0      0      0      0      0      0      0
8us             0      0      0      0      0      0      0      1      0      0      0
16us            0      0      0      2      0      0      0      2      0      0      0
32us            0      5      0     14      0      0      0      2      0      0      0
65us            0      5      0      4      0      0      0      2      0      0      0
131us           0      5      0      0      0      0      0      4      0      0      0
262us           0      4      0      0      0      0      0      3      0      0      0
524us           0      0      0      0      0      0      0      0      0      0      0
1ms             0      0      0      0      0      0      0      0      0      0      0
2ms             0      0      0      0      0      0      0      0      0      0      0
4ms             0      0      0      0      0      0      0      0      0      0      0
8ms             0      0      0      0      0      0      0      0      0      0      0
16ms            0      0      0      0      0      0      0      0      0      0      0
33ms            0      0      0      0      0      0      0      0      0      0      0
67ms            0      0      0      0      0      0      0      0      0      0      0
134ms           0      0      0      0      0      0      0      0      0      0      0
268ms           0      0      0      0      0      0      0      0      0      0      0
536ms           0      0      0      0      0      0      0      0      0      0      0
1s              0      0      0      0      0      0      0      0      0      0      0
2s              0      0      0      0      0      0      0      0      0      0      0
4s              0      0      0      0      0      0      0      0      0      0      0
8s              0      0      0      0      0      0      0      0      0      0      0
17s             0      0      0      0      0      0      0      0      0      0      0
34s             0      0      0      0      0      0      0      0      0      0      0
68s             0      0      0      0      0      0      0      0      0      0      0
137s            0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------

Edit: pool was created like this zpool create testpool -O mountpoint=/mnt/testpool -o ashift=12 /dev/nvme0n1

@sargarass
Copy link
Author

sargarass commented Dec 11, 2024

ext4 latencies can be significantly lower when O_DIRECT is not used (jobfile.fio does set it on).

For comparison, here are the ext4 latencies when O_DIRECT is not used. The latencies at the 95th percentile are 62µs, due to leveraging the OS cache.

jobfile.fio changes
# If value is true, use non-buffered io. This is usually O_DIRECT.
direct=0
# If value is true, use buffered io. This is the opposite of the 'direct' option
buffered=1

No fsync writes

ext4: lat (usec): min=31, max=53265, avg=675.26, stdev=4421.49
root@plt-platform-helix:/mnt# fio --section dirty-writers-thoughput --nrfiles 4000 /home/ipa/abornev/jobfile.fio
dirty-writers-thoughput: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=sync, iodepth=1
...
fio-3.36
Starting 10 threads
Jobs: 2 (f=2): [_(8),f(2)][100.0%][eta 00m:00s]                         
dirty-writers-thoughput: (groupid=0, jobs=10): err= 0: pid=410837: Wed Dec 11 13:04:04 2024
  write: IOPS=13.3k, BW=1657MiB/s (1738MB/s)(391GiB/241343msec); 0 zone resets
    clat (usec): min=30, max=53264, avg=674.50, stdev=4421.49
     lat (usec): min=31, max=53265, avg=675.26, stdev=4421.49
    clat percentiles (usec):
     |  1.00th=[   35],  5.00th=[   36], 10.00th=[   36], 20.00th=[   37],
     | 30.00th=[   37], 40.00th=[   38], 50.00th=[   38], 60.00th=[   39],
     | 70.00th=[   39], 80.00th=[   41], 90.00th=[   46], 95.00th=[   62],
     | 99.00th=[31589], 99.50th=[32637], 99.90th=[33817], 99.95th=[33817],
     | 99.99th=[46924]
   bw (  MiB/s): min= 1330, max= 7385, per=100.00%, avg=1660.64, stdev=97.03, samples=4817
   iops        : min=10640, max=59086, avg=13283.86, stdev=776.23, samples=4817
  lat (usec)   : 50=91.99%, 100=4.88%, 250=1.10%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=2.03%
  lat (msec)   : 100=0.01%
  cpu          : usr=10.38%, sys=5.43%, ctx=70939, majf=0, minf=3
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,3200000,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1657MiB/s (1738MB/s), 1657MiB/s-1657MiB/s (1738MB/s-1738MB/s), io=391GiB (419GB), run=241343-241343msec

Disk stats (read/write):
  nvme0n1: ios=0/2915318, sectors=0/745252680, merge=0/37605, ticks=0/127972869, in_queue=127972869, util=99.85%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants