Notifications

Clear all

Disk errors during PostgreSQL checksum operation

Functionality Issues

Last Post by SoftRAID Support 5 months ago

5 Posts

2 Users

0 Reactions

379 Views

RSS

rgv

(@rgv)

Posts: 3

Active Member

Topic starter

Shortly after starting the `pg_checksums --enable...` command on a PostgreSQL v17.7 database, SoftRAID began reporting disk errors and then dismounted RAID members. After rebooting and waiting for the array to rebuild, I was able to successfully validate the array.

I have reproduced this twice in the last few hours. Failing disks were different each time. Prior to this, the array has not had any errors in two months of continuous uptime.

Any idea why this operation seems to be triggering the disk errors?

Express 4M2, 4x4TB Samsung 990 Pro in RAID 4 (media life remaining ~92%)

Intel Mac mini 8,1 connected via Thunderbolt

SoftRAID 8.6.1, driver 8.5

macOS 15.7.2

Posted : 04/01/2026 4:54 am

Topic Tags

Express 4M2

SoftRAID Support

(@softraid-support)

Posts: 9207

Member Admin

I suspect the disks were the same, but remember disks are randomly assigned disk#'s at connection time.
You can look in the SoftRAID log, filter for error and see if the disks are actually different.
Did you check with Samsung magician to see if there are firmware updates for your drives? (There is a MacOS version now)

Posted : 05/01/2026 11:22 am

rgv

(@rgv)

Posts: 3

Active Member

Topic starter

Did you check with Samsung magician to see if there are firmware updates for your drives? (There is a MacOS version now)

No. According to the "Supported Features" matrix on the Samsung Magician web site: "*Internal SSDs are not supported on macOS..."

I used SoftRAID's utilities after rebuilding/rebooting and no errors were reported.

2026.01.04 - 03:29:55 - SoftRAID Driver: The volume "DB" (disk9) validated successfully.  There were 0 blocks which were updated.  All parity data is now correct.

2026.01.04 - 01:51:51 - SoftRAID Monitor: Starting SMART test on all disks which support SMART.
2026.01.04 - 01:51:51 - SoftRAID Monitor: Finished SMART test on all disks. No disks failed the SMART test.

2026.01.04 - 12:18:21 - SoftRAID Application: The verify disk command for disk disk4, Label: "Express - 3", SoftRAID ID: 0A860F0E50E90E80, PCI bus 0, id 0, lun 0 (Thunderbolt) completed successfully.
2026.01.04 - 12:21:42 - SoftRAID Application: The verify disk command for disk disk2, Label: "Express - 1", SoftRAID ID: 0A860F0E53A16080, PCI bus 0, id 0, lun 0 (Thunderbolt) completed successfully.
2026.01.04 - 12:22:19 - SoftRAID Application: The verify disk command for disk disk3, Label: "Express - 2", SoftRAID ID: 0A860F0E54D0D400, PCI bus 0, id 0, lun 0 (Thunderbolt) completed successfully.
2026.01.04 - 12:22:24 - SoftRAID Application: The verify disk command for disk disk1, Label: "Express - 4", SoftRAID ID: 0A860F0E51FB9800, PCI bus 0, id 0 (Thunderbolt) completed successfully.

Posted : 07/01/2026 2:07 pm

rgv

(@rgv)

Posts: 3

Active Member

Topic starter

You can look in the SoftRAID log, filter for error and see if the disks are actually different.

There were three different SoftRAID ID numbers (out of the four SSDs in the arrray) reported as having errors across the two attempts I described previously:

2026.01.03 - 22:50:10 - SoftRAID Driver: A disk (disk4, SoftRAID ID: 0A860F0E51FB9800) for the SoftRAID volume "DB" (disk6) was removed or stopped responding while the volume was mounted and in use.
2026.01.03 - 22:50:15 - SoftRAID Driver: A disk (disk4, SoftRAID ID: 0A860F0E51FB9800) for the SoftRAID volume "DB" (disk6) encountered multiple read or write errors.  This disk has been marked "failed" and will no longer be used for when reading volume data.

2026.01.03 - 22:51:50 - SoftRAID Driver: A disk (disk0, SoftRAID ID: 0A860F0E54D0D400) for the SoftRAID volume "DB" (disk6) was removed or stopped responding while the volume was mounted and in use.
2026.01.03 - 22:52:06 - SoftRAID Application: The get info command for disk disk0, SN: S7KGNU0XA07209M, PCI bus 0, id 0, lun 0 (Thunderbolt) hung while reading (offset 0, i/o block size = 512).  This disk should be replaced immediately.
2026.01.03 - 22:52:10 - SoftRAID Driver: Discarding 121 cache blocks, unable to write to volume.

2026.01.04 - 00:58:26 - SoftRAID Driver: A disk (disk4, SoftRAID ID: 0A860F0E53A16080) for the SoftRAID volume "DB" (disk9) was removed or stopped responding while the volume was mounted and in use.
2026.01.04 - 00:58:26 - SoftRAID Driver: The SoftRAID volume "DB" (disk9) encountered an error (E00002E4).  A program attempted to read or write to a volume which was no longer accepting i/o requests.

Can we rule out a bad disk(s) because of the small probability that three fail spontaneously so close to each other?

And even if that can't be ruled out, why could this behavior only present during the PostgreSQL checksumming I described above? The array is operational now and has been performing error-free since I wrote the original post.

Thanks.

Posted : 07/01/2026 2:09 pm

SoftRAID Support

(@softraid-support)

Posts: 9207

Member Admin

@rgv

Yes, its not a drive issue.

This looks like transient I/O errors under extreme load.

pg_checksums --enable is a perfect trigger
Enabling checksums is one of the heaviest storage operations Postgres can do because it effectively touches every data page (and depending on implementation/version/flags, may also force a lot of writes + fsyncs). In practice it looks like:
sustained sequential reads across the whole cluster
sustained writes (page rewrite / metadata churn)
very high queue depth + flushes
long periods of “no idle” for the NVMe + TB bridge
That kind of workload can expose timeouts / resets in any weak link of the chain:
Thunderbolt ↔ bridge ↔ NVMe firmware ↔ power/thermal behavior, and SoftRAID will interpret those timeouts as disk errors and may drop members to protect the array.

Can you wait until you get this again and get me: (terminal)
log show --last 2h --predicate 'subsystem == "com.apple.iokit.IOStorageFamily"'
log show --last 2h --predicate 'eventMessage CONTAINS[c] "Thunderbolt" OR eventMessage CONTAINS[c] "IOThunderbolt" OR eventMessage CONTAINS[c] "reset"'

Reduce the stress and see if the issue disappears

This helps confirm it’s load-triggered transport/firmware behavior.
Run pg_checksums with minimum parallelism (if using jobs/threads, set to 1).
Make sure the cluster is otherwise idle.
If possible, repeat with the enclosure cooled aggressively (fan, open air, no stacking).
If it stops happening when load is reduced, that’s strong evidence it’s not “random disk failures”.

Also, check for thermal throttling. (get istat menus in trial mode). see what temps you are getting in your checksums test.

Are you c onnecting through a dock? If so, directly connect the 4M2

Posted : 07/01/2026 6:22 pm

51 Forums
2,396 Topics
20.6 K Posts
6 Online
3,693 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed