Raid5 intermittent ...
 
Notifications
Clear all

Raid5 intermittent hang - despite workarounds on Catalina 15.4

24 Posts
3 Users
0 Reactions
3,389 Views
(@fantafly)
Posts: 13
Member
Topic starter
 

Out of curiosity, in Energy Saver preferences, do you have selected "put hard drives to sleep when possible"?

Hello
No, I don't . Just double checked :)

 
Posted : 29/05/2020 9:11 am
(@fantafly)
Posts: 13
Member
Topic starter
 

So

I've managed to install clean versions of Mac OSX, both Catalina AND Mojave.
With nothing else installed, on both systems the same pauses occur.

A few more observations:
- While the "pause" is occurring, the disks in the raid are audibly NOT busy. They become quiet just as the data transfer rate drops to zero.
- There is no observable pattern as to when the "pause" occurs. Five minutes is about the longest I've gone without it occurring though.

After these extensive tests (different clean installs & safe boots on different machines), I really see no other possibility left than the hardware being at fault.
Is there any chance the chosen drives have an issue with the cache?
(4x Seagate Barracuda ST6000DM003)

 
Posted : 31/05/2020 8:54 am
(@softraid-support)
Posts: 9200
Member Admin
 

Very interesting.

Let me speculate on something about the Seagates that could conceivably cause this.

Find out whether these drives have SMR recording. You may have to dig to discover whether they do or not, as drive manufacturers are trying to hide when they use SMR (shingled magnetic recording) on drives, as it is much cheaper and much slower. (Drives get much slower with sustained writes)

SMR could be the culprit, its worth investigating.
Seagate has definitely added SMR to many of its Barracuda drives. Here are some references and information on SMR:

https://arstechnica.com/gadgets/2020/04/caveat-emptor-smr-disks-are-being-submarined-into-unexpected-channels/

https://blocksandfiles.com/2020/04/15/seagate-2-4-and-8tb-barracuda-and-desktop-hdd-smr/

https://www.ixsystems.com/community/resources/list-of-known-smr-drives.141/

 
Posted : 31/05/2020 12:31 pm
(@fantafly)
Posts: 13
Member
Topic starter
 

Now...

While researching about SMR (all very intriguing) I decided to run a validation of the RAID.
Mind you there had never been ANY errors, but upon validation one drive came up as faulty. And sure enough, once I tried to certify it, it failed (two months after purchase).
With this drive removed the RAID performs flawlessly (albeit degraded). I certainly hope a rebuilt RAID will also work fine.

So I'm at fault for not having validated the RAID in the first place, nor certifying the new disks before using them. Doh!
But in case this phenomenon happens to anyone else I hope they will check the disks first, rather than search for solutions elsewhere.
I guess Softraid can't be vocal enough about running certification before deploying disks!

Also, I'm curious why the Softraid monitor wouldn't spot an error like this without the user attempting validation first?

Below the Softraid monitor log, for what it's worth.

Jun 1 04:15:15 - SoftRAID Driver: The volume "*****" (disk6) has started validating.
Jun 1 04:39:32 - SoftRAID Driver: The volume "
**" (disk6) failed to validate because one the disks encountered a read error. The disk (disk3, SoftRAID ID: **) was unable to read sectors (offset 308501544960, i/o block size 3145728, error E00002E7). This disk should be replaced.
Jun 1 04:39:37 - SoftRAID Driver: A disk for the volume "
**" (disk6) encountered a read error (E00002E7). The disk (disk3, SoftRAID ID: **) was unable to read sectors. The error occurred at volume offset 308501544960 (i/o block size 3145728). This disk should be replaced.
Jun 1 13:02:50 - SoftRAID Application: Changing the safeguard on the volume "
**" (disk6). The safeguard is now disabled so the volume can be deleted.
Jun 1 13:13:49 - SoftRAID Application: Certifying the disk disk3, SoftRAID ID:
**, SATA bus 0, id 3 (Thunderbolt). with 2 passes and 15 minutes of random access testing. During each pass, every sector on the disk is filled with a pattern. Then the pattern is read back and verified.
Jun 1 13:14:04 - SoftRAID Application: The certify disk command for disk disk3, SN:
**, SATA bus 0, id 3 (Thunderbolt) hung while writing (offset 754,974,720, i/o block size = 16,777,216). This disk should be replaced immediately.
Jun 1 13:14:28 - SoftRAID Application: The certify disk command for disk disk3, SN:
*****, SATA bus 0, id 3 (Thunderbolt) failed because this disk has unreliable sectors. It should be replaced immediately (error number = 66).

 
Posted : 01/06/2020 1:37 am
(@softraid-support)
Posts: 9200
Member Admin
 

Intriguing. This drive must have been stalling, but not failing when reading. If the drive does not produce a failure error, then the SoftRAID Monitor would not know about it. Same with SMART data, or predicted failure.

Glad you sorted this out. guess we were on the wrong track after all, but its very unlikely to see this behavior from drives, pausing constantly on reading, but never actually failing. And for this to happen the way you described, the drive must have many faulty segments in the same way.

Perhaps this is a controller problem, not so much a media problem. I can't explain the symptoms.

 
Posted : 01/06/2020 11:44 am
(@softraid-support)
Posts: 9200
Member Admin
 

Can you send us a support file on your system?

(support at softraid)

(before you return the disks)

does the pause still happen after the certify attempt?

 
Posted : 01/06/2020 12:57 pm
(@fantafly)
Posts: 13
Member
Topic starter
 

Can you send us a support file on your system?

(support at softraid)

(before you return the disks)

does the pause still happen after the certify attempt?

Just sent you a report.

Since the disk was erased when attempting to certify, i tried to rebuild the raid with the faulty disk, however it failed again.
With only the remaining 3 disks the raid works fine and the pause does not happen.

 
Posted : 01/06/2020 9:46 pm
(@fantafly)
Posts: 13
Member
Topic starter
 

Can you send us a support file on your system?

(support at softraid)

(before you return the disks)

does the pause still happen after the certify attempt?

Just sent you a report.

Since the disk was erased when attempting to certify, i tried to rebuild the raid with the faulty disk, however it failed again.
With only the remaining 3 disks the raid works fine and the pause does not happen.

In any case thanks for helping me figure this out ...

 
Posted : 01/06/2020 9:49 pm
(@softraid-support)
Posts: 9200
Member Admin
 

Thanks for the file. We are in touch off line to get more data on your scenario.

 
Posted : 02/06/2020 3:03 pm
Page 2 / 2
Share:
close
open