RAID5 was a bit slow so I ran DiskWarrior, now all 4 disks i/o errors, degraded, failed – Page 2 – Solving Problems

SteveK · 2023-10-09T00:07:39Z

I searched the forums and saw a post saying this might be related to a power draw issue but that was for SSDs, and this is a ThunderBay 4 with 4 HDDs. I keep lots of video on a 4 HDD ThunderBay 4 running SoftRAID 7.5, RAID5. It's running on a late 2012 Mac mini that I use for Plex and Time Machine backups running Monterey via OpenCore Legacy Patcher. I've been running it this way for maybe 18 months without a hitch. Recently that volume has been transferring video files slowly. No errors, SMART reports OK. Today I rebuilt that volume's directory after DiskWarrior reported it had an optimization index of 8 and needed a rebuild. After replacing the rebuilt directory I got SoftRAID popup after popup showing i/o errors, now showing degraded disks and failures. I unmounted and am now shutting down the ThunderBay. Can someone give me a hand when you have a minute, please? I'm hoping this is not a "real" failure since all 4 drives going at once is pretty rare. Thank you, guys.

SteveK

(@stevek)

Posts: 86

Member

Topic starter

OK, I let each one verify for about at least one minute and took a screenshot of right around its highest read speeds. When cancelling each verification, I got a popup showing an error, but there were no errors recorded during the actual verification processes.

Attachment : Drive 1.jpeg

Attachment : Disk1 Error.jpeg

Attachment : Disk2.jpeg

Attachment : Disk2 Error.jpeg

Attachment : Disk3.jpeg

Attachment : Disk3 Error.jpeg

Attachment : Disk4.jpeg

Attachment : Disk4 Error.jpeg

Posted : 12/10/2023 8:59 am

SoftRAID Support

(@softraid-support)

Posts: 9207

Member Admin

@stevek

This proves the disks are OK.

I assume when you "validate" you see 500MB/s + ?

That shows the volume structure is OK.

So it comes down to what is inside the volume. How much free space is left in the volume?

this is after Disk warrior rebuild the directory, so there should be no issue there.

Posted : 12/10/2023 9:45 am

SteveK

(@stevek)

Posts: 86

Member

Topic starter

It's a 12TB RAID5 volume. It has 3.6TB free. It's filled with lots of H.265 and H.264 movies and DVR recordings (but nothing records directly to this volume; that goes to an SSD then gets transferred here manually), backups, InDesign, Illustrator and Photoshop files, app installers, DMGs, etc.

It's validating now and updating blocks. It's going very slowly, according to Activity Monitor.

Attachment : Screenshot 2023-10-12 at 10.08.02 AM.jpeg

Posted : 12/10/2023 10:09 am

SoftRAID Support

(@softraid-support)

Posts: 9207

Member Admin

@stevek

Make sure Volume Optimization is set to workstation or server.

Posted : 12/10/2023 1:38 pm

SteveK

(@stevek)

Posts: 86

Member

Topic starter

@softraid-support

It's set for workstation, 16k stripe. The slowdowns just started happening recently. I've been using this volume for years without issue, at least after I stopped using it with M1 MacBooks because the Thunderbolt sleep issues were murder; so many problems. We worked for months on that, here in the forums.

I replaced a couple of drives over the last few years but after rebuilding they worked fine.

Would a Thunderbolt 2 cable gone bad cause this? Could it be the Thunderbay enclosure? I can try swapping the cable and then the enclosure if you like; I have two of each.

Attachment : Screenshot 2023-10-12 at 1.41.24 PM.jpeg

Posted : 12/10/2023 1:52 pm

SoftRAID Support

(@softraid-support)

Posts: 9207

Member Admin

@stevek

A clue:
validation of a 12TB volume should be 4-6 hours on fast disks, not 140. On older disks, maybe 8 hours.
Maybe do try swappingi the enclosure/cable. I know of either "failing", but not "slowing down", so that is weird.

This post was modified 3 years ago by SoftRAID Support

Posted : 12/10/2023 2:37 pm

SteveK reacted

SteveK

(@stevek)

Posts: 86

Member

Topic starter

@softraid-support About 45 minutes ago the estimate went up to 180 hours.

Two of my drives are 6 years old with nearly 60,000 hours on them. I guess it's time to replace those. Those are the good HGST drives that they no longer make, I believe. I just noticed the last one at the bottom is now showing weird Smart status and relocated sectors; it didn't look like that before the validation. The validation did update 118,000+ blocks so far, does this mean the blocks were updated on that disk, causing that incongruous reading?

Sorry about all this. I appreciate you taking the time, especially if it turns out to be a disk issue.

I'll stop the validation now and try swapping the cable. Then I'll try the other Thunderbay first chance I get, probably tomorrow because of an extra late work deadline tonight.

Attachment : Screenshot 2023-10-12 at 3.18.50 PM.jpeg

Posted : 12/10/2023 3:27 pm

SoftRAID Support

(@softraid-support)

Posts: 9207

Member Admin

@stevek

I just updated the estimated validate hours, it was too high.

Sixty thousand hours on a drive is like 100 years old in people terms. So you certainly got your value from them.

Posted : 12/10/2023 4:23 pm

SteveK

(@stevek)

Posts: 86

Member

Topic starter

I tried a different cable and my other Thunderbay; no difference. So I would guess I need to replace one of the older drives, see how it runs, then replace the other one. Is there any way to tell which drive(s) is running slow? What if replace the old drives but it turns out the new ones were the bad ones? I think next time I create a RAID I'll leave a few GB open on every disk so I can create a non-raid volume on each one to test for issues down the road. I only have about 130MB free on each disk here.

Or maybe it's time to bite the bullet and switch to SSDs.

Posted : 12/10/2023 11:06 pm

SoftRAID Support

(@softraid-support)

Posts: 9207

Member Admin

@stevek

I would replace the two 60,000 hour drives in any case. I would guess the odds of failure at that age are >20% a year, perhaps higher. You can use them perhaps for shelf backup, but not for active use. You got great use out of them.

Posted : 13/10/2023 9:27 am

SteveK

(@stevek)

Posts: 86

Member

Topic starter

@softraid-support Even if those enterprise drives are rated at 2 million hours MTBF?

I tried validating that volume. I forgot to turn off my scheduled shut down at midnight and it was ¾ of the way through validating that volume after 20 hours — with millions of block updates — when it shut down and I lost the validation. I could not resume the next morning. Lesson learned.

But this is interesting; look at the speeds I'm getting now after validation and block updating. Much better. The slower read screenshot is the file per frame test, the faster one is single file.

I take it this is more in line with expected performance?

Attachment : single file after validation.jpeg

Attachment : file per frame after validation.jpeg

Attachment : Screenshot 2023-10-16 at 9.43.23 AM.jpeg

Posted : 16/10/2023 12:00 pm

SoftRAID Support

(@softraid-support)

Posts: 9207

Member Admin

@stevek

A 4 disk RAID 5 should be getting 400-800MB/s depending on the disks, and computer. 200 is very slow. Even with a fairly full volume. So something is wrong here. I

Each disk can do 200MB/s, so 600 is theoretical, assuming Apple Silicon performance.

Validation should also survive a restart. the first time, expect lots of blocks to update. Next time, there should be zero on HDD.s

I do not know of any statistical study of enterprise drives in particular and longevity, but we know general population studies show disks start failing at higher rates starting around 25-30K power on hours. I don't often see 60K hours on drives.

the MTBF does not mean they last 2 million hours. its a very different population calculation to estimate reliability.

Posted : 16/10/2023 9:22 pm

SteveK

(@stevek)

Posts: 86

Member

Topic starter

@softraid-support I purchased DriveDX last night for $20 and checked out all 8 HDDs; all were OK but 4 had slower spinup times and lower throughputs. So I ordered four 4TB drives last night; they'll arrive in 2 days.

I let both Thunderbays sit overnight, powered on and connected. This morning I ran First Aid on both then restarted. I didn't use them all day until 8:00pm, and look what happened. Now they're back at the speeds they were a few weeks ago. What the heck? Other than running First Aid and DriveDX I didn't do anything else, except restart, but I have restarted multiple time over the last few days.

Any idea what might have happened?

Attachment : Screenshot 2023-10-17 at 8.13.02 PM.jpeg

Attachment : Screenshot 2023-10-17 at 8.04.15 PM.jpeg

Posted : 17/10/2023 8:22 pm

SoftRAID Support

(@softraid-support)

Posts: 9207

Member Admin

@stevek

I did not see it running, but indexing? (mdworker primarily)

Neither DriveDx nor Disk utility first aid could have fixed these.

strange, but glad it is working now.