Notifications
Clear all

Three disks failed certification within an hour of each other

28 Posts
2 Users
0 Likes
6,544 Views
(@ericbarker)
Posts: 20
Member
Topic starter
 

After over a day of certification, 3 of the 4 disks I'm currently trying to verify failed. I'm on an iMac with a ThunderBay6 (USB-C). I'm trying to install Six, 16TB Seagate Exo drives for a total of 80TBs with RAID5. About once a day, the SoftRAID application quits and says it needs to be restarted, though I can leave off where the certification began.

The timing of these is too close together, Im wondering if there's something else going on than the drives being bum.

 
Posted : 14/02/2020 4:17 am
(@softraid-support)
Posts: 8008
Member Admin
 

This is a bug in SoftRAID, that is very annoying and we have not yet found the final root cause. The main cause is the application is losing contact with the SoftRAID Monitor. So it has to quit and be relaunched.

Very sorry about this. You can certify via command line, but our command line tool does not yet know how to "resume", so you would have to start over.

Fixing this is one of our highest priorities for the next update. It was triggered by a Security update in Mojave, but we cannot find out why yet.

 
Posted : 15/02/2020 12:56 pm
(@ericbarker)
Posts: 20
Member
Topic starter
 

Very sorry about this. You can certify via command line, but our command line tool does not yet know how to "resume", so you would have to start over.

Thanks for the honest update, and I'm relieved to hear that my drives probably aren't all junked (that I know of). Can you tell me where I can find this Certify command line tool you mentioned? It's not in the Application bundle as far as I can tell.

Secondly, do you know whether this Mojave problem has an impact on any of the other SoftRAID operations? The program has crashed a few times during Certification, bit I'm also worried that it could cause contact to be lost with the drives once I turn them into a RAID array. I'm commonly performing many overnight video renders, and that would be a fairly large concern.

 
Posted : 17/02/2020 8:16 am
(@softraid-support)
Posts: 8008
Member Admin
 

It can be run from terminal.app.

You can type softraidtool help to see the command and options.

But terminal is not for faint of heart, as it is fairly complex to string commands together.

certify is pretty simple, once you know the disk numbers. (they are shown in the disk tiles in SoftRAID) Lets say disk3 disk4 disk5 disk6. Then the command would be:

softraidtool disk disk3 disk4 disk5 disk6 certify 3

(3 is for 3 passes)

The problem where the application quits never impacts the functioning of the driver. It is simply the application losing contact with the SoftRAID Monitor, which reports errors. So the app has to be relaunched. no other impact.

 
Posted : 17/02/2020 10:23 am
(@ericbarker)
Posts: 20
Member
Topic starter
 

Alright, so I started over from scratch. 2 of the original 4 drives passed certification with the GUI app. Previously it had reported that 3 of 4 failed, but I stopped/restarted and two were deemed okay. The other two ERRORED out, but as you're saying this could be the GUI app misreporting due to the bug. I just added 2 more drives for a total of 6. I'll consider the two that passed to be good to go.

The IDs for the four remaining disks are: disk10, disk5, disk7, and disk6.

However, when I enter the unix command "softraidtool disk disk10 disk5 disk7 disk6 certify 3", it gives me an error that "command disk5 not recognized". It acts like the softraidtool only wants to perform on one disk at a time. So I just started 4 different terminal threads and am certifying a different drive on each one.

Question: the moment I hit enter it says, "Certifying disk at disk6 Number of passes is 3", and then nothing. I do see the lights blink, so I'm assuming it's working, but the terminal program doesn't seem to report any kind of progress. Is this normal?

 
Posted : 20/02/2020 5:39 am
(@softraid-support)
Posts: 8008
Member Admin
 

The terminal certify has not feedback mechanism at present. This is in our list to add as a feature.
(we are also trying to fix the issue that causes the Application to report it needs to quit)

 
Posted : 20/02/2020 10:05 am
(@ericbarker)
Posts: 20
Member
Topic starter
 

Just to be clear: the application reported DISK ERRORS, not just that it quit. But it reported all three errors within minutes of each other after running for days. It ALSO quit multiple times during the process. So 3 out of the 4 disks reported errors.

I re-ran the Certify again, and 2 of the 4 then succeeded.

I set the two successful disks aside, and am running the terminal command again on the remaining disks (two of which are brand new and haven't been tested at all yet).

 
Posted : 20/02/2020 3:08 pm
(@softraid-support)
Posts: 8008
Member Admin
 

If you look at the errors, or the time, you can get a clue whether this was hardware triggered. Same time means hardware (cabling most likely) If hours apart, then it could be enclosure. Its unlikely for multiple disks to fail of course, so you need to consider what else could cause it. Certify should never fail.

(the app quitting is a bug, not the same thing)

I have a beta I could make available that may fix this issue if you want to test it.

 
Posted : 20/02/2020 5:19 pm
(@ericbarker)
Posts: 20
Member
Topic starter
 

Ah, shoot. I thought from your first response that the bug was triggering a disk fail error, not just the app crash.

The failures were relatively close together considering the multi-day length of the test, but they were anywhere from 15minutes to 2hrs apart. Some of them failed in close proximity to app crashes/resumes.

For the record: Hardware is a new ThunderBay6 enclosure with 6 new Seagate Exo 16TB drives attached via USB-C to a 2018 iMac running Mojave.

I've been running the terminal certify now for about 14 hours, so far no error messages. But the entire process for a 16TB seems to take about 5days, so we'll see. This is replacing a Drobo with 5 drives that has been relatively okay up until now, but recently started giving off some warning signs, so I thought it better to switch it out when I upgraded to 16TB drives.

 
Posted : 20/02/2020 5:38 pm
(@softraid-support)
Posts: 8008
Member Admin
 

If you get the App quit again ask me for the beta, which may fix this. Certify is generally about 2TB/Day.

Its possible we have a bug in the certify process that if the app quits, it could trip an issue, but certify is just a terminal script, so should not have that problem.

 
Posted : 20/02/2020 8:33 pm
(@ericbarker)
Posts: 20
Member
Topic starter
 

Okay, cool, that's what I'm hoping. Yeah, I'm not foreign to using terminal commands (currently doing some advanced FFmpeg concatenation now), so this is just fine doing it without a GUI. It just took me a bit to figure out the proper BSD IDs for each disk. I'll update you in about a week once it's done.

 
Posted : 20/02/2020 9:57 pm
(@ericbarker)
Posts: 20
Member
Topic starter
 

Ugh... so I woke up this morning, not to a terminal error, but to 4 MacOS Dialog boxed saying "This Hard Drive is not recognizable and needs to be formatted". Which is true, because I've never formatted them. But what this means is the bay must have momentarily unmounted (likely the cat bumped it). The result being that the certify didn't finish. It had run for about 3 days (which was about half of what the GUI took when it finished). The only readout the terminal generated was:

SoftRAIDTool status: certifying disk at disk6
Number of Passes is 3

That's it. Then it returned to the prompt this morning after I saw those dialog boxes. It didn't even echo a disk read error or anything. What does it normally echo when it passes or fails a certify?

 
Posted : 22/02/2020 5:49 pm
(@softraid-support)
Posts: 8008
Member Admin
 

The command prompt is what it returns to, as you might expect.

We are going to enhance this, soon. (And we have a possible fix for the "SoftRAID must quit" issue)

 
Posted : 22/02/2020 11:31 pm
(@ericbarker)
Posts: 20
Member
Topic starter
 

The command prompt is what it returns to, as you might expect.

So does that mean that they passed the certify? It never said anything about pass or fail at all.

 
Posted : 23/02/2020 5:37 pm
(@ericbarker)
Posts: 20
Member
Topic starter
 

Yeah, tried running the GUI again, and all 6 disks failed at some point during the day today (I was out), not long after the GUI program crashed and I had to reboot. Just not sure what to do anymore.

I'm wondering whether the console program succeeded, and I'm wasting my time with the GUI app.

 
Posted : 24/02/2020 1:14 am
Page 1 / 2
Share:
close
open