Double failure in R...
 
Notifications
Clear all

Double failure in Raid 1+0.. is this going to recover?

27 Posts
2 Users
0 Reactions
2,284 Views
(@wgdixon)
Posts: 23
Member
Topic starter
 

I'll just start out by saying there's suddenly a lot of funky sh*t happening on my Mac Pro Softraid 1+0 setup.

Here's the configuration: 

MacOS Mojave 10.14.6, Mac Pro 5,1, 12 core 3.46GHz, 96GB RAM, Softraid 5.8.4

2 internal drives (SATA), "Internal 1" and "Internal 2"

2 external drives (eSATA in an enclosure) "External 1" and "External 2"

The internal drives are striped as are the extenal drives to make a single 12GB stripe, then the two striped 12GB RAID drives are mirrored.  Usually after booting the internal drives are the primary mirrors.

All four drives are WD Black 6TB approaching 2 years old, frankly my computer is sleeping way more than it's awake

I had been leaving the system up and running for the past few days (sleep disabled) trying to remedy some unrelated issues with Mail search.

Today I woke up the monitor to see that one of the disks in the raid was being reported as no longer responding.. the log message said it stopped responding while in use (SoftRaid log).   It reported disk0 which at the moment is appearing as internal drive 2.  That would be the primary mirror of one of the disks making up the internal 12GB striped drive.  

Simple enough, right?  I powered down, assuming the drive would reappear (without checking the actual physical status with system report or disk util <slaps head.)

OK, powered back up.  NOW, Softraid is reporting all four drives alive (2 internal, 2 external) and disk0 (Internal 2 primary mirror pair 2) is showing degraded - out of sync, and needs rebuilding.   NOT ONLY THAT, but now External 1 (secondary mirror pair 1) is ALSO reporting out of sync and degraded, needs rebuilding.

So basically on my 2 disk stripe, the primary mirror of one of the disks in the stripe is out of sync and the secondary mirror of the _other_ disk in the stripe is out of sync.  I have tried to start a rebuild but nothing seems to be happening.  I did the kextstat:

$ kextstat -b com.softraid.driver.SoftRAID

Index Refs Address            Size       Wired      Name (Version) UUID <Linked Against>

  101    0 0xffffff7f8100f000 0x3c000    0x3c000    com.softraid.driver.SoftRAID (5.8.4) D7D1A553-AAAE-33DD-8344-D23C859380A5 <27 6 5 3>

So I assume it is loaded.

Question:  one would think this is recoverable scenario.  Is SoftRaid going to fix it with a rebuild?

Question: why am I not getting any indication of a rebuild in progress?

Along with this I'm getting other weird behaviors.. my keychain was reported corrupted when I logged in and a new one created.. funkiness with iCloud connections, Safari not remembering history, Firefox not loading extensions.. I'm afraid to touch it until everything is rebuilt and I can run Disk Util or DiskWarrior on it, but it seems the RAID stuff needs to get resolved first.  What's going on here?

 

 
Posted : 31/10/2020 11:38 pm
(@softraid-support)
Posts: 9200
Member Admin
 

Attach a SoftRAID tech support file.

 

 
Posted : 01/11/2020 11:26 am
(@wgdixon)
Posts: 23
Member
Topic starter
 

So things have either gotten worse, or better from a recovery standpoint.  My Mac _is_ telling me it cannot repair the disk /Users (this is the RAID 1+0 volume).  Furthermore, after some reboots during troubleshooting, I hit the issue where if I allow it to reboot rather than power all the way down, it loses the internal drives, thus separating the internal drives from the external drives (so now I have two nearly-identical stripe drives that no longer know about mirroring each other).  Gah!  Currently in the process of trying to repair them with DiskWarrior, which is, well, trickier than it should be on my machine.

Tech support attached.

 
Posted : 01/11/2020 12:26 pm
(@wgdixon)
Posts: 23
Member
Topic starter
 

So far I have managed to get DiskWarrior and then Disk Utility happy with the external stripe, trying to do the same on the internal stripe but based on other forum threads it appears I need to basically select the external as my favorite (it had fewer errors according to DW) then rename the internal stripe then re-add it to the 1+0.   So it seems if DW is being slow about finishing the internal stripe I should just punt, have SoftRaid reformat it and then re-add to 1+0 anyway.  Just my thoughts if DW continues to spin its wheels on trying to recover overlapped files

 
Posted : 01/11/2020 2:25 pm
(@wgdixon)
Posts: 23
Member
Topic starter
 

OK, so now after identifying the "best" stripe I have (external pair), I have reformatted both drives associated with the "bad" internal pair, deleted (as far as I know) missing disks from the Raid 1+0, and re-added the newly formatted internal drives to the RAID 1+0.   HOWEVER, IT WILL NOT REBUILD!  I have included a picture of the SoftRAID window showing the status.  The thing that strikes me is that one of my "good" external disks still shows "out of sync", one does not, and the new ones (obviously) show out of sync.   I am suspecting perhaps that is preventing the rebuild?  The volume is mounted RW and disk Warrior reports the individual drives are writable.   I seem to have a working stripe at least but this is a very tenuous situation and I need the rebuild to get going now.   I have also added another tech support file that should represent where I am now with this.   

 
Posted : 01/11/2020 4:36 pm
(@softraid-support)
Posts: 9200
Member Admin
 

The problem is 3 disks are out of sync, so the SoftRAID driver does not know what to do.

I would backup, erase the volume with SoftRAID, then restore.

There may be another way, but I would be concerned about data corruption, as I do not know the full history of this volume, so it is not worth risking your data.

 
Posted : 02/11/2020 12:44 pm
(@wgdixon)
Posts: 23
Member
Topic starter
 

I figured as much and began the recreate process last night.  Question I have is, how does a primary disk in a mirror pair get a status of "out of sync"?  My hunch is that there is some sort of timing issue or something missing when there is a status change going on.  Case in point, the two now-primary drives are the external drives, which were formerly the secondary drives before this all started.   I can't recall the exact statuses before but you may recall that there was one external listed as out-of-sync and one internal listed out of sync (remember externals originally are secondary, internals primary) before the restart that dropped the internal drives off the map completely.  Ideally finding a way to notice in SoftRAID that once a disk becomes a primary mirror, and unsetting the out-of-sync status, may be helpful.  Then again, it may be a warning flag to leave it and force the user to rebuild everything.   From what I can tell, yes, there is some corruption and the directory structures were whacked enough to where Disk Utilities couldn't fix it but DiskWarrior could.  Still, I believe there are some damaged files as a result.  Alternately, if there was a way to, through the SoftRAID program, manually force a primary disks's status NOT too be out-of-sync so that a rebuild would happen and we could examine the results and move forward from there.. since it is a position that we now know can happen, would be useful.  Just some feedback.. I could be all wrong and it may well be an indicator the world is hosed and recreating everything from scratch and using a backup may have been the only option regardless.

 
Posted : 02/11/2020 12:52 pm
(@softraid-support)
Posts: 9200
Member Admin
 

In a RAID 1 volume, the primary cannot be "out of sync" as it is the master of record. he only way to change an out of sync secondary is a rebuild.

 

Usually when Disk warrior can repair a volume, it recovers everything, except files it puts in a "damaged files" folder, or a "rescued items" folder. If you did not get anything like that, you should be OK.

If you are "concerned" a mirror is not in sync, perform a "validate" which will update the secondary by force, as if it were a forced rebuild across the whole volume.

 

Note that Disk warrior does not have access to independent disks, only the published volume. If a volume is out of sync, then essentially Disk Warrior is only seeing the primary disk. SoftRAID is a layer below the file system, like the individual platters on a HDD mechanism.

T

 
Posted : 02/11/2020 9:37 pm
(@wgdixon)
Posts: 23
Member
Topic starter
 

Well, that's the problem, as you saw in the screen shot.. the primary was noted "out of sync"

The behavior of softRAID on my system seems to be totally unpredictable.   So I got everything rebuilt, a new 1+0 disk setup, all synced up, recovered stuff that was lost, ready to roll.  Logging in, everything looked normal, more or less.   

Thinking I could outsmart my Mac's tendency to spin up the internal drives too slowly on a restart (vs a shutdown and power up), I made the EXTERNAL drives the primary mirror and the internal drives, secondary.  I figured that now, if I ever did a restart, like during a software upgrade, the external drives would retain their primary status since they would be seen, and the internal drives, which do not mount when this happens, would remain secondary..   thinking that a subsequent power down and things would come up as desired without a fuss or split mirrors next time.  WRONG.  

Here's what happened after I got it all nice and shiny.   I then did a software update, which I had been deferring but figured while stuff was still fresh and nothing else new was done yet, I'd go ahead.   And this time test my theory above.  So software update ran, restarted, finished its thing, then I got the login window.   I shut down without logging in.  Yes, there was a message that disk were missing but that is not unexpected after a restart.  Powered up, viola!   SPLIT MIRROR again.  Internal and external think they are primary.  Why is that?     

Furthermore, can I recover from the split without having to have to tear down the 1+0 and rebuild it all over again?

It's impractical to babysit software updates to try and catch them when they restart the computer and poke the power button. Any bright ideas on how to let this setup run through a restart with some chance of not having to recreate things over and over again?

 
Posted : 03/11/2020 12:32 pm
(@wgdixon)
Posts: 23
Member
Topic starter
 

After searching the forums it appears it's another round of remove the missing disks and re-add them and another sync.   My next bright idea for doing software updates is to go back to the practice of making the internal drives primary.  So when I know I'm going to do a software update, then I would:

power down,

turn off the external enclosure with the secondary drives, 

power up, the internal drives mount and remain primary

do the software update, let it restart.   Now, on a warm restart those internal large drives never mount (known issue with Mac Pro 5,1 and large drives), but I let it boot up fully anyway, and then:

power down the Mac

Power up the external drive

Power up the Mac

Theory here is that through the software update induced restart, I've not allowed the wrong pair to take on primary role for any reason, the internal drives will come up on power up as primary as they should and the externals secondary.  They may ned to sync a little but that _should_ be it.   Think that will work?

 
Posted : 03/11/2020 12:50 pm
(@softraid-support)
Posts: 9200
Member Admin
 

The problem with your plan is if the disks are not available at startup, the volume will "fail over" to the external. You can move the primary disks to the internal after startup/rebuilding, you need to pay attention however.

 
Posted : 03/11/2020 10:06 pm
(@softraid-support)
Posts: 9200
Member Admin
 

What you can comfortably do is power up the externals well after startup. (after they mount). The RAID will rebuild quickly with fast mirror rebuilds. I think that is your best procedure.

 
Posted : 03/11/2020 10:07 pm
(@wgdixon)
Posts: 23
Member
Topic starter
 

Thanks.. that aligns with my theory.  Maybe one day someone will find a cure for the slow internal drive posting.. one can dream!

 
Posted : 03/11/2020 10:11 pm
(@wgdixon)
Posts: 23
Member
Topic starter
 

I can't catch a break right now.. I set the internal drives as primary, everything is up and running.  Hooray.   Last thing to do is re-sync dropbox.  I left that running last night and this morning I wake up to a login screen.  Uh oh.  So I log in as my admin user (does not use the RAID) and lo and behold I've had a kernel panic somewhere in the Apple IO driver system.  Great.  And yes, the internal drives are not present, the externals have taken over as primary mirror.   Is there *ANYTHING* I can do to get the RAID back in service without yet another delete-the-mirror-drives-and-let-it-resync-another-8-hours before I reboot?

 
Posted : 04/11/2020 8:13 am
(@softraid-support)
Posts: 9200
Member Admin
 

If you shut down, restart, there is a good chance SoftRAID will recover. We try when possible to put it back together, but sometimes it is not possible. After a restart, though, often the driver does not have enough info to put it together.

But note that the externals may be marked the new primary disks if SoftRAID auto recovers, so you would need to set Primary disk.

 

Do some investigation on your internal drives and see if they have TLER enabled and if you can find OEM instructions on disabling TLER, if so. It may solve the cause of the problem

 
Posted : 04/11/2020 1:52 pm
Page 1 / 2
Share:
close
open