Multiple i/o errors, disk "failed" then after restart 0 i/o errors
Hi. I created a new RAID 5 volume in an OWC Thunderbay enclosure using SoftRAID 6.2 and macOS 12.1 on a Late 2015 iMac. I initialised and verified the 4 brand new WD 2TB Black drives in SoftRAID before creating the volume. I have been copying over files from another external drive to the RAID and an error message came up saying a disk which was used for the volume had encountered a write error. I clicked OK and the next window said a disk which was used for the volume had encountered 2 or more read/write errors. I clicked OK and launched SoftRAID and a window popped up saying a disk which was used for the volume has encountered multiple read or write errors and has been marked "failed". I restarted the iMac and updated the OS to macOS 12.2. Now all the disks in the volume say 0 i/o errors and there are no warnings from SoftRAID. The disk that was marked as "failed" had 10 i/o errors and 0 reallocated sectors at the time of failure but now it reads as 0 i/o errors and 0 reallocated sectors. Can you tell me if perhaps the warnings were in error or what else I should do? Screenshots of the error messages and SoftRAID window showing the disk attached.
SoftRAID reports all IO errors, and enters (some) error events in the SoftRIAD log (when there are numerous, as you noticed, only a couple are entered, to prevent the log from filling with io errors)
the errors are also written to the SoftRAID status partition, generally at shutdown, unmount, or restart.
If there is an unclean shutdown, or the drives cannot be written to, the errors cannot be written to disk. Perhaps that is what happened.
I would take the two disks and "certify" them, it is the only test that ensures reliable read/write on drives.
It takes a couple days on those drives. But if they pass, you have good disks, and any further errors are occurring in the communication channels, like bus, enclosure, cable, etc. (or even macOS, if the file pointers are damaged and pointing to non available locations)
If all the errors are on one disk, it is likely to be the disk. I would swap it with the other disk as a test of the enclosure, as if it is the disk, it will generate errors whatever slot it is in. If the errors stay with the slot, it is likely a problem in the enclosure.
Thanks for getting back to me. This is really unnerving. I'm regretting not forking out the extra cash and getting a single 8TB Thunderblade to replace this Thunderbay.
When I restarted the iMac after SoftRAID said the disk had failed, it was an unclean shutdown as the finder was hanging with the file transfer interrupted. When you say "If there is an unclean shutdown, or the drives cannot be written to, the errors cannot be written to disk." do you mean the errors would no longer be remembered by SoftRAID after the restart because of the unclean shutdown and it would therefore be reporting 0 errors when there were, in fact, the 10 errors? I will certify all the disks. (There are 4) Will the certify process affect my files on the disk? Prior to installing these new drives I had the Thunderbay temporarily in a cupboard with the door open for ventilation. Unfortunately my wife closed the door one night and I though it might have cooked the drives as they were failing one at a time with i/o errors until I decided to replace them all. I see from the Monterey thread that there are reports of possibly incorrect i/o errors. Is it possible that those errors were incorrect as well? Finally, I am able to access the "certify" process for another drive that shows up in SoftRAID but is not part of my RAID 5 and is in another enclosure. Will the certify process work as well with drives in that enclosure?
Sorry, I read up on certify and I will transfer the data on the volume to another drive before I use certify. However, I would still like to know the answer to my questions about certifying a drive in a different enclosure and if it's possible that the reported errors were incorrect and are an issue with SoftRAID and Monterey. Thanks.
All IO errors are "correct". An IO error means a command was sent to a drive to perform a task, and there was an error. the cause of the error can vary.
SoftRAID tracks IO errors in the log and on the disk. If there was a write error, then it is possible the error count could not be saved. These errrors are written to disk at shutdown, so same can happen. (you can change a setting to save errors more regularly in preferences)
An IO error does not always mean a disk failure. SMART failure is a 100% indicator. Reallocated sectors, etc, indicate a failing drive. IO errors need more investigation, as many things can cause them. But do not ignore IO errors.
Certify can be done on any disk, except the XT version of SoftRAID restricts SoftRAID actions to certain OWC enclosures. The certify is not using the SoftRAID driver, it is a low level macOS script.
Maybe you need to put a space on the door to the cupboard, to force it to at least have more of a gap if the door closes.
@softraid-support Thanks. I understand a bit better now. The cupboard thing was a dumb thing to do. It's out of it now. I ran certify on all the disks I intend to use for the RAID5. I chose 3 passes but not the random (read?) test because I was still concerned about overheating. All disks passed certify but now I'm wondering if I should have added the random test. It's in an OWC Thunderbay 4 and has a fan. Is the random test important? If so I'll run certify again and include it. I'm also now wondering if the enclosure might be at fault (affected possibly by the heat) and that when SoftRAID was reporting i/o errors on the old drives, it was actually the enclosure (3 of the 4 drives in the Thunderbay reported i/o errors over a 3 day period). I am going to buy a new Apple TB2 cable today as well to eliminate the cable as the possible cause of the i/o errors. I'll also run certify on the old drives in a Voyager 3 dock. Is it safe to run the random test in certify on a bare drive in that dock? Thanks again.
With regard to the random/stress test I suppose I'm asking what that will tell me that the 3 pass certify didn't.
You probably don't need to run the random access test. What it does is ensure the disk is read/writeable under heavy stress. You actually do not want to do this in an enclosure without cooling, like a dock, the drive may get hot.