Notifications
Clear all

Why am I Getting Multiple Certify Disk Failures?

1 Posts
1 Users
0 Reactions
5,190 Views
(@softraid-support)
Posts: 9200
Member Admin
Topic starter
 

Q) I am getting repeated failures when I certify brand new disks. Could this be a SoftRAID bug?

A certify disk is a rigorous test of drives. The Certify Disk function does NOT use the SoftRAID driver. So a certify failure cannot be caused by a "SoftRAID bug".

There are no known bugs in the code we use to certify. This code has not needed to be updated in 10 years, it is very stable.

Whenever there is a failure reported during a disk certify, the error indicates there was a hardware failure. The Certify Disk command writes a pattern write to disks, then reads it back. A system that fails certify is unreliable and should not be used for important data.

In the vast majority of cases, a certify failure is a failure of the individual disk. Disks are sold untested and are not burned in. A Certify Disk command acts as a "burn in" test for new or used drives.

Q) Why did multiple disks fail to Certify?
If you have multiple Certify failures, then you need to consider other possible causes for the failures. Note: We have had users report large failures in batches of drives, as many as 24 out of 32 drives which failed to certify. This users hardware consisted of thunderbolt encloures, so we concluded that the disks were in fact faulty and all were replaced by the manufacturer. (no we will not mention the brand, but these were an early batch from newly released, ultra high capacity disks, what are often described in the trade as "bleeding edge" mechanisms.)

If you have multiple disks failing, epecially with different batches of drives and different brands, then you may have an unreliable system.

If certify is failing during the Random Access portion, with disks which passed certify, this often indicate the drives are not cooled adequately. Heat can destroy drives and the random access test causes the drives to get very hot. They must be in well cooled enclosures to use the random access tests.

Another reason disks can fail random access testing is if the bus is not reliable. A "bus" is comprised of the computer, type of bus (USB/FireWire/SATA/thunderbolt), cabling, enclosure, etc.

In our experience, any of these can cause disks to fail to certify: ° Many PCI cards do not handle multiple read and write threads at the same time (called multi-threaded I/O).
° Inexpensive USB/FireWire cables
° Long cables can cause I/O problems
° Placing the enclosure close to another power source, even the Mac
° External enclosures with inadequate power supplies
° Port multipliers that cannot handle heavy I/O loads
° USB Hubs
° Long chains of devices (FireWire/USB)
° RAM incompatibilities

We have seen budget priced multi-disk enclosures with inadequate power supplies cause I/O problems. Similarly, with SATA enclosures, the port multipliers used may be at fault. We have also had many experiences with eSATA PCI cards that could not reliably handle intensive disk activity.

Q) What do I do if I suspect unreliable system hardware?

This depends on what hardware you have. You need to consider everything. We recommend you test components in this order:
replace the disk(s)
replace all cabling
Move enclosures further from the computer and power sources

If those steps do not work, then:
remove/replace any USB hubs
replace the SATA card with a brand name (Sonnet/LaCie/Newer)
Replace the enclosure, especially if a budget SATA enclosure

Isolating the source of read/write failures can be time consuming.
We recommend when possible, moving all your disks into Thunderbolt enclosures, which have proven to be far more reliable than previous bus technologies.

Q) How can I test for RAM problems?
We have learned over the years that installing different brand of RAM can have an impact on reliability. For example, Crucial (Micron) RAM tends to not be compatible with Apple brand RAM in the same machine. So try to stick to using the same brand of RAM.
Testing RAM can be done by the Apple hardware test (AHT), or using third party applications.
The Apple hardware test can be accessed on most modern Mac's via holding the D or "option D" keys at startup. Older Macs shipped with a special Hardware Diagnostic DVD.
Note: Many users consider the Apple RAM test to be minimal and recommend third party test applications.

There are several excellent third party applications to test RAM:
Memory Tester: http://diglloydtools.com/memorytester.html
Techtool Pro: http://www.micromat.com/products/techtool-pro
Memtest: http://www.memtestosx.org/
Memtest86: http://www.memtest86.com/

Q) Why does other software I am running indicate the disks are OK?

The tests other vendors do are not intensive and generally do a pass/fail check. Not actually informative.

Q) Can I keep using my system even if I do not resolve the cause of the certify disk failures?

The unreliability of your system is a real failure, it means you will eventually have data corruption when you put this into use. Whether the cause is the drives, or another component, you may not know.

All SoftRAID can do is report that the system you tested is not reliable.

There are no simple tests that for faulty enclosures or USB/SATA buses, except trial and error.

 
Posted : 04/04/2016 3:33 pm
Share:
close
open