M1 Max MBP Gets Wedged, Cannot Unmount SR Arrays – Page 2 – Functionality Issues

higgins · 2022-08-23T19:53:15Z

Hey folks, I've had this problem on and off since I got my M1 Max-based MacBook Pro. I'm on the latest shipping version of Monterey, but I've seen this behavior throughout Monterey and the OS before it (whatever shipped with the MBP). The problem: After some amount of use (typically about 24 hours), SR volumes seem to slow down or get "flaky" in use. Much worse, though, they cannot be unmounted (either via Finder or SR itself), and prevent Restart or Shut Down, resulting in hangs during Restart/Shut Down. The only way to recover from this situation is to pull the TBT cable (!!!) or hold down the computer's power button for 10 seconds. I have not seen others post this kind of problem, so I thought I'd ask whether it's a known issue? Another way to explain how the problem presents. I'll just be working along doing my normal stuff, editing in Premiere or perhaps running an export, and suddenly the process will slow to a crawl. The arrays are still "usable" in that I can open them, browse, and even open files...but their speed has become severely degraded to the point where things like file exports appear to hang. In reality they're just running at glacial speed. The problem might be triggered by heavy I/O. Or it might be time. Or it might be something like total I/O requests. I'm really not sure. But it *always* eventually happens on the M1 Max machine. The problem here is pretty extreme, because it means essentially I have to force-eject the arrays periodically. And I can't predict when it'll happen. This is not workable in a pro workflow. I have to validate the arrays constantly and also check my backups...not something I can afford to do given the size of the data involved. In the meantime I have switched back to using my iMac Pro, which is running (to my eyes) an identical software stack. On the iMac Pro, this problem does not happen at all, ever. Software stack is latest Monterey, latest SoftRAID, latest Adobe CC stuff, same arrays plugged in, same workload (editing in Adobe Premiere, mainly). I have a Thunderbay 8 units running RAID1+0 across 8 disks. A second Thunderbay 8 hosting a RAID4 + a RAID5 each w/4 disks (this is rarely used). Plus several 4-bay Thunderbay units in RAID0 and a 4-stick Thunderblade also in RAID0. I have seen this problem with AT LEAST the RAID0+1 and RAID0 units, and I believe I have also seen it with both RAID4 and RAID5 (those two are much less used in my workflow, so it's rare I'd see them at all). I have also seen the problem when multiple arrays are connected. I can't pin it down to one piece of hardware; it seems to happen with ANY piece of OWC/SoftRAID hardware. If I remove all the arrays and do some other workload, the problem does not occur. This has been happening through several minor versions within SoftRAID v 6.x. I'm not even sure how to troubleshoot this. The first time it started happening I wiped the MacBook Pro and re-installed everything from scratch under the theory it was some kind of software problem. That seemed to help for a week or two, then the problem returned. So I'm not actually convinced that effort did anything—maybe I just didn't run into the problem during that time window. Is there any kind of logging I can set up? At this stage I think I can make this problem happen. It is guaranteed to happen if I let the machine sit there and work for a while with any SR array attached. So if there's any way to watch for something special, I can certainly do that. I'd love to resolve this, as I'd really prefer to use the much faster computer on my desk for my actual job! :) ;Chris

SoftRAID Support

(@softraid-support)

Posts: 9201

Member Admin

@emp001

I agree there is an issue here. The fact you are losing recent file saves means there is a hang in your system and the driver is not flushing all data out. This can lead to data corruption, so you want to resolve/figure this out.

When you go to unmount the volume in Finder, I assume you cannot?

Here is a command you can try to see what is using the volume. Paste this into terminal, replacing VolumeName with the name of your volume. If there is a space in the name, like My Volume, you need to enter it like this: My\ Volume

sudo lsof /Volumes/VolumeName

Posted : 31/08/2022 3:24 pm

higgins

(@higgins)

Posts: 23

Eminent Member

Topic starter

Attachment : Chris Higgins 2022-09-02 16.45.56.sr_supt

Attachment : Screen Shot 2022-09-02 at 9.28.02 AM.jpg

Attachment : Screen Shot 2022-09-02 at 9.30.47 AM.jpg

Attachment : Screen Shot 2022-09-02 at 9.36.35 AM.jpg

Attachment : Screen Shot 2022-09-02 at 9.36.40 AM.jpg

Attachment : Screen Shot 2022-09-02 at 9.36.56 AM.jpg

Attachment : Screen Shot 2022-09-02 at 9.37.28 AM.jpg

Attachment : Screen Shot 2022-09-02 at 9.45.24 AM.jpg

@softraid-support OK, I have an update. I have been trying to reproduce the issue on my MacBook Pro in a way that doesn't involve or require any commercial third-party software. I think I found it.

I have managed to create what is so far a quick, 100% reliable way to replicate the failure. Or at least *A* failure. While this is a fairly intense benchmark, I think it should technically work. And frankly, *on the same hardware it DOES work* when using a different target disk (for instance, a two-drive RAID, or a single drive SSD).

Here are the steps to reproduce:

1. Use a MacBook Pro 16" 2021 with M1 Max 32 core GPU/64GB RAM/4TB SSD. (Have not tried any other model.) Run Monterey 12.5.1.

2. Create a new user account. I called mine 'testuser'. The intent here was to minimize background processes such as the Adobe stuff or Backblaze. Log in.

3. Plug in an OWC ThunderBlade. (Mine is 4GB, or 4x 960GB blades.)

4. Open SoftRAID v6.3.

5. (In my case I had to delete an existing volume; this part may vary; in any case, set up your blades and label them first.) Create a RAID0 using 128k blocks ("Video Editing").

6. Leave SoftRAID open.

7. Launch AJA System Test Lite.

8. Launch Blackmagic Disk Speed Test.

9. In AJA, set it to 3840x2160, 16GB, ProRes 4444, and set the Target Disk to the new RAID0. Also in the "Settings" area set it to "Runs continuously" but leave the other two items at default: Single file and Disable (recommended).

10. In Blackmagic Disk Speed Test, select the new RAID0 as its target disk. Set the file size to 5GB if it's not there already.

11. Start AJA.

12. Start Blackmagic.

13. Wait a few minutes.

In two tests just now, I managed to encounter a failure in less than five minutes (!). One of the blades disappears, both disk benchmarks fail/stop, and of course SoftRAID throws a major error now that a drive has been "removed". I will attach a support file and a few screenshots that, I think, explain the steps pretty well.

EXTRA INFO: I ran this same test all day yesterday. But I did it using different volumes—mostly a SoftRAIDed RAID0 set of Samsung T5 USB SSDs, which offered a near-identical 4TB overall disk size. Running that test (and throwing extra I/O at it like adding Finder file copies onto the drive) did not fail after 12 hours of attempts.

While the problem here may be that I have a bad ThunderBlade (...but it's quite reliable in regular use...), my next step will be to try some other OWC/SoftRAID hardware and see what happens. I have a theory. I think it may be related to the number of drives in the array or perhaps the overall number of drives mounted by the OS. Maybe I'm an outlier in the number of drives attached to an M1 Max based MacBook Pro? Perhaps that's why we don't see this issue very often (??). In my daily work, I'm using an 8-bay RAID10 volume and ALSO often have a 4-bay ThunderBlade plugged in. Sometimes I also attached a 4-drive RAID4 (SSD), a 4-drive RAID5 (HDD), and a 4-drive RAID0 (HDD). Perhaps there is something about addressing that large number of individual disks that's causing an issue? Maybe that accounts for why this test could run for 12 hours on two disks, but immediately fails when tried against four? Just a guess.

Posted : 02/09/2022 12:07 pm

SoftRAID Support

(@softraid-support)

Posts: 9201

Member Admin

@higgins

It is looking like a faulty blade in your Thunderblade. I will try this test sequence also.

The fact that a disk is disappearing is not good. What you should do is give the disks SoftRAID disk labels. then when a blade disappears, you can identify it more easily. See if it is the same blade (2 tests gives 75% confidence it is the same blade, 3 tests is over 90%.)

Then you can get the serial number of the bad blade and get it fixed under warranty (hopefully under warranty!)

Posted : 02/09/2022 1:11 pm

higgins

(@higgins)

Posts: 23

Eminent Member

Topic starter

@softraid-support OK, so here is another weird twist. I do have the blades labeled, though the names are very basic (NVME 1, 2, 3, 4). It was NVME 4 that kept "disappearing" on the MacBook Pro. So I tried Certifying that blade with default settings and it passed. Then I rebuilt the RAID, tried the test again, and it immediately failed again (within about 2 minutes). Same failure mode, very immediate. Well, OK.

But then...I unplugged it, moved it over to the OTHER M1 Max based computer on my desk, a Mac Studio. Plugged it in, re-created the volume with the same settings, configured the same test. And it has now been running for 2.5 hours. I'm going to keep it running for at least 12 hours, because that seems...weird to me. Same cables, same software stack, same everything. Instant failure on one, no failure on the other. ???

I'm going to try testing another OWC RAID product on the MacBook Pro to see if this behavior occurs with other devices. I'm a little concerned about damaging an HDD or losing data, but I think I can find something that'll work. If THAT fails on the MacBook Pro, but NOT the Mac Studio...I'm curious what we do in that case? I guess I will keep you posted.

But this whole thing continues to support my (our) theory that there is a hardware problem with this MacBook Pro. I have never spoken to Apple Tech Support, but I wonder if they'd be able to investigate. I did run the hardware diagnostics (Cmd-D on boot) and it seemed to think everything was fine.

Posted : 02/09/2022 4:40 pm

SoftRAID Support

(@softraid-support)

Posts: 9201

Member Admin

@higgins

While it could be the Mac, I have a theory that some behavior can trigger a blade to hang/eject.
I asked product management, I can get back to you when I get an answer.

Posted : 02/09/2022 5:17 pm

higgins

(@higgins)

Posts: 23

Eminent Member

Topic starter

@softraid-support Incidentally, after running the same test on a Mac Studio w/M1 Max, I DID eventually get a sort of "stuck" situation with SoftRAID. It was not as dramatic as the array dropping a disk. But the Mac Studio ended up being basically unable to access, mount, or unmount any SR volumes...which eventually led to forced removal of TBT3 cables and forced power off (long power button press). It took many hours to arrive at this situation, but it did get there. I had a second, similar failure later in the day on the Mac Studio when trying to export a video project using Adobe Premiere/Media Encoder. It appears to be the same issue I saw on the M1 Max powered MacBook Pro.

So I have on my desk two M1 Max based Macs that display some flavor of "strange problems" with heavy file I/O on at least two SoftRAID volumes. One is a RAID0 Thunderblade (4 NVME blades), the other is a RAID10 Thunderbay 8 (14TB HDDs x8).

Let me know if you need/want additional support files or other info to try tracking this down?

Posted : 03/09/2022 12:47 pm

SoftRAID Support

(@softraid-support)

Posts: 9201

Member Admin

@higgins

OK, I am researching this, thanks for the details.

Posted : 03/09/2022 2:58 pm

higgins

(@higgins)

Posts: 23

Eminent Member

Topic starter

@softraid-support Just wanted to check if there has been any progress on this issue? I have been using an Intel based Mac for all my SoftRAID needs, but hope one day to return to the M1 Max.

Posted : 07/11/2022 3:15 pm

SoftRAID Support

(@softraid-support)

Posts: 9201

Member Admin

@higgins

We have not seen anything similar in "casual" testing. I will be setting up 4 systems this week, that will run 24/7 with intensive I/O, both HDD and flash to look for any I/O issues. If anything shows up, we will be addressing those issues immediately.

The problem with intermitant bugs is you need to be able to reproduce them in a controlled environment, to be able to identify and fix them.

Posted : 07/11/2022 4:51 pm

higgins

(@higgins)

Posts: 23

Eminent Member

Topic starter

@softraid-support I understand. I'd love to hear the outcome of your testing...because in my testing (steps above to reproduce) I was able to cause a crash or hang 100% of the time on one system, and ~50% of the time on the other.

Posted : 07/11/2022 5:28 pm

SoftRAID Support

(@softraid-support)

Posts: 9201

Member Admin

@higgins

Both running Adobe software? Or what?

Posted : 07/11/2022 9:14 pm

higgins

(@higgins)

Posts: 23

Eminent Member

Topic starter

@softraid-support Yes, they both run Adobe Creative Cloud for my main work. However, my steps to reproduce don’t use the Adobe stack…the issue seems to be triggered by heavy use of a SoftRAID volume (and likely a RAID0 or RAID1+0 volume specifically). The steps above show disk benchmarks triggering a failure, and it is 100% reproducible on the laptop.

Posted : 07/11/2022 10:00 pm

SoftRAID Support

(@softraid-support)

Posts: 9201

Member Admin

@higgins

If there is an issue, I will find it if it does not require a specific app like the Adobe suite.

Posted : 07/11/2022 10:53 pm

SoftRAID Support

(@softraid-support)

Posts: 9201

Member Admin