M1 Max MBP Gets Wed...
 
Notifications
Clear all

M1 Max MBP Gets Wedged, Cannot Unmount SR Arrays

29 Posts
3 Users
0 Reactions
1,682 Views
(@softraid-support)
Posts: 9200
Member Admin
 

@emp001 

I agree there is an issue here. The fact you are losing recent file saves means there is a hang in your system and the driver is not flushing all data out. This can lead to data corruption, so you want to resolve/figure this out.

When you go to unmount the volume in Finder, I assume you cannot?

Here is a command you can try to see what is using the volume. Paste this into terminal, replacing VolumeName with the name of your volume. If there is a space in the name, like My Volume, you need to enter it like this: My\ Volume

sudo lsof /Volumes/VolumeName

 
Posted : 31/08/2022 3:24 pm
(@higgins)
Posts: 23
Eminent Member
Topic starter
 

 

@softraid-support OK, I have an update. I have been trying to reproduce the issue on my MacBook Pro in a way that doesn't involve or require any commercial third-party software. I think I found it.

I have managed to create what is so far a quick, 100% reliable way to replicate the failure. Or at least *A* failure. While this is a fairly intense benchmark, I think it should technically work. And frankly, *on the same hardware it DOES work* when using a different target disk (for instance, a two-drive RAID, or a single drive SSD).

Here are the steps to reproduce:

1. Use a MacBook Pro 16" 2021 with M1 Max 32 core GPU/64GB RAM/4TB SSD. (Have not tried any other model.) Run Monterey 12.5.1.

2. Create a new user account. I called mine 'testuser'. The intent here was to minimize background processes such as the Adobe stuff or Backblaze. Log in.

3. Plug in an OWC ThunderBlade. (Mine is 4GB, or 4x 960GB blades.)

4. Open SoftRAID v6.3.

5. (In my case I had to delete an existing volume; this part may vary; in any case, set up your blades and label them first.) Create a RAID0 using 128k blocks ("Video Editing").

6. Leave SoftRAID open.

7. Launch AJA System Test Lite.

8. Launch Blackmagic Disk Speed Test.

9. In AJA, set it to 3840x2160, 16GB, ProRes 4444, and set the Target Disk to the new RAID0. Also in the "Settings" area set it to "Runs continuously" but leave the other two items at default: Single file and Disable (recommended).

10. In Blackmagic Disk Speed Test, select the new RAID0 as its target disk. Set the file size to 5GB if it's not there already.

11. Start AJA.

12. Start Blackmagic.

13. Wait a few minutes.

In two tests just now, I managed to encounter a failure in less than five minutes (!). One of the blades disappears, both disk benchmarks fail/stop, and of course SoftRAID throws a major error now that a drive has been "removed". I will attach a support file and a few screenshots that, I think, explain the steps pretty well.

EXTRA INFO: I ran this same test all day yesterday. But I did it using different volumes—mostly a SoftRAIDed RAID0 set of Samsung T5 USB SSDs, which offered a near-identical 4TB overall disk size. Running that test (and throwing extra I/O at it like adding Finder file copies onto the drive) did not fail after 12 hours of attempts.

While the problem here may be that I have a bad ThunderBlade (...but it's quite reliable in regular use...), my next step will be to try some other OWC/SoftRAID hardware and see what happens. I have a theory. I think it may be related to the number of drives in the array or perhaps the overall number of drives mounted by the OS. Maybe I'm an outlier in the number of drives attached to an M1 Max based MacBook Pro? Perhaps that's why we don't see this issue very often (??). In my daily work, I'm using an 8-bay RAID10 volume and ALSO often have a 4-bay ThunderBlade plugged in. Sometimes I also attached a 4-drive RAID4 (SSD), a 4-drive RAID5 (HDD), and a 4-drive RAID0 (HDD). Perhaps there is something about addressing that large number of individual disks that's causing an issue? Maybe that accounts for why this test could run for 12 hours on two disks, but immediately fails when tried against four? Just a guess.

 
Posted : 02/09/2022 12:07 pm
(@softraid-support)
Posts: 9200
Member Admin
 

@higgins 

It is looking like a faulty blade in your Thunderblade. I will try this test sequence also.

 

The fact that a disk is disappearing is not good. What you should do is give the disks SoftRAID disk labels. then when a blade disappears, you can identify it more easily. See if it is the same blade (2 tests gives 75% confidence it is the same blade, 3 tests is over 90%.)

Then you can get the serial number of the bad blade and get it fixed under warranty (hopefully under warranty!)

 
Posted : 02/09/2022 1:11 pm
(@higgins)
Posts: 23
Eminent Member
Topic starter
 

@softraid-support OK, so here is another weird twist. I do have the blades labeled, though the names are very basic (NVME 1, 2, 3, 4). It was NVME 4 that kept "disappearing" on the MacBook Pro. So I tried Certifying that blade with default settings and it passed. Then I rebuilt the RAID, tried the test again, and it immediately failed again (within about 2 minutes). Same failure mode, very immediate. Well, OK.

But then...I unplugged it, moved it over to the OTHER M1 Max based computer on my desk, a Mac Studio. Plugged it in, re-created the volume with the same settings, configured the same test. And it has now been running for 2.5 hours. I'm going to keep it running for at least 12 hours, because that seems...weird to me. Same cables, same software stack, same everything. Instant failure on one, no failure on the other. ???

I'm going to try testing another OWC RAID product on the MacBook Pro to see if this behavior occurs with other devices. I'm a little concerned about damaging an HDD or losing data, but I think I can find something that'll work. If THAT fails on the MacBook Pro, but NOT the Mac Studio...I'm curious what we do in that case? I guess I will keep you posted.

But this whole thing continues to support my (our) theory that there is a hardware problem with this MacBook Pro. I have never spoken to Apple Tech Support, but I wonder if they'd be able to investigate. I did run the hardware diagnostics (Cmd-D on boot) and it seemed to think everything was fine.

 
Posted : 02/09/2022 4:40 pm
(@softraid-support)
Posts: 9200
Member Admin
 

@higgins 

While it could be the Mac, I have a theory that some behavior can trigger a blade to hang/eject.
I asked product management, I can get back to you when I get an answer.

 
Posted : 02/09/2022 5:17 pm
(@higgins)
Posts: 23
Eminent Member
Topic starter
 

@softraid-support Incidentally, after running the same test on a Mac Studio w/M1 Max, I DID eventually get a sort of "stuck" situation with SoftRAID. It was not as dramatic as the array dropping a disk. But the Mac Studio ended up being basically unable to access, mount, or unmount any SR volumes...which eventually led to forced removal of TBT3 cables and forced power off (long power button press). It took many hours to arrive at this situation, but it did get there. I had a second, similar failure later in the day on the Mac Studio when trying to export a video project using Adobe Premiere/Media Encoder. It appears to be the same issue I saw on the M1 Max powered MacBook Pro.

So I have on my desk two M1 Max based Macs that display some flavor of "strange problems" with heavy file I/O on at least two SoftRAID volumes. One is a RAID0 Thunderblade (4 NVME blades), the other is a RAID10 Thunderbay 8 (14TB HDDs x8).

Let me know if you need/want additional support files or other info to try tracking this down?

 
Posted : 03/09/2022 12:47 pm
(@softraid-support)
Posts: 9200
Member Admin
 

@higgins 

OK, I am researching this, thanks for the details.

 
Posted : 03/09/2022 2:58 pm
(@higgins)
Posts: 23
Eminent Member
Topic starter
 

@softraid-support Just wanted to check if there has been any progress on this issue? I have been using an Intel based Mac for all my SoftRAID needs, but hope one day to return to the M1 Max.

 
Posted : 07/11/2022 3:15 pm
(@softraid-support)
Posts: 9200
Member Admin
 

@higgins 

We have not seen anything similar in "casual" testing. I will be setting up 4 systems this week, that will run 24/7 with intensive I/O, both HDD and flash to look for any I/O issues. If anything shows up, we will be addressing those issues immediately.

The problem with intermitant bugs is you need to be able to reproduce them in a controlled environment, to be able to identify and fix them.

 
Posted : 07/11/2022 4:51 pm
(@higgins)
Posts: 23
Eminent Member
Topic starter
 

@softraid-support I understand. I'd love to hear the outcome of your testing...because in my testing (steps above to reproduce) I was able to cause a crash or hang 100% of the time on one system, and ~50% of the time on the other.

 
Posted : 07/11/2022 5:28 pm
(@softraid-support)
Posts: 9200
Member Admin
 

@higgins 

Both running Adobe software? Or what?

 
Posted : 07/11/2022 9:14 pm
(@higgins)
Posts: 23
Eminent Member
Topic starter
 

@softraid-support Yes, they both run Adobe Creative Cloud for my main work. However, my steps to reproduce don’t use the Adobe stack…the issue seems to be triggered by heavy use of a SoftRAID volume (and likely a RAID0 or RAID1+0 volume specifically). The steps above show disk benchmarks triggering a failure, and it is 100% reproducible on the laptop.

 
Posted : 07/11/2022 10:00 pm
(@softraid-support)
Posts: 9200
Member Admin
 

@higgins 

If there is an issue, I will find it if it does not require a specific app like the Adobe suite.

 
Posted : 07/11/2022 10:53 pm
(@softraid-support)
Posts: 9200
Member Admin
 

@higgins 

Heads up, I think 13.3, does tuesday, will fox some of your issues. And, 13.4, whenever that is out, should address the DART kernel panics.

 
Posted : 22/03/2023 5:06 pm
Page 2 / 2
Share:
close
open