M1 Max MBP Gets Wed...
 
Notifications
Clear all

M1 Max MBP Gets Wedged, Cannot Unmount SR Arrays

29 Posts
3 Users
0 Reactions
1,613 Views
(@higgins)
Posts: 23
Eminent Member
Topic starter
 

Hey folks,

I've had this problem on and off since I got my M1 Max-based MacBook Pro. I'm on the latest shipping version of Monterey, but I've seen this behavior throughout Monterey and the OS before it (whatever shipped with the MBP). The problem: After some amount of use (typically about 24 hours), SR volumes seem to slow down or get "flaky" in use. Much worse, though, they cannot be unmounted (either via Finder or SR itself), and prevent Restart or Shut Down, resulting in hangs during Restart/Shut Down. The only way to recover from this situation is to pull the TBT cable (!!!) or hold down the computer's power button for 10 seconds.

I have not seen others post this kind of problem, so I thought I'd ask whether it's a known issue?

Another way to explain how the problem presents. I'll just be working along doing my normal stuff, editing in Premiere or perhaps running an export, and suddenly the process will slow to a crawl. The arrays are still "usable" in that I can open them, browse, and even open files...but their speed has become severely degraded to the point where things like file exports appear to hang. In reality they're just running at glacial speed. The problem might be triggered by heavy I/O. Or it might be time. Or it might be something like total I/O requests. I'm really not sure. But it *always* eventually happens on the M1 Max machine.

The problem here is pretty extreme, because it means essentially I have to force-eject the arrays periodically. And I can't predict when it'll happen. This is not workable in a pro workflow. I have to validate the arrays constantly and also check my backups...not something I can afford to do given the size of the data involved.

In the meantime I have switched back to using my iMac Pro, which is running (to my eyes) an identical software stack. On the iMac Pro, this problem does not happen at all, ever. Software stack is latest Monterey, latest SoftRAID, latest Adobe CC stuff, same arrays plugged in, same workload (editing in Adobe Premiere, mainly). I have a Thunderbay 8 units running RAID1+0 across 8 disks. A second Thunderbay 8 hosting a RAID4 + a RAID5 each w/4 disks (this is rarely used). Plus several 4-bay Thunderbay units in RAID0 and a 4-stick Thunderblade also in RAID0. I have seen this problem with AT LEAST the RAID0+1 and RAID0 units, and I believe I have also seen it with both RAID4 and RAID5 (those two are much less used in my workflow, so it's rare I'd see them at all). I have also seen the problem when multiple arrays are connected. I can't pin it down to one piece of hardware; it seems to happen with ANY piece of OWC/SoftRAID hardware. If I remove all the arrays and do some other workload, the problem does not occur. This has been happening through several minor versions within SoftRAID v 6.x.

I'm not even sure how to troubleshoot this. The first time it started happening I wiped the MacBook Pro and re-installed everything from scratch under the theory it was some kind of software problem. That seemed to help for a week or two, then the problem returned. So I'm not actually convinced that effort did anything—maybe I just didn't run into the problem during that time window. Is there any kind of logging I can set up?

At this stage I think I can make this problem happen. It is guaranteed to happen if I let the machine sit there and work for a while with any SR array attached. So if there's any way to watch for something special, I can certainly do that. I'd love to resolve this, as I'd really prefer to use the much faster computer on my desk for my actual job! :)

;Chris

 
Posted : 23/08/2022 2:53 pm
(@softraid-support)
Posts: 9197
Member Admin
 

Attach a SoftRAID tech support file. I can take a first look.

This is unlikely to be SoftRAID, but we may be involved.

What we may need to do is gather a system diagnose file. then we may need to report it to Apple. It appears to be some kind of threading, or queuing issue.

When this happens open Activity Monitor. Is anything gobbling CPU's? threads or memory?

 
Posted : 23/08/2022 4:39 pm
(@higgins)
Posts: 23
Eminent Member
Topic starter
 

@softraid-support Right on, attached is a tech support file generated today.

In the past I have not noticed anything fishy in Activity Monitor. No high CPU or RAM use, nothing like that. It's weird that way. What tends to happen is various processes are wedged (un-quittable, red, and specifically un-force-quittable). If I open SoftRAID itself it is GUARANTEED to get into this condition. If I try to "Force Relaunch" Finder, it ends up unable to relaunch, so I enter the condition of a Mac without a Finder. (I do still have a Dock so I can get at Terminal, which is...of limited usefulness.) FWIW, I did do the whole dance of restarting the machine so I could allow the kext to load and such.

If I had to hazard a guess, it feels like some kind of bizarro I/O problem. It's like any app that's trying to get data from the SoftRAID volume gets into some kind of zombie state. This includes Finder, plus whatever other apps are implicated (like Adobe apps). What's odd about it is the specifics of their zombie nature. Normally I would expect that I could simply terminate the app and start over. But this situation is "so broken" that the only way to even get a shutdown/restore is via Terminal (sudo shutdown -r now, etc.) and sometimes even that fails. And of course it's not a clean shutdown either.

Anyway, I will stop speculating because what do I know. But I will say it's weird that the Intel machine is just normal and OK while the M1 machine seems super flaky, but only in this one way...for other tasks it is rock solid. Reminds of back in the day when I had a Mac with bad RAM and "weird stuff" just happened if I ended up using certain parts of the RAM modules. (G4 Cube.) I'm not sure Apple would accept a return for "ghosts in the machine" but I feel oddly compelled to try.

 

 
Posted : 23/08/2022 7:10 pm
(@softraid-support)
Posts: 9197
Member Admin
 

@higgins 

If you can describe how to get there in a way I can replicate, our engineering can research this. If it is something involving SoftRAID volumes, we would be happy to fix (or get Apple to fix) what ever is wrong.

Is it always with Adobe software?
i have no third party media apps like Adobe, FCP, Lightroom etc for duplicating issues. I need to do it with our internal tools, finder level stuff, etc.

Its possible this is one of the Adobe apps, not working well on SoftRAID volumes (which would be a bug, as a SoftRAID volume should present to the OS exactly like any other volume). We would be happy to help Adobe fix that, if it were the case.

This stuff can be hard. We are pretty sure there are no memory leaks in SoftRAID, so it may easily be something from a third party, which can be difficult to track down.

 
Posted : 23/08/2022 9:56 pm
(@higgins)
Posts: 23
Eminent Member
Topic starter
 

@softraid-support It definitely happens when non-Adobe apps are in the mix. For instance today, I had the machine doing AI rendering using Topaz Video. The same thing happened—I/O from the SoftRAID volume slowed down substantially, the Topaz app began to run much slower than usual, then within ten minutes of this starting the Topaz app got wedged, and I was unable to un-mount the SoftRAID volume at all. Like via no means. Force-quitting Topaz became impossible (??) and ultimately the machine had to be forcibly shut down per the usual. In this case it was using a Thunderblade RAID0 stripe. I moved the same device over the iMac Pro and it is happily churning away doing the same workload.

I agree, there is no guarantee that SoftRAID is somehow the culprit. But it feels like a possibility.

Any ideas on how to create a synthetic workload that would be Finder-based (or generally not needing paid third-party software)? I suppose I could run some kind of disk benchmark for 24 hours and see whether it just stopped working at some point....?

 
Posted : 23/08/2022 11:41 pm
(@softraid-support)
Posts: 9197
Member Admin
 

@higgins 

I have run things for long periods of time, such as AJA System test, a couple Apple tools, our own tools, etc. Long file activity does not cause an issue (as a general case).

Either there is something "wrong" in your hardware, or your use case (conditions) have something we do not do in our general testing.

For example, if you "validate" a volume. That involves many reads, etc, is intensive for the drives and driver, but should cause no issues. If you run a validate, it should be 100% smooth.

If you use Carbon Copy Cloner, to either copy data to or from a volume, again, should be zero issues.

It would be fantastic if you found something like that.

A couple other things: Make sure if you have added RAM to your computer, that all modules are identical brands.

I have a computer that is in a known failing stage, probably the graphics card. I get similar issues to what you describe. IO starts getting crazy slow, and it gets so bad, the monitor starts blinking. That is from a hardware issue, however. I am not saying your problem is hardware, but if you can do standard IO and replicate your issue, then we need to consider that.

Let me know if you can make any progress. We are happy to assist.

 
Posted : 24/08/2022 12:30 am
(@higgins)
Posts: 23
Eminent Member
Topic starter
 

@softraid-support I'll have more on this late this week, but there is some progress here toward reproducing the issue. I have various Support Files but I'm not sure you want them at this stage. Basically, I do think there is a semi-reproducible problem and I am trying to reduce it to its essence so it's actually feasible to reproduce without a bunch of extra steps. Whether SR has anything to do with it...unclear.

The test case I set up yesterday involved this loop:

1. Login as typical user but don't open ANY apps.

2. Mount a Thunderblade (RAID0) with lots of free space.

3. Run BlackMagic Disk Speed Test, 5GB, on a loop on the Thunderblade.

4. Wait.

I ran this for maybe 60 minutes and nothing much happened. So I decided to add more disk I/O stress, to better simulate what really happens when I'm editing a film project in Premiere. I did this:

5. Plug in external Samsung T5 SSD (consumer grade, about 500MB/S typical throughput) and copy files from this drive onto the Thunderblade, while leaving the speed test running.

After about ten minutes, part-way through a 1.5 TB file copy, "the bad thing" happened. In this case, a more unusual flavor happened, where SR declared that there were 7 I/O errors on the Thunderblade (!). The speed test app entered a hang/crash state and could not be force-quit. The Finder copy also got wedged and could not be stopped.

FWIW, this issue with I/O errors has happened once before in regular use...I talked to OWC support about it back then and they said as long as I verified the volume, it wasn't a real issue, it was sort of a phantom error. So I verified the array and it verified clean with no rebuilding.

Then I thought, well, this is not really the MINIMUM case. A better case would be to have a clean user account (thus less chance of any background processes) and ideally we would not need to add a Finder file copy into the mix. Also, if I am able to reproduce the issue writing repeatedly to an SR array, I wonder if I can reproduce it writing to a NON-SR volume? So I created that setup and am trying it now.

Thanks for your patience, and I hope I can locate the issue, wherever it resides. To be perfectly honest this feels like a hardware issue on the Mac where heavy file I/O just occasionally doesn't work right. I don't like my odds of convincing Apple of that, though.

;Chris

 
Posted : 25/08/2022 10:34 am
(@softraid-support)
Posts: 9197
Member Admin
 

@higgins 

Do attach a support file and I can try this, probably today.

 
Posted : 25/08/2022 4:16 pm
(@higgins)
Posts: 23
Eminent Member
Topic starter
 

@softraid-support Sure thing. Attached is an SR support file I generated yesterday at 6:59pm (though the file name date says today...that's weird but whatever), as well as the text of the macOS crash report upon forced shutdown and restart.

I have not had a chance today to try re-creating the issue again. Hope to get at it today/tomorrow/soon.

 

 
Posted : 25/08/2022 6:00 pm
(@softraid-support)
Posts: 9197
Member Admin
 

@higgins 

I just want to use this to make sure I am close to the same hardware. thanks.

 
Posted : 25/08/2022 6:30 pm
(@emp001)
Posts: 2
New Member
 

I have also been having this problem on my M1 Max MBP.  I'm using an OWC 4M2 4slot enclosure with 4 Samsung SSD's turned into a 8TB media volume, raid 0, striped 64kb (per OWC instructions re: SR and M1 Macs).  

After restarting, I can safely eject the volume for a bit, but after a while, it becomes unmountable (have to hard-shutdown the MBP).  I also use Premiere Pro for editing (the footage is stored on the raided drive), but I'm not always using the drive, and it still becomes unmountable.  

Occasionally I'll lose all the work I've been doing that day (or for the past several days).  I won't even notice until I reload a saved project and realize the version is from a few days ago - the new files I added to the folder are gone.  Thank god for cloud backups, but this isn't a solution by any means.

I started a case today with support, but saw this and wanted to chime in.  Very strange.

 

 

 
Posted : 30/08/2022 6:21 pm
(@softraid-support)
Posts: 9197
Member Admin
 

@emp001 

I do not like what you describe, it sounds terrible. Can you post the case number?

Not being able to unmount a volume means something is using it. If you quit Premier, can you unmount the volume.
no changes being saved is something I am not familiar with. I want to see your case, so I can comment further.

 
Posted : 31/08/2022 10:24 am
(@higgins)
Posts: 23
Eminent Member
Topic starter
 

@emp001 I'm also curious about whether Apple can tell you anything. Our issues sound somewhat different but possibly related, so anything you hear might help me too.

I'm actually about to receive a second M1 Max powered machine (a Mac Studio) in an attempt to test the same workload and see whether the laptop is uniquely weird, or if it's just generally an M1 Max type of problem.

So far, the M1 Max MacBook Pro has been...flaky. When it works it's awesome. It's fast. Everything's great. And it sometimes works for up to a couple of days—regardless of the amount of rebooting, quitting/re-launching apps, etc. And then at some point, sometimes in the middle of me doing something active and sometimes just overnight, suddenly the bad thing happens. Usually when the bad thing happens, all I/O to the SoftRAID volume just "stops" although the volume appears (in Finder) to be mounted. So in the middle of an edit or an export, I just a crash or a hang. But I can't actually access the volume, and any process (including the SR app itself) that is communicating with that volume gets very confused. I have to force-shutdown the whole thing, and even then it often crashes during shutdown and comes up again with a crash report.

I am still trying to replicate this reliably using the laptop and a "simple" type of setup. I have been able to make it happen three or four times by hitting an SR array (NVME based, though I mostly work with HDD based arrays) with tons of I/O. Like a disk benchmark + a file copy coming in over USB. This heavy I/O is actually fairly similar to what happens during a large edit session, because I'm often working with multiple streams of ProRes 8K, layering in other streams at lower resolution...in other words, just tons of I/O activity even just playing around on the timeline.

Anyway, super curious if you hear anything from Apple. I am *this close* to taking this thing to Apple and just falling on their mercy. It feels like I somehow got a lemon, because I do not see other people talking about these kinds of problems.

(FWIW when I switched back to my comparatively-slow-but-reliable iMac Pro, everything works. Completely. All the time. It's slower, but it works.)

 
Posted : 31/08/2022 10:50 am
(@emp001)
Posts: 2
New Member
 

@softraid-support Yes, the case # is L01554689.

Quitting Premiere seems to have no effect, the drive simply won't unmount.  I've tried using Activity Monitor to quit all Adobe processes, force-ejecting, logging out of finder, etc.  The lost changes seem to have been related to me physically unplugging the thunderbolt cable (after hard-shutdown) and removing the laptop so I could take it somewhere else. This particular thing has happened 2-3 times since Feb.

 

thanks for any help!  

 
Posted : 31/08/2022 11:47 am
(@softraid-support)
Posts: 9197
Member Admin
 

@higgins 

I have not seen common reports of this, no. Lets see what you get with the new computer.

I see occasional reports similar to this, where users complain of slowdowns/ hangs, but there is generally a specific cause, such as bad cable/enclosure.

 
Posted : 31/08/2022 3:21 pm
Page 1 / 2
Share:
close
open