Nvme boot drive randomly dropping Mag z690 Tomahawk wifi ddr4

synphul159502dc

New member
Joined
Oct 4, 2022
Messages
6
I've seen a number of issues but none that seem to relate to this problem. I recently within the past month built a new system with an msi mag z690 tomahawk wifi ddr4 board. I7-12700k cpu stock, 64gb (2x32gb) corsair vengeance lpx 3600 ddr4, sk hynix p41 platinum 1tb gen4 drive. Power supply is corsair rm1000x. Msi gaming z trio 3080 12gb lhr. Win11 pro.

Everything installed just fine, installed windows fresh. The only 'oc' of any kind is enabling xmp profile 1 for my ram. Every few days my pc just randomly shuts down. Not under any actual load, not while gaming. While commenting in reddit this last time. Screen turns black, rgb keyboard goes dark, system still running, rgb's and fans lit and running. Proceeds to reboot, I get the msi boot screen logo and instead of booting to windows it kicks me right into the bios. It took a few times to realize that the main boot drive, the nvme, goes missing. It just isn't in the boot list. When moving to the easy bios screen and viewing drives, 0/4 m.2 drives listed. If I flip the power switch on the psu off at the back, wait a moment, switch it back on and power my pc back up, no problem. The nvme is detected, boots straight into windows.

I get no warning, no bsod. Just a black screen like lost video signal and these annoying random reboots that force me to hard shut down and reboot. It's happened at least a half dozen times or more. I've double checked that bios is set to uefi, tried m.2 on auto and selecting gen4 pcie manually, secure boot off, secure boot enabled. I've checked my power settings, running balanced. Power settings allow my monitor to sleep after 20min, pc sleep after 30min, hybrid sleep on, hibernate never, wake timers enabled. Pci express link state power management disabled. Processor min 5%, max 100%. Usb wake settings allowed on mouse/kb. Running I believe the latest bios, 7/4/22 07d3211.

I've monitored my nvme drive temp via hwinfo64 and at most it reaches 45-46c. Well within specs, so it doesn't appear the drive is overheating. Using the included msi mainboard heatsink cover, the drive was bare (no oem heatsink). It makes no sense as to why at any time, for any reason it just disappears and the system runs to a reboot. Upon which it's lost the fact there's an nvme drive installed. Until a cold boot and hard shut down. I thought I had overcome this issue after roughly 3 days of uptime, until it just happened again an hour or two ago. The pc's also on clean stable power, connected to an apc backups pro 1500v/900w unit that's recorded no incidents since the pc shutdown. Should be noted this problem also occurred before using the UPS and powered straight from the wall.

If anyone has a suggestion for a setting or something. Otherwise I have to assume that the motherboard is defective. Drive's been checked, status healthy. I plan to keep this system 4-5yrs at least and I can't tolerate this randomly shutting down for no particular reason when the board decides it's lost the drive. Thanks.
 

citay

Pro
SERGEANT
Joined
Oct 12, 2016
Messages
9,808
First off, update to the latest BIOS version, there is a newer one: https://www.msi.com/Motherboard/MAG-Z690-TOMAHAWK-WIFI-DDR4/support

Then there's a couple things i would start with. For example, it's quite common on Intel is to inadvertently bend some socket pins during CPU installation. The M.2 slot gets its PCIe lanes directly from the CPU. So i would do a socket pins inspection. For that, take out the CPU and carefully inspect the underside and especially the socket for any bent pins, foreign objects, dirt, or other abnormalities. Check carefully (with a magnifying glass, if possible) if all the socket pins look exactly the same, in position and in height. If the pads under the CPU and some bent pins of the socket don't make proper contact, you can have all kinds of problems. If some pins look bent (even very slightly), please make photos of the socket from different angles and upload them/link them here.

If the pins are all perfectly aligned, you can put the CPU back in. For the best cooling results, you should clean the CPU heatspreader and the cooler base, because when you re-use old heatpaste, there can be trapped air bubbles which won't be good. So ideally you'd clean it and then apply new heatpaste. I use soft, lint-free paper towels and q-tips dipped in high-purity isopropanol alcohol to clean, and then apply a generous drop of new heatpaste onto the middle of the CPU, at least the size of a large grain of rice. It will be spread by the cooler pressure.

Another thing high relatively on the list would be to test with a different PSU, only because it's still relatively easy (at least compared to swapping the motherboard or things like that). Maybe you can borrow a known good and not too old PSU, or you have a spare one in another system. If it's not as powerful, leave out the GeForce for the test duration. Sadly your problem only seems to happen every couple of days, so it's not very easy to reproduce. So i would definitely have a look at the pins in the socket first.
 

synphul159502dc

New member
Joined
Oct 4, 2022
Messages
6
Thanks for the response. Wouldn't bent pins be a persistent issue vs intermittent? I did inspect everything prior, landing pads on the cpu as well as the pins on the motherboard socket. I could even see things shifting while under load as they heat up causing various pins in use to flex a tiny amount or shift such as load tests. No shutdowns during p95, none during timespy. No errors after running several loops of memtest64. I think the random shut down has only happened once when coming out of sleep, the rest of the time was right in the middle of use.

I'm thinking windows never gets a chance to throw a bsod, the error event is happening physically outside of windows. Similar to reaching over and flipping the power switch on either the psu or monitor. Just quietly goes dark while fans remain spinning and rgb's remain lit. I might have to try buying another psu to test, I don't have one strong enough to power the new setup. The old one was quite old (over 15yrs) and may have been what took down the previous system. To rule it out and due to new parts exceeding the wattage I went with the new psu. I've thought maybe transient loads from the gpu, the 30 series are known for them. Thought pretty certain a 1000w psu should handle that aspect. And during any power failure issues I have the pc set in bios to remain shut down, not to auto reboot.

Good call on the bios, I didn't figure there would be another update so recently. And low and behold, it does say it improves pcie storage compatibility. Hopefully things go smoother this time around. The first time I updated the bios was to clear up some audio noise, found I was unable to use m-flash. One of the items cumulative in the updates was an update to allow m-flash to work properly so fingers crossed. Being so random drives me nuts, at least if I knew what was triggering it I could recreate the scenario. Some days it's every day, other times it's fine for 2-3 days. And day 3 or 4 when everything seems resolved, nope, does it again. With 0 changes. So no matter what I do it could be a week or two before I find which has fixed it, if it fixes it. I'll have to go through it one change at a time. While I'm tempted to try a new psu, mobo, bios update etc and be done with it, too many variables to keep track of. Thanks again for your help.
 

citay

Pro
SERGEANT
Joined
Oct 12, 2016
Messages
9,808
Well, about the PSU, i wrote a Guide: How to find a good PSU, but you already have a nice PSU model. So instead of outright buying a new one, i'd first try to borrow one for testing. What's really hampering your testing is the long intervals between each occurrence. Nobody will probably lend you their PSU for four days, not unless they have a spare one to give you. And even that has to be new and good enough. I mean, i have a couple factory sealed PSUs i could help someone with or test with, but i'm regularly building PCs for people and companies, so i realize this is unusual.

I didn't understand the part about you seeing the pins shift, did you intently monitor your CPU cooler during stress tests and could you spot some kind of movement or what? Didn't quite get that. Anyway. So to the best of your knowledge, the socket pins are all in pristine condition and all look 100% the same? Problems from bent pins can be very weird and intermittent sometimes, similar to PSU problems. But i'll take your word for it.

Monitor things now with the new BIOS. If the problem reappears, i'd take out the GPU and run in internal graphics for a while.
 

synphul159502dc

New member
Joined
Oct 4, 2022
Messages
6
I think there's some confusion about pins shifting. What I meant was, it seems unlikely that pins would move at all post install. So a bad pin to cpu pad seems (if it were a bent pin) to be persistent rather than work 95% of the time. Either contacting or not. Like trying to boot and ram not being detected or a ram error preventing booting every time. I could see that being a misaligned pin. Otherwise seems unlikely.

It's the weirdest thing, been building pc's for years and this one's given me the most grief. I barely have any programs moved over and reinstalled, coming off win7 so win11 is a learning curve. The only other thing new this time was the nvme, never bothered with them in the past. My old motherboard only supported sata m.2 so no advantage. After the last message I downloaded the new bios, managed to install via m-flash so thankfully that's working now. After flashing I was able to restart normally, so it appeared maybe that fixed it. Then I started getting odd windows errors, my first bsod on this new system and just said 'critical error' (how unhelpful). Looking that up it suggested maybe malware, so went to install mbam. Had an issue there, it didn't want to install from the installer. Ended up having to download the entire thing from their help page then run it. 0 faults found.

Another suggested maybe a virus, hardly had the pc up and running but ok. So I went to install avg free antivirus, I got the warning 'sorry, this isn't supported by your version of windows'. Finally got that installed, initial scan, 0 issues found. Had a pending windows update 22H2, downloaded that and tried to install it. The first time it glitched, went to reboot, lost the nvme drive again same as before. Reinitiated the download/update, went through successfully and restarted on it's own, found the drive, updated the system and so far so good.

While using edge to search for that critical error, something else was going on. I was trying to type 'avg antivirus' into the browser bar and every time I typed 'a', edge closed. It would allow other letters, but not 'a'. First time I've had all these errors and they all piled up at once, aside from the missing drive and random reboot issue. I don't know enough about win11, have the pro version installed and activated. Aside from installing avg, mbam and the recent windows update today I haven't installed any new software in the past week or two so seems unlikely that's the cause. Nothing else has changed. I also tried running several windows troubleshooters as sites suggested, no errors found, no wonky drivers, no nothing.

Only one other oddity and it happened only once, I lost my argb. It's always come back after waking from sleep, on reboots etc. Again it happened while casually using the pc, not on a wake from sleep event. The only argb in the system are 2x 120mm fans on the cpu cooler and a gpu support bracket, aside from the built in rgb on the msi gaming z trio card. That remained lit, the ones plugging into argb on the motherboard went dark. Going into MSI center and fiddling with mystic light did nothing. I shut down, restarted and the lighting came back. Which is why I'm wondering if it's not the board somehow. Both nvme and argb are controlled via the board, both having issues at least once. And since the board runs sata and nvme on different channels, my sata drives shouldn't be conflicting in any way. My older board was like that, shared m.2 with the sata ports and when using m.2 it would disable 2 of the sata headers. With only a couple of devices using argb doesn't seem they'd be overloaded. The 2 fans are daisy chained and share one jrainbow header, the gpu support uses another jrainbow header by itself and the gpu runs it's own light bar.
 

citay

Pro
SERGEANT
Joined
Oct 12, 2016
Messages
9,808
Ok, i understand about the pins. Like i said, problems from slightly bent pins defy logic sometimes. The signal quality could be just on the edge, and sometimes it causes a problem, but most of the time it doesn't, things like that. Similar with PSU faults. Not saying it's a bent pin or PSU issue that you have.

It would seem you have a stability issue with your PC now. I find it highly unlikely for a freshly installed Win11 to somehow catch a virus out of the blue. This would only happen if you were to install software from dubious sources yourself, or if you ran a malicious email attachment, things like that. Your Windows won't be randomly infected just from being online, provided that you let Windows Update install all the security updates. Third-party AV is unnecessary, the integrated Windows Defender is pretty good nowadays. As for the browser, i recommend an alternative to Edge, for example Firefox with the ad-blocking add-on uBlock Origin. This is more secure than Edge.

As soon as you get a BSOD, it's either a sign of general instability or sometimes it can be a driver acting up. What BIOS settings have you made so far? Just enabling XMP? Try with XMP disabled until we get to the bottom of this.

With all your symptoms, i wouldn't necessarily trust your Win11 installation anymore. While i don't suspect a virus to be at play, you might have some corrupted files from your instability. So after disabling XMP for testing, i would do a fresh install of Win11. If you use the Media Creation Tool from here, "Create Windows 11 Installation Media", it will actually prepare an installation USB drive with Win11 version 22H2, so no need to do the feature update later.
 

synphul159502dc

New member
Joined
Oct 4, 2022
Messages
6
With the latest bios update, about the only key change I noticed that it made, it enabled resizable bar. Before it was disabled by default (I never messed with that). It's set to uefi, not csm. I've allowed for usb devices to wake from sleep so I can use the kb or mouse to wake instead of having to use the power button on the case. I enabled cpu fan fail warning, it was enabled prior and the bios update disabled that again. Secure boot is enabled, but only generic no security keys or anything registered. Not using bitlocker and not trying to get locked out of the pc. Beyond that in bios, xmp is the only thing enabled. All voltages for ram, cpu etc are auto/default, nothing is oc'd. 1.35v on the ram.

I didn't actually think a virus or anything was the issue, was just double checking things suggested by sites as a number of steps to be taken when encountering the critical error message. And I'm sure the odd instability in windows wasn't to do with the main issue of the nvme drive dropping out. Only mentioned that because it was the first major issue I've encountered and it happened in the middle of everything. Like a cascade of issues out of nowhere. But just in case it may have indicated something semi related. I looked through event viewer, during all those issues I had tons of errors. The vast majority of them repetitive and linked to the print spooler which is odd since I don't have a printer hooked up. Haven't tried printing anything. I did have a windows message about memory integrity being 'off' so I switched it on.

As far as drivers, once I installed win11 I went and installed msi's motherboard drivers, audio codec drivers, msi center. I installed nvidia's drivers and all that came with it, the geforce experience, control panel etc and latest nvidia drivers 3.26.0.131 currently running 516.94 game ready drivers. There's a newer one I haven't installed yet. Aside from checking a couple of photo editing programs compatibility with win11, installing steam and slowly rebuilding my skyrim game/mods most of the software is either default or diagnostic. 3dmark timespy, memtest64, hwinfo64, mbam, avg antivirus, net disabler, fan control, msi afterburner, lightshot for screen caps, crystal disk info, a drive check utility from sk hynix to check the nvme.

Typically I use chrome but I had a number of tabs open and kept it closed so I could easily restore them. Also figured edge being the default browser might be less problematic while I was in the midst of trying to sort the issues. In my travels looking for solutions the past week or two it seems nvme issues are common place. Across a variety of boards, drives etc. Though many of the complaints stem from the drive not showing up in windows or not even in bios. Where mine is recognized by the bios and had no issues out of the gate until the random disconnects. At least I assume that's what's happening. The issue has been using the pc as I am now, typing a response or anything basic and no bsod, no nothing. Screen just goes blank and says I've lost signal, keyboard goes dark. Fans remain running, lights remain lit. A bit of drive noise spinning up as though a reboot is taking place, but instead of returning to windows it drops me immediately to the bios. And that's where it shows the optical drive, hdd, ssd in the boot order, but for boot drive it just says 'uefi drive' with no info, no drive name. When looking at the others they indicate samsung ssd, wd hdd. When switching to the ez bios I can select drives and it lays them all out in one spot. Under m.2 it lists the 4 m.2 drive placeholders where the nvme drive should be and instead says 'not installed' on all 4 slots. 0/4.

The only thing I can guess is that the drive connection is lost suddenly, causing an auto reboot. And when it tries, it's unable because it can't find the c drive on the m.2/nvme. So it leaves me in the bios. Yet if I switch the power supply off and shut down totally, reset the switch and then power up using the case power button, everything fires up as normal. Locates the nvme, the bootloader and windows install and takes me right into windows.
 

citay

Pro
SERGEANT
Joined
Oct 12, 2016
Messages
9,808
Well, you know what i would do. If i'm not absolutely sure the pins are all 100% ok, i would do a pins inspection. And i would try with a different PSU. MSI Center i don't use for reasons. Your problems are very odd indeed, and you have to start somewhere. Pins inspection and different PSU are still relatively low-effort, compared to trying another board/CPU/RAM etc...
 

synphul159502dc

New member
Joined
Oct 4, 2022
Messages
6
I'll do a pin inspect when I can shut it down for a bit. I need to dig in there anyway since I didn't realize I needed the numbers off the side of the 24 pin connector to register the motherboard. No one around me has a spare psu and I don't have any with high enough wattage to swap out so I'll likely end up having to buy another for testing. The few psu's I have are older 450-550w units, my old 650w seasonic that I ran for 17+ years and was likely part of the issues on the last build that led to this new rig. I really appreciate your insight and help, thanks.
 

synphul159502dc

New member
Joined
Oct 4, 2022
Messages
6
Spoke with msi trying to see if they had any insight. Realized the p41 hynix nvme isn't on the qvl list though doesn't necessarily mean incompatibility, seems highly unlikely as the p31 is validated. Just in case, ordered a wd sn850x with is on the qvl list. Tried moving the nvme to another slot from slot 1 to 4, turned xmp off. Same issue. I think if this new drive with yet another fresh install of win11 suffers the same, safe to say it's the board. This last shutdown was due to some windows critical error. I didn't have a phone ready to snap a pic and the screen went off before I could write the error down. The pc went to shutdown and reboot itself and the drive was missing again. I might have some problem with windows or drivers however they shouldn't affect the drive not being detected in bios. Bios loads before windows, even if a win error caused a shutdown it should've still booted back up and seen the drive, taken me right back into windows. Double checked that my ram is on the qvl, it is. I wasn't getting windows errors until the last update so maybe that's related to that issue. Still doesn't explain the houdini drive. Xmp doesn't seem to be a factor at all. And yet again the shutdown happened not during any level of stress, simply clicking and checking messages on yt and responding. With xmp disabled and running a different nvme slot, windows stayed operational for roughly 24-30hrs before acting up again.

One odd thing (or maybe it's common), the msi rep told me when I pull the integrated cover/sink off the drive not to be alarmed if the drive looks 'wet'. Something about the thermal pad building moisture through heat cycling. The drive was in fact quite wet with a sort of oily residue, wiped it clean and reinserted in slot 4. It might be a non issue like thermal paste but still a bit unsettling seeing a drive coated in wet anything. Just like before a hard shut down and powering all the way back up and the system saw the drive just fine.
 

Starkdv815d802ec

New member
Joined
Oct 21, 2022
Messages
21
Spoke with msi trying to see if they had any insight. Realized the p41 hynix nvme isn't on the qvl list though doesn't necessarily mean incompatibility, seems highly unlikely as the p31 is validated. Just in case, ordered a wd sn850x with is on the qvl list. Tried moving the nvme to another slot from slot 1 to 4, turned xmp off. Same issue. I think if this new drive with yet another fresh install of win11 suffers the same, safe to say it's the board. This last shutdown was due to some windows critical error. I didn't have a phone ready to snap a pic and the screen went off before I could write the error down. The pc went to shutdown and reboot itself and the drive was missing again. I might have some problem with windows or drivers however they shouldn't affect the drive not being detected in bios. Bios loads before windows, even if a win error caused a shutdown it should've still booted back up and seen the drive, taken me right back into windows. Double checked that my ram is on the qvl, it is. I wasn't getting windows errors until the last update so maybe that's related to that issue. Still doesn't explain the houdini drive. Xmp doesn't seem to be a factor at all. And yet again the shutdown happened not during any level of stress, simply clicking and checking messages on yt and responding. With xmp disabled and running a different nvme slot, windows stayed operational for roughly 24-30hrs before acting up again.

One odd thing (or maybe it's common), the msi rep told me when I pull the integrated cover/sink off the drive not to be alarmed if the drive looks 'wet'. Something about the thermal pad building moisture through heat cycling. The drive was in fact quite wet with a sort of oily residue, wiped it clean and reinserted in slot 4. It might be a non issue like thermal paste but still a bit unsettling seeing a
Did you ever get your NVMe situation working as it should?
 
Top