Very odd problem KT8Master2-Far

J

jimmitch

Guest
I am having a very odd problem with a new installation of a K8TMaster2-Far board. The fault is that after 5 to 10 minutes, the machine goes into what looks to be a 'standby' or 'hibernation' mode with the fans off, screen blank, and the power LED blinking at about a 2 second interval.

An Infrared thermal probe of the CPU's says that the heat sinks are at 89 degrees Farenheit, the Northbridge at 91 degrees F, the memory at 86 degrees F, and the regulators at about 92-101 degrees F. Ambient temperature is 72 degrees F

This is really quite frustrating, since there is no apparent recovery short of using the AC switch on the power supply to hard reset the machine, which then 'wakes up' in a locked condition with the fans on high, the drive lights all on, and no display or apparent keyboard input.

I have simplified my setup to the following:

MSI K8TMaster2-far (on bench)
512Mb PC2700 Apacer memory (registered) in slot 1
1 Opteron 240 CPU (from known good machine)
Enermax 450W power supply (new, replacement)
1 Maxtor 150Gb SATA drive
Nvidia Ti4200 AGP card

I have gone through the CMOS settings serveral times and am confident that 'suspend' and power-save modes are disabled, re-flashed the BIOS to the latest version, performed a 'restore optimised defaults' from CMOS (this should re-write NVRAM, shouldn't it?) and have reset CMOS with the jumper.

This machine exactly duplicates another server which I've built, and that machine has run flawlessly for almost a month now, which is what I expect, since all of the parts are on MSI's 'approved' list.

I am now out of ideas since the problem prevents me from even loading Windows Server 2000.

Any suggestions?

TIA

Jim
 
Hi

Apart from the Bios, power settings are set in screen saver page.

With cursor in a clear bit of desktop right click then properties, then screen saver, at the bottom of page is a button power. in that page you set time to switch off or hibernate.

Check that before panicking


Cheers

jocko
 
'Tis true - but I can produce this failure even when booted from Windows 98/MS-Dos boot disk - so Windows/NT OS utilities are not in play.

Sigh, one more data point, I can here a distinct 'Click' which sounds like a relay when the system enters its 'suspended' state, not certain, but i suspect it may be coming from the power supply - I checked the Enermax manual (PS number EG465P-VE) and the Enermax website, and all they have to say is a rather ambigious:

"When the total O/P load exceeds 105% to 160% of the max. output currenrt, the power supply shall be latched inot the state of shutdown."

And

"When any set of DC outputs is in short circuit, power supply shall be latched into a state of shutdown..."

OK, I've fried my share of components, and have seen this sort of PS crash - the lights go off, the fans go dead, and the supply is basically dead until the short is removed from the DC line. But not in this case - the damn supply is alive, the front panel light is blinking, and there is a small DC offset voltage on the PS fans - since they feel sort of 'notchy' when you turn them.

Mutter, mutter, mutter :wall:

Jim
 
Hi

OK, could it then be a faulty Thermal Protection Switch in PSU, cutting out to soon, the thermal protection is certainly jumping in to quick. A lot of PSUs, certainly the super flower that I use, and many others from reviews have same problem with clicking, most think it is a relay clicking as it adjusts fan speed, on mine only happens on auto setting. Its just like a clock quietly ticking.


cheers

Jocko
 
OK, so here's today's update:

First, Yes BAS I did make sure that the 12v molex was in the CPU power plug, system won't boot otherwise.

The good news is the silly thing has now staggered to life as a single CPU machine and appears to be as stable and fast as the other two MSI-based servers I've got running - it's been grinding out SETI units for nearly 24 hours now without a hiccup.

The bad news is that the underlying problem with dual CPU's continues to baffle me...

Here's what I currently know:

With both CPU's installed and a single 512mb DIMM the machine will invariably go into the previously described lock-up state.

With either CPU installed it runs just fine.

With dual CPU's and TWO 512Mb DIMMs it also seems to be stable - but I haven't had time to let it run very long yet.

I'm very confident that this is not a Power Supply issue, since the lockup symptom appears with two different Enermax 460W supplies.

It might be a memory issue, but I'm puzzled that either single DIMM fails, but two don't.

It might be a CPU issue, but then all I can think of is that there is some sort of timing or access issue between the CPU's internal memory controllers.

Finally, could there be something silly that I'm overlooking on the motherboard, or might it be defective?

Sleepless in Seattle

Jim
 
Bas -

Sorry, thought I'd mentioned this already, the memory is APACER PC 2100 registered/ECC with Infineon chips, as specified on the MSI web site.

I also have several Buffalo DIMMS (PC 2700, registerd/ECC) with Samsung chips, and they show the same results.

My very strong hunch at thei smoment is that there's an underlying conflict between the two on-cpu memory controllers as they each go to access the single DIMM, which may be mitigated when two DIMMs are installed - will do more testing today and get bacxk with the results.

BTW - I've been blissfully igonorant of the complexitites of the Opteron memory architechture, but did find a rather decent article here:

http://www.gamepc.com/labs/view_content.asp?id=opteronmemory&page=1

Cheers

Jim
 
One possibility is that one of the CPU's is getting a Thermtrip overtemperature
event. I have a debug disk from AMD that triggers that via software, and the result is the same--screen off, fans off, but LED blinking, and only way out is hold power switch for 5 seconds to power off all the way.

There is no chance that it is some memory access conflict, as the second cpu only accesses memory by sending requests to the first cpu.
 
Extera

Very interesting - would account for the symptoms I'm seeing.

This would have to be a board-level defect then, since both CPU's exhibit the same fault, and I have very carefully monitored both the heatsink and the under-socket temperatures with an infra-red heat gun ands they're within 10 degrees F of ambient.

Yes, if the heatsink were not making thermal contact it would stay cool, but you'd expect to see a rather dramatic rise in the temperature directly under the CPU wouldn't you?

Any chance of getting that AMD utility from you? It's not posted anywhere publicly I can find ....my e-mail is in the comments field of my profile

Cheers

Jim
 
OK - progress has been made ...

Taking Extera's comments to heart, I loaded up the MSI PC Alert III utility under Windows Server 2003 and noted that both CPU fans showed 0 as in <Zero> RPMS.

"Hmmmm;" I thought to myself "Self, what would you do if you were a decent, self-respecting BIOS and you suddenly noticed that your CPU fan had stopped working?"

"Well, I'd PANICThat's what I'd do ..."

So, I went back into the CMOS and disabled the SmartFan control, now both fans are running at full speed, are correctly reporting their RPM's to the Alert utility, and the system has been running for over an hour at about 100 Degrees F on each CPU at 100% utilization.

Bas - could you pass this on to MSI as a potential BIOS bug? the smartfan utility is broken with dual CPU's and a single memory DIMM, but does appear to report fan RPM correctly with a single CPU.

Nasty, nasty little bug ....

Cheers

Jim
 
Smartfan isn't broken.....
It drops the CPU-fans to 1/2 the speed....
They go from 5500 RPM to 2500RPM....then they are on the detection limit of PC-Alert then...
 
OK - this is believable, that the drop in fan speed dips below the threshold of PC-Alert ....

Except, that with one CPU, PC-Alert does correctly report the 2450RPM speed, it also reports the correct speed with two CPU's and two DIMMS.

PC-Alert <fails> to report the speed with two CPU's and one DIMM when Smartfan is enabled, it works fine with Smartfan disabled - I checked the CMOS status page and it appears to report the correct fan RPM's in any condition (Bios 1.1).

I've no other way to assess Fan speed at this moment, is there another Windows utility which is known to work with the K8TMaster series? I'd like to do some further verification.

The system ran all night pounding out SETI workunits with both CPU's at 98% utilization, system temperatures are 101/104 degrees F (38/39 C) and quite stable. With Smartfan enabled I show 110/114 F (41/42 C) and also stable, so actual temperatures are not a factor.

My current working idea is that Smartfan does something stupid to the WMI (Windows Management Interface) data when dual CPU's are installed, such that the Fan RPM objects are getting blanked or damaged. I'll install the WMI tools and report back.

Jim
 
I have the same here....
But you should use CoreCenterPro to monitor it right....
But even then, the second CPU is mostly shown as 0 RPM....
 
Back
Top