58 deg Tctl is high-ish with OC. Not in context of absolute heat but, as you OC, thermal thresholds of stability ultimately go down (instead of 70 deg Tctl which is the default, you may start witnessing instability already at 60 deg Tctl).
Heat and power delivery of your board will both become...
Why not split it by whole CPU ? Then you could do 36/12 or 24/24 split and you'd be able to drop the voltage on the slower CPUs.
You probably could do that but I can't really recall if there are any caveats. It was long time ago and it's possible there's a reason this wasn't heavily...
212 EVOs may not look as pro as Noctuas but they get the job done as well as big (120) Noctuas, if not better.
Small (92) Noctuas will probably suffice for open-case 140W application I don't think they would be fine for OC.
Big (120) Noctuas should be fine for mild OC but their fans, while...
Power LED getting lit is definitely a good sign.
It's been a while since I did a recovery on these boards so I can't really remember if there are specific recovery requirements...
I will have few of these boards accessible next weekend and should be able to exercise recovery process and come...
OK
Though, if possible, see if you can do the SUPER.ROM recovery using the ROM from the link above. The 3.5a (H8Q series) and 3.5b (H8D series) released in 2015 were nothing but trouble for me.
Bah, I meant regular 3.5 (non-a) there (got confused a bit as only dual-socket boards ever received a 3.5b version).
Anyway, try the ROM from this package (plain 3.5): http://www.supermicro.com/support/resources/getfile.aspx?ID=2535
EDIT: one more side question -- when you power the board on...
I'd try doing recovery again. Did you use 3.5b version? If so, I'd try 3.5a -- it's visibly more stable than 3.5b.
Your choice of hardware seems OK but I haven't used either piece so can't really vouch for them.
It would make sense to confirm pinout compatibility between the test clip and the...
Which temps do you have in mind? Temps reported by.. ? IPMI? TPC? Something else?
For CPU-reported temps you can use TurionPowerControl -mtemp
You can easily tell a thermtrip by the fact of machine shutting off (though PSU's OCP could be responsible for this as well). Unfortunately, I have not...
Gah, sorry, I meant post #4 in _this_ thread -- http://hardforum.com/showpost.php?p=1042051765&postcount=4
To summarize --
once H8DG6/H8DGi support was complete, there was a short (12/26 - 01/03) window to do some H8DGU development but it didn't work out (no one stepped forward...).
My new...
Comprehensive and exhaustive list of OC approaches has been covered there: http://hardforum.com/showthread.php?t=1765747
H8DGU support has been covered in post #4.
 
OCNG 5.3 has been released
Highlights:
supports H8DG6 and H8DGi series boards
features multiplier adjustment for unlocked CPUs
For installation instructions, see: http://area51dev.blogspot.com/p/ocng5-installation.html
Note: you'll also need ocng-utils-5.3 to configure it
Short version: do not use PSI_L for power savings on server platforms; it's typically
             used in mobile platforms and desktop/server boards are unlikely to...
Current OCNG release (5.2) does not support voltage or multiplier changes.
OCNG 5.3 (which will be released this week and will also support H8DG6/H8DGi) has
appropriate provisions but only positive adjustments can be done (e.g. Vcore +25 mV,
multiplier +2.5x).
If you want to dial things...
And if you guys would like OCNG support for H8DGU, let me know (I would most likely only need some time w/Linux remote access to get it figured out).
Time is short, my obligations will prevent me from working on it completely effective Jan 4th.
Indeed, some ES CPUs seen in the wild behaved exactly the way you describe.
I strongly encourage you to check temps using TurionPowerControl -mtemp because
what I'm about to suggest may fry your stuff... (and I don't trust HWMonitor).
If the board is wired per AMD recommendations, making the...
I'd say new thread was totally warranted here...
Some CPUs have been identified to be causing this (the board itself, per SM, doesn't
employ any overcurrent protection circuitry). The suspicion is that they're incorrectly*
asserting THERMTRIP# signal (which causes the board to shut the system...
No problem!
Aye, seems so.
SPI clock is derived from SP5100's main clock (100 MHz) which comes from one of clock
synthesizer's SRC outputs (which are not modified by OCNG).
Also, for the record, SPI flash chip in my board is rated at 75 MHz (I suppose the same
applies to yours)...
<nod>
Missed that. Nothing to see here, then...
I suppose we now need to profile the profileinterrupt [sic!].
I spent very little time on looking for the source of HalpProfileInterrupt (tried to find if it's
hardware or software) and found that it's hooked up to IRQ8 (RTC) in ReactOS*.
Not...
Ok, so nothing that would (per my knowledge) cause DMI or event log update between
attempts N-1 and N.
Right.
Question related to #1 -- are you back to using the same ROM as before?
Another related question: which OCNG version were you using when BadThings(TM) happened?
FWIW, I made...
Weeee!
You are welcome :)
Indeed.
Could be something wrong happening during DMI data update (they seem to be located
in dedicated 64kB block starting at 0x1E0000 in the flash). Never seen anything like that
but it is one area that may be written by the firmware boot time. In principle...
Default is DHCP on IPMI LAN port with failover to one of regular GigE ports (don't remember which one).
However, I do not know of any way of resetting IPMI configuration without booting the box.
Given it's a paid job, I don't think they would take a loss there -- ofc, I can't say for sure.
Ok...
SM is known for performing paid repairs outside warranty. Few folks in here have gone
through that process (I have been lucky). Historically, SM took their time with repairs
(4-6 weeks) but did not charge a lot for performed repairs. IIRC these were mostly socket
replacements but don't quote me...
If your system is qualified for SE (140W) chips (and you're using standard power chips from
what I can tell?), you should have a decent bit of headroom (140W vs 115W -- that's ~21%).
Though note that the box may become (even more) noisy -- not sure how much of a concern
that is to you.
 
Here's small utility that lets you identify application that's not coded with multiple groups in mind: http://darkswarm.org/pgatester/
Sample output:
C:\Users\Administrator\Downloads>pgatester.exe
Processor Group Affinity Tester
Copyright (c) 2015 by Kris Rusocki <[email protected]>...
Just recalled this: https://msdn.microsoft.com/en-us/library/windows/desktop/dd405503%28v=vs.85%29.aspx
Seems that unless an app explicitly spills over to another processor group,
it will never utilize more than 64 "logical" processors:
Out of curiosity, what do coreinfo -g and coreinfo -n...
Try LINPACK. This one's intel-made so ... you probably won't get anything better than that:
https://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download
Anyway, your data suggest either BOINC cause (tasks being cycled or throttled) or OS
cause (poor scheduling).
Nice :-)
I think air should be enough. Incidentally, catastrophic VRM failures have been reported with
sinked boards, too. While sinking theoretically shouldn't have contributed to failures, personally,
I'm wary of throwing sinks at VRM components.
My most heat-generating system was...
Many BOINC workers (don't know about Skynet specifically) frequently reach to disk, be it to read
data to process, write (partial) results or checkpoints. When they do the computation is halted.
On desktop boards (few threads), it's not a big deal but when you multiply this by 64 or more...
Most excellent!
Side note: this level of power requires at least some airflow across the motherboard
to reduce heat stress on its components, esp. front-side VRMs, e.g. sth like:
I have no experience with H80s but if they're as good as Hyper 212+ or 120mm Noctuas,
you should be just fine. All my 4Ps only ran on air. Grandpa ran a 1000W+ (AC) setup
on air as well. Custom loop definitely is not a requirement.
PCI-E clock is completely decoupled from reference clock...
W/retail chips it would allow you to do both or either (at your discretion) of:
1. Forcing all-core turbo (P-state Pb1)
2. Overclocking by means of reference clock (aka BCLK in intel nomenclature)
EDIT: for instance, here are retail CPUs with 20% OC...
To elaborate on explanation few posts back -- if you have such board and can spare some
time to examine it (per my instructions) and test the firmware, then supporting LN4F boards
(including H8QG7) should be possible.
Porting the changes is relatively simple and low-risk process. If...
CPU will typically malfunction (crash the app/OS or freeze) before
any thermal conditions are met.
Side note: we have seen CPUs with broken thermals that would signal
overheat (and cause the board to shut the system down) even at stock
speeds and medium load.
Actually, you weren't missing anything.
A bug sneaked into Windows version of the script (no idea how); fixed it -- should be fine now.
Thanks for the report :)
BTW, I'm thinking of integrating multiplier changes with OCNG to ultimately remove the need to run TPC at boot (w/ES chips).
The script should have worked. I'll double check in spare while.
Temps in TPC come from on-die sensors.
Also, the units are not Celsius degrees; from AMD documentation:
Virtually all AMD CPUs bear max Tctl of 70 deg regardless of Tcase or Tjmax or whatnot.
IOW, you could look at your 27 deg...