Workunits stopping?

CaptRingold

Limp Gawd
Joined
Feb 12, 2005
Messages
256
I'm new to this, only recently switched over from BOINC projects, but it looks like every other workunit on both computers I have this running on at the moment (three clients total, one is a dual proc) leaves either a "Gromacs could not continue any further" sort of message or something about potential system instability. Appears to save what has been done up to then, sends it in (partial credit?), loads up another WU and starts over.

Is this.. normal? No OC on either system, but they're both not exactly brand-spankin' new.


 
No... it's not normal... very rarely should a WU end early... I usually go months without having an early end unit... and then I may get one or two in a weeks time... but not every other WU I get.....

In my experience, F@H taxes the system more than just about anything else... I used to used Seti, and when I first switched to F@H, my PC couldn't handle it (because my HSF was clogged with dust).... After I dusted it out, everything worked fine...

Is it prime stable? memtest?

BTW... Welcome to the [H]orde!!


Keep on Folding!!

 
Coincidently I just got an EUE pn a BIG unit this week.
What flags are you running? What unit was it?
 
I concur, I recently moved my boxen from SETI to F@H (GO [H]ORDE!!! ) and folding is much harder on my cpus and memory than SETI ever was. frex, my [email protected] flew thru most seti units in 4:30 to 5 hours, the fastest it has ever processed any F@H unit is about 18 hours. I have a gig of ram, but all I get is tinkers and gromacs (I just enabled advmethods though, that should change soon) BTW, I'm stress testing my sisters celeron 2.7 w/256meg, it takes about 44 hours to process what my a64 does in 36-38, at least based on the only unit it will get to do, a 1130 tinker. Too bad she doesn't even have dial-up :(
 
one ? would be which version of F@H are you using?

There has been a few people who have lots of problems
with graphical not finishing units, or crapping out..
 
I went with console in both computers. The older of the two, I'm not sure, but has either 256mb or 512mb. This computer has 640mb.

This computer, being dual 2400 MPs, and therefore not having SSE2, I'm thinking may have trouble since my flags are -local -forceasm -advmethods. I'll try knocking off the last two and see if that helps, and I'll start Prime and Memtest now to stress it while I run to the post office and see if when I get back it comes up with any errors. I feel bad after it's been cranking out a unit for 18 hours and then on the 99th frame of 100 decides it's done early; mission incomplete!



edit: I think you all knew what I was talking about, but heres the exact message from one of the two clients log files:

[23:17:46] Completed 330000 out of 1000000 steps (33)
[23:22:57] Quit 101 - Fatal error:
[23:22:57] Step 332007, time 664.014 (ps) LINCS WARNING
[23:22:57] relative constraint deviation after LINCS:
[23:22:57] max 0.000000 (between atoms 1 and 2) rms 1.#QNAN0
[23:22:57]
[23:22:57] Simulation instability has been encountered. The run has entered a
[23:22:57] state from which no further progress can be made.
[23:22:57] This may be the correct result of the simulation, however if you
[23:22:57] often see other project units terminating early like this
[23:22:57] too, you may wish to check the stability of your computer (issues
[23:22:57] such as high temperature, overclocking, etc.).
[23:22:57] Going to send back what have done.
[23:22:57] logfile size: 18802
[23:22:57] - Writing 19480 bytes of core data to disk...
[23:22:57] ... Done.
[23:22:57]
[23:22:57] Folding@home Core Shutdown: EARLY_UNIT_END
[23:23:01] CoreStatus = 72 (114)
 
First of all - early unit ends may generate partial points.
Try running the performance monitor under microsoft management console and track memory usage, cpu usage and page faults. Or at least look at Task Manager, performance, memory usage.
Some of the QMD's have been sucking up 300MB of memory each.
We need to know version of Os, F@H, protein names, and if possible CPU temps.

 
i would say -forceasm is the problem here. I had some stability probelms with -forceasm while my rig was o/ced. Took that out and never looked back.
 
In regards to which protein i'm working on.. well, currently, a pair of gromacs, p246_vil00MUreGS.

Looking in the logs of one client deeper, it happened to fail while on a Gromacs core, p238_c20MUre99p, and stopped around the 33rd frame.

Memory usage. well, there is Fah502-Console, 2mb RAM 6mb virtual memory, and FahCore_78.exe which is doing the work, about 9mb ram 10mb vm, full cpu utilization. It lists the page faults for FahCore at 5,740 (getting all this from task manager), and 572 for the Fah502-Console. That sounds like a lot, but then again.. Firefox currently sits and 13,000 faults and appears to be doing okay.

I did some research, and appears that S&M stress-testing utility seems to live up to its namesake, so ran it in lieu of Prime95 and Memtest, as it functions to test both CPU and memory.

After a couple hours of it, it had thoroughly checked L1 cache, L2 cache, FPU, etc, whole work over of both CPUs, and was about 90% through with stressing my RAM, thus far without errors, when I decided I wanted it to go back to folding. This guy in another forum said it failed him in 3 minutes after a week of perfect Prime95, so, I figure it did its job, he also said it heated up the cpu 5-10º hotter than Prime95. Sure as hell made my room warm :D

I removed just the -forceasm flag, not sure what advmethods will do for me but so far so good.. if it folds the rest of the week solid without an error being noticed by me (any utility that'll notify me if it does? I'm using EMI III, and it'll go elephant on me if I'm right there in front of the comp and it hicups, but if I'm away I never know except when I look at the progress I know it couldn't possibly have finished the last wu and started another)

Thanks for the help, too, guys!

 
this flag allows you to recieve BETA cores.

the only one I know of right now is the QMD.

these things FLY on P4s with SSE and the faster the FSB the better.
these things crawl on AMDs if they finish.


The [H]orde needs You!
 
jfb9301 said:
these things FLY on P4s with SSE and the faster the FSB the better.
these things crawl on AMDs if they finish.
You won't see many QMDs on AMD systems for awhile... Simple explanation from stanford's site:
QMD FAQ @ folding.stanford.edu said:
8. What about SIMD (SSE/SSE2/3D-Now) support? QMD core uses double precision only. Therefore, only SSE2 will be useful. Currently, Intel Pentium 4 SSE2 is supported, and AMD Athlon64 SSE2 will be added soon.

I'm sure they won't crawl when a64 sse2 is supported.

 
[19:13:12] Completed 270000 out of 1000000 steps (27)
[19:17:06] Gromacs cannot continue further.
[19:17:06] Going to send back what have done.

Does that mean it was my fault? That lacked the whole 'system instability' message that previous failings have resulted in, so I'm thinking that one was supposed to occur.

By the way, S&M heated up my cpu's 13ºF hotter than FAH does after having run all day in just 5min, almost sort of scary..MBM acted like my cpu's were about to melt :p

 
I would agree with the masses and try memtest and Prime and see if they show any anomolous behavior. Those seem to be the more preferred methods of testing a system at least in this neck of the woods. Also, is there possibility any software that you use that hangs, or is just unstable as I have seen some of the engineering software I use cause system problems and WU's to go haywire.

 
Well, I wasted a lot of my time stress testing the CPU's, as I really didn't imagine for some reason it was a RAM issue, I didn't think RAM particularly went bad if not overclocked.. anyway, ran Memtest overnight, and turns up 8 errors, and Azureus and a half dozen other programs had crashed as well. (I don't know if you guys use Memtest86, the bootdisk, or Memtest, the little windows app, or both, but I used the windows app, though Memtest86 did find 1 error after 50 minutes; I guess It'd find 8 too if it ran all night as well) This folding business really does stress a system; in the two or three years I've had this system, I've honestly never suspected instability! :p [H]ard stuff indeed.

Anyway, I've got a 512mb and 128mb stick for this box. I'm going to pull out the 128 first, then rerun tests for 12 to 24hrs.. and if that doesn't fix it, I think I know it's the 512 stick. At least RAM isn't too expensive to replace these days

 
Back
Top