6.22 SMP - stability still a problem

relic

[H]ard|DCer of the Month - August 2007
Joined
Mar 30, 2001
Messages
9,318
So far I have tested the 6.22 client on 4 SMP systems with the -SMP flag set, all have EUEed. Three were running stable on 2x5.91s + 1xGPU2 for weeks. The other was running 1x5.91 and 1xGPU2, again stable for weeks.

Last night I continued testing on one box (Vista x64 Q6600 3GHz 1.25vcore, 8GB, 8800GT) CPU temps at 100% load are 55c max load. GPU is 57c max load.

Two 5.91 clients with a GPU2 client will run stable on this system for weeks.
On the same system a solitary 6.22 client will crash with the old:
Folding@home Core Shutdown: MISSING_WORK_FILES...Client-core communications error: ERROR 0x1 Usually it would fail about 8-10% into a WU.

Now in the past this was stated to be a problem with the system or the WU. The WU is a 2665, nothing strange there. The system is very mildly OCed, so I stepped it down to stock but still encountered the MISSING_WORK_FILES error with 6.22. (Not that OC mattered here, as stated earlier TWO 5.91s + GPU2 will run without issue for weeks on end) Same problem without the OC.

So I tried turning the CPU usage down. Bingo. At 90% setting it seems to be stable again. So now I am running two 6.22 -SMP clients set to 90% and one GPU2. Obviously this stresses the CPU to 100%. Definately a 6.22 client problem not a WU or system error as the same WU on the same system works in 5.91 at 100% settings.

Now Stanford is well know for finger-pointing and blaming everything but their own software, or using the never ending excuse "it's beta code" (have you noticed that almost all of F@Hs code is beta until it's too old to be useful?), so I expect this craptacular client to be released "as is" on Saturday and the 5.91 to go away even though it's more stable.

Now you're forewarned.
We're getting another piece of garbageware, and we have to figure out how to get it to run.
At least we have a band-aid for the poor programming.
 
You are correct, I'm seeing a few reports of this issue so I'll run kick a few asses around. this is especially disturbing since it's a very stupid to release a very new client version a few days before deadline and make this version the official one, basically stuffing it down the throats of all users. This is bad betatesting pratice (a good pratice would be to avoid releasing a beta client close to a deadline and instead offer a choice of either version until all the kinks get worked out).

Thanks for putting the light on the stupid solution.

 
Very interesting.

All I can say is 2xSMP 6.22 MPI = 2800-3100 PPD without a single EUE on:

Q6600 @ 3.4Ghz
True HSF
EVGA 780i mobo
2x 88GTX with GPU Clients

I maybe the exception versus the rule though... not really sure!

I'll go double check those clients tonight and make sure they are not EUE'n... They show up green though... not that Fahmon is flawless.

 
You are correct, I'm seeing a few reports of this issue so I'll run kick a few asses around. this is especially disturbing since it's a very stupid to release a very new client version a few days before deadline and make this version the official one, basically stuffing it down the throats of all users. This is bad betatesting pratice (a good pratice would be to avoid releasing a beta client close to a deadline and instead offer a choice of either version until all the kinks get worked out).

Thanks for putting the light on the stupid solution.

At least we have a workaround, Xil.
(It's what we've always had to do....nothing new . ;) )

If all we need to do is cut the client down to 90% to fix the EUE issue then we have a way to keep frustrated folks running. I know it's goofy, but it will keep people with us while the client is being fixed.

Thanks for going to rattle the cages.
 
So I tried turning the CPU usage down. Bingo. At 90% setting it seems to be stable again. So now I am running two 6.22 -SMP clients set to 90% and one GPU2. Obviously this stresses the CPU to 100%. Definately a 6.22 client problem not a WU or system error as the same WU on the same system works in 5.91 at 100% settings.
.

Are you pretty much saying you offer the GPU client the last 10% and it takes it all then?

 
A PM has been sent to Peter Kasson and Vijay Pande about the poor execution and decisions about the new beta client. I hope they realize this and do the necessary changes.

This is like having a new Windows version out and telling everyone to upgrade within a few weeks with no chance to stay with the old one. That's poor pratice...

 
Very interesting.

All I can say is 2xSMP 6.22 MPI = 2800-3100 PPD without a single EUE on:

I'm sure some systems run fine or the client would never have been released at all.
Well, at least I would hope that is true. ;)
All data points even "it works for me" are important when troubleshooting.

I have redone the other systems with 6.22 at 90% to see if the "fix" is as reproduceable as the error condition. If yes, we know for certain that there is a serious issue in 6.22 that didn't exist in 5.91. At least on some systems.

Suprisingly I am able to reproduce the error on a variety of systems/OSs and both Intel and NVidia chipsets. The only thing in common on all systems is an NVidia GPU (3 8800GTs and an 8700M GT) and the CUDA driver (although slightly differing versions).

I have not tried 6.22 on the P4 (non SMP) yet to see if I can reproduce the EUE condition, perhaps I'll do that for another data point.

It is also possible that through an amazing streak of bad luck all of my WU have been bad ones only when testing the 6.22 client, very unlikely, but possible. I have never seen a SMP client error until loading the 6.22 client. While others have seen them fairly regularly even with the old client.
 
I don't quite understand all I know about this BS about client v5.91 vs client v6.22. (which is very little, I haven't even upgraded yet, you know the old sayin' "good to the last WU", err... drop :rolleyes: )

It seems to me "old timers" like Mr relic or BillR are very seldom wrong about anything folding wise. ("old timers" = experience folding, not age, except Mr relic, he's got both :p)

As "Smoke" said (another pretty good authority on folding along with several other "old timers" on this team) they shouldn't have released the latest 6.22 client so soon to the deadline of the 5.91 client.

My question is, "can you set the system clock back before the expiration date of the v5.91 client and still use it for a while". Or at least until some of the "bugs", that I've read about the v6.22 client, are worked out" ? :(

If these questions aren't particularly on the bright side I apologise in advance :(

Thanks for any answers :D

FOLD ON!

 
I've got some words from Peter Kasson and Vijay Pande. They will discuss about the clients today and will come up with a plan. Chances are that they will end up repackaging 5.91/5.92 with a new expiration date then let all 3 clients coexist for a short while so we can iron out the bugs in the 6.22 client.
 
I've got some words from Peter Kasson and Vijay Pande. They will discuss about the clients today and will come up with a plan. Chances are that they will end up repackaging 5.91/5.92 with a new expiration date then let all 3 clients coexist for a short while so we can iron out the bugs in the 6.22 client.

If they re-package the 5.91 and 5.92 clients with a new expiration date, does that mean we will still have to re-download and install them? I would imagine so right?

 
Are you pretty much saying you offer the GPU client the last 10% and it takes it all then?

Wheresatom,
I run 3 clients on one system normally.
Up until 6.22 I could run GPU at "slightly higher" priority, and two SMPs at "idle" priority and set to 100% CPU usage.

With 6.22 the client crashes even when only the 1 client is running, unless set to 90% CPU usage. This is a very strange error as three clients should stress the system much more than just one.

I'm hoping that someone can reproduce the error and we can compare notes to see what is the cause of the 6.22 client instability....and avoid the problem.

Common to my systems able to produce this error:
Intel Core2 CPUs
NVidia GPUs
CUDA Video drivers

RAM varies from 3-8GB
OC's vary from none to 20%
OSs Vista x86, Vista x64 and XP pro x86
Chipsets are 3 Intel and one NVidia
 
If they re-package the 5.91 and 5.92 clients with a new expiration date, does that mean we will still have to re-download and install them? I would imagine so right?

You would have to redownload them but then just get the main exe file and replace it on each box. Just have to stop each client for a little bit to do it.



 
Yes, regardless of the outcome, we must redownload and replace the current smp client. My own opinion is to give a try with the 6.22 beta client since the expiration is in 6 months and not the usual 3 months. Only fall back to 5.91/5.92 if you cannot resolve a issue with 6.22.

The goal they wanted to reach is to have 6.22 exclusively soon since it implement a few necessary changes to better handle newer cores and to get rid of any v5 trace. For this, we need to use them and report issues since beta testing is what it's about. if you don't want to deal with beta testing, you should not use a beta client and use any non-beta client (for SMP, the linux/OSX 6.02 is out of beta in case you care).

We are the [H]orde and we won't abandon for such a petty issue like a beta bug ;) relic proved himself by finding the bandaid issue. Peter also found out that the extra safety checks in 6.22 is overreacting with a A1 EUE so expect a new version shortly.

 
If these questions aren't particularly on the bright side I apologise in advance :(

All questions are always good ones...it's those who don't ask that make the bigger mistakes. ;)

I think the best plan, since we have a workaround, is as Xilikon suggested. Go to 6.22 if possible. I would suggest dropping back to 90% CPU settings if you have an EUE else leave it at 100%. The 5.91/5.92 repackaging isn't really helping except to keep some people running while they sort out the 6.22 errors. So use the repackaged 5.9x clients as a last resort.

I have no idea how this reduced CPU usage fixes the client crashing problem, I only know that it is reproduceable and effective. The programmers have to take it from there, but at least they have something to work with.
 
I don't quite understand all I know about this BS about client v5.91 vs client v6.22. (which is very little, I haven't even upgraded yet, you know the old sayin' "good to the last WU", err... drop :rolleyes: )

It seems to me "old timers" like Mr relic or BillR are very seldom wrong about anything folding wise. ("old timers" = experience folding, not age, except Mr relic, he's got both :p)

As "Smoke" said (another pretty good authority on folding along with several other "old timers" on this team) they shouldn't have released the latest 6.22 client so soon to the deadline of the 5.91 client.

My question is, "can you set the system clock back before the expiration date of the v5.91 client and still use it for a while". Or at least until some of the "bugs", that I've read about the v6.22 client, are worked out" ? :(

If these questions aren't particularly on the bright side I apologise in advance :(

Thanks for any answers :D

FOLD ON!


Heh, you keep making allusions to (Mr.) relic’s age vs. mine. The truth is relic is but a teenager in the big scope of things, I’m all buy old enough to be his daddy and that my friend is one disgusting thought.:rolleyes:

I believe I have an answer to how to make the new program run, reliably although I can take no responsibility for relic’s hardware building skills.;)




 
I just installed, then reinstalled the 6.22 program successfully three different times with repeatable results.

Step one, uninstall all instances of the old SMP client. If you were running two in windows before uninstall both. Done properly each uninstall should require a standard reboot to free up the files in use. At that point manually delete all the folders containing anything about SMP that had been in use before.

Now, down load the combo 32-64 bit client 6.22 it is the least buggered of the two.

Double click to install and it may well choose to use one of the directories you used before even though by now you should have deleted it. This is all good. Allow it to do so.

Prior to this in Vista you had to disable user control, well, don’t do that this time. As long as you are the admin in your OS and have to enter with a password you are all set.

Double click the install bat, enter your username and passwords, you will get the normal “if you see this twice crap press any key to close screen”, again, this is good.

Now, make a shortcut to your new .exe file and add –smp –configonly to the short cut.

Click on your short cut and do your configuration. I did use advanced options and allowed the size of work units to remain “Normal” CPU idle and 100% no –advanced methods and I left the two mystery config lines empty. I allowed machine #1 to be the default.

Now, go back to your shortcut and edit it through properties and make it read –smp –forceasm then save and close.

Click or double click your shortcut (depending on where you put it, I like quick launch) and bingo you should be up and running.

I’m finding temps to be about the same and at 3.2 a 2665 is hitting 14 min per frame and a 3065 12 min a frame, again at 3.2.

Give it a shot, leme know.;)

Luck;)


 
Bill,

Almost exactly as I installed. I disabled UAC permanently on my systems as I have no use for it, but otherwise we had similar approaches.

One of the 90% CPU usage systems EUEed today, again at 8% into WU completion.
Apparently our workaround isn't 100% reliable. This doesn't bode well....we'll likely be going back to 5.91 unless an patch is developed quickly.

This kind of instability isn't acceptable.

As a side note, this is again Project: 2665 (Run 0, Clone 207, Gen 16) WU that is having a problem in 6.22.

Code:
[20:45:52] Completed 20000 out of 250000 steps  (8 percent)
[21:01:16] Warning:  long 1-4 interactions
[21:01:17] Gromacs cannot continue further.
[21:01:17] Going to send back what have done.
[21:01:17] logfile size: 0
[21:01:17] Warning: Core could not open logfile.
[21:01:17] - Writing 536 bytes of core data to disk...
[21:01:17]   ... Done.
[21:01:17] - Failed to delete work/wudata_05.sas
[21:01:17] - Failed to delete work/wudata_05.goe
[21:01:17] Warning:  check for stray files
[21:01:17] 
[21:01:17] Folding@home Core Shutdown: EARLY_UNIT_END
[21:01:17] 
[21:01:17] Folding@home Core Shutdown: EARLY_UNIT_END
[21:01:20] CoreStatus = 7B (123)
[21:01:20] Client-core communications error: ERROR 0x7b
[21:01:20] This is a sign of more serious problems, shutting down.
 
Peter Kasson mentionned that 6.22 included some safety checks which is too agressive, causing EUE for nothing and not able to clean up the shitty mess after a EUE (MISSING_WORK_FILES). Expect a new version very soon.

 
Too funny.

Our favorite idiot has chimed in....
Finger pointing as usual.
Why do they put up with this moron?

Re: BAD_CORE_FILES
by 7im on Thu Jul 31, 2008 7:19 pm

Are you overclocking? NaN errors tend to be hardware related, and p2665s are extray chewey.

Dumber than a brick. I doubt I've met a more useless individual.
 
I just turned in my first 6.22 born work unit and have two others, 1 is almost half way and the 2nd is well over 2/3s.

those choosing to disable UAC might want to try this old trick. Start the program once as an admin, give it a few moments, control c and restart normaly.

Please don't ask me why but I"ve had it work before. Keep in mind we still have a fair number of folders who couldn't get the original to work.

Luck;)

 
Too funny.

Our favorite idiot has chimed in....
Finger pointing as usual.
Why do they put up with this moron?



Dumber than a brick. I doubt I've met a more useless individual.

2665s are extra chewey??????????? I don't think anyone else knew that, what an eye opener:rolleyes:

 
Just wanted to get this back to the top... any updates relic on how it's going?

Also wanted to remind everyone that isn't aware the GPU client is expiring 8/2 w/o a replacement yet. Nice job again there Stanford. At least your friggin' consistent...:rolleyes:

Q: I know you're probably working on it anyway, but I just wanted to know if we will be getting a new GPU2 client today since the beta 8 and beta 11 clients expire tomorrow. I know they will keep going until you restart the client, but you never know when a reboot is needed :)

A#1: You can set your system clock back and still use the client, hopefully we wont have to do this.. Just in case we dont get a new client on time, but im sure itll be up soon.

A#2: Don't worry, you'll have a new client on time ... maybe short time, but on time anyways ;)

*SIGH*

 
OOPS.... wrong thread. Forgot their was a new one...:eek:
 
Don't forget the same asshole answer from our beloved idiot :

"You should already know they are working on it, why bother asking ?"

 
Missed that one... but I try to limit my time there as much as possible. Sounds right in context for the asshole involved.

My motto is: Get in there, get what you need, get out before too much of it sticks to your shoes... :D

 
"You should already know they are working on it, why bother asking ?"

Xil...no no, you're supposed to be the positive/happy guy this week :) .... Next week you get to be bitter and BillR has to be positive, the week after than I'll pretend to be happy with Stanford, then SmokeRngs get's his turn to fake a smile. :D :D

What we really need for entertainment is a website dedicated to "Stupid things 7im said", it's amazing how worthless a foldingforum moderator can be. That guy has done more to hurt the project than anyone else. He's a one man anti-folding juggernaut.
 
LOL, don't worry about that... I'm now looking above 7im's stupid actions and I chuckle each time I see him goof again (matter of fact, he pissed another guy yesterday on FCF and I'm sure he decided to leave F@H :rolleyes:).

 
LOL, don't worry about that... I'm now looking above 7im's stupid actions and I chuckle each time I see him goof again (matter of fact, he pissed another guy yesterday on FCF and I'm sure he decided to leave F@H :rolleyes:).

PM the poor guy on FCF and tell him to join [H]...if he despises 7im he'll fit right in. :D
 
Xil...no no, you're supposed to be the positive/happy guy this week :) .... Next week you get to be bitter and BillR has to be positive, the week after than I'll pretend to be happy with Stanford, then SmokeRngs get's his turn to fake a smile. :D :D

What we really need for entertainment is a website dedicated to "Stupid things 7im said", it's amazing how worthless a foldingforum moderator can be. That guy has done more to hurt the project than anyone else. He's a one man anti-folding juggernaut.

And yet we could only get 70 signatures on the beat 7im with a stick petition, errrr de-mod 7im petition. Go figure. :rolleyes:


216
 
And yet we could only get 70 signatures on the beat 7im with a stick petition, errrr de-mod 7im petition. Go figure. :rolleyes:

No one bothers to go there, it's a cesspit of stupidty...70 people is a half a year's activity on FCF. ;)
 
No one bothers to go there, it's a cesspit of stupidty...70 people is a half a year's activity on FCF. ;)

I think 70 is too high of a number for that place myself. And out of that 70, I think a good 50 of those signatures came from OCAU.

But I do like the stupid 7im quotes thing. :D


216
 
I think I have the whole problem solved.

4 quads means 16 single core virtual machines.

16 installs of win-98 should go pretty well and we know those clients work.:rolleyes:


 
I think I have the whole problem solved.

4 quads means 16 single core virtual machines.

16 installs of win-98 should go pretty well and we know those clients work.:rolleyes:


But is the Win98 client out of beta yet?
 
I can only get 350 PPD in Fahmon using this new client. 5.92 would be producing 8-900 PPD. I added the -smp while I was configuring it and it is using both cores 100%. This is a 2665 though...

 
Back
Top