What to do when your SMP clients shutdown?

leSLIe

Fisting is Too Mainstream for Me
Joined
Oct 18, 2004
Messages
14,022
I recently had some problems with my PC's power cord, and the PC shutdown, well i copy the backup files and start the SMP clients again, first "FaH SMP2" and i got "Folding@Home Core Shutdown: FILE_IO_ERROR",

Code:
Note: Please read the license agreement (fah.exe -license). Further
use of this software requires that you have read and accepted this agreement.

If you see this twice, MPI is working
If you see this twice, MPI is working
4 cores detected
No directory settings found in registry. Using current directory...


--- Opening Log file [May 2 14:45:33]


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 5.91beta6

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: D:\Folding@Home\FoldSMP2
Executable: D:\Folding@Home\FoldSMP2\fah.exe
Arguments: -advmethods -forceasm -verbosity 9

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[14:45:33] - Ask before connecting: No
[14:45:33] - User name: leSLIe (Team 33)
[14:45:33] - User ID: 2A15A7363A356C29
[14:45:33] - Machine ID: 2
[14:45:33]
[14:45:33] Loaded queue successfully.
[14:45:33] - Preparing to get new work unit...
[14:45:33] - Autosending finished units...
[14:45:33] Trying to send all finished work units
[14:45:33] + No unsent completed units remaining.
[14:45:33] - Autosend completed
[14:45:33] + Attempting to get work packet
[14:45:33] - Will indicate memory of 1024 MB
[14:45:33] - Connecting to assignment server
[14:45:33] Connecting to http://assign.stanford.edu:8080/
[14:45:34] Posted data.
[14:45:34] Initial: 40AB; - Successful: assigned to (171.64.65.63).
[14:45:34] + News From Folding@Home: Welcome to Folding@Home
[14:45:34] Loaded queue successfully.
[14:45:34] Connecting to http://171.64.65.63:8080/
[14:45:36] Posted data.
[14:45:36] Initial: 0000; - Receiving payload (expected size: 608878)
[14:45:39] - Downloaded at ~198 kB/s
[14:45:39] - Averaged speed for that direction ~189 kB/s
[14:45:39] + Received work.
[14:45:39] + Closed connections
[14:45:39]
[14:45:39] + Processing work unit
[14:45:39] Core required: FahCore_a1.exe
[14:45:39] Core found.
[14:45:39] Working on Unit 02 [May 2 14:45:39]
[14:45:39] + Working ...
[14:45:39] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suf
fix 02 -priority 96 -checkpoint 15 -forceasm -verbose -lifeline 2156 -version 59
1'

[14:45:39]
[14:45:39] *------------------------------*
[14:45:39] Folding@Home Gromacs SMP Core
[14:45:39] Version 1.74 (March 10, 2007)
[14:45:39]
[14:45:39] Preparing to commence simulation
[14:45:39] - Ensuring status. Please wait.
[14:45:40] - Starting from initial work packet
[14:45:40]
[14:45:40] Project: 3062 (Run 4, Clone 15, Gen 97)
[14:45:40]
[14:45:40] Assembly optimizations on if available.
[14:45:40] Entering M.D.
[14:45:57]  on if available.
[14:45:57] Entering M.D.
[14:46:03] Couldn't open Go file
[14:46:03]
[14:46:03] Folding@home Core Shutdown: FILE_IO_ERROR
[14:46:03]
[14:46:03] Folding@home Core Shutdown: FILE_IO_ERROR
weird thing here is that all the cores were at 100% !! and there is apparently nothing going on, so i waited a couple of minutes and start the second client "FaH SMP1" and there was this message "Folding@Home Core Shutdown: FILE_IO_ERROR" which was the same as with the other client, except this time, it moved on, and start over.

Code:
Note: Please read the license agreement (fah.exe -license). Further
use of this software requires that you have read and accepted this agreement.

If you see this twice, MPI is working
If you see this twice, MPI is working
4 cores detected
No directory settings found in registry. Using current directory...


--- Opening Log file [May 2 14:50:22]


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 5.91beta6

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: D:\Folding@Home\FoldSMP1
Executable: D:\Folding@Home\FoldSMP1\fah.exe
Arguments: -advmethods -forceasm -verbosity 9

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[14:50:22] - Ask before connecting: No
[14:50:22] - User name: leSLIe (Team 33)
[14:50:22] - User ID: 2A15A7363A356C29
[14:50:22] - Machine ID: 1
[14:50:22]
[14:50:22] Loaded queue successfully.
[14:50:22] - Preparing to get new work unit...
[14:50:22] - Autosending finished units...
[14:50:22] Trying to send all finished work units
[14:50:22] + No unsent completed units remaining.
[14:50:22] - Autosend completed
[14:50:22] + Attempting to get work packet
[14:50:22] - Will indicate memory of 1024 MB
[14:50:22] - Connecting to assignment server
[14:50:22] Connecting to http://assign.stanford.edu:8080/
[14:50:22] Posted data.
[14:50:22] Initial: 40AB; - Successful: assigned to (171.64.65.63).
[14:50:22] + News From Folding@Home: Welcome to Folding@Home
[14:50:23] Loaded queue successfully.
[14:50:23] Connecting to http://171.64.65.63:8080/
[14:50:24] Posted data.
[14:50:24] Initial: 0000; - Receiving payload (expected size: 608255)
[14:50:27] - Downloaded at ~197 kB/s
[14:50:27] - Averaged speed for that direction ~114 kB/s
[14:50:27] + Received work.
[14:50:27] + Closed connections
[14:50:27]
[14:50:27] + Processing work unit
[14:50:27] Core required: FahCore_a1.exe
[14:50:27] Core found.
[14:50:27] Working on Unit 02 [May 2 14:50:27]
[14:50:27] + Working ...
[14:50:27] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suf
fix 02 -priority 96 -checkpoint 15 -forceasm -verbose -lifeline 3780 -version 59
1'

[14:50:28]
[14:50:28] *------------------------------*
[14:50:28] Folding@Home Gromacs SMP Core
[14:50:28] Version 1.74 (March 10, 2007)
[14:50:28]
[14:50:28] Preparing to commence simulation
[14:50:28] - Ensuring status. Please wait.
[14:50:45] - Assembly optimizations manually forced on.
[14:50:45] - Not checking prior termination.
[14:50:45] - Expanded 607743 -> 3260637 (decompressed 536.5 percent)
[14:50:45] - Starting from initial work packet
[14:50:45]
[14:50:45] Project: 3062 (Run 5, Clone 157, Gen 9)
[14:50:45]
[14:50:45] Assembly optimizations on if available.
[14:50:45] Entering M.D.
[14:50:50]  work thread failure
[14:50:50] - Shutting down core
[14:50:50]
[14:50:50] Folding@home Core Shutdown: FILE_IO_ERROR
[14:50:50] Finalizing output
[14:50:56] CoreStatus = 7B (123)
[14:50:56] Client-core communications error: ERROR 0x7b
[14:50:56] Deleting current work unit & continuing...
[14:53:16] - Warning: Could not delete all work unit files (2): Core returned in
valid code
[14:53:16] Trying to send all finished work units
[14:53:16] + No unsent completed units remaining.
[14:53:16] - Preparing to get new work unit...
[14:53:16] + Attempting to get work packet
[14:53:16] - Will indicate memory of 1024 MB
[14:53:16] - Connecting to assignment server
[14:53:16] Connecting to http://assign.stanford.edu:8080/
[14:53:16] Posted data.
[14:53:16] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[14:53:16] + News From Folding@Home: Welcome to Folding@Home
[14:53:16] Loaded queue successfully.
[14:53:16] Connecting to http://171.64.65.64:8080/
[14:53:20] Posted data.
[14:53:20] Initial: 0000; - Receiving payload (expected size: 2435236)
[14:53:32] - Downloaded at ~198 kB/s
[14:53:32] - Averaged speed for that direction ~142 kB/s
[14:53:32] + Received work.
[14:53:32] + Closed connections
[14:53:37]
[14:53:37] + Processing work unit
[14:53:37] Core required: FahCore_a1.exe
[14:53:37] Core found.
[14:53:37] Working on Unit 03 [May 2 14:53:37]
[14:53:37] + Working ...
[14:53:37] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suf
fix 03 -priority 96 -checkpoint 15 -forceasm -verbose -lifeline 3780 -version 59
1'

[14:53:37]
[14:53:37] *------------------------------*
[14:53:37] Folding@Home Gromacs SMP Core
[14:53:37] Version 1.74 (March 10, 2007)
[14:53:37]
[14:53:37] Preparing to commence simulation
[14:53:37] - Ensuring status. Please wait.
[14:53:54] - Assembly optimizations manually forced on.
[14:53:54] - Not checking prior termination.
[14:53:58] - Expanded 2434724 -> 12901369 (decompressed 529.8 percent)
[14:53:58] - Starting from initial work packet
[14:53:58]
[14:53:58] Project: 2653 (Run 14, Clone 12, Gen 51)
[14:53:58]
[14:53:58] Assembly optimizations on if available.
[14:53:58] Entering M.D.
[14:54:06] Rejecting checkpoint
[14:54:07] Protein: Protein in POPCExtra SSE boost OK.
[14:54:07]
[14:54:08] Extra SSE boost OK.
[14:54:09] Writing local files
[14:54:09] Completed 0 out of 500000 steps  (0 percent)

to summarize, i restarted my SMP clients after a PC shutdown, with the backup files, one client got to continue and the other one is stuck. now what do i have to do to make the non working client to start folding again?

pic of the task manager
cpu.PNG


i'm getting weary of this :(... should i start using a VM?
 
I take it you are running two instances in the windows envrionment? I stuck with VM's anywhere I want to run more than one SMP instance just to avoid this...

I am not sure of the recovery method.

***DOH edit*** I see the 8 fah_cores in the task manager. I am not sure what to do here.

Since the one is starting off back at zero again, is killing the core, delete all work and rebooting an option?
 
I stopped running it on windows navite, run CentOS vms, it really does take about 30 mins / instance I have had not had one lockup or even issue with coming back up after power failure.

Even at the same clock speed the dual windows smp would lock up on me, probably a memory leak due to windows :p
 
I take it you are running two instances in the windows envrionment? I stuck with VM's anywhere I want to run more than one SMP instance just to avoid this...

I am not sure of the recovery method.

indeed, 2 SMP clients in WinXP.
VM is now looking as a very good option
 
indeed, 2 SMP clients in WinXP.
VM is now looking as a very good option

Maybe for just one of them, if this is a regularly used machine. That way the one on the windows side will throttle back when you are using the machine for other purposes...which you shouldn't be doing!
 
***DOH edit*** I see the 8 fah_cores in the task manager. I am not sure what to do here.

Since the one is starting off back at zero again, is killing the core, delete all work and rebooting an option?

already tried that, i get the same error :(
 
Maybe for just one of them, if this is a regularly used machine. That way the one on the windows side will throttle back when you are using the machine for other purposes...which you shouldn't be doing!

well my PC is not a dedicated folding machine :(
i use it for other things too, like CAD and FEA (Finite Element Analysis), and yes eventually some gaming too ;)
 
well my PC is not a dedicated folding machine :(
i use it for other things too, like CAD and FEA (Finite Element Analysis), and yes eventually some gaming too ;)

You forgot posting here! I kid, haha.

You might find stopping the VM instance neccessary when running intensive apps. The Windows one should free up the resources.
 
I am at work, so I did not read the other suggestions. Might you be running NOD32 anti virus. It does not play well with SMP folding. It might let a few through, but after 3 or 4 it will start giving you this error. I had to find a different AV for this reason.

 
yeah you can easily pause the vm from the vmware server console or shut down one of the vm instances. Just do a ctrl + C on the client and init 0 to shut down linux...
 
I am at work, so I did not read the other suggestions. Might you be running NOD32 anti virus. It does not play well with SMP folding. It might let a few through, but after 3 or 4 it will start giving you this error. I had to find a different AV for this reason.


well yes, i use NOD32, oh boy i really liked this AV !! would be a shame if a have to change it
 
Since I don't have a quad like everyone else...

All those suggestions sound correct :D:D
 
I'll toss in my vote for VM's as well. I'm running VM's now on about half my boxen, and ironically enough, I had the exact same situation happen to me two days ago on a box. It turned out to be a hardware issue with Crucial Ballistix single sided PC2-6400 RAM though (you can read about all kinds of issues with those chips, so if any of you guys are getting RAM for your folding boxen, I'd avoid them) I've got two sets, and haven't had issues with the other set so far...but its kind of like waiting for the other shoe to fall. :rolleyes:

But, back on topic for now. Two VM's will give you a lot more stability with your WU's, because you can shut down one without affecting the other. When running dual instances of SMP on a standard box, you will most likely corrupt the second instance any time you have to stop one or the other. It was a painful lesson learned for me early on. I'm just waiting for the right opportunity to set up my other boxes with VM, and they will all go that route.

You are on a quad, right? I'm not sure running two SMP's is effective on a dual core CPU.



 
But, back on topic for now. Two VM's will give you a lot more stability with your WU's, because you can shut down one without affecting the other. When running dual instances of SMP on a standard box, you will most likely corrupt the second instance any time you have to stop one or the other. It was a painful lesson learned for me early on. I'm just waiting for the right opportunity to set up my other boxes with VM, and they will all go that route.

You are on a quad, right? I'm not sure running two SMP's is effective on a dual core CPU.

yes i have a quad.
what i am planning now to do is to run just one VM with a smp client, and the other smp client in my current windows, that's because my PC is not a dedicated folding machine
 
yes i have a quad.
what i am planning now to do is to run just one VM with a smp client, and the other smp client in my current windows, that's because my PC is not a dedicated folding machine

This is probably your best bet for overall productivity of the machine. You'll get the advantages of stability and speed of the VM over a straight Windows client and still have the ability to keep a lot of your resources ready to be used for things other than folding.

One thing to remember about VMs is that they normally do not like to give up the resources they are using. This includes RAM and processor cycles.

I used to run dual VMs on my main machine (Linux host OS with dual Linux VMs) and didn't have a problem for a long time until I updated nVidia graphics drivers. After doing that, I could no longer run both VMs while gaming. Quake Wars would get pure shit for framerates and stutter constantly. Once I killed the VMs this was no longer a problem. However, I have lost between 800-1000PPD by going to two native Linux SMP clients.

If I get my ass in gear, I'm going to setup a single VM for the better PPD and run a single native Linux client so I don't have to stop folding at all when I need the CPU for something else. I hate wasting cycles which is why I don't still run dual VMs and kill one of them when I want to game or something.

 
Back
Top