SMP(Mpi) client shutdown, won't restart.

D3v01D

Gawd
Joined
Oct 6, 2005
Messages
557
My SMP (Mpi) client shutdown sometime in the night and won't restart.
Here's what I get:

Note: Please read the license agreement (fah622b2.exe -license). Further
use of this software requires that you have read and accepted this agreement.

4 cores detected
If you see this twice, MPI is working
If you see this twice, MPI is working


--- Opening Log file [September 24 09:58:23 UTC]


# Windows SMP Console Edition #################################################
###############################################################################

Folding@Home Client Version 6.22 SMP Beta2

http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: E:\Folding\smp
Executable: E:\Folding\smp\fah622b2.exe
Arguments: -smp -verbosity 9

[09:58:23] - Ask before connecting: No
[09:58:23] - User name: D3v01D (Team 33)
[09:58:23] - User ID: <deleted>
[09:58:23] - Machine ID: 1
[09:58:23]
[09:58:24] Loaded queue successfully.
[09:58:24]
[09:58:24] - Autosending finished units... [September 24 09:58:24 UTC]
[09:58:24] + Processing work unit
[09:58:24] Trying to send all finished work units
[09:58:24] Work type a1 not eligible for variable processors
[09:58:24] + No unsent completed units remaining.
[09:58:24] Core required: FahCore_a1.exe
[09:58:24] - Autosend completed
[09:58:24] Core found.
[09:58:24] Using generic mpiexec calls
[09:58:24] Working on queue slot 04 [September 24 09:58:24 UTC]
[09:58:24] + Working ...
[09:58:24] - Calling 'mpiexec -np 4 -channel auto -host 127.0.0.1 FahCore_a1.exe
-dir work/ -suffix 04 -checkpoint 15 -verbose -lifeline 2368 -version 622'

[09:58:24]
[09:58:24] *------------------------------*
[09:58:24] Folding@Home Gromacs SMP Core
[09:58:24] Version 1.74 (March 10, 2007)
[09:58:24]
[09:58:24] Preparing to commence simulation
[09:58:24] - Ensuring status. Please wait.
[09:58:41] - Looking at optimizations...
[09:58:41] - Working with standard loops on this execution.
[09:58:41] Examination of work files indicates 8 consecutive improper terminatio
ns of core.
[09:58:41]
[09:58:41] Folding@home Core Shutdown: MISSING_WORK_FILES
[09:58:41] Finalizing output
[10:00:46] CoreStatus = 1 (1)
[10:00:46] Client-core communications error: ERROR 0x1
[10:00:46] This is a sign of more serious problems, shutting down.

Then I get a xp message box saying:
Folding@home has run into a serious error running the core. and will shutdown.

I click "OK" and that's it.

I tried removing all the files from the \work folder but it still wouldn't start.

It's been running fine since this weekend.

Can anyone *COUGH*xilikon*COUGH* help me?


 
I need to see the log part before this. The MISSING_WORK_FILES is often because the client had a error, tried to delete files and mess up.
 
The above is what the client tells me when I try to start it.
I found this at the end of the FAHlog-Prev.txt file:

[04:19:48] Completed 55000 out of 250000 steps (22 percent)
[04:34:49] Timered checkpoint triggered.
[04:36:47] Writing local files
[04:36:47] Completed 57500 out of 250000 steps (23 percent)
[04:46:33] Gromacs cannot continue further.
[04:46:33] Going to send back what have done.
[04:46:33] logfile size: 52434
[04:46:33] - Writing 52970 bytes of core data to disk...
[04:46:33] ... Done.
[04:46:33] - Failed to delete work/wudata_04.sas
[04:46:33] - Failed to delete work/wudata_04.goe
[04:46:33] Warning: check for stray files
[04:46:33]
[04:46:33] Folding@home Core Shutdown: EARLY_UNIT_END
[04:46:33]
[04:46:33] Folding@home Core Shutdown: EARLY_UNIT_END
[04:46:36] CoreStatus = 7B (123)
[04:46:36] Client-core communications error: ERROR 0x7b
[04:46:36] This is a sign of more serious problems, shutting down.
[05:40:56] - Autosending finished units... [September 24 05:40:56 UTC]
[05:40:56] Trying to send all finished work units
[05:40:56] + No unsent completed units remaining.
[05:40:56] - Autosend completed

I also found - Failed to delete work lines for wudata_01 thru wudata_03 in previous unit logs above this.

The only wudata file in my \smp\work folder is wudata_04.dyn

Thanks


 
Started right up and seems to be running strong.

Once again, thanks a Xillion Xilikon!


 
Man I thought I had applied that client to all my boxes, but I'm going to have to try again on #5... its acting up on one of the clients! :(

 
Woke up this morning to find more Failed to delete crap. was still running tho...
Restarted the pc and removed the mild overclock on my cpu (Q6600 was at 2.6 now stock) Deleted \work and the .dat file and restarted the client.
Will report back later today.

:(


 
Can you give me the project number (including run, clone and gen) to see if you are hit by a bad unit ?


 
All these failed to delete:

Project: 2665 (Run 2, Clone 480, Gen 51)

Project: 2665 (Run 2, Clone 109, Gen 51)

Project: 2665 (Run 3, Clone 953, Gen 51)

Project: 2665 (Run 3, Clone 387, Gen 37) - This one died at 23 percent:

[04:36:47] Completed 57500 out of 250000 steps (23 percent)
[04:46:33] Gromacs cannot continue further.
[04:46:33] Going to send back what have done.
[04:46:33] logfile size: 52434
[04:46:33] - Writing 52970 bytes of core data to disk...
[04:46:33] ... Done.
[04:46:33] - Failed to delete work/wudata_04.sas
[04:46:33] - Failed to delete work/wudata_04.goe
[04:46:33] Warning: check for stray files
[04:46:33]
[04:46:33] Folding@home Core Shutdown: EARLY_UNIT_END
[04:46:33]
[04:46:33] Folding@home Core Shutdown: EARLY_UNIT_END
[04:46:36] CoreStatus = 7B (123)
[04:46:36] Client-core communications error: ERROR 0x7b
[04:46:36] This is a sign of more serious problems, shutting down.
[05:40:56] - Autosending finished units... [September 24 05:40:56 UTC]
[05:40:56] Trying to send all finished work units
[05:40:56] + No unsent completed units remaining.
[05:40:56] - Autosend completed

That's the end of my FAHlog_Prev.txt

I smell a rat :mad:

Currently running Project: 2653 (Run 23, Clone 21, Gen 84) have completed 52 percent with no errors.


 
Good morning guys...

2653 (Run 23, Clone 21, Gen 84) also failed to delete... By the fourth failure I expect my client to to crash and burn again. I'm just going to let it run and see.
Now running 2665 (Run 3, Clone 768, Gen 52).

I'm going to post my entire FAHlog.txt here to see if anyone can see anything here:
...
OK I can't post the whole thing it's too long (almost 49000 characters).
Can I email my logs to someone for forensic examination?

Any help will be greatly appreciated.


 
Back
Top