[Wien] Problems in parallel jobs

bothina hamad both_hamad at yahoo.com
Wed Jul 21 10:53:44 CEST 2010


Dear Wien users,

When running optimisation jobs under torque queuing system for anything but
very small systems:

Job runs for many cycles using lapw0, lapw1, lapw2 (parallel) successfully but eventually the 'mom-superior' node (that launches ) mpirun becomes non-communicating with the other nodes involved with the job.

At the console of this node there is correct load (4 for quad processor) and memory free... but can no longer access any nfs mounts, can no longer ping other nodes in cluster... am eventually forced to reboot node and kill job from cluster queuing system (job enters 'E' state and stays there... need to stop pbs_server and manually remove jobfiles from /var/spool/torque/server_priv/jobs... then restart pbs_server)

A similar problem is encountered on larger cluster (same install procedure) but with added problem that the .dayfile reports that for lapw2 only the 'mom-superior' node is reporting doing work (even though logging into other job nodes top reports correct load and 100%cpu use).

DOS calculation seems to work properly on both clusters...

We have used a modified x_lapw that you provided earlier.
We have been inserting 'ulimit -s unlimited' into job-scripts

We are using...
Centos5.3 x86_64
Intel compiler suite with mkl v11.1/072
openmpi-1.4.2, compiled with intel compilers
fftw-2.1.5, compiled with intel compilers and openmpi above
Wien2k v10.1

Optimisation jobs for small systems complete OK on both clusters.

The working directories for this job are large (>2GB).

 Please let us know what
files we could send you from these that may be helpful for diagnosis...

Best regards
Bothina


      


More information about the Wien mailing list