[molpro-user] Non-reproducible stuck state when running Molpro on NFS drive
Gregory Magoon
gmagoon at MIT.EDU
Fri Jul 15 19:17:42 BST 2011
Hi,
I have successfully compiled molpro (with Global Arrays/TCGMSG; mpich2 from
Ubuntu package) on one of our compute nodes for our new server, and installed
it in an NFS directory on our head node. The initial tests on the compute node
ran fine but since the installation, I've had issues with running molpro on the
compute nodes (it seems to work fine on the head node). Sometimes (sorry I can't
be more precise, but it does not seem to be reproducible), when running on the
compute node, the job will get stuck in the early stages, producing a lot (~14+
Mbps outbound to headnode and 7Mbps inbound from headnode) of NFS traffic and
causing fairly high nfsd process CPU% usage on the head node. Molpro processes
in the stuck state are shown in "top" command display at the bottom of the
e-mail. I have also attached example verbose output for a case that works and a
case that gets stuck.
Some notes:
-/usr/local is mounted as NFS read-only file system; /home is mounted as NFS rw
file system
-It seems like runs with fewer processors (e.g. 6) are more likely to run
successfully
I've tried several approaches for addressing the issue, including 1. Mounting
/usr/local as rw file system, and 2. Changing the rsize and wsize parameters
for the NFS filesystem but none seem to work. We also tried piping < /dev/null
when calling the process, which seemed like it was helping at first, but later
tests suggested that this wasn't actually helping.
If anyone has any tips or ideas to help diagnose the issue here, it would be
greatly appreciated. If there are any additional details I can provide to help
describe the problem, I'd be happy to provide them.
Thanks very much,
Greg
Top processes in "top" output in stuck state:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10 root 20 0 0 0 0 S 10 0.0 0:16.50 kworker/0:1
2 root 20 0 0 0 0 S 6 0.0 0:10.86 kthreadd
1496 root 20 0 0 0 0 S 1 0.0 0:04.73 kworker/0:2
3 root 20 0 0 0 0 S 1 0.0 0:00.93 ksoftirqd/0
Processes in "top" output for user in stuck state:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29961 user 20 0 19452 1508 1072 R 0 0.0 0:00.05 top
1176 user 20 0 91708 1824 868 S 0 0.0 0:00.01 sshd
1177 user 20 0 24980 7620 1660 S 0 0.0 0:00.41 bash
1289 user 20 0 91708 1824 868 S 0 0.0 0:00.00 sshd
1290 user 20 0 24980 7600 1640 S 0 0.0 0:00.32 bash
1386 user 20 0 4220 664 524 S 0 0.0 0:00.01 molpro
1481 user 20 0 18764 1196 900 S 0 0.0 0:00.00 mpiexec
1482 user 20 0 18828 1092 820 S 0 0.0 0:00.00 hydra_pmi_proxy
1483 user 20 0 18860 488 212 D 0 0.0 0:00.00 hydra_pmi_proxy
1484 user 20 0 18860 488 212 D 0 0.0 0:00.00 hydra_pmi_proxy
1485 user 20 0 18860 488 212 D 0 0.0 0:00.00 hydra_pmi_proxy
1486 user 20 0 18860 488 212 D 0 0.0 0:00.00 hydra_pmi_proxy
1487 user 20 0 18860 488 212 D 0 0.0 0:00.00 hydra_pmi_proxy
1488 user 20 0 18860 488 212 D 0 0.0 0:00.00 hydra_pmi_proxy
1489 user 20 0 18860 488 212 D 0 0.0 0:00.00 hydra_pmi_proxy
1490 user 20 0 18860 488 208 D 0 0.0 0:00.00 hydra_pmi_proxy
1491 user 20 0 18860 488 208 D 0 0.0 0:00.00 hydra_pmi_proxy
1492 user 20 0 18860 488 208 D 0 0.0 0:00.00 hydra_pmi_proxy
1493 user 20 0 18860 488 208 D 0 0.0 0:00.00 hydra_pmi_proxy
1494 user 20 0 18860 492 212 D 0 0.0 0:00.00 hydra_pmi_proxy
-------------- next part --------------
# PARALLEL mode
nodelist=12
first =12
second =
third =
HOSTFILE_FORMAT: $hostname
node02
node02
node02
node02
node02
node02
node02
node02
node02
node02
node02
node02
export LD_LIBRARY_PATH='/opt/acml4.4.0/gfortran64_int64/lib:'
export AIXTHREAD_SCOPE='s'
export INSTLIB='/usr/local/molpro2010.1/lib/molprop_2010_1_Linux_x86_64_i8'
export MP_NODES='0'
export MP_PROCS='12'
MP_TASKS_PER_NODE=''
export MOLPRO_NOARG='1'
export MOLPRO_OPTIONS=' -v -d /tmp/user -m 250M test3.inp'
export MOLPRO_OPTIONS_FILE='/tmp/molpro_options.1109'
MPI_MAX_CLUSTER_SIZE=''
export PROCGRP='/tmp/procgrp.1109'
export RT_GRQ='ON'
TCGRSH=''
TMPDIR=''
export XLSMPOPTS='parthds=1'
/usr/bin/mpiexec -machinefile /tmp/procgrp.1109 -np 12 /usr/local/molpro2010.1/bin/molprop_2010_1_Linux_x86_64_i8.exe -v -d /tmp/user -m 250M test3.inp
-------------- next part --------------
# PARALLEL mode
nodelist=12
first =12
second =
third =
HOSTFILE_FORMAT: $hostname
node01
node01
node01
node01
node01
node01
node01
node01
node01
node01
node01
node01
export LD_LIBRARY_PATH='/opt/acml4.4.0/gfortran64_int64/lib:'
export AIXTHREAD_SCOPE='s'
export INSTLIB='/usr/local/molpro2010.1/lib/molprop_2010_1_Linux_x86_64_i8'
export MP_NODES='0'
export MP_PROCS='12'
MP_TASKS_PER_NODE=''
export MOLPRO_NOARG='1'
export MOLPRO_OPTIONS=' -v -d /tmp/user -m 250M test3.inp'
export MOLPRO_OPTIONS_FILE='/tmp/molpro_options.1229'
MPI_MAX_CLUSTER_SIZE=''
export PROCGRP='/tmp/procgrp.1229'
export RT_GRQ='ON'
TCGRSH=''
TMPDIR=''
export XLSMPOPTS='parthds=1'
/usr/bin/mpiexec -machinefile /tmp/procgrp.1229 -np 12 /usr/local/molpro2010.1/bin/molprop_2010_1_Linux_x86_64_i8.exe -v -d /tmp/user -m 250M test3.inp
token read from /usr/local/molpro2010.1/lib/molprop_2010_1_Linux_x86_64_i8//.token
token read from /home/user/.molpro/token
input from /home/user/molprowd/test3.inp
output to /home/user/molprowd/test3.out
XML stream to /home/user/molprowd/test3.xml
Creating directory /tmp/user
More information about the Molpro-user
mailing list