[molpro-user] running parallel GA Molpro from under Torque?
Grigory Shamov
Grigory.Shamov at umanitoba.ca
Tue Nov 26 20:51:05 GMT 2013
Hi All,
I was trying to install MolPro 12.1 for one of our users. We have an
Infinibad, Linux cluster, with OpenMPI 1.6 and Torque, the later has
CPUsets enabled. I wanted to build MolPro from sources, using our OpenMPI
so that hopefully it would be Torque-aware, run within CPUsets allocated
for it, etc.
I have chosen the GA version, as it seems to depend less on shared
filesystem. I've used Intel 12.1 compilers and MKL. The configure line was:
./configure -blas -lapack -mpp -mppbase
/global/software/openmpi-1.6.1-intel1/include -icc -ifort -x86_64 -i8
-nocuda -noopenmp -slater -auto-ga-openmpi -noboost
It all went really smooth, and make test on a single SMP node would run
successfully. So do MolPro batch jobs, within a single node
(nodes=1:ppn=N); however, when I try to run it across the nodes, it fails
with some ARMCI errors. Here is my script:
#!/bin/bash
#PBS -l pmem=1gb,nodes=1:ppn=4
cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"
echo "Starting run at: `date`"
echo "TMPDIR is $TMPDIR"
export TMPDIR4=$TMPDIR
#Add molpro directory to the path
export PATH=$PATH:$HOME/molpro-12.1-ga/molprop_2012_1_Linux_x86_64_i8/bin
#Run molpro in parallel
export ARMCI_DEFAULT_SHMMAX=1800
molpro -v -S ga auh_ecp_lib.com
# all done.
If I change "nodes=1:ppn=4" to, say, "procs=8" I get the following output
with the failure:
export AIXTHREAD_SCOPE='s'
export
MOLPRO_PREFIX='/home/abrown/molpro-12.1-ga/molprop_2012_1_Linux_x86_64_i8'
export MP_NODES='0'
export MP_PROCS='8'
MP_TASKS_PER_NODE=''
export MOLPRO_NOARG='1'
export MOLPRO_OPTIONS=' -v -S ga auh_ecp_lib.com'
export
MOLPRO_OPTIONS_FILE='/scratch/6942419.yak.local/molpro_options.12505'
MPI_MAX_CLUSTER_SIZE=''
MV2_ENABLE_AFFINITY=''
export RT_GRQ='ON'
TCGRSH=''
export TMPDIR='/scratch/6942419.yak.local'
export XLSMPOPTS='parthds=1'
/home/user/molpro-12.1-ga/molprop_2012_1_Linux_x86_64_i8/src/openmpi-instal
l/bin/mpirun --mca mpi_warn_on_fork 0 -machinefile
/scratch/6942419.yak.local/procgrp.12505 -np 8
/home/user/molpro-12.1-ga/molprop_2012_1_Linux_x86_64_i8/bin/molpro.exe
-v -S ga auh_ecp_lib.com
-10005:Segmentation Violation error, status=: 11
(rank:-10005 hostname:n211 pid:29570):ARMCI DASSERT fail.
src/common/signaltrap.c:SigSegvHandler():310 cond:0
5:Child process terminated prematurely, status=: 11
(rank:5 hostname:n211 pid:29560):ARMCI DASSERT fail.
src/common/signaltrap.c:SigChldHandler():178 cond:0
-10004:Segmentation Violation error, status=: 11
(rank:-10004 hostname:n227 pid:13769):ARMCI DASSERT fail.
src/common/signaltrap.c:SigSegvHandler():310 cond:0
-10003:Segmentation Violation error, status=: 11
(rank:-10003 hostname:n232 pid:15721):ARMCI DASSERT fail.
src/common/signaltrap.c:SigSegvHandler():310 cond:0
3:Child process terminated prematurely, status=: 11
(rank:3 hostname:n232 pid:15717):ARMCI DASSERT fail.
src/common/signaltrap.c:SigChldHandler():178 cond:0
4:Child process terminated prematurely, status=: 11
(rank:4 hostname:n227 pid:13765):ARMCI DASSERT fail.
src/common/signaltrap.c:SigChldHandler():178 cond:0
-10000:Segmentation Violation error, status=: 11
(rank:-10000 hostname:n242 pid:12666):ARMCI DASSERT fail.
src/common/signaltrap.c:SigSegvHandler():310 cond:0
0:Child process terminated prematurely, status=: 11
(rank:0 hostname:n242 pid:12652):ARMCI DASSERT fail.
src/common/signaltrap.c:SigChldHandler():178 cond:0
The std err has also some ARMCI messages:
ARMCI master: wait for child process (server) failed:: No child processes
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI COMMUNICATOR 4 DUP
FROM 0
with errorcode 11.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
ARMCI master: wait for child process (server) failed:: No child processes
ARMCI master: wait for child process (server) failed:: No child processes
ARMCI master: wait for child process (server) failed:: No child processes
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
molpro.exe 000000000529533A Unknown Unknown Unknown
molpro.exe 0000000005293E36 Unknown Unknown Unknown
molpro.exe 0000000005240270 Unknown Unknown Unknown
molpro.exe 00000000051D4F2E Unknown Unknown Unknown
molpro.exe 00000000051DD5D3 Unknown Unknown Unknown
molpro.exe 0000000004F6C9ED Unknown Unknown Unknown
molpro.exe 0000000004F4C367 Unknown Unknown Unknown
molpro.exe 0000000004F6C8AB Unknown Unknown Unknown
libc.so.6 0000003B40632920 Unknown Unknown Unknown
libpthread.so.0 0000003B40A0C170 Unknown Unknown Unknown
libmlx4-m-rdmav2. 00002B0147A155FE Unknown Unknown Unknown
Could you please suggest what I am doing wrong, and how do I run GA
version of MolPro correctly in parallel, across the nodes? Any suggestions
would be really appreciated! Thanks!
--
Grigory Shamov
HPC Analyst, Westgrid/Compute Canada
E2-588 EITC Building, University of Manitoba
(204) 474-9625
More information about the Molpro-user
mailing list