[molpro-user] running parallel GA Molpro from under Torque?
Grigory Shamov
Grigory.Shamov at umanitoba.ca
Wed Nov 27 18:15:59 GMT 2013
Dear Jeff,
Thanks for the advice! We have no MVAPICH2, but there is Intel MPI which
is a derivative of it.
The problem with both of them is that they would not run their processes
in a proper CPUsets allocated by Torque, if you start the MPI processes
with their kickstart mechanism (hydra). Chaotic pinning to CPU cores then
results in very bad performance. Allocating whole nodes results in huge
queuing times and unhappy users.
The only way to use Intel MPI I know of is OSC mpiexec; this one does not
allow oversubscription (i.e., no data-server processes) but would use
Torque API to launch processes. Is there a(n easy) way to use OSC
mpiexec with the "molpro" script? Has anyone tried it so?
--
Grigory Shamov
HPC Analyst, Westgrid/Compute Canada
E2-588 EITC Building, University of Manitoba
(204) 474-9625
On 13-11-26 5:24 PM, "Jeff Hammond" <jeff.science at gmail.com> wrote:
>The important part of your output is "Segmentation Violation error",
>which often corresponds with out-of-memory situations, but not
>necessarily in the case of ARMCI over Infiniband.
>
>Despite having spent more than half of my scientific life using ARMCI
>inside of quantum chemistry codes, I cannot offer any good advice for
>resolving this type of issue with ARMCI. I know some tricks that work
>in NWChem, but probably don't help with Molpro.
>
>I suggest you use MVAPICH2 and the non-GA build of Molpro. The reason
>I suggest MVAPICH2 instead of OpenMPI is that one-sided communication
>in OpenMPI is partially broken. The latest MVAPICH2, on other hand,
>should have very good support for what Molpro needs.
>
>While there are some performance advantages of ARMCI over MPI-2 for
>one-sided communication, my first priority is always successful
>execution. One of my favorite quotes is related to this:
>
> "The best performance improvement is the transition from the
>nonworking state to the working state." - John Osterhout
>
>Anyways...
>
>Jeff
>
>On Tue, Nov 26, 2013 at 2:51 PM, Grigory Shamov
><Grigory.Shamov at umanitoba.ca> wrote:
>> Hi All,
>>
>> I was trying to install MolPro 12.1 for one of our users. We have an
>> Infinibad, Linux cluster, with OpenMPI 1.6 and Torque, the later has
>> CPUsets enabled. I wanted to build MolPro from sources, using our
>>OpenMPI
>> so that hopefully it would be Torque-aware, run within CPUsets allocated
>> for it, etc.
>>
>> I have chosen the GA version, as it seems to depend less on shared
>> filesystem. I've used Intel 12.1 compilers and MKL. The configure line
>>was:
>>
>> ./configure -blas -lapack -mpp -mppbase
>> /global/software/openmpi-1.6.1-intel1/include -icc -ifort -x86_64 -i8
>> -nocuda -noopenmp -slater -auto-ga-openmpi -noboost
>>
>>
>> It all went really smooth, and make test on a single SMP node would run
>> successfully. So do MolPro batch jobs, within a single node
>> (nodes=1:ppn=N); however, when I try to run it across the nodes, it
>>fails
>> with some ARMCI errors. Here is my script:
>>
>>
>> #!/bin/bash
>> #PBS -l pmem=1gb,nodes=1:ppn=4
>>
>> cd $PBS_O_WORKDIR
>> echo "Current working directory is `pwd`"
>> echo "Starting run at: `date`"
>>
>> echo "TMPDIR is $TMPDIR"
>> export TMPDIR4=$TMPDIR
>>
>> #Add molpro directory to the path
>> export
>>PATH=$PATH:$HOME/molpro-12.1-ga/molprop_2012_1_Linux_x86_64_i8/bin
>>
>> #Run molpro in parallel
>>
>> export ARMCI_DEFAULT_SHMMAX=1800
>> molpro -v -S ga auh_ecp_lib.com
>>
>> # all done.
>>
>>
>>
>> If I change "nodes=1:ppn=4" to, say, "procs=8" I get the following
>>output
>> with the failure:
>>
>>
>> export AIXTHREAD_SCOPE='s'
>> export
>>
>>MOLPRO_PREFIX='/home/abrown/molpro-12.1-ga/molprop_2012_1_Linux_x86_64_i8
>>'
>> export MP_NODES='0'
>> export MP_PROCS='8'
>> MP_TASKS_PER_NODE=''
>> export MOLPRO_NOARG='1'
>> export MOLPRO_OPTIONS=' -v -S ga auh_ecp_lib.com'
>> export
>> MOLPRO_OPTIONS_FILE='/scratch/6942419.yak.local/molpro_options.12505'
>> MPI_MAX_CLUSTER_SIZE=''
>> MV2_ENABLE_AFFINITY=''
>> export RT_GRQ='ON'
>> TCGRSH=''
>> export TMPDIR='/scratch/6942419.yak.local'
>> export XLSMPOPTS='parthds=1'
>>
>>/home/user/molpro-12.1-ga/molprop_2012_1_Linux_x86_64_i8/src/openmpi-inst
>>al
>> l/bin/mpirun --mca mpi_warn_on_fork 0 -machinefile
>> /scratch/6942419.yak.local/procgrp.12505 -np 8
>> /home/user/molpro-12.1-ga/molprop_2012_1_Linux_x86_64_i8/bin/molpro.exe
>> -v -S ga auh_ecp_lib.com
>> -10005:Segmentation Violation error, status=: 11
>> (rank:-10005 hostname:n211 pid:29570):ARMCI DASSERT fail.
>> src/common/signaltrap.c:SigSegvHandler():310 cond:0
>> 5:Child process terminated prematurely, status=: 11
>> (rank:5 hostname:n211 pid:29560):ARMCI DASSERT fail.
>> src/common/signaltrap.c:SigChldHandler():178 cond:0
>> -10004:Segmentation Violation error, status=: 11
>> (rank:-10004 hostname:n227 pid:13769):ARMCI DASSERT fail.
>> src/common/signaltrap.c:SigSegvHandler():310 cond:0
>> -10003:Segmentation Violation error, status=: 11
>> (rank:-10003 hostname:n232 pid:15721):ARMCI DASSERT fail.
>> src/common/signaltrap.c:SigSegvHandler():310 cond:0
>> 3:Child process terminated prematurely, status=: 11
>> (rank:3 hostname:n232 pid:15717):ARMCI DASSERT fail.
>> src/common/signaltrap.c:SigChldHandler():178 cond:0
>> 4:Child process terminated prematurely, status=: 11
>> (rank:4 hostname:n227 pid:13765):ARMCI DASSERT fail.
>> src/common/signaltrap.c:SigChldHandler():178 cond:0
>> -10000:Segmentation Violation error, status=: 11
>> (rank:-10000 hostname:n242 pid:12666):ARMCI DASSERT fail.
>> src/common/signaltrap.c:SigSegvHandler():310 cond:0
>> 0:Child process terminated prematurely, status=: 11
>> (rank:0 hostname:n242 pid:12652):ARMCI DASSERT fail.
>> src/common/signaltrap.c:SigChldHandler():178 cond:0
>>
>>
>> The std err has also some ARMCI messages:
>>
>> ARMCI master: wait for child process (server) failed:: No child
>>processes
>>
>>-------------------------------------------------------------------------
>>-
>> MPI_ABORT was invoked on rank 3 in communicator MPI COMMUNICATOR 4 DUP
>> FROM 0
>> with errorcode 11.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>>
>>-------------------------------------------------------------------------
>>-
>> ARMCI master: wait for child process (server) failed:: No child
>>processes
>> ARMCI master: wait for child process (server) failed:: No child
>>processes
>> ARMCI master: wait for child process (server) failed:: No child
>>processes
>> forrtl: error (78): process killed (SIGTERM)
>> Image PC Routine Line
>>Source
>> molpro.exe 000000000529533A Unknown Unknown
>>Unknown
>> molpro.exe 0000000005293E36 Unknown Unknown
>>Unknown
>> molpro.exe 0000000005240270 Unknown Unknown
>>Unknown
>> molpro.exe 00000000051D4F2E Unknown Unknown
>>Unknown
>> molpro.exe 00000000051DD5D3 Unknown Unknown
>>Unknown
>> molpro.exe 0000000004F6C9ED Unknown Unknown
>>Unknown
>> molpro.exe 0000000004F4C367 Unknown Unknown
>>Unknown
>> molpro.exe 0000000004F6C8AB Unknown Unknown
>>Unknown
>> libc.so.6 0000003B40632920 Unknown Unknown
>>Unknown
>> libpthread.so.0 0000003B40A0C170 Unknown Unknown
>>Unknown
>> libmlx4-m-rdmav2. 00002B0147A155FE Unknown Unknown
>>Unknown
>>
>>
>>
>>
>> Could you please suggest what I am doing wrong, and how do I run GA
>> version of MolPro correctly in parallel, across the nodes? Any
>>suggestions
>> would be really appreciated! Thanks!
>>
>> --
>> Grigory Shamov
>>
>> HPC Analyst, Westgrid/Compute Canada
>> E2-588 EITC Building, University of Manitoba
>> (204) 474-9625
>>
>>
>>
>>
>>
>> _______________________________________________
>> Molpro-user mailing list
>> Molpro-user at molpro.net
>> http://www.molpro.net/mailman/listinfo/molpro-user
>
>
>
>--
>Jeff Hammond
>jeff.science at gmail.com
More information about the Molpro-user
mailing list