[molpro-user] MRCI problem when running on multiple nodes
aristotle Papakondylis
papakondylis at chem.uoa.gr
Thu Dec 3 13:13:13 GMT 2009
Dear all
I am trying to run a MRCI calculation with molpro 2009.1 on two nodes (
4 Itanium processors each) of
my system but molpro crashes with the error message attached below.
However if I run the same calculation
on a single node using for example 4 processors the job finishes without
any problems. Molpro was built with
ga-4-2 and tcgmsg and I use Infiniband. Any suggestions would be appreciated
Thanks
A. Papakondylis
Laboratory of Physical Chemistry
Department of Chemistry
University of Athens
The output:
......................................................................................................................................
Number of blocks in overlap matrix: 20 Smallest eigenvalue: 0.30D-06
Number of N-2 electron functions: 210
Number of N-1 electron functions: 139698
Number of internal configurations: 20627
Number of singly external configurations: 6446250
Number of doubly external configurations: 894852
Total number of contracted configurations: 7361729
Total number of uncontracted configurations: 636698630
Diagonal Coupling coefficients finished. Storage: 9109747
words, CPU-Time: 5.02 seconds.
Energy denominators for pairs finished in 1 passes. Storage: 893537
words, CPU-time: 0.09 seconds.
ITER. STATE ROOT SQ.NORM CORR.ENERGY TOTAL ENERGY ENERGY
CHANGE DEN1 VAR(S) VAR(P) TIME
1 1 1 1.00000000 0.00000000 -1384.06452996
0.00000000 -0.14494506 0.17D-01 0.37D-01 58.08
1 2 2 1.00000000 0.00000000 -1384.03046431
0.00000000 -0.18878279 0.16D-01 0.56D-01 58.08
GLOBAL ERROR fehler on processor 5
GLOBAL ERROR fehler on processor 4
GLOBAL ERROR fehler on processor 7
Last System Error Message from Task 7:: Invalid argument
Last System Error Message from Task 5:: Invalid argument
7:7:fehler:: 1010707757
(rank:7 hostname:nodeib_08 pid:32674):ARMCI DASSERT fail.
armci.c:ARMCI_Error():260 cond:0
Last System Error Message from Task 4:: Invalid argument
5:5:fehler:: 1010707757
(rank:5 hostname:nodeib_08 pid:32622):ARMCI DASSERT fail.
armci.c:ARMCI_Error():260 cond:0
5: ARMCI aborting 0 (0).
system error message: Invalid argument
5: ARMCI aborting 0 (0).
7: ARMCI aborting 0 (0).
7: ARMCI aborting 0 (0).
system error message: Invalid argument
GLOBAL ERROR fehler on processor
6
Last System Error Message from Task 6:: Invalid argument
6:6:fehler:: 1010707757
(rank:6 hostname:nodeib_08 pid:32648):ARMCI DASSERT fail.
armci.c:ARMCI_Error():260 cond:0
6: ARMCI aborting 0 (0).
6: ARMCI aborting 0 (0).
system error message: Invalid argument
8: interrupt(1)
2:SigIntHandler: interrupt signal was caught: 2
1:SigIntHandler: interrupt signal was caught: 2
3:SigIntHandler: interrupt signal was caught: 2
Last System Error Message from Task 2:: Numerical result out of range
Last System Error Message from Task 1:: Numerical result out of range
Last System Error Message from Task 3:: Numerical result out of range
2:SigIntHandler: abort signal was caught: cleaning up: 2
2: ARMCI aborting 0 (0).
agonal Coupling coefficients finished. Storage: 9109747
words, CPU-Time: 5.02 seconds.
Energy denominators for pairs Diagonal Coupling coefficients
finished. Storage: 9109747 words, CPU-Time: 5.02 seconds.
Energy denominators for pairs
system error message: Illegal seek
1:SigIntHandler: abort signal was caught: cleaning up: 2
1: ARMCI aborting 0 (0).
1: ARMCI aborting 0 (0).
system error message: Illegal seek
0:SigIntHandler: interrupt signal was caught: 2
(rank:0 hostname:nodeib_07 pid:4468):ARMCI DASSERT fail.
signaltrap.c:SigIntHandler():69 cond:0
3:SigIntHandler: abort signal was caught: cleaning up: 2
3: ARMCI aborting 0 (0).
3: ARMCI aborting 0 (0).
system error message: Illegal seek
Last System Error Message from Task 0:: Inappropriate ioctl for device
4:4:fehler:: 1010707757
(rank:4 hostname:nodeib_08 pid:32596):ARMCI DASSERT fail.
armci.c:ARMCI_Error():260 cond:0
4:SigIntHandler: interrupt signal was caught: 2
4:SigIntHandler: abort signal was caught: cleaning up: 2
4: ARMCI aborting 0 (0).
4: ARMCI aborting 0 (0).
system error message: Transport endpoint is not connected
WaitAll: Child (4471) finished, status=0x100 (exited with code 1).
WaitAll: Child (4470) finished, status=0x100 (exited with code 1).
WaitAll: Child (4469) finished, status=0x100 (exited with code 1).
More information about the Molpro-user
mailing list