[molpro-user] MRCI problem when running on multiple nodes

aristotle Papakondylis papakondylis at chem.uoa.gr
Thu Dec 3 13:13:13 GMT 2009


Dear all
I am trying to run a MRCI calculation with molpro 2009.1 on two nodes ( 
4 Itanium  processors each) of
my system but  molpro crashes with the error message attached below. 
However if I run the same calculation
on a single node using for example 4 processors the job finishes without 
any problems.  Molpro was built with
ga-4-2 and tcgmsg and I use Infiniband. Any suggestions would be appreciated
Thanks

A. Papakondylis
Laboratory of Physical Chemistry
Department of Chemistry
University of Athens



The output:

......................................................................................................................................
 Number of blocks in overlap matrix:    20   Smallest eigenvalue:  0.30D-06
 Number of N-2 electron functions:     210
 Number of N-1 electron functions:  139698

 Number of internal configurations:                20627
 Number of singly external configurations:       6446250
 Number of doubly external configurations:        894852
 Total number of contracted configurations:      7361729
 Total number of uncontracted configurations:  636698630

 Diagonal Coupling coefficients finished.               Storage: 9109747 
words, CPU-Time:      5.02 seconds.
 Energy denominators for pairs finished in 1 passes.    Storage:  893537 
words, CPU-time:      0.09 seconds.

  ITER. STATE  ROOT     SQ.NORM     CORR.ENERGY   TOTAL ENERGY   ENERGY 
CHANGE       DEN1      VAR(S)    VAR(P)      TIME
    1     1     1     1.00000000     0.00000000 -1384.06452996     
0.00000000    -0.14494506  0.17D-01  0.37D-01    58.08
    1     2     2     1.00000000     0.00000000 -1384.03046431     
0.00000000    -0.18878279  0.16D-01  0.56D-01    58.08

 GLOBAL ERROR fehler on processor   5

 GLOBAL ERROR fehler on processor   4

 GLOBAL ERROR fehler on processor   7
Last System Error Message from Task 7:: Invalid argument
Last System Error Message from Task 5:: Invalid argument
7:7:fehler:: 1010707757
(rank:7 hostname:nodeib_08 pid:32674):ARMCI DASSERT fail. 
armci.c:ARMCI_Error():260 cond:0
Last System Error Message from Task 4:: Invalid argument
5:5:fehler:: 1010707757
(rank:5 hostname:nodeib_08 pid:32622):ARMCI DASSERT fail. 
armci.c:ARMCI_Error():260 cond:0
  5: ARMCI aborting 0 (0).
system error message: Invalid argument
  5: ARMCI aborting 0 (0).
  7: ARMCI aborting 0 (0).
  7: ARMCI aborting 0 (0).
system error message: Invalid argument

 GLOBAL ERROR fehler on processor   
6                                        
Last System Error Message from Task 6:: Invalid argument
6:6:fehler:: 1010707757
(rank:6 hostname:nodeib_08 pid:32648):ARMCI DASSERT fail. 
armci.c:ARMCI_Error():260 cond:0
  6: ARMCI aborting 0 (0).
  6: ARMCI aborting 0 (0).
system error message: Invalid argument
  8: interrupt(1)
2:SigIntHandler: interrupt signal was caught: 2
1:SigIntHandler: interrupt signal was caught: 2
3:SigIntHandler: interrupt signal was caught: 2
Last System Error Message from Task 2:: Numerical result out of range
Last System Error Message from Task 1:: Numerical result out of range
Last System Error Message from Task 3:: Numerical result out of range
2:SigIntHandler: abort signal was caught: cleaning up: 2
  2: ARMCI aborting 0 (0).
agonal Coupling coefficients finished.               Storage: 9109747 
words, CPU-Time:      5.02 seconds.
 Energy denominators for pairs  Diagonal Coupling coefficients 
finished.               Storage: 9109747 words, CPU-Time:      5.02 seconds.
 Energy denominators for pairs
system error message: Illegal seek
1:SigIntHandler: abort signal was caught: cleaning up: 2
  1: ARMCI aborting 0 (0).
  1: ARMCI aborting 0 (0).
system error message: Illegal seek
0:SigIntHandler: interrupt signal was caught: 2
(rank:0 hostname:nodeib_07 pid:4468):ARMCI DASSERT fail. 
signaltrap.c:SigIntHandler():69 cond:0
3:SigIntHandler: abort signal was caught: cleaning up: 2
  3: ARMCI aborting 0 (0).
  3: ARMCI aborting 0 (0).
system error message: Illegal seek
Last System Error Message from Task 0:: Inappropriate ioctl for device
4:4:fehler:: 1010707757
(rank:4 hostname:nodeib_08 pid:32596):ARMCI DASSERT fail. 
armci.c:ARMCI_Error():260 cond:0
4:SigIntHandler: interrupt signal was caught: 2
4:SigIntHandler: abort signal was caught: cleaning up: 2
  4: ARMCI aborting 0 (0).
  4: ARMCI aborting 0 (0).
system error message: Transport endpoint is not connected
WaitAll: Child (4471) finished, status=0x100 (exited with code 1).
WaitAll: Child (4470) finished, status=0x100 (exited with code 1).
WaitAll: Child (4469) finished, status=0x100 (exited with code 1).




More information about the Molpro-user mailing list