[molpro-user] problems with global file system when running in parallel
Jörg Saßmannshausen
j.sassmannshausen at ucl.ac.uk
Mon Feb 4 23:51:24 GMT 2013
Hi Jeff,
thanks for the feedback.
What I cannot really work out is: on my 8-core machine it is all working and
here I only got one (local) scratch space.
Thus, I would have thought that this is not a problem.
I can see where you are coming from, however, I would not know how to generate
different scratch for the different nodes where the job is running on. The only
option I found in the Molpro manual regarding scratch space is the -d flag.
Here you give a full path for the scratch space and hence I would not know how
to say core1 is using space1 etc.
I thought of Sys5 shm as well but as I have already set it higher to use
NWChem on that machine and as it runs with local scratch I would have thought
there is no problem here.
I am still a bit puzzled here.
All the best from London
Jörg
On Montag 04 Februar 2013 Jeff Hammond wrote:
> If you want shared scratch to behave as if it was local scratch, just
> create a subdirectory for each process to ensure that no I/O is
> conflicted. NWChem does this automagically with *.${procid} file
> suffixes but it's easy enough to use a directory instead since that
> requires no source changes.
>
> Molpro might have an option for this but I don't know what it is.
>
> Note also that I cannot be certain that the error messages you see
> aren't a side-effect of Sys5 shm exhaustion, which has nothing to do
> with file I/O, but since you say this job runs fine on local scratch,
> I'll assume that Sys5 is not the issue. ARMCI error messages are not
> always as they seem.
>
> Jeff
>
> On Mon, Feb 4, 2013 at 6:42 AM, Jörg Saßmannshausen
>
> <j.sassmannshausen at ucl.ac.uk> wrote:
> > Dear all,
> >
> > I was wondering if somebody could shed some light on here.
> >
> > When I am trying to do a DF-LCCSD(T) calculation, the first few steps are
> >
> > working ok but then the program crashes when it comes to here:
> > MP2 energy of close pairs: -0.09170948
> > MP2 energy of weak pairs: -0.06901764
> > MP2 energy of distant pairs: -0.00191297
> >
> > MP2 correlation energy: -2.48344057
> > MP2 total energy: -940.89652776
> >
> > LMP2 singlet pair energy -1.53042229
> > LMP2 triplet pair energy -0.95301828
> >
> > SCS-LMP2 correlation energy: -2.42949590 (PS= 1.200000 PT=
> >
> > 0.333333)
> >
> > SCS-LMP2 total energy: -940.84258309
> >
> > Minimum Memory for K-operators: 2.48 MW Maximum memory for
> > K-operators
> >
> > 28.97 MW used: 28.97 MW
> >
> > Memory for amplitude vector: 0.52 MW
> >
> > Minimum memory for LCCSD: 8.15 MW, used: 65.01 MW, max:
> > 64.48 MW
> >
> > ITER. SQ.NORM CORR.ENERGY TOTAL ENERGY ENERGY CHANGE
> > DEN1
> >
> > VAR(S) VAR(P) DIIS TIME
> >
> > 1 1.96000293 -2.52977250 -940.94285970 -0.04633193
> >
> > -2.42872569 0.35D-01 0.15D-01 1 1 348.20
> >
> >
> > Here are the error messages which I found:
> >
> > 5:Segmentation Violation error, status=: 11
> > (rank:5 hostname:node32 pid:5885):ARMCI DASSERT fail.
> > src/common/signaltrap.c:SigSegvHandler():310 cond:0
> >
> > 5: ARMCI aborting 11 (0xb).
> >
> > tmp = /home/sassy/pdir//usr/local/molpro-2012.1/bin/molpro.exe.p
> >
> > Creating: host=node33, user=sassy,
> >
> > [ ... ]
> >
> > and
> >
> > Last System Error Message from Task 5:: Bad file descriptor
> >
> > 5: ARMCI aborting 11 (0xb).
> >
> > system error message: Invalid argument
> >
> > 24: interrupt(1)
> >
> > Last System Error Message from Task 2:: Bad file descriptor
> > Last System Error Message from Task 0:: Inappropriate ioctl for device
> >
> > 2: ARMCI aborting 2 (0x2).
> >
> > system error message: Invalid argument
> > Last System Error Message from Task 3:: Bad file descriptor
> >
> > 3: ARMCI aborting 2 (0x2).
> >
> > system error message: Invalid argument
> > WaitAll: Child (25216) finished, status=0x8200 (exited with code 130).
> > [ ... ]
> >
> > I got the feeling there is a problem with reading/writing some files.
> > The global file system got around 158G of disc space free and as far as I
> > could see it it was not full at the time of the run.
> >
> > Interestingly, the same input file but with the local scratch space was
> > working. As the local scratch is rather small I would use the global,
> > larger system.
> >
> > Are there any known problems with that approach or is there something I
> > am doing wrong here?
> >
> > All the best from a sunny London
> >
> > Jörg
> >
> > --
> > *************************************************************
> > Jörg Saßmannshausen
> > University College London
> > Department of Chemistry
> > Gordon Street
> > London
> > WC1H 0AJ
> >
> > email: j.sassmannshausen at ucl.ac.uk
> > web: http://sassy.formativ.net
> >
> > Please avoid sending me Word or PowerPoint attachments.
> > See http://www.gnu.org/philosophy/no-word-attachments.html
> >
> > _______________________________________________
> > Molpro-user mailing list
> > Molpro-user at molpro.net
> > http://www.molpro.net/mailman/listinfo/molpro-user
--
*************************************************************
Jörg Saßmannshausen
University College London
Department of Chemistry
Gordon Street
London
WC1H 0AJ
email: j.sassmannshausen at ucl.ac.uk
web: http://sassy.formativ.net
Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
More information about the Molpro-user
mailing list