[molpro-user] problems running molpro-mpp-2010.1-24.Linux_x86_64 on AMD-based cluster
Andy May
MayAJ1 at cardiff.ac.uk
Thu Jun 21 09:59:13 BST 2012
Anatoliy,
It could be that gfortran has been too aggressive when optimizing some
file. If you can send me the CONFIG file then I'll try to reproduce with
exactly the same tools.
I think that -blaspath should be:
/usr/local/acml5.1.0/gfortran64_int64/lib
i.e. the 64-bit integer version of the acml library.
Best wishes,
Andy
On 20/06/12 17:36, Anatoliy Volkov wrote:
> Dear Andy,
>
> Thank you very much for your prompt reply. I have managed to compile Molpro 2010.1.25
> using the following options:
> ./configure -gcc -gfortran -openmpi -mpp -mppbase /usr/local/openmpi_gcc46/include -blas -blaspath /usr/local/acml5.1.0/gfortran64/lib
> I believe tuning went well (no errors reported), but when running tests on 4 processors:
> make MOLPRO_OPTIONS=-n4 test
> I have gotten errors for the following tests : Cs_DKH10, Cs_DKH2, Cs_DKH2_standard,
> Cs_DKH3, Cs_DKH4, Cs_DKH7, Cs_DKH8. I believe the error is the same and comes from
> MPI:
>
> Running job Cs_DKH10.test
> [wizard:11602] *** An error occurred in MPI_Allreduce
> [wizard:11602] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
> [wizard:11602] *** MPI_ERR_TRUNCATE: message truncated
> [wizard:11602] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>
> Running job Cs_DKH2.test
> [wizard:11720] *** An error occurred in MPI_Allreduce
> [wizard:11720] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
> [wizard:11720] *** MPI_ERR_TRUNCATE: message truncated
> [wizard:11720] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>
> .....
> .....
>
> Running job Cs_DKH8.test
> [wizard:12281] *** An error occurred in MPI_Allreduce
> [wizard:12281] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
> [wizard:12281] *** MPI_ERR_TRUNCATE: message truncated
> [wizard:12281] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>
>
> Interestingly enough, there are no errors for other tests I have run so far:
> Running job Cs_nr.test
> Running job allene_opt.test
> Running job allyl_cipt2.test
> Running job allyl_ls.test
> Running job ar2_dk_dummy.test
> Running job au2o_optdftecp1.test
> Running job au2o_optdftecp2.test
> Running job au2o_optecp.test
> Running job aucs4k2.test
> Running job b_cidft.test
> Running job basisinput.test
> Running job bccd_opt.test
> Running job bccd_save.test
> Running job benz_nlmo.test
> Running job benzol_giao.test
> Running job big_lattice.test
> Running job br2_f12_multgem.test
> Running job c2f4_cosmo.test
> Running job c2h2_dfmp2.test
> Running job c2h4_c1_freq.test
> Running job c2h4_ccsd-f12.test
> Running job c2h4_ccsdfreq.test
> Running job c2h4_cosmo.test
> Running job c2h4_cosmo_direct.test
> Running job c2h4_d2.test
> Running job c2h4_d2h.test
> Running job c2h4_d2h_freq.test
> Running job c2h4_ksfreq.test
> Running job c2h4_lccsd.test
> Running job c2h4_lccsd2.test
> Running job c2h4_lccsd3.test
> Running job c2h4_lmp2.test
> Running job c2h4_optnum.test
> Running job c2h4_prop.test
> Running job c2h4o_cosmo.test
> Running job c6h6_freq.test
> Running job c6h6_freq_restart.test
> Running job c6h6_opt.test
>
> Do you have any suggestions on how to fix the MPI error ? Should I worry about it at all ?
>
> Thank you in advance for your help.
>
> Best Regards,
> Anatoliy
>
> ________________________________________
> From: mayaj1 at Cardiff.ac.uk [mayaj1 at Cardiff.ac.uk]
> Sent: Wednesday, June 20, 2012 4:21 AM
> To: Anatoliy Volkov
> Cc: molpro-user at molpro.net
> Subject: Re: [molpro-user] problems running molpro-mpp-2010.1-24.Linux_x86_64 on AMD-based cluster
>
> Anatoliy,
>
> Yes, there appears to be some problem when running the binaries on
> openSUSE 12.1. It is not simply a binary incompatibility problem; I see
> the same issue building from source code on 12.1. Apparently, building
> with pure TCGMSG produces an executable which crashes upon the first
> call to a global arrays routine. We have this bug reported:
>
> https://www.molpro.net/bugzilla/show_bug.cgi?id=3712
>
> and are looking into a fix.
>
> I see that you have access to the source code, so you can easily build a
> TCGMSG-MPI or MPI2 version from source which should work fine. Please
> let us know if you have any problems building.
>
> Just for information, rsh should be the default for the binaries, but it
> can be changed by setting TCGRSH environment variable (or passing
> --tcgssh option to bin/molpro shell script).
>
> Best wishes,
>
> Andy
>
> On 19/06/12 18:16, Anatoliy Volkov wrote:
>> Greetings,
>>
>> I seem to have hit a wall trying to get molpro-mpp-2010.1-24.Linux_x86_64 to run
>> on my AMD-based cluster (16 nodes, 6-core Phenom II X6 1090T or FX-6100 cpus, and
>> 16 GB RAM per node, OpenSUSE 12.1 x86-64, kernel 3.1.10-1.9-desktop), while there are
>> absolutely no issues running the same version of Molpro on my old Intel-based cluster
>> (dual-socket quad-core Xeon E5230 cpus, 16 GB RAM per node, OpenSUSE 11.4 x86-64,
>> kernel 2.6.37.6-0.11-desktop)
>>
>> On the AMD cluster, when Molpro starts to run on the master nodes, it tries of allocate a lot of memory,
>> and then dies. I have taken a couple of snapshots of 'top' (see attached top.log file)
>>
>> At first it tries to allocate 9GB, then 19 GB, then 25 GB etc., and then it dies with the
>> following error in TORQUE log file:
>>
>> Running Molpro
>> tmp = /home/avolkov/pdir//usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe.p
>>
>> Creating: host=viz01, user=avolkov,
>> file=/usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe, port=34803
>>
>> 60: ListenAndAccept: timeout waiting for connection 0 (0).
>>
>> 0.008u 0.133s 3:01.86 0.0% 0+0k 9344+40io 118pf+0w
>>
>> I am not sure I understand what is happening here. My cluster uses passwordless rsh and
>> I have not noticed any issues with communication between nodes.. At least my own code that
>> I compile using rsh-enabled OpenMPI runs just fine on the cluster. Could it be that this version of
>> Molpro tries ti use ssh? But then I do not understand why it works on my Intel cluster where only
>> rsh is available...
>>
>> On both clusters Molpro has been installed the same way (/usr/local/molpro, NFS mounted on
>> all nodes) and pretty much the same TORQUE script is used.
>>
>> On both clusters, I start Molpro using the following command in my TORQUE script:
>>
>> time /usr/local/molpro/molpro -m 64M -o $ofile -d $SCR -N $TASKLIST $ifile
>>
>> where, $TASKLIST is defined by the TORQUE script, and in case of the latest failed job
>> on the AMD cluster, had the following value:
>> TASKLIST = viz01:6,viz02:6,viz03:6,viz04:6,viz05:6,viz06:6,viz07:6,viz08:6,viz09:6,viz10:6
>>
>> In the temp directory, file molpro_options.31159 contained:
>> -m 64M -o test.out -d /tmp/16.wizard.cs.mtsu.edu test.com
>> while file procgrp.31159 was as follows:
>> avolkov viz01 1 /usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe /data1/avolkov/benchmarks/molpro/bul
>> ......
>> avolkov viz02 1 /usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe /data1/avolkov/benchmarks/molpro/bul
>> .....
>> .....
>> avolkov viz10 1 /usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe /data1/avolkov/benchmarks/molpro/bul
>>
>> BTW, test.out file is never created....
>>
>> Contents of test.com file:
>> ! $Revision: 2006.3 $
>> ***,bullvalene !A title
>> memory,64,M ! 1 MW = 8 MB
>> basis=cc-pVTZ
>> geomtyp=xyz
>> geometry={
>> 20 ! number of atoms
>> this is where you put your title
>> C 1.36577619 -0.62495122 -0.63870960
>> C 0.20245537 -1.27584792 -1.26208804
>> C -1.09275642 -1.01415419 -1.01302123
>> .........
>> }
>> ks,b3lyp
>>
>> What am I doing wrong here ?
>>
>> Thank you in advance for your help!
>>
>> Best Regards,
>> Anatoliy
>> ---------------------------
>> Anatoliy Volkov, Ph.D.
>> Associate Professor
>> Department of Chemistry
>> Middle Tennessee State University
>>
>>
>>
>> _______________________________________________
>> Molpro-user mailing list
>> Molpro-user at molpro.net
>> http://www.molpro.net/mailman/listinfo/molpro-user
>>
>
>
>
More information about the Molpro-user
mailing list