[molpro-user] technical molpro questions

Wed Oct 30 18:18:17 GMT 2013

> 4) Another way I tried to accelerate molpro, is, to use the parallel mkl as
> blas and lapac library. From my, probably naive point of view, this should
> be perfect, since all of the time consuming calculations done by molpro
> should be some kind of matrix-operations, which are ultimately done by blas
> and lapac, the parallelization is openmpi, so I don't need the n'th amount
> of memory, when using several threads, and especially if I have intel cpus,
> the parallelization should be quite good.

Integral-direct algorithms not spend all of their time in BLAS.  AOs
are expensive and for some basis sets, they may be the dominant cost
(e.g. Roos ANO TZ is heavily contracted and thus presumably rather
expensive).  I can't tell you exactly what Molpro algorithms are
integral-direct but I'm sure the documentation and various papers
will.

While BLAS dgemm scales nicely in the thread count, LAPACK thread
scaling is never perfect.  For any method that use linear solvers
and/or eigensolvers heavily, you might find thread-scaling across
sockets is bad.

A good rule of thumb is to run one MPI process per NUMA domain.  A
NUMA domain is often a CPU socket but in the case of AMD Magny Cours
and related parts, they put two dies (and thus two last-level caches)
on a socket and thus you have two NUMA domains per socket.  You'll
find the NERSC documentation of the Cray XE6 tells users to run 4x6 on
their 24-core nodes for this very reason.

> Unfortunately I got rather mixed results - sometimes parts of molpro run
> faster, even those, where the mpich-parallelization doesn't work at all
> (e.g. the rs2c iterations), but often it's much less, than mpich, or even
> not much at all (at least no negative results, here), although, when
> checking the cpu usage, the amount of time it is at n*100% would suggest
> much better results (or, as it is, much overhead). Generally it is hard to
> compare from the cpu times given in the output. Looking at the real times,
> the (by a very small amount) a combination of mpich and the parallel mkl
> resulted in the fastest runs - but, as I wrote, this depends very much on
> the type of calculation performed.
> I'm very interested in why it does not work as I expected, or if someone
> made similar attempts with maybe more success, who could tell me exactly
> what he did, so I could compare to my stuff...my configure options to do
> this where (for the flags I used intels mkl link line advisor):

If you have a late-model Intel compiler, just use "-mkl" (see "ifort
-help" for suboptions).  Linking MKL is horrendous and should be
avoided at all costs.

Best,

Jeff

-- 
Jeff Hammond
jeff.science at gmail.com