[molpro-user] MOLPRO on dual-core AMD Opteron

Wed Nov 16 11:53:43 GMT 2005

Greetings:

Following up on my earlier request, a kind soul here provided me with  
root access to a demo machine with four Opteron 875s (2.2 GHz, dual  
core) and 16 GB of 400 MHz DDR2 RAM. Red Hat Advanced Server 4 was  
installed on it.

The benchmark I ran was HClO4 CCSD(T)/aug-cc-pV(T+d)Z single-point  
energy (in Cs symmetry).

 From the viewpoint of Linux, the cores appear as eight independent  
CPUs, divided into four nodes. One can lock jobs onto individual  
*nodes* by running the job as follows:

numactl --cpubind=0,1 molpro -n 4 -m 150000000  
testjob-2cpus-4cores.com &

but there is no way (at least not that I could see) to bind processes  
to specific *cores*. So, in order to compare N dual-core with 2 N  
single-core CPUs, I used a somewhat "dirty" trick: I wrote a simple  
program that executes an endless loop of integer multiplies, saved it  
as "block1core", and ran

numactl --cpubind=0 block1core &
numactl --cpubind=1 block1core &
numactl --cpubind=2 block1core &
numactl --cpubind=3 block1core &

which effectively leaves me with what amounts to exactly the same  
machine but with four Opteron 848s (the single-core equivalent of the  
875).

The results (CPU times in seconds as reported in MOLPRO's output):

#CPUs  #cores    t[xform] time[CCSD] (a)   time[(T)]
1       1         82.74     2399     2316    3132
2       2         54.40     1251     1197    1509
1       2         76.34     1308     1232    1593
3       3         53.14      924      871    1048
4       4         47.73      735      687     823
2       4         55.43      728      672     804
3       6         53.45      538      484     540
4       8         46.43      478      432     461

(a) = CCSD minus transformation

There may be some slight measurement errors here, as well as some  
fluctuation because of other processes running (although the machine  
was basically empty otherwise), but the bottom line appears to be:

* N dual-core CPUs yield only slightly less performance than 2N  
equivalent single-core CPUs. Presumably this will deteriorate
with larger and more memory-intensive jobs, but clearly the  
performance penalty from the shared memory access channel of the
two cores isn't nearly as bad as I feared
* (T) in this job size range parallelizes nearly perfectly with the  
number of cores (not just CPUs) up to about 6 of them
* even CCSD in this job size range still parallelizes with about 80%  
efficiency over 6 cores (if the transformation step is taken out of  
the total figures) and with 2/3 efficiency over 8 cores.

Any comments, observations, welcome :-)

Best regards,
JMLM
------------------------------------------------------------------------
Gershom (Jan M.L.) Martin   /  Baroness Thatcher Professor of Chemistry
Member, Lise Meitner-Minerva Center for Computational Quantum Chemistry
    and Helen and Martin Kimmel Center for Molecular Design
Dept. of Organic Chemistry  /  Weizmann Institute of Science
Kimmelman Bldg., Room 252   /  76100 Rechovot, Israel
Email: comartin at wicc.weizmann.ac.il
Phone: +972 8 9342533 (office),  +972 54 4631676 (mobile)
FAX: +972 8 9344142 (dept.),  +972 8 9342621 (direct to computer)
Web: http://theochem.weizmann.ac.il
------------------------------------------------------------------------