The hottest part is the WOLFF subroutine:
samples % symbol name 778707 73.8191 wolff_ 126106 11.9545 energ_ 121502 11.5180 ggl_ 14281 1.3538 suscep_ 14185 1.3447 MAIN__
WOLFF has the following loop:
:! ------------------------
:! Check nearest neighbors:
:! ------------------------
: DO j = 1 , 4
308337 18.3276 : nny = nn(j,1)
13106 0.7790 : nnx = nn(j,2)
133639 7.9435 : IF ( ishere.NE.Iz(nny,nnx) ) GOTO 125
156916 9.3271 : isize = isize + 1
12357 0.7345 : rans = GGL(Dseed)
198040 11.7715 : IF ( rans.GT.Prob ) GOTO 125
4027 0.2394 : Iz(nny,nnx) = -Iz(nny,nnx)
92837 5.5183 : Iclsiz = Iclsiz + 1
:! -----------------------------------------------------------------
:! Increment the stack of lattice sites which are part of a cluster,
:! but whose nearest neighbors have not yet been checked:
:! -----------------------------------------------------------------
574 0.0341 : ipt = ipt + 1
80619 4.7920 : Istack(ipt,1) = nny
13238 0.7869 : Istack(ipt,2) = nnx
: 125 ENDDO
samples % symbol name 535460 12.7126 exp2 475739 11.2947 anyavg_ 468200 11.1157 powf 370922 8.8062 locate_ 206648 4.9061 iblval_ 192494 4.5701 sigz_ 187145 4.4431 _gfortrani_compare_stringOne hot function is
anyavg which is called a lot.
samples % symbol name 1118193 58.2766 MAIN__ 212250 11.0618 derivy_ 170644 8.8934 fvspltx2_ 170459 8.8838 derivx_ 151798 7.9112 fvsplty2_ 57339 2.9883 state_No really hot parts, just a lot of loops that add up.
DERIVX has
2188 0.1140 : DO jm = 1 , M
: jmax = 0
: jmin = 1
1288 0.0671 : DO i = 1 , Nd
3727 0.1942 : jmax = jmax + Np(i) + 1
7938 0.4137 : DO j = jmin , jmax
: uxt = 0.
107253 5.5897 : DO k = 0 , Np(i)
37659 1.9627 : uxt = uxt + D(j,k+1)*U(jmin+k,jm)
: ENDDO
8758 0.4564 : Ux(j,jm) = uxt*Al(i)
: ENDDO
:!
732 0.0381 : jmin = jmin + Np(i) + 1
: ENDDO
: ENDDO
DERIVY has a similar loop.
samples % symbol name 5595624 88.2744 __solv_cap__fourir 440779 6.9536 __solv_cap__fourir2dx 105672 1.6670 __solv_cap__prod1 71401 1.1264 __solv_cap__preco
fourir has the following loop:
: loop : do
111211 1.7544 : if(ij) then
26394 0.4164 : h=A(i)*root
75373 1.1891 : A(i)=A(j)*root
8838 0.1394 : A(j)=h
: else if(i==j) then
18611 0.2936 : A(j)=A(i)*root
: end if
22557 0.3559 : j=j+1
: else
17098 0.2697 : inc=inc*2
24064 0.3796 : if(inc>ntot) then
: write(unit=*, fmt=*) "error in fourier: n=", ntot
: stop
: end if
: cycle loop ! recursive invocation I, followed by II
: end if
: else ! returned from recursive invocation II
284 0.0045 : i=i-inc
1149463 18.1335 : inc=inc/2
158780 2.5049 : n=ntot/inc
16125 0.2544 : j0=j-n
33206 0.5238 : j2=j0+n/2
78837 1.2437 : h = A(j0) + A(j2)
49847 0.7864 : A(j2) = A(j0) - A(j2)
1 1.6e-05 : A(j0) = h
33412 0.5271 : if(n>2) then
8156 0.1287 : j1=j0+n/4
15141 0.2389 : j3=j2+n/4
79795 1.2588 : h = A(j1) + A(j3)*Icc
23359 0.3685 : A(j3) = A(j1) - A(j3)*Icc
10781 0.1701 : A(j1) = h
782045 12.3372 : do m=1,n/4-1
1068196 16.8514 : h = A(j0+m) + A(j2+m)*E(m*inc)
115815 1.8271 : A(j2+m) = A(j0+m) - A(j2+m)*E(m*inc)
1 1.6e-05 : A(j0+m) = h
205199 3.2371 : eh = conjg(E(ntot/4-m*inc))
840185 13.2544 : h = A(j1+m) - A(j3+m)*eh
146492 2.3110 : A(j3+m) = A(j1+m) + A(j3+m)*eh
1 1.6e-05 : A(j1+m) = h
: end do
: end if
: end if
99516 1.5699 : if(inc==1) then
: exit loop ! return from root-invocation
: end if
103912 1.6393 : i = i+inc/2
: end do loop ! return from recursive invo
samples % symbol name 2494816 63.8103 MAIN__ 726182 18.5737 ddx.1380 687084 17.5736 ddy.1376Most of the time is spent on the time marching loop here:
: u(2:M-1,1:N,new) = u(2:M-1,1:N,old) & ! interior u points
: +2.d0*dt*f(2:M-1,1:N)*v(2:M-1,1:N,mid) &
970191 24.8147 : -2.d0*dt/(2.d0*dx)*g*dhdx(2:M-1,1:N)
:
: v(1:M,2:N-1,new) = v(1:M,2:N-1,old) & ! interior v points
: -2.d0*dt*f(1:M,2:N-1)*u(1:M,2:N-1,mid) &
822404 21.0347 : -2.d0*dt/(2.d0*dy)*g*dhdy(1:M,2:N-1)
:
: h(1:M,2:N-1,new) = h(1:M,2:N-1,old) & ! interior h points
: -2.d0*dt/(2.d0*dx)*Href*dudx(1:M,2:N-1) &
656240 16.7847 : -2.d0*dt/(2.d0*dy)*Href*dvdy(1:M,2:N-1)
and in the ddx and ddy functions:
721331 18.4496 : ddx(2:I-1,1:J) = array(3:I,1:J)-array(1:I-2,1:J) ! interior points 666704 17.0524 : ddy(1:I,2:J-1) = array(1:I,3:J)-array(1:I,1:J-2) ! interior points
samples % symbol name 1063280 21.5747 pow 659946 13.3908 si_ 292427 5.9336 s00017_ 258190 5.2389 exp 256393 5.2024 s00089_ 213899 4.3402 f00113_ 196424 3.9856 mul12and lots of tiny routines that add up.
SI has a loop like:
128350 2.6043 : DO ik = 2 , n
: ii = ik
332341 6.7434 : IF ( Xtbl(ik).GE.x ) GOTO 100
: ENDDO
samples % symbol name 673758 26.8266 __perdida_m__perdida 564683 22.4836 __remainder_piby2 426911 16.9980 __perdida_m__generalized_hookes_law 253172 10.0804 MAIN__ 134442 5.3530 sin 120423 4.7948 cos 108797 4.3319 _int_malloc 89870 3.5783 _int_freeWe spend a lot of time in reducing sin and cos arguments it seems.
PERDIDA has
: damaged_dev_stress_tensor(:,:) = deviatoric_stress_tensor(:,:)/(1.0_LONGreal - &
104077 4.1440 : crack_parameter * damage)
generalized_hookes_law:
172913 6.8848 : generalized_constitutive_tensor(:,:) = 0.0_LONGrealand
: do i = 1, 6
: generalized_stress_vector(i) = dot_product(generalized_constitutive_tensor(i,:), &
77066 3.0685 : generalized_strain_vector(:))
: end do
samples % symbol name 1507338 50.4328 eos_ 604520 20.2261 chozdt_ 385809 12.9085 _gfortran_minloc0_4_r4 153227 5.1267 MAIN__ 120127 4.0192 area_In
EOS we have
268 0.0090 : IF (SHEAT>0.0 .AND. CGAMMA>0.0) THEN
345224 11.5506 : TEMP(:NODES) = IENER(:NODES)/SHEAT
377779 12.6398 : PRES(:NODES) = (CGAMMA - 1.0)*DENS(:NODES)*IENER(:NODES)
196586 6.5774 : GAMMA(:NODES) = CGAMMA
587464 19.6555 : CS(:NODES) = SQRT(CGAMMA*PRES(:NODES)/DENS(:NODES))
: ELSE
And CHOZDT spends all its time in
603054 20.1771 : DTEMP = DX/(ABS(VEL) + SOUND) 752 0.0252 : ISET = MINLOC (DTEMP)there is also the
MINLOC call.
samples % symbol name 4492315 72.7969 __mqc_m__mutual_ind_quad_cir_coil 1411917 22.8798 __mqr_m__mutual_ind_quad_rec_coil 154445 2.5027 __remainder_piby2dot_products galore. Also a lot of sqrt of dot_products, maybe a pattern to recognize and optimize.
samples % symbol name 3637781 99.4426 daxpy_The classic one, with some manual unrolling:
405466 11.0838 : DO i = mp1 , N , 4
631494 17.2626 : Dy(i) = Dy(i) + Da*Dx(i)
1396884 38.1853 : Dy(i+1) = Dy(i+1) + Da*Dx(i+1)
371128 10.1452 : Dy(i+2) = Dy(i+2) + Da*Dx(i+2)
481208 13.1543 : Dy(i+3) = Dy(i+3) + Da*Dx(i+3)
: ENDDO
samples % symbol name 1441821 47.7020 mforce_ 764881 25.3057 fbuild_ 451006 14.9213 mstep_ 266520 8.8177 cbuild_ 89573 2.9635 gbuild_Loops, loops, loops - with control flow. Like for example (
FBUILD):
: DO i = 1 , MOLsa
11678 0.3864 : DO j = i + 1 , MOLsa
11596 0.3836 : xij = X0(1,i) - X0(1,j)
45792 1.5150 : IF ( xij.GT.+HALf ) xij = xij - PBCx
29208 0.9663 : IF ( xij.LT.-HALf ) xij = xij + PBCx
11502 0.3805 : yij = X0(2,i) - X0(2,j)
114264 3.7804 : IF ( yij.GT.+HALf ) yij = yij - PBCy
76163 2.5198 : IF ( yij.LT.-HALf ) yij = yij + PBCy
22806 0.7545 : zij = X0(3,i) - X0(3,j)
65153 2.1556 : IF ( zij.GT.+HALf ) zij = zij - PBCz
52288 1.7299 : IF ( zij.LT.-HALf ) zij = zij + PBCz
: rsq = xij*(g11*xij+g12d*yij+g13d*zij) &
: & + yij*(g22*yij+g23d*zij) + g33*zij*zij
145 0.0048 : MARk(j) = 0
266293 8.8102 : IF ( rsq.LT.Ransq ) MARk(j) = 1
: ENDDO
: MRKr1(i) = nll
11613 0.3842 : DO j = i + 1 , MOLsa
15790 0.5224 : LISt(nll) = j
18679 0.6180 : nll = nll + MARk(j)
: ENDDO
250 0.0083 : MRKr2(i) = nll - 1
: ENDDO
samples % symbol name 1922085 34.3511 spmmult.1796 1480771 26.4641 trisolve.1787 1208249 21.5936 nfcg_ 510377 9.1214 nf2dprecon.1778 434281 7.7614 nf3dprecon.1769
:subroutine spmmult(x,b)
:real(dpkind),dimension(nxyz):: x,b
270662 4.8372 :b = ad*x
276608 4.9435 :b(1:nxyz-1) = b(1:nxyz-1) + au1(1:nxyz-1) *x(2:nxyz)
275996 4.9325 :b(2:nxyz) = b(2:nxyz) + au1(1:nxyz-1) *x(1:nxyz-1) ! symmetric - so use au for lower
band too
275781 4.9287 :b(1:nxyz-nx) = b(1:nxyz-nx) + au2(1:nxyz-nx) *x(nx+1:nxyz)
242638 4.3364 :b(nx+1:nxyz) = b(nx+1:nxyz) + au2(1:nxyz-nx) *x(1:nxyz-nx)
273006 4.8791 :b(1:nxyz-nxy) = b(1:nxyz-nxy) + au3(1:nxyz-nxy)*x(nxy+1:nxyz)
239446 4.2793 :b(nxy+1:nxyz) = b(nxy+1:nxyz) + au3(1:nxyz-nxy)*x(1:nxyz-nxy)
:end subroutine spmmult !=========================================
23408 0.4183 :subroutine trisolve(x,i1,i2) ! solve along a single line of cells ===== /* trisolve.1787 total: 1480771 26.4641 */
:integer :: i1 , i2
:real(dpkind),dimension(i2)::x
:integer :: i
4317 0.0772 :x(i1) = gi(i1)* x(i1)
22584 0.4036 :do i = i1+1 , i2
780462 13.9483 : x(i) = gi(i)*(x(i)-au1(i-1)*x(i-1))
:enddo
97761 1.7472 :do i = i2-1 , i1 , -1
552239 9.8695 : x(i) = x(i) - gi(i)*au1(i)*x(i+1)
:enddo
:end subroutine trisolve !=========================================
samples % symbol name 3819311 76.4573 __conformations__conformation 872399 17.4642 __energy__compute_energy 219551 4.3951 cshift0
samples % symbol name 1930486 43.5695 trs2a2.3579 630563 14.2313 minlst.5369 486611 10.9824 cptrf2_ 334118 7.5408 gentrs_ 267859 6.0454 _gfortran_matmul_r4 121412 2.7402 invima.3572
TRS2A2 looks like
30803 0.6952 : trs2a2 = 0.0
29 6.5e-04 : do iclw1 = j, k - 1
4127 0.0931 : do iclw2 = j, k - 1
: dtmp = 0.0d0
583839 13.1768 : do iclww = j, k - 1
1151037 25.9780 : dtmp = dtmp + u (iclw1, iclww) * d (iclww, iclw2)
: enddo
151279 3.4142 : trs2a2 (iclw1, iclw2) = dtmp
: enddo
: enddo
MINLST looks like
4926 0.1112 : minlst = ipos2 /* minlst.5369 total: 630563 14.2313 */
121931 2.7519 : do ipos = ipos2 - 1, ipos1, -1
503706 11.3682 : if (xxtrt (ipos) < xxtrt (minlst)) then
: minlst = ipos
: endif
: enddo
samples % symbol name 1532040 51.2070 crout_ 591500 19.7703 dgemm_ 567353 18.9632 gauss_ 149244 4.9883 dtrmm_ 31588 1.0558 dtrsm_dot_product again, in
CROUT. DGEMM has:
139 0.0046 : DO l = 1 , K
622 0.0208 : IF ( B(l,j)/=ZERO ) THEN
: temp = Alpha*B(l,j)
21380 0.7146 : DO i = 1 , M
569348 19.0299 : C(i,j) = C(i,j) + temp*A(i,l)
: ENDDO
: ENDIF
: ENDDO
samples % symbol name 1078001 99.8597 rfft_