The hottest part is the WOLFF
subroutine:
samples % symbol name 778707 73.8191 wolff_ 126106 11.9545 energ_ 121502 11.5180 ggl_ 14281 1.3538 suscep_ 14185 1.3447 MAIN__
WOLFF
has the following loop:
:! ------------------------ :! Check nearest neighbors: :! ------------------------ : DO j = 1 , 4 308337 18.3276 : nny = nn(j,1) 13106 0.7790 : nnx = nn(j,2) 133639 7.9435 : IF ( ishere.NE.Iz(nny,nnx) ) GOTO 125 156916 9.3271 : isize = isize + 1 12357 0.7345 : rans = GGL(Dseed) 198040 11.7715 : IF ( rans.GT.Prob ) GOTO 125 4027 0.2394 : Iz(nny,nnx) = -Iz(nny,nnx) 92837 5.5183 : Iclsiz = Iclsiz + 1 :! ----------------------------------------------------------------- :! Increment the stack of lattice sites which are part of a cluster, :! but whose nearest neighbors have not yet been checked: :! ----------------------------------------------------------------- 574 0.0341 : ipt = ipt + 1 80619 4.7920 : Istack(ipt,1) = nny 13238 0.7869 : Istack(ipt,2) = nnx : 125 ENDDO
samples % symbol name 535460 12.7126 exp2 475739 11.2947 anyavg_ 468200 11.1157 powf 370922 8.8062 locate_ 206648 4.9061 iblval_ 192494 4.5701 sigz_ 187145 4.4431 _gfortrani_compare_stringOne hot function is
anyavg
which is called a lot.
samples % symbol name 1118193 58.2766 MAIN__ 212250 11.0618 derivy_ 170644 8.8934 fvspltx2_ 170459 8.8838 derivx_ 151798 7.9112 fvsplty2_ 57339 2.9883 state_No really hot parts, just a lot of loops that add up.
DERIVX
has
2188 0.1140 : DO jm = 1 , M : jmax = 0 : jmin = 1 1288 0.0671 : DO i = 1 , Nd 3727 0.1942 : jmax = jmax + Np(i) + 1 7938 0.4137 : DO j = jmin , jmax : uxt = 0. 107253 5.5897 : DO k = 0 , Np(i) 37659 1.9627 : uxt = uxt + D(j,k+1)*U(jmin+k,jm) : ENDDO 8758 0.4564 : Ux(j,jm) = uxt*Al(i) : ENDDO :! 732 0.0381 : jmin = jmin + Np(i) + 1 : ENDDO : ENDDO
DERIVY
has a similar loop.
samples % symbol name 5595624 88.2744 __solv_cap__fourir 440779 6.9536 __solv_cap__fourir2dx 105672 1.6670 __solv_cap__prod1 71401 1.1264 __solv_cap__preco
fourir
has the following loop:
: loop : do 111211 1.7544 : if(ij) then 26394 0.4164 : h=A(i)*root 75373 1.1891 : A(i)=A(j)*root 8838 0.1394 : A(j)=h : else if(i==j) then 18611 0.2936 : A(j)=A(i)*root : end if 22557 0.3559 : j=j+1 : else 17098 0.2697 : inc=inc*2 24064 0.3796 : if(inc>ntot) then : write(unit=*, fmt=*) "error in fourier: n=", ntot : stop : end if : cycle loop ! recursive invocation I, followed by II : end if : else ! returned from recursive invocation II 284 0.0045 : i=i-inc 1149463 18.1335 : inc=inc/2 158780 2.5049 : n=ntot/inc 16125 0.2544 : j0=j-n 33206 0.5238 : j2=j0+n/2 78837 1.2437 : h = A(j0) + A(j2) 49847 0.7864 : A(j2) = A(j0) - A(j2) 1 1.6e-05 : A(j0) = h 33412 0.5271 : if(n>2) then 8156 0.1287 : j1=j0+n/4 15141 0.2389 : j3=j2+n/4 79795 1.2588 : h = A(j1) + A(j3)*Icc 23359 0.3685 : A(j3) = A(j1) - A(j3)*Icc 10781 0.1701 : A(j1) = h 782045 12.3372 : do m=1,n/4-1 1068196 16.8514 : h = A(j0+m) + A(j2+m)*E(m*inc) 115815 1.8271 : A(j2+m) = A(j0+m) - A(j2+m)*E(m*inc) 1 1.6e-05 : A(j0+m) = h 205199 3.2371 : eh = conjg(E(ntot/4-m*inc)) 840185 13.2544 : h = A(j1+m) - A(j3+m)*eh 146492 2.3110 : A(j3+m) = A(j1+m) + A(j3+m)*eh 1 1.6e-05 : A(j1+m) = h : end do : end if : end if 99516 1.5699 : if(inc==1) then : exit loop ! return from root-invocation : end if 103912 1.6393 : i = i+inc/2 : end do loop ! return from recursive invo
samples % symbol name 2494816 63.8103 MAIN__ 726182 18.5737 ddx.1380 687084 17.5736 ddy.1376Most of the time is spent on the time marching loop here:
: u(2:M-1,1:N,new) = u(2:M-1,1:N,old) & ! interior u points : +2.d0*dt*f(2:M-1,1:N)*v(2:M-1,1:N,mid) & 970191 24.8147 : -2.d0*dt/(2.d0*dx)*g*dhdx(2:M-1,1:N) : : v(1:M,2:N-1,new) = v(1:M,2:N-1,old) & ! interior v points : -2.d0*dt*f(1:M,2:N-1)*u(1:M,2:N-1,mid) & 822404 21.0347 : -2.d0*dt/(2.d0*dy)*g*dhdy(1:M,2:N-1) : : h(1:M,2:N-1,new) = h(1:M,2:N-1,old) & ! interior h points : -2.d0*dt/(2.d0*dx)*Href*dudx(1:M,2:N-1) & 656240 16.7847 : -2.d0*dt/(2.d0*dy)*Href*dvdy(1:M,2:N-1)and in the
ddx
and ddy
functions:
721331 18.4496 : ddx(2:I-1,1:J) = array(3:I,1:J)-array(1:I-2,1:J) ! interior points 666704 17.0524 : ddy(1:I,2:J-1) = array(1:I,3:J)-array(1:I,1:J-2) ! interior points
samples % symbol name 1063280 21.5747 pow 659946 13.3908 si_ 292427 5.9336 s00017_ 258190 5.2389 exp 256393 5.2024 s00089_ 213899 4.3402 f00113_ 196424 3.9856 mul12and lots of tiny routines that add up.
SI
has a loop like:
128350 2.6043 : DO ik = 2 , n : ii = ik 332341 6.7434 : IF ( Xtbl(ik).GE.x ) GOTO 100 : ENDDO
samples % symbol name 673758 26.8266 __perdida_m__perdida 564683 22.4836 __remainder_piby2 426911 16.9980 __perdida_m__generalized_hookes_law 253172 10.0804 MAIN__ 134442 5.3530 sin 120423 4.7948 cos 108797 4.3319 _int_malloc 89870 3.5783 _int_freeWe spend a lot of time in reducing sin and cos arguments it seems.
PERDIDA
has
: damaged_dev_stress_tensor(:,:) = deviatoric_stress_tensor(:,:)/(1.0_LONGreal - & 104077 4.1440 : crack_parameter * damage)
generalized_hookes_law
:
172913 6.8848 : generalized_constitutive_tensor(:,:) = 0.0_LONGrealand
: do i = 1, 6 : generalized_stress_vector(i) = dot_product(generalized_constitutive_tensor(i,:), & 77066 3.0685 : generalized_strain_vector(:)) : end do
samples % symbol name 1507338 50.4328 eos_ 604520 20.2261 chozdt_ 385809 12.9085 _gfortran_minloc0_4_r4 153227 5.1267 MAIN__ 120127 4.0192 area_In
EOS
we have
268 0.0090 : IF (SHEAT>0.0 .AND. CGAMMA>0.0) THEN 345224 11.5506 : TEMP(:NODES) = IENER(:NODES)/SHEAT 377779 12.6398 : PRES(:NODES) = (CGAMMA - 1.0)*DENS(:NODES)*IENER(:NODES) 196586 6.5774 : GAMMA(:NODES) = CGAMMA 587464 19.6555 : CS(:NODES) = SQRT(CGAMMA*PRES(:NODES)/DENS(:NODES)) : ELSEAnd
CHOZDT
spends all its time in
603054 20.1771 : DTEMP = DX/(ABS(VEL) + SOUND) 752 0.0252 : ISET = MINLOC (DTEMP)there is also the
MINLOC
call.
samples % symbol name 4492315 72.7969 __mqc_m__mutual_ind_quad_cir_coil 1411917 22.8798 __mqr_m__mutual_ind_quad_rec_coil 154445 2.5027 __remainder_piby2dot_products galore. Also a lot of sqrt of dot_products, maybe a pattern to recognize and optimize.
samples % symbol name 3637781 99.4426 daxpy_The classic one, with some manual unrolling:
405466 11.0838 : DO i = mp1 , N , 4 631494 17.2626 : Dy(i) = Dy(i) + Da*Dx(i) 1396884 38.1853 : Dy(i+1) = Dy(i+1) + Da*Dx(i+1) 371128 10.1452 : Dy(i+2) = Dy(i+2) + Da*Dx(i+2) 481208 13.1543 : Dy(i+3) = Dy(i+3) + Da*Dx(i+3) : ENDDO
samples % symbol name 1441821 47.7020 mforce_ 764881 25.3057 fbuild_ 451006 14.9213 mstep_ 266520 8.8177 cbuild_ 89573 2.9635 gbuild_Loops, loops, loops - with control flow. Like for example (
FBUILD
):
: DO i = 1 , MOLsa 11678 0.3864 : DO j = i + 1 , MOLsa 11596 0.3836 : xij = X0(1,i) - X0(1,j) 45792 1.5150 : IF ( xij.GT.+HALf ) xij = xij - PBCx 29208 0.9663 : IF ( xij.LT.-HALf ) xij = xij + PBCx 11502 0.3805 : yij = X0(2,i) - X0(2,j) 114264 3.7804 : IF ( yij.GT.+HALf ) yij = yij - PBCy 76163 2.5198 : IF ( yij.LT.-HALf ) yij = yij + PBCy 22806 0.7545 : zij = X0(3,i) - X0(3,j) 65153 2.1556 : IF ( zij.GT.+HALf ) zij = zij - PBCz 52288 1.7299 : IF ( zij.LT.-HALf ) zij = zij + PBCz : rsq = xij*(g11*xij+g12d*yij+g13d*zij) & : & + yij*(g22*yij+g23d*zij) + g33*zij*zij 145 0.0048 : MARk(j) = 0 266293 8.8102 : IF ( rsq.LT.Ransq ) MARk(j) = 1 : ENDDO : MRKr1(i) = nll 11613 0.3842 : DO j = i + 1 , MOLsa 15790 0.5224 : LISt(nll) = j 18679 0.6180 : nll = nll + MARk(j) : ENDDO 250 0.0083 : MRKr2(i) = nll - 1 : ENDDO
samples % symbol name 1922085 34.3511 spmmult.1796 1480771 26.4641 trisolve.1787 1208249 21.5936 nfcg_ 510377 9.1214 nf2dprecon.1778 434281 7.7614 nf3dprecon.1769
:subroutine spmmult(x,b) :real(dpkind),dimension(nxyz):: x,b 270662 4.8372 :b = ad*x 276608 4.9435 :b(1:nxyz-1) = b(1:nxyz-1) + au1(1:nxyz-1) *x(2:nxyz) 275996 4.9325 :b(2:nxyz) = b(2:nxyz) + au1(1:nxyz-1) *x(1:nxyz-1) ! symmetric - so use au for lower band too 275781 4.9287 :b(1:nxyz-nx) = b(1:nxyz-nx) + au2(1:nxyz-nx) *x(nx+1:nxyz) 242638 4.3364 :b(nx+1:nxyz) = b(nx+1:nxyz) + au2(1:nxyz-nx) *x(1:nxyz-nx) 273006 4.8791 :b(1:nxyz-nxy) = b(1:nxyz-nxy) + au3(1:nxyz-nxy)*x(nxy+1:nxyz) 239446 4.2793 :b(nxy+1:nxyz) = b(nxy+1:nxyz) + au3(1:nxyz-nxy)*x(1:nxyz-nxy) :end subroutine spmmult !=========================================
23408 0.4183 :subroutine trisolve(x,i1,i2) ! solve along a single line of cells ===== /* trisolve.1787 total: 1480771 26.4641 */ :integer :: i1 , i2 :real(dpkind),dimension(i2)::x :integer :: i 4317 0.0772 :x(i1) = gi(i1)* x(i1) 22584 0.4036 :do i = i1+1 , i2 780462 13.9483 : x(i) = gi(i)*(x(i)-au1(i-1)*x(i-1)) :enddo 97761 1.7472 :do i = i2-1 , i1 , -1 552239 9.8695 : x(i) = x(i) - gi(i)*au1(i)*x(i+1) :enddo :end subroutine trisolve !=========================================
samples % symbol name 3819311 76.4573 __conformations__conformation 872399 17.4642 __energy__compute_energy 219551 4.3951 cshift0
samples % symbol name 1930486 43.5695 trs2a2.3579 630563 14.2313 minlst.5369 486611 10.9824 cptrf2_ 334118 7.5408 gentrs_ 267859 6.0454 _gfortran_matmul_r4 121412 2.7402 invima.3572
TRS2A2
looks like
30803 0.6952 : trs2a2 = 0.0 29 6.5e-04 : do iclw1 = j, k - 1 4127 0.0931 : do iclw2 = j, k - 1 : dtmp = 0.0d0 583839 13.1768 : do iclww = j, k - 1 1151037 25.9780 : dtmp = dtmp + u (iclw1, iclww) * d (iclww, iclw2) : enddo 151279 3.4142 : trs2a2 (iclw1, iclw2) = dtmp : enddo : enddo
MINLST
looks like
4926 0.1112 : minlst = ipos2 /* minlst.5369 total: 630563 14.2313 */ 121931 2.7519 : do ipos = ipos2 - 1, ipos1, -1 503706 11.3682 : if (xxtrt (ipos) < xxtrt (minlst)) then : minlst = ipos : endif : enddo
samples % symbol name 1532040 51.2070 crout_ 591500 19.7703 dgemm_ 567353 18.9632 gauss_ 149244 4.9883 dtrmm_ 31588 1.0558 dtrsm_dot_product again, in
CROUT
. DGEMM
has:
139 0.0046 : DO l = 1 , K 622 0.0208 : IF ( B(l,j)/=ZERO ) THEN : temp = Alpha*B(l,j) 21380 0.7146 : DO i = 1 , M 569348 19.0299 : C(i,j) = C(i,j) + temp*A(i,l) : ENDDO : ENDIF : ENDDO
samples % symbol name 1078001 99.8597 rfft_