Polyhedron analysis

AC

The hottest part is the WOLFF subroutine:

samples  %        symbol name
778707   73.8191  wolff_
126106   11.9545  energ_
121502   11.5180  ggl_
14281     1.3538  suscep_
14185     1.3447  MAIN__
WOLFF has the following loop:
               :!     ------------------------
               :!     Check nearest neighbors:
               :!     ------------------------
               :      DO j = 1 , 4
308337 18.3276 :         nny = nn(j,1)
 13106  0.7790 :         nnx = nn(j,2)
133639  7.9435 :         IF ( ishere.NE.Iz(nny,nnx) ) GOTO 125
156916  9.3271 :         isize = isize + 1
 12357  0.7345 :         rans = GGL(Dseed)
198040 11.7715 :         IF ( rans.GT.Prob ) GOTO 125
  4027  0.2394 :         Iz(nny,nnx) = -Iz(nny,nnx)
 92837  5.5183 :         Iclsiz = Iclsiz + 1
               :!     -----------------------------------------------------------------
               :!     Increment the stack of lattice sites which are part of a cluster,
               :!     but whose nearest neighbors have not yet been checked:
               :!     -----------------------------------------------------------------
   574  0.0341 :         ipt = ipt + 1
 80619  4.7920 :         Istack(ipt,1) = nny
 13238  0.7869 :         Istack(ipt,2) = nnx
               : 125  ENDDO

AERMOD

Aermod spends most of the time in library routines:
samples  %        symbol name
535460   12.7126  exp2
475739   11.2947  anyavg_
468200   11.1157  powf
370922    8.8062  locate_
206648    4.9061  iblval_
192494    4.5701  sigz_
187145    4.4431  _gfortrani_compare_string
One hot function is anyavg which is called a lot.

AIR

samples  %        symbol name
1118193  58.2766  MAIN__
212250   11.0618  derivy_
170644    8.8934  fvspltx2_
170459    8.8838  derivx_
151798    7.9112  fvsplty2_
57339     2.9883  state_
No really hot parts, just a lot of loops that add up. DERIVX has
  2188  0.1140 :      DO jm = 1 , M
               :         jmax = 0
               :         jmin = 1
  1288  0.0671 :         DO i = 1 , Nd
  3727  0.1942 :            jmax = jmax + Np(i) + 1
  7938  0.4137 :            DO j = jmin , jmax
               :               uxt = 0.
107253  5.5897 :               DO k = 0 , Np(i)
 37659  1.9627 :                  uxt = uxt + D(j,k+1)*U(jmin+k,jm)
               :               ENDDO
  8758  0.4564 :               Ux(j,jm) = uxt*Al(i)
               :            ENDDO
               :!
   732  0.0381 :            jmin = jmin + Np(i) + 1
               :         ENDDO
               :      ENDDO
DERIVY has a similar loop.

CAPACITA

samples  %        symbol name
5595624  88.2744  __solv_cap__fourir
440779    6.9536  __solv_cap__fourir2dx
105672    1.6670  __solv_cap__prod1
71401     1.1264  __solv_cap__preco
fourir has the following loop:
               :    loop : do
111211  1.7544 :      if(ij) then
 26394  0.4164 :            h=A(i)*root
 75373  1.1891 :            A(i)=A(j)*root
  8838  0.1394 :            A(j)=h
               :          else if(i==j) then
 18611  0.2936 :            A(j)=A(i)*root
               :          end if
 22557  0.3559 :          j=j+1
               :        else
 17098  0.2697 :          inc=inc*2
 24064  0.3796 :          if(inc>ntot) then
               :            write(unit=*, fmt=*) "error in fourier: n=", ntot
               :            stop
               :          end if
               :          cycle loop             ! recursive invocation I, followed by II
               :        end if
               :      else                       ! returned from recursive invocation II
   284  0.0045 :        i=i-inc
1149463 18.1335 :        inc=inc/2
158780  2.5049 :        n=ntot/inc
 16125  0.2544 :        j0=j-n
 33206  0.5238 :        j2=j0+n/2
 78837  1.2437 :        h = A(j0) + A(j2)
 49847  0.7864 :        A(j2) = A(j0) - A(j2)
     1 1.6e-05 :        A(j0) = h
 33412  0.5271 :        if(n>2) then
  8156  0.1287 :          j1=j0+n/4
 15141  0.2389 :          j3=j2+n/4
 79795  1.2588 :          h = A(j1) + A(j3)*Icc
 23359  0.3685 :          A(j3) = A(j1) - A(j3)*Icc
 10781  0.1701 :          A(j1) = h
782045 12.3372 :          do m=1,n/4-1
1068196 16.8514 :            h = A(j0+m) + A(j2+m)*E(m*inc)
115815  1.8271 :            A(j2+m) = A(j0+m) - A(j2+m)*E(m*inc)
     1 1.6e-05 :            A(j0+m) = h
205199  3.2371 :            eh = conjg(E(ntot/4-m*inc))
840185 13.2544 :            h = A(j1+m) - A(j3+m)*eh
146492  2.3110 :            A(j3+m) = A(j1+m) + A(j3+m)*eh
     1 1.6e-05 :            A(j1+m) = h
               :          end do
               :        end if
               :      end if
 99516  1.5699 :      if(inc==1) then
               :        exit loop                    ! return from root-invocation
               :      end if
103912  1.6393 :      i = i+inc/2
               :    end do loop                      ! return from recursive invo

CHANNEL

Channel looks like a candidate for optimizing scalarization of Fortran9x array operations. Nearly all time is spent on such operations.
samples  %        symbol name
2494816  63.8103  MAIN__
726182   18.5737  ddx.1380
687084   17.5736  ddy.1376
Most of the time is spent on the time marching loop here:
               :    u(2:M-1,1:N,new) = u(2:M-1,1:N,old) &               ! interior u points
               :        +2.d0*dt*f(2:M-1,1:N)*v(2:M-1,1:N,mid) &
970191 24.8147 :        -2.d0*dt/(2.d0*dx)*g*dhdx(2:M-1,1:N)
               :
               :    v(1:M,2:N-1,new) = v(1:M,2:N-1,old) &               ! interior v points
               :        -2.d0*dt*f(1:M,2:N-1)*u(1:M,2:N-1,mid) &
822404 21.0347 :        -2.d0*dt/(2.d0*dy)*g*dhdy(1:M,2:N-1)
               :
               :    h(1:M,2:N-1,new) = h(1:M,2:N-1,old) &               ! interior h points
               :        -2.d0*dt/(2.d0*dx)*Href*dudx(1:M,2:N-1) &
656240 16.7847 :        -2.d0*dt/(2.d0*dy)*Href*dvdy(1:M,2:N-1) 
and in the ddx and ddy functions:
721331 18.4496 :    ddx(2:I-1,1:J) = array(3:I,1:J)-array(1:I-2,1:J)    ! interior points

666704 17.0524 :    ddy(1:I,2:J-1) = array(1:I,3:J)-array(1:I,1:J-2)    ! interior points

DODUC

samples  %        symbol name
1063280  21.5747  pow
659946   13.3908  si_
292427    5.9336  s00017_
258190    5.2389  exp
256393    5.2024  s00089_
213899    4.3402  f00113_
196424    3.9856  mul12
and lots of tiny routines that add up. SI has a loop like:
128350  2.6043 :         DO ik = 2 , n
               :            ii = ik
332341  6.7434 :            IF ( Xtbl(ik).GE.x ) GOTO 100
               :         ENDDO

FATIGUE

samples  %        symbol name
673758   26.8266  __perdida_m__perdida
564683   22.4836  __remainder_piby2
426911   16.9980  __perdida_m__generalized_hookes_law
253172   10.0804  MAIN__
134442    5.3530  sin
120423    4.7948  cos
108797    4.3319  _int_malloc
89870     3.5783  _int_free
We spend a lot of time in reducing sin and cos arguments it seems. PERDIDA has
               :      damaged_dev_stress_tensor(:,:) = deviatoric_stress_tensor(:,:)/(1.0_LONGreal -  &
104077  4.1440 :                                                                    crack_parameter * damage)
generalized_hookes_law:
172913  6.8848 :      generalized_constitutive_tensor(:,:) = 0.0_LONGreal
and
               :      do i = 1, 6
               :          generalized_stress_vector(i) = dot_product(generalized_constitutive_tensor(i,:),  &   
 77066  3.0685 :                                                                generalized_strain_vector(:))
               :      end do

GAS_DYN

samples  %        symbol name
1507338  50.4328  eos_
604520   20.2261  chozdt_
385809   12.9085  _gfortran_minloc0_4_r4
153227    5.1267  MAIN__
120127    4.0192  area_
In EOS we have
   268  0.0090 :      IF (SHEAT>0.0 .AND. CGAMMA>0.0) THEN
345224 11.5506 :          TEMP(:NODES) = IENER(:NODES)/SHEAT
377779 12.6398 :          PRES(:NODES) = (CGAMMA - 1.0)*DENS(:NODES)*IENER(:NODES)
196586  6.5774 :          GAMMA(:NODES) = CGAMMA
587464 19.6555 :          CS(:NODES) = SQRT(CGAMMA*PRES(:NODES)/DENS(:NODES))
               :      ELSE
And CHOZDT spends all its time in
603054 20.1771 :      DTEMP = DX/(ABS(VEL) + SOUND)
   752  0.0252 :      ISET = MINLOC (DTEMP)
there is also the MINLOC call.

INDUCT

samples  %        symbol name
4492315  72.7969  __mqc_m__mutual_ind_quad_cir_coil
1411917  22.8798  __mqr_m__mutual_ind_quad_rec_coil
154445    2.5027  __remainder_piby2
dot_products galore. Also a lot of sqrt of dot_products, maybe a pattern to recognize and optimize.

LINPK

samples  %        symbol name
3637781  99.4426  daxpy_
The classic one, with some manual unrolling:
405466 11.0838 :               DO i = mp1 , N , 4
631494 17.2626 :                  Dy(i) = Dy(i) + Da*Dx(i)
1396884 38.1853 :                  Dy(i+1) = Dy(i+1) + Da*Dx(i+1)
371128 10.1452 :                  Dy(i+2) = Dy(i+2) + Da*Dx(i+2)
481208 13.1543 :                  Dy(i+3) = Dy(i+3) + Da*Dx(i+3)
               :               ENDDO

MDBX

samples  %        symbol name
1441821  47.7020  mforce_
764881   25.3057  fbuild_
451006   14.9213  mstep_
266520    8.8177  cbuild_
89573     2.9635  gbuild_
Loops, loops, loops - with control flow. Like for example (FBUILD):
               :      DO i = 1 , MOLsa
 11678  0.3864 :         DO j = i + 1 , MOLsa
 11596  0.3836 :            xij = X0(1,i) - X0(1,j)
 45792  1.5150 :            IF ( xij.GT.+HALf ) xij = xij - PBCx
 29208  0.9663 :            IF ( xij.LT.-HALf ) xij = xij + PBCx
 11502  0.3805 :            yij = X0(2,i) - X0(2,j)
114264  3.7804 :            IF ( yij.GT.+HALf ) yij = yij - PBCy
 76163  2.5198 :            IF ( yij.LT.-HALf ) yij = yij + PBCy
 22806  0.7545 :            zij = X0(3,i) - X0(3,j)
 65153  2.1556 :            IF ( zij.GT.+HALf ) zij = zij - PBCz
 52288  1.7299 :            IF ( zij.LT.-HALf ) zij = zij + PBCz
               :            rsq = xij*(g11*xij+g12d*yij+g13d*zij)                       &
               :     &            + yij*(g22*yij+g23d*zij) + g33*zij*zij
   145  0.0048 :            MARk(j) = 0
266293  8.8102 :            IF ( rsq.LT.Ransq ) MARk(j) = 1
               :         ENDDO
               :         MRKr1(i) = nll
 11613  0.3842 :         DO j = i + 1 , MOLsa
 15790  0.5224 :            LISt(nll) = j
 18679  0.6180 :            nll = nll + MARk(j)
               :         ENDDO
   250  0.0083 :         MRKr2(i) = nll - 1
               :      ENDDO

NF

samples  %        symbol name
1922085  34.3511  spmmult.1796
1480771  26.4641  trisolve.1787
1208249  21.5936  nfcg_
510377    9.1214  nf2dprecon.1778
434281    7.7614  nf3dprecon.1769
               :subroutine spmmult(x,b)
               :real(dpkind),dimension(nxyz):: x,b
270662  4.8372 :b = ad*x
276608  4.9435 :b(1:nxyz-1)  = b(1:nxyz-1)  + au1(1:nxyz-1) *x(2:nxyz)
275996  4.9325 :b(2:nxyz)  = b(2:nxyz)  + au1(1:nxyz-1) *x(1:nxyz-1) ! symmetric - so use au for lower
 band too
275781  4.9287 :b(1:nxyz-nx)  = b(1:nxyz-nx)  + au2(1:nxyz-nx) *x(nx+1:nxyz)
242638  4.3364 :b(nx+1:nxyz)  = b(nx+1:nxyz)  + au2(1:nxyz-nx) *x(1:nxyz-nx)
273006  4.8791 :b(1:nxyz-nxy) = b(1:nxyz-nxy) + au3(1:nxyz-nxy)*x(nxy+1:nxyz)
239446  4.2793 :b(nxy+1:nxyz) = b(nxy+1:nxyz) + au3(1:nxyz-nxy)*x(1:nxyz-nxy)
               :end subroutine spmmult               !=========================================
 23408  0.4183 :subroutine trisolve(x,i1,i2)         ! solve along a single line of cells ===== /* trisolve.1787 total: 1480771 26.4641 */
               :integer :: i1 , i2
               :real(dpkind),dimension(i2)::x
               :integer :: i
  4317  0.0772 :x(i1) = gi(i1)* x(i1)
 22584  0.4036 :do i = i1+1 , i2
780462 13.9483 :   x(i) = gi(i)*(x(i)-au1(i-1)*x(i-1))
               :enddo
 97761  1.7472 :do i = i2-1 , i1 , -1
552239  9.8695 :   x(i) = x(i) - gi(i)*au1(i)*x(i+1)
               :enddo
               :end subroutine trisolve              !=========================================

PROTEIN

samples  %        symbol name
3819311  76.4573  __conformations__conformation
872399   17.4642  __energy__compute_energy
219551    4.3951  cshift0

RNFLOW

samples  %        symbol name
1930486  43.5695  trs2a2.3579
630563   14.2313  minlst.5369
486611   10.9824  cptrf2_
334118    7.5408  gentrs_
267859    6.0454  _gfortran_matmul_r4
121412    2.7402  invima.3572
TRS2A2 looks like
 30803  0.6952 :      trs2a2 = 0.0
    29 6.5e-04 :      do iclw1 = j, k - 1
  4127  0.0931 :         do iclw2 = j, k - 1
               :            dtmp = 0.0d0
583839 13.1768 :            do iclww = j, k - 1
1151037 25.9780 :               dtmp = dtmp + u (iclw1, iclww) * d (iclww, iclw2)
               :            enddo
151279  3.4142 :            trs2a2 (iclw1, iclw2) = dtmp
               :         enddo
               :      enddo
MINLST looks like
  4926  0.1112 :      minlst = ipos2 /* minlst.5369 total: 630563 14.2313 */
121931  2.7519 :      do ipos = ipos2 - 1, ipos1, -1
503706 11.3682 :         if (xxtrt (ipos) < xxtrt (minlst)) then
               :            minlst = ipos
               :         endif
               :      enddo

TEST_FPU

samples  %        symbol name
1532040  51.2070  crout_
591500   19.7703  dgemm_
567353   18.9632  gauss_
149244    4.9883  dtrmm_
31588     1.0558  dtrsm_
dot_product again, in CROUT. DGEMM has:
   139  0.0046 :               DO l = 1 , K
   622  0.0208 :                  IF ( B(l,j)/=ZERO ) THEN
               :                     temp = Alpha*B(l,j)
 21380  0.7146 :                     DO i = 1 , M
569348 19.0299 :                        C(i,j) = C(i,j) + temp*A(i,l)
               :                     ENDDO
               :                  ENDIF
               :               ENDDO

TFFT

samples  %        symbol name
1078001  99.8597  rfft_