Comment by teo_zero

13 hours ago

The author forgot to add "fused" here, like they did in other parts of the same section.

Non-fused:

  foreach i
    y[i] = cos(x[i])
  foreach i
    z[i] = cos(y[i])

Fused, no intermediate variable:

  foreach i
    t = cos(x[i])
    z[i] = cos(t)

The temporary "t" doesn't leave the GPU. Sweeping the array twice makes you twice as dependent on memory bandwidth.