I am not denying that this is the case for the EG paper implementation. I am just saying that it is not a real problem for the implementation presented in the TOG paper. 3.5x (/4) is the average (we get to 4x with hyper threading) for the scene tested, not the best case.Unfortunately, there is definitely a scaling problem for incoherent rays.
For instance, for the conference scene we have:
1 thread 8 threads speed-up
TOG paper 2011 (i7 3.GHz) 1.1 MR/s 5.5 MR/s 4.23x
EG paper 2012 (i7 3.2GHz) 1.7 MR/s 6.2 MR/s 3.64x
Keep in mind as well that the partition scheme is not the same, and this may have some consequences on the behaviour of the two implementations. I think that BVHs are more prone to variability according to the scene, though they have advantages as well that can't be denied.
On my Sandy Bridge laptop, performances are noticeably better than on my I7-920. By the way, is the turbo-mode BIOS de-activated in the benchmarks provided in this EG paper?
Swapping indices works well for coherent rays, but performs consistently worse for incoherent rays because of the poor cache utilization.
Incoherent rays are for sure less efficient when talking of cache utilization. I have no exact idea of how bad it is, but poor will have to be quantified in further research and I believe that prefetching actually does a very good job at sorting out the issue you mentioned! By the way, would you know why 8 bounce tests perform usually better than 1 bounce tests in the EG paper? Other publications seem to show a different trend.
Overall, writing large chunk of data in a linear list is not efficient either when compared to just swapping indices and it is clear that it increases the bandwidth by much.
In conclusion, I guess we can conclude that there is a choice to make on whether we want to favor single-threaded or multi-threaded applications with good scaling.