A Tale of Two Optimisations
With the LBR processor feature, the CPU logs the from and to addresses of predicted and mispredicted branches taken to a set of special purpose registers. Looking at the trace data, the brstackins field allows us to see the instructions executed and approximate CPU cycles for different branches. PRED 5 cycles 0.20 IPC. Even without decoding the machine code, what immediately stands stand out is the 22 cycles taken following the branch misprediction. The likely cause of this is the number is the number of branch misses, which is up at ~12%. These branch prediction misses are destroying any benefit of pipelining in the CPU. I've got an sneaking suspicion that the cause of these missed branch predictions is our input data, so let's test the hypothesis. In the grand scheme of things, almost no branch prediction misses, and a huge speed increase as compared to the random bytes. We found a surprising result with the switch statement, and were able to determine using the perf tool that we were paying a penalty for missed branch predictions due to the randomness of the input data. I've certainly learned a lot about perf, LBR stacks and branch prediction in putting this article together.