The P6 has only two read ports in its permanent register file for operand values: http://www.cs.tau.ac.il/~afek/p6tx050111.pdf (p. 36). P-M upped it to three, and Sandy Bridge removed the limitation completely.
The optimization manual mentions examples of the stall occurring when e.g. often-used constants are stored in registers, or when a load is hoisted "too high" and the value "goes cold" before its consumers use it.
Agner Fog's manual has a discussion starting on p. 69, 84 of his manual: http://www.agner.org/optimize/microarchitecture.pdf. Note his use of an unnecessary MOV to "refresh" a register to avoid the stall.
I only glanced at the code quickly, but the comment about how he got rid of a load by holding a value in a register made me think the load was keeping the value from "going cold." Of course, I didn't profile it so I'm probably completely wrong...
Thanks for that that those very interesting links. But I still don't think that is what is going on here.
In designs which rename using the ROB, the register file holds values produced by instructions which are completed and retired, the ROB holds values from instructions that are completed but not retired, and the bypass network supplies values from instructions currently completing.
What Agner is doing in his example with the seemingly useless instruction is transferring a value from the the register file to the ROB so that instructions which try to read logical register ECX will now source it from the ROB instead of the register file. But when I look at the code in the stack overflow question, nothing actually reads from s1. So these are even "more useless" instructions than Agner's example.
Some people have already mentioned instruction alignment issues, so that is one likely explanation. There are a whole bunch of other possible issues involving the scheduler and dispatch restrictions. For example, I've seen processors where there were two pipelines with slightly different instruction schedulers. So adding a useless instruction like this might push your bottleneck instruction into a pipe with a scheduler that is slightly better for your code. Sometimes bypassing across different pipes is more expensive than within the same pipe, so again the useless instruction might push some instructions into pipes that have more of their sources. It could one of any number of reasons and it's going to be very hard to tell from the outside without knowing the details of the microarchitecture.
For some reason I thought I read in the original question that he'd replaced the MOV with an equivalent string of NOOPs, but now that I read the example again I clearly just made that up in my head... In that case, I agree that it's probably an instruction alignment issue, specifically the MOV pushing some group of instructions to align better into the 16-byte fetch/decode window. It'd be interesting if someone can run the code on Sandy Bridge+ and see if the useless MOV still helps. The decoded u-op cache should take a lot of the instruction alignment issues off the table.
Intel's optimization manual describes the stall: http://www.intel.com/content/dam/doc/manual/64-ia-32-archite... (3.5.2.1, "ROB Read Port Stalls.").
The optimization manual mentions examples of the stall occurring when e.g. often-used constants are stored in registers, or when a load is hoisted "too high" and the value "goes cold" before its consumers use it.
Agner Fog's manual has a discussion starting on p. 69, 84 of his manual: http://www.agner.org/optimize/microarchitecture.pdf. Note his use of an unnecessary MOV to "refresh" a register to avoid the stall.
I only glanced at the code quickly, but the comment about how he got rid of a load by holding a value in a register made me think the load was keeping the value from "going cold." Of course, I didn't profile it so I'm probably completely wrong...