cmd/compile: poor spill decisions making code 14% slower #71868
Labels
BugReport
Issues describing a possible bug in the Go implementation.
compiler/runtime
Issues related to the Go compiler and/or runtime.
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Performance
Milestone
At commit 266b0cf from earlier today (but also with some older toolchains, not claiming the behavior is new), suppose you:
I've attached big.zip, which contains the test executable (big.test.old) and the objdump output, lightly annotated (old.asm).
Now consider this line-number-preserving diff, which introduces a new variable (bZ) that must be saved on the stack across a function call.
The big.zip attachment also contains the resulting test executable (big.test.new) and the objdump output (new.asm).
On x86-64, this diff makes the 'Scan/1000/Base10' benchmark run 14% faster! Quite an optimization!
How can this be? It turns out that the compiler is make poor choices when spilling to the stack, and for whatever reason, that extra temporary causes different, better choices. The nat.scan method has a loop that looks like:
That last ReadByte call is a basic block of its own, with three possible predecessor blocks. There is a slice variable
z
in the code that is live across that call.In the slow version of the code (without
bZ
), the compiler chooses to store len(z) in 0xb0(SP) and cap(z) in 0xb8(SP) pretty consistently throughout the function, except in this one basic block, where it swaps their locations. On entry to this basic block, len(z) is in R8, and cap(z) is in SI. The assembly, lightly annotated, is:The important part is the lines marked with a leading >. At the start of the block, the compiler stores len(z) and cap(z) in the slots usually used for cap(z) and len(z). At the end of the block, the compiler must now correct the stores to match where other parts of the function will expect to load from. The marked sequence is a bit confusing but it does:
ch
result of the call, to 0x80(SP).If the marked stores at the top of the basic block wrote to 0xb0(SP) and 0xb8(SP) instead of 0xb8(SP) and 0xb0(SP), then all the marked lines at the bottom could be deleted entirely. In the
bZ
version of the code, something about the extra temporary leads the compiler to make better choices and do exactly that.We can also make that change directly. The big.zip attachment contains fixbig.go, which reads big.test.old, makes exactly those edits (swap the two store locations and delete the marked lines at the end), and writes big.test.old.fix. Sure enough, big.test.old.fix runs about as fast as big.test.new does.
Aside 1. Note that the line numbers are bogus. The ReadByte call is line 219. The rest is spilling in preparation for the call and reloading in preparation for falling through to the next basic block. I would argue that the initial spills should be given the first line number from the basic block's actual code, and the cleanup should be given the last line number from the actual code, so that every instruction in this basic block should say line 219. It looks like maybe the spill/load code is assigned the line number of the final mention of that variable in the function. As it stands now, the resulting scattershot line numbering makes source-level profiling almost worthless.
Aside 2. Note the number of loads in the disassembly that I've marked
// dead
. The registers these instructions load are overwritten later in the basic block without being used. It seems like a basic peephole optimizer could remove them, although better would be not to generate them in the first place. There are also a few marked// could be just
that load into a register whose only use is to move into a final destination register later. These suboptimal load sequences persist in the bZ version that is 14% faster. I wrote a version of fixbig.go that removes the lines I marked// dead
(except the final one before the > section at the bottom, which stops being dead when that section is deleted), and it runs 17% faster than the base:This is not the tightest loop in the world (it contains an indirect call!) and yet these basic cleanups give a very significant speedup. I wonder whether this is an isolated case or this happens in basically all code we generate.
The text was updated successfully, but these errors were encountered: