Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: poor spill decisions making code 14% slower #71868

Open
rsc opened this issue Feb 21, 2025 · 4 comments
Open

cmd/compile: poor spill decisions making code 14% slower #71868

rsc opened this issue Feb 21, 2025 · 4 comments
Labels
BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@rsc
Copy link
Contributor

rsc commented Feb 21, 2025

At commit 266b0cf from earlier today (but also with some older toolchains, not claiming the behavior is new), suppose you:

GOOS=linux GOARCH=amd64 go test -c math/big
go tool objdump -s big.nat.scan big.test

I've attached big.zip, which contains the test executable (big.test.old) and the objdump output, lightly annotated (old.asm).

Now consider this line-number-preserving diff, which introduces a new variable (bZ) that must be saved on the stack across a function call.

diff --git a/src/math/big/natconv.go b/src/math/big/natconv.go
index ce94f2cf72..a92343befe 100644
--- a/src/math/big/natconv.go
+++ b/src/math/big/natconv.go
@@ -121,9 +121,9 @@ func (z nat) scan(r io.ByteScanner, base int, fracOk bool) (res nat, b, count in
 	prev := '.'
 	invalSep := false
 
-	// one char look-ahead
-	ch, err := r.ReadByte()
-
+	bZ := base + 1
+	ch, err := r.ReadByte()
+	base = bZ - 1
 	// determine actual base
 	b, prefix := base, 0
 	if base == 0 {

The big.zip attachment also contains the resulting test executable (big.test.new) and the objdump output (new.asm).

On x86-64, this diff makes the 'Scan/1000/Base10' benchmark run 14% faster! Quite an optimization!

goos: linux
goarch: amd64
pkg: math/big
cpu: AMD Ryzen 9 7950X 16-Core Processor            
                    │     old     │                 new                 │
                    │   sec/op    │   sec/op     vs base                │
Scan/1000/Base10-32   4.694µ ± 0%   4.032µ ± 0%  -14.11% (p=0.000 n=20)

How can this be? It turns out that the compiler is make poor choices when spilling to the stack, and for whatever reason, that extra temporary causes different, better choices. The nat.scan method has a loop that looks like:

for err == nil {
	if ch == '.' && fracOk {
		...
	} else if ch == '_' && base == 0 {
		...
	} else {
		...
	}
	ch, err = r.ReadByte()
}

That last ReadByte call is a basic block of its own, with three possible predecessor blocks. There is a slice variable z in the code that is live across that call.

In the slow version of the code (without bZ), the compiler chooses to store len(z) in 0xb0(SP) and cap(z) in 0xb8(SP) pretty consistently throughout the function, except in this one basic block, where it swaps their locations. On entry to this basic block, len(z) is in R8, and cap(z) is in SI. The assembly, lightly annotated, is:

> natconv.go:243	0x5fc72f		4c898424b8000000	MOVQ R8, 0xb8(SP)	// len(z) (NOT cap!)
  natconv.go:243	0x5fc737		4c89ac24d0000000	MOVQ R13, 0xd0(SP)
> natconv.go:243	0x5fc73f		4889b424b0000000	MOVQ SI, 0xb0(SP) // cap(z) (NOT len!)
  natconv.go:243	0x5fc747		4c897c2470		MOVQ R15, 0x70(SP)
  natconv.go:242	0x5fc74c		4889542460		MOVQ DX, 0x60(SP)
  natconv.go:231	0x5fc751		4889bc24a0000000	MOVQ DI, 0xa0(SP)
  natconv.go:171	0x5fc759		4488642443		MOVB R12, 0x43(SP)
  natconv.go:219	0x5fc75e		498b4b18		MOVQ 0x18(R11), CX
  natconv.go:219	0x5fc762		ffd1			CALL CX
  natconv.go:170	0x5fc764		488b9424c0000000	MOVQ 0xc0(SP), DX
  natconv.go:170	0x5fc76c		488bb42410010000	MOVQ 0x110(SP), SI // could be just R11
  natconv.go:170	0x5fc774		488b7c2458		MOVQ 0x58(SP), DI // could be just R10
  natconv.go:170	0x5fc779		4c8b842420010000	MOVQ 0x120(SP), R8 // could be just R9
  natconv.go:170	0x5fc781		4c8b8c2418010000	MOVQ 0x118(SP), R9 // dead
  natconv.go:170	0x5fc789		4c8b542468		MOVQ 0x68(SP), R10 // dead
  natconv.go:170	0x5fc78e		4c8b5c2470		MOVQ 0x70(SP), R11 // dead
  natconv.go:170	0x5fc793		4c8ba424b8000000	MOVQ 0xb8(SP), R12 // len(z) (NOT cap!); dead
  natconv.go:170	0x5fc79b		4c8bac24d0000000	MOVQ 0xd0(SP), R13
  natconv.go:170	0x5fc7a3		4c8bbc24b0000000	MOVQ 0xb0(SP), R15 // cap(z) (NOT len!); dead
  natconv.go:110	0x5fc7ab		4d89c1			MOVQ R8, R9
  natconv.go:212	0x5fc7ae		4989fa			MOVQ DI, R10
  natconv.go:219	0x5fc7b1		4989f3			MOVQ SI, R11
  natconv.go:233	0x5fc7b4		4c8b642448		MOVQ 0x48(SP), R12
  natconv.go:213	0x5fc7b9		4c8b7c2450		MOVQ 0x50(SP), R15
  natconv.go:171	0x5fc7be		4189c0			MOVL AX, R8
  natconv.go:227	0x5fc7c1		8b742444		MOVL 0x44(SP), SI
  natconv.go:231	0x5fc7c5		488bbc24a0000000	MOVQ 0xa0(SP), DI
  natconv.go:219	0x5fc7cd		488b842418010000	MOVQ 0x118(SP), AX // dead
> natconv.go:243	0x5fc7d5		488b8424b8000000	MOVQ 0xb8(SP), AX // len(z) (NOT cap!)
> natconv.go:171	0x5fc7dd		4488842480000000	MOVB R8, 0x80(SP) // temporary
> natconv.go:243	0x5fc7e5		4c8b8424b0000000	MOVQ 0xb0(SP), R8 // cap(z) (NOT len!)
> natconv.go:243	0x5fc7ed		4c898424b8000000	MOVQ R8, 0xb8(SP)	// cap(z)
> natconv.go:243	0x5fc7f5		48898424b0000000	MOVQ AX, 0xb0(SP) // len(z)
> natconv.go:219	0x5fc7fd		488b842418010000	MOVQ 0x118(SP), AX
> natconv.go:171	0x5fc805		440fb6842480000000	MOVZX 0x80(SP), R8

The important part is the lines marked with a leading >. At the start of the block, the compiler stores len(z) and cap(z) in the slots usually used for cap(z) and len(z). At the end of the block, the compiler must now correct the stores to match where other parts of the function will expect to load from. The marked sequence is a bit confusing but it does:

  • Reload len(z) from 0xb8(SP) into AX.
  • Spill a byte from R8, which holds the ch result of the call, to 0x80(SP).
  • Reload cap(z) from 0xb0(SP) into R8, and then store it into 0xb8(SP).
  • Store len(z) from AX into 0xb0(SP).
  • Reload AX with the value it held before the marked sequence (0x118(SP)).
  • Reload R8 from its spilled stack location, zero-extending the byte into a full word.

If the marked stores at the top of the basic block wrote to 0xb0(SP) and 0xb8(SP) instead of 0xb8(SP) and 0xb0(SP), then all the marked lines at the bottom could be deleted entirely. In the bZ version of the code, something about the extra temporary leads the compiler to make better choices and do exactly that.

We can also make that change directly. The big.zip attachment contains fixbig.go, which reads big.test.old, makes exactly those edits (swap the two store locations and delete the marked lines at the end), and writes big.test.old.fix. Sure enough, big.test.old.fix runs about as fast as big.test.new does.

Aside 1. Note that the line numbers are bogus. The ReadByte call is line 219. The rest is spilling in preparation for the call and reloading in preparation for falling through to the next basic block. I would argue that the initial spills should be given the first line number from the basic block's actual code, and the cleanup should be given the last line number from the actual code, so that every instruction in this basic block should say line 219. It looks like maybe the spill/load code is assigned the line number of the final mention of that variable in the function. As it stands now, the resulting scattershot line numbering makes source-level profiling almost worthless.

Aside 2. Note the number of loads in the disassembly that I've marked // dead. The registers these instructions load are overwritten later in the basic block without being used. It seems like a basic peephole optimizer could remove them, although better would be not to generate them in the first place. There are also a few marked // could be just that load into a register whose only use is to move into a final destination register later. These suboptimal load sequences persist in the bZ version that is 14% faster. I wrote a version of fixbig.go that removes the lines I marked // dead (except the final one before the > section at the bottom, which stops being dead when that section is deleted), and it runs 17% faster than the base:

goos: linux
goarch: amd64
pkg: math/big
cpu: AMD Ryzen 9 7950X 16-Core Processor            
                    │     old     │                 fix                 │
                    │   sec/op    │   sec/op     vs base                │
Scan/1000/Base10-32   4.694µ ± 0%   3.859µ ± 1%  -17.80% (p=0.000 n=20)

This is not the tightest loop in the world (it contains an indirect call!) and yet these basic cleanups give a very significant speedup. I wonder whether this is an isolated case or this happens in basically all code we generate.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Feb 21, 2025
@gabyhelp gabyhelp added the BugReport Issues describing a possible bug in the Go implementation. label Feb 21, 2025
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/650640 mentions this issue: math/big: optmize atoi of base 2, 4, 16

@mknyszek mknyszek added this to the Unplanned milestone Feb 21, 2025
@mknyszek mknyszek added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Feb 21, 2025
@mknyszek
Copy link
Contributor

CC @golang/compiler

@mcy
Copy link

mcy commented Feb 23, 2025

I've done a lot of reading of gc's output and this is very par for the course. gc is very bad at making many regalloc decisions, some of which show up in the assembly listings here: https://mcyoung.xyz/2024/12/12/go-abi/ (although I don't comment on them, as I explain below). Admittedly it's hard to tell when a spill is necessary for the GC to scan the stack or not, but I have seen a lot of this behavior both for non-pointers and for pointers that should not be marked as such in a GC shape, such as the itab pointer on an interface fat pointer.

I often find that gc is very bad at things that I would expect to not be an issue when coming out of SSA where you do cleanups between passes like LLVM does (such as DCE, DSE, and mem2reg cleanups). This includes unnecessary spills, but also lots of dead stores, both to registers and memory. Also the lack of callee registers hurts a lot, because it means every call forces spills.

I would be filing bugs but I have always had an expectation that Go doesn't care about codegen micro in the way that I (an LLVM-shaped person) do. The most I do is I email @dr2chase once a quarter whenever I encounter something obviously wrong (wrong in the "this code was written by GCC in the 90s"-- surprisingly I have not hit a miscompile in official releases).

If this expectation is based on stale information... well, let me know, and I'll have a lot of missed optimizations to share. One that irked me recently is that Go cannot optimize if a || b {} into if a | b {} to avoid an extra branch when both a and b are pure, such as when doing integer comparisons.

(If you're wondering where I get them. Well, I kinda just torture compilers to find missed optimizations for fun, and the current job has me writing high performance Go, so there's been a lot of opportunities...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

5 participants