Making a Go program run 1.7x faster with a one character change

2022-11-13

If you read the title and thought “well, you were probably just doing something silly beforehand”, you’re right! But what is programming if not an exercise in making silly mistakes? Tracking down silly mistakes is where all the fun is to be had!

I’ll also state the usual benchmarking caveat up front: the 1.7x speedup was measured while running the program on my data on my computer, so take that number with a big old pinch of salt.

What does the program do?

codeowners is a Go program that prints out the owners for each file in a repository according to a set of rules defined in a GitHub CODEOWNERS file. A rule might say that all files ending in .go are owned by the @gophers team, or all files in the docs/ directory are owned by the @docs team.

When considering a given path, the last rule that matches wins. A simple but naive matching algorithm iterates backwards through the rules for each path, stopping when it finds a match. Smarter algorithms do exist, but that’s for another day. Here’s what the Ruleset.Match function looks like:

type Ruleset []Rule

func (r Ruleset) Match(path string) (*Rule, error) {
  for i := len(r) - 1; i >= 0; i-- {
    rule := r[i]
    match, err := rule.Match(path)
    if match || err != nil {
      return &rule, err
    }
  }
  return nil, nil
}

Finding the slow bits with pprof and flamegraphs

The tool was a tad slow when running it against a moderately large repository:

$ hyperfine codeowners
Benchmark 1: codeowners
  Time (mean ± σ):      4.221 s ±  0.089 s    [User: 5.862 s, System: 0.248 s]
  Range (min … max):    4.141 s …  4.358 s    10 runs

To see where the program was spending its time, I recorded a CPU profile with pprof. You can get a CPU profile generated by adding this snippet to the top of your main function:

pprofFile, pprofErr := os.Create("cpu.pprof")
if pprofErr != nil {
  log.Fatal(pprofErr)
}
pprof.StartCPUProfile(pprofFile)
defer pprof.StopCPUProfile()

Aside: I use pprof quite a lot, so I’ve got that code saved as a vscode snippet. I just type pprof, hit tab, and that snippet appears.

Go comes with a handy interactive profile visualisation tool. I visualised the profile as a flamegraph by running the following command then navigating to the flamegraph view in the menu at the top of the page.

$ go tool pprof -http=":8000" ./codeowners ./cpu.pprof

As I expected, most of the time was being spent in that Match function. The CODEOWNERS patterns are compiled to regular expressions, and the majority of the Match function’s time was spent in Go’s regex engine. But I also noticed a lot of time was being spent allocating and reclaiming memory. The purple blocks in the flamegraph below match the pattern gc|malloc, and you can see in aggregate they represent a meaningful part of the program’s execution time.

Flamegraph showing time spent in the regex engine and malloc

Hunting heap allocations with escape analysis traces

So let’s see if there are any allocations we can get rid of to reduce the GC pressure and time spent in malloc.

The Go compiler uses a technique called escape analysis to figure out when some memory needs to live on the heap. Say a function initialises a struct then returns a pointer to it. If the struct was allocated on the stack, the pointer that’s returned would become invalid as soon as the function returns and the corresponding stack frame is invalidated. In that case the Go compiler would determine that the pointer has “escaped” the function, and move the struct to the heap instead.

You can see these decisions being made by passing -gcflags=-m to go build:

$ go build -gcflags=-m *.go 2>&1 | grep codeowners.go
./codeowners.go:82:18: inlining call to os.IsNotExist
./codeowners.go:71:28: inlining call to filepath.Join
./codeowners.go:52:19: inlining call to os.Open
./codeowners.go:131:6: can inline Rule.Match
./codeowners.go:108:27: inlining call to Rule.Match
./codeowners.go:126:6: can inline Rule.RawPattern
./codeowners.go:155:6: can inline Owner.String
./codeowners.go:92:29: ... argument does not escape
./codeowners.go:96:33: string(output) escapes to heap
./codeowners.go:80:17: leaking param: path
./codeowners.go:70:31: []string{...} does not escape
./codeowners.go:71:28: ... argument does not escape
./codeowners.go:51:15: leaking param: path
./codeowners.go:105:7: leaking param content: r
./codeowners.go:105:24: leaking param: path
./codeowners.go:107:3: moved to heap: rule
./codeowners.go:126:7: leaking param: r to result ~r0 level=0
./codeowners.go:131:7: leaking param: r
./codeowners.go:131:21: leaking param: path
./codeowners.go:155:7: leaking param: o to result ~r0 level=0
./codeowners.go:159:13: "@" + o.Value escapes to heap

The output is a little noisy, but you can ignore most of it. As we’re looking for allocations, “moved to heap” is the phrase we should be concerned about. Looking back at the Match code above, the Rule structs are stored within the Ruleset slice, which we can be confident is already on the heap. And as a pointer to the rule is returned, no extra allocation should be necessary.

Then I saw it—by assigning rule := r[i], we copy the heap allocated Rule out of the slice onto the stack, then by returning &rule we create an (escaping) pointer to the copy of the struct. Fortunately, fixing this is easy. We just need to move the ampersand up a bit so we take a reference to the struct in the slice rather than copying it:

 func (r Ruleset) Match(path string) (*Rule, error) {
 	for i := len(r) - 1; i >= 0; i-- {
-		rule := r[i]
+		rule := &r[i]
 		match, err := rule.Match(path)
 		if match || err != nil {
-			return &rule, err
+			return rule, err
 		}
 	}
 	return nil, nil
 }

I did consider two other approaches:

Changing Ruleset from being []Rule to []*Rule, which would mean we no longer need to explicitly take a reference to the rule.
Returning a Rule rather than a *Rule. This would still copy the Rule, but it should stay on the stack instead of moving to the heap.

However, both of these would have resulted in a breaking change as this method is part of the public API.

Anyway, after making that change we can see if it had the desired effect by getting a new trace from the compiler and comparing it to the old one:

$ diff trace-a trace-b
14a15
> ./codeowners.go:105:7: leaking param: r to result ~r0 level=0
16d16
< ./codeowners.go:107:3: moved to heap: rule

Success! The allocation is gone. Now let’s see how removing that one heap allocation affects performance:

$ hyperfine ./codeowners-a ./codeowners-b
Benchmark 1: ./codeowners-a
  Time (mean ± σ):      4.146 s ±  0.003 s    [User: 5.809 s, System: 0.249 s]
  Range (min … max):    4.139 s …  4.149 s    10 runs

Benchmark 2: ./codeowners-b
  Time (mean ± σ):      2.435 s ±  0.029 s    [User: 2.424 s, System: 0.026 s]
  Range (min … max):    2.413 s …  2.516 s    10 runs

Summary
  ./codeowners-b ran
    1.70 ± 0.02 times faster than ./codeowners-a

As that allocation was happening for every single path that was being matched against, removing it got a 1.7x speedup in this instance. Not bad for a one character change.